Parsers and Parser Combinators
What is a parser?
A parser (in the singular) is a unit of computational logic that evaluates a string of text and (1) determines whether it matches a specified pattern, and (2) returns the text that matches the pattern.
Every parser is looking for something – some pattern. When evaluating text, a parser seeks to answer two questions:
- Is the text an example of my pattern?
- What is the text that matches my pattern?
For example, the simplest parser in the world might just look for the letter “e” – call it LetterEParser
. If we passed in “elephant,” it would find the letter “e” at the start of that text, then return “true” and the letter “e”.
Clearly, that’s not a great use of a parser, since it’s just as easy to evaluate the text yourself.
But what about a parser that looks for any letter and not something else? Call this AnySingleLetterParser
.
If we passed in “elephant,” it would again return true and the letter “e.”
If we passed in “anteater,” it would return true and the letter “a.”
If we passed in “007,” it would return false since “0” is not a letter.
We could just as easily do the same for numbers. Let’s call that AnySingleNumberParser
.
Additionally, we could create a SpecificLetterParser
that we configured with the specific letter we were looking for. It would return “true” if the text started with that letter.
The important point: parsers are small units of code that look for specific patterns. We tend to discuss “parsers” as the grand constructs that break down massive stretches of text into data structures, but those are really combinations of many parsers.The system that coordinates these smaller parsers is called, appropriately, a “parser combinator.”
Combining Parsers
Our prior example was admittedly trivial. The real power of parsers is when they’re combined to create more complicated patterns.
What if we could run two parsers consecutively to make sure they both match, and then get back the combined return values of both of them?
So we could do something like this:
(AnySingleLetterParser + AnySingleLetterParser1)
If we ran this on “elephant,” both parsers would return true, and we’d get back the combined result of “el.”
If we ran it on “a1,” we’d get back “false” and nothing, because the second parser would return false (and we’re requiring that they both return true).
We do this by creating another parser that simply “wraps” two instances of AnySingleLetterParser
and makes sure they both match. We can call this TwoLetterSequenceParser
.
From the “outside,” this is a parser like any other. It takes text input, and it will return true or false and the text that matched. Internally, however, it creates two instances of AnySingleLetterParser
and executes them both in sequence, ensures they both return true, and combines the result.
Now, that’s pretty specific and not very helpful, but what if we were more generic?
Since a parser is a generic unit of logic that returns consistent values, we can create a SequenceParser
that wraps any number of other parsers, and executes them in order, and concatenates the result. If all the “inner” parsers return true, the “outer” parser will return true with the combined result of all the inner parsers. If any one of the inner parsers returns false, the outer parser returns false and nothing.
Let’s create another parser called AnyParser
. This will wrap other parsers, and execute them in order. If any one of them returns true, the AnyParser
itself will return true and return the result of the inner parser that returned true (and skip the rest – it just uses the first parser that works).
Now we’ve done something potentially useful.
Let’s say we want to parse the course numbers at the local university. For example, the course number for the introductory business course in the morning is “BU101A” and the afternoon class is “BU101B.” All the course numbers fit this format.
What we need to do is something like this:
- A sequence of
- Any letter
- Any letter
- A sequence of
- Any number
- Any number
- Any number
- Either
- “A”
- “B”
So, this is three parsers. Or … is it?
It is three parsers, but they themselves form a sequence.
We need all three to return true, and we have SequenceParser
to do exactly this. There’s no reason why SequenceParser
can’t just wrap other SequenceParser
instances, because every parser acts the same way. SequenceParser
works like any other parser, and there’s no reason one instance can’t execute another instance.
So what we really need is this:
SequenceParser
SequenceParser
AnySingleLetterParser
AnySingleLetterParser
SequenceParser
AnySingleNumberParser
AnySingleNumberParser
AnySingleNumberParser
AnyParser
SpecificLetterParser
(“A”)SpecificLetterParser
(“B”)
What we’ve created is a “tree” of parsers, all working from an initial parser (the “root” parser). Every parser – including the root parser – is responsible for its own logic.
All we have to do is execute the outer (“root”) parser. It will then execute the three inner parsers as part of its logic. Those, in turn, will each call their inner parsers, passing back true or false and whatever they “collect.” If everything returns true, we’ll receive a result of true from the outer parser, along with the combined text that matched.
This is what we mentioned earlier: a “parser combinator.”
Manipulating the Cursor
Here’s the key to making this work – each parser works on the same input text. The text that’s passed to one parser, is then passed to the next, then the next, etc. But each parser “reads forward” as it matches (as we stated in the assumptions above, we’re assuming left-to-right).
The input text has a positional designator – called a “cursor” – that indicates the current location of evaluation. This is the current character in the text that the parser is operating on.
If a parser matches, it pushes the cursor forward. It effectively “claims” that part of the text, and tells the next parser, “We’re good so far. Just work from this point forward.”
SequenceParser
AnySingleLetterParser
AnySingleLetterParser
In our example above, the first AnySingleLetterParser
would start with a cursor of 1, meaning it’s looking at the first character. It would match that “B” (since it’s looking for any letter). It would then set the cursor to 2.
The next AnyLetterParser
would then start at character 2 – the “U”. It would match, then set the cursor to 3. And so on. Every parser is responsible for claiming the section of the text that it matched on, so the next parser can start after that.
In our example, the next parser is the second SequenceParser
, so it would evaluate the text starting at position 3.
If a parser does not match, it’s responsible for resetting the cursor back to where it was when it started work. And remember that a “match” might mean that it contains inner parsers that are executing – so if these inner parsers don’t match, the outer parser has to back the cursor up.
Essentially: if you don’t match, then leave the cursor where you found it.
In our example above, the final “second level” parser will return true if any of its inner parsers returns true. In the case of a course number that ends in a “B,” the first SpecificLetterParser
will return false. When this happens, that parser needs to reset the cursor, so that the second SpecificLetterParser
can take its turn to try to match the “B.” If the first parser didn’t reset the cursor, the second one would start in the wrong position (off the end of the string, in this case).