Writing that Parser

Dennis J. McWherter, Jr. bio photo By Dennis J. McWherter, Jr. Comment

There have been many articles written about parsing and when regexes are generally insufficient. Some of these come with full-blown proofs containing the pumping lemma (every Junior CS student€™s nightmare) and others describe the complexity involved in trying for something like PCRE which feels (read: I have no evidence that this claim is true or false) stronger than the formal definition of a regular expression. While these articles are great, the direction given feels a little light. The usual advice is €œuse a real parser€ (whatever that means– usually some sort of yacc) or €œuse tool X, never roll your own.€ I simply disagree with the €œnever roll your own€ philosophy unless it€™s directed toward a very common language (i.e. HTML, XML, JSON, etc.) and it€™s not part of your €œsecret sauce€ for your product. On the other hand, while I agree yacc is pretty good advice in most cases, these articles do not discuss the details to get from point A to point B. Here I€™ll try to fill some of that gap.

What€™s your Problem?

First of all, there must be a reason you€™re looking for a parser. Do you want to write a small (or large?) programming language (i.e. compile and execute), do you need a quick domain-specific language (DSL), or are you looking to manipulate a language in some other way (i.e. coloring, syntax checking, etc.)? In any case, you will likely want to start by writing a grammar that describes the language you wish to parse (or, if this is an existing language, find the existing grammar). If it€™s important enough (or you€™re curious enough), you can even prove that your grammar is correct.

After you have identified the problem you€™re solving and have a grammar in hand, we can start to evaluate how best to go about creating our parser. For almost all the techniques we€™ll explore, the grammar of your language is the foundation. Where and how you use that grammar is determined by your use-case.

What kind of tools should I use?

As you may have guessed, this is a quite common problem. People are always creating DSLs or programming languages and, consequently, there are tried and true solutions which already exist to help your productivity. Depending on your language preferences and problem constraints, you can consider lex/yacc (C/C++), antlr (Java), Happy (Haskell). These packages ingest your grammar and parse it for you. Consequently, you can take the parsed result and update your internal state to then manipulate it as needed.

I want to write my own Programming Language!

Why? Because the world doesn€™t have enough of them yet? (end sarcasm) If you€™re looking to write a custom programming language, a yacc (as mentioned earlier) is likely the best route to take. There are already many primers on this, but we will go through a very brief example.

First, we must come up with a use-case and then its corresponding grammar. We€™ll write a short language which does nothing but basic arithmetic (+, -, /, *) on integers and can print integers or strings. We€™ll call it ArithLang. Sounds simple enough, so let€™s see what such a grammar would look like (in yacc):

arithlang : /* Empty */
          | arithlang stmt  { pg_destroy(); pg_init(); }
          | arithlang NEWLINE
stmt : append
     | PRINT append     { pg_print_graph(); } /* This is where we make the interpretation "observable" */
append : eval
       | append APPEND eval
eval : exp              { pg_insert_int($1); }
     | QUOTE str QUOTE  { isInStr = 0; pg_insert_str(str_get()); }
exp : NUM                       { $$ = $1; }
    | OPENPAREN exp CLOSEPAREN  { $$ = $2; }
    | exp DIV exp               { $$ = $1 / $3; }
    | exp MUL exp               { $$ = $1 * $3; }
    | exp ADD exp               { $$ = $1 + $3; }
    | exp SUB exp               { $$ = $1 - $3; }
str : /* Empty */       { isInStr = 1; str_clear(); } /* Always clear first */
    | str CHAR          { str_update($2); }
    | str WHITESPACE    { str_update_str($2); }
    | str NUM           { sprintf(num, "%d", $2); str_update_str(num); }
    | str ESCAPE QUOTE  { str_update('"'); }
    | str ADD           { str_update('+'); }
    | str SUB           { str_update('-'); }
    | str MUL           { str_update('*'); }
    | str DIV           { str_update('/'); }
    | str OPENPAREN     { str_update('('); }
    | str CLOSEPAREN    { str_update(')'); }
    | str ESCAPE ESCAPE { str_update('\\'); }
    | str APPEND        { str_update('@'); }
    | str PRINT         { str_update_str("print"); }

Without getting into too many details about this grammar (i.e. nuances having to deal with LALR parsers and so forth), we can see that we deal entirely with €œtokens.€ In fact, there is not a single string represented in parsing the structure of our input. The reason for this is that our lexer parse strings into these tokens for us creating a nice separation of concern between parsing text and parsing the structure of our grammar. Since it is a very popular solution, we will use lex/yacc to parse our grammar in this example. In short, lex is the program which interprets text as language objects and yacc takes these objects and produces the structure that is the program. If you€™re interested in the full solution, you can see the full code here

Once you have the language parsed, you can do whatever you want with it. That is, you can compile the language to machine code (see LLVM or C–) or you can interpret it. Interpreting your program is often as simple as performing a particular action based on the token.

For this example I actually made it interpreted. In summary, the observable effects (i.e. the €œinterpretation€) occurs in the pg_print_graph() method (in helpers.c) which is called when we see the €œPRINT€ token. 

I need a DSL

So you€™re trying to simplify the logic of some tedious task, right? This is a valiant effort (though often difficult). That said, as before, the real work is in deciding the grammar and since we assume that have a sufficient grammar, this is a piece of cake.

Suppose we want to create a DSL for anyone who keeps track of inventory. In short, our grammar will support listing items by name and then providing a log of buys and sells which include the number of items bought or sold along with the corresponding price. This time we€™ll use antlr to represent our DSL.

ilang: (item NEWLINE)*;
ops: operation
   | operation NEWLINE ops
operation: TAB action WS amount WS price;
action: BUY
      | SELL
amount: INTEGER;

From here we can do more interesting things (as before) once the grammar is parsed. For instance, we can use the visitor pattern (i.e. stream parsing) to obtain aggregate information such as total profit or number of items in stock based on the parsed tree. You can see how I implemented such situations by viewing the full source here.

As you can see, most of these parsing problems are very similar in nature. That is, the goal is to parse a grammar and then perform some operations on the result. Consequently, solving this problem for programming languages vs. DSLs is not much different. That is, you start with a grammar, use the tool you€™re most familiar/comfortable with to parse it, and perform the desired operations based on the result. Sounds simple right?

Can I do this from scratch?

Well, yes. Yes you can. In fact, I encourage you to do so as it is a non-trivial problem. The tools above are some examples which make this process a little easier, but we can certainly perform all of these operations without employing their help. That being said– though it is non-trivial– parsing and lexing aren€™t always the most exciting problems and they€™re certainly error-prone. As a result, I would recommend against doing this for any commercial project you€™re working on unless you have a very good reason to do so (i.e. it is your product or there are other significant limitations you face with existing solutions). 

With that in mind, learning new things is always a good thing. If that is your motivation to rolling your own solution, I suggest you do it!

That was fast

We lightly touched on a lot of concepts (whether you realized it or not) pretty quickly. I didn€™t go into much detail on anything (read: there are many articles out there that does this already probably better than I can), but I expect that you have gained a clearer overall picture of this parsing business. As usual, I am happy to answer any outstanding questions you may have below!

comments powered by Disqus