Elsewhere > Write You a Parser for Fun and Win
Posted 2013-11-26 on viget.com
As a software developer, you’re probably familiar with the concept of a parser, at least at a high level. Maybe you took a course on compilers in school, or downloaded a copy of Create Your Own Programming Language, but this isn’t the sort of thing many of us get paid to work on. I’m writing this post to describe a real-world web development problem to which creating a series of parsers was the best, most elegant solution. This is more in-the-weeds than I usually like to go with these things, but stick with me – this is cool stuff.
The Problem
Our client, the Chronicle of Higher Education, hired us to build Vitae, a series of tools for academics to find and apply to jobs, chief among which is the profile, an online résumé of sorts. I’m not sure when the last time you looked at a career academic’s CV was, but these suckers are long, packed with degrees, publications, honors, etc. We created some slick Backbone-powered interactions for creating and editing individual items, but a user with 70 publications still faced a long road to create her profile.
Since academics are accustomed to following well-defined formats (e.g. bibliographies), KV had the idea of creating formats for each profile element, and giving users the option to create and edit all their data of a given type at once, as text. So, for example, a user might enter his degrees in the following format:
Duke University
; Ph.D.; Biomedical Engineering
University of North Carolina
2010; M.S.; Biology
2007; B.S.; Biology
That is to say, the user has a bachelor’s and a master’s in Biology from UNC, and is working on a Ph.D. in Biomedical Engineering at Duke.
The Solution
My initial, naïve approach to processing this input involved splitting it up by line and attempting to suss out what each line was supposed to be. It quickly became apparent that this was untenable for even one model, let alone the 15+ that we eventually needed. Chris suggested creating custom parsers for each resource, an approach I’d initially written off as being too heavy-handed for our needs.
What is a parser, you ask? According to Wikipedia, it’s
a software component that takes input data (frequently text) and builds a data structure – often some kind of parse tree, abstract syntax tree or other hierarchical structure – giving a structural representation of the input, checking for correct syntax in the process.
Sounds about right. I investigated Treetop, the most well-known Ruby library for creating parsers, but I found it to be targeted more toward building standalone tools rather than use inside a larger application. Searching further, I found Parslet, a “small Ruby library for constructing parsers in the PEG (Parsing Expression Grammar) fashion.” Parslet turned out to be the perfect tool for the job. Here, for example, is a basic parser for the above degree input:
class DegreeParser < Parslet::Parser
root :degree_groups
rule(:degree_groups) { degree_group.repeat(0, 1) >>
additional_degrees.repeat(0) }
rule(:degree_group) { institution_name >>
(newline >> degree).repeat(1).as(:degrees_attributes) }
rule(:additional_degrees) { blank_line.repeat(2) >> degree_group }
rule(:institution_name) { line.as(:institution_name) }
rule(:degree) { year.as(:year).maybe >>
semicolon >>
name >>
semicolon >>
field_of_study }
rule(:name) { segment.as(:name) }
rule(:field_of_study) { segment.as(:field_of_study) }
rule(:year) { spaces >>
match("[0-9]").repeat(4, 4) >>
spaces }
rule(:line) { spaces >>
match('[^ \r\n]').repeat(1) >>
match('[^\r\n]').repeat(0) }
rule(:segment) { spaces >>
match('[^ ;\r\n]').repeat(1) >>
match('[^;\r\n]').repeat(0) }
rule(:blank_line) { spaces >> newline >> spaces }
rule(:newline) { str("\r").maybe >> str("\n") }
rule(:semicolon) { str(";") }
rule(:space) { str(" ") }
rule(:spaces) { space.repeat(0) }
end
Let’s take this line-by-line:
2: the root
directive tells the parser what rule to start parsing
with.
4-5: degree_groups
is a Parslet rule. It can reference other
rules, Parslet instructions, or both. In this case, degree_groups
, our
parsing root, is made up of zero or one degree_group
followed by any
number of additional_degrees
.
7-8: a degree_group
is defined as an institution name followed by
one more more newline + degree combinations. The .as
method defines
the keys in the resulting output hash. Use names that match up with your
ActiveRecord objects for great justice.
10: additional_degrees
is just a blank line followed by another
degree_group
.
12: institution_name
makes use of our line
directive (which
we’ll discuss in a minute) and simply gives it a name.
14-18: Here’s where a degree (e.g. “1997; M.S.; Psychology”) is
defined. We use the year
rule, defined on line 23 as four digits in a
row, give it the name “year,” and make it optional with the .maybe
method. .maybe
is similar to the .repeat(0, 1)
we used earlier, the
difference being that the latter will always put its results in an
array. After that, we have a semicolon, the name of the degree, another
semicolon, and the field of study.
20-21: name
and field_of_study
are segments, text content
terminated by semicolons.
23-25: a year
is exactly four digits with optional whitespace on
either side.
27-29: a line
(used here for our institution name) is at least one
non-newline, non-whitespace character plus everything up to the next
newline.
31-33: a segment
is like a line
, except it also terminates at
semicolons.
35-39: here we put names to some literal string matches, like semicolons, spaces, and newlines.
In the actual app, the common rules between parsers (year, segment, newline, etc.) are part of a parent class so that only the resource-specific instructions would be included in this parser. Here’s what we get when we pass our degree info to this new parser:
[{:institution_name=>"Duke University"@0,
:degrees_attributes=>
[{:name=>" Ph.D."@17, :field_of_study=>" Biomedical Engineering"@24}]},
{:institution_name=>"University of North Carolina"@49,
:degrees_attributes=>
[{:year=>"2010"@78, :name=>" M.S."@83, :field_of_study=>" Biology"@89},
{:year=>"2007"@98, :name=>" B.S."@103, :field_of_study=>" Biology"@109}]}]
The values are Parslet nodes, and the @XX
indicates where in the input
the rule was matched. With a little bit of string coercion, this output
can be fed directly into an ActiveRecord model. If the user’s input is
invalid, Parslet makes it similarly straightforward to point out the
offending line.
This component of Vitae was incredibly satisfying to work on, because it solved a real-world issue for our users while scratching a nerdy personal itch. I encourage you to learn more about parsers (and Parslet specifically) and to look for ways to use them in projects both personal and professional.