Gazelle
a system for building fast, reusable parsers
Gazelle is a system currently in development for parsing context-free grammars, especially programming languages. It takes inspiration from yacc/bison and ANTLR, but seeks to take these ideas to the next level.
Why another parser generator? What does Gazelle bring to the table?
- a VM-based approach: grammars compile to bytecode, not C
- Traditional parser generators compile a grammar to imperative code in C or another procedural language. In contrast, Gazelle compliles a grammar to byte-code which is then loaded by the very-light-weight Gazelle runtime (or "VM"). This approach makes it much easier to save the intermediate, incremental state of a parse. The state of the parse and all its state machines can be inspected and visualized. The API can give you many options at run-time for how you want to consume the output (AST only, AST+parse tree, etc) without forcing you to to tweak code generation options and recompile. For dynamic languages (Lua, Python, Ruby, etc), wrapping a high-performance VM will yield much better performance than generating code in the dynamic language.
- a standard, language-independent AST format (Protocol Buffers)
- Google's Protocol Buffers serialization format provides a schema language for describing tree structures with useful types like strings, integers, enums, and floating-point built-in. A protocol buffer schema can be used to expose a SAX-like event based parsing interface or a DOM-like tree, and both can be conveniently consumed from any language. And we can efficiently serialize these trees to either binary or text format.
- reusable parsers
- Unlike existing parsing tools, Gazelle puts a major focus on making grammars reusable. Grammars written using Gazelle should be usable out-of-the-box for anything from a compiler to the syntax-highlighting in a text editor. Making grammars reusable will let the open source community cooperate to write and maintain a set of canonical parsers for all popular languages, so that you can use off-the-shelf parsers instead of reinventing the wheel.
- language agnostic
- Many existing parsing systems are tied to one specific host language
(for example, Python has PyParsing, Ruby has Treetop, etc). Grammars
written using these systems are useless in the case that you want to invoke
these parsers from a different language. This hinders reusability and
wide adoption.
Gazelle parsers can be used from any language by writing bindings for the Gazelle C Runtime (and runtimes could be written for other platforms like the JVM or .NET). You only need to write these wrappers once — once they're written, you can parse any language that has a Gazelle grammar. And because the runtime is in C, you get very fast parsing, even if you are parsing from a high-level language like Ruby that wouldn't be a speed demon on its own. - the fastest parsing money can buy
- Gazelle uses fast algorithms, keeps its memory footprint minimal, and avoids memory allocation/deallocation wherever possible. The APIs are designed such that the parser could be JITted at runtime, opening up access to SSE, which lets you do things like 16 bytewise comparisons in a single instruction. SSE 4.2 even has specific instructions to accelerate parsing.
- integrated lexing and parsing
- The Gazelle grammar contains all necessary lexing and parsing information in a single unified grammar description language. There is no need to create and maintain a separate lexer. Any tricky work you would otherwise need to do in the lexer (for example, using lexer states) is automatically inferred from the grammar by Gazelle.
- modular grammars
- Embedding one language inside another is quite common these days, especially with templating languages like PHP or RHTML that embed an imperative language inside HTML. Gazelle will let you directly express the idea that RHTML is just HTML with Ruby in between these special tags. And because Gazelle integrates lexing and parsing, you don't have to worry about writing a lexer smart enough to know when to switch between languages.
- lightweight, embeddable compiler
- The compiler will be available as a library, just as the runtime is. This means that users of dynamic languages can opt to load the grammar in text form directly, thereby eliminating a compile step of any kind.
Another Angle
There are an uncountable number of parsing frameworks out there. Too many. Gazelle seeks to bring some unity to this landscape through two main goals:
- making parsers reusable, so the community can cooperate to write really good ones
- making parsers usable from any language, so that different language communities can cooperate.
If these goals are accomplished, then fast, complete, correct parsers for common languages can become a commodity, like they ought to be. No one should have to "roll their own" unless they are parsing something very new or unusual.
Status
As of January, 2011, Gazelle is in a somewhat dormant state while I work on my Protocol Buffers implementation upb, which is a prerequisite for further Gazelle development. I remain extremely motivated to return to Gazelle once upb is ready.
Gazelle has reached a state where it is genuinely useful for doing event-based parsing of very simple languages, but its API will change significantly when protocol buffer support is integrated. So users will likely want to hold off on adopting Gazelle at this time.
There is still much more to come: operator-precedence parsing, embedded Lua to deal with tricky situations, and character set support, just to name a few.
Project Resources
Gazelle is free software, released under the BSD license.
- Downloads are on Google Code (direct download link: gazelle-0.4.tar.gz, 120kb).
- News and status updates are posted on the "Gazelle" category of my blog.
- Questions/Comments should go to the gazelle-users Google Group.
- Development Repository uses Git and is hosted at the gazelle repository on GitHub.
- Documentation: the manual is included in the source distribution, but a pre-built version is also kept online: the Gazelle Manual.
- Bug/Issue tracking is on the Google Code issue tracker.
Gazelle is written by Joshua Haberman. To reach me directly, email joshua@reverberate.org.
If you were looking for the Gazelle Movie Editor, its homepage is here.
Thanks to the Bradshaw Foundation for the Gazelle pic.