Colm programming language released: best parser-writer ever

Colm programming language released: best parser-writer ever(complang.org)

87 points by edwardog 15 years ago | 42 comments

ScottBurson 15 years ago |

Thurston claims that no previous grammar system supports his three requirements of generalized parsing, grammar-dependent scanning, and context-dependent parsing.

I would argue that Prolog Definite Clause Grammars, which date back to the early 1970s, have all three of these properties. Furthermore, since the context is maintained functionally, by threading additional values through the productions, no "undo actions" are required; Prolog's built-in backtracking is all that's needed.

Of course, the problem with DCGs is performance: they're exponential in the worst case. But I think they deserve mention in a dissertation like this anyway. Also, any backtracking parser risks exponential worst-case performance; it will be interesting to see how Colm avoids this fate (I've only read the first few pages yet).

alan-crowe 15 years ago | |

I remember, back in 1985, trying to write a parser in Prolog on a VAX 11/750. I didn't know what I was doing and my code fell into the exponential case trap. Even minimalist examples appeared to crash as they took unreasonably long, so I wasn't getting clues to the problem, and just gave up.

Consequently I want to rephrase the first sentence of your third paragraph. The problem with DCGs is performance: they're exponential in the naive case. That might not sound too bad today, but back in 1985 the hype was that Prolog let you program declaratively. Just declare what a parse looks like. Reasonable performance in the naive case was the popular selling point for Prolog.

thurston 15 years ago | |

Where is the grammar-dependent scanning?

Note that threading the context through the parse tree while maintaining fully generalized parsing requires keeping all versions of the parsing context in memory. Consider making a C++ parser in that way ... ie every time you modify the structures you build in memory you make a copy of them first.

Hemospectrum 15 years ago | | |

If you know that every subfield of your state object is immutable, and you're using references instead of "local" copies (inconvenient unless you have garbage collection) then your copying costs are limited to the size of the overall state object plus whichever field changed.

Obviously this makes no sense for C++, but Clojure and OCaml get away with it and in Haskell it's the standard way of implementing almost every stateful computation.

swannodette 15 years ago | | |

If your data structures are persistent data structures you don't incur the costs of copying.

jws 15 years ago |

Notice that the DNS example is parsing a binary DNS request, not a text file.

thurston 15 years ago | |

:) If I had my way this comment would be closer to the top. Not many grammar-based parsing systems can claim raw DNS parsing.

haberman 15 years ago |

From my quick scan of the thesis, the basic design seems to be a programming language in which you write both the parser and any transformations you want to perform. It's not clear whether there is an easily-accessible parse tree serialization that you can use to load the output into another language, or whether you'd have to invent that yourself.

I think it's generally a hard sell if you try to convince people that they need to write their algorithms in your special language. Parsing tools deliver value because grammars are easier to write than the imperative code that implements those grammars. That value offsets the cost of having to learn a new special-purpose language. But imperative programming languages are already pretty good at tree traversal and transformation, so there's little benefit to using a special-purpose language for this.

I think that the next big thing in parsing will be a runtime that easily integrates into other languages so that the parsing framework can handle only the parsing and all of the tree traversal and transformation can be performed using whatever language the programmer was already using. This requires much less buy-in to a special-purpose language.

thurston 15 years ago | |

Colm has built-in serialization. There is still some work to do in this area though. Colm will preserve whitespace for minimal disruption of untransformed text, but figuring out what to do at the boundaries between modified and unmodified trees can be tricky.

You are right, people want to use general purpose languages for the more complex algorithms. I agree a means of embedding is necessary and I have kept this in mind, though not yet achieved it. I would very much like to be able to parse, transform, then have the option to import the data into another environment and carry on there.

haberman 15 years ago | | |

Thanks for the info. What is the built-in serialization format?

beza1e1 15 years ago | |

AntLR can do this, although it does not work that well. I used the C backend, which is pretty directly ported from the Java backend. C-in-Java-style is pretty awkward.

bdfh42 15 years ago |

Quote "Colm does not yet have any documentation".

Then I would hazard that it is not yet a language as without documentation it has no "grammar". At best it is a patois.

thurston 15 years ago | |

Grammar: http://svn.complang.org/colm/trunk/colm/lmparse.kl

wzdd 15 years ago | |

TXL, its apparently predecessor, is very well documented (http://www.txl.ca/). TXL is a very interesting approach to parsing and worth reading up on if you're interested in the area (or are waiting for documentation for Colm :)

scscsc 15 years ago | |

There seems to be a PhD thesis behind, so you should check it for the grammar.

colomon 15 years ago |

It would be interesting to see someone who understood both this and Perl 6's grammars to do a comparison. Based on Colm's quick description and my rough understanding of Perl 6 grammars, they sound like they are roughly equally powerful. But I admit I'm not sure I understand what "transformation language" means...

audreyt 15 years ago | |

Although similar in expressive power, Colm offers instruction logging to auto-reverse global state changes upon backtracking, something Perl 6 grammars does not (yet) support; at the moment we need to manually manage them with embedded blocks.

chocolateboy 15 years ago | | |

Re: "reverse global state changes upon backtracking": this sounds similar to the (manual) "undo actions" supported by the Kelbt parser [1], perhaps unsurprisingly as it was developed by the same author :-)

[1] http://www.complang.org/kelbt/

colomon 15 years ago | | |

Thanks!

thurston 15 years ago | |

If they are then I don't deserve to be called "Dr. Thurston!"

Twisol 15 years ago |

Adrian Thurston (the creator of Colm) is also responsible for the fantastic Ragel state machine generator.

DrCatbox 15 years ago |

I am more interested in DSNP, how come this project has not received more fame than the infamous Disapora? http://www.complang.org/dsnp/

thurston 15 years ago | |

There are some difficult problems in that space. I've posted to HN and reddit a few times, but mostly I've been working on it quietly so I can focus. Lately, that's starting to change. I'll be talking about it at FSW 11 in Berlin in a few weeks.

DrCatbox 15 years ago | | |

My google sense failed me this time around to find information on this FSW 11 in Berlin. Care to explain? Is it a conference, can anybody come?

I am really interested in DSNP and am fairly well versed in GNU/Linux and can do some programming, Java and Python mostly. I work as a web-frontend developer guy. Can I be of some help? Do you need testers, peers, documenters?

Barrasmara 15 years ago |

This kind of sounds like Semantic Design's DMS software Reengineering toolkit and the Parlanse language.

thurston 15 years ago | |

They are related systems. DMS is much more mature.