https://news.ycombinator.com/item?id=30414683
https://news.ycombinator.com/item?id=30414879
I spent a year or two working with PEGs, and ran into similar issues multiple times. Adding a new production could totally screw up seemingly unrelated parses that worked fine before.
As the author points out, Earley parsing with some disambiguation rules (production precedence, etc.) has been much less finicky/annoying to work with. It's also reasonably fast for small parses even with a dumb implementation. Would suggest for prototyping/settings when runtime ambiguity is not a showstopper, despite the remaining issues described in the article re: having a separate lexer.
> Earley parsing with some disambiguation rules
Any idea why GLR always gets ignored?
GLR would probably have (much) better performance but I'm usually not parsing huge files (or would hand-roll one if I were). I've not yet found an explanation of GLR (or even LR for that matter) that's quite as simple as PEGs or Earley (suggestions welcome tho!).
Both GCC and LLVM implement recursive decent parsers for their C compilers.
Parser generators are an abomination inflicted upon us by academia, solving a non problem, and poorly.
A hand-written recursive descent parser (with an embedded Pratt parser to handle expressions/operators) solves all the problems that parser generators struggle with. The big/tricky "issue" mentioned in the article - composing or embedding one parser in another - is a complete non-issue with recursive-descent - it's just a function call. Other basic features of parsing: informative/useful error messages, recovery (i.e. don't just blow up with the first error), seamless integration with the rest of the host language, remain issues with all parser generators but are simply not issues with recursive descent.
And that's before you consider non-functional advantages of recursive-descent: debuggability, no change/additions to the build system, fewer/zero dependencies, no requirement to learn a complex DSL, etc.
This is the controversial part, Lisp aficionados to the contrary.
You can parse quite a lot more than Lisp with techniques from 1965.
My brain can do it, why can't my computer?
Grammar-based parsing for natural language isn't anywhere close to working, sadly, and may never be.
Parsing: The Solved Problem That Isn't (2011) - https://news.ycombinator.com/item?id=8505382 - Oct 2014 (70 comments)
Parsing: the solved problem that isn't - https://news.ycombinator.com/item?id=2327313 - March 2011 (47 comments)
Certain parser generators make life easier by supporting actions on parser/lexer rules. This is great and all, but it has the downside that the grammar you provide is no longer reusable. There's no way for others to import that grammar and provide custom actions for them.
I don't know. In my opinion parsing theory is already solved. Whether it's PEG, LL, LR, LALR, whatever. One of those is certainly good enough for the kind of data you're trying to parse. I think the biggest annoyance is the tooling.
Pros: * They're just a technique/library that you can use in your own language without the seperate generation step.
* They're simple enough that I often roll my own rather than using an existing library.
* They let you stick code into your parsing steps - logging, extra information, constructing your own results directly, e.g.
* The same technique works for lexing and parsing - just write a parser from bytes to tokens, and a second parser from tokens to objects.
* Depending on your languages syntax, you can get your parser code looking a lot like the bnf grammar you're trying to implement.
Cons: * You will eventually run into left-recursion problems. It can be nightmarish trying to change the code so it 'just works'. You really need to step back and grok left-recursion itself - no handholding from parser combinators.
* Same thing with precedence - you just gotta learn how to do it. Fixing left-recursion didn't click for me until I learned how to do precedence.
Double quotes in C code mean begin and end of a string. But strings contain quotes too. And newlines. Etc.
So we got the cumbersome invention of escape codes, and so characters strings in source (itself a character string) are not literally the strings they represent.
ugly, yes. problematic? no.
You can't always just copy and paste some text into code, without adding escape encodings.
Now write code that generates C code with strings, that generates C code with strings, and ... (ahem!)
It's not a big deal, but it isn't zero friction either. Relevant here because it might be the most prevalent example of what happens when even two simple grammars collide.
Unless it's a regex....
But this ignores all sorts of other steps you can take. Targeting multiple execution environments is an obvious step. Optimization is another. Trivial local optimizations like shifts over multiplications by 2 and fusing operations to take advantage of the machine that is executing it. Less trivial full program optimizations that can propagate constants across source files.
And preemptive execution is a huge consideration, of course. Very little code runs in a way that can't be interrupted for some other code to run in the meantime. To the point that we don't even think of what this implies anymore. Despite accumulators being a very basic execution unit on most every computer. (Though, I think I'm thankful that reentrancy is the norm nowadays in functions.)
Even English can be parsed first into the sounds. This is why puns work. Consider the joke, "why should you wear glasses to math class? It helps with division." That only works if you go to the sounds first. And you will have optionality in where to go from there.
So, for parsing programs, we often first decide on primitives for execution. For teaching, this is often basic math operations. But in reality, you have far more than the basic math operations. And, as I was saying, you can do more with the intermediate representation than you probably realize at the outset.
A somewhat breathless description of all of this is in the Marpa parser documentation:
https://jeffreykegler.github.io/Marpa-web-site/
In practice, I've found that computers are so fast, that with just the Joop Leo optimizations, 'naive' Earley parsing is Good Enough™: https://loup-vaillant.fr/tutorials/earley-parsing/(I may be talking out of my ass here.)
In particular, you can do (a subset of) the following in sequence:
* write your own grammar in whatever bespoke language you want
* compose those grammars into a single grammar
* generate a Bison grammar from that grammar
* run `bison --xml` instead of actually generating code
* read the XML file and implement your own (trivial) runtime so you can easily handle ownership issues
In particular, I am vehemently opposed to the idea of implementing parsers separately using some non-proven tool/theory, since that way leads to subtle grammar incompatibilities later.
https://soft-dev.org/pubs/html/diekmann_tratt__dont_panic/
https://drops.dagstuhl.de/storage/00lipics/lipics-vol166-eco...
I don't know if that's specific to tree-sitter though, I'm sure there are other incremental parsers. I have to say that I've tried ANTLR and tree-sitter, and I absolutely love tree-sitter. It's a joy to work with.
Also Tree Sitter only does half the parsing job - you get a tree on nodes, but you have to do your own parse of that tree to get useful structures out.
I prefer Chumsky or Nom which go all the way.
No, it isn't. And incremental parsing is older than 2011 too (like at least the 70s).
For example: https://dl.acm.org/doi/pdf/10.1145/357062.357066
Because the grammar for a parser generator is usually much simpler than most general purpose programming languages, it is typically relatively straightforward to handwrite a parser for it.
If you get stuck on LR(0), here's the idea: you use a state machine with a stack, and (b) treat nonterminal symbols like any other input tokens. How? Like below. Say you had the rules:
#1: Add: Add '+' NUMBER;
#2: Add: NUMBER;
Initially, you could be in the beginning of any of these rules: either in rule #1 offset 0 (before the Add), or in rule #2 offset 0 (before the NUMBER). Push ({(#1, 0), (#2, 0)}, <null>) onto your stack, where the second list is the symbols you've seen so far (nothing so far). Now consume the first token in the input; let's say it was a NUMBER (say, "55"). Go through every possible new location in your stack, and update where it could be now, and push the new candidate locations along with the symbol you just saw, filtering out any locations where you went past the end of the rule. In this case, that means pushing ({(#1, 1), (#2, 1)}, NUMBER). Now, carefully examine your stack. You've seen [<null>, NUMBER] so far (i.e., just [NUMBER]), and you're either in the middle of rule #1, or at the end of rule #2. Well, it can't be #1, because the last symbol you saw was NUMBER, not Add. Therefore it's #2, and you've finished rule #2, which means you've recognized an Add. Therefore, pop all the symbols in the production of #2 from the stack (which leaves the stack empty), then push the output back (i.e. "reduce" it to the Add symbol). This leaves you with a stack that has just one item, ({(#1, 1)}, Add). Since you're no longer at the end of a production, consume the next input symbol, then go repeat this process, reducing as much as you can every time before consuming an input. [1]Once you understand this intuitively (and if you stare at it long enough, you'll realize the process I just explained resembles that of a regex NFA), then learn how to optimize this by preprocessing the set of possible locations into simple integers instead of entire sets, so you don't have to do O(|rules|) work every iteration. Then move onto LR(1), which is merely about disambiguating the rules using a lookahead. After that, GLR is "just" keeping a DAG instead of a stack (and it reduces to a linked-list version of a stack if there is no ambiguity)... best of luck.
[1] I lied a little here, in that the process starts with reducing, not shifting. Because you might need to reduce (possibly multiple times) before you consume any inputs. But that's easier to explain here after the fact, than when you're initially learning it.
(… I have.)
I've moved towards designing languages now that operate over CSV source. That adds an extra dimension while still enabling convenient editing - just turn off all the parsing behavior in the spreadsheet and you can edit it as "plain text". Although, column alignment isn't always desirable in this case.
I think you cover that with "debuggability" remark.
Here is something else. Yacc/Bison parsing uses its own stack for the symbols. Whereas recursive descent uses the regular run-time stack.
Pop quiz: which stack is visible to your garbage collector, and which isn't?
In TXR Lisp, which uses a Bison/Byacc generated parser, GC is suspended while yyparse() executes. Otherwise it would prematurely collect anything that is only stored in the Yacc stack, and so there would have to be a GC hook to traverse that stack (which would depend on undocumented features of the generated code, and likely have to have some separate coding for Bison versus Byacc.)
Precedence climbing is a natural and efficient way to parse expressions in a recursive decent parser. It's used by Clang in LLVM.
https://eli.thegreenplace.net/2012/08/02/parsing-expressions...
Exactly the recipe I swear by as well!
Like using a hyperlink to code example.
[0]: 1963 operator precedence description: https://dl.acm.org/doi/10.1145/321172.321179
let a = 12;
let b = a + 5;
...
Tree-Sitter will give you a tree like Node(type="file", range=..., children=[
Node(name="let_item", range=... children=[
Node(name="identifier", range=...)
Node(name="expression", range=..., children=[
Node(name="integer_literal", range=...)
...
Whereas Nom/Chumsky will give you: struct File {
let_items: Vec<LetItem>,
..
};
struct LetItem {
name: String,
expression: Expression,
};
...
Essentially Tree-Sitter's output is untyped, and ad-hoc, whereas Nom/Chumksy's is fully validated and statically typed.In some cases Tree-Sitter's output is totally fine (e.g. for syntax highlighting, or rough code intelligence). But if you're going to want to do stuff with the data like actually process/compile it, or provide 100% accurate code intelligence then I think Nom/Chumksy make more sense.
The downsides of Nom/Chunksy are: pretty advanced Rust with lots of generics (error messages can be quite something!), and keeping track of source code spans (where did the `LetItem` come from) can be a bit of a pain, whereas Tree-Sitter does that automatically.
Tree-sitter's output is closer to being "dynamic" than "untyped", though.
It's not too hard to build a layer on top of tree-sitter (out of the core lib) to generate statically typed APIs. I haven't felt the need for that yet, but it may be worth exploring.
> actually process/compile it
At work, I built a custom embedded DSL, using tree-sitter for parsing. It has worked well enough so far. The dynamically-typed nature of tree-sitter actually made it easier to port the DSL to multiple runtimes.
> provide 100% accurate code intelligence
Totally agree that tree-sitter cannot be used for this, if we are aiming for 100%.
Maybe you need to optimize the data, maybe you need to do some error checking. Lots of code is syntactically valid but not semantically valid, and usually those semantic errors will persist into the AST (in my limited experience).
> until you need to get your string through several levels of escape. how many backslashes to add? depends on how deep your pipe is and how each of those layers is defined
When the same character is used as both an open and close delimiter, you have to disambiguate between three possibilities: opening a new string, closing the current string (which may or may not be embedded) and a literal character as a constituent of the current string. By convention, an unescaped double-quote inside a string indicates closing that string, so you need different escapes to indicate opening embedded strings and constituents.
You could have done that by using two different escape characters, but for historical reasons there is only one escape character: the backslash. So that one character has to do double-duty to disambiguate two different cases. But in fact it's even worse than that because string parsers have a very shallow understanding of backslashes. To a string parser, a backslash followed by another character means only that the following character should be treated as a constituent. So you still need to disambiguate between actual constituents and opening an embedded string, and the only way to do that, because all you have is the backslash, is with more backslashes. The whole mess is just a stupid historical accident.
If you used balanced quotes you only have one case that needs to be escaped: constituents. So you never need multiple escapes.
Note that I made a mistake when I wrote:
> Only if you want to refer to [a close-quote character] literally as a closing quote rather than having it act as a closing quote.
You have to escape both open and close quotes to refer to them as constituents. In other words you would need to write something like this:
«Here is an example of a «nested string». The start of a nested string is denoted by a \« character. The end of a nested string is denoted by a \» character.»
Note that it doesn't matter how many levels deep you are:
«Even when you write «a nested string that refers to \« or \» characters» you only need one level of escape.»
Note that when you refer to quote characters as balanced pairs as in the examples above you don't actually need the escapes. The above strings will parse just fine even without the backslashes, and they will print out exactly as you expect. The only "problem" will be that they will contain embedded strings that you probably did not intend. The only time escapes are actually required is when referring to an quote characters as constituents without balancing them. This will always be the case if you refer to a close-quote without a corresponding preceding open-quote, which is the reason I got it wrong: escaping close-quotes will be more common than escaping open-quotes, but both will be needed occasionally.
Human language has a pretty clear distinction between syntax and sementics. This is how we recognize that "colorless green ideas sleep furiously" are perfectly well formed, if meaningless. In contrast, "I is happy" is meaningful and unambiguous, but grammatically incorrect.
In terms of syntax, English (like most, if not all) languages is literally ambigous.
Consider the sentence structure:
Subject Verb Object Prepositional-Phrase.
This can be either:
(Subject Verb (Object Prepositional Phrase))
Or
(Subject Verb Object ) Prepositional Phrase.
For instance, consider the sentence "I saw a man with binoculars".
In any sense of the word, this example is structually ambigous.
Exactly! So glad we're on the same page.
Language is created by and intended for big brains with huge amounts of knowledge about each other and about the world. Relying on external knowledge makes it extremely compact and flexible, but also means your parser needs a similar level of knowledge to function.
My favorite funny phenomenon in this area is when a sentence has two unambiguously different parses that mean exactly the same thing.
Politicians lie.
Cast iron sinks.
Politicians lie in cast iron sinks.
It's not actually ambiguous, but I think it's a lovely illustration of the subtleties of the problem.
An actually ambiguous example: I saw a politician lying in a cast iron sink.
"I couldn't fit the trophy in my suitcase because it was too big."
"I couldn't fit the trophy in my suitcase because it was too small."
I would also advocate the principle that you don't escape the escape character by doubling it. There are two problems with replacing \ with \\: firstly the length of the string doubles with each nested quotation; secondly you can't tell at a glance whether \\\\\\\\\\\\\\\\\\\n contains a newline character or an n because it depends on whether the number of backslashes is odd or even.
Another useful principle is to escape a quote character with a sequence that does not contain that character: then it is much easier to check whether the quotes are balanced because you don't need to check whether any of them are escaped.
So here's a possible algorithm for quoting a string: first identify the top-level quote characters that don't match (this is not totally trivial but it isn't difficult or computationally expensive); then, in parts of the string that are not inside nested quotes, but only there, replace « with \<, » with \>, and \ with \_ (say). Does that work?
That leaves only the problem of escaping the escape character, and here again there is no need to constrain ourselves to ascii. There is no reason that the escape character needs to be backslash. In fact, that is a particularly poor choice because backslash, being an ascii character, is extremely precious real estate. In fact, it is doubly precious because it actually has a balanced partner in the forward slash, so if you are going to use backslash for any special purpose it should be partnered with forward slash as a balanced set (which open up the problem of what to use for the directory delimiter in your operating system, but that's another can o' worms).
I think the Right Answer is simply to choose a different character to serve as the escape character inside balanced strings. My first pick would probably be ␛, but there are obviously a lot of other possibilities.
This points to a potential danger of this approach: there are a lot of unicode characters that render very similarly, like U and ᑌ. You would need to choose the unicode characters with special meanings very judiciously, and make sure that when you are writing code you have an editor that renders them in some distinctive way so you can be sure you're typing what you think you're typing. But that seems doable.
We're really not. I'm saying you've incorrectly defined the English language. By that definition, no piece of text is English.