Nevertheless, I think I'm learning more from this book than most other books I've tried before that are far more theoretical or abstract. I'm still eager to reach the chapter on implementing C types. I think it's a good book, but it requires more effort than something like Crafting Interpreters or Writing a Compiler/Interpreter in Go, while also covering topics not in those books.
Plus, you get to become proficient in OCaml, which is a pretty good language.
It feels like a more advanced version of Crafting Interpreters.
I haven’t looked at the OCaml implementation at all. The text and unit tests are all you need.
Discussion on the Ada Forum: https://forum.ada-lang.io/t/writing-a-c-compiler/1024
Is duck typing the pseudo-unsafe alternative? (Not unsafe as in accessing unsafe memory, but as in throwing exceptions if the duck-typed function doesn’t exist on the current type)
Can C handle both?
Coming from a static type system like rust and c#, i’m doing alot of “if this is a duck, duck.quack()” and i’m looking for faster alternatives and less verbosity if possible
data Tree a = Leaf a | Branch (Tree a) (Tree a)
fromList :: [a] -> Tree [a]
fromList = Leaf
toList :: Tree [a] -> [a]
toList (Leaf x) = x
toList (Branch (Leaf x) r) = x ++ toList r
toList (Branch (Branch l1 l2) r)
= toList (Branch l1 (Branch l2 r))
append :: Tree [a] -> Tree [a] -> Tree [a]
append = Branch
In a language that doesn't have tree pattern matching, the code wouldn't be this short and easy to understand, and I don't think it could be replicated just by having duck typing. Rust has pattern matching, but because it's primarily focused on lower-level concerns like pointers and memory ownership, pattern matching isn't this nice, because you have to pattern match on the pointer first.Since a compiler is all about tree manipulation, support for tree pattern matching should be a boon.
[0]: http://h2.jaguarpaw.co.uk/posts/demystifying-dlist/
[1]: https://en.wikipedia.org/wiki/Joel_Spolsky#Schlemiel_the_Pai...
https://archive.org/details/byte-magazine-1978-09 (part 1)
All 3 parts of Tiny Pascal:
https://albillo.hpcalc.org/publications/Easter%20Egg%20-%20T...
On topic, though: wouldn't a simpler language (maybe even a pseudo language) be a better target for a first learning compiler. I understand they don't build a full C compiler, but still. It looks to me like there's a lot of complexity add from choosing such a lofty target.
For writing an interpreter or transpiler, there are probably better options, but for a true compiler I can’t think of a better choice than C (or at least a subset of C).
SML is very dated and the standard library and ecosystem lack many things that are considered table stakes for a viable programming language nowadays. And F# and Scala are fine as enterprise languages, but being tied to .NET and Java respectively makes them less desirable for implementing a language that won't itself be coupled to one of those runtimes.
So OCaml was probably the most mainstream choice among the languages with appropriate tools, as funny as that sounds. And honestly, once you get over the syntax, it doesn’t actually have anything outrageous.
It's been covered on several threads here over the years [1].
[0]: https://craftinginterpreters.com/ [1]: https://hn.algolia.com/?q=crafting+interpreters
I’ve been bored with building line-of-business applications, despite designing for complex requirements in high-volume distributed systems.
In fact I took a break from CS learning entirely 9 months ago. Even reading HN. I’ve been studying electronics and analog signal processing instead.
But now that I’ve built about 50 guitar pedals of increasing complexity, I feel ready to switch back to CS studies again.
https://www.amazon.com/Writing-Compiler-Programming-Language...
https://onlinebooks.library.upenn.edu/webbin/book/lookupid?k...
Aside from that, I encourage everyone who cites Compiler Construction to actually work through the first 10% of the book and then count the number of errata.
While they teach similar content, they have a different approach.
There are literally thousands of compiler design books out there, I don't really see anything particularly comparable between this book and Wirth's
That last book was Allen Holub's "Compiler Design in C", which is from 1990. Here's how the blurb on the back describes it:
> Allen I. Holub's Compiler Design in C offers a comprehensive, new approach to compilers that proves to be more accessible to computer science students than the other strictly mathematical books.
> With this method in mind, the book features three major aspects:
> (1) The author develops fully functional versions of lex and yacc (tools available in the UNIX® operating system to write compilers), (2) he uses lex and yacc to develop a complete C compiler that includes parts of C that are normally left out of compiler design books (eg., the complete C "type" system, and structures), and (3) the version of yacc developed here improves on the UNIX version of yacc in two ways (error recovery and the parser, which automatically produces a window-oriented debugging environment in which the parse and value stacks are visible).
It's out of print, but the author has made a searchable PDF available on his website [1]. I found it quite useful.
Holub seems to like the "learn by doing" approach. He's got another book, "Holub on Patterns" that teaches all the design patterns from the gang of four book organically by developing two programs that together use all of those patterns. The two programs are an embedded SQL interpreter and a GUI application for Conway's Game of Life.
PS: Ooh. It occurred to me that No Starch Press books are often available on O'Reilly Learning. I checked and this one is there. So I guess it is going on my "don't need but am doing out of curiosity" pile after all.
Several "compiler light" style articles and books kind of walk over that part, and it can be non-trivial to do properly, especially with modern expectations.
I remember way back in the day, one of the early C compilers for the PDP, and, honestly, it could almost be argued that ed(1) had better error messages than what that thing produced.
A lot of simple compilers hit an error and just give up.
So, just curious what the approach was in this book.
[0] actually from the readme in the github repo[1] it seems to be a C subset, not all of C
Was featured here a couple of times.
Unfortunately the timing of the release is quite unfortunate with regards to the summer holidays. Will take a look at it next year
https://news.ycombinator.com/item?id=40940799
So maybe you saw it then.
Each chapter of the book includes a test suite to run against the code you’ve written.
In some ways, the tests in this book feel very similar to the labs in the book Computer Systems: A programmers perspective — which is high praise!
Alternatively, I would like to learn about not just how to make a compiler, but also simultaneously a debugger, hot-reloading, etc.
However, it's also boring.
Nevertheless the contents of the book cover all the techniques required to write an assembler, if you'd really like to
If there is some library that can help create machine code from assembly instructions on a line by line basis (at least as opposed to invoking a separate program that generates the entire binary collectively from the assembly code), that could also work.
In my case, I already know enough of the lexer, parser, etc., parts. What's missing is going all the way to making a debugger, profiler, etc.
It's also "fun" if some instructions come in different sizes... and you may need stronger restrictions on allowed expressions in that case.
https://www.linuxfromscratch.org/
"Linux From Scratch (LFS) is a project that provides you with step-by-step instructions for building your own custom Linux system, entirely from source code."
The most obvious change you'll see is the use of SSA, which has become the dominant representation in IR starting 25-30 years ago.
There's also been an increase in the importance of compiler IRs, and especially the concept of code passing through multiple IRs before reaching machine code.
Formal semantics has become more of a thing in the past decade or so. It's now routine that even weak memory models have a detailed formal model of how they work. In LLVM, it's now a requirement that you demonstrate formal correctness of new InstCombine transformations (which are essentially peephole optimizations).
The use of parser generators has really fallen into disrepute; everything has transitioned to handrolled parsers these days. Language standards themselves are starting to rely on context-sensitive keywords, which are hard to implement in a generator-based lexer/parser setup.
Optimizations have generally broadened in scope; we're now seeing whole-function level of optimization being the default way to look at stuff for analysis, and there's been a variety of techniques introduced to make whole-program optimization (aka LTO) much more tractable.
Another subtle but major shift is that compilers are increasingly reliant on inferred analysis from dumber representations over the original declared intent of the code. For example, SROA in LLVM (which breaks up structs so that individual fields can be independently allocated in registers) relies not on looking at the way the struct is declared but the offsets within the struct that various load and store operations use.
A final major shift is the trend towards support for ancillary tooling in the programming language space, so that the compiler isn't merely a tool that goes from source to object code, but is something that can be actively queried for information about the source code. Things like the language server, or smart formatting, or automatic refactoring tooling.
That "middle-pass" approach that will let you address many targets is still valid; the trick is finding a sufficiently robust and flexible internal representation at the right level. You also have to be able to out-guess the chip vendors where before you could go to the architect or a complete "System" book and get the real scoop, including things you shouldn't do. Oddly enough, there is simultaneously useful and completely worthless documentation scattered about the internet.
You might want to take a look at Muchnick and Jones' _Program_Flow_Analysis_ (yes, it's from 1981) but chapters 4-6 can be applied at code-generation time. How that fits modern Intel processors (for example) is unknown. Idealizing your processor as a RISC-V might be a reasonable way to proceed but in the end, you'll have to translate the code for the target -- it will be reasonably straight-forward if you drive it all from tables but it's not trivial.
https://news.ycombinator.com/item?id=40940799
> So what's different about writing a compiler in 2024 than say 10, 20, or 30 years ago?
As far as I can tell, the main difference is that static single assignment (SSA) as an intermediate form was not the norm 30 years ago, but it is nowadays. Also, in newer books, it's more common to go over global register allocation now, whether that's graph coloring or linear scan register allocation. If you read old compiler books, the main optimizations they talk about are use-def chains, moving computations out of loops, and using the local and tree-based Sethi-Ullman register allocation algorithm.
Today most languages are front-ends for LLVM IR, but LLVM is very slow and takes a long time to optimize. Many new languages target x86/arm directly with their own weakly optimized backends, and output an LLVM IR for "release builds".
and, while we're talking about ocaml, ocaml does use ocamllex and ocamlyacc for its own parser
so, while you can certainly do without parser generators, they have very commonly been used for making real-world programming languages. almost every programming language anyone here has ever heard of was first implemented with a parser generator. the main exceptions are probably fortran, cobol, algol, lisps, c, and pascal
Now-a-days, the difference between "big compiler optimized" and "little compiler not optimized" can be quite dramatic; but, is probably no more than 4x — certainly within range of the distinction between "systems programming language" and "high tuned JITted scripting language". I think most people are perfectly fine with the performance of highly-tuned scripting languages. The result is that all of the overhead of "big compiler" is just ... immaterial; overhead. This is especially true for the case of extremely well-tuned code, where the algorithm and — last resort — assembly, will easily beat out the best optimizer by at least an order-of-magnitude, or more.
Were they? GCC abandoned bison in favour of their own parser relatively recently.
(imagine a medieval accountant trying to learn to do long division in roman numerals. he'll be much better off learning the western arabic numerals fibonacci is so excited about)
Modern Compiler Implementation in ML: https://www.cs.princeton.edu/~appel/modern/ml/
As an undergrad student, I think the C version is kinda easier to understand, though.
A good engineer should be able to use the right tool for the job
https://www.amazon.com/Retargetable-Compiler-Design-Implemen...
Like if I’d see the book on a shelf I would instantly guess it’s related to compilers. And I bet that’s completely intentional homage to the original.
so gcc has literally been using a parser-generator-generated parser for c for more than half its existence, at which point it had already become the most popular c compiler across the unix world and basically the only one used for linux, which had already mostly annihilated the proprietary unix market. it was also imposingly dominant in the embedded space
and i think that kind of development path is fairly typical; getting a parser up and running is easier with a parser generator, but it can be tricky to get it to do strange things when that's what you want (especially with lalr, less so with peg)
Most mainstream languages have a GC, and don't support distinguishing between values on the stack or references, don't need to deal with lifetimes or don't provide the safety you get with them, etc.
I'm curious though, could you give an example of syntax you consider convoluted, and how you would do it instead?
Not sure about Jai.
That's what JIT libraries do, for example asmjit: https://github.com/asmjit/asmjit/blob/master/test/asmjit_tes...
Also much of that work is heavily dependent on the used operating system.
Nevertheless, I'm wishing you all the best on your journey!
i use debuggers a lot when i'm programming in assembly, and from time to time when i'm programming in c or c++, but in ocaml i've never needed one. it's not that i've never had bugs in ocaml, but they tend to be of a different flavor, a flavor for which breakpoints are of little value
it sounds like your f# experience is different; what kinds of bugs have you recently found the debugger valuable for in f#?
debug logging is usually a superior option to breakpoint debugging for problems to which both are applicable, because debug logging shows you the whole history of your program's execution rather than just a single point in time. breakpoint debugging requires a lot of manual labor, painstakingly operating the machine to navigate the execution to the state that has the problem. it's like grinding in a video game. i'd rather program the computer to do that labor for me, at which point i no longer need the debugger
(except in c and assembly)
1. I am developing on windows so that's an issue for me and
2. I don't use emacs, I use VScode and I've not been able to get the experimental debugger working for the VScode plugin.
It's not that the debugger is purely for fixing bugs; I use it as an active part of development. I want to be able to freeze the program and inspect values as they are running. I may know what the types or state of the program are just by viewing the code, but I want to be able to inspect the actual data for myself as the program is running.
> debug logging shows you the whole history of your program's execution rather than just a single point in time
breakpoints also provide stack traces, so they provide a kind of history as well. I'd rather inspect and interact with a program than dig through thousands of lines of logs
> breakpoint debugging requires a lot of manual labor, painstakingly operating the machine to navigate the execution to the state that has the problem
I see things the opposite as you: print debugging is tedious and requires restarting the program from the beginning whenever you make a change to the source, unless you are constructing your program piecemeal with the repl, which I consider to be extremely tedious as well. To me, a debugger with breakpoints is a far more efficient way to code than print debugging.
I think there is a cultural difference in software engineering between people who use debuggers and people who don't. John Carmack once pointed out that people who come from the game dev and Windows/PC world use debuggers while people from the linux and web dev world tend not to. It seems to be a matter of preference/taste, and I think FP programmers seem to have a distaste for debuggers and graphical debugging/development environments
"... and graphical debugging/development environments" The loud ones might be saying they use Arch btw (on ThinkPads no less) and lean heavily on using neovim inside the current popular Rust rewrite of tmux, but I personally don't care for it.
VS Code is neat. I mostly still use Emacs because that's where the tooling investment has historically been, but my ideal is definitely much closer to Smalltalk-meets-Oberon.
i think that's true! but i don't think it's purely a matter of preference; it's also a matter of what the ecosystems support well and what you're trying to achieve. lisp is a huge exception to the rule about fp programmers; lisp systems, even including emacs, have extremely strong debugging support. but non-lisp fp languages tend to heavily emphasize static typing, thinking things through ahead of time, and correctness by construction, which reduce the need for debugging. but those are more valuable for writing compilers than for writing games or uis in general, where it's hard to define what counts as 'correct' ahead of time but easy to recognize it when you test it interactively
webdev of course has the problem that you can't stop your http response handler while the browser is waiting for a response. and, often, it's sadly not very concerned with ux. and it's often concerned with operating systems in a way where you have to debug problems after they occur, in part because of the scale of the problems
automated testing is another ecosystem thing that reduces the need for debuggers; to a significant extent it competes with static typing
one of the things i really appreciate about godot is being able to continuously adjust parameters and observe variables in my games as they're running. godot is motherfucking awesome, man. i definitely don't have a distaste for graphical debugging and development environments!
> breakpoints also provide stack traces, so they provide a kind of history as well. I'd rather inspect and interact with a program than dig through thousands of lines of logs (...) print debugging is tedious and requires restarting the program from the beginning whenever you make a change to the source
oh, see, i have programs like grep and emacs that dig through thousands of lines of logs for me. often when i'm debugging from logs i don't run the program at all; i just look at the logs and the source code. sometimes the program is running someplace i can't interact with it—memorably, on some occasions, on a satellite out of range of the groundstation. and usually exceptions on python or java give me a pretty decent stack trace, though other log messages unfortunately don't
there's another ecosystem support issue here—although i've sometimes developed on systems that supported fix-and-continue (cmucl, gforth, squeak, basic-80, godot), python and gcc support it very poorly or not at all. so for me i have to restart the program from the beginning whenever i make a change to the source in any case, whether i'm doing printf debugging or not. godot, again, is a very nice exception to this rule, and incidentally lets me add print debugging to the game while it's running
one of the great things about a breakpoint debugger, from my point of view, is that it makes it possible to add logging after the fact to a running program without editing the source or restarting it
i really appreciate you sharing your experience!
I meant to say the idea of a parser generator is a solution to a problem that that real world langs don't really have. When writing a programming language, your issue isnt how much time the parser is going to take to write, or how complex it's going to be. The parser is a relatively trivial part of the problem.
Due to language designers often being taught to develop langauges in this fashion, many have relied on these tools. But the modern view of compliers as "programming langauge UIs" and the focus on DX, i'd argue its actively pathological to use a parser generator.
Much academic work has, til recently, focused on these areas -- whereas today, the bulk of the difficulty is in understanding SSA/LLVM/ARM/Basic Optimizatiosn/etc. details which are "boring, circumstantial" etc. and not really research projects. I was just pointing this out since a lot of people, myself included, go down the EBNF parser-generator rabbit hole and think inventing a langauge is some formal exercise -- when the reality is the opposite: it's standard programming-engineering work.
but no language starts out as a 'real world lang'; every language is initially a toy language, and only becomes a 'real world lang' in the unlikely case that it turns out to be useful. and parser generators are very useful for creating toy languages. that's why virtually every real world lang you've ever used was implemented first using a parser generator, even if the parser you're using for it now is handwritten
having a formally defined grammar is also very helpful for dx features like syntax highlighting and automated refactoring
The f18 compiler’s parser uses parser combinations to construct a backtracking recursive descent parser that builds a parse tree for the whole source file before doing any semantic analysis. This approach allows good error recovery without having to revert any updates to the symbol table.
i assume that by 'parser combinations' you mean parser combinators
what i meant about fortran is that the first fortran compiler didn't use a parser generator
Repeating others, today’s compilers are really just “optimizing compilers”, there is no room for toying in production environments.
If you find some time to go through gcc bugzilla you'll find shockingly simple snippets of code that miscompiled (often by optimization passes), with fixes never backported to older versions that production environments like RHEL are using.
I still insist that a production grade compiler can’t leave performance on table. Which is where the current battlefield is.
in summary, optimizing compilers for c or pascal or zig or rust or whatever can only be used for code where considerations like compatibility, ease of programming, security, and predictability are more important than performance
probably the vast majority of production code is already in python and javascript, which don't even have reasonable compilers at all
You can write compilers in almost any language. I fail to see how C, C++, or even Java or Python aren’t the right tool for the job here. I like pattern matching too, but given that hundreds of successful production compilers have been written without pattern matching, it’s surely just a personal preference.
But even for hobby projects, it’s just a matter of personal preference. OCaml is great for implementing compilers. So are Go, C++, and Java.
Nevertheless, OCaml is very strong in compiler design. For example Rust and Hack were written in OCaml initially.
Nevertheless you are not wrong that compilers needing the very last bit of performance like the JVM and LLVM tend to be written in C++
But the barrier is quite a lot more tending to high performance/very high performance and not toy/production
Java and Python are suitable for implementing a toy Compiler and the auther invites you to use any language you like. Just the reference implementation is using OCaml
I would however argue that using C++ is quite advanced since it does not have pattern matching and using C is just masochm. You will be fighting against the language to do even trivial things instead of fighting the actual problem at hand
I was more so pushing back on the the implication that if it’s not OCaml, it’s not the right tool for the job.
Like, I honestly can’t think of a mainstream language in which it would be hard to implement a C compiler in.
Developers in Unix shops before the Linux era used debuggers.
Obviously, the GNU project developed a debugger for an audience. GNU wouldn't be a complete replacement for proprietary systems like Unix with only compilers, but no debugger.
My point was that it's necessary. How would you implement the same features Rust and C++ have, without garbage collection, but with simpler syntax?
Even plain and simple C99 compilers can be reasonably written by a solo dev, so a motivated small team...
For example, at $WORK I had to reimplement Apache's mod_rewrite.c in Rust, and I made it 100x faster for our particular use-case.
I could've done that in C as well, sure, but the simplicity of the C language just moves the complexity to the code itself; whereas, with Rust, I whipped out a prototype in just 3 days, and was able to freely pass around pointers to data, with zero allocations, zero copying, and every time it compiled I knew it was guaranteed to be safe.
You can't get that safety in C. You can't get that speed in Java.
You can't do this in any other language that does not have these features. I absolutely agree that the syntax is horrible... but I see no other way to achieve this.
Main post is about a C compiler, post I responded is saying
> When your computer was anemic, and could barely do the tasks required for it
which I could only interpret as a machine that can run a single process at a time, so not really about gpus or what.
if you've tried godot i'm interested to hear what you think about it
The obscene complexity of the rust language (like c++) makes a toolchain beyond anything reasonable to code alternatives: that reason alone is sufficient to avoid it.
You can have as many "features" you want, the anti-feature of absurd/grotesque syntax complexity _alone_ buries it.
There is nothing more to say or argue about. This is dead simple.
Your side has a significantly higher cost technical dependency than we don't have.
maybe you're talking about stuff you haven't released?
I will soon likely create a probabilistic programming language and compiler.
But how can you have assurance which grammar it defines, or that it even defines a well-defined grammar?
I’m well aware that some languages don’t bother defining a proper grammar, or define it without having a mechanism to ensure their implementation matches it, but lacking that assurance is exactly the drawback of not using a parser generator.