Writing a C Compiler: Build a Real Programming Language from Scratch

Writing a C Compiler: Build a Real Programming Language from Scratch(nostarch.com)

274 points by shoggouth 1 year ago | 156 comments

signaru 1 year ago |

Have read the first few chapters and it expects that you either read the accompanying source code or implement your own and pass the tests. The pseudo code presented in the book often look like function calls with the inner details not there in the book. Furthermore, as already pointed out in another comment, the available implementation is in OCaml, which is probably not something many C programmers have experience with.

Nevertheless, I think I'm learning more from this book than most other books I've tried before that are far more theoretical or abstract. I'm still eager to reach the chapter on implementing C types. I think it's a good book, but it requires more effort than something like Crafting Interpreters or Writing a Compiler/Interpreter in Go, while also covering topics not in those books.

wrycoder 1 year ago | |

Nand2Tetris is also like that - they provide an outline and tests, but you have to do the work. And, having the implementation language be different from the target language reduces confusion.

Plus, you get to become proficient in OCaml, which is a pretty good language.

kragen 1 year ago | | |

that's a good point—it was pretty confusing when i wrote ur-scheme in scheme, or stoneknifeforth in stoneknifeforth, because i kept getting confused about which level of the language i was changing things in

myko 1 year ago | |

I thought this book looked neat but closed the tab before reading the comments here, and after this one decided to go ahead and buy it. Sounds really fun!

synack 1 year ago |

I’ve been working through this book implementing the compiler in Ada. So far, I’m really enjoying it. The book doesn’t make too many assumptions about implementation details, leaving you free to experiment and fill in the blanks yourself.

It feels like a more advanced version of Crafting Interpreters.

I haven’t looked at the OCaml implementation at all. The text and unit tests are all you need.

Discussion on the Ada Forum: https://forum.ada-lang.io/t/writing-a-c-compiler/1024

francogt 1 year ago |

I see many comments saying that the book implements the C compiler in ocaml. In the introduction the author states that the book actually uses pseudo code so you are actually free to implement it in any language. The only recommendation is that you use a language with pattern matching because the pseudo code makes heavy use of it. The reference implementation is in ocaml.

markus_zhang 1 year ago | |

Thanks, can you please lemme know which part uses pattern matching? I'd assume mostly in the lexer, but the parser should just be something that consume the tokens and spit out AST. Unless of course it combines the two.

shawn_w 1 year ago | | |

Presumably anything that walks the syntax tree.

CrimsonCape 1 year ago | |

Question for HN, pattern matching is defined as “runtime type/value checking”, is that correct?

Is duck typing the pseudo-unsafe alternative? (Not unsafe as in accessing unsafe memory, but as in throwing exceptions if the duck-typed function doesn’t exist on the current type)

Can C handle both?

Coming from a static type system like rust and c#, i’m doing alot of “if this is a duck, duck.quack()” and i’m looking for faster alternatives and less verbosity if possible

trealira 1 year ago | | |

One thing is that pattern matching can make writing tree manipulation code succinct and easier to read. For example, take this article[0] that describes the difference list algorithm (in Haskell). Basically, it's kind of like a rope, but for lists. It's a tree with lists at the leaves, and when you want to convert it into a list, you rewrite the tree to be right-leaning, and then concatenate all the lists at once. This turns repeated concatenation at the end of lists from taking quadratic time into one that takes linear time (strcpy can be an example of this in C [1]). The code can be written like this:

  data Tree a = Leaf a | Branch (Tree a) (Tree a)

  fromList :: [a] -> Tree [a]
  fromList = Leaf

  toList :: Tree [a] -> [a]
  toList (Leaf x) = x
  toList (Branch (Leaf x) r) = x ++ toList r
  toList (Branch (Branch l1 l2) r)
               = toList (Branch l1 (Branch l2 r))

  append :: Tree [a] -> Tree [a] -> Tree [a]
  append = Branch

In a language that doesn't have tree pattern matching, the code wouldn't be this short and easy to understand, and I don't think it could be replicated just by having duck typing. Rust has pattern matching, but because it's primarily focused on lower-level concerns like pointers and memory ownership, pattern matching isn't this nice, because you have to pattern match on the pointer first.

Since a compiler is all about tree manipulation, support for tree pattern matching should be a boon.

[0]: http://h2.jaguarpaw.co.uk/posts/demystifying-dlist/

[1]: https://en.wikipedia.org/wiki/Joel_Spolsky#Schlemiel_the_Pai...

ashconnor 1 year ago | |

Useful list considering that feature: https://en.wikipedia.org/wiki/Category:Pattern_matching_prog...

songbird23 1 year ago | |

I can implement it in rust?

WalterBright 1 year ago |

I learned how to write a compiler by studying BYTE magazine in the 70's which published the source to a complete Pascal compiler as an article!

https://archive.org/details/byte-magazine-1978-09 (part 1)

All 3 parts of Tiny Pascal:

https://albillo.hpcalc.org/publications/Easter%20Egg%20-%20T...

barelyauser 1 year ago | |

The Byte magazine is incredible. First time reading it. The archive.org collection is a gold mine for learning. Thank you very much for posting it.

nj5rq 1 year ago | |

Thank you for sharing this, very useful. The BYTE magazine is absolutely amazing, it's a shame nothing similar could be done today.

hasbot 1 year ago |

So what's different about writing a compiler in 2024 than say 10, 20, or 30 years ago? When I started writing compilers in the 80's and 90's lex/flex and yacc/bison were popular. ANTLR came out but I never had a chance to use it. Everything after lexing and parsing was always hand rolled.

jerjerjer 1 year ago |

I uh misread the title and thought someone built a C compiler in Scratch.

On topic, though: wouldn't a simpler language (maybe even a pseudo language) be a better target for a first learning compiler. I understand they don't build a full C compiler, but still. It looks to me like there's a lot of complexity add from choosing such a lofty target.

tuveson 1 year ago | |

What do you think would make a better target? C maps pretty closely to assembly, so it seems like it would be the simplest. Maybe Pascal or BASIC, but most people these days don’t have experience with Pascal, and BASIC would probably be too simple for a full-length book.

For writing an interpreter or transpiler, there are probably better options, but for a true compiler I can’t think of a better choice than C (or at least a subset of C).

fuhsnn 1 year ago |

chibicc[0] complement this book nicely, in addition to a basic compiler, it guides you through writing the preprocessor and driver, which, although not addressed much in literature, are the missing link between the compiler built from the book and real C projects.

[0] https://github.com/rui314/chibicc

markus_zhang 1 year ago | |

Thanks, I wish the companion book were ready!

carom 1 year ago |

I took a compilers course in university and the course culminated in having a compiler for C Minus (a subset of C). The professor noted how each year the line count of the compilers was dropping as students found ways libraries or languages that made it easier. I think the evolution was Java -> Antlr -> Python. I used OCaml and emitted LLVM and blew that metric out of the water.

ccmcarey 1 year ago | |

Blew it out of the water with more or less lines of code? :)

carom 1 year ago | | |

Far fewer, to the point of another student asking me what I even did for the project because I didn't have to implement any of the algorithms.

the_panopticon 1 year ago |

In Ocaml, interesting. I was similarly surprised when I learned that the firs Rust compiler was written in Ocaml, too https://users.rust-lang.org/t/understanding-how-the-rust-com...

bunderbunder 1 year ago | |

ML (short for "meta-language") was originally designed for use in programming language research, and really shines for that purpose. And OCaml is probably the most pragmatic dialect for the purpose.

SML is very dated and the standard library and ecosystem lack many things that are considered table stakes for a viable programming language nowadays. And F# and Scala are fine as enterprise languages, but being tied to .NET and Java respectively makes them less desirable for implementing a language that won't itself be coupled to one of those runtimes.

mananaysiempre 1 year ago | |

Tree processing is best done in a language with decent algebraic datatypes and pattern matching. I would’ve preferred Standard ML, but, well, pot-ay-to, pot-ah-to. Haskell is another choice but the techniques you need to use there (while undeniably gaining you some possibilities) don’t really generalize to other languages, so you’re now writing a book about compiler construction in Haskell rather than just compiler construction. Ditto for Rust. Kotlin has deliberately anemic pattern matching. C# or F# leave you depending on Microsoft’s benevolence (sic). Metalua and Sweet.js both have decent ADT support but both are pretty much dead. Racket exists, I guess, and there are some pattern-matching libraries for normal Scheme as well, but the charisma malus of the parenthesis is real even if I don’t understand what causes it.

So OCaml was probably the most mainstream choice among the languages with appropriate tools, as funny as that sounds. And honestly, once you get over the syntax, it doesn’t actually have anything outrageous.

Coolbeanstoo 1 year ago |

This looks cool, been interested in learning more about compilers since I did the basics in college. Lots of things seem to focus on making interpreters and never make it to the code generation part so its nice to see that this features information about that.

spinningslate 1 year ago | |

With no disrespect to the book that's the subject of this thread as I haven't read it, but Bob Nystrom's Crafting Interpreter [0] is a fantastic book. It covers all phases in compilation, including both an interpreter and a VM.

It's been covered on several threads here over the years [1].

[0]: https://craftinginterpreters.com/ [1]: https://hn.algolia.com/?q=crafting+interpreters

jcpst 1 year ago | | |

I remember seeing this a while back. That typesetting is beautiful. Thank you for bringing it up here, I might have to pick that one up.

I’ve been bored with building line-of-business applications, despite designing for complex requirements in high-volume distributed systems.

In fact I took a break from CS learning entirely 9 months ago. Even reading HN. I’ve been studying electronics and analog signal processing instead.

But now that I’ve built about 50 guitar pedals of increasing complexity, I feel ready to switch back to CS studies again.

agent281 1 year ago | | |

This book covers compiling to assembly whereas Crafting Interpreters only has a bytecode VM implementation. We'll see how good this book is when it drops, but I think that's a worthwhile feature that Crafting Interpreters punted on.

shoggouth 1 year ago |

It also will be available via Amazon after August 20, 2024.

https://www.amazon.com/Writing-Compiler-Programming-Language...

sergius 1 year ago |

How does it compare with N.Wirth's?

https://onlinebooks.library.upenn.edu/webbin/book/lookupid?k...

cxr 1 year ago | |

Wirth's book does not implement a "real" programming language. Whatever your thoughts on Oberon and Pascal-like SHOUTCASE languages, it's largely irrelevant. Oberon is arguably a "real" language (and operating system), but Wirth's book does not cover the implementation of Oberon. It covers the implementation of Oberon0, an inarguably toy subset of Oberon. (Actually, "subset" is not even correct.) The example code has also diverged from the book, with Wirth abandoning the strategy described in the book for avoiding redundant initialization of the module static base, among other things.

Aside from that, I encourage everyone who cites Compiler Construction to actually work through the first 10% of the book and then count the number of errata.

hdbxbxndj 1 year ago | |

The book is a very hands on tutorial whereas Wirths is basic literature for the general case.

While they teach similar content, they have a different approach.

There are literally thousands of compiler design books out there, I don't really see anything particularly comparable between this book and Wirth's

anta40 1 year ago | |

Similar to studying OS concepts using Silberschatz' Operating System Concept and Tanenbaum's Operating Systems Design and Implementation. The former only explains the theoritical ideas, while the latter is the documentation of an implementation.

tzs 1 year ago |

I don't really need to know how to build a compiler, and I've got enough other "don't need but am doing out of curiosity" things going on that I don't need any more of those, but if it wasn't $70 I'd probably get it anyway. It would be interesting to compare to the last building a compiler book I read back and see how things have changed. Based on the comments here a lot has changed.

That last book was Allen Holub's "Compiler Design in C", which is from 1990. Here's how the blurb on the back describes it:

> Allen I. Holub's Compiler Design in C offers a comprehensive, new approach to compilers that proves to be more accessible to computer science students than the other strictly mathematical books.

> With this method in mind, the book features three major aspects:

> (1) The author develops fully functional versions of lex and yacc (tools available in the UNIX® operating system to write compilers), (2) he uses lex and yacc to develop a complete C compiler that includes parts of C that are normally left out of compiler design books (eg., the complete C "type" system, and structures), and (3) the version of yacc developed here improves on the UNIX version of yacc in two ways (error recovery and the parser, which automatically produces a window-oriented debugging environment in which the parse and value stacks are visible).

It's out of print, but the author has made a searchable PDF available on his website [1]. I found it quite useful.

Holub seems to like the "learn by doing" approach. He's got another book, "Holub on Patterns" that teaches all the design patterns from the gang of four book organically by developing two programs that together use all of those patterns. The two programs are an embedded SQL interpreter and a GUI application for Conway's Game of Life.

PS: Ooh. It occurred to me that No Starch Press books are often available on O'Reilly Learning. I checked and this one is there. So I guess it is going on my "don't need but am doing out of curiosity" pile after all.

[1] https://holub.com/compiler/

whartung 1 year ago |

What approach does this book take to error recovery?

Several "compiler light" style articles and books kind of walk over that part, and it can be non-trivial to do properly, especially with modern expectations.

I remember way back in the day, one of the early C compilers for the PDP, and, honestly, it could almost be argued that ed(1) had better error messages than what that thing produced.

A lot of simple compilers hit an error and just give up.

So, just curious what the approach was in this book.

badsectoracula 1 year ago |

Weird that this is about building a C compiler[0] in OCaml. I expected the implementation language to also be C both for consistency but also because i'm willing to bet that there are more people who can read C than OCaml.

[0] actually from the readme in the github repo[1] it seems to be a C subset, not all of C

[1] https://github.com/nlsandler/nqcc2

quibono 1 year ago |

I swear I've seen this cover before... is this a new release or an updated edition of an older book?

halfcat 1 year ago | |

”Automate the Boring Stuff with Python” has a similar cover, by the same publisher.

jdnendm 1 year ago | |

Book is not yet published but in early access since a couple of years

Was featured here a couple of times.

Unfortunately the timing of the release is quite unfortunate with regards to the summer holidays. Will take a look at it next year

sgbeal 1 year ago | | |

> Book is not yet published but in early access since a couple of years

According to the top post's link, it was released in July 2024.

byteplane 1 year ago | | |

It’s actually out now, I have a copy! Ordered directly fro No Starch Press.

thejteam 1 year ago | |

There was a HN article about the same book about a month ago:

https://news.ycombinator.com/item?id=40940799

So maybe you saw it then.

orktes 1 year ago | |

Many compiler related books take inspiration from the "Dragon book" (Compilers: Principles, Techniques and Tools). So with likely lots of books with similar looking covers.

hdbxbxndj 1 year ago | | |

The cover looks nothing like the dragon book however?

Almondsetat 1 year ago | |

I believe the author first started by making blog posts and then interrupted them to simply make a book about it

sunday_serif 1 year ago |

I’m working through this book now and really enjoying it!

Each chapter of the book includes a test suite to run against the code you’ve written.

In some ways, the tests in this book feel very similar to the labs in the book Computer Systems: A programmers perspective — which is high praise!

alok-g 1 year ago |

I would love to see a book that talks about going all the way to generate machine code, i.e., not stopping at generation of assembly.

Alternatively, I would like to learn about not just how to make a compiler, but also simultaneously a debugger, hot-reloading, etc.

synack 1 year ago | |

The debugger book is coming soon. https://nostarch.com/building-a-debugger

alok-g 1 year ago | | |

Awesome! Thanks.

hdbxbxndj 1 year ago | |

Writing an simple assembler is trivial. Even macro assemblers are very easy.

However, it's also boring.

Nevertheless the contents of the book cover all the techniques required to write an assembler, if you'd really like to

alok-g 1 year ago | | |

I understand that assembly file can be parsed in the same way. However, I want to learn about the machine instructions to the level of bits, and likewise the layouts of binary files. Unless I am able to go all the way to machine code loaded in memory, I would not know where in memory to add a breakpoint instruction when a developer wants the same on a line of code.

If there is some library that can help create machine code from assembly instructions on a line by line basis (at least as opposed to invoking a separate program that generates the entire binary collectively from the assembly code), that could also work.

In my case, I already know enough of the lexer, parser, etc., parts. What's missing is going all the way to making a debugger, profiler, etc.

peterfirefly 1 year ago | | |

There can be weird interactions unless there are strong enough limits on what kind of expressions the assembler allows. Especially if it supports conditional assembly and loops in the macros. One ugly way around it -- which causes its own headaches -- is to introduce pass-sensitive conditional assembly (as in "if in pass 1/2/...").

It's also "fun" if some instructions come in different sizes... and you may need stronger restrictions on allowed expressions in that case.

sim7c00 1 year ago |

cool, remember some tutorials online i think from the same author (not 100% sure) doing stuff around c compilation in python. shame its not in a language i want to learn. the other book on compilers i got is almost to heavy to lift! :D

i_don_t_know 1 year ago |

Somewhat unrelated: Is there a book that walks you through building a database system from storage to queries, optimizer, execution, indexing, transactions, etc?

rednab 1 year ago | |

Database Design and Implementation, ISBN 3030338355 ¹). Java source code for the SimpleDB system from the book available from the author's website ²).

¹) https://www.amazon.com/dp/3030338355/

²) http://www.cs.bc.edu/~sciore/simpledb/

gtirloni 1 year ago | |

kragen 1 year ago | |

transaction processing by gray (rip) and reuter was pretty close back in the 90s. i don't think it covered query optimization because it's really about tp monitors rather than databases, but, perhaps surprisingly, it does cover the other topics you're asking about

rramadass 1 year ago | |

In the early 90's Al Stevens wrote 2 books C Database Development and C++ Database Development with source code which might be a good starting point.

myth_drannon 1 year ago | | |

Interesting suggestion! here is the book on archive.org: https://archive.org/details/cdatabasedevelop00stev/mode/2up

sylware 1 year ago |

I wonder why there is not the same book for c++... mmmmh... I really wonder... (irony).

sylware 1 year ago | |

It is because c++ has an absurdely and grotesquely massive and complex syntax (like rust...).

stevefolta 1 year ago | | |

Yeah, Rust is the language for people who think C++ is not complex (or hostile) _enough_.

viraj_shah 1 year ago |

Dropping this one here! (no affiliation)

https://www.linuxfromscratch.org/

"Linux From Scratch (LFS) is a project that provides you with step-by-step instructions for building your own custom Linux system, entirely from source code."

pull_my_finger 1 year ago | |

Why though? It doesn't seem to be related at all to the OP other than both are tutorial books?

jsnnsjxj 1 year ago | |

This has nothing to do with the post?