Writing a C compiler in 500 lines of Python

Writing a C compiler in 500 lines of Python(vgel.me)

510 points by vgel 2 years ago | 165 comments

brundolf 2 years ago |

> Instead, we'll be single-pass: code generation happens during parsing

IIRC, C was specifically designed to allow single-pass compilation, right? I.e. in many languages you don't know what needs to be output without parsing the full AST, but in C, syntax directly implies semantics. I think I remember hearing this was because early computers couldn't necessarily fit the AST for an entire code file in memory at once

speps 2 years ago | |

Linked from another thread: http://cm.bell-labs.co/who/dmr/chist.html

It explains the memory limits and what happened :)

> After the TMG version of B was working, Thompson rewrote B in itself (a bootstrapping step). During development, he continually struggled against memory limitations: each language addition inflated the compiler so it could barely fit, but each rewrite taking advantage of the feature reduced its size. For example, B introduced generalized assignment operators, using x=+y to add y to x. The notation came from Algol 68 [Wijngaarden 75] via McIlroy, who had incorporated it into his version of TMG. (In B and early C, the operator was spelled =+ instead of += ; this mistake, repaired in 1976, was induced by a seductively easy way of handling the first form in B's lexical analyzer.)

bsder 2 years ago | | |

Infix parsing chews up a remarkable amount code and memory.

It's scary just how much easier it is to parse languages without infix parsing.

bee_rider 2 years ago | | |

I wonder why =+ is so obviously a mistake. It does look vaguely wrong for some reason, but I’m prejudiced by current languages.

WalterBright 2 years ago | |

You're exactly right. This makes for a small, memory-efficient compiler. But this entails a lot of compromises that we're not willing to put up with anymore, because there's no longer a reason to.

vgel 2 years ago | |

I'm not sure, haven't looked at the codebases of old compilers in a long time. Definitely a lot of the language is pretty amenable to it, especially if you have unstructured jumps for e.g. the for advancement statement. I had a distinct feeling while writing the compiler every time I added a new feature that "wow, the semantics work exactly how I'd like them to for ease of implementation."

Compare that to, say, Rust, which would be pretty painful to single-pass compile with all the non-local behavior around traits.

eru 2 years ago | | |

What you are saying is true for a naive C compiler.

Once you want to optimize or analyse, things become more complicated.

> Compare that to, say, Rust, which would be pretty painful to single-pass compile with all the non-local behavior around traits.

Type inference also spans a whole function, so you can't do it in a single pass through the code. (But it's still tamer than in eg Haskell, where type inference considers your whole program.)

kazinator 2 years ago | |

In C, nothing you have not parsed yet (what it to the right) is necessary for what you've already parsed (what lies to the left). (Necessary for type checking it or translating it.)

E.g. to call a function later in the file, you need a prior declaration. Or else, an implicit one (possibly wrong) will be assumed from the call itself.

This is not true in some C++ situations, like class declarations. In a class definition there can be functions with bodies. Those are inline functions. The functons can freely refer to each other in either direction. Type checking a class declaration therefore requires all of it to be parsed.

A one pass language is advantageous even if you're building a serious multi-pass compiler with optimization. This is because that exercise doesn't require an AST! Multi-pass doesn't mean AST.

Building an AST doesn't require just more memory, but more code and development work: more complexity in the code to build the abstraction and to traverse it. It's useful if you need to manipulate or analyze the program in ways that are closely related to the source language. If the language cannot be checked in one pass, you might need one; you wouldn't want to be doing the checking on an intermediate representation, where you've lost the relationship to the code. AST building can be reused for other purposes, like code formatting, refactoring, communicating with an IDE for code completion and whatnot.

If the only thing you're going to do with an AST is walk it up and down to do some checks, and then to generate code, and you do all that in an order that could have been done without the AST (like a bottom-up, left to right traversal), then it was kind of a waste to construct it; those checks and generation could have been done as the phrase structure rules were parsed.

frutiger 2 years ago | |

I once read that this is why the MSVC compiler didn't support two-pass template instantiation until very recently: the original compiler implemented templates almost like a macro that re-emitted a stream of tokens with the template parameters replaced.

pragma_x 2 years ago | |

I can't say if that was a design goal, but it sure looks like it. That's also the way to avoid scaling compiler memory use to program size.

At first I thought that it wasn't possible for C. After I thought about it, as long as you disallow forward references, and rely on a single source file as input, it's possible to compile a complete C program in one pass. Anything else requires a preprocessor (e.g "#include") and/or linker (e.g. "extern" and prototypes) to solve. The implementation in the article dodges all of these and focuses on a very pure subset of C.

brundolf 2 years ago | | |

I think this goal may also have shifted over time. I remember when I learned C, we used c89, which required declaring all local variables at the top of the block. This seemed like a weird/arbitrary requirement at the time (and is no longer required in later versions), but it makes a lot of sense in a single-pass context! It would allow the stack frame for the current function to be fully sized before any other logic is compiled

Gibbon1 2 years ago | | |

I think one thing some early compilers did was read the source serially in one pass and write the output serially in one pass. If you were doing multiple passes you had do that for each pass. That means your compiler speed is IO bound. So one pass is faster.

My cousins ex had a workflow in the late 70's that involved two floppy drives and a little dance to compile and link. Later he got a 5M hard drive which improved things a lot.

zabzonk 2 years ago | |

using recursive descent, you don't need to build an ast

jjtheblunt 2 years ago | | |

the call stack during recursive descent is an ephemeral ast, for the recursive descent parsers I've written.

pjmlp 2 years ago | | |

Only if the compiler doesn't do anything beyond basic peephole optimizations.

mati365 2 years ago |

I made similar project in TypeScript[1]. Basically multipass compiler that generates x86 assembly, compiles it to binary and runs it. The worst thing were register allocator, designing IR code and assembler.

[1] https://github.com/Mati365/ts-c-compiler

vgel 2 years ago | |

Ooh, this is cool! Using WASM let me avoid writing a register allocator (though I probably would have just used the stack if I had targeted x86/ARM since I wasn't going for speed).

amedvednikov 2 years ago | |

Nice project!

Joker_vD 2 years ago |

I am pretty certain the following is a valid "for"-loop translation:

    block
        ;; code for "i = 0"
        loop
            ;; code for "i < 5"
            i32.eqz
            br_if 1
        
            i32.const 1
            loop
                if
                    ;; code for "i = i + 1"
                    br 2
                else
                end
        
                ;; code for "j = j * 2 + 1"

                i32.const 0
            end
        end
    end

It doesn't require cloning the lexer so probably would still fit in 500 lines? But yeah, in normal assembly it's way easier, even in one-pass:

        ;; code for "i = 0"
    .loop_test:
        ;; code for "i < 5"
        jz  .loop_end
        jmp .loop_body
    .loop_incr:
        ;; code for "i = i + 1"
        jmp .loop_test
    .loop_body:
        ;; code for "j = j * 2 + 1"
        jmp .loop_incr
    .loop_end:

Of course, normally you'd want to re-arrange things like so:

        ;; code for "i = 0"
        jmp .loop_test
    .loop_body:
        ;; code for "j = j * 2 + 1"
    .loop_incr:
        ;; code for "i = i + 1"
    .loop_test:
        ;; code for "i < 5"
        jnz .loop_body
    .loop_end:

I propose the better loop syntax for languages with one-pass implementations, then: "for (i = 0) { j = j * 2 + 1; } (i = i + 1; i < 5);" :)

vgel 2 years ago | |

Oh, interesting--I remember messing around with flags on the stack but was having issues with the WASM analyzer (it doesn't like possible inconsistencies with the number of parameters left on the stack between blocks). I think your solution might get around that, though!

tptacek 2 years ago |

A time-honored approach!

https://www.blackhat.com/presentations/win-usa-04/bh-win-04-...

(minus directly emitting opcodes, and fitting into 500 lines, of course.)

ak_111 2 years ago |

Somewhat unrelated question, but I think one of the second most difficult things of learning C for coders who are used to scripting languages is to get your head around how the various scaler data types like short, int, long,... (and the unsigned/hex version of each) are represented and how they relate to each other and how they relate to the platform.

I am wondering if this complexity exists due to historical reasons, in other words if you were to invent C today you would just define int as always being 32, long as 64 and provide much more sane and well-defined rules on how the various datatypes relate to each other, without losing anything of what makes C a popular low-level language?

kaycebasques 2 years ago |

Is there a C compiler written in Python that aims for maximum readability rather than trying to get as much done under X lines of code?

vgel 2 years ago | |

I think the code is fairly readable! It's formatted with Black (and therefore limited to reasonable line lengths) and well-commented.

IMO, being under X lines of code is part of the readability—10,000 lines of code is hard to approach no matter how readable it otherwise is.

muth02446 2 years ago | |

Not quite a C compiler but arguably better:

http://cwerg.org

WalterBright 2 years ago |

This looks a lot like the Tiny Pascal compiler that BYTE published a listing of back in 1978.

http://www.trs-80.org/tiny-pascal/

I figured out the basics of how a compiler works by going through it line by line.

vgel 2 years ago | |

Oh, that's neat (funny that they skipped out on similar things to me, like GOTO and structs :-)

I didn't see a link to the source in the article, but this seems to be it: https://sourceforge.net/p/tiny-pascal/code/HEAD/tree/NorthSt...

dugmartin 2 years ago | |

I think Borland’s Turbo Pascal was also a single pass compiler that emitted machine code as COM files.

kwhitefoot 2 years ago | | |

Surely it is a feature of all Pascal compilers that they are single pass. I thought that it was part of the specification of the language that it be possible to compile in a single pass.

bemmu 2 years ago | | |

It makes development so much more fun when you see the results right away.

Pressing "build" in Turbo Pascal on my 386sx it was already done before you could even perceive any delay. Instant.

andrewmcwatters 2 years ago | |

Thanks for sharing this, Walter. I'm always curious where language developers get their experience from.

WalterBright 2 years ago | | |

Figuring out how recursive descent worked was just magical.

marcodiego 2 years ago |

It is interesting to think that 500 lines of code is something one can write in one or two days. But, writing a C compiler in 500 of comprehensible code (even in python) is challenge in itself that may take months after a few years of solid learning.

I wonder if is this a good path to becoming an extremely productive developer. If some one spends time developing projects like this, but for different areas... A kernel, a compressor, renderer, multimedia/network stack, IA/ML... Will that turn a good dev into a 0.1 Bellard?

jll29 2 years ago |

Writing your own compiler

- demystifies compilers, interpreters, linkers/loaders and related systems software, which you now understand. This understanding will no doubt one day help in your debugging efforts;

- elevates you to become a higher level developer: you are now a tool smith who can make their own language if needed (e.g. to create domain specific languages embedded in larger systems you architect).

So congratulations, on top of other forms of abstraction, you have mastered meta-linguistic abstraction (see the latter part of Structure and Interpretation of Computer Programs, preferably the 1st or 2nd ed.).

mananaysiempre 2 years ago |

> [Building parse trees] is really great, good engineering, best practices, recommended by experts, etc. But... it takes too much code, so we can't do it.

It takes too much code in Python. (Not a phrase one gets to say often, but it’s generally true for tree processing code.) In, say, SML this sort of thing is wonderfully concise.

meitham 2 years ago |

Actually with SLY (https://sly.readthedocs.io) now dead, what is the recommended Lexer/Parser library in Python?

bfLives 2 years ago | |

I’m partial to the Python port of parsec. (https://pythonhosted.org/parsec/)

nn3 2 years ago |

Just for comparison the LOCs for some other small C or C like compilers. It's not that far away from Ritchie's

C4x86 | 0.6K (very close)

small C (x86) | 3.1K

Ritchie's earliest struct compiler | 2.3K

v7 Unix C compiler | 10.2K

chibicc | 8.4K

Biederman's romcc | 25.0K

userbinator 2 years ago | |

This one is certainly stretching the definition of "C like", but it's just under 512 bytes : https://news.ycombinator.com/item?id=36064971

vgel 2 years ago | |

Oh, C4 is neat—technically it has me beat since it also implements the VM to run the code—though their formatting definitely takes advantage of long lines :-)

Shocka1 2 years ago |

These kinds of posts are one of the things that keeps me coming back to HN. Right when I start thinking I'm a professional badass for implementing several features with great well tested code in record time, I stumble along posts like this that set me in my place.

rcarmo 2 years ago |

I have to wonder if there's a Scheme to WASM compiler out there someplace right now I haven't found yet.

vgel 2 years ago | |

Looks like Schism (https://github.com/schism-lang/schism) got part of the way there, but it unfortunately seems to be dead.

cnity 2 years ago | |

Have you seen Guile Hoot?

https://gitlab.com/spritely/guile-hoot

rcarmo 2 years ago | | |

No, thanks! Had a look, doesn’t seem to be ready to support WASI, but it’s active.

aldousd666 2 years ago |

This is crazy cool! Esolangs have been a hobby of mine, (more just an interest lately, since I haven't built any in a while,) so this is like a fun code golf game for compilation. Nice work, and even better, nice explanation article!

varispeed 2 years ago |

I wrote a C compiler back in the day as a learning exercise. It was the most fun and rewarding project.

jokoon 2 years ago |

I don't see he use match case... while it's clearly a good use case.

MrYellowP 2 years ago |

I am really confused by what people call compilers nowadays. This is now a compiler that takes input text and generates output text, which then gets read by a compiler that takes input text and generates JIT code for execution.

This is more of a transpiler, than an actual compiler.

Am I missing something?

traes 2 years ago | |

To quote the great Bob Nystrom's Crafting Interpreters, "Compiling is an implementation technique that involves translating a source language to some other — usually lower-level — form. When you generate bytecode or machine code, you are compiling. When you transpile to another high-level language, you are compiling too."

Nowadays, people generally understand a compiler to be a program that reads, parses, and translates programs from one language to another. The fundamental structure of a machine code compiler and a WebAssembly compiler is virtually identical -- would this project somehow be more of a "real" compiler if instead of generating text it generated binary that encoded the exact same information? Would it become a "real" compiler if someone built a machine that runs on WebAssembly instead of running it virtually?

The popular opinion is that splitting hairs about this is useless, and the definition of a compiler has thus relaxed to include "transpilers" as well as machine code targeting compilers (at least in my dev circles).

teddyh 2 years ago |

For some value of “C”:

> Notably, it doesn't support:

> structs :-( would be possible with more code, the fundamentals were there, I just couldn't squeeze it in

> enums / unions

> preprocessor directives (this would probably be 500 lines by itself...)

> floating point. would also be possible, the wasm_type stuff is in, again just couldn't squeeze it in

> 8 byte types (long/long long or double)

> some other small things like pre/post cremements, in-place initialization, etc., which just didn't quite fit any sort of standard library or i/o that isn't returning an integer from main()

> casting expressions

vgel 2 years ago | |

Well, I set the 500 line budget up front, and that was really as much as I could fit with reasonable formatting. I'll be excited to see your 500 line C compiler supporting all those features once it's done ;-)

spease 2 years ago | |

C--23

(Respect to the author for doing this, I just couldn’t resist the obvious joke)

vgel 2 years ago | | |

I actually almost made it a C-- (https://www.cs.tufts.edu/~nr/c--/download/ppdp.pdf) compiler, but IIRC the `goto`s made me go with the regular C subset instead.

pjmlp 2 years ago | |

Basically like many C compilers outside UNIX during the 1980's.

RatC did not need 500 lines for its preprocessor support, by the way.

fan_of_yoinked 2 years ago |

I love the graphic - would go see the worlds largest chomsky

moomin 2 years ago |

Inevitably we have to ask: and how many lines of C in library functions?

hamilyon2 2 years ago |

So, given the python is an interpreter and very well understood, can we say that we are sure this compiler does not include Thompson virus?

pyinstallwoes 2 years ago | |

rhabarba 2 years ago |

Finally, one can have inefficient C.

MaxBarraclough 2 years ago | |

There's always the CINT interpreter for C and C++.

https://root.cern.ch/root/html534/guides/users-guide/CINT.ht...

brnt 2 years ago | | |

A PTSD trigger for me. Only half joking. Funny thing is, I never checked out Cling to see if it was at long last the real deal.

wiseowise 2 years ago | |

Why would language choice of compiler make any difference for efficiency of final output?

NeuroCoder 2 years ago | | |

They didn't say the language was the issue. It doesn't support the full C spec. But if you want a reason why language might be an issue for a compiler, it could make compilation time slower. But I think the point of this project is not real world use but fun demonstration of skill

vgel 2 years ago | | |

Maybe not the language choice, but the codegen of this compiler is terrible because of the single-pass shortcuts (for example, it unconditionally loads the result of all assignment operations back to the stack just in case you want to write `a = b = 1`, even though 99% of the time that load is immediately thrown away.)

folmar 2 years ago | |

Always remember _bashcc_.

ForOldHack 2 years ago |

The *point* of a compiler is to compile itself.

HumblyTossed 2 years ago | |

Is it?

Jake_K 2 years ago |

Interesting stuff

golemarms 2 years ago |

Cool. Now try writing a Python compiler in 500 lines of C.

_chu1 2 years ago | |

The fact this is hidden says something about the disparity here.