Oops, I Wrote a C++ Compiler

Oops, I Wrote a C++ Compiler(praeclarum.org)

162 points by mpalme 7 years ago | 39 comments

chubot 7 years ago |

The process of creating the parser went very smoothly thanks to this. The real work involves creating C# syntax classes that mirror the grammar. These classes form the Abstract Syntax Tree (AST) of the compiler.

To address this problem, I use Zephyr ASDL to describe the abstract syntax of a language (which CPython also uses!)

It's a little DSL that lets you write an ML-style algebraic data type, and generates a bunch of code for you. In CPython, you go from ~100 lines ASDL schema to ~10,000 lines of C! It underlies the 'ast' module in the stdlib.

Why is Zephyr ASDL? http://www.oilshell.org/blog/2016/12/11.html

It eliminates the need for a lot of boilerplate he pointed out:

https://github.com/praeclarum/CLanguage/tree/master/CLanguag...

fmihaila 7 years ago | |

For those writing language processors in Java, I highly recommend having a look at JastAdd [1]. It's a great framework for implementing semantic analysis using attribute grammars [2]. The tool automatically generates Java classes to represent ASTs, based on a high-level description (a CFG grammar) of the desired AST shape. You can use any Java-based parser as long as the AST it produces is constructed from those classes. That's only the starting point; the real power is in the excellent support for expressing attribute computation. It uses a high-level, lazy equational formalism extended by the programmer with custom logic, which is then turned into automatically-generated Java code. Compilers/interpreters written in JastAdd tend to be very well structured and easy to maintain. It's one of my all-time favourite libraries, actively maintained, and very well documented.

[1] http://jastadd.org/web/documentation/concept-overview.php

[2] https://en.wikipedia.org/wiki/Attribute_grammar

lopatin 7 years ago | | |

And for Scala, there's Kiama https://bitbucket.org/inkytonik/kiama

amelius 7 years ago |

Please don't call it C++ if it doesn't even support constructors and destructors.

People will be trying to port their existing code, and will be unpleasantly surprised.

> But the truth is, I’m just going to keep listening to users and improving the parts I can. I am not trying to recreate GCC or LLVM.

Why not use GCC or LLVM as part of your project? That seems like a better investment of your time.

jchw 7 years ago | |

It is very well explained? GCC and Clang would be nightmares to integrate into an app that runs on many platforms including iOS, and even once you did, you'd still need to implement an interpreter, because you're not allowed to JIT on the app store. Compiling a toolchain for a desktop platform is easy. Cross compiling a toolchain is moderately easy. But embedding a toolchain into an iOS app, where you have to get it and all of its dependencies compiling for iOS, definitely seems like a massive challenge.

comex 7 years ago | | |

For Clang/LLVM, it wouldn’t be that hard. It’s one source tree to build (Clang is in a separate repo from LLVM core, but you check it out into a specific subdirectory and they build together), all C++, no external dependencies, and the result is a set of static libraries you can easily link into an iOS app.

As for interpreting — there are a few options. The simplest approach would be to use LLVM‘s built-in IR interpreter (lli), which did exist in 2010.

Alternately, if you want more control: these days, the targets supported by upstream LLVM include:

- eBPF

- WebAssembly

- AVR

The first two are bytecode formats, both designed for (relative) simplicity and portability. There are open-source interpreters available for both, or alternately they’re both simple enough that you could relatively easily write your own. AVR, on the other hand, is the actual instruction set used by the Arduino; I’m not sure what the status is of open-source AVR emulators (I see a few projects on GitHub but don’t know how good or complete they are), but there’s probably something out there that would work.

But to be fair, all three of those targets were merged into LLVM upstream only recently, much later than 2010. Also, 2010 was the year that Clang got C++ support working well enough to become self-hosting, according to Wikipedia [1]; it was still relatively experimental, not nearly as much of a “safe bet” as it is today. (Still, even then, it was pretty much guaranteed to work better than writing your own C++ compiler from scratch! :p)

GCC is older but much harder to embed, even today. For one thing, it’s under the GPL. iCircuit is closed source, and even if the author were willing to open it, there are supposedly legal issues with publishing GPL code to the App Store. And there are technical obstacles as well: last I checked, GCC still expects to be run as the main executable as part of a traditional Unix toolchain, not embedded as a library. If that has changed, it was only recently.

[1] https://en.m.wikipedia.org/wiki/Clang#Status_history

AgentME 7 years ago | | |

If they integrated Clang, then they could use Clang to compile code to WebAssembly, which can be executed by iOS's Javascript engine and pre-processed by wasm-metering[0] to set execution op limits so infinite loops can be recovered from.

[0] https://github.com/ewasm/wasm-metering

ezoe 7 years ago | | |

Why people still use stupid Apple products which unbelievably forbid the certain computation on the machine we own(except we don't) and have complete control of(also we don't).

craftyguy 7 years ago | |

> Please don't call it C++ if it doesn't even support constructors and destructors.

At least they recognize this is missing, and call it out.

> Why not use GCC or LLVM as part of your project? That seems like a better investment of your time.

Learning opportunity? Maybe they're trying to solve a problem? Or do you think the LLVM folks should have just used GCC instead of wasting their time on LLVM?

umanwizard 7 years ago | | |

Aren't constructors and destructors basically the main idea of c++?

peterkelly 7 years ago | |

> Why not use GCC or LLVM as part of your project?

That's not how the world gets compiler people.

Someone had to write GCC and LLVM, and someone will have to write whatever comes next. Languages and tooling only advance because there are those who choose to take the hard road instead of "just using that thing that someone else made".

lilyball 7 years ago | | |

The successor to LLVM isn't going to be written by someone just hacking on an app and decides they could use a basic compiler. It's going to be written by someone who has the explicit goal of replacing LLVM, fully understands LLVM's design and tradeoffs, and has ideas about an architecture for a successor that improves on the design.

Encouraging people to use LLVM instead of home-growing a toy compiler can still get you compiler people. LLVM is just the backend, not the frontend, so you still need to write the compiler frontend yourself, and once you've learned LLVM, if you're so inclined you may start hacking on LLVM and therefore become a backend compiler person too.

FartyMcFarter 7 years ago | |

Am I seeing things or there's a recent trend of claiming things are C++ compilers when they're nowhere near that?

Just yesterday there was this, which doesn't even support structs:

https://news.ycombinator.com/item?id=17843293

I'm all for simple instructive projects, but we don't need to describe them in such misleading ways.

userbinator 7 years ago | |

Indeed, for a moment I thought that someone had actually figured out how to condense the C++ standard enough to write a full compiler for it, like we've seen many times here on HN with C/C subset compilers.

It's more accurately a "C superset" compiler, or perhaps "C+".

mhh__ 7 years ago | |

Integrating LLVM requires quite a lot of mental effort (Let's not even discuss GCC...), whereas writing from scratch can be "simpler" mentally (at least to start with).

Aardwolf 7 years ago | |

The destructors happen to be one of the most important features that distinguishes C++ from C imho (automatic cleanup at end of scope), so would be great to implement those!

jlarocco 7 years ago | |

> People will be trying to port their existing code, and will be unpleasantly surprised.

I agree with you 100%, but this made me laugh.

mynegation 7 years ago |

It takes hundreds, if not thousands person-years to write C++ compiler intentionally, damn next to impossible to write it accidentally as title implies.

Frank is a talented engineer, I used to follow his work on Calca very closely. But as others noted, this does not seem to be anywhere close to C++. The problem with parsing C++ starts with C. Say you see the code “T(t);” at the beginning of a function body. What is it? Declaration of variable t of type T? Call of function T on variable t? You cannot parse it properly without symbol table. No context free parser handles C properly, let alone C++. It gets progressively and exponentially worse from here.

sedachv 7 years ago | |

> You cannot parse it properly without symbol table. No context free parser handles C properly, let alone C++. It gets progressively and exponentially worse from here.

C is actually pretty easy if you ignore all of the bad advice to use parser generators. I looked into the various YACC grammars for C I could find on the Internet, and all of them either had bugs or were incomplete. TCC[1] has a simple recursive-descent parser. With a recursive descent parser you also have the option of implementing the C pre-processor in the same step. Turns out I was able to implement a single-pass C parser and pre-processor as a bunch of Common Lisp read macros[2].

I have not looked into it, but the approach for C++ looks like it would be very different because template instantiation needs its own step.

[1] https://bellard.org/tcc/ [2] https://github.com/vsedach/Vacietis/blob/master/compiler/rea...

userbinator 7 years ago | |

Parsing is the easy part of a C++ compiler, and the T t example you mentioned has a relatively simple solution (which is, as you say, using the symbol table). In fact that ambiguity is present in C as well (with typedefs), and yet it hasn't really stopped many others from writing their own C compilers which handle that case.

It's the things like templates, multiple/virtual inheritance, and argument-dependent lookup/overload resolution which are difficult to get right, and the reason that even somewhat fully-featured C++ compilers are rare.

njsubedi 7 years ago |

This post reminded me of myself in 2012. Where did all that energy go?

ncmncm 7 years ago |

Does this seem insane to anyone else?

Emulating an Arduino, you get no benefit at all from re-implementing a compiler. The Instruction Set Architecture is rigorously defined. All you need to do is implement that, plus some peripherals. Write a compiler from ATmega machine code to your VM, if you like.

This C++-- will never be actually useful to anyone, and people developing for Arduino already have more than one anyway. It might be educational, but almost all the work has gone into the least useful and least educational parts of it.

The destructor is by far the single most essential feature of C++. After that, constructors, templates, and the standard library take 2nd, 3rd, and 4th place. If you think integer conversions are important, you have not learned much at all.

There are already several ATmega emulators done sanely. One more would not be bad, and could break new ground. Use virtualization primitives to make an x86 program translated from ATmega think it's a real chip, and emulate the Arduino 100x faster than the real one.

maxxxxx 7 years ago |

Nice! I think doing this is a great learning experience. I think I understand how to write a basic compiler but I am completely stumped by writing an optimizer. How do you model that?

archgoon 7 years ago | |

The llvm compiler has it's optimization routines here:

https://github.com/llvm-mirror/llvm/tree/master/lib/Transfor...

You can play around with how it takes in llvm code to optimized llvm code by using the opt tool.

Here's a basic example.

  $ cat test.c
  #include <stdio.h>
  int main() {
    for (int i = 0; i < 3; i++) {
      printf("hello world");
    }
  }

  $ clang -c -S -emit-llvm test.c
  $ cat test.ll
  ** too long replacing with gist**
  https://gist.github.com/cwgreene/b6f33d40ad735a448ed15057d91fdbdc

  $ opt -S -print-before-all -O2 test.ll
  ** too long, replacing with gist**
  https://gist.github.com/cwgreene/99b3075dbfffae35a5745934ded217fb
  ** final result**
  @.str = private unnamed_addr constant [12 x i8] c"hello world\00", align 1

  ; Function Attrs: nounwind uwtable
  define i32 @main() #0 {
  entry:
    %call = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([12 x i8], [12 x i8]* @.str, i64 0, i64 0)) #2
    %call.1 = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([12 x i8], [12 x i8]* @.str, i64 0, i64 0)) #2
    %call.2 = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([12 x i8], [12 x i8]* @.str, i64 0, i64 0)) #2
    ret i32 0
  }

The command `opt -S -print-before-all -O2` can be manually modified to specify exactly which compiler passes you want to see. You can run `opt --help` to see a list of all compiler passes. (you can figure out what code performs by looking at the * * * IR Dump Before `MODULE NAME` * * * and finding the associated logic in the Lib directory)

In the case of llvm, the idea for them is they break the code into things called 'basic blocks' which, along with using Single Static Assignment, are more easily analyzed, transformed, and simplified.

mhh__ 7 years ago | |

It's a bit idiosyncratic but Steven Muchnick's "Advanced Compiler Design and implementation" is something of a bible for optimising compiler design. Of course, other books are available.

I haven't look around recently enough to name any off-hand but there are some really good free textbooks/monographs(etc.) on optimisation around. Search "ssabook", there's a really great data flow analysis book out there somewhere which if I find I will edit and include.

aaaaaaaaaab 7 years ago |

More like a “C++ [1] compiler”.

[1] restrictions may apply

saagarjha 7 years ago | |

To be honest, that's most C++ compilers.

int_19h 7 years ago | | |

Most C++ compilers still implement most of the C++ standard. This thing basically implements C, and adds methods to structs.

mhh__ 7 years ago | | |

That's only technically true. The main C++ compilers in common use basically support, aim to support or have collectively agreed not to support parts of the C++ standard.

aaaaaaaaaab 7 years ago | | |

Touché

cift 7 years ago |

This looks like a cool app, the post says it's available on Android but I can't seem to find it on the Play Store. Anyone know what happened?

hasahmed 7 years ago |

Very cool

$ cat test.c #include <stdio.h> int main() { for (int i = 0; i < 3; i++) { printf("hello world"); } } $ clang -c -S -emit-llvm test.c $ cat test.ll ** too long replacing with gist** https://gist.github.com/cwgreene/b6f33d40ad735a448ed15057d91fdbdc $ opt -S -print-before-all -O2 test.ll ** too long, replacing with gist** https://gist.github.com/cwgreene/99b3075dbfffae35a5745934ded217fb ** final result** @.str = private unnamed_addr constant [12 x i8] c"hello world\00", align 1 ; Function Attrs: nounwind uwtable define i32 @main() #0 { entry: %call = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([12 x i8], [12 x i8]* @.str, i64 0, i64 0)) #2 %call.1 = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([12 x i8], [12 x i8]* @.str, i64 0, i64 0)) #2 %call.2 = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([12 x i8], [12 x i8]* @.str, i64 0, i64 0)) #2 ret i32 0 }