Oops, I Wrote a C++ Compiler(praeclarum.org) |
Oops, I Wrote a C++ Compiler(praeclarum.org) |
To address this problem, I use Zephyr ASDL to describe the abstract syntax of a language (which CPython also uses!)
It's a little DSL that lets you write an ML-style algebraic data type, and generates a bunch of code for you. In CPython, you go from ~100 lines ASDL schema to ~10,000 lines of C! It underlies the 'ast' module in the stdlib.
Why is Zephyr ASDL? http://www.oilshell.org/blog/2016/12/11.html
It eliminates the need for a lot of boilerplate he pointed out:
https://github.com/praeclarum/CLanguage/tree/master/CLanguag...
[1] http://jastadd.org/web/documentation/concept-overview.php
People will be trying to port their existing code, and will be unpleasantly surprised.
> But the truth is, I’m just going to keep listening to users and improving the parts I can. I am not trying to recreate GCC or LLVM.
Why not use GCC or LLVM as part of your project? That seems like a better investment of your time.
As for interpreting — there are a few options. The simplest approach would be to use LLVM‘s built-in IR interpreter (lli), which did exist in 2010.
Alternately, if you want more control: these days, the targets supported by upstream LLVM include:
- eBPF
- WebAssembly
- AVR
The first two are bytecode formats, both designed for (relative) simplicity and portability. There are open-source interpreters available for both, or alternately they’re both simple enough that you could relatively easily write your own. AVR, on the other hand, is the actual instruction set used by the Arduino; I’m not sure what the status is of open-source AVR emulators (I see a few projects on GitHub but don’t know how good or complete they are), but there’s probably something out there that would work.
But to be fair, all three of those targets were merged into LLVM upstream only recently, much later than 2010. Also, 2010 was the year that Clang got C++ support working well enough to become self-hosting, according to Wikipedia [1]; it was still relatively experimental, not nearly as much of a “safe bet” as it is today. (Still, even then, it was pretty much guaranteed to work better than writing your own C++ compiler from scratch! :p)
GCC is older but much harder to embed, even today. For one thing, it’s under the GPL. iCircuit is closed source, and even if the author were willing to open it, there are supposedly legal issues with publishing GPL code to the App Store. And there are technical obstacles as well: last I checked, GCC still expects to be run as the main executable as part of a traditional Unix toolchain, not embedded as a library. If that has changed, it was only recently.
At least they recognize this is missing, and call it out.
> Why not use GCC or LLVM as part of your project? That seems like a better investment of your time.
Learning opportunity? Maybe they're trying to solve a problem? Or do you think the LLVM folks should have just used GCC instead of wasting their time on LLVM?
That's not how the world gets compiler people.
Someone had to write GCC and LLVM, and someone will have to write whatever comes next. Languages and tooling only advance because there are those who choose to take the hard road instead of "just using that thing that someone else made".
Encouraging people to use LLVM instead of home-growing a toy compiler can still get you compiler people. LLVM is just the backend, not the frontend, so you still need to write the compiler frontend yourself, and once you've learned LLVM, if you're so inclined you may start hacking on LLVM and therefore become a backend compiler person too.
Just yesterday there was this, which doesn't even support structs:
https://news.ycombinator.com/item?id=17843293
I'm all for simple instructive projects, but we don't need to describe them in such misleading ways.
It's more accurately a "C superset" compiler, or perhaps "C+".
I agree with you 100%, but this made me laugh.
Frank is a talented engineer, I used to follow his work on Calca very closely. But as others noted, this does not seem to be anywhere close to C++. The problem with parsing C++ starts with C. Say you see the code “T(t);” at the beginning of a function body. What is it? Declaration of variable t of type T? Call of function T on variable t? You cannot parse it properly without symbol table. No context free parser handles C properly, let alone C++. It gets progressively and exponentially worse from here.
C is actually pretty easy if you ignore all of the bad advice to use parser generators. I looked into the various YACC grammars for C I could find on the Internet, and all of them either had bugs or were incomplete. TCC[1] has a simple recursive-descent parser. With a recursive descent parser you also have the option of implementing the C pre-processor in the same step. Turns out I was able to implement a single-pass C parser and pre-processor as a bunch of Common Lisp read macros[2].
I have not looked into it, but the approach for C++ looks like it would be very different because template instantiation needs its own step.
[1] https://bellard.org/tcc/ [2] https://github.com/vsedach/Vacietis/blob/master/compiler/rea...
It's the things like templates, multiple/virtual inheritance, and argument-dependent lookup/overload resolution which are difficult to get right, and the reason that even somewhat fully-featured C++ compilers are rare.
Emulating an Arduino, you get no benefit at all from re-implementing a compiler. The Instruction Set Architecture is rigorously defined. All you need to do is implement that, plus some peripherals. Write a compiler from ATmega machine code to your VM, if you like.
This C++-- will never be actually useful to anyone, and people developing for Arduino already have more than one anyway. It might be educational, but almost all the work has gone into the least useful and least educational parts of it.
The destructor is by far the single most essential feature of C++. After that, constructors, templates, and the standard library take 2nd, 3rd, and 4th place. If you think integer conversions are important, you have not learned much at all.
There are already several ATmega emulators done sanely. One more would not be bad, and could break new ground. Use virtualization primitives to make an x86 program translated from ATmega think it's a real chip, and emulate the Arduino 100x faster than the real one.
https://github.com/llvm-mirror/llvm/tree/master/lib/Transfor...
You can play around with how it takes in llvm code to optimized llvm code by using the opt tool.
Here's a basic example.
$ cat test.c
#include <stdio.h>
int main() {
for (int i = 0; i < 3; i++) {
printf("hello world");
}
}
$ clang -c -S -emit-llvm test.c
$ cat test.ll
** too long replacing with gist**
https://gist.github.com/cwgreene/b6f33d40ad735a448ed15057d91fdbdc
$ opt -S -print-before-all -O2 test.ll
** too long, replacing with gist**
https://gist.github.com/cwgreene/99b3075dbfffae35a5745934ded217fb
** final result**
@.str = private unnamed_addr constant [12 x i8] c"hello world\00", align 1
; Function Attrs: nounwind uwtable
define i32 @main() #0 {
entry:
%call = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([12 x i8], [12 x i8]* @.str, i64 0, i64 0)) #2
%call.1 = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([12 x i8], [12 x i8]* @.str, i64 0, i64 0)) #2
%call.2 = tail call i32 (i8*, ...) @printf(i8* getelementptr inbounds ([12 x i8], [12 x i8]* @.str, i64 0, i64 0)) #2
ret i32 0
}
The command `opt -S -print-before-all -O2` can be manually modified to specify exactly which compiler passes you want to see. You can run `opt --help` to see a list of all compiler passes. (you can figure out what code performs by looking at the * * * IR Dump Before `MODULE NAME` * * * and finding the associated logic in the Lib directory)In the case of llvm, the idea for them is they break the code into things called 'basic blocks' which, along with using Single Static Assignment, are more easily analyzed, transformed, and simplified.
I haven't look around recently enough to name any off-hand but there are some really good free textbooks/monographs(etc.) on optimisation around. Search "ssabook", there's a really great data flow analysis book out there somewhere which if I find I will edit and include.
[1] restrictions may apply
Those parts were ripped out of the standard. (The big one is export templates, where the expert compiler implementers when asked for feedback about how to implement it said "don't").
> I'm doing a (free) operating system (just a hobby, won't be big and professional like gnu) for 386(486) AT clones.
—Linus Torvalds, in a post[1] to comp.os.minix, 25 August 1991 (emphasis mine)
[1]: https://groups.google.com/forum/#!msg/comp.os.minix/dlNtH7RR...
And from where would they get those ideas? Tweaking LLVM is not enough to learn the necessary trade-offs. You need to write a compiler from the ground up. In fact you need to do it multiple times. Most likely you will not end up writing a LLVM replacement. Most likely you will end up a better programmer.
The genius of LLVM is that it's a stable, well-defined IR layer in the middle of the compilation stack that lets you switch out either the frontend or the backend without having to do the entire messy, boring, critical implementation of the other piece. And you can support pretty wacky frontend languages or pretty wacky backends (people compile LLVM to C source or FPGAs, for example).
Check the papers related to PL.8 compiler architecture.