IIRC, C was specifically designed to allow single-pass compilation, right? I.e. in many languages you don't know what needs to be output without parsing the full AST, but in C, syntax directly implies semantics. I think I remember hearing this was because early computers couldn't necessarily fit the AST for an entire code file in memory at once
It explains the memory limits and what happened :)
> After the TMG version of B was working, Thompson rewrote B in itself (a bootstrapping step). During development, he continually struggled against memory limitations: each language addition inflated the compiler so it could barely fit, but each rewrite taking advantage of the feature reduced its size. For example, B introduced generalized assignment operators, using x=+y to add y to x. The notation came from Algol 68 [Wijngaarden 75] via McIlroy, who had incorporated it into his version of TMG. (In B and early C, the operator was spelled =+ instead of += ; this mistake, repaired in 1976, was induced by a seductively easy way of handling the first form in B's lexical analyzer.)
It's scary just how much easier it is to parse languages without infix parsing.
Compare that to, say, Rust, which would be pretty painful to single-pass compile with all the non-local behavior around traits.
Once you want to optimize or analyse, things become more complicated.
> Compare that to, say, Rust, which would be pretty painful to single-pass compile with all the non-local behavior around traits.
Type inference also spans a whole function, so you can't do it in a single pass through the code. (But it's still tamer than in eg Haskell, where type inference considers your whole program.)
E.g. to call a function later in the file, you need a prior declaration. Or else, an implicit one (possibly wrong) will be assumed from the call itself.
This is not true in some C++ situations, like class declarations. In a class definition there can be functions with bodies. Those are inline functions. The functons can freely refer to each other in either direction. Type checking a class declaration therefore requires all of it to be parsed.
A one pass language is advantageous even if you're building a serious multi-pass compiler with optimization. This is because that exercise doesn't require an AST! Multi-pass doesn't mean AST.
Building an AST doesn't require just more memory, but more code and development work: more complexity in the code to build the abstraction and to traverse it. It's useful if you need to manipulate or analyze the program in ways that are closely related to the source language. If the language cannot be checked in one pass, you might need one; you wouldn't want to be doing the checking on an intermediate representation, where you've lost the relationship to the code. AST building can be reused for other purposes, like code formatting, refactoring, communicating with an IDE for code completion and whatnot.
If the only thing you're going to do with an AST is walk it up and down to do some checks, and then to generate code, and you do all that in an order that could have been done without the AST (like a bottom-up, left to right traversal), then it was kind of a waste to construct it; those checks and generation could have been done as the phrase structure rules were parsed.
At first I thought that it wasn't possible for C. After I thought about it, as long as you disallow forward references, and rely on a single source file as input, it's possible to compile a complete C program in one pass. Anything else requires a preprocessor (e.g "#include") and/or linker (e.g. "extern" and prototypes) to solve. The implementation in the article dodges all of these and focuses on a very pure subset of C.
My cousins ex had a workflow in the late 70's that involved two floppy drives and a little dance to compile and link. Later he got a 5M hard drive which improved things a lot.
block
;; code for "i = 0"
loop
;; code for "i < 5"
i32.eqz
br_if 1
i32.const 1
loop
if
;; code for "i = i + 1"
br 2
else
end
;; code for "j = j * 2 + 1"
i32.const 0
end
end
end
It doesn't require cloning the lexer so probably would still fit in 500 lines? But yeah, in normal assembly it's way easier, even in one-pass: ;; code for "i = 0"
.loop_test:
;; code for "i < 5"
jz .loop_end
jmp .loop_body
.loop_incr:
;; code for "i = i + 1"
jmp .loop_test
.loop_body:
;; code for "j = j * 2 + 1"
jmp .loop_incr
.loop_end:
Of course, normally you'd want to re-arrange things like so: ;; code for "i = 0"
jmp .loop_test
.loop_body:
;; code for "j = j * 2 + 1"
.loop_incr:
;; code for "i = i + 1"
.loop_test:
;; code for "i < 5"
jnz .loop_body
.loop_end:
I propose the better loop syntax for languages with one-pass implementations, then: "for (i = 0) { j = j * 2 + 1; } (i = i + 1; i < 5);" :)https://www.blackhat.com/presentations/win-usa-04/bh-win-04-...
(minus directly emitting opcodes, and fitting into 500 lines, of course.)
I am wondering if this complexity exists due to historical reasons, in other words if you were to invent C today you would just define int as always being 32, long as 64 and provide much more sane and well-defined rules on how the various datatypes relate to each other, without losing anything of what makes C a popular low-level language?
IMO, being under X lines of code is part of the readability—10,000 lines of code is hard to approach no matter how readable it otherwise is.
http://www.trs-80.org/tiny-pascal/
I figured out the basics of how a compiler works by going through it line by line.
I didn't see a link to the source in the article, but this seems to be it: https://sourceforge.net/p/tiny-pascal/code/HEAD/tree/NorthSt...
Pressing "build" in Turbo Pascal on my 386sx it was already done before you could even perceive any delay. Instant.
I wonder if is this a good path to becoming an extremely productive developer. If some one spends time developing projects like this, but for different areas... A kernel, a compressor, renderer, multimedia/network stack, IA/ML... Will that turn a good dev into a 0.1 Bellard?
- demystifies compilers, interpreters, linkers/loaders and related systems software, which you now understand. This understanding will no doubt one day help in your debugging efforts;
- elevates you to become a higher level developer: you are now a tool smith who can make their own language if needed (e.g. to create domain specific languages embedded in larger systems you architect).
So congratulations, on top of other forms of abstraction, you have mastered meta-linguistic abstraction (see the latter part of Structure and Interpretation of Computer Programs, preferably the 1st or 2nd ed.).
It takes too much code in Python. (Not a phrase one gets to say often, but it’s generally true for tree processing code.) In, say, SML this sort of thing is wonderfully concise.
C4x86 | 0.6K (very close)
small C (x86) | 3.1K
Ritchie's earliest struct compiler | 2.3K
v7 Unix C compiler | 10.2K
chibicc | 8.4K
Biederman's romcc | 25.0K
This is more of a transpiler, than an actual compiler.
Am I missing something?
Nowadays, people generally understand a compiler to be a program that reads, parses, and translates programs from one language to another. The fundamental structure of a machine code compiler and a WebAssembly compiler is virtually identical -- would this project somehow be more of a "real" compiler if instead of generating text it generated binary that encoded the exact same information? Would it become a "real" compiler if someone built a machine that runs on WebAssembly instead of running it virtually?
The popular opinion is that splitting hairs about this is useless, and the definition of a compiler has thus relaxed to include "transpilers" as well as machine code targeting compilers (at least in my dev circles).
> Notably, it doesn't support:
> structs :-( would be possible with more code, the fundamentals were there, I just couldn't squeeze it in
> enums / unions
> preprocessor directives (this would probably be 500 lines by itself...)
> floating point. would also be possible, the wasm_type stuff is in, again just couldn't squeeze it in
> 8 byte types (long/long long or double)
> some other small things like pre/post cremements, in-place initialization, etc., which just didn't quite fit any sort of standard library or i/o that isn't returning an integer from main()
> casting expressions
(Respect to the author for doing this, I just couldn’t resist the obvious joke)
RatC did not need 500 lines for its preprocessor support, by the way.
https://root.cern.ch/root/html534/guides/users-guide/CINT.ht...
You'd lose something because those decisions would be impractical for 8-bit and 16-bit targets (which still exist in the world of embedded programming).
My main point is that a C programmer today is forced to learn dozens of rules just to cater for many niche platforms that they will probably never target, so if you were to separate those two use cases you can get a much more neat C that targets modern 64 bit architectures with all the power of traditional C but a bit less portability.
That's a nice theory and is what we've got, but it falls down in a few places.
The first is that the "int" world has got a bit munged - some platforms make some slightly strange choices for long and short and so you can't always rely on it (although int is usually pretty sensible).
The other is that when doing unsigned maths, rollover is silent so generally you really need to know the exact size at coding time so that you can ensure that rollover doesn't happen silently.
Together, these mean that you're generally just better using uint32_t (etc.) all over the place and you get more predictable results.
Off topic, but a log scale might be useful: 0.1 Bellard --> -10 deciBellards. That allows for: 0.001 Bellard --> -30 deciBellards.
Problem: Programmers with negative productivity cannot be represented on the same log scale.
Are they a 4σ programmer, 1σ programmer, -0.5σ programmer, -2σ programmer?
Plus, most people are "average" not negative productivity, and CDFs let you use really fun stuff like Beta Distributions (variable, shapable distributions) and Gamma Distributions (exponential distributions). They're super sweet as far as probability statistics.
[1] https://en.wikipedia.org/wiki/Cumulative_distribution_functi...
[2] https://en.wikipedia.org/wiki/Probability_density_function
This is similar to the problem of price-to-earnings ratio. The ratio goes asymptotic as earnings goes through zero. It would be better to quote earnings-to-price ratio. Another screwy reciprocal unit is miles per gallon for cars.
The feeling that you write code somewhere in the skies and have no idea how something works underneath has always really bugged me when I've used something.
The function of the university in the near future will probably just be to have like-minded curious people to discuss ideas with, and to get a better grasp of what problems need to be solved (specifically scientific ideas, rather than just applying engineering).
The prestige element (specifically of certain universities over others, perhaps not university over high school) is dwindling, and hopefully will be abolished with this new generation.
The people behind this project avoided that caveat by simply not implementing C. Apparently they kept a bit of the syntax but then proceeded to cherry-pick features that suited them and not make.an effort to even try to comply with any version of the standard.
Essentially all you need is a grammar, parser library and a couple of tree walkers to convert the AST first to expand macros and then convert to assembly.
A production compiler with all its optimisation steps is of course far more complex and more modern languages have many more features, but C is really pretty simple (the K&R book is concise and good!) as it was built to work on the computers of half a century ago.
https://www.amazon.com/500-Lines-Less-Amy-Brown/dp/132987127...
It was super fun and interesting. But I wouldn't say it was a terribly useful exercise that has greatly enriched me as a programmer.
And somehow I have ended up with a very strong bias against DSLs.
Most DSLs are bad, partially because most people are bad at designing languages.
Embedded DSLs can be quite neat. Eg Haskell makes it easy to embed something like DSLs inside your Haskell code.
(If you squint a bit, the ability to define your own functions in any language goes in that direction of allowing you to define your own mini-sub-language that handles your problem better. Haskell (and the Lisps) just dial that kind of approach up to 11.)
But DSLs are tempting, especially among the more passionate programmers, so at work we've ended up with a lot of them. All of them are tripping hazards.
Not sure if most people will get much extra out of writing their own compiler, too?
I’d bet range anxiety was also a thing for early gasoline powered cars, so the early adopters of those probably preferred mpg over gpm.
About half are better than median though.
ANTLR is pretty good and is supported across several languages and something I had previously used for some quick Elasticsearch query syntax munging in Python. It also means you can often start from an already existing grammar.
The JS version of ANTLR didn't seem to work for me so for the SQL/JSONPath stuff ended up using the Moo lever and Nearly parser which was rather pleasant. https://nearley.js.org
I attempted to write a parser for Wikitext once in ANTLR which was an interesting nightmare.
Some dialects and optimising compilers, had multiple passes.
You are right about Pascal's original design. Though I'm not sure if that's still true about modern versions of the language?
(-<> "alive"
(and "kicking")
(is "prefix notation" <>)
(in "all lisps"))Which led us to imperative programming and object oriented programming. Which everybody now recognizes as limited in various artificial ways.
Which we are now struggling to shake off given that the majority of chips have a whopping number of parallel cores.
Programming languages restrain your thinking about your solution space.
Note that in several countries in Europe, studying is free of charge or only costs a symbolic fee (scholarships aside).
The university system is thousands of years old, and it would be disastrous if it were abolished. I'm a self-taught developer, who also obtained two Masters and a Ph.D. later, and I can attest to it that the speed of learning at uni cannot be compared with the speed of learning via self study, especially self-study in isolation.
Having said this, learning only in groups/herds is also not the best approach IMHO, there is something to sitting alone and figuring something out without external help, at least occasionally, as it trains your analytical skills, your ability to concentrate and it gives you grit/perseverance.
Seminars/labs and office hours is where the biggest value of university can come from, but that’s the part that is often left to TA…
Motivation and internet access can get some of those things, but even among those things one would struggle to find them in one place and so readily available to them.
My perspective on the value dwindling comes from several graduate degrees and teaching at some of these 'prestigious' universities in the northeast.
It is of course just my singular experience, but the handful of research institutions I have worked at actually provided more of a typical 'academic' atmosphere than the universities, unfortunately. I still really want to believe that universities provide the most innovative research environment, but the incentive and operational structures don't really make it as optimal.
The reason I mention the new generation viewing its value as dwindling is that the cost isn't really 'worth it' and many of the best educational resources are actually free online now.
I'm an academic, so the fact that the investment doesn't pay off doesn't really bother me (academics typically don't care about money), but many people do want some financial advantage from education, and that isn't as present anymore.
I’m very much pro-education, and think that US high school should include the first two years of a current university education (grades should be roughly split between D, C, B and A, where D means “meets curriculum requirements”).
After that, I don’t think many people should go to a university. Instead, the state should provide up to 10 years of vocational education (splittable and redeemable at any point in life) or 8 years of university education (subject to a B or better average in high school).
Most vocational programs would be under 3 years, so you’d get three-four cracks at finding the right profession for you.
University track would be targeted at educators, researchers and entrepreneurs.
I think this would be better for every socioeconomic demographic in the US, and also help alleviate the housing shortage, enable modern manufacturing, fix many current issues with healthcare, and so on.)
however, the engineering culture which took the time to tell me about all these cool things and let me grow into being an expert in them seems to be largely gone.
The same is true of any field. Medicine for example. Then again I have several acres of swamp behind my place to help me when lerning from my mistakes.
Anything from programmable authorization to custom views becomes safely and relatively easily available. Say what you want about ESB pattern, but you can implement ESB transformers in a custom DSL.
Obviously not every piece of software needs these features, but when it does DSLs are very handy.
At work, we had two painful DSLs for monitoring queries. Our team changed monitoring systems entirely just to use SQL instead.
You can do anything with it because it allows launching other programs, but I'd say it was designed specifically for the domain of launching other programs and building data pipelines with their in- and outputs.
IMO Bash and friends are too powerful and generic to be considered DSLs.
Consider
x = -5
now how about: x =- 5
One makes x negative 5, the other subtracts 5 from it. And under this design of the operator the only difference is a space.It's the same with +, just that we're not used to seeing unary plus in the wild.
I think you mean: x -= 5, which indeed is different.
Kind of how we use (*p)->next instead of *p->next where p is node_t**
So that is problematic; If {a}{=}{-}{b} means a = a - b, then you have no way to write a = -b without parentheses like a = (-b).
It had better have an extremely high payoff to justify that cost.
See http://wiki.haskell.org/Embedded_domain_specific_language
Because the overheads are lower, the payoff doesn't have to be as high to make it worthwhile.
(However, there's still some overhead, otherwise you wouldn't really label them as embedded-DSLs.)
But even then. I was never really able to get into Observable because although it looks like JavaScript, there are kind of hidden objects and methods available to you that I found weirdly hard to discover.
Stumbled across your post about spin2win a while ago and I'm glad you found it useful! I still use it every day and wish I could fix the jankiness but oh well hahaha.
A perfectly-designed DSL is still bad just because it's a whole 'nother language for you to build, and for others to learn, that probably isn't needed for whatever project.
Imagine if we had to watch out for this as a common pitfall:
// BUG! Actually subtracts x from current val of neg_x.
neg_x = -x;
Even moreso, how would these two lines behave? Would they differ in semantics? n = -5
n =- 5
Overall, -= is just so much less ambiguous.EDIT: To your point about ->, I personally think C would be better if:
*p->next
parsed as: (*p)->next
instead of: *(p->next)
but maybe now I'm not thinking through all the parsing impliciations :)Regarding "n = -5", it would presumably be interpreted as "n=(-5)", same as today. Operators don't have spaces in them. So "n- -5" is "n-(-5)", rather than "n--5" (not valid).
so really the best way out is to be as verbose as possible imo; a = a + c or auto nodep = *nodepp; nodep->next;
Compilers and compute performance have grown to make the difference negligible for code output and compilation times but they definitely take a lot of mental complexity out of such scenarios (anything helps when grokking 10k+ lines of code).
Kind of like Jupyter notebooks for JavaScript? Kind of?