The Nanopass paper link doesn’t work.
I disagree with the author's point: the "Dragon book"'s ("Compilers: Principles, Techniques, and Tools" by Aho et al.) Chapter 2 is a self-sufficient introduction into compilers from end to end, and it can be read on its own, ignoring the rest of the excellent book.
Another fantastic intro to compiler writing is the short little book "Compilers" by Niklaus Wirth, which explains and contains the surprisingly short source code of a complete compiler (the whole book is highly understandable - pristine clarity, really) and all in <100 pages total (99).
(I learned enough from these two sources to write a compiler in high school.)
Abdulaziz Ghuloum
http://scheme2006.cs.uchicago.edu/11-ghuloum.pdf
Abstract
Compilers are perceived to be magical artifacts, carefully crafted by the wizards, and unfathomable by the mere mortals. Books on compilers are better described as wizard-talk: written by and for a clique of all-knowing practitioners. Real-life compilers are too complex to serve as an educational tool. And the gap between real-life compilers and the educational toy compilers is too wide. The novice compiler writer stands puzzled facing an impenetrable barrier, “better write an interpreter instead.”
The goal of this paper is to break that barrier. We show that building a compiler can be as easy as building an interpreter. The compiler we construct accepts a large subset of the Scheme programming language and produces assembly code for the Intel-x86 architecture, the dominant architecture of personal computing. The development of the compiler is broken into many small incremental steps. Every step yields a fully working compiler for a progressively expanding subset of Scheme. Every compiler step produces real assembly code that can be assembled then executed directly by the hardware. We assume that the reader is familiar with the basic computer architecture: its components and execution model. Detailed knowledge of the Intel-x86 architecture is not required.
The development of the compiler is described in detail in an extended tutorial. Supporting material for the tutorial such as an automated testing facility coupled with a comprehensive test suite are provided with the tutorial. It is our hope that current and future implementors of Scheme find in this paper the motivation for developing high-performance compilers and the means for achieving that goal.
[1] Ometa https://tinlizzie.org/VPRIPapers/tr2007003_ometa.pdf
[2] Other ometa papers https://tinlizzie.org/IA/index.php/Papers_from_Viewpoints_Re...
[3] Adaptive compilation https://youtu.be/CfYnzVxdwZE?t=4575
the PhD thesis https://www.researchgate.net/publication/309254446_Adaptive_...
[4] Is it really "Complex"? Or did we just make it "Complicated"? Alan Kay https://youtu.be/ubaX1Smg6pY?t=3605
I have ignored all the stuff about parsing theory, parser generators, custom DSL's, formal grammers etc. and instead have just been using the wonderful Megaparsec parser combinator library. I can easily follow the parsing logic, it's unambiguous (only one successful parse is possible, even if it might not be what you intended), it's easy to compose and re-use parser functions (was particularly helpful for whitespace sensitive parsing/line-fold handling), and it removes the tedious lexer/parser split you get with traditional parsing approaches.
Google search points me to https://github.com/cesarghali/PL241-Compiler/blob/master/DLX... for a description of the architecture and possibly https://bernsteinbear.com/assets/img/linear-scan-ra-context-... for the register allocation algorithm
There are some excellent books out there. In its own way, the dragon book is excellent, but it is a terrible starting place.
Here are a bunch of references from the same vintage as OP. I recommend starting with a book that actually walks through the process of building a compiler and doesn't spend its time exclusively with theory.
The first edition was my first CS textbook, back in the '90s and as a young programmer I learned a lot from it. A couple years ago, I started on a modern compiler back-end however, and found that I needed to update my knowledge with quite a lot.
The 2nd ed covers data-flow analysis, which is very important. However, modern compilers (GCC, LLVM, Cranelift, ...) are built around an intermediate representation in Static Single Assignment-form. The 2nd ed. has only a single page about SSA and you'd need to also learn a lot of theory about its properties to actually use it properly.
What taught me how to write an optimizer was a Stanford summer course taught by Ullman and Hennessy.
The code generator was my own concoction, and is apparently quite unlike any other one out there!
I have the Dragon Book, but have never actually read it. So sue me.
But, to also be fair, the above random access method does not work when you don't know what you don't know. So I understand why having a light, but good introduction to the topic is important, and I believe that's what the author is pointing out.
I think that the nanopass architecture is especially well suited for compilers implemented by LLMs as they're excellent at performing small and well defined pieces of work. I'd love to see Anthropic try their C compiler experiment again but with a Nanopass framework to build on.
I've recently been looking in to adding Nanopass support to Langkit, which would allow for writing a Nanopass compiler in Ada, Java, Python, or a few other languages [3].
[1]: https://andykeep.com/pubs/dissertation.pdf
Some years later I (re-) discovered Forth, and I thought "why not?" and built my own forth in 32-bit Intel assembly, _that_ brought back the wonder and "magical" feeling of compilers again. All in less than 4KB.
I guess I wasn't the right audience for the dragon book.
The book is famous for its SSA treatment. Chapters 1-8 are not required to understand SSA. This allows you to walk away with a clear win. Refer to 9.2 if you're struggling with dominance + liveness.
http://www.r-5.org/files/books/computers/compilers/writing/K...
So this made me do a runnable cheat sheet for Crafting Interpreters. I keep parsing demonstrative, and the AST is a little more Lisp-y than the book's.
Disclaimer: it's meant to convey the essence of what you'll learn, it is NOT by any means a replacement for the book. I'd also describe the book as more of an experience (including some things Nystrom clearly enjoyed, like the visitor pattern) than a compilers manual. If anyone's interested, I can do a separate visitor-pattern cheat sheet too, also in Python.
I turned it into a 'public-facing artifact' from private scripts with an AI agent.
[0] https://ouatu.ro/blog/crafting-interpreters-cheat-sheet/
And the best thing about the parser combinator approach is that each is just a kind of parser, something like
type Lexer = ParsecT e ByteString m [Token]
type Parser = ParsecT e [Token] Expr
All the usual helper functions like many or sepBy work equally well in the lexing and parsing phases.It really beats getting to the parentheses-interacting-with-ordering-of-division-operations stage and still having to think "have I already trimmed off the whitespace here or not?"
In fact, inventing new programming languages and writing compilers for them used to be so much of a trend that people created YACC (Yet Another Compiler Compiler) to make it easier.
I quite like "understanding and writing compilers" by Richard Bornat - written in the 1970s using BCPL as the implementation language, so rather old-fashioned, but it gives a friendly gentle overview of how to do it, without excessive quantities of parsing theory.
This one? https://people.inf.ethz.ch/wirth/CompilerConstruction/Compil...
> And after Volumes 1--5 are done, God willing, I plan to publish Volume 6 (the theory of context-free languages) and Volume 7 (Compiler techniques), but only if the things I want to say about those topics are still relevant and still haven't been said.
- types and typing
- optimization passes
- object files, executables, libraries and linking
Then two of them would be sufficient for writing a compiler.I work in PL, and from my first compiler to today, have always found recursive descent easiest, most effective (less bugs, better error diagnostics, fast enough), and intuitive. Many popular language compilers use recursive descent: I know at least C# (Roslyn) and Rust, but I believe most except Haskell (GHC) and ocaml.
The LR algorithm was simple once I learned it, and yacc-like LR (and antlr-like LL) parser generators were straightforward once I learned how to resolve conflicts. But recursive descent (at least to me) is simpler and more straightforward.
LR being more expressive than LL has never mattered. A hand-written recursive descent parser is most expressive: it has unlimited lookahead, and can modify parsed AST nodes (e.g. reordering for precedence, converting if into if-else).
The only solution that comes close is tree-sitter, because it implements GLR, provides helpful conflict messages, and provides basic IDE support (e.g. syntax highlighting) almost for free. But it’s a build dependency, while recursive descent parsers can be written in most languages with zero dependencies and minimal boilerplate.
A lot of people say the dragon book is difficult, so I suppose there must be something there. But I don't see what it is, I thought it was quite accessible.
I'm curious, what parts/aspects of the dragon book make it difficult to start with?
https://www.eis.mdx.ac.uk/staffpages/r_bornat/#compilerbook
https://www.eis.mdx.ac.uk/staffpages/r_bornat/books/compilin...
Another alternative is basing the language on S-expressions, for which a parser is extremely simple to write.
But then, pushing regular languages theory into the curriculum, just to rush over it so you can use them for parsing is way worse.
On the other hand technical books can be so overwhelmingly difficult that you need to go outside and do hours of learning to understand one tidbit of it
This ( https://github.com/tpn/pdfs/blob/master/Compiler%20Construct... ) seems to be a previous version (2005) and it's 131 pages long
Admittedly, volumes 5-7 wouldn't be as massive as volume 4 (it sort of turns out that almost all interesting algorithms ends up being categorized as being in volume 4), so you probably wouldn't have a half-dozen subvolumes per topic but, it's still too many books down the line, especially if he plans to revise volumes 1-3 before working on anything else.
This would be like asking for a book on designing grammar. It's just too disjoint of a field to have any kind of reasonable baseline, and it's drop dead easy to grok a basic one together. With those two things being equal, just like with grammar, the answer to this is any resource about implementing the language you're trying to ape.
Types and Programming Languages, Benjamin C Pierce
> object files, executables, libraries and linking
Linkers and Loaders, John R Levine
A hand-written recursive descent parser is something you do later when you start to industrialize your code, to get better error messages, make the parser incremental, etc.
Bison/ANTLR are code generators, they do not fit well in that model.
I would now agree with that. My compiler experience was on a team that happened to have a LALR expert, Jeanne Musinski PhD, a student of Jeffrey Ullman. She invented a better error recovery for the language. Recursive descent would have been perfectly suited to the task.
> LR being more expressive than LL has never mattered.
Quite agree. One might guess that a language that needs that might be hard to program in.
Most of the work is actually the backend, and people sort of illusion themselves into "creating a language" just because they have an AST.
The course I did was organized perfectly with big parts of compiler boiler plate already written, and I only had to implement parser/lexer rules and the translation of language structures into assembly instructions. Also it was a compiler for a language designed just for this course with the intention of it being specifically easy to write a compiler for it and not programming.
Without this I can imagine it being a painful experience
https://web.archive.org/web/20190712115536/http://home.iae.n...
PD: Klong's intro to statisticks, even if the compiler looks like a joke, it isn't. It can be damn useful. Far easier than Excel. And it comes with a command to output a PS file with your chart being embedded.
Intro to statistics with Klong
https://t3x.org/klong/klong-intro.txt.html
https://t3x.org/klong/klong-ref.txt.html
On S9, well, it has Unix, Curses, sockets and so on support with an easy API. So it's damn easy to write something if you know Scheme/Ncurses and try stuff in seconds. You can complete the "Concrete Abstractions" book with it, and just adapt the graphic functions to create the (frame) one for SICP (and a few more).
And as we are doing compilers... with SICP you create from some simulator to some Scheme interpreter in itself.
The reasonable baseline would be something like Java 1. Scalars, arrays and classes. If I remember correctly, Lox even skips arrays as an exercise for the user.
I repeatedly skip parts that are not important to me when reading books like this. I grabbed a book about embedded design and skipped about half of it, which was bus protocols, as I knew I wouldn't need it. There is no need to read the dragon book from front to back.
> But there's a reason most modern resources skip over all of that and just make the reader write a recursive descent parser.
Unless the reason is explicitly stated there is no way to verify it's any good.
There's a reason people use AI to write do their homework - it just doesn't mean it's a good one.
I can think of plenty arguments for why you wouldn't look into the pros and cons of different parsing strategies in an introduction to compilers, "everyone is(or isn't) doing it" does not belong to them.
In the end, it has to be written down somewhere, and if no other book is doing it for whatever reason, then the dragon book it shall be. You can always recommend skipping that part if someone asks about what book to use.One of them was a compilers course done by karpathy. It was pure joy and a great learning experience.
Also in my experience the joy of doing a course was much stronger correlated with the teacher's qualities rather than the subject itself.
It also is only the case that most of the work is the backend for some compilers, though of course all of this depends on how backend is defined. Is backend just codegen or is it all of the analysis between parsing and codegen? If you target a high level language, which is very appropriate for one's first few compilers, the backend can be quite simple. At the simplest, no ast is even necessary and the compiler can just mechanically translate one syntax into another in a single pass.
Want to Write a Compiler? Read These Two Papers (2008) - https://news.ycombinator.com/item?id=10786842 - Dec 2015 (70 comments)
Want to Write a Compiler? Just Read These Two Papers. - https://news.ycombinator.com/item?id=2927784 - Aug 2011 (77 comments)
Want to Write a Compiler? Just Read These Two Papers - https://news.ycombinator.com/item?id=231758 - June 2008 (39 comments)
however I could dig out the references to it. Apparently it was a course by Prof. Alex Aiken and karpathy was a TA.
This repository seems to be a future version of the same course. https://github.com/gboduljak/stanford-compilers-coursework
Edit: found the videos from the course on the archive https://archive.org/details/academictorrents_e31e54905c7b266...
Ive never been a good book learner but I love taking apart and tinkering with something to learn. A small toy compiler is way better than any book and its not like the LLM didnt absorb the book anyways during training.
Regardless, it is incredibly reckless to ask Claude to generate assembly if you don't understand assembly, and it's irresponsible to recommend this as advice for newbies. They will not be able to scan the source code for red flags like us pros. Nor will they think "this C compiler is totally untrustworthy, I should test it on a VM."
Imagine you don't know anything about programming, and you want learn how to do it. You take a look at Amazon.com, and there's a highly recommended set of books by Knute or something with a promising title, The Art of Computer Programming, so you buy them. Now imagine that it's more than just a poor choice, but that all the books on programming are at written at that level.
That's the situation with books about writing compilers.
It's not that they're bad books, they're just too broadly scoped, and the authors present so much information that it's hard to know where to begin. Some books are better than others, but there are still the thick chapters about converting regular expressions into executable state machines and different types of grammars and so on. After slogging through it all you will have undoubtedly expanded your knowledge, but you're no closer to actually writing a working compiler.
Not surprisingly, the opaqueness of these books has led to the myth that compilers are hard to write.
The best source for breaking this myth is Jack Crenshaw's series, Let's Build a Compiler!, which started in 1988. This is one of those gems of technical writing where what's assumed to be a complex topic ends up being suitable for a first year programming class. He focuses on compilers of the Turbo Pascal class: single pass, parsing and code generation are intermingled, and only the most basic of optimizations are applied to the resulting code. The original tutorials used Pascal as the implementation language, but there's a C version out there, too. If you're truly adventurous, Marcel Hendrix has done a Forth translation (and as Forth is an interactive language, it's easier to experiment with and understand than the C or Pascal sources).
As good as it is, Crenshaw's series has one major omission: there's no internal representation of the program at all. That is, no abstract syntax tree. It is indeed possible to bypass this step if you're willing to give up flexibility, but the main reason it's not in the tutorials is because manipulating trees in Pascal is out of sync with the simplicity of the rest of the code he presents. If you're working in a higher level language--Python, Ruby, Erlang, Haskell, Lisp--then this worry goes away. It's trivially easy to create and manipulate tree-like representations of data. Indeed, this is what Lisp, Erlang, and Haskell were designed for.
That brings me to A Nanopass Framework for Compiler Education [PDF] by Sarkar, Waddell, and Dybvig. The details of this paper aren't quite as important as the general concept: a compiler is nothing more than a series of transformations of the internal representation of a program. The authors promote using dozens or hundreds of compiler passes, each being as simple as possible. Don't combine transformations; keep them separate. The framework mentioned in the title is a way of specifying the inputs and outputs for each pass. The code is in Scheme, which is dynamically typed, so data is validated at runtime.
After writing a compiler or two, then go ahead and plunk down the cash for the infamous Dragon Book or one of the alternatives. Maybe. Or you might not need them at all.
permalink June 29, 2008
Regarding test coverage, this is a toy compiler. Don't use it to compile production code! Regarding while loops and such, again, this is a simple compiler intended only to compile sort and search functions written in C.
> Don't use it to compile production code!
This is an understatement. A more useful warning would be "don't use it to compile any code with a while loop." Seriously, this compiler looks terrible. Worse than useless.
If you really want AI to make a toy compiler just to help you learn, use Python or Javascript as a compilation target, so that the LLM's dumb bugs are mostly contained, and much easier to understand. Learn assembly programming separately.
The compiler is indeed useless for any purpose other than learning how compilers work. It has all the key pieces such as a lexer, abstract syntax tree, parser, code generator, and it is easy to understand.
If the general approach taken by the compiler is wrong then I would agree it is useless even for learning. But you are not making that claim, only claiming to have found some bugs.
The code quality is of course unacceptably terrible but there is no point in reviewing 1500 lines of LLM output. A starting point: learners will get nothing out of this without better comments. I understand what's going on since this is all Compilers 101. But considering it's a) stingily commented and b) incorrect, this project is 100% useless for learners. It's indefensible AI slop.
I am only asking you to backup your own assertions. If you can't then I would have to assume that you are denigrating AI because you are threatened by it.
I pointed to two very specific problems with this project, and also directly mentioned that assembly generation for while loops was broken. It's not a technical report but it's a lot of substance for an HN comment. Instead of addressing any of that substance, you resorted to an insulting ad hominem. You are thoughtlessly rejecting contrary evidence, then pretending I'm not providing any. It's straight out of the Flat Earther playbook.