LLMs as Compilers
What if we think about LLM coding as if it’s a compiler stage?
I want to explore the relationship of LLMs to compilers, inspired in part by articles like my AI skeptic friends are all nuts, in part by a lot of time spent exploring LLMs for threat modeling. (There, I feel like I should have more useful things to say before saying them.) And by the way, by “experimenting,” I don’t mean just vibe-threat-modeling (although I’ve done some), but rather carefully constructing prompts, feeding them to multiple engines, scoring the results, and evaluating the evaluations.1 That’s a slow process, but I still feel there’s could be something meaningful, if we can get there. The other inspiration was talking with a friend who builds static analysis tools a very large firm with a famously high bar for developer interviews. He mentioned that people are now checking in their LLM prompts.
In a sense, turning human readable code into machine executable code is what a compiler does, and that’s very similar to one way we use LLMs. You give it a prompt, and it turns it into code, which a later stage, either a reviewer or a compiler, tells you is shit.
When computers were new, compilers were actually pretty bad at writing code, when compared to smart, dedicated people. In fact, the people didn’t need to be that smart to outdo compilers into the 90s. (If you go back and read “Smashing the stack for fun and profit,”2 you’ll see the concept of nop sleds, because compilers would just insert strings of no-ops to make writing exploits easier — amonst other reasons.) The only people who can actually think about machine code these days work in either compilers (and runtimes and chip development) or exploit development.3
There’s a very longstanding aspiration to make programming languages that look like natural languages. You can see this in languages like SQL, but even more in the many programming languages named “English.” These languages have not been incredibly successful at their goal of letting anyone program, probably because language syntax has not been as big a barrier as thinking about algorithms, edge cases, and other elements of programming.
Today, LLMs are pretty bad at writing code. They’re better where they have more training data, such as Python, and worse where they have less data or the data isn’t consistent. (This is a particular problem with Swift, because Apple has updated Swift roughly annually, which means that the training data is both scarce and conventions change.) Before reading Tom’s “all nuts” essay, I’d have said they’re better when you give them prompts that look like specifications. I know folks who like to write a long comment explaining what a function does, then feed the comment to an LLM for code that matches it. (I personally find the pattern far preferable to having an LLM in my IDE, because it wants to intrude when it thinks it has something to say, which is often when I’m deep in thought.)
When we think about LLMs as compilers, we could design them to compile directly to some machine code, such as WASM or x86. Or can consider them compiling from a high level language, such as English, to an intermediate language like Python, to be further compiled. I think this is a more useful frame, and we can start to think about linters and other such tools to work on both English and high level programming languages. We get other values from using intermediate languages, especially if we use languages that give us various safety properties, such as Rust or Go. And language design always includes a tradeoff of pickiness versus usability (I’ve been saying for 20 years that I don’t like GCC’s choice of tradeoff.) With LLMs as compilers, we might make different tradeoffs in how GCC works, or we might even design new languages with a goal of being easy to use in the LLM-driven use cases.
We can also think about how to build different security analysis techniques. For example, with Cursor, many people are starting to create “design” and “development” documents which specify how a system is designed. We might ask “is threat modeling a linter on the design documents,” or “what parts of threat modeling might be linters on design documents?” Can we consider “falsehoods programmers believe” as a set of test cases to be run?
I’ve been using this model of LLMs as additional compiler stages for a bit, and finding it evocative and interesting.
Footnotes
- Along the lines of ACSE-Eval: Can LLMs threat model real-world cloud infrastructure?, and I hope to say more about that paper. Short form: they’re doing what I hope to be able to do.
- Jesus, Google is bad at authoritative sources. The Phrack article isn’t on page 1.
- For fun, next time you meet someone claiming to be a “full stack engineer,” ask them to explain the most common MOV instructions and how a chip implements them.