If you don't like the results or the process, you have to switch targets or add new intermediates. For example instead of doing description -> implementation, do description -> spec -> plan -> implementation
There are people playing around with straight machine code generation, or integrating ML into the optimisation backend, finally compiling via a translation to an existing language is already a given in vibe coding with agents.
Speaking of which, using agentic runtimes is hardly any different from writing programs, there are some instructions which then get executed just like any other applications, and if it gets compiled before execution or plainly interpreted, becomes a runtime implementation detail.
Are we there yet without hallucinations?
Not yet, however the box is already open, and there are enough people trying to make it happen.
This is technically true. But unimportant. When I write code in a higher level language and it gets compiled to machine code, ultimately I am testing statically generated code for correctness. I don’t care what type of weird tricks the compiler did for optimizations.
How is that any different than when someone is testing LLM generated C code? I’m still testing C code that isn’t going to magically be changed by the LLM without my intervention anymore than my C code is going to be changed without my recompiling it.
On this latest project I was on, the Python generated code by Codex was “correct” with the happy path. But there were subtle bugs in the distributed locking mechanics and some other concurrency controls I specified. Ironically, those were both caught by throwing the code in ChatGPT in thinking mode.
No one is using an LLM to compute is a number even or odd at runtime.
One current idea of mine, is to iteratively make things more and more specific, this is the approach I take with psuedocode-expander ([0]) and has proven generally useful. I think there's a lot of value in the LLM instead of one shot generating something linearly, building from the top down with human feedback, for instance. I give a lot more examples on the repo for this project, and encourage any feedback or thoughts on LLM driven code generation in a more sustainable then vibe-coding way.
[0]: https://github.com/explosion-Scratch/psuedocode-expander/
We have mechanisms for ensuring output from humans, and those are nothing like ensuring the output from a compiler. We have checks on people, we have whole industries of people whose whole careers are managing people, to manage other people, to manage other people.
with regards to predictability LLMs essentially behave like people in this manner. The same kind of checks that we use for people are needed for them, not the same kind of checks we use for software.
This is why I think the better goal is an abstraction layer that differentiates human decisions from default (LLM) decisions. A sweeping "compiler" locks humans out of the decision making process.
It all feels to me like the guys who make videos of using using electric drills to hammer in a nail - Sure, you can do that, but it is the wrong tool for the job. Everyone knows the phrase: "When all you have is a hammer, everything looks like a nail." But we need to also keep in mind the other side of that coin: "When all you have is nails, all you need is a hammer." LLMs are not a replacement for everything that happens to be digital.
This could be a good way to learn how robust your tests are, and also what accidental complexity could be removed by doing a rewrite. But I doubt that the results would be so good that you could ask a coding agent to regenerate the source code all the time, like we do for compilers and object code.
>From one gut feeling I derive much consolation: I suspect that machines to be programmed in our native tongues —be it Dutch, English, American, French, German, or Swahili— are as damned difficult to make as they would be to use.
The more I use LLMs, the more I find this true. Haskell made me think for minutes before writing one line of code. Result? I stopped using Haskell and went back to Python because with Py I can "think while I code". The separation of thinking|coding phases in Haskell is what my lazy mind didn't want to tolerate.
Same goes with LLMs. I want the model to "get" what I mean but often times (esp. with Codex) I must be very specific about the project scope and spec. Codex doesn't let me "think while I vibe", because every change is costly and you'd better have a good recovery plan (git?) when Codex goes stray.
I think this is an interesting development, because we (linguists and logicians in particular) have spent a long time developing a highly specified language that leaves no room for ambiguity. One could say that natural language was considered deficient – and now we are moving in the exact opposite direction.
Using LLMs to do something like what a compiler can already do is also modelling LLMs as infinite rather than finite. In fact in this particular situation not only are they finite, they're grotesquely finite, in particular, they are expensive. For example, there is no world where we just replace our entire infrastructure from top to bottom with LLMs. To see that, compare the computational effort of adding 10 8-digit numbers with an LLM versus a CPU. Or, if you prefer something a bit less slanted, the computational costs of serving a single simple HTTP request with modern systems versus an LLM. The numbers run something like LLMs being trillions of times more expensive, as an opening bid, and if the AIs continue to get more expensive it can get even worse than that.
For similar reasons, using LLMs as a compiler is very unlikely to ever produce anything even remotely resembling a payback versus the cost of doing so. Let the AI improve the compiler instead. (In another couple of years. I suspect today's AIs would find it virtually impossible to significatly improve an already-optimized compiler today.)
Moreover, remember, oh, maybe two years back when it was all the rage to have AIs be able to explain why they gave the answer they did? Yeah, I know, in the frenzied greed to be the one to grab the money on the table, this has sort of fallen by the wayside, but code is already the ultimate example of that. We ask the LLM to do things, it produces code we can examine, and the LLM session then dies away leaving only the code. This is a good thing. This means we can still examine what the resulting system is doing. In a lot of ways we hardly even care what the LLM was "thinking" or "intending", we end up with a fantastically auditable artifact. Even if you are not convinced of the utility of a human examining it, it is also an artifact that the next AI will spend less of its finite resources simply trying to understand and have more left over to actually do the work.
We may find that we want different programming languages for AIs. Personally I think we should always try to retain that ability for humans to follow it, even if we build something like that. We've already put the effort into building AIs that produce human-legible code and I think it's probably not that great a penalty in the long run to retain that. At the moment it is hard to even guess what such a thing would look like, though, as the AIs are advancing far faster than anyone (or any AI) could produce, test, prove out, and deploy such a language, against the advantage of other AIs simply getting better at working with the existing coding systems.
Why? Because new languages have an IR in their compilation path?
You can see it clearly if you just translate the article's expensive vocabulary into plain English. When the author writes, 'When you hand-build, the space of possibilities is explored through design decisions you’re forced to confront,' they are just saying, 'When you write code yourself, you have to choose how to write it.' When they claim, 'contextuality is dominated by functional correctness,' they just mean, 'Usually, we just care if the code works.' When they warn about 'inviting us to outsource functional precision itself,' they really mean, 'LLMs let you be lazy.' And finaly, 'strengthening the will to specify,' is just a dramatic way of saying, 'We need to write better requirements.' It is obscurantism plain and simple. using complexity to hide the fact that the insight is trivial.
But that is just an estethical problem to me. Worse. The argument collapses entirely when you look at the logical leap between the premises.
The author basically argues that because Natural Language is vague, engineers will inevitably stop caring about the details and just accept whatever reasonable output the AI gives. This is pure armchair psychology. It assumes that just because the tool allows for vagueness, professionals will suddenly abandon the concept of truth or functional requirements. That is a massive, unsubstantiated jump.
If we use fuzzy matching to find contacts on our phones all the time. Just because the search algorithm is imprecise doesn't mean we stop caring if we call the right person. We don't say, 'Well, the fuzzy match gave me Bob instead of Bill, I guess I'll just talk to Bob now.' The hard constraint, the functional requirement of talking to the specific person you need, remains absolute. Similarly, in software, the code either compiles and passes the tests, or it doesn't. The medium of creation might be fuzzy, but the execution environment is binary. We aren't going to drift into accepting broken banking software just because the prompt was in English.
This entire essay feels like those social psychology types that now have been thoroughly been discredited by the replication crisis in psychology. The ones who are where concerned with dazzling people with verbal skills than with being right. It is unnecessarily complex, relying on projection of dreamt up concepts and behavior, rather than observation. THIS tries to sound profound by turning a technical discussion into a philosophical crisis, but underneath the word salad, it is not just shallow, it is wrong.
One of the first things I tried to have an llm do is transpile. These days that works really well. You find an interesting project in python, i'm a js guy, boom js version. Very helpful.
This feels like the same debate assembly programmers had about C in the 60s. "You don’t understand what the compiler is doing, therefore it’s dangerous". Eventually we realised the important thing isn’t how the code was authored but whether the behaviour is correct, testable, and maintainable.
If code generated by an LLM:
- passes a real test suite (not toy tests),
- meets performance/security constraints,
- goes through review like any other change,
then the acceptance criteria haven’t changed. The test suite is part of the spec. If the spec is enforced in CI, the authoring tool is secondary.The real risk isn’t "LLMs as compilers", it’s letting changes bypass verification and ownership. We solved that with C, with large dependency trees, with codegen tools. Same playbook applies here.
If you give expected input and get expected output, why does it matter how the code was written?
yet nobody complained about this
in fact engineers appreciate that, "we are not replaceable code monkeys cogs in the machine as management would like"
But for reference, we don't (usually) care which register three compiler uses for which variable, we just care that it works, with no bugs. If the non-dertetminism of LLMs mean the variable is called file, filename, or fileName, file_name, and breaking with convention, why do we care? At the level Claude let's us work with code now, it's immaterial.
Compilation isn't stable. If you clear caches and recompile, you don't get a bit-for-bit exact copy, especially on today's multi-core processors, without doing extra work to get there.
The obvious has been stated.
Stop this. This is such a stupid way way of describing mistakes from AI. Please try to use the confusion matrix or any other way. If you're going to try and make arguments, it's hard to take them seriously if you keep regurgitating that LLM's hallucinate. It's not a well defined definition so if you continually make this your core argument, it becomes disingenuous.
Well, you can always set temperature to 0, but that doesn't remove hallucinations.
You see a business you like, boom competing business.
These are going to turn into business factories.
Anthropic has a business factory. They can make new businesses. Why do they need to sell that at all once it works?
We're focusing on a compiler implementation. Classic engineering mindset. We focus on the neat things that entertain us. But the real story is what these models will actually be doing to create value.
This is not really a point about whether LLMs can currently be used as English compilers, but more questioning whether determinism of the final machine code output is a critical property of a build system.
"This gets to my core point. What changes with LLMs isn’t primarily nondeterminism, unpredictability, or hallucination. It’s that the programming interface is functionally underspecified by default."
> passes a real test suite (not toy tests)
“not toy tests” is doing a lot of heavy lifting here. Like an immeasurable amount of lifting.
Can you formally verify prose?
> But yeah, they probably don't fit the bill of English based code to machine code
Which is why LLMs cannot be compilers that transform code to machine code.
The same thing happens in JavaScript. I debug it using a Javascript debugger, not with gdb. Even when using bash script, you don’t debug it by going into the programs source code, you just consult the man pages.
When using LLM, I would expect not to go and verify the code to see if it actually correct semantically.
you might not, but plenty of others do. on the jvm for example, anyone building a performance sensitive application has to care about what the compiler emits + how the jit behaves. simple things like accidental boxing, megamorphic call preventing inlining, etc. have massive effects.
i've spent many hours benchmarking, inspecting in jitwatch, etc.
Those checks works for people because humans and most living beings respond well to rewards/punishment mechanisms. It’s the whole basis of society.
> not the same kind of checks we use for software.
We do have systems that are non deterministic (computer vision, various forecasting models…). We judge those by their accuracy and the likely of having false positive or false negatives (when it’s a classifier). Why not use those metrics?
The whole benefit of computers is that they don't make stupid mistakes like humans do. If you give a computer the ability to make random mistakes all you have done is made the computer shitty. We don't need checks, we need to not deliberately make our computers worse.
For context, my initial implementation went through the official AWS open source process (no longer there) five years ago and I’m still getting occasional emails and LinkedIn Messages because it’s one of the best ways to solve the problem that is publicly available - the last couple of times, I basically gave the person the instructions I gave ChatGPT (since I couldn’t give them the code) and told them to have it regenerate the code in Python and it would do much better than what I wrote when I didn’t know the service as well as I do now, and the service has more features that you have to be concerned about
"Deterministic" is not the the right constraint to introduce here. Plenty of software is non-deterministic (such as LLMs! But also, consensus protocols, request routing architecture, GPU kernels, etc) so why not compilers?
What a compiler needs is not determinism, but semantic closure. A system is semantically closed if the meanings of its outputs are fully defined within the system, correctness can be evaluated internally and errors are decidable. LLMs are semantically open. A semantically closed compiler will never output nonsense, even if its output is nondeterministic. But two runs of a (semantically closed) nondeterministic compiler may produce two correct programs, one being faster on one CPU and the other faster on another. Or such a compiler can be useful for enhancing security, e.g. programs behave identically, resist fingerprinting.
Nondeterminism simply means the compiler selects any element of an equivalence class. Semantic closure ensures the equivalence class is well‑defined.
They are designed to be where temperature=0. Some hardware configurations are known defy that assumption, but when running on perfect hardware they most definitely are.
What you call compilers are also nondeterministic on 'faulty' hardware, so...
Like I said above, I do know to watch out for implementations that “Work on my Machine” but don’t work at scale or involve concurrency. But I have had to check for the same issues when I delegate work to more junior developers.
This is not meant to be an insult toward you. But my not doing front end development for well over a decade, a front end developer might as well be a “human LLM” to me. I’m going to give you the business requirements and constraints and you are going to come back with a website. I am just going to check it meets the business requirements and not tell you the how. I’m definitely not going to look at the code.
I just had a web project I had to modify for a new project, I used Codex and didn’t look at a line of code. Yeah I know JavaScript. But I have no idea whether the initial developer who worked on on another project I led or whether the Codex changes were idiomatic. I know the developer and Codex met my functional requirements.
There’s even efforts to guarantee this for many packages on Linux - it’s a core property of security because it lets you validate that the compilation process or environment wasn’t tampered with illicitly by being able to verify by building from scratch.
Now actually managing to fix all inputs and getting deterministic output can be challenging, but that’s less to do with the compiler and more to do with the challenge of completely taking the entire environment (the profile you are using for PGO, isolating paths on the build machine being injected into the binary, programs that have things in their source or build system that’s non deterministic (e.g. incorporating the build time into the binary)
> PGO seems like it ought to have a random element.
PGO should be deterministic based on the runs used to generate the profile. The runs are tracking information that should be deterministic--how many times does the the branch get taken versus not taken, etc. HWPGO, which relies on hardware counters to generate profiling information, may be less deterministic because the hardware counters end up having some statistical slip to them.
Hence why it is hard to do benchmarks with various kinds of GC and dynamic compilers.
You can't even expect deterministic code generation for the same source code across various compilers.
LLM code completion compares unfavourably to the (heuristic, nigh-instant) picklist implementations we used to use, both at the low-level (how often does it autocomplete the right thing?) and at the high-level (despite many believing they're more effective, the average programmer is less effective when using AI tools). We need reasons to believe that LLMs are great and do all things, therefore we look for measurements that paint it in a good light (e.g. lines of code written, time to first working prototype, inclination to output Doom source code verbatim).
The reason we're all using (or pretending to use) LLMs now is not because they're good. It's almost entirely unrelated.
If they are junior developers working in Java they may just as well build an AbstractFactoryConcurrentSingletonBean because that’s what they learned in school as an LLM would be from training on code it found on the Internet.
Yes I know every millisecond a company like Google can shave off, is multiplied by billions of transactions a day and can save real money on infrastructure. But even at a second tier company like Salesforce, it probably doesn’t matter
or does your binary always come out differently each time you compile the same file??
You can try it. try to compile the same file 10 times and diff the resultant binaries.
Now try to prompt a bunch of LLMs 10 times and diff the returned rubbish.
Humans, in all their non deterministic brain glory, long ago realized they don't want their software to behave like their coworkers after a couple of margaritas.
No. LLMs are undefined behavior.
Then my job became I am assigned a larger implementation and depending on how large the implementation was, I had to design specifications for others to do some or all of the work and validate the final product for correctness. I definitely didn’t pore over every line of code - especially not for front end work that I stopped doing around the same time.
The same is true for LLMs. I treat them like junior developers and slowly starting to treat them like halfway competent mid level ticket takers.
I’ve been going round and round in my mind about a particular discussion around LLMs: are they really similar to compilers? Are we headed toward a world where people don’t look at the underlying code for their programs?
People have been making versions of this argument since Andrej Karpathy’s “English is the hottest new programming language.” Computer science has been advancing language design by building higher and higher level languages; this is the latest iteration: maybe we no longer need a separate language to express ourselves to machines; we can just use our native tongues (let alone English).
My stance has been pretty rigid for some time: LLMs hallucinate, so they aren’t reliable building blocks. If you can’t rely on the translation step, you can’t treat it as a serious abstraction layer because it provides no stable guarantees about the underlying system.
As models get better, hallucinations become less central (even though models still make plenty of mistakes). Lately I’ve been thinking about a different question: imagine an LLM that never “hallucinates” in the usual sense, one that reliably produces some plausible implementation of what you asked. Would that make it the next generation of compiler? And what would that mean for programming and software engineering in general?
This post is my stab at that question. The core of my argument is simple:
Specifying systems is hard; and we are lazy.
Before getting to what that means in practice, I want to pin down something else: what does it mean for a language to be “higher level”?
Programming is, at a fundamental level, the act of making a computer do something. Computers are very dumb from the point of view of a human. You need to tell the computer exactly what to do, there's no inference. A computer fundamentally doesn't even have the notion of a value, type, concept; everything is a series of bits, which are processed to generate other bits, we bring meaning to this whole ordeal. Very early on, people have started by building arithmetic and logical instructions into computers, you would have 2 different bit sequences each denoting a number, you could add, subtract, multiply them. In order to make a computer do something, you could denote your data in terms of a bunch of numbers, map your logical operations onto those ALU instructions, and interpret the result in your domain at the end. Then, you can define a bunch of operations on your domain, which will be compiled down to those smaller ALU instructions, and voila, you have a compiler at hand.
This compiler is, admittedly, kind of redundant. It doesn't do anything you would be able to do because you essentially have a direct mapping between your two languages, your higher level language desugars into a bunch of lower level ALU instructions, so anyone would be able to implement the same mapping very easily, and even go further, perhaps just write the ALU instructions themselves.
What real higher level languages do is they give you an entirely new language that is eventually mapped to the underlying instruction set in non-trivial mechanisms in order to reduce the mental complexity on the side of the programmer. For instance, instruction sets do not have the concept of variables, nor loops, nor data structures. You can definitely build a sequence of instructions that amount to a binary search tree, but the mental burden of the process is orders of magnitude higher than any classic programming language. Structs, Enums, Classes, Loops, Conditionals, Exceptions, Variables, Functions are all properties that exist in higher level languages that are compiled away when going down the stack.
There's a crucial aspect of compilation, which is that the programmer gives away some control, that's essentially what removes the mental burden. If a programming language doesn't give away any control, it arguably isn't a very useful abstraction layer, because it did not absolve you of any responsibility that comes with that control. One of the first examples of this type of control we gave away is code layout. If you are writing handwritten assembly, you control where the code lives in the program memory. When you go into a language with structured control flow with callable procedures, you now don't have exact control over when the instructions for a particular piece of code is fetched, how basic blocks are arranged in the memory. Other examples are more common, the runtime of a language works in the background to absolve you of other responsibilities such as manual memory management, which itself was an abstraction for automatically managing how your data is organized in memory in the first place.
This loss of control raises a question: how do we know the abstraction is implemented correctly? More importantly, what does it mean for an abstraction to be correct?
There are a few layers to the answer. First, mature abstractions are defined against some semantics: what behaviors are allowed, what behaviors are forbidden, and what guarantees you’re meant to rely on. In C, malloc gives you a pointer to a block of memory of at least the requested size (or NULL), suitably aligned, which you may later free. It doesn’t give you “exclusive ownership” in the language-theoretic sense, but it does define a contract you can program against.
Second, we validate implementations with testing (and sometimes proofs), because these guarantees are at least in principle checkable. Third, in practice, guarantees are contextual: most programs care that allocation works; only some care deeply about allocator performance, fragmentation behavior, or contention, those are the cases where people swap allocators or drop down a level.
This highlights a critical point: abstraction guarantees aren’t uniform; they’re contextual. Most of the time, that contextuality is dominated by functional correctness: “does it do what it says?” Programming languages made enormous progress by giving us abstractions whose functional behavior can be specified precisely and tested relentlessly. We can act as if push/pop on a Python list has the same semantics as a vector in C++ even when the underlying implementation differs wildly across languages and runtimes.
LLM-based programming challenges this domination because the “language” (natural language) doesn’t come with precise semantics. That makes it much harder to even state what functional correctness should mean without building a validation/verification suite around it (tests, types, contracts, formal specs).
This gets to my core point. What changes with LLMs isn’t primarily nondeterminism, unpredictability, or hallucination. It’s that the programming interface is functionally underspecified by default. Natural language leaves gaps; many distinct programs can satisfy the same prompt. The LLM must fill those gaps.
Just as a garbage-collected runtime takes control over how and when memory is reclaimed, “programming in English” relinquishes control over which exact program gets built to fulfill your requirements. The underspecification forces the model to guess the data model, edge cases, error behavior, security posture, performance tradeoffs in your program, analogous to how an allocator chooses an allocation strategy.
This creates quite a novel danger in how we write programs.
Humans have always written vague requirements; that part isn’t new. What’s new is how directly an LLM can turn vagueness into running code, inviting us to outsource functional precision itself. We can leave meaningful behavioral choices to a generator and only react to the outcome.
If you say “give me a note-taking app,” you’re not describing one program, you’re describing a huge space of programs. The LLM can return one of a billion “reasonable” implementations: something Notion-like, Evernote-like, Apple Notes-like, or something novel. The danger is that “reasonable” choices can still be wrong for your intent, and you won’t notice which commitments got made until later.
This pushes development toward an iterative refinement loop: write an imprecise spec, get one of the possible implementations, inspect it, refine the spec, repeat until you’re satisfied. In this mode, you become more like a consumer selecting from generated artifacts than a producer deliberately constructing one.
And you also lose something subtle: when you hand-build, the “space of possibilities” is explored through design decisions you’re forced to confront. With a magic genie, those decisions get made for you; you only see the surface of what you ended up with.
I don’t think this point is widely internalized yet: hallucinations aren’t the only problem. Even in a hallucination-free world, the ability to take the easy way out on specification plays into a dangerously lazy part of the mind. You can see it in the semi-conscious slips (I’m guilty too): accept-all-edits, “one more prompt and it’ll be fine,” and slow drifting into software you don’t really understand.
That’s why I think the will to specify is going to become increasingly important. We already see LLMs shine when they’re given concrete constraints: optimization, refactors, translations, migrations, tasks that used to be so labor-intensive we’d laugh at the timeline, become feasible when the target behavior is well specified and backed by robust test suites.
It’s been true for a long time that specifying a piece of software is often harder than building it. But we may be entering a world where: if you can specify, you can build. If that’s right, then specification and verification become the bottleneck, and therefore the core skill.
This isn’t my most polished post, but I wanted to get the idea out. I do think it’s possible to treat LLMs as compiler-like, in the loose sense that they translate a specification into an executable artifact. But the control we relinquish to that translation layer is larger than it has ever been.
Traditional compilers reduce the need to stare at lower layers by replacing low-level control with defined semantics and testable guarantees. LLMs also reduce the need to read source code in many contexts, but the control you lose isn’t naturally bounded by a formal language definition. You can lose control all the way into becoming a consumer of software you meant to produce, and it’s frighteningly easy to accept that drift without noticing.
So: I don’t think we should fully accept the compiler analogy without qualification. As LLMs become a central toolchain component, we’ll need ways to strengthen the will to specify, and to make specification and verification feel as “normal” as writing code used to.
LLMs are not designed for that.
No, a compiler needs determinism. The article is quite correct on this point: if you can't trust that the output of a tool will be consistent, you can't use it as a building block. A stochastic compiler is simply not fit for purpose.
I am not. To me that describes a debugging fiasco. I don't want "semantic closure," I want correctness and exact repeatability.
To say the least, this is garbage compared to compilers
That a compiler might pick among different specific implementations in the same equivalency class is exactly what you want a multi-architecture optimizing compiler to do. You don't want it choosing randomly between different optimization choices within an optimization level, that would be non-deterministic at compile time and largely useless assuming that there is at most one most optimized equivalent. I always want the compiler to choose to xor a register with itself to clear it if that's faster than explicitly setting it to zero if that makes the most sense to do given the inputs/constraints.
But even with C, it’s still not completely deterministic with out of order and predictive branching, cache hits vs misses etc. Didn’t exactly this cause some of the worse processor level security issues we had seen in years?
There's this really good blog post about how autovectorization is not a programming model https://pharr.org/matt/blog/2018/04/18/ispc-origins
The point is that you want to reliably express semantics in the top level language, tool, API etc. because that's the only way you can build a stable mental model on top of that. Needing to worry about if something actually did something under the hood is awful.
Now of course, that depends on the level of granularity YOU want. When writing plain code, even if it's expressively rich in the logic and semantics (e.g. c++ template metaprogramming), sometimes I don't necessarily care about the specific linker and assembly details (but sometimes I do!)
The issue I think is that building a reliable mental model of an LLM is hard. Note that "reliable" is the key word - consistent. Be it consistently good or bad. The frustrating thing is that it can sometimes deliver great value and sometimes brick horribly and we don't have a good idea for the mental model yet.
To constrain said possibility space, we tether to absolute memes (LLMs are fully stupid or LLMs are a superset of humans).
Idk where I'm going with this
But most LLM services on purpose introduce randomness, so you don’t get the same result for the same input you control as a user.
When isn't that true?
int main() {
printf("Continue?\n");
}
and int main() {
printf("Continue?\n");
printf("Continue?\n");
}
do not see the compiler produce equivalent outputs and I am not sure how they ever could. They are not equivalent programs. Adding additional instructions to a program is expected to see a change in what the compiler does with the program.There are legitimate compiler use cases e.g. search‑based optimization, superoptimization, diversification etc where reproducibility is not the main constraint. It's worth leaving conceptual space for those use cases rather than treating deterministic output as a defining property of all compilers
Over the past decade, part of my job has been to design systems, talk to “stakeholders” and delegate some work and do some myself. I’m neither a web developer nor a mobile developer.
I don’t look at a line of code for those types of implementations. I do make sure they work. From my perspective, those that I delegated to might as well be “human LLMs”.
Meanwhile, you press the "shuffle" button, and code-gen creates different code. But this isn't necessarily the part that's supposed to be reproducible, and isn't how you actually go about comparing the output. Instead, maybe two different rounds of code-generation are "equal" if the test-suite passes for both. Not precisely the equivalence-class stuff parent is talking about, but it's simple way of thinking about it that might be helpful
You are attempting to hedge and leave room for a non-deterministic compiler, presumably to argue that something like vibe-compilation is valuable. However, you've offered no real use cases for a non-deterministic compiler, and I assert that such a tool would largely be useless in the real world. There is already a huge gap between requirements gathering, the expression of those requirements, and their conversion into software. Adding even more randomness at the layer of translating high level programming languages into low level machine code would be a gross regression.
But the determinism/non-determinism axis isn't the core issue here. The issue is that they are trained by gradient descent which produces instability/unpredictability in its output. I can give it a set of rules and a broad collection of examples in its context window. How often it will correctly apply the supplied rules to the input stream is entirely unpredictable. LLMs are fundamentally unpredictable as a computing paradigm. LLMs training process is stochastic, though I hesitate to call them "fundamentally stochastic".
With LLMs the output depends on the phases of the moon.
Often it seems like tech maximalists are the most against tech reliability.
https://thinkingmachines.ai/blog/defeating-nondeterminism-in...
As with LLMs, unless you ask for the output to be nondeterministic. But any compiler can be made nondeterministic if you ask for it. That's not something unique to LLMs.
> With LLMs the output depends on the phases of the moon.
If you are relying on a third-party service to run the LLM, quite possibly. Without control over the hardware, configuration, etc. then there is all kinds of fuckery that they can introduce. A third-party can make any compiler nondeterministic.
But that's not a limitation of LLMs. By design, they are deterministic.
I suggest when their pointer dereferences, it can go a bit forward or backwards in memory as long as it is mostly correct.
Imagine that - you got your project done ahead of schedule (which looks great on your OKRs) AND finally achieved your dream of no longer being dependent on those stupid overpaid, antisocial software engineers, and all it cost you was the company's reputation. Boeing management would be proud.
Lots of business leaders will do the math and decide this is the way to operate from now on.
You cannot formally verifiy prose or the text that LLMs generates when attempting to compare what a compiler does. So even in this sense that is completely false.
No-one can guarrantee that the outputs will be 100% to what the instructions you are giving to the LLM, which is why you do not trust it. As long as it is made up of artificial neurons that predict the next token, it is fundamentally a stochastic model and unpredictable.
One can maliciously craft an input to mess up the network to get the LLM to produce a different output or outright garbage.
Compilers have reproducable builds and formal verification of their functionality. No such thing with LLMs exist. Thus, comparing LLMs to a compiler and suggesting that LLMs are 'fundamentally deterministic' or is even more than a compiler is completely absurd.
On a practical level, existing implementations are nondeterministic because they don't take care to always perform mathematically commutative operations in the same order every time. Floating-point arithmetic is not commutative, so those variations change the output. It's absolutely possible to fix this and perform the operations in the same order every time, implementors just don't bother. It's not very useful, especially when almost everything runs with a non-zero temperature.
I think the whole nondeterminism thing is overblown anyway. Mathematical nondeterminism and practical nondeterminism aren't the same thing. With a compiler, it's not just that identical input produces identical output. It's also that semantically identical input produces semantically identical output. If I add an extra space somewhere whitespace isn't significant in the language I'm using, this should not change the output (aside from debug info that includes column numbers, anyway). My deterministic JSON decoder should not only decode the same values for two runs on identical JSON, a change in one value in the input should produce the same values in the output except for the one that changed.
LLMs inherently fail at this regardless of temperature or determinism.