There is no evidence offered. No attempt to measure the benefits.
The article has some good practical tips and it's not on the author but man I really wish we'd stop abusing the term "engineering" in a desperate attempt to stroke our own egos and or convince people to give us money. It's pathetic. Coming up with good inputs to LLMs is more art than science and it's a craft. Call a spade a spade.
But: Interestingly, the behavior of LLMs in different contexts is also the subject of scientific research.
For example, his first listed design pattern is RAG. To implement such a system from scratch, you'd need to construct a data layer (commonly a vector database), retrieval logic, etc.
In fact I think the author largely agrees with you re: crafting prompts. He has a whole section admonishing "prompt engineering" as magical incantations, which he differentiates from his focus here (software which needs to be built around an LLM).
I understand the general uneasiness around using "engineering" when discussing a stochastic model, but I think it's worth pointing out that there is a lot of engineering work required to build the software systems around these models. Writing software to parse context-free grammars into masks to be applied at inference, for example, is as much "engineering" as any other common software engineering project.
We don’t have that, yet. For instance experiments show that not all parts of the context window are equally well attended. Imagine trying to engineer a bridge when no one really knows how strong steel is.
Most engineering disciplines have to deal with tolerances and uncertainty - the real world is non-deterministic.
Software engineering is easy in comparison because computers always do exactly what you tell them to do.
The ways LLMs fail (and the techniques you have to use to account for that) have more in common than physical engineering disciplines than software engineering does!
If you are the cincinnatian poet Caleb Kaiser, we went to college together and I’d love to catch up. Email in profile.
If you aren’t, disregard this. Sorry to derail the thread.
Can you make a thing that’ll serve its purpose and look good for years under those constraints? A professional carpenter can.
We have it easy in software.
But they really shouldn't because obviously scheduling and logistics is difficult, involving a lot of uncertainty and tolerances.
As the author points out, many of the patterns are fundamentally about in-context learning, and this in particular has been subject to a ton of research from the mechanistic interpretability crew. If you're curious, I think this line of research is fascinating: https://transformer-circuits.pub/2022/in-context-learning-an...
If an engineer built an internal combustion engine that misfired 60% of the time, it simply wouldn't work.
If an engineer measured things with a ruler that only measured correctly 40% of the time, that would be the apt analogy.
The tool isn't what makes engineering a practice, it's the rigor and the ability to measure and then use the measurements to predict outcomes to make things useful.
Can you predict the outcome from an LLM with an "engineered" prompt?
No, and you aren't qualified to even comment on it since your only claim to fame is a fucking web app
Ah yes, the God given free parameters in the Standard Model, including obviously the random seed of a transformer. What if just put 0 in the inference temperature? The randomness in llms is a technical choice to generate variations in the selection of the next token. Physical engineering? Come on.
Software engineering blurs the lines, sure, but woodworking isn't engineering ever.
When the central component of your system is a black box that you cannot reason about, have no theory around, and have essentially no control over (a model update can completely change your system behavior) engineering is basically impossible from the start.
Practices like using autoscorers to try and constrain behaviors helps, but this doesn't make the enterprise any more engineering because of the black box problem. Traditional engineering disciplines are able to call themselves engineering only because they are built on sophisticated physical theories that give them a precise understanding of the behaviors of materials under specified conditions. No such precision is possible with LLMs, as far as I have seen.
The determinism of traditional computing isn't really relevant here and targets the wrong logical level. We engineer systems, not programs.
I’m going to start a second career in lottery “engineering”, since that’s a stochastic process too.
Engineers are not just dealing with a world of total chaos, observing the output of the chaos, and cargo culting incantations that seem to work for right now [1]…oh wait nevermind we’re doing a different thing today! Have you tried paying for a different tool, because all of the real engineers are using Qwghlm v5 Dystopic now?
There’s actually real engineering going on in the training and refining of these models, but I personally wouldn’t include the prompting fad of the week to fall under that umbrella.
[1] I hesitate to write that sentence because there was a period where, say, bridges and buildings were constructed in this manner. They fell down a lot, and eventually we made predictable, consistent theoretical models that guide actual engineering, as it is practiced today. Will LLM stuff eventually get there? Maybe! But right now we’re still plainly in the phase of trying random shit and seeing what falls down.
Whoa, where did that come from?
Does that really work? And is it affected by the almost continuous silent model updates? And gpt-5 has a "hidden" system prompt, even thru the API, which seemed to undergo several changes since launch...
I'm honestly a bit confused at the negativity here. The article is incredibly benign and reasonable. Maybe a bit surface level and not incredibly in depth, but at a glance, it gives fair and generally accurate summaries of the actual mechanisms behind inference. The examples it gives for "context engineering patterns" are actual systems that you'd need to implement (RAG, structured output, tool calling, etc.), not just a random prompt, and they're all subject to pretty thorough investigation from the research community.
The article even echoes your sentiments about "prompt engineering," down to the use of the word "incantation". From the piece:
> This was the birth of so-called "prompt engineering", though in practice there was often far less "engineering" than trial-and-error guesswork. This could often feel closer to uttering mystical incantations and hoping for magic to happen, rather than the deliberate construction and rigorous application of systems thinking that epitomises true engineering.
Trial and error and fumbling around and creating rules of thumbs for systems you don’t entirely understand is the purest form of engineering.
> Are we still calling this things engineering?
But my point stands. The non-deterministic nature of LLMs are implementation details, not even close to physical constraints as the parent comment suggest.
The problem is - and it’s a problem common to AI right now - you can’t generalize anything from it. The next thing that drives LLMs forward could be an extension of what you read about here, or it could be a totally random other thing. There are a million monkeys tapping on keyboards, and the hope is that someone taps out Shakespeare’s brain.
What would "generalizing" the information in this article mean? I think the author does a good job of contextualizing most of the techniques under the general umbrella of in-context learning. What would it mean to generalize further beyond that?
Dang, so we don't even know why it's not deterministic, or how to make it so? That's quite surprising! So if I'm reading this right, it doesn't just have to do with LLM providers cutting costs or making changes or whatever. You can't even get determinism locally. That's wild.
But I did read something just the other day about LLMs being invertible. It goes over my head but it sounds like they got a pretty reliable mapping from inputs to outputs, at least?
https://news.ycombinator.com/item?id=45758093
> Transformer components such as non-linear activations and normalization are inherently non-injective, suggesting that different inputs could map to the same output and prevent exact recovery of the input from a model's representations. In this paper, we challenge this view. First, we prove mathematically that transformer language models mapping discrete input sequences to their corresponding sequence of continuous representations are injective and therefore lossless, a property established at initialization and preserved during training. Second, we confirm this result empirically through billions of collision tests on six state-of-the-art language models, and observe no collisions.
The distinction here appears to be between the output tokens versus some sort of internal state?
While this affects all models it seems, I think the case gets worse for in particular LLMs because I would imagine all backends, including proprietary ones, are batching users prompts. Other concurrent requests seem to change the output of your request, and then if there is even a one token change to the input or output token, especially on large inputs or outputs, the divergence can compound. Also vLLM's documentation mentions this: https://docs.vllm.ai/en/latest/usage/faq.html
So how does one do benchmarking of AI/ML models and LLMs reliably (lets ignore arguing over the flaws of the metrics themselves, and just the fact that the output for any particular input can diverge given the above). You'd also want to redo evals as soon as any hardware or software stack changes are made to the production environment.
Seems like one needs to setup a highly deterministic backend, by forcing non-deterministic behavior in pytorch and using a backend which doesn't do batching for an initial eval that would allow for troubleshooting and non-variation in output to get a better sense of how consistent the model without the noise of batching and non-deterministic GPU calculations/kernels etc.
However then, for production, when determinism isn't guaranteed because you'd need batching and non-determism for performance, I would think that one would want to do multiple runs in various real-world situations (such as multiple users doing all sorts of different queries at the same time) and do some sort of averaging of the results. But I'm not entirely sure, because I would imagine the types of queries other users are making would then change the results fairly significantly. I'm not sure how much the batching that vLLM does would change the results of the output; but vLLM does say that batching does influence changes in the outputs.
The best writing I've seen about this is from Hamel Husain - https://hamel.dev/blog/posts/llm-judge/ and https://hamel.dev/blog/posts/evals-faq/ are both excellent.
As our use of LLMs has changed from conversational chatbots and into integral decision-making components of complex systems, our inference approach must also evolve. The practice of "prompt engineering", in which precise wording is submitted to the LLM to elicit desired responses, has serious limitations. And so this is giving way to a more general practice of considering every token fed into the LLM in a way that is more dynamic, targeted, and deliberate. This expanded, more structured practice is what we now call "context engineering."
Throughout, we'll use a toy example of understanding how an LLM might help us answer a subjective question such as "What is the best sci-fi film?"
An LLM is a machine learning model that understands language by modelling it as a sequence of tokens and learning the meaning of those tokens from the patterns of their co-occurrence in large datasets. The number of tokens that the model can comprehend is a fixed quantity for each model, often in the hundreds of thousands, and is known as the context window: LLMs are trained through repeated exposure to coherent token sequences — normally large textual databases scraped from the internet. Once trained, we use the LLM by running "inference" (i.e. prediction) of the next token based on all the previous tokens in a sequence. This sequence of previous tokens is what we used to refer to as the prompt:
Inference continues the token sequence by adding high-probability tokens to the sequence one at a time.
When prompted to complete the sentence "the best sci-fi film ever made is...", the highest probability tokens to be generated might be
probably,star, andwars.
Early uses of LLMs focused on this mode of "completion", taking partially written texts and predicting each subsequent token in order to complete the text based on the desired lines. While impressive at the time, this was limiting in several ways, including that it was difficult to instruct the LLM exactly how you wished the text to be completed.
To address this limitation, model providers started training their models to expect sequences of tokens that framed conversations, with special tokens inserted to indicate the hand-off between two speakers. By learning to replicate this "chat" framing when generating a completion, models were suddenly far more usable in conversational settings, and therefore easier to instruct: The context window started to be more greedily filled up by different types of messages — system messages (special instructions telling the LLM what to do), and chat history from both the user and the response from the LLM itself.
With a chat framing, we can instruct the LLM that it is "a film critic" before "asking" it what the best sci-fi film is. Maybe we'll now get the response tokens
bladeandrunner, as the AI plays the role of a speaker likely to reflect critical rather than popular consensus.
The crucial point to understand here is that the LLM architecture did not change — it was still just predicting the next token one at a time. But it was now doing that with a worldview learned from a training dataset that framed everything in terms of delimited back-and-forth conversations, and so would consistently respond in kind.
In this setting, getting the most out of LLMs involved finding the perfect sequence of prompt tokens to elicit the best completions. This was the birth of so-called "prompt engineering", though in practice there was often far less "engineering" than trial-and-error guesswork. This could often feel closer to uttering mystical incantations and hoping for magic to happen, rather than the deliberate construction and rigorous application of systems thinking that epitomises true engineering.
We might try imploring the AI to reflect critical consensus with a smarter system prompt, something like
You are a knowledgeable and fair film critic who is aware of the history of cinema awards. We might hope that this will "trick" the LLM into generating more accurate answers, but this hope rests on linguistic probability and offers no guarantees.
As LLMs got smarter and more reliable, we were able to feed them more complex sequences of tokens, covering different types of structured and unstructured data. This enabled LLMs to produce completions that displayed "knowledge" of probable token sequences based on novel structures in the prompt, rather than just remembered patterns from their training dataset. This mode of feeding examples to the LLM is known as in-context learning because the LLM appears to "learn" how to produce output purely based on example sequences within its context window.
This approach led to an explosion of different token sequences that we might programmatically include within the prompt:
In our sci-fi film example, our prompt could include many things to help the LLM: historic box office receipts, lists of the hundred greatest films from various publications, Rotten Tomatoes ratings, the full history of Oscar winners, etc.
Suddenly, our 100,000+ context window isn't looking so generous anymore, as we stuff it with tokens from all kinds of places: This expansion of context not only depletes the available context window for output generation, it also increases the overall footprint and complexity of what the LLM is paying attention to at any one time. This then increases the risk of failure modes such as hallucination. As such, we must start approaching its construction with more nuance — considering brevity, relevance, timeliness, safety, and other factors.
At this point, we aren't simply "prompt engineering" anymore. We are beginning to engineer the entire context in which generation occurs.
Language encodes knowledge, but it also encodes meaning, logic, structure, and thought. Training an LLM to encode knowledge of what exists in the world, and to be capable of producing language that would describe it, therefore, also produces a system capable of simulating thought. This is, in fact, the key utility of an LLM, and to take advantage of it requires a mindset shift in how we approach inference.
To adopt context engineering as an approach to LLM usage is to reject using the LLM as a mystical oracle to approach, pray to with muttered incantations, and await the arrival of wisdom. We instead think of briefing a skilled analyst: bringing them all the relevant information to sift through, clearly and precisely defining the task at hand, documenting the tools available to complete it, and avoiding reliance on outdated, imperfectly remembered training data. In practice, our integration of the LLM shifts from "crafting the perfect prompt", towards instead the precise construction of exactly the right set of tokens needed to complete the task at hand. Managing context becomes an engineering problem, and the LLM is reframed as a task solver whose output is natural language.
Let's consider a simple question you might wish an LLM to answer for you:
What is the average weekly cinema box office revenue in the UK?
In "oracle" mode, our LLM will happily quote a value learned from the data in its training dataset prior to its cutoff:
As of 2019, the UK box office collects roughly £24 million in revenue per week on average.
This answer from GPT 4.1 is accurate, but imprecise and outdated. Through context engineering, we can do a lot better. Consider what additional context we might feed into the context window before generating the first token of the response:
The above should be enough for the LLM to know how to: look for data for 2024; extract the total figure of £979 million from the document; and call an external function to precisely divide that by 52 weeks. Assuming the caller then runs that calculation and invokes the LLM again, with all the above context, plus its own output, plus the result of the calculation, we will then get our accurate answer:
Across the full year of 2024, the UK box office collected £18.8 million in revenue per week on average.
Even this trivial example involves multiple ways of engineering the context before generating the answer:
Fortunately, we do not need to invent a new approach every single time.
Retrieval-augmented generation (RAG) is a fashionable technique for injecting external knowledge into the context window at inference time. Leaving aside implementation details of how to identify the correct documents to include, we can clearly see that this is another specific form of context engineering: This is a useful and obvious way to use pre-trained LLMs in contexts that need access to knowledge outside the training dataset.
For a correct answer, our application needs to be aware of up-to-date film reviews, ratings, and awards, to track new films and critical opinion after the point the model was trained. By including relevant extracts in the context window, we enable our LLM to generate completions with today's data and avoid hallucination.
To do this, we can search for relevant documents and then include them in the context window. If this sounds conceptually simple, that is because it is — though reliable implementation is not trivial and requires robust engineering.
Complex systems can be brittle and opaque to build. We need a way to scale complexity without harming our ability to maintain, debug, and reason about our code. Fortunately, we can apply the same thinking that traditional software design used to solve this same problem.
We can think of RAG as simply the first of many design patterns for context engineering. And just as with other software engineering design patterns, in future we will find that most complex systems will have to employ variations and combinations of such patterns in order to be most effective.
In software engineering, design patterns promote reusable software by providing proven, general solutions to common design problems. They encourage composition over inheritance, meaning systems are built from smaller, interchangeable components rather than rigid class hierarchies. They make your codebase more flexible, testable, and easier to maintain or extend. They are a crucial piece of the software design toolkit, that enable engineers to build large functioning codebases that can scale over time.
Some examples of software engineering design patterns include:
Factory: standardises object creation to make isolated testing easierDecorator: extends behaviour without editing the originalCommand: passes work around as a value, similar to a lambda functionFacade: hides internals with a simple interface to promote abstractionDependency injection: wires modules externally using configurationThese patterns were developed over a long time, though many were first codified in a single book. Context engineering is a nascent field, but already we see some common patterns emerging that adapt LLMs well to certain tasks:
RAG: inject retrieved documents based on relevance to user intentTool calling: list available tools and inject results into the contextStructured output: fix a JSON/XML schema for the LLM completionsChain of thought / ReAct: emit reasoning tokens before answeringContext compression: summarise long history into pertinent factsMemory: store and recall salient facts across sessionsIn our examples above, we have already used some of these patterns:
- RAG for getting film reviews, critics' lists, and box office data
- Tool calling to calculate weekly revenues accurately
Some of the other techniques, such as ReAct, could help our LLM to frame and verify its responses more carefully, counterbalancing the weight of linguistic probability learnt from its training data.
By seeing each as a context engineering design pattern, we are able to pick the right ones for the task at hand, compose them into an "agent", and avoid compromising our ability to test and reason about our code.
Production systems that rely on LLMs for decision-making and action will naturally evolve towards multiple agents with different specialisations: safety guardrails; information retrieval; knowledge distillation; human interaction; etc. Each of these is a component that interprets a task, then returning a sequence of tokens indicating actions to take, the information retrieved, or both.
For our multi-agent film ranker, we might need several agents:
- Chatbot Agent: to maintain a conversation with the user
- Safety Agent: to check that the user is not acting maliciously
- Preference Agent: recalls if the user wants to ignore some reviews
- Critic Agent: to synthesise sources and make a final decision
Each of these is specialised for a given task, but this can be done purely through engineering the context they consume, including outputs from other agents in the system.
Outputs are then passed around the system and into the context windows of other agents. At every step, the crucial aspect to consider is the patterns by which token sequences are generated, and how the output of one agent will be used as context for another agent to complete its own task. The hand-off token sequence is effectively the contract for agent interaction — apply as much rigour to it as you would any other API within your software architecture.
Context engineering is the nascent but critical discipline that governs how we are able to effectively guide LLMs into solving the tasks we feed into them. As a subfield of software engineering, it benefits from systems and design thinking, and we can learn lessons from the application of design patterns for producing software that is modular, robust, and comprehensible.
When working with LLMs, we must therefore:
By doing these, we can control in-context learning with the same rigour as any other engineered software.