We've got a QA agent that needs to run through, say, 200 markdown files of requirements in a browser session. Its a cool system that has really helped improve our team's efficiency. For the longest time we tried everything to get a prompt like the following working: "Look in this directory at the requirements files. For each requirement file, create a todo list item to determine if the application meets the requirements outlined in that file". In other words: Letting the model manage the high level control flow.
This started breaking down after ~30 files. Sometimes it would miss a file. Sometimes it would triple-test a bundle of files and take 10 minutes instead of 3. An error in one file would convince it it needs to re-test four previous files, for no reason. It was very frustrating. We quickly discovered during testing that there was no consistency to its (Opus 4.6 and GPT 5.4 IIRC) ability to actually orchestrate the workflow. Sometimes it would work, sometimes it wouldn't. I've also tested it once or twice against Opus 4.7 and GPT 5.5; not as extensively; but seems to have the same problems.
We ended up creating a super basic deterministic harness around the model. For each test case, trigger the model to test that test case, store results in an array, write results to file. This has made the system a billion times more reliable. But, its also made the agent impossible to run on any managed agent platform (Cursor Cloud Agents, Anthropic, etc) because they're all so gigapilled on "the agent has to run everything" that they can't see how valuable these systems can be if you just add a wee bit of determinism to them at the right place.
"One thing that I have seen in the wild quite a bit is taking the agent pattern and sprinkling it into a broader more deterministic DAG." - https://github.com/humanlayer/12-factor-agents/blob/main/REA...
I started working on it piece by piece about 14 years ago. It was originally targeted at junior developers to provide them the necessary security and scalability guardrails whilst trying to maintain as much flexibility as possible. It's very flexible; most of Saasufy is itself is built using Saasufy. Only the actual user service and orchestration is custom backend code.
Also, I designed it in a way that it would help the user fast-track their learning of important concepts like authentication, access control, schema validation.
It turns out that all of these things that junior devs need are exactly what LLMs need as well.
I tested it with Claude Code originally and got consistently great results. More recently, I tested with https://pi.dev with GPT 5.5 and it seemed to be on par.
This is essentially declarative programming. Most traditional programming is imperative, what most developers are used to - I give the exact set of instructions and expect them to be obeyed as I write them. Agents are way more declarative than imperative - you give them a result, they work on getting that result. Now the problem of course, is in something declarative like say, SQL, this result is going to be pretty consistent and well-defined, but you're still trusting the underlying engine on how to go about it.
Thinking about agents declaratively has helped me a lot rather than to try to design these rube-goldberg "control" systems around them. Didn't get it right? Ok, I validated it's not correct, let's try again or approach it differently.
If you really need something imperative, then write something imperative! Or have the agent do so. This stuff reads like trying to use the wrong tool for the job.
The models are already good enough for code generation. What we need is the harness around them actually deterministically enforcing a specific path and “leashing” the models output to be aligned with the intention of the user as much as possible. You can’t make the output of the model deterministic, but you can make everything around it to be so.
Trying to make enforcements work with prompts is like a government agency investigating/auditing itself, there’s no incentive to find problems, so you’ll always inevitably get the “All Good, Boss!”
This is the only way to guarantee AI usage doesn't burn you. Any automation beyond this is just theater, no matter how much that hurts to hear/undermines your business model.
A bird sings, a duck quacks. You don't expect the duck to start singing now, do you?
Agents aren't reliable; use workflows instead.
Even skills are not a catch-all, because besides the supply chain risk from using skills you pull from someone else, a lot of tasks require an assortment of skills.
I've accommodated this with my agent team (mostly sonnets fwiw) by developing what we call "operational reflexes". Basically common tasks that require multiple domains of expertise are given a lockfile defining which of the skills are most relevant (even which fragment of a skill) and how in-depth / verbose each element needs to be to accomplish the same task the same way, with minimal hallucinations or external sources.
A coordinator agent assigns the tasks and selects the relevant lockfile and sends it along or passes it along to another agent with a different specified lockfile geared towards reviewing.
It's a bit, but this workflow dramatically increased the quality of output for technical work I get from my agents and I don't really need to write many prompts myself like this.
hmmmmmm maybe i could vibecode a harness based on that pi thing i've heard about, and integrate it closer with jj instead of relying on llms knowing how to use it, and make certain stages guaranteed to run... oh dear
edit: also i can't bring myself to believe the 'ultimate' form or whatever stabilizes out will be chat-based interfaces for coding and code generation
i think it's just that openai happened to strike gold with ChatGPT and nobody has time to figure anything else out because they've got to get the bazillion investor dollars with something that happens to kinda work
also afaiu all these instruct models are based on 'base' models that 'just' do text prediction, without replying with a chat format; will we see code generation models that output just code without the chat stuff?
I created it to address this exact issue. It is a vendor-neutral ESLint-style policy engine and currently supports Claude Code, Codex, and Copilot.
It uses the agents hooks payloads and session history to enforce the policies. Allowing it to be setup to block commits if a file has been modified since the checks were last run, disallow content or commands using string or regex matching, and enforce TDD without the need of any extra reporter setup and it works with any language.
Feedback welcome: https://github.com/nizos/probity
My first thought was, well agents seem nice, but I think, AI workflows are a better bet. However, I don't really understood AI or agents in depth and felt like I was just "doing things the old way" and removing flexibility from agents was a ridiculous idea.
After some research I got the impression that I was right. A well defined workflow and scope is just what's needed for AI. It's cheaper and more consistent. It probably even makes the whole thing run well with non-SOTA models.
Here's a pretty specific example of what I mean, but maybe food for thought:
Podcast (20 minute digest): https://pub-6333550e348d4a5abe6f40ae47d2925c.r2.dev/EP008.ht...
The alternative is running your ten lines of Python in the most expensive, slowest, least reliable way possible. (Sure is popular though)
For example, most people were using the agents for internet research. It would spin for hours, get distracted or forget what it was supposed to be doing.
Meanwhile `import duckduckgo` and `import llm` and you can write ten lines that does the same thing in 20 seconds, actually runs deterministically, and costs 50x less.
The current models are much better -- good enough that the Auto-GPT is real now! -- but running poorly specified control flow in the most expensive way possible is still a bad idea.
0 - https://stripe.dev/blog/minions-stripes-one-shot-end-to-end-...
The second it works, bake the workflow into the harness. Yesterday I was doing just that, and the whole agent loop disappeared because the process could've been condensed into a one-shot request (+1 MorphLLM fast apply) from careful context construction. (It was an Autoresearcher)
It's externally orchestrated and managed, not by an agent running the the loop.
The goal is to force LLMs to produce exactly what you want every time.
I will be open sourcing soon. You can use whatever harness or tools you already use, you just delegate the actual implementation to the engine.
Swamp teaches your Agent to build and execute repeatable workflows, makes all the data they produce searchable, and enables your team to collaborate.
We also build swamp and swamp club using swamp. You can see that process in the lab[2]. This combines all of the creativity of the LLM for the parts that matter, while providing deterministic outcomes for the parts you need to be deterministic.
Especially all bookkeeping logic should move into the symbolic layer: https://zby.github.io/commonplace/notes/scheduler-llm-separa...
couldn't you "just" have it orchestrate a bunch of subagents? a la the superpower skill
definitely a worse solution, non deterministic orchestration + way higher token usage (unless there's a way to hide the subagent output from the orchestrator agent? i haven't used any of these platforms) but could work in some cases
I feel like I'm falling out of whatever is popular these days. I've been using prepaid tokens and custom harnesses for a long time now. It just seems to work. I can ignore most of the news. Copilot & friends are currently dead to me for the problems I've expressly targeted. For some codebases it's not even in the same room of performance anymore, despite using the exact same GPT5.4 base model.
At the time I had access to only 4o and there was no way to guarantee that the agent would follow the flowchart if I just mention it in its prompt. What I ended up wrapping the agent in a loop that kept feeding it the next step in the flowchart. In a way, a custom harness for the agent
I'm a firm believer that a "thin harness" is the wrong approach for this reason and that workflows should be enforced in code. Doing that allows you to make sure that the workflow is always followed and reduces tokens since the LLM no longer has to consider the workflow or read the workflow instructions. But it also allows more interesting things: you can split plans into steps and feed them through a workflow one by one (so the model no longer needs to have as strong multi-step following); you can give each workflow stage its own context or prompts; you can add workflow-stage-specific verification.
Based on my experience with Claude Code and Kilo Code, I've been building a workflow engine for this exact purpose: it lets you define sequences, branches, and loops in a configuration file that it then steps through. I've opted to passing JSON data between stages and using the `jq` language for logic and data extraction. The engine itself is written in (hand coded; the recent Claude Code bugs taught me that the core has to be solid) Rust, while the actual LLM calls are done in a subprocess (currently I have my own Typescript+Vercel AI SDK based harness, but the plan is to support third party ones like claude code cli, codex cli, etc too in order to be able to use their subscriptions).
I'm not quite ready to share it just yet, but I thought it was interesting to mention since it aims to solve the exact problem that OP is talking about.
However, there are some things that I think need a foundational next-generation improvement of some sort. The way that LLMs sort of smudge away "NEVER DO X" and can even after a lot of work end up seeing that as a bit of a "PLEASE DO X" seems fundamental to how they work. It can be easy to lose track of as we are still in the initial flush of figuring out what they can do (despite all we've already found), but LLMs are not everything we're looking for out of AI.
There should be some sort of architecture that can take a "NEVER DO X" and treat it as a human would. There should be some sort of architecture that instead of having a "context window" has memory hierarchies something like we do, where if two people have sufficiently extended conversations with what was initially the same AI, the resulting two AIs are different not just in their context windows but have actually become two individuals.
I of course have no more idea what this looks like than anyone else. But I don't see any reason to think LLMs are the last word in AI.
Both designs (Lightroom, game engines) have worked successfully.
There's probably nothing that prevents mixing both approaches in the same "app".
Slash commands, for instance, are a misfeature. I should never have to wait for the chatbot finish a turn so that I can check on the status of my context window or how much money I've spent this session. Control should be orthogonal to the chat loop.
Even things that have nothing to do with controlling the text generator's input and output are entangled with chat actions for no good reason except "it's a chat thing, let's pretend we're operating an IRC bot".
There are a zillion LLM agents out there nowadays, but none of them really separate control from the agent loop from presentation well. (A few do at least have headless modes, which is cool.)
The reason why "DO NOT SKIP" fails is because your agent is responsible for too many things and there's things in context that are taking away the attention from this guidance.
But nobody said the agent that does enforcement must be the same agent that builds. While you can likely encode some smart decision making logic in your deterministic control flow, you either make it too rigid to work well, or you'll make it so complex that at that point, you might as well just use the agent, it will be cheaper to setup and maintain.
You essentially need 3 base agents:
- Supervisor that manages the loop and kicks right things into gear if things break down
- Orchestrator that delegates things to appropriate agents and enforces guardrails where appropriate
- Workers that execute units of work. These may take many shapes.
I've tried doing something similar with AI by running a prompt several times and then have an agent pick the best response. It works fairly well but it burns a lot of tokens.
My personal opinion is that AI and agents are being misrepresented… The amount of setup, guidance and testing that’s required to create smarter version of a form is insane.
At the moment my small test is: Compressed instructions (to fit within the 8k limit) 9 different types of policies to guide the agent (json) 3 actual documents outlining domain knowledge (json) 8 Topics (hint harvesting, guide rails, and the pieces of information prepared as adaptive cards for the user) 3 Tools (to allow for connectors)
The whole thing is as robust as I can make it but it still feels like a house of cards and I expect some random hiccup will cause a failure.
Markdown files are a good reference but they are a weak enforcement tool and go stale easily.
Avoid burying yourself in more skills docs you’re not even writing yourself and probably never even read. Focus that toward deterministic tooling. (Not that skills or prompts are bad, I agree a meta skill that tells an agent what subagents and what order to run is useful)
I am finding that the better the quality gates are the lower quality llm you can use for the same result (at a cost of time).
Still have yet to see a universal treatment that tackles this well.
Codex's short context and todolist system combined somehow helps here though. Because of the frequent compact. The model was forced to recheck what todo list item has not done yet and what workflow skill it has to use. I used to left it for multi hour to do a big clean up and it finished without obvious issues.
We need to define agents in code, and drive them through semi-deterministic workflows. Kick subtasks off to agents where appropriate, but do things like gather context and deal with agent output deterministically.
This is a massive boost in accuracy, cost efficiency, AND speed. Stop using tokens to do the deterministic parts of the task!
https://github.com/yieldthought/flow
Happily, 5.5 is good at writing and using it.
Aren’t some benchmarks giving the model multiple shots at a problem and only keep the successful result if it appeared, ignoring the failure rate?
I also think having granular, tightly controlled steps is much friendlier to implementing smaller, cheaper, more specialized models rather than using some ginormous behemoth of a model that can automate your tests, or crank out 5 novels of CSI fan fic in a snap.
Everyone misses this pattern with skills: you can just drop code alongside a SKILL.md to guarantee certain behaviors, but for some reason everyone's addicted to writing prompts. You don't even need to build a CLI. A simple skill.py with tasks does it. You can even have helpers that call `claude -p`!
Sorry, you thought a prompt was a suitable replacement for a testing suite?
And then you run into similar issues as the llm does, like silent failures, loops, contradictions unless you're very careful.
The essence might be the same closed world assumption problem. In llm case this manifests as hallucination rather that admitting it does not know.
Or the opposite depending on how you ask the questions, some automated code review tools _always_ find issues, even when they don't really exist, or they exist in the scope of a function but not once the function is wired in the project.
I think it's step more-abstract that that, we're doing... How about "narrative programming"? (Though we could debate whether "programming" is still an applicable word.)
Yes, it may look like declarative programming, but it's within an illusion: We aren't aren't actually describing our goals "to" an AI that interprets them. Instead, there's a story-document where our human stand-in character has dialogue to a computer-character, and up in the real world we're hoping that the LLM will append more text in a way that makes a cohesive longer story with something useful that can be mined from it.
It's not just an academic distinction, if we know there's a story, that gives us a better model for understanding (and strategizing) the relationship between inputs and outputs. For example, it helps us understand risks like prompt-injection, and it provides guidance for the kinds of training data we do (or don't) want it trained on.
And to your point: instructing a (non-deterministic) LLM declaratively ("get me to this end state") compounds the likelihood of going off the rails.
Presuming you meant burns you out though.
why tf would i ever need this
Deterministic workflows using AI to help perform those steps not requiring human input has been an area of interest for me for some time. Particularly interesting how you are using the AI to determine what a step has achieved and the action of the next step.
Combine it with workflow elements that does handle human steps together with a notification/routing/task system would make for a helpful system for so many.
You prompt for what you want it to do, and it will write eg. python scripts as needed for the looping part, and for example use claude -p for the LLM call.
You can build this in 10 minutes.
I don’t use a cloud platform, so I can’t comment on that part. I‘d say just run it on your own hardware, it’s probably cheaper too.
It is the same in terraform - yes, the HCL spec defines things very precisely, but you're kind of at the mercy of how the provider and provider API decide how to handle what you wrote, which can be very messy and inconsistent even when nothing changed on your side at all. LLM/agent usage feels a lot like that to me, in the sense it's declarative and can be a bit lossy. As a result there are things I could technically do in terraform but would never, because I need imperativeness.
My main point being, I think people are trying to ram agents into a ton of cases where they might not necessarily need or even want to be used, and stuff like this gets written. Maybe not, but I see it day to day - for instance, I have a really hard time convincing coworkers that are complaining about the reliability of MCP responses with their agents, that they could simply take an API key, have the agent write a script that uses it, and strictly bound/define the type of response format they want, rather than let the agent or server just guess - for some reason there is some inclination to "let the agent decide how to do everything."
I think that's probably what this article is getting at, but, I am saying that trying to create these elaborate control flows with validation checks everywhere to reign in an unruly application making dumb decisions, why not just use it to write deterministic automation instead of using agent as the automation?
It will make a mistake and you will get burned, so you have to babysit it.
Your agent can write a python script to loop and simply call „claude -p“ or „codex exec“.
For simple workflows this seems good enough and can be set up in 10 minutes without third party software.
What do you think?
Your comment EXACTLY mirrors my experience. Week 1 was ever expanding prompts, and degrading performance. Week 2 has been all about actually defining the objects precisely (notes, tasks, projects, people etc) and defining methods for performing well defined operations against these objects. The agent surface has, as you rightly point out, shrunk to a translation layer that converts natural language to commands and args that pass the input validator.
I just had Claude write itself a couple shell scripts to handle a bunch of common cases (like running tests) in my workflow where it just couldn't figure it out efficiently. Now it just runs those tools and sets things up instead of spinning in circles for half an hour.
Every time it tries to ask me if it can run some one-off crazy shell or python one-liner to do something, I've started asking myself if I should have it write a tool I can auto-approve instead.
I think you are on to something. But I also think this sort of system lends itself to not needing really good LLMs to do impressive things. I've noticed that the quality of a lot of these LLMs just gets worse the more datapoints they need to track. But, if you break it up into smaller and easier to consume chunks all the sudden you need a much less capable LLM to get results comparable or better than the SOTA.
Why pay extra money for Opus 4.7 when you could run Qwen 3.6 35b for free and get similar results?
If a business can get away with some margin of error being acceptable, more power to them. But if not (or doing so would cause additional problems; what I'd imagine to be true for a non-trivial number of orgs), it's wise to consider the nature of the tool a lot of people are suggesting is mandatory if you're dependent on consistent, predictable results.
Such an LLM might have fared better with the strawberry test.
In the GUI I can see the context indicator and usage stats.
It also makes it easier to jump between conversations and see the updates.
Sometimes I use Claude Code or opencode in the terminal, and my experience is much poorer compared to the Codex desktop app.
That’s why you see “base” vs “instruct” models for example — base is just that, the basic language model that models language, but doesn’t follow instructions yet.
Especially the open weights models have lots of variants, eg tuned for math, tuned for code, tuned for deep thinking, etc.
But it’s definitely a post train thing, usually done by generating synthetic data using other models.
Other things don't though.
llm -> prompt -> result
llm -> prompt + prompt encoded as skill -> result
llm -> prompt + deterministic code encoded as skill -> result
I do think prompting to generate code early can shortcut that path to deterministic code, but we're still essentially embedding deterministic code in a non-deterministic wrapper. There is a missing layer of determinism in many cases that actually make long-horizon tasks successful. We need deterministic code outside the non-deterministic boundary via an agentic loop or framework. This puts us in a place where the non-deterministic decision making is sandwiched in between layers of determinism:
deterministic agentic flows -> non-deterministic decision making -> deterministic tools
This has been a very powerful pattern in my experiments and it gets even stronger when the agents are building their own determinism via tools like auto-researcher.
The hardware control team delivers a spec as a document and spreadsheet. The mobile team was using that to code the interface library and validating their code against the server. I converted the document to TSV, sent some parts to Claude and have it write a parser for the TSV keeping all the nuances of human written spec. It took more than 150 iterations to get the parser to handle all edge cases and generate an intermediate output as JSON. Then Claude helped me write a code generator using some custom glue on top of Apollo to generate the code that is consumed by the mobile app.
This whole pipeline runs as part of Github actions and calls Claude only when our library validator fails. There is an md file which is sent to Claude on failure as part of the request to figure out what went wrong, propose a solution and create a PR. This is followed by a human review, rework and merge. Total credits consumed to get here < $350.
Determinism is a different matter. Scripts and hooks are really the main levers you can pull there, but yeah - a a decent script and a cron job will handle certain things much better (and for a fraction of the cost)
Somewhere in between that I guess is the varying levels of intelligence more likely able to make the “right” decision for anything you throw at it.
I see this as the most robust way to build a predictable system that runs in a controlled way while taking advantage of probabilistic AIs while reducing the impact of their alucinations.
LLMs simply can't be trusted to follow instructions in the general case, no matter how much you constraint them. The power of very large probabilistic models is that they basically solved the _frame problem_ of classic AI: logical reasoning didn't work for general tasks because you can't encode all common sense knowledge as axioms, and inference engines lost their way trying to solve large problems.
LLMs fix those handicaps, as they contain huge amounts of real world knowledge and they're capable of finding facts relevant to the problem at hand in an efficient way. Any autonomous system using them should exploit this benefit.
Hooks do wonders here. The payload contains a lot of information about the pending action the agent wants to make. Combine that with the most recent n events from the agent’s session history and you have a rich enough context to pass to another agent to validate the action through the SDK.
This way the validation uses the same subscription you’re logged in to, whether you’re using Claude Code, Codex, or Copilot. The validation agent responds with a json format that you can easily parse and return, allowing you to let the action through or block it with direction and guidance. I’m genuinely impressed by how well this works considering how simple it is.
You can find my approach here: https://github.com/nizos/probity
I get what you're trying to say but in practice architecting what you propose is considerably more difficult. Why not build it and try to get hired by one of the bigcos?
- Please consult me when you encounter any ambiguous edge cases
Attaching the AI to production to directly do things with API calls is bad. For me the only use case where the app should do any AI stuff is with reading/categorizing/etc. Basically replacing the "R" in old CRUD apps. If you want to use that same new AI based "R" endpoint to auto fill forms for the "C", "U", and "D" based on a prompt that's cool, but it should never mutate anything for a customer before a human reviews it. Basically CRUD apps are still CRUD apps (and this will always be true), they just have the benefit of having a very intelligent "R" endpoint that can auto complete forms for customers (or your internal tooling/Jenkins pipelines/etc), or suggest (but never invoke) an action.
I think most problems with ai tend to be around can you deterministically test the thing you are asking it to do?
How many of us would never ever show work, without going to check the thing we just built first?
https://www.decisional.com/blog/workflow-automation-should-b...
I think there is a fundamental incentive problem - code + llm + harness is bound to be more efficient but the labs want you to burn tokens so they are not going to tell you to use the code, just burn more tokens. They are asking us to forget about the token cost and reliability for now - model will become better.
This means that most people just believe that their agent should just be able to do anything with the help of some Model fairy dust with prompts + skills.
People need to watch their agents fail in production to be able to come to the right conclusion unfortunately.
Making an unreliable, nondeterministic system give reliable results for a bounded task with well-understood parameters is... like half of engineering, no?
There's a huge difference between "generate this code here's a vague feature description" and "here's a list of criteria, assign this input to one of these buckets" -- the latter is obviously subject to prompt engineering, hallucination, etc -- but so can a human pipeline!
[
07 May, 2026
Thesis: reliable agents tackling complex tasks need deterministic control flow encoded in software, not increasingly elaborate prompt chains
If you’ve ever resorted to MANDATORY or DO NOT SKIP, you’ve hit the ceiling of prompting.
Imagine a programming language where statements are suggestions and functions return “Success” while hallucinating. Reasoning becomes impossible; reliability collapses as complexity grows.
Software scales through recursive composability: systems built from libraries, modules, and functions. It’s code all the way down. Code exposes predictable behavior, enabling local reasoning. Prompt chains lack this property. While useful for narrow tasks, prompts are non-deterministic, weakly specified, and difficult to verify.
Reliability requires moving logic out of prose and into runtime. We need deterministic scaffolds: explicit state transitions and validation checkpoints that treat the LLM as a component, not the system.
But deterministic orchestration is only half the battle. In a system prone to silent failure, an agent without aggressive error detection is just a fast way to reach the wrong conclusion. Without programmatic verification, we are left with three options:
Its one things for a model to be very clearly instructed to add a REST endpoint to an existing Django app and add a button connected to it on the front vs "Design me a youtube". The smaller models can pretty dependably do the first and fall flat on the second.
They just want features. They don't really care about duplicated work, so half of them reinvent the TUI rendering wheel. Pluggability is something that might be actually hostile to their interests in lock-in. And the AI labs probably think "after a couple more scaling cycles, our models will be so good that our agents can just rewrite themselves from scratch"; until they hit a compute or power wall, it always looks rational to them to defer rearchitecting.
Another real possibility is that if you work on an agent with a really clean architecture and publish it in hopes of getting hired by some AI company, all of them think "that looks great, but we don't want to rearchitect right now". Your code winds up in the training set, and a year and a half from now, existing agents can "one-shot" rewrites along the lines of your design because they're "smarter".
As for me, I'm not that interested, personally. There are other things I want to build and I'm working on those.
Hell, a telegram bot can handle that just fine.
Eg. Loop in the code, process the subagents non-deterministic response for the individual task.
This takes 10 minutes to set up, you just need to run something like /skill-builder and describe the desired workflow.
I imagine many people just don’t know that it’s possible. I only discovered it a few days ago myself.
It worked on the first try.
...which is why we write deterministic code to take the human out of the pipeline. One of the early uses of computers was calculating firing tables for artillery, to replace teams of humans that were doing the calculations by hand (and usually with multiple humans performing each calculation to catch errors). If early computers had a 99% chance of hallucinating the wrong answer to an artillery firing table, the response from the governments and militaries that used them would not be to keep using computers to calculate them. It would be to go back to having humans do it with lots of manual verification steps and duplicated work to be sure of the results.
If you're trying to make LLMs (a vague simulacrum of humans) with their inherent and unsolvable[1] hallucination problems replace deterministic systems, people are going to eventually decide to return to the tried and true deterministic systems.
Good luck with that. Users will flood you with complaints if a button moves 5px to the left after a design update. A program that is generated at runtime, with not just a variable UI but also UX and workflows, would get you death threats.
Are google search results modifying your software at runtime?
Take or agent chat for example, the output text is a ui, agents can generate charts and even constrained ui elements.
Isn’t that created and adapted at run time?
If you mean like agents live modifying your code. I think that’s pretty much here as well. Can read the logs and send prs.
The only thing is how fast that loop will execute from days or hours to mins or seconds, and what validation gates it needs to pass.
My git repo is pretty much self modifying personal software at this point, that I interface through the ide chat window.
But I don’t think we will ever lose the intermediary deterministic language (code) between the llm and the execution engine.
It would be prohibitively expensive to run everything through models all the time.
But I am starting to think we need a more precise language than English when talking with LLMs. That can do both precision and ambiguity when you need either.
Why not check the logprobs of the output and take action when the prob of the first and second most likely token is too similar? (or below a certain threshold?
Of course: have it write tests first; and run them to check its work.
Works well for refactoring, but greenfield implementations still rely on a spec that is guaranteed to be incomplete, overcomplete and wrong in many ways.
But if you're trying to tell me that every time you list criteria you get them all perfectly matched, you're clearly gifted.
They are particularly bad at complex multiline parsing. Writing all sorts of weird/crude python/awk scripts and getting confused in the process.
I wish they would use Perl6/Grammer or Haskell/Parsec or similar and write better parsing scripts.
The problem is that outside of that most people want boring and regular interfaces so they can get in and solve the problem and get out - they don't want to "love" it or care if its "sexy" they want it to work and get out of the way.
LLMs transmogrifying your software at ever request assumes people are software architects and creators who love the computer interface, and that just doesn't describe the bulk of the population.
Most people using computers use the to consume things or utilize access to things, not for their own sake, and they certainly don't think "what if I just had code to do x..." unless x is make them a lot of money.
I think the core issue is that non-deterministic output is great for a chatbot experience where you want unpredictable randomness so it feels less like talking to the mirror - but when it comes to coding I think we're pretty fundamentally misaligned in sticking to that non-deterministic approach so firmly.
Though chaotic, which I believe is the better word here - a single letter change may result in widely different results.
We just choose to use more random inference rules, because they have better results.