However I do want to mention that the “recommended” flow these days isn’t to separate out a tool request in the way you have. Eg instead of asking an LLM to route a tool, extracting that, running the tool, passing output back to the LLM, etc. - you simply pass the tool definitions, prompt, structural output expectations, and let the LLM (and your caller library) manage the tool use loop.
That’s how these modern LLMs are trained in post-training, and so I suspect it’s likely you’ll get different (and potentially worse?) results in trying to subvert this with a small, local model.
It comes with all the downsides you mentioned to let the LLM do this, but is also more likely to be in-distribution, and it’s easier to compose multiple tool calls.
Anyway, thanks for sharing! I’d love to see evals on a task where it compares the result when an LLM is involved in tool selection versus when it is handed tool output only - if I’m wrong about quality degradation then there’s a lot to like about your local tool routing.
If LLMs could handle determinism better, I’d say having a single chat-based entrypoint into a plethora of services makes sense. But as they stand, it doesn’t make sense. Simpler control flow and constraining the number and type of downstream services that sit behind a single interface I think is the way to go.
That said, I agree we should keep the ambition to move to the one size fits all approach.
From the article:
Each LLM call incurs latency, cost, and token overhead. More subtly, it compounds context:
every step includes not only the original query, but intermediate outputs and scratchpad logic from earlier prompts.
This creates a growing burden on both inference and model performance.
I was working with agents over a year ago before the common workflows had really been set in stone. At that time we were heavily doctoring the context to give a very streamlined representation of what had occurred during a given run to the LLM. Is this not standard practice?You'd still need to figure out what payload to give to the tool based on your context.
But I guess depending on your business case it might be worth it. It's not something I'd do from the beginning, though.
For complex real-world agent flows though, tool use is often the only thing that the LLM is expected to do. Like in a coding agent:
```
User: Develop a program to ...
Agent: Bash("touch main.py") > 0, ""
Agent: Edit("main.py", initial_patch) > 0, ""
Agent: Bash("python main.py") > 1, "SyntaxError: ..."
Agent: Edit("main.py", fix_patch) > 0, ""
Agent: Bash("python main.py") > 0, "OK"
Agent: FINISH
```
Here, tool selection (+ writing the arguments) is actually the whole job. It's also easy to see that if you omit even one of the tool use records in the middle, the agent wouldn't work at all.
Speaking generically -- any place in your workflow you feel the task is not hard, you can use smaller and cheaper LM.
Smaller LMs come with accuracy reduction, particularly in tail cases. So in the real world this doesn't work out.
Also is gumble softmax usage intentional? It looks like a straightforward classifier that just needs regular softmax.
https://gist.github.com/viksit/c67d1d960c4cec89488290496defb...
I guess that applies when you're not able to fine-tune the LLM you're using. Presumably Anthropic has a lot of data too.
quick note: it doesn’t have to be an rnn. i’ve got a follow-up example coming that uses a transformer-style ToolController with self attention, more expressive routing, etc.
but here’s the thing — when you rely on few-shot bootstrapping the LLM, you never end up updating the model's priors. even after 100k tool calls, you’re still stuck in the same polluted context window and its all stateless.
this gets worse fast with more than 3–4 tool calls, especially when there’s branching logic (e.g., if api1 > 5, go left, else right).
what this approach offers is: backprop through tool calls. you can tune prompts and update priors across the full workflow, end to end. trying to develop this intuition a bit more, and would love feedback.
thanks for the suggestion on the eval — will post that comparison soon.
I think of an llm as a differentiable interpreter of a program. it should do decision making (tool selection, argument routing), branching logic via weights + gates etc.
so as a differentiable state machine:
- each state == a stage in your workflow
- transitions == tool calls
- encode this as a rnn or graph
and learn transitions and actions via supervision or RL
re: different tools (apis vs mcps). in my mind, there should be no real difference at what kind of tools is called at this moment since I model this as a softmax over a label set of tools.
that said, an idea I want to investigate is whether tools can live in a learned embedding space, where selection isn’t a softmax over discrete labels but a nearest-neighbor or attention mechanism over continuous vectors.
this is the intuition I'm developing as we speak and in some of my other comments on this thread (see differentiable state machine comment).
Great points about not updating priors. I also thought about it a bit more and realized that there’s a way you can largely mitigate the out-of-distribution inference requests after local tool selection, if you wanted to.
The tool use loop in an inference framework builds up history of each interaction and sends that along with each subsequent request. You could create “synthetic history”, where you send the LLM history containing the prompt, your local tool selection masquerading as though the LLM generated it, and the tool response. This would be in-distribution but still rely on your local tool routing.
If this works well enough, then I think your approach is very powerful once you’ve decided on a task and set of tools and are able to commit to training on that. Definitely want to try this myself now.
Looking forward to seeing more! I take it your substack is the best place to follow along?
this gets worse when you’re chaining 3–4+ tools. context gets noisy, priors stay frozen and there's prompt soup..
my intuition here is: you can learn the tool routing and the llm prompts before and after the call. (can always swap out the rnn for a more expressive encoder model and backprop through the whole thing).
super useful when you’re building complex workflows -- it gives you a way to learn the full pipeline, not just guess and hope.
Right tool for the step to the right extent.
Feels like soft skills for software development.
also, if you're down, love to connect and talk more about what use cases / techniques you're using. I'm @viksit on X dms if that works.
in my mind the biggest difference is llms that are invoked during a workflow, and llms that are invoked when _creating_ code (codegen).
for the former, tools could be well defined till they are small in number, but at some point, the system needs to examine a library of tools, understand how to call it and integrate it, and at its peak, even create new tools to talk to systems not already present in that library (codegen).
Hacker news discussion trended on #9 with excellent feedback and comments here.
Modern agentic architectures rely heavily on chaining LLM calls. A typical pattern looks like:
Use an LLM to decide which tool to invoke
Call the tool (e.g. search, calculator, API)
Use another LLM call to interpret the result and generate a final response
This structure is easy to reason about, simple to prototype, and generalizes well.
But it scales poorly.
Each LLM call incurs latency, cost, and token overhead. More subtly, it compounds context: every step includes not only the original query, but intermediate outputs and scratchpad logic from earlier prompts. This creates a growing burden on both inference and model performance.
The consequence is that most agent stacks are paying GPT-4 to do what amounts to classical control flow — tool selection — with no reuse, no abstraction, and no efficiency gains at scale.
Instead of using an LLM to route between tools, we can model the decision as a trainable function. A differentiable controller learns tool selection from data — typically via reinforcement or supervised fine-tuning — and runs entirely outside the LLM.
The benefits are architectural:
Local execution — avoids external API calls
Determinism — removes stochastic sampling from routing
Composability — integrates natively with PyTorch / DSPy pipelines
Control — tool choice is explainable and debuggable
A minimal example looks like this 1 (PyTorch):
This is a simple 4-layer network: input is tokenized text; output is a softmax distribution over tools. Because it’s differentiable, you can backpropagate from downstream task reward and improve the router over time.
We can either get data from existing logs, or use GPT to create a synthetic dataset. (Our costs will be one time per tool controller, vs LLM calls for them in production).
We use a simple Adam optimizer to train this simple controller.
And finally, the demo!
For completeness, this is how we’d do it via an LLM directly.
And as a bonus, here’s how you would integrate it into a DSPy Pipeline.
Prompt-based planners incur a hidden penalty: context inflation.
Each new prompt must reintroduce the full conversation history, prior decisions, and any scratch output. The result is exponential growth in irrelevant tokens, particularly in multi-hop workflows.
This leads to:
Token tax — redundant tokens sent repeatedly
Truncation risk — long contexts hit model limits earlier
Attention dilution — more tokens competing for limited compute
Leakage — planner logic unintentionally affects final output
By contrast, a differentiable router operates entirely out-of-band. The only input to the final LLM call is the original query and the selected tool’s result. Context length is constant regardless of tool depth.
This architectural separation restores clarity to the final model call — reducing hallucinations, improving determinism, and reclaiming inference capacity for core reasoning.
The shift to differentiable routing mirrors a broader trend:
Separating declarative control logic from generative inference.
Current agentic systems blur this line. Tool selection is handled in the same modality — and often the same model — as natural language generation. This creates coupling where there should be composition.
Differentiable programming is one way to decouple the two:
LLMs focus on generation and synthesis
Lightweight neural modules handle routing, orchestration, and control
The result is a more modular, inspectable, and scalable architecture — one that avoids paying transformer inference costs for classical programming constructs.
To drive this home, lets consider a planner that routes queries between a search API and a calculator tool. Each query invokes:
One LLM call to plan
One LLM call to interpret the tool’s result
One LLM call to generate the final answer
At GPT-4.1 prices (75 input / 75 output tokens per call), this costs:
A 3× reduction in cost per run — with larger savings as tool chains grow in complexity.
In early-stage workflows, LLM routing is fast to implement and flexible to extend. But at scale, it’s structurally inefficient — economically and architecturally.
Differentiable controllers offer an excellent alternative. They reduce cost, improve performance, and clarify model behavior. They mark a step toward LLM systems that look less like prompt chains — and more like programs.
No posts