Wayfinder Router: deterministic routing of queries between local and hosted LLM

We need LLM query routing at the OS level like Mobile data. I know it will sound crazy but hear me out. I think about this AI inference as infrastructure. I do not want to pay for it on every app I use it on. I do not think "I have to pay the mobile data of youtube, and the mobile data of whatsapp etc.". I pay Mobile data infrastructure and let my device route it appropiately. In fact, if we ever go the local llm route, you could have LLM capabilities without having access to the internet (or local LAN), and your OS/computer is the only one capable of doing that routing for you.

It's funny how much that first paragraph is Claude's voice. I don't know how it got trained so hard to use, "the shape of" for everything.

There's a hidden tax with routing this way, the original model loses context of what was done and either performs a regression or hallucinates.

I think this sort of behaviour started happening more frequently as agentic/ai programming became more often.

Back in the days (lol, reads like a long time ago but that's probably a few months?), you would not say "edit this typo", you would just open the file and not be lazy, and the harness would detect a user change and ground itself.

I feel like now, when I edit outside the AI flow, it goes and introduces a regression or gets lots thinking it didn't do that and something must have gone wrong.

I'm not sure I understand what this is trying to solve?

If a prompt I give routes to one model, and then another prompt to another model, how does one tie the context together such that the next model knows what's going on?

Otherwise this would only be useful for one-off prompts as far as I can tell.

And if it did keep a context to be passed around, it would always land hot (not in the cache).

we could use some composability.

today any kind of routing requires implementing an http proxy to put in the middle

ideally harnesses would support a routing plugin which receives the new whole context and returns just where to send it, and the harness does that. no http proxy. obviously some complications if you want to route from codex to anthropic or openrouter.

but we need to decouple the context building and routing decision from the actual http requests sending, we need to be able to insert "context/routing plugins" in the chain

Some kind of routing prompts to different models does make sense. But the usecase of saving money on simple prompt.. I think that has only a slight benefit. Fix my typo doesn't use many tokens anyhow.. also model switching still requires carrying over context so it does have some overhead right.

There are so many proxies like this now but I can tell you from first hand experience this is not going to work. You cannot just route away from a situation at such a high level especially when we are talking about models that are quite different in behaviour, with different context windows and tuned to different tool uses. The harness is doing all kind of funky things to compensate for issues (like tool call truncation) that a proxy that routes dynamically like this will work against the very same strategies that make the harness work.

Interesting concept, work in theory, but I cannot see this being part of larger system.

Love to see local/cloud routing explicitly supported.

I'm building another router for routing between local and remote models, ShowHN coming up later today. Here's a sneak preview of the github: https://github.com/try-works/role-model

We are developing many applications in my company, some of them safety critical. A natural routing way could happen for certain phases of development, and interfaces via git. One agent works on branch a and is responsible for brainstorm planning specs, and the other is responsible code and tests. The first agent creates tickets for the second one and the second one consumes these. This works with today’s standard harness.

Slight tangent, but “Wayfinder sits behind whatever OpenAI-compatible client you already use” reminds me that descriptions of where proxies sit in the information flow always seem so arbitrary to me:

  - “after the client”
  - “reverse proxy” (in front  of servers)
  - “proxy” (in front of client)

I always have to look this up, surely there must be a standardized way to describe this?

Has anyone tried the others listed? Any feedback?

It'd be nice to just have a command prefix e.g.

/local fix my typo

can you send to multiple LLMs to compare responses? From that create a heuristic of which LLM gets what.

I do this manually with a desktop app called BoltAI that lets you continue the whole conversation at your LLM of choice.

It doesn't sound crazy at all, this seems almost obvious. The OS should provide a chat completions server and the user should be able to select the underlying LLM's server. This should be just like selecting a default search engine or browser.

Hopefully the EU forces US tech giants to do this. God knows Apple and Google won't do this on their own. They gotta get that sweet default provider revenue.

can you send to multiple LLMs to compare responses? From that create a heuristic of which LLM gets what.

I do this manually with a desktop app called BoltAI that lets you continue the whole conversation at your LLM of choice.

It's funny how much that first paragraph is Claude's voice. I don't know how it got trained so hard to use, "the shape of" for everything.

Loads of ed sheeran in the training data?

Do you want the honest answer?

I'm not sure I understand what this is trying to solve?

If a prompt I give routes to one model, and then another prompt to another model, how does one tie the context together such that the next model knows what's going on?

Otherwise this would only be useful for one-off prompts as far as I can tell.

And if it did keep a context to be passed around, it would always land hot (not in the cache).

Every turn of a conversation with an LLM is getting the whole conversation. Caching complicates the picture, but not by a huge amount. That's why a short question at the end of a long conversation chews tokens faster than it would in a fresh session.

So, a conversation that's ongoing with one model then switching to another would presumably send the whole conversation and the new question. Which defeats the purpose of splitting traffic...so, you're not wrong to question how this actually improves things for anything other than short sessions, which you could choose your own model for if it's a small problem.

Here's a use case: You want to extend the GPT 5.5 quota in you Codex subscription by routing some % of requests to DeepSeek V4 Pro. A router needs to figure out which requests to route where, for the appropriate difficulty level.

Another use case: You have two models on your local device. One is large and fairly powerful but low, the other is smaller, faster and good at tool calls and chat, but not great for writing and reviewing code. If you route between them per request, you can get a better developer experience with preserved performance.

The linked repo aims to help you achieve these things, as do I with the role-model router and protocol that I linked in another comment.

I'm not sure if output of easy commands like "summarize this" are added back to the context? I always assumed they are in a separate UI layer?

Interesting concept, work in theory, but I cannot see this being part of larger system.

This is not choosing between different models, really. You should check the (interesting, yet sadly very slop-padded) readme. It’s about trying to make a binary decision: Is this a hard or easy question, and about making that decision extremely fast. They suggest putting another router that chooses the model behind it. I’m not sure how well it would work, but the idea is interesting and different than other routers.

Love to see local/cloud routing explicitly supported.

I'm building another router for routing between local and remote models, ShowHN coming up later today. Here's a sneak preview of the github: https://github.com/try-works/role-model

  - “after the client”
  - “reverse proxy” (in front  of servers)
  - “proxy” (in front of client)

I always have to look this up, surely there must be a standardized way to describe this?

It'd be nice to just have a command prefix e.g.

/local fix my typo

Loads of ed sheeran in the training data?

I'm not sure if output of easy commands like "summarize this" are added back to the context? I always assumed they are in a separate UI layer?

The linked repo aims to help you achieve these things, as do I with the role-model router and protocol that I linked in another comment.

"after the client" and "in front of client" can mean the same thing depending on your viewpoint.

That’s what I did with Pi, super simple :)

"after the client" and "in front of client" can mean the same thing depending on your viewpoint.

Exactly, that’s my point

Do you want the honest answer?

There's a hidden tax with routing this way, the original model loses context of what was done and either performs a regression or hallucinates.

I think this sort of behaviour started happening more frequently as agentic/ai programming became more often.

I feel like now, when I edit outside the AI flow, it goes and introduces a regression or gets lots thinking it didn't do that and something must have gone wrong.

we could use some composability.

today any kind of routing requires implementing an http proxy to put in the middle

but we need to decouple the context building and routing decision from the actual http requests sending, we need to be able to insert "context/routing plugins" in the chain

Mine does this. Only I don’t use the whole context with the router because that’s wildly resource intensive and slow.

Then, a bit like open router, it does a classifier job with a fast model to choose which one should process the turn.

In my case I usually don’t do local vs remote… although it can. Now I use it for thinking vs no-think against my preferred local model, which is a huge time saver even with the added classification step.

I'm still waiting for an isolated protocol so we don't have to run the hanress directly on any of the code base's infrastructure. Something as simple as piping everything into and out of an ssh shell would be better than anything I've tested so far.

Has anyone tried the others listed? Any feedback?

Mine does this. Only I don’t use the whole context with the router because that’s wildly resource intensive and slow.

Then, a bit like open router, it does a classifier job with a fast model to choose which one should process the turn.

i've created the protocol, role-model: https://github.com/try-works/role-model

Deterministic prompt-complexity routing — send each prompt to your local or cloud model, offline, with no model call to decide.

Quickstart · Benchmark · How it compares · Explainer · Changelog

No model call to decide the route	Deterministic and fully offline
Calibrate on your own data	Bring your own key self-hosted

Wayfinder looks at a prompt's structure (length, headings, lists, code) and its wording (proofs, math, hard constraints), then tells you whether to send it to your small local model or your big cloud one. It decides in microseconds, runs offline, and never calls another model to make the call: no API key, no network, no model call to decide. You get a score and a recommendation, and what you do with it is up to you.

Cheap prompts stay local and hard ones go to the expensive model, so you stop paying top-tier prices for "summarize this" and "fix my typo."

How it compares

Most routers decide by calling a model: a trained classifier, an LLM judge, or a hosted API. That adds latency, cost, and randomness to the exact step meant to save you money. Wayfinder reads structure and wording instead, so the decision is free and the same every time.

router	decides by	model call?	self-host	calibrate
Wayfinder	deterministic structural score	no	yes	yes
RouteLLM	trained classifier (preference data)	yes	yes	retrain
NotDiamond / Martian	learned, hosted	yes	no	via platform
OpenRouter (Auto)	hosted auto-router	yes	no	—
Bifrost / LiteLLM	provider gateway (not complexity-routed)	no	yes	n/a

The gateways in the last two rows (OpenRouter, Bifrost, LiteLLM) answer a different question: which provider serves a call, by price, availability, and failover. Wayfinder answers which tier a prompt deserves: cheap vs expensive, by difficulty, decided offline. The two compose. Run Wayfinder to make the cheap-vs-expensive call, and a gateway underneath to reach the providers.

Wayfinder is not chasing a top accuracy number. What it gives you is a routing decision you can run offline, with no model call, and tune on your own traffic. By default it scores prompt structure only. It can also read lexical cues (proofs, math, constraints), but those ship off by default: a double-blind test on independently-authored prompts showed the lexical lift does not generalize (it catches ~20% of unseen hard prompts and loses to a plain word-count baseline), so they are opt-in. Raise their weights only if you've calibrated them to your own traffic's vocabulary. A prompt whose difficulty is purely semantic (a subtle code snippet, an innocent-looking "what is the 100th prime number?") has no structural tell, and a semantic router will beat it there. What holds up under the blind test is the part to rely on: a deterministic, sub-millisecond, offline routing decision with no model call. The benchmark (make benchmark) shows where it wins and where it loses, against honest baselines and a perfect oracle. Point it at RouterBench or RouterArena for graded numbers.

New here, or weighing it up? The FAQ gives straight answers — including where it loses (it's no better than random on RouterBench's short-but-hard items) and why you'd still run it.

Try the demo (no keys)

Two ways to see the routing decision for yourself — no API keys, no models, nothing on the network.

In your terminal — a decision-first chat in the Wayfinder palette. The terminal chat ships in the default install, so there's nothing extra to add — or run it with no install at all via uvx:

uvx wayfinder-router chat --dry-run      # zero install, zero keys
# or:  pip install wayfinder-router && wayfinder-router chat

Wayfinder terminal chat — a routed prompt, the decision, the reply, and the running savings

Every turn shows where it routed (● LOCAL / ◆ CLOUD), the structural score and why (/why), and the running savings vs always-cloud. /init sets up models without leaving the chat, /route · /local · /cloud force a turn, and conversations persist across sessions (/threads).

In your browser — the web chat UI with a live threshold slider:

pip install "wayfinder-router[gateway]"
wayfinder-router webchat --dry-run
# opens http://127.0.0.1:8088/demo

webchat is a thin launcher over serve (the gateway and its /demo page; --no-open, --port, --host 0.0.0.0, --dry-run); serve is the headless command. Both surfaces show, for every message, where it routed (local vs cloud), the complexity score and why (the feature breakdown), and the cost saved vs always-cloud. With no config both are decision-only (--dry-run for the web; the terminal's preview), so you can poke at it with zero setup. To get real replies, run wayfinder-router init to scaffold [gateway.models] (then wayfinder-router doctor to confirm your keys resolve) — see Quickstart.

Works with any OpenAI-compatible API

Wayfinder forwards each call to an OpenAI-style /chat/completions endpoint — so if your provider speaks that (and most do), it just works. A tier is one base_url, a model name, and a key read from the environment at request time; no SDK, no per-provider code. Pair a free local model with a hosted one, or run two cloud tiers.

_{…plus Groq, Together, OpenRouter, Fireworks, DeepSeek, and local servers
(vLLM, LM Studio, llama.cpp) — + any OpenAI-compatible endpoint
that takes a Bearer key.}

Quickstart

Put Wayfinder in front of your models. Your app keeps speaking the OpenAI API; you just change one base_url.

Scaffold a config — init writes a starter wayfinder-router.toml (keyless local Ollama → Anthropic cloud) plus a .env.example, then checks your keys:
```
pip install "wayfinder-router[gateway]"
wayfinder-router init                 # starter config (hybrid preset)
wayfinder-router init --preset openai # two OpenAI tiers (gpt-4o-mini → gpt-4o)
wayfinder-router init --preset gemini # two Gemini tiers (gemini-2.5-flash → gemini-2.5-pro)
wayfinder-router init --interactive   # pick providers/models step by step
```
Or describe your two models in wayfinder-router.toml by hand:
```
[routing]
threshold = 0.5            # below -> local, at/above -> cloud

[gateway.models.local]
base_url = "http://localhost:11434/v1"
model = "llama3.2"

[gateway.models.cloud]
base_url = "https://api.openai.com/v1"
model = "gpt-4o"
api_key_env = "OPENAI_API_KEY"   # read from this env var, never stored
# api_key_cmd = "op read op://Private/OpenAI/credential"  # optional: fill it from a vault
```
Wayfinder never stores secrets: a model names an env var (api_key_env) and the key is read from your environment at request time. There is nothing to "install" — just export the variable. Prefer not to paste a raw key into your shell? Add an optional api_key_cmd and Wayfinder fills that variable from your secret store at startup — op read … (1Password), security … (macOS Keychain), secret-tool … (Linux), pass/gopass, vault kv get …, aws secretsmanager get-secret-value …, bw, doppler, gcloud secrets …, or any command that prints the secret. The key is held in memory only, still never written to disk. wayfinder-router doctor detects which of these tools you have installed and suggests the exact line.

Set your key(s), then run the gateway. doctor re-checks the config and whether each model's key resolves (✓ set / ✗ not set) before you start:

export ANTHROPIC_API_KEY=sk-...     # or OPENAI_API_KEY, per your config
wayfinder-router doctor             # ✓/✗ per model — is each key set?
wayfinder-router serve --port 8088

Point your existing client at it. No code change:

client = openai.OpenAI(base_url="http://localhost:8088/v1", api_key="unused")
client.chat.completions.create(model="auto", messages=[{"role": "user", "content": "..."}])

Easy prompts go local, hard ones go cloud, and every response carries x-wayfinder-router-model and x-wayfinder-router-score so you can see where it went. Want to force a tier for one request? Set model="local" or "cloud" (or prefer-local / prefer-hosted), move the cut for a single call with an X-Wayfinder-Threshold header, or start a chat message with /local or /cloud (see Steer a single request).

Check it's working:

curl -s localhost:8088/healthz
# {"status":"ok","models":["cloud","local"]}

curl -s -D - -o /dev/null http://localhost:8088/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"auto","messages":[{"role":"user","content":"hi"}]}' \
  | grep -i x-wayfinder-router
# x-wayfinder-router-model: local
# x-wayfinder-router-score: 0.00

No backends yet? wayfinder-router serve --dry-run answers with the routing decision instead of calling an upstream, so you can feel the routing in 30 seconds before wiring up real models.

Install

command	what you get
`pip install wayfinder-router`	scorer, CLI, Python API, and the terminal chat (`chat`); the scorer/library imports stay dependency-light
`pip install "wayfinder-router[gateway]"`	adds the OpenAI-compatible routing gateway, the common case for serving
`pip install "wayfinder-router[ui]"`	adds the local calibrate / explain / configure UI
`pip install "wayfinder-router[all]"`	gateway and UI on top of the default install

How it works

Wayfinder sits behind whatever OpenAI-compatible client you already use. You point that client's base_url at the gateway once, and from then on it is invisible. The same client serves a request whether it routes local or hosted.

  your client   (chat app, IDE, agent, or code)
       |
       v
  Wayfinder gateway   scores, picks a model
       |
       |-- low  -->  local    (Ollama, vLLM)
       |-- high -->  hosted   (OpenAI, any /v1)
       |
       v
  response returns via the same client,
  with x-wayfinder-router-* headers

A few things follow from this:

The interface in front is yours. A chat GUI (Open WebUI, LibreChat), an IDE assistant with a custom endpoint (Cursor, Continue), an agent framework, or your own code on the OpenAI SDK. Want a chat window today? Put Open WebUI in front and point it at the gateway.
Local and hosted are backends, not apps. The local model is just a server (Ollama, LM Studio, vLLM, llama.cpp) speaking OpenAI's /v1; the hosted one is the same shape. The user never switches UIs and usually never knows which model answered.
The score is computed, not a second opinion. Asking a model how hard a prompt is would be slow, non-deterministic, and would cost a model call to decide whether to make a model call. Wayfinder scans the prompt instead — structure (length, headings, steps, links, code, tables) and difficulty cues in the wording (reasoning terms, math symbols, constraints) — into a 0.0-1.0 value and compares it to your threshold. Same prompt, same threshold, same answer. It is a proxy for difficulty, not a verdict, which is why the threshold is yours to tune.

Keys are read from the environment at request time and never touch the config file or the scored path.

Score a prompt from the CLI

echo "Summarise this paragraph in one sentence." | wayfinder-router route -

Recommended Model: local
Complexity Score: 0.00  (mode: tiered)

Tiers:
  >= 0.00  local <-
  >= 0.50  cloud

Contributing Features:
  Word Count: 6
  ...

Add --json for machine consumers (an agent reads this and routes to its own model):

{
  "schema_version": "3",
  "score": 0.66,
  "recommendation": "cloud",
  "mode": "tiered",
  "features": { "word_count": 545, "heading_count": 12, "reasoning_term_count": 3, "...": 0 },
  "tiers": [{ "min_score": 0.0, "model": "local" }, { "min_score": 0.5, "model": "cloud" }]
}

Configure routing

Wayfinder reads its own wayfinder-router.toml, found by walking up from where you run it. There are three modes, in precedence order (classifier > tiers > threshold); the scalar-score weights apply to any of them.

Binary (the default) is a single cut:

[routing]
threshold = 0.6
weights = { word_count = 4.0, list_item_count = 2.5 }

--threshold N overrides it for one run; WAYFINDER_ROUTER_THRESHOLD overrides it from the environment.

To switch the lexical cues on, raise their weights and cut at the knee — the one held-out improvement over the structural default on real frontier traffic (skill −0.038 → +0.057, 61% cost saved on RouterBench). See docs/lexical-routing.md and the ready-to-edit examples/wayfinder-router.lexical.toml; recalibrate the threshold to your own traffic (a ~20-prompt bootstrap is only a smoke test — see benchmarks/calibration-eval.md).

Tiered routes ordered score bands to any number of models:

[[routing.tiers]]
min_score = 0.0
model = "llama-3b"
[[routing.tiers]]
min_score = 0.3
model = "llama-70b"
[[routing.tiers]]
min_score = 0.6
model = "claude-cloud"

Classifier is a fitted multinomial-logistic model, argmax over per-model linear scores. You usually generate it with calibrate rather than write it by hand.

Each [gateway.models.<name>] block maps a routed name to an upstream base_url, a model, and an optional api_key_env (the name of an environment variable, never the secret itself). The gateway is the only part that touches keys or the network; the scorer, config, and calibrator stay pure and offline.

Calibrate on your data

The cut is a proxy, so tune it against your own traffic. wayfinder-router calibrate reads a labeled JSONL dataset ({"text": ..., "label": ...}) and prints a config fragment. It runs offline and never calls a model; the labels are your ground truth.

wayfinder-router calibrate data.jsonl --mode threshold              # sweep the binary cut
wayfinder-router calibrate data.jsonl --mode tiers                  # ordinal multi-model
wayfinder-router calibrate data.jsonl --mode classifier --out wayfinder-router.toml

The fragment drops straight into wayfinder-router.toml; the accuracy and chosen breakpoints print to stderr. The classifier is fit by deterministic L2-regularized Newton/IRLS, pure Python, converging in a handful of iterations.

To pick a cut in cost terms instead of bare accuracy, use a cost-aware objective. --objective knee chooses the cost-aware knee automatically (it maximizes quality-recovered × cost-saved — no target to guess, and it can't collapse to always-routing-to-the-expensive-model the way pure accuracy does on skewed labels); --objective cost-quality --target-savings X instead holds a specific savings floor. Add --weights to score with — and emit — custom feature weights, e.g. the lexical opt-in, so the output is a complete, deployable config (see docs/lexical-routing.md):

wayfinder-router calibrate data.jsonl --mode threshold --objective knee \
  --costs local=0.2,cloud=1.0 \
  --weights reasoning_term_count=5,math_symbol_count=3,constraint_term_count=1.5

Cost is metadata only — it shapes the calibrated cut and is reported on the /metrics endpoint, but never enters a per-request decision, which stays deterministic and free.

Steer a single request

The deployment's config sets the default boundary, but a client can override the decision for one request over plain OpenAI transport. An override only changes where the request goes; the prompt is still scored, and nothing adds a model call.

The model field is a routing directive. auto (or any normal model id) lets Wayfinder decide; a configured endpoint name (local, cloud) pins the request there; prefer-local / prefer-hosted pin to the low / high end of your router (prefer-cloud still works as an alias of prefer-hosted).
An X-Wayfinder-Threshold header re-cuts the decision for that request, a number in 0.0-1.0 reusing your weights (binary routers only).
An in-message /directive (opt-in: [gateway] slash_directives = true) lets a plain chat box steer routing — start a message with /local, /cloud, /prefer-hosted, or /auto and it pins that turn (stripped before the model sees it). Only known directives are acted on; anything else starting with / is left as ordinary text (WF-ADR-0036).

# Pin one call to cloud regardless of score:
client.chat.completions.create(model="cloud", messages=[...])
# Or move the cut for one call (keep model="auto"):
client.chat.completions.create(
    model="auto", messages=[...], extra_headers={"X-Wayfinder-Threshold": "0.8"}
)

Each response adds x-wayfinder-router-mode (scored / pinned / threshold-override) next to the -model and -score headers, so you can see which channel decided the route.

Drive it from a chat UI (no fork)

Because the model field is a routing directive, any OpenAI-compatible chat UI can drive routing with no code change: the app's normal model dropdown becomes a per-conversation routing picker (auto / prefer-local / prefer-hosted / a pinned endpoint). The gateway lists these at GET /v1/models, so a UI discovers them on its own.

LibreChat — copy examples/librechat.yaml and examples/docker-compose.override.yml into your checkout, run docker compose up, and pick the "Wayfinder" endpoint.
Open WebUI — add an OpenAI connection pointing at the gateway; it auto-discovers the routing options.

See examples/ for both. The one thing a stock UI can't express is a live per-conversation threshold slider; that's what the wayfinder-chat fork adds, and this no-fork path proves it out first.

See where requests go

Wayfinder's controls are spread across the tools you already run, so it's easy not to notice it working. Four surfaces show or steer routing:

surface	what it shows	where
Model dropdown	the routing picker (`auto` / `prefer-local` / `prefer-hosted` / a pinned endpoint)	your client, from `GET /v1/models`
Response headers	where each request went and why (`-model` / `-score` / `-mode` / `-request-id`)	every response
Debug body field	the decision inside the response body, opt-in	request header `X-Wayfinder-Debug: true`
Dashboard	recent decisions, per-model counts, scores — metadata only, never prompt text	`GET /router` (JSON at `/router/recent`)

The dashboard is separate from the off-path wayfinder-router ui console, which is for tuning, not production traffic.

Learn from feedback

Don't guess the cut, learn it from your own judgment of local versus hosted output. The loop is: collect judgments, calibrate, route automatically.

Bootstrap it with A/B onboarding. For each sample prompt, wayfinder-router onboard runs both arms and asks which was good enough; the answer is a label:

wayfinder-router onboard prompts.jsonl --arms local,cloud --calibrate > wayfinder-router.toml

The comparison goes to stderr; --calibrate prints the resulting config to stdout. Each judgment appends a {"text", "label"} line to a feedback log, which is itself the calibrate dataset, so the log turns straight into a config.

To skip the manual grading, let wayfinder-router judge label automatically. It runs both tiers and asks an automated judge "was the cheaper tier good enough?" — the same sufficiency question, no person in the loop:

wayfinder-router judge prompts.jsonl --arms local,cloud --gold gold.jsonl > wayfinder-router.toml

The built-in judge is a deterministic text comparator that abstains rather than guess when it can't tell. Because a bad label would silently degrade live routing, judge will only emit a config once it passes trust gates — agreement with your human-labeled --gold set (Cohen's κ ≥ 0.6), out-of-fold lift over the majority baseline, and both arms represented. If the gates fail it prints the confusion matrix and refuses (the labels are still recorded). Pass --save-comparisons out.jsonl to also keep the raw responses (off by default — it's a body store).

Once you're routing automatically, keep it honest by recording which model was actually good enough:

curl localhost:8088/v1/feedback -d '{"text": "...", "label": "cloud"}'

Then re-fit on a schedule from cron, a k8s CronJob, or a click in the UI. Recalibration rewrites only the [routing] section and preserves your [gateway] endpoints, and a running gateway hot-reloads the result with no restart:

wayfinder-router recalibrate                  # log -> calibrate -> write config
wayfinder-router recalibrate --min-labels 50  # no-op until you have enough signal

The judging runs models, so it lives in the gateway layer (with your key); the scoring core stays untouched and the log carries no secrets.

Deploy and integrate

The CLI, onboarding, and UI are for operators and bootstrapping. In production, prompts flow through the gateway (transparent) or the library (in-process), so routing happens where prompts already are.

Run the gateway as a service, sidecar or standalone:

docker build -t wayfinder-router . && docker run -p 8088:8088 -v "$PWD/data:/data" wayfinder-router
# or: docker compose up gateway   (see docker-compose.example.yml)

Point your existing client at it with no app change. Anything that speaks the OpenAI API takes a base_url, including agent frameworks (LangChain, LlamaIndex), IDE assistants with a custom endpoint (Cursor, Continue), and gateways like LiteLLM:

client = openai.OpenAI(base_url="http://localhost:8088/v1", api_key="unused")

See Integration recipes for copy-paste setup across chat UIs (Open WebUI, LibreChat, Jan), editors (Continue, Cline, Zed, JetBrains), agent frameworks (LangChain, LlamaIndex, CrewAI, AutoGen, the OpenAI Agents SDK, the Vercel AI SDK), and CLIs (aider, Copilot CLI) — plus the canonical OPENAI_BASE_URL / OPENAI_API_KEY pair.

Claude Code speaks Anthropic's Messages API rather than OpenAI's, so the gateway exposes a POST /v1/messages adapter (WF-DESIGN-0011) that translates Anthropic ⇄ OpenAI in both directions — streaming and tool use included. Point it at the gateway root and Claude Code routes through Wayfinder like any other client:

export ANTHROPIC_BASE_URL="http://localhost:8088"   # client appends /v1/messages
export ANTHROPIC_API_KEY="unused"                   # the gateway uses each upstream's own key
claude

Wire feedback from wherever your users are. Your app, IDE, or chat shows a thumbs-up or thumbs-down and posts the judgment; the next recalibration learns from it:

fetch("http://localhost:8088/v1/feedback", {
  method: "POST",
  body: JSON.stringify({ text: prompt, label: wasGoodEnough ? "local" : "cloud" }),
});

The gateway forwards asynchronously and streams: a request with stream: true comes back as Server-Sent-Events, so chat clients render tokens as they arrive. An upstream timeout or connection failure returns an OpenAI-shaped error instead of a bare 500, every response carries a request id for tracing, and routing decisions and reload failures are logged. The knobs:

setting	effect
`WAYFINDER_ROUTER_TIMEOUT` / `serve --timeout`	upstream timeout in seconds (default 60)
`WAYFINDER_ROUTER_FEEDBACK_TOKEN`	when set, `/v1/feedback` requires `Authorization: Bearer <token>`
`serve --dry-run`	return routing decisions without calling any upstream
`GET /healthz`	reports `degraded` and lists `missing_keys` when a configured `api_key_env` is unset
`GET /router`	read-only dashboard of recent decisions, with `X-Wayfinder-Debug: true` surfacing one in the body
`GET /v1/savings?period=today\|7d\|30d\|all`	realized vs always-frontier cost and the savings between them, per route (WF-DESIGN-0007)
`WAYFINDER_ROUTER_SAVINGS_FILE`	where the savings ledger is persisted (default `<config-dir>/wayfinder-savings.json`)
`[gateway] retries` / `breaker_threshold` / `breaker_cooldown`	reliability: bounded retries on transport/`429`/`5xx`, and a per-target circuit breaker (WF-ADR-0031)
`[gateway] failover = same-tier\|degrade\|escalate`	on exhaustion, stay on the tier (default), fall to a cheaper one (never raises cost), or a dearer one (opt-in); per-request `X-Wayfinder-Failover`
`[gateway.models.<name>] fallbacks = [...]` / `context_window`	same-tier endpoints to try on failure; skip a target whose window can't fit the prompt. Responses carry `x-wayfinder-router-served-by`
`[gateway.budget] limit` / `window = day\|month\|all` / `on_breach = degrade\|block`	spend cap: once `limit` realized cost is reached, degrade to the cheapest tier (default, never raises cost) or block with HTTP 402. Surfaced via `x-wayfinder-router-budget`; needs real `cost_per_1k` prices (WF-ADR-0032)
`[gateway.cache] enabled` / `ttl` / `max_entries` / `max_bytes`	exact-match response cache: replay a stored answer for an identical deterministic request — instant, free repeats. Off by default; in-memory only; raise `max_bytes` (default 64 MiB) for more. A hit is free and surfaced via `x-wayfinder-router-cache: hit\|miss`; disabling purges it (WF-ADR-0033)
`[gateway.rate_limit] rpm` / `tpm` / `window`	cap requests-per-minute and/or upstream-tokens-per-minute over a fixed `window` (default 60s); on breach returns `429` with `Retry-After`. The outermost guardrail (checked before scoring); gateway-wide. Successful responses carry `X-RateLimit-Limit`/`-Remaining`/`-Reset` so clients can self-pace; surfaced via `x-wayfinder-router-rate-limit` and `wayfinder_router_rate_limited_total` (WF-ADR-0034)
`[gateway.keys.<id>] hash` / `tags` / `models` (+ nested `budget` / `rate_limit`)	virtual API keys: when any is set, `/v1/` requires a valid `Authorization: Bearer` token (else `401`). Mint with `wayfinder-router keys new`; only the SHA-256 hash is stored. Spend & savings* are attributed per key (`by_key` in `/v1/savings`, `wayfinder_router_key_requests_total`); a key can carry its own budget/rate-limit (strictest wins) and a `models` allowlist (clamps to the nearest allowed tier) (WF-ADR-0035)

Explain and tune

To see why a prompt routed where it did, ask for the per-feature breakdown: each feature's value, its normalized level, its weight, and its share of the score.

wayfinder-router route prompt.md --explain

For interactive tuning there's a local web UI:

Explain — paste a prompt; see the score, the tier ladder, and contribution bars, and drag a threshold slider to watch routing change live.
Calibrate — paste a labeled dataset, run a mode, and see accuracy, the sweep curve, and the resulting config fragment.
Configure — edit wayfinder-router.toml with live validation and save.
Onboard — A/B a local and a hosted model in the browser, judge each, and calibrate from the log (needs [gateway] for the model calls).

pip install "wayfinder-router[ui]"
wayfinder-router ui --port 8099    # then open http://localhost:8099

The UI is a thin wrapper over the same pure functions; it never calls a model, and no secret appears in it.

Python API

from wayfinder_router import score_complexity, RoutingConfig, explain_score

result = score_complexity(prompt_text, config=RoutingConfig.binary(threshold=0.7))
print(result.recommendation, result.score, result.features)
for fc in explain_score(result.features, RoutingConfig().weights):
    print(fc.name, fc.contribution)

Origin

Wayfinder started as a route experiment inside a larger requirements tool and was split out because routing is a runtime concern, not a knowledge one: a prompt router shouldn't make you install an engine you don't need. The result is a small, focused tool whose scoring core stays dependency-free — you can import wayfinder_router and score prompts with nothing but the standard library (WF-ADR-0001, WF-ADR-0029).

Repository layout

wayfinder-router/
  wayfinder_router/   the package: scorer, tiers + classifier, config loader/writer,
                      offline calibration (Newton/IRLS), explain, the feedback log and
                      onboarding harness, recalibration, CLI, and the optional gateway
                      and local UI (the impure layers, behind their extras)
  tests/              scorer, config, calibration, explain, feedback, onboard,
                      recalibrate, CLI, gateway, and UI coverage
  decisions/          design notes behind the tool's own choices
  docs/               the FAQ and the lexical-routing guide
  Dockerfile, docker-compose.example.yml   deploy the gateway as a service

Test

pip install -e .[dev]   # or: pip install pytest
make test

i've created the protocol, role-model: https://github.com/try-works/role-model

Great name, but ironically hard to reason about from a role perspective, at least at the read me.

Does this interfere with cache hits? Could a single conversation or task span multiple roles?

Why are you building this? Does this maximize my toxen value by saving the hard tasks for the hard model? Does it maximize cache hits as part of its scoring? Does it help agents develop a specialist mindset? Are you anticipating users will have many local models hot, or is this also a model load/unload controller?

Great name, but ironically hard to reason about from a role perspective, at least at the read me.

Does this interfere with cache hits? Could a single conversation or task span multiple roles?

I'm building this to achieve a state where I can, as a user and on my own device, decide that certain type of workloads should be handled by my Qwen model and keep the data on my device, while other workloads should be handled by more capable models.

For this we don't just need a router, because the information to make detailed and accurate routing decisions currently doesn't exist. And there are no standards but every lab and maybe even inference providers have their own way of implementing reasoning, chat templates, cache, tool use and so on. All issues that make models non-interoperable.

What we need is applications that clearly specify their requests so they can be accurately routed to a provider, whether local or remote. And for that they need to use a standard protocol for model requests and intent.

I wrote a longer piece here: https://news.ycombinator.com/item?id=48706181

Hacker Times

Hacker Times

Wayfinder Router: deterministic routing of queries between local and hosted LLM

Discussion

Discussion

How it compares

Try the demo (no keys)

Works with any OpenAI-compatible API

Quickstart

Install

How it works

Score a prompt from the CLI

Configure routing

Calibrate on your data

Steer a single request

Drive it from a chat UI (no fork)

See where requests go

Learn from feedback

Deploy and integrate

Explain and tune

Python API

Origin

Repository layout

Test