DESIGN.md:

> Each rule below is enforced mechanically by the skill, not left to vibes.

> R1. Repo docs are the memory; not in HANDOFF.md = didn't happen

SKILL.md:

> Not in docs/HANDOFF.md = didn't happen. Refuse to judge results that exist only in conversation or builder chat output.

"Mechnical enforcement" just means "prompting the LLM a bit extra" these days? It (still) amazes me how much effort and tokens we expend on what could and should be a two line script...

I know how to reduce Fable tokens by 100% ; https://www.anthropic.com/news/fable-mythos-access

I do exactly this with awman workflows: https://github.com/prettysmartdev/awman/blob/main/docs/05-wo...

You can use any agent and/or model for each step and share context between them.

I actually just started doing this by having Fable roleplay as Jeff Dean and to use Codex as Sanjay driving the implementation and have them go back and forth. Works really well and it’s cool to see AI pair program

ANNNNNND it's gone. Guys, I found a way to reduce Fable token usage 100%. You can find it here: github.com/USGov/idiotic-overreach.

yes I'm using Fable to inspect, generate plan and architectural docs then using Gemini to implement then have Fable review, find bugs. saving lots of usage.

Fable will do this itself, by spawning Opus/Sonnet subagents to do easy work.

Reduce Fable tokens by 80%, simply by not using it!

> I am fairly convinced this is the shape serious agent work keeps converging toward.

"this" being "plan with expensive model, implement with cheap model".

Anyone who follows HN would be hard-pressed to disagree; this architecture is re-invented twice monthly.

https://www.facebook.com/groups/vibecodinglife/posts/1946207... https://github.com/openai/codex/discussions/10628 https://build5nines.com/stop-burning-premium-requests-how-to...

> Not because it is aesthetically pleasing. Because every other shape eventually runs into the same boring failures: context rot, self-grading, goalpost drift, and merge chaos.

Actual failure isn't boring. But struggling through a generated software project that celebrates its own genius and doesn't have a single self-critical or genuinely reflective thing to say...at least watching paint dry I might get giddy off the fumes.

I'm not interested in critiquing the project itself, either, you'll just run that through a model, too.

Fool me once. Fool me twice. Fool me thirty three times and here we are trying lucky number 34.

Reducing token usage is this year's "one weird trick". It doesn't make sense on the face of it.

Even if one discovered something that millions (billions?) of dollars of AI compute and the best statisticians in the world was not able to find via exhaustive research, domain search and training... what do you think are the chances this won't be folded into the next update of every model, making the rigmarole moot?

Extraordinary claims require extraordinary evidence and technology-shattering innovations in AI are not know to come from a markdown.

Last night I switched back to Codex for a minute having burned through my tokens for the week with Fable and oh boy I had a terrible experience. Running in circles over simple problems (which I ended up solving myself, like a peasant) and running "terraform apply" several times despite several instructions all over the place to never do that. The performance difference was stark.

Reduce fable token usage even more by not using it. What a clever idea, op! Wow.

DESIGN.md:

> Each rule below is enforced mechanically by the skill, not left to vibes.

> R1. Repo docs are the memory; not in HANDOFF.md = didn't happen

SKILL.md:

> Not in docs/HANDOFF.md = didn't happen. Refuse to judge results that exist only in conversation or builder chat output.

"Mechnical enforcement" just means "prompting the LLM a bit extra" these days? It (still) amazes me how much effort and tokens we expend on what could and should be a two line script...

Agents are in a wacky state, which makes projects like this fall into a weird spot. Eg I vaguely expect my agent to do two disparate things: manage dependency injection for tools, prompt modifications, etc, but also be the sort of “brain trust” that controls the flow of execution (can we stop now, do we keep going, etc).

This project is meant to be the latter, but there’s not a clean way to integrate that into Claude Code or Codex because they expect to do both.

Pi can do it, but then your users can’t use their Claude subscriptions, so you have to cludgily try to do the same thing via LLM prompts.

I know how to reduce Fable tokens by 100% ; https://www.anthropic.com/news/fable-mythos-access

Fool me once. Fool me twice. Fool me thirty three times and here we are trying lucky number 34.

I do exactly this with awman workflows: https://github.com/prettysmartdev/awman/blob/main/docs/05-wo...

You can use any agent and/or model for each step and share context between them.

yes I'm using Fable to inspect, generate plan and architectural docs then using Gemini to implement then have Fable review, find bugs. saving lots of usage.

Reduce Fable tokens by 80%, simply by not using it!

> I am fairly convinced this is the shape serious agent work keeps converging toward.

"this" being "plan with expensive model, implement with cheap model".

Anyone who follows HN would be hard-pressed to disagree; this architecture is re-invented twice monthly.

https://www.facebook.com/groups/vibecodinglife/posts/1946207... https://github.com/openai/codex/discussions/10628 https://build5nines.com/stop-burning-premium-requests-how-to...

> Not because it is aesthetically pleasing. Because every other shape eventually runs into the same boring failures: context rot, self-grading, goalpost drift, and merge chaos.

I'm not interested in critiquing the project itself, either, you'll just run that through a model, too.

>https://www.facebook.com/groups/vibecodinglife/posts/1946207...

wow linking a facebook groups post might actually be worse than x, is there an xcancel alternative for facebook?

I don't disagree with any of this. It is generated software, and it's not a novel idea. I didn't mean for it to come off like that. It's just solving an itch that I couldn't find a solution to and I'm getting a lot of personal utility out of it. I do have a lot of experience with agentic memory, multi-agent systems and harnesses and wasn't super impressed by the workflow of Fable calling opus subagents so I figured I'd apply best practices to what already exists to make it a teensy bit better and easier to use.

Fable will do this itself, by spawning Opus/Sonnet subagents to do easy work.

ANNNNNND it's gone. Guys, I found a way to reduce Fable token usage 100%. You can find it here: github.com/USGov/idiotic-overreach.

Reduce fable token usage even more by not using it. What a clever idea, op! Wow.

This project is meant to be the latter, but there’s not a clean way to integrate that into Claude Code or Codex because they expect to do both.

Pi can do it, but then your users can’t use their Claude subscriptions, so you have to cludgily try to do the same thing via LLM prompts.

GPT 5.5 xhigh is better than Opus and Sonnet.

/advisor has been really good experience for me especially with having only a Pro plan.

I exclusively use sonnet and advisor is basically “hey opus chime in on my approach”. been working great as far as i can tell.

I had a similar experience. So far Fable has been a game changer, at least for the work I used it for. Having said that, I think its writing is definitely worse than GPT 5.5. Ethan Mollick also observed the same. He called it more "Claudy." It generates worse academic prose than other frontier models.

Could you provide some details, if possible, like what model & thinking effort, what kinds of tasks? I used to swap between Claude Code and Codex often, and these days use Codex more because of the usage limits. Wondering if I should go to Claude for a month, I get a strange FOMO when I read vague comments like this.

The one major difference I noticed is that the GPT models are more analytical (e.g. better at mathematical analysis, code review) vs Claude models tend to write more straight forward code. Besides that I don't really see any significant differences.

There are a few gotchas with swapping, like being careful with AGENTS.md/CLAUDE.md naming (Claude Code only recognizes CLAUDE.md, and I think Codex only works with AGENTS.md), and updating skill files to match the tool.

Reducing token usage is this year's "one weird trick". It doesn't make sense on the face of it.

Extraordinary claims require extraordinary evidence and technology-shattering innovations in AI are not know to come from a markdown.

incentives aren’t aligned

>https://www.facebook.com/groups/vibecodinglife/posts/1946207...

wow linking a facebook groups post might actually be worse than x, is there an xcancel alternative for facebook?

incentives aren’t aligned

I just symlink AGENTS.md and CLAUDE.md

I was using gpt-5.5 high. Writing terraform code for GCP, debugging app launch and Dockerfile issues, that sort of thing. It was going in loops hallucinating features of GCP, looking things up in strange ways, running terraform apply after being explicitly told in the last interaction not to, and overall not solving problems. These were very straightforward tasks and it couldn't be trusted for five minutes. It's the difference in what I would trust an early senior engineer to do vs what I would trust an unreliable high school intern to do.

/advisor has been really good experience for me especially with having only a Pro plan.

I exclusively use sonnet and advisor is basically “hey opus chime in on my approach”. been working great as far as i can tell.

GPT 5.5 xhigh is better than Opus and Sonnet.

Not in my subjective experience sadly

I don’t know why you’re getting downvoted. It’s true. Averaged across a wide variety of benchmarks Fable is the only Anthropic model that performs better than GPT 5.5 xhigh.

architect-loop

Claude Fable handles planning and review; GPT-5.5 Codex handles implementation and research. Two Claude Code skills wire that split into a repo-centered loop: specs and gates are written first, Codex works in fresh contexts, and Fable reviews the evidence before anything is integrated. It runs on the subscriptions you already have — no API keys required by default.

Install (30 seconds)

git clone https://github.com/DanMcInerney/architect-loop
cd architect-loop && ./install.sh        # Windows: .\install.ps1
npm i -g @openai/codex@latest            # the builder (Codex CLI >= 0.133)

./install.sh --project installs to the current repo only instead of globally. You need Claude Code on any paid plan and the Codex CLI signed into a ChatGPT plan.

Use (two commands)

/architect                                      # the build loop
/architect-research <what you're considering>   # the research loop

/architect runs one work block: judge the last run, spec the next slice, dispatch builders. /architect-research is for when you're still deciding what to build — its cited report feeds the build loop's PRD.

/architect

/architect flow

One short Fable session per work block — judgment only, it never writes code:

Spec + gates first. Fable specs a one-PR slice, splits it into 1–4 lanes whose file sets are checked for overlap, and commits the acceptance gates to docs/gates/ before any builder starts. Gates are read-only; a builder edit to a gate file fails the slice automatically.
Parallel isolated builders. One fresh codex exec (xhigh) per lane, each in its own git worktree. Builders must argue with the spec before building (silent compliance = defect), build only their declared files, and report raw results — they do not have commit access in the sandbox.
Fable judges and integrates. It runs the gate commands itself (builder claims are hearsay), reads the diff against the spec's intent (passing tests ≠ mergeable work), then commits and merges passing lanes. Judgment happens in a fresh session because the cited evidence favors fresh-context review.
The repo is the only memory. docs/HANDOFF.md (a short table of contents, pruned every session), docs/gates/, docs/lanes/, git history. Not in the repo = didn't happen.
Supervision built in. Liveness checks on dispatched runs, stall triage (diagnose the child process tree, kill the narrowest thing), explicit timeouts on every long command.

/architect-research

/architect-research flow

Scout-first, like the production deep-research systems — no fixed lane taxonomy:

A cheap Codex scout maps the topic (~10 searches): canonical terminology, the load-bearing systems and papers, the named people, the topic's natural fault lines. Skipped for comparisons and fact-finds.
Fable designs 3–6 topic-specific lanes from the scout's map, drawing per-source-class tactics from a library (academic citation snowballing, dependents-not-stars repo evidence, emerging-vs-hype gating, production pattern mining, expert tracking) — checked for overlap and gaps before dispatch.
Parallel Codex researchers run under hard budgets: search caps, ≤5 subjects per lane, saturation stop, strict findings discipline (URL + date
- quote + confidence tag; NOT FOUND beats inference; no recommendations). Expert opinion runs as a second wave, roster-seeded by the first.
Fable verifies and writes. ≥2 independent sources per load-bearing claim, adversarial falsification searches, citations only from URLs actually fetched — then one author writes one decision-oriented report. Gathering parallelizes; synthesis never does.

Why this shape

Each design choice is source-backed (full citations in DESIGN.md):

Weak planners hurt more than weak executors — so the architect model does the design, and builders get explicit specs.
Manager + worktree-isolated workers is a well-supported topology for shared-artifact software work; naive shared-file coordination collapses throughput.
Frozen external gates beat trusting the agent — but agents game visible tests and their passing PRs are frequently unmergeable, so the architect also reads the diff.
Memory files rot — so the handoff stays a short map, and detail lives in linked gate/lane files.
The surveyed production deep-research systems use planner-designed decomposition rather than fixed lanes — so research lanes are designed per topic, after a scout pass.

What's in the box

File	What it is
DESIGN.md	The design document — 12 enforced rules, failure-mode table, cited sources
skills/architect/SKILL.md	The architect role: hard rules + procedure
skills/architect/dispatch.md	Verified `codex exec` commands, builder block, worktree fan-out, stall triage
skills/architect/research.md	Slice-scale inline fact-check fan-out
skills/architect/HANDOFF.template.md	The repo-memory file
skills/architect-research/SKILL.md	Research orchestration: scout → design → fan out → verify → write
skills/architect-research/lanes.md	Scout block + source-class tactics library with verified endpoints
tests/validate_skills.py	Repo sanity checks (frontmatter limits, links, fences)

FAQ

Do I need API keys? No. Claude Code runs on your Claude plan; Codex CLI on your ChatGPT plan.

What does a run cost? Builder/researcher runs draw on your ChatGPT plan's 5-hour and weekly quotas; a multi-hour run is a meaningful fraction of a weekly window. Fable's architect sessions are minutes, not hours.

What if a builder wrecks things? Nothing reaches a branch until the architect's tamper, boundary, and gate checks pass — worktrees are discarded and re-dispatched from the freeze commit.

Can I watch a run? Yes — every dispatch prints the builder block, so you can paste it into an interactive codex session with /goal instead.

Why two skills? Research-grade fan-out costs ~15× chat-level tokens — it should be a deliberate act, not a side-effect of the build loop.

Origin

The original idea came from this X post by @jumperz about using Fable with Codex subagents. I built architect-loop because I couldn't find an easy way to run that pattern, and because it seemed useful to add a few extra operational best practices on top of what Fable can already do when calling Codex subagents.

License

MIT

I just symlink AGENTS.md and CLAUDE.md

Not in my subjective experience sadly

I don’t know why you’re getting downvoted. It’s true. Averaged across a wide variety of benchmarks Fable is the only Anthropic model that performs better than GPT 5.5 xhigh.

The problem is that there are a bunch of benchmarks, the model providers often don't even use the same benchmarks, a bunch of them have known problems, and it's expensive to do your own benchmarks.

I am a GPT 5.x booster since to me it just feels smarter, and I generally felt like the benchmarks backed me up, but it's not every benchmark, so sadly we're mostly arguing about vibes.

SWEBench-Pro was a big one, though apparently Claude was reading solutions out of the .git folder it wasn't meant to have access to among other problems.

The problem is that there are a bunch of benchmarks, the model providers often don't even use the same benchmarks, a bunch of them have known problems, and it's expensive to do your own benchmarks.

I am a GPT 5.x booster since to me it just feels smarter, and I generally felt like the benchmarks backed me up, but it's not every benchmark, so sadly we're mostly arguing about vibes.

SWEBench-Pro was a big one, though apparently Claude was reading solutions out of the .git folder it wasn't meant to have access to among other problems.

I find it fascinating that every time this kind of discussion comes up, people talk about night and day experiences between Claude and Codex, in both directions. I’m really wondering what people are doing to get such different outcomes.

I’m currently working on two projects/clients one using Claude, one using Codex. I have a strong preference for the latter, but not because I think it is much more intelligent or writes much better code. It is simply because I find the way of interacting with it more pleasant: more literal, mechanical, makes fewer assumption and or double checks, and is less proactive in my experience. At least until some updates over the last few weeks.

I think I like Codex for the same reason tbh. I think it's just general misanthropy or autism or something lol. Most people seem to prefer Claude.

For me, I think Codex was visibly smarter than Claude until 4.8 came out, it would regularly do better debugging and IMO write better code. 4.8 I think is close.

I think Claude is widely regarded to have a big lead in front-end, which I do not work on.

Claude's Ultrathink is pretty cool, though it eats up tokens like nothing else obviously.