> Each rule below is enforced mechanically by the skill, not left to vibes.
> R1. Repo docs are the memory; not in HANDOFF.md = didn't happen
SKILL.md:
> Not in docs/HANDOFF.md = didn't happen. Refuse to judge results that exist only in conversation or builder chat output.
"Mechnical enforcement" just means "prompting the LLM a bit extra" these days? It (still) amazes me how much effort and tokens we expend on what could and should be a two line script...
You can use any agent and/or model for each step and share context between them.
> I am fairly convinced this is the shape serious agent work keeps converging toward.
"this" being "plan with expensive model, implement with cheap model".
Anyone who follows HN would be hard-pressed to disagree; this architecture is re-invented twice monthly.
https://www.facebook.com/groups/vibecodinglife/posts/1946207... https://github.com/openai/codex/discussions/10628 https://build5nines.com/stop-burning-premium-requests-how-to...
> Not because it is aesthetically pleasing. Because every other shape eventually runs into the same boring failures: context rot, self-grading, goalpost drift, and merge chaos.
Actual failure isn't boring. But struggling through a generated software project that celebrates its own genius and doesn't have a single self-critical or genuinely reflective thing to say...at least watching paint dry I might get giddy off the fumes.
I'm not interested in critiquing the project itself, either, you'll just run that through a model, too.
This project is meant to be the latter, but there’s not a clean way to integrate that into Claude Code or Codex because they expect to do both.
Pi can do it, but then your users can’t use their Claude subscriptions, so you have to cludgily try to do the same thing via LLM prompts.
Even if one discovered something that millions (billions?) of dollars of AI compute and the best statisticians in the world was not able to find via exhaustive research, domain search and training... what do you think are the chances this won't be folded into the next update of every model, making the rigmarole moot?
Extraordinary claims require extraordinary evidence and technology-shattering innovations in AI are not know to come from a markdown.
wow linking a facebook groups post might actually be worse than x, is there an xcancel alternative for facebook?
The one major difference I noticed is that the GPT models are more analytical (e.g. better at mathematical analysis, code review) vs Claude models tend to write more straight forward code. Besides that I don't really see any significant differences.
There are a few gotchas with swapping, like being careful with AGENTS.md/CLAUDE.md naming (Claude Code only recognizes CLAUDE.md, and I think Codex only works with AGENTS.md), and updating skill files to match the tool.
I exclusively use sonnet and advisor is basically “hey opus chime in on my approach”. been working great as far as i can tell.
Claude Fable handles planning and review; GPT-5.5 Codex handles implementation and research. Two Claude Code skills wire that split into a repo-centered loop: specs and gates are written first, Codex works in fresh contexts, and Fable reviews the evidence before anything is integrated. It runs on the subscriptions you already have — no API keys required by default.
git clone https://github.com/DanMcInerney/architect-loop
cd architect-loop && ./install.sh # Windows: .\install.ps1
npm i -g @openai/codex@latest # the builder (Codex CLI >= 0.133)
./install.sh --project installs to the current repo only instead of
globally. You need Claude Code on any paid
plan and the Codex CLI signed into a ChatGPT plan.
/architect # the build loop
/architect-research <what you're considering> # the research loop
/architect runs one work block: judge the last run, spec the next slice,
dispatch builders. /architect-research is for when you're still deciding
what to build — its cited report feeds the build loop's PRD.

One short Fable session per work block — judgment only, it never writes code:
docs/gates/ before any builder starts. Gates are read-only; a builder
edit to a gate file fails the slice automatically.codex exec (xhigh) per lane,
each in its own git worktree. Builders must argue with the spec before
building (silent compliance = defect), build only their declared files,
and report raw results — they do not have commit access in the sandbox.docs/HANDOFF.md (a short table of
contents, pruned every session), docs/gates/, docs/lanes/, git
history. Not in the repo = didn't happen.
Scout-first, like the production deep-research systems — no fixed lane taxonomy:
Each design choice is source-backed (full citations in DESIGN.md):
| File | What it is |
|---|---|
| DESIGN.md | The design document — 12 enforced rules, failure-mode table, cited sources |
| skills/architect/SKILL.md | The architect role: hard rules + procedure |
| skills/architect/dispatch.md | Verified codex exec commands, builder block, worktree fan-out, stall triage |
| skills/architect/research.md | Slice-scale inline fact-check fan-out |
| skills/architect/HANDOFF.template.md | The repo-memory file |
| skills/architect-research/SKILL.md | Research orchestration: scout → design → fan out → verify → write |
| skills/architect-research/lanes.md | Scout block + source-class tactics library with verified endpoints |
| tests/validate_skills.py | Repo sanity checks (frontmatter limits, links, fences) |
Do I need API keys? No. Claude Code runs on your Claude plan; Codex CLI on your ChatGPT plan.
What does a run cost? Builder/researcher runs draw on your ChatGPT plan's 5-hour and weekly quotas; a multi-hour run is a meaningful fraction of a weekly window. Fable's architect sessions are minutes, not hours.
What if a builder wrecks things? Nothing reaches a branch until the architect's tamper, boundary, and gate checks pass — worktrees are discarded and re-dispatched from the freeze commit.
Can I watch a run? Yes — every dispatch prints the builder block, so you
can paste it into an interactive codex session with /goal instead.
Why two skills? Research-grade fan-out costs ~15× chat-level tokens — it should be a deliberate act, not a side-effect of the build loop.
The original idea came from this X post by @jumperz about using Fable with Codex subagents. I built architect-loop because I couldn't find an easy way to run that pattern, and because it seemed useful to add a few extra operational best practices on top of what Fable can already do when calling Codex subagents.
MIT
I was using gpt-5.5 high. Writing terraform code for GCP, debugging app launch and Dockerfile issues, that sort of thing. It was going in loops hallucinating features of GCP, looking things up in strange ways, running terraform apply after being explicitly told in the last interaction not to, and overall not solving problems. These were very straightforward tasks and it couldn't be trusted for five minutes. It's the difference in what I would trust an early senior engineer to do vs what I would trust an unreliable high school intern to do.
I am a GPT 5.x booster since to me it just feels smarter, and I generally felt like the benchmarks backed me up, but it's not every benchmark, so sadly we're mostly arguing about vibes.
SWEBench-Pro was a big one, though apparently Claude was reading solutions out of the .git folder it wasn't meant to have access to among other problems.
I’m currently working on two projects/clients one using Claude, one using Codex. I have a strong preference for the latter, but not because I think it is much more intelligent or writes much better code. It is simply because I find the way of interacting with it more pleasant: more literal, mechanical, makes fewer assumption and or double checks, and is less proactive in my experience. At least until some updates over the last few weeks.
For me, I think Codex was visibly smarter than Claude until 4.8 came out, it would regularly do better debugging and IMO write better code. 4.8 I think is close.
I think Claude is widely regarded to have a big lead in front-end, which I do not work on.
Claude's Ultrathink is pretty cool, though it eats up tokens like nothing else obviously.