My automated doubt development process

Strangely reminiscent of an Electric Monk:

The Electric Monk was a labor-saving device, like a dishwasher or a video recorder. Dishwashers washed tedious dishes for you, thus saving you the bother of washing them yourself, video recorders watched tedious television for you, thus saving you the bother of looking at it yourself; Electric Monks believed things for you, thus saving you what was becoming an increasingly onerous task, that of believing all the things the world expected you to believe.

-- Douglas Adams, "Dirk Gently's Holistic Detective Agency"

I think its common to develop an adversarial-collaborative approach to getting some semblance of quality out of AI. I personally favour using multiple models for different roles, having a bunch of continuity documentation maintained, and having the plan surface human-verifiable deliverables as soon as feasible. It does involve more attention than most people would tolerate probably.

Most writings about the spec-driven development I see start with a product requirements document that is assumed to be valid. But I doubt that's the case. If so, you would've written about it, and probably would've involved agents in the research that goes into it. My gut feeling tells me there's much more emphasis on implementing the feature than on questioning if it's relevant, feasible, and based on valid assumptions.

I've stumbled on the same workflow. Except for one thing: If I just do as OP does, Claude Code will tend to overengineer. For example it'll build complex solutions to super rare race conditions that have trivial fallout. But I've found that all it takes is a "skeptical pass". Here's how it goes: After having a bunch of specialist subagents review the (plan/implementation), after doing the deduplication/synthesis of their findings, the main agent will bucket them into A) Trivial/obvious fix B) there's multiple possible resolutions, but the LLM had a strong lean, so it went with it on its own C) Genuine ambiguity, where it asks me what to do (and presents its lean) and D) Wontfix. Crucially, after doing this, I have it run a "skeptical pass" where it takes a hard look at these findings and see if maybe some of them deserve to be downgraded. Generally, a lot of things make their way into wontfix this way. I find, I don't need to push back against overengineering, I can have the LLM do so itself, and it'll actually do a decent job of it.

This has been my attempt at wrangling the new A.I. assisted development that seems to be overtaking the software engineering profession. I jumped head first into LLM development after observing the trends from the last year and it appears this process might be a viable path forward.

Strangely reminiscent of an Electric Monk:

-- Douglas Adams, "Dirk Gently's Holistic Detective Agency"

This sounds harder than just writing the code.

Yes, it's called "questioning attitude", one of the traits of a healthy nuclear safety culture (and a good thing to apply in other fields!)

https://www.nrc.gov/docs/ML1433/ML14338A739.pdf

Biased cause I work there, but that’s where software like Tactiq shines. We just added an MCP, and now the agent has access to the meetings when writing the plan.

Last week I had three meetings with three stakeholders, and the agent was able to gather everyone’s ideas and make sure they are all working together in the feature.

Most of my energy is refining a prd these days.

There's a selection bias here that nobody's mentioned. The people who had bad experiences with this approach probably aren't commenting.

I’ve had similar feelings how can I trust this if I no longer write the code directly.

I wrote an /assess tool. I designed it to be token light but assesses on everything I could do to regain trust and help AI to improve my code base not by add features but by adding discipline.

Biased cause I work there, but that’s where software like Tactiq shines. We just added an MCP, and now the agent has access to the meetings when writing the plan.

Last week I had three meetings with three stakeholders, and the agent was able to gather everyone’s ideas and make sure they are all working together in the feature.

Yes, it's called "questioning attitude", one of the traits of a healthy nuclear safety culture (and a good thing to apply in other fields!)

https://www.nrc.gov/docs/ML1433/ML14338A739.pdf

There's a selection bias here that nobody's mentioned. The people who had bad experiences with this approach probably aren't commenting.

I’ve had similar feelings how can I trust this if I no longer write the code directly.

I wrote an /assess tool. I designed it to be token light but assesses on everything I could do to regain trust and help AI to improve my code base not by add features but by adding discipline.

This sounds harder than just writing the code.

When you put it like that, it really does, lol.

Most of my energy is refining a prd these days.

Then how come that process is not agentic and not well-described?

When you put it like that, it really does, lol.

Then how come that process is not agentic and not well-described?

Personally it's well-defined and agentic - just not circulated.

/understand - agents interrogate the problem /huddle - Thinking panel turns it into a PRD - attacks the premise, PRDs regularly die here /tm - claude-task-master breaks the survivor into a dependency graph

Nobody writes this half up because "agent talked me out of building it" demos worse than "agent built it".

Personally it's well-defined and agentic - just not circulated.

Nobody writes this half up because "agent talked me out of building it" demos worse than "agent built it".

Sorry I have to ask. How senior are you? The notion that I‘d allow an agent to talk me out of something seems weird. 99% of cases, it’s the other way around. Architecture is just not where they shine.

What’s your process? My experience matches yours, but then again I usually just give a few lines to codex. I imagine if I tried harder to give detailed specs as input, the agent would have a lot more room to spot flaws and kill the plan.

Usually when they push back, it’s for obvious reasons, things I already know and actively decided to ignore. They are trained on mediocre software, and it shows.

I like using voice input a lot, I get way more info out of my brain and into the context that way.

Process wise, for bug fixes I usually just throw the ticket in and optionally some thoughts on how to fix. But if I don’t know the cause, I let it write instrumentation tests until the bug is reproed, and then the fix is easy.

For new features in brownfield projects, I usually need to align with team members because we‘re closely aligned between platforms. We iterate on what you could call a spec, which is just a mix of requirements, magic numbers we want, algorithms we‘ve picked (often by vibing prototypes), and sometimes going very specific on parts that must be done right. Eg for interfaces with other teams, and there’s not yet a document to describe that, we put that in the spec as well. We do use agents to shoot holes in those specs, and often they find inconsistencies. But architecturally, they seem to get caught up too much in what’s already in those specs, and personally I haven’t seen any worthwhile feedback that I‘d have taken up.

Sometimes we use this spec to vibe a first draft. Often the draft is so good that it can be bent to our liking. Sometimes, it just serves as a reservoir of ideas, and the feature must be implemented (with assistance) by re-assembling the pieces differently.

Usually when they push back, it’s for obvious reasons, things I already know and actively decided to ignore. They are trained on mediocre software, and it shows.

I like using voice input a lot, I get way more info out of my brain and into the context that way.

This process originated out of a lack of trust. I lost trust early in my AI-assisted development due to allowing our LLM partners to do too much, too quickly and without the standard engineering practices I had come to internalize. Trust was regained by automating as much doubt as I could muster. What does performing doubt look like? Critiquing the implementation of an artifact and doing so, repeatedly. If you are using AI to write code, specs, docs or any artifact, you may find this piece useful.

I use subagents, quite a bit. They inhabit the fulcrum of the entire process. They are specialized in ways that audit perspectival surfaces a standard instantiation of Claude wouldn't necessarily cover. The core idea in all of this is automated doubt from multiple perspectives and the front-loading of scrutiny. The more parallax coverage in AI development, the better; where different vantage points catch different defects, the way two eyes give you depth. The development process goes something like this:

Phase 1 — Design

It starts with an idea or a feature I'd like to build and a specification. Like any good development practice, it's usually wise to start with a spec, PRD, plan, or whatever flavor of design preferred. I ask Claude to write the spec and I spend 2–5 minutes skimming the file to verify the core implementation aspects of the idea are captured. This is where the iteration process begins.

I start with a Pre-implementation workflow (slash command in Claude Code), which consists of three agents performing the first round of doubt: Pre-Implementation Architect, Documentation Validator and Assumption Excavator. These agents do several things: verify design quality, scope assessment, completeness, documentation gaps and all the hidden assumptions that exist in the spec. All relevant findings discovered are folded into the spec by the main terminal agent — usually 10–25 depending on the scope of the idea.

Example findings:

Assumption Excavator: "executionStatsSchema in registry-sdk returns {totalCount, recentCount, windowMinutes}. Spec assumes {avgScore, medianDurationMs, passRate, lastRunDate, lastRunScore}. Entire history section unbuildable without new API endpoint"

Pre-Implementation Architect: "HarnessProfile embeds mcp.read/merge/remove/write methods alongside path config. Consider extracting McpConfigStrategy to separate concerns. Each harness file will grow to 80–120 lines otherwise."

The scope determines the amount of iterations I make. If the scope calls for it, the iteration continues with the next set of agents: Gap Analyzer, Implied Completeness Detector, Ambiguity Mapper. These agents in particular are excellent at finding all the omitted aspects of the system that will be missed if left unaddressed. When the gaps are discovered, they are added to the spec.

Example findings:

Gap Analyst: "McpConfigStrategy defines read/merge/write but does not specify behavior for malformed input, permission denied, partial write failure, or file locking. Destructive operation on user config files across 4 harnesses in 3 formats."

Implied Completeness Detector: "Manifest records version at root but installation state per-harness. When v0.3.0 user (Claude Code) runs v0.4.0 with --harness opencode, behavior undefined. Per-harness versioning or upgrade reconciliation not addressed."

For practical use:

Small scope: Pre-implementation only
Medium scope: Pre-implementation with Gap, Implied, Ambiguity
Large scope: Full sweep with multiple runs with each, occasionally dipping into other specialized agents

Now I pause and spend some time to read the spec, ~15–60 min. If everything checks out and the spec is ready for development, I ask Claude to generate a companion checklist that we can update and follow along. The checklist is created as a separate file and helps if you need to step away and close out a session mid-dev.

Phase 2 — Development

Claude pulls up the spec and checklist and begins development. If I'm picking the spec up with the development partly complete in a new session, I usually ask Claude to explore, or send a Chain Tracer or Deep Explore subagent for the complete picture prior to restarting.

One aspect of my development process that might stand out and that I would like to highlight: I don't use subagents for writes. This comes back to the trust angle. My experiences of spawning subagents for writes gone awry, often causing more harm than good, led to a temporary line drawn in the sand. I also say temporary, because this will undoubtedly change. As I understand it, there are methods for proper swarm orchestration, worktrees, agent-to-spec driven dev, but that's a bit beyond my trust level now. Sometimes the Claude terminal agent will spawn them for bulk updates, but I prefer a single Claude Code terminal instance building out the project.

I tackle all phases of the specification until complete. Verify the build works, and then comes the post-implementation development process. I mentioned automated doubt and this is where it shines. The next several iterations of the development process involve running a Post-Implementation workflow consisting of the following subagents: Code Validator, Type Safety Validator, Test Architect, Code Optimizer, Public Interface Validator and Security Analyst. These agents audit the codebase and provide findings: code & testing quality, security posture, duplication, performance considerations, semantic or structural integrity, documentation, the public interface, etc. The first run usually generates (depending on the scope) 15–35 findings, usually with the first 15–20 findings flagged as critical or high severity. These findings are addressed and I re-run the Post-implementation workflow. Then tackle the next set of issues, then the next and so on until I've reached my idea of what quality ought to look like.

Example findings:

Code Validator: "Every other execution method calls trackIfEnabled() after completion. startPipeline() returns PipelineHandle directly without tracking. Async pipeline users get no tracking data."

Security Analyst: "PreflightError includes shellQuote-expanded target path verbatim. Error messages containing resolved filesystem paths may propagate to tracking API and dashboard."

Phase 3 — Wrap-up and Ship

Once I've satisfied my preference for what I'm ready to release and everything checks out both in a practical and quality manner, I then run the final workflow: Ship. This workflow consists of the following agents: Code Validator, Type Safety Validator, Test Architect, Code Auditor, Public Interface Validator, Security Analyst, Anxiety Reader, API Contract Validator (if API), Release Readiness Validator. This workflow finalizes the iterative process tackled in the previous phase. 5/9 agents were all in the post-implementation workflow, so they should be finding very little or entering preference territory, the others are checking the API contract (if relevant), runtime consistency, what could break and the release posture of the system. When this runs, the question is: is this ready for release? Depending on the complexity, this may require 2+ iterations of Ship.

Example findings:

Anxiety Reader: "Promise.allSettled fires all agents simultaneously with no concurrency limit, risking resource exhaustion and API rate limits."

Code Auditor: "File I/O errors in writeReportFiles caught by handleCoreError which gives SDK-specific hints instead of filesystem-specific messaging."

Conclusion

On the philosophical end, this is the negotiation between the artifacts, the agents and the operator and where the idea of quality converges. We all have an idea of what quality means to us, even the agents themselves have ideas of what both quantifies and qualifies as quality. This is the agreement we make with ourselves and the agents: what constitutes readiness. The foundation of it all is the idea that we are aiming for some form of consistency, usability, readability, maintainability — and underneath those, something we can be more confident in. Quality can be a subjective state, with objective goals. I iterate until those ideas converge. How do you know when to terminate the loop? I'd like to think it's intuitive: the combination of patience, practice, judgement and your expertise in asking the right questions. Is the juice worth the squeeze for this next fix or feature? It comes back to the personal thresholds for whatever state of the project you are ready to release. The artist is never finished, is the engineer? It ultimately comes down to the operator. The good thing about versioning, is that you can always add, subtract or modify in some manner and how that quality manifests is derived from preference and the artifact's trajectory.

One consideration of the method, and one I can state with confidence: this process is not necessarily cheap on the tokens. For those of us who have spent countless hours burning through tokens and hitting usage limits, this can play a major role in how we develop with AI. For some projects, this process is absolutely overkill, and for others, it's simply not enough and requires appending an entirely different set of agents to audit. My personal inclination is to run this process and run it repeatedly. I'd like to ensure the code I am developing with Claude or any other AI system can be verified, validated and ideally, meet a higher standard. Some projects may require nothing more than a Code Validator and Test Architect for review, others involve 40+ agents from multiple perspectives. If there is at least one agent that should be tried out on any artifact — codebase, spec, docs, etc — it's the Assumption Excavator, as it is near universally applicable.

This process originated out of a lack of trust and has developed into a trust signal.

The agents, commands, and pipelines referenced in this post are available at github.com/aself101/agents-and-pipelines.

Hacker Times

Hacker Times

My automated doubt development process

Discussion

Discussion

Phase 1 — Design

Phase 2 — Development

Phase 3 — Wrap-up and Ship

Conclusion