The Electric Monk was a labor-saving device, like a dishwasher or a video recorder. Dishwashers washed tedious dishes for you, thus saving you the bother of washing them yourself, video recorders watched tedious television for you, thus saving you the bother of looking at it yourself; Electric Monks believed things for you, thus saving you what was becoming an increasingly onerous task, that of believing all the things the world expected you to believe.
-- Douglas Adams, "Dirk Gently's Holistic Detective Agency"
Last week I had three meetings with three stakeholders, and the agent was able to gather everyone’s ideas and make sure they are all working together in the feature.
I wrote an /assess tool. I designed it to be token light but assesses on everything I could do to regain trust and help AI to improve my code base not by add features but by adding discipline.
/understand - agents interrogate the problem /huddle - Thinking panel turns it into a PRD - attacks the premise, PRDs regularly die here /tm - claude-task-master breaks the survivor into a dependency graph
Nobody writes this half up because "agent talked me out of building it" demos worse than "agent built it".
I like using voice input a lot, I get way more info out of my brain and into the context that way.
Process wise, for bug fixes I usually just throw the ticket in and optionally some thoughts on how to fix. But if I don’t know the cause, I let it write instrumentation tests until the bug is reproed, and then the fix is easy.
For new features in brownfield projects, I usually need to align with team members because we‘re closely aligned between platforms. We iterate on what you could call a spec, which is just a mix of requirements, magic numbers we want, algorithms we‘ve picked (often by vibing prototypes), and sometimes going very specific on parts that must be done right. Eg for interfaces with other teams, and there’s not yet a document to describe that, we put that in the spec as well. We do use agents to shoot holes in those specs, and often they find inconsistencies. But architecturally, they seem to get caught up too much in what’s already in those specs, and personally I haven’t seen any worthwhile feedback that I‘d have taken up.
Sometimes we use this spec to vibe a first draft. Often the draft is so good that it can be bent to our liking. Sometimes, it just serves as a reservoir of ideas, and the feature must be implemented (with assistance) by re-assembling the pieces differently.
This process originated out of a lack of trust. I lost trust early in my AI-assisted development due to allowing our LLM partners to do too much, too quickly and without the standard engineering practices I had come to internalize. Trust was regained by automating as much doubt as I could muster. What does performing doubt look like? Critiquing the implementation of an artifact and doing so, repeatedly. If you are using AI to write code, specs, docs or any artifact, you may find this piece useful.
I use subagents, quite a bit. They inhabit the fulcrum of the entire process. They are specialized in ways that audit perspectival surfaces a standard instantiation of Claude wouldn't necessarily cover. The core idea in all of this is automated doubt from multiple perspectives and the front-loading of scrutiny. The more parallax coverage in AI development, the better; where different vantage points catch different defects, the way two eyes give you depth. The development process goes something like this:
It starts with an idea or a feature I'd like to build and a specification. Like any good development practice, it's usually wise to start with a spec, PRD, plan, or whatever flavor of design preferred. I ask Claude to write the spec and I spend 2–5 minutes skimming the file to verify the core implementation aspects of the idea are captured. This is where the iteration process begins.
I start with a Pre-implementation workflow (slash command in Claude Code), which consists of three agents performing the first round of doubt: Pre-Implementation Architect, Documentation Validator and Assumption Excavator. These agents do several things: verify design quality, scope assessment, completeness, documentation gaps and all the hidden assumptions that exist in the spec. All relevant findings discovered are folded into the spec by the main terminal agent — usually 10–25 depending on the scope of the idea.
Example findings:
Assumption Excavator: "executionStatsSchema in registry-sdk returns {totalCount, recentCount, windowMinutes}. Spec assumes {avgScore, medianDurationMs, passRate, lastRunDate, lastRunScore}. Entire history section unbuildable without new API endpoint"
Pre-Implementation Architect: "HarnessProfile embeds mcp.read/merge/remove/write methods alongside path config. Consider extracting McpConfigStrategy to separate concerns. Each harness file will grow to 80–120 lines otherwise."
The scope determines the amount of iterations I make. If the scope calls for it, the iteration continues with the next set of agents: Gap Analyzer, Implied Completeness Detector, Ambiguity Mapper. These agents in particular are excellent at finding all the omitted aspects of the system that will be missed if left unaddressed. When the gaps are discovered, they are added to the spec.
Example findings:
Gap Analyst: "McpConfigStrategy defines read/merge/write but does not specify behavior for malformed input, permission denied, partial write failure, or file locking. Destructive operation on user config files across 4 harnesses in 3 formats."
Implied Completeness Detector: "Manifest records version at root but installation state per-harness. When v0.3.0 user (Claude Code) runs v0.4.0 with --harness opencode, behavior undefined. Per-harness versioning or upgrade reconciliation not addressed."
For practical use:
Now I pause and spend some time to read the spec, ~15–60 min. If everything checks out and the spec is ready for development, I ask Claude to generate a companion checklist that we can update and follow along. The checklist is created as a separate file and helps if you need to step away and close out a session mid-dev.
Claude pulls up the spec and checklist and begins development. If I'm picking the spec up with the development partly complete in a new session, I usually ask Claude to explore, or send a Chain Tracer or Deep Explore subagent for the complete picture prior to restarting.
One aspect of my development process that might stand out and that I would like to highlight: I don't use subagents for writes. This comes back to the trust angle. My experiences of spawning subagents for writes gone awry, often causing more harm than good, led to a temporary line drawn in the sand. I also say temporary, because this will undoubtedly change. As I understand it, there are methods for proper swarm orchestration, worktrees, agent-to-spec driven dev, but that's a bit beyond my trust level now. Sometimes the Claude terminal agent will spawn them for bulk updates, but I prefer a single Claude Code terminal instance building out the project.
I tackle all phases of the specification until complete. Verify the build works, and then comes the post-implementation development process. I mentioned automated doubt and this is where it shines. The next several iterations of the development process involve running a Post-Implementation workflow consisting of the following subagents: Code Validator, Type Safety Validator, Test Architect, Code Optimizer, Public Interface Validator and Security Analyst. These agents audit the codebase and provide findings: code & testing quality, security posture, duplication, performance considerations, semantic or structural integrity, documentation, the public interface, etc. The first run usually generates (depending on the scope) 15–35 findings, usually with the first 15–20 findings flagged as critical or high severity. These findings are addressed and I re-run the Post-implementation workflow. Then tackle the next set of issues, then the next and so on until I've reached my idea of what quality ought to look like.
Example findings:
Code Validator: "Every other execution method calls trackIfEnabled() after completion. startPipeline() returns PipelineHandle directly without tracking. Async pipeline users get no tracking data."
Security Analyst: "PreflightError includes shellQuote-expanded target path verbatim. Error messages containing resolved filesystem paths may propagate to tracking API and dashboard."
Once I've satisfied my preference for what I'm ready to release and everything checks out both in a practical and quality manner, I then run the final workflow: Ship. This workflow consists of the following agents: Code Validator, Type Safety Validator, Test Architect, Code Auditor, Public Interface Validator, Security Analyst, Anxiety Reader, API Contract Validator (if API), Release Readiness Validator. This workflow finalizes the iterative process tackled in the previous phase. 5/9 agents were all in the post-implementation workflow, so they should be finding very little or entering preference territory, the others are checking the API contract (if relevant), runtime consistency, what could break and the release posture of the system. When this runs, the question is: is this ready for release? Depending on the complexity, this may require 2+ iterations of Ship.
Example findings:
Anxiety Reader: "Promise.allSettled fires all agents simultaneously with no concurrency limit, risking resource exhaustion and API rate limits."
Code Auditor: "File I/O errors in writeReportFiles caught by handleCoreError which gives SDK-specific hints instead of filesystem-specific messaging."
On the philosophical end, this is the negotiation between the artifacts, the agents and the operator and where the idea of quality converges. We all have an idea of what quality means to us, even the agents themselves have ideas of what both quantifies and qualifies as quality. This is the agreement we make with ourselves and the agents: what constitutes readiness. The foundation of it all is the idea that we are aiming for some form of consistency, usability, readability, maintainability — and underneath those, something we can be more confident in. Quality can be a subjective state, with objective goals. I iterate until those ideas converge. How do you know when to terminate the loop? I'd like to think it's intuitive: the combination of patience, practice, judgement and your expertise in asking the right questions. Is the juice worth the squeeze for this next fix or feature? It comes back to the personal thresholds for whatever state of the project you are ready to release. The artist is never finished, is the engineer? It ultimately comes down to the operator. The good thing about versioning, is that you can always add, subtract or modify in some manner and how that quality manifests is derived from preference and the artifact's trajectory.
One consideration of the method, and one I can state with confidence: this process is not necessarily cheap on the tokens. For those of us who have spent countless hours burning through tokens and hitting usage limits, this can play a major role in how we develop with AI. For some projects, this process is absolutely overkill, and for others, it's simply not enough and requires appending an entirely different set of agents to audit. My personal inclination is to run this process and run it repeatedly. I'd like to ensure the code I am developing with Claude or any other AI system can be verified, validated and ideally, meet a higher standard. Some projects may require nothing more than a Code Validator and Test Architect for review, others involve 40+ agents from multiple perspectives. If there is at least one agent that should be tried out on any artifact — codebase, spec, docs, etc — it's the Assumption Excavator, as it is near universally applicable.
This process originated out of a lack of trust and has developed into a trust signal.
The agents, commands, and pipelines referenced in this post are available at github.com/aself101/agents-and-pipelines.