I've used the analogy of a circular saw before ("it's not really sawing, because you can't feel the wood...")

It's easy to just slab on a Skil saw, but through the beam and it'll be somewhat straight. But when every manual stroke counts, there's enough time on a human time scale to correct every little mistake. It's definitely possible to become skilled at using the circular saw, but it takes effort that it feels like you don't need at first.

This is similar. LLMs are so powerful for writing code that it's easy to become complacent and forget your role as the engineer using the tool: guaranteeing correctness, security, safety and performance of the end result. When you're not invested in every if-statement, forgetting to check edge cases is really easy to do. And as much as I like Claude writing test cases for me, I also have to ensure the coverage is decent, that the implicit assumptions made about external library code is correct, etc. It takes a lot of effort to do it right. I don't know why Mycelium thinks they invented interfaces for module boundaries, but I'm pretty sure they are still as suceptible to that "0" not behaving as you'd expect, or the empty string being interpreted as "missing." Or the CSG algorithm working, except if your hole edges are incident with some boundary edges.

I have no idea what am I reading

> Mycelium structures applications as directed graphs of pure data transformations. Each node (cell) has explicit input/output schemas. Cells are developed and tested in complete isolation, then composed into workflows that are validated at compile time. Routing between cells is determined by dispatch predicates defined at the workflow level — handlers compute data, the graph decides where it goes.

No still don't understand

> Mycelium uses Maestro state machines and Malli contracts to define "The Law of the Graph," providing a high-integrity environment where humans architect and AI agents implement.

Nope, still don't

How is that different compared to a human dev that would miss bugs?

It's starting to become obvious that if you can't effectively use AI to build systems it is a skill issue. At the moment it is a mysterious, occasionally fickle, tool - but if you provide the correct feedback mechanisms and provide small tweaks and context at idiosyncrasies, it's possible to get agents to reliably build very complex systems.

I've used the analogy of a circular saw before ("it's not really sawing, because you can't feel the wood...")

Your analogy with a Skil saw is genius! You can cut much faster but it's also much more dangerous. Just like the AI indeed.

How is that different compared to a human dev that would miss bugs?

The people using these programs are generating massive amounts of code, and you won't convince me they're actually carefully understanding most of it, if they even read it all. And it bypasses the first verification step where you are actively typing in the code that will be run.

The ability to plan long term and anticipate flow on errors.

I didn't downvote you, but I've noticed recently a trend on HN that just asking a normal question (possibly to start a discussion) gets downvoted for no apparent reason and without any explanation. Not good etiquette, in my opinion.

Yes, this is simply 'technical debt'.

They should try to fix technical debt before going to the next round. Of course Claude can probably also do this.

I have no idea what am I reading

No still don't understand

> Mycelium uses Maestro state machines and Malli contracts to define "The Law of the Graph," providing a high-integrity environment where humans architect and AI agents implement.

Nope, still don't

I don't understand why the poster (which is the author) links us to a slop report of a test for their library. It would be much more effective to cover part of this info into the README where we get the context of what they want to achieve (where there is a very clear "Why?" section), and then link to it instead. I have flagged it as AI slop.

Yeah it reads very Time Cube...

The top-level README gives a bit better idea. Armed with that the explanation might sound a bit more understandable.

I'm not familiar with the project (or Clojure), but let me try to explain!

> Mycelium structures applications as directed graphs of pure data transformations.

There is a graph that describes how the data flows in the system. `fn(x) -> x + 1` in a hypothetical language would be a node that takes in a value and outputs a value. The graph would then arrange that function to be called as a result of a previous node computing the parameter x for it.

> Each node (cell) has explicit input/output schemas.

Input and output of a node must comply to a defined schema, which I presume is checked at runtime, as Clojure is a dynamically typed language. So functions (aka nodes) have input and output types and presumably they should try to be pure. My guess is there should be nodes dedicated for side effects.

> Cells are developed and tested in complete isolation, then composed into workflows that are validated at compile time.

Sounds like they are pure functions. Workflows are validated at compile time, even if the nodes themselves are in Clojure.

> Routing between cells is determined by dispatch predicates defined at the workflow level — handlers compute data, the graph decides where it goes.

When the graph is built, you don't just need to travel all outgoing edges from a node to the next, but you can place predicates on those edges. The aforementioned nodes do not have these predicates, so I suppose suppose the predicates would be their own small pure-ish functions with the same as input data as a node would get, but their output value is only a boolean.

> Mycelium uses Maestro state machines and

Maestro is a Clojure library for Finite State Machines: https://github.com/yogthos/maestro

> Malli contracts

Malli looks like a parsing/data structure specification EDSL for Clojure: https://github.com/metosin/malli

> to define "The Law of the Graph," providing a high-integrity environment where humans architect and AI agents implement.

Well, beats me. I don't know what is "The Law of the Graph" and Internet doesn't seem to know either. I suppose it tries to say how you can from the processing graph to see that given input data to the ingress of the graph you have high confidence that you will get expected data at the final egress from the graph.

I do think these kind of guardrails can be beneficial for AI agents developing code. I feel that for that application some additional level of redundancy can improve code quality, even if the guards are generated by the AI code to begin with.

> At the moment it is a mysterious, occasionally fickle, tool - but if you provide the correct feedback mechanisms and provide small tweaks and context at idiosyncrasies, it's possible to get agents to reliably build very complex.

This sounds like arguing you can use these models to beat a game of whack-a-mole if you just know all the unknown unknowns and prompt it correctly about them.

This is an assertion that is impossible to prove or disprove.

> It's starting to become obvious that if you can't effectively use AI to build systems it is a skill issue.

I think it's fair to say that you can get a long way with Claude very quickly if you're an individual or part of a very small team working on a greenfield project. Certainly at project sizes up to around 100k lines of code, it's pretty great.

But I've been working startups off and on since 2024.

My last "big" job was with a company that had a codebase well into the millions of lines of code. And whilst I keep in contact with a bunch of the team there, and I know they do use Claude and other similar tools, I don't get the vibe it's having quite the same impact. And these are very talented engineers, so I don't think it's a skill either.

I think it's entirely possible that Claude is a great tool for bootstrapping and/or for solo devs or very small teams, but becomes considerably less effective when scaled across very large codebases, multiple teams, etc.

For me, on that last point, the jury is out. Hopefully the company I'm working with now grows to a point where that becomes a problem I need to worry about but, in the meantime, Claude is doing great for us.

Partially agree, but I think "skill issue" undersells the genuine reliability problem the original post is describing.

The skill part is real — giving the agent the right context, breaking tasks into the right size, knowing when to intervene. Most people aren't doing that well and their results reflect it.

But the latent bug problem isn't really a skill issue. It's a property of how these systems work: the agent optimises for making the current test pass, not for building something that stays correct as requirements change. Round 1 decisions get baked in as assumptions that round 3 never questions — and no amount of better prompting fixes that.

The fix isn't better prompting. It's treating agent-generated code with the same scepticism you'd apply to code from a contractor who won't be around to maintain it — more tests, explicit invariants, and not letting the agent touch the architecture without a human reviewing the design first.

Say i buy into your mysticism based take, is it a useful tool if it blows up in damn well near every professionals face?

lets say i accept you and you alone have the deep majiks required to use this tool correctly, when major platform devs could not so far, what makes this tool useful? Billions of dollars and environment ruining levels of worth it?

I'd say the only real use for these tools to date has been mass surveillance, and sometimes semi useful boilerplate.

According to the https://blog.katanaquant.com/p/your-llm-doesnt-write-correct... previously discussed on HN, it may be at least partially true:

> The vibes are not enough. Define what correct means. Then measure.

> if you provide the correct feedback

And how do you define correct feedback? If the output is correct?

I'd agree, I've been building a personal assistant (https://github.com/skorokithakis/stavrobot) and I'm amazed that, for the first time ever, LLMs manage to build reliably, with much fewer bugs than I'd expect from a human, and without the repo devolving to unmaintainability after a few cycles.

It's really amazing, we've crossed a threshold, and I don't know what that means for our jobs.

I don't understand LISP or Clojure, but it seems to be some kind of library for making web services out of LISP, which has some separate components that are somehow well defined. And somehow it's all related to AI.

Again I don't know much about Clojure and I am too slow for functional programming in general.

Mycelium Benchmark Scaling Analysis

How schema-enforced cells change the reliability equation as system complexity grows.

The Four Benchmarks

We ran four progressively complex benchmarks, each building on the previous one. The traditional approach achieved 100% test passage through V1 and V2, but the V1 shipping bug finally surfaced in V3 — causing 17 test failures after being latent for two full rounds of development.

Benchmark	Subsystems	Tests	Traditional	Mycelium	Traditional LOC	Mycelium LOC
Checkout Pipeline	3	8 / 39 assertions	39/39 pass	39/39 pass	~130	~230 + manifest
Order Lifecycle V1	6	18 / 136 assertions	136/136 pass	136/136 pass	~540	~590 + ~130 manifest
Order Lifecycle V2	11	30 / 235 assertions	235/235 pass	235/235 pass	~722	~900 + ~360 manifest
Order Lifecycle V3	15	52 / 383 assertions	366/383 pass	383/383 pass	~920	~1146 + ~440 manifest

Benchmark 1: Checkout Pipeline

Scale: ~130 lines, 3 subsystems (discounts, tax/shipping, payment).

What it tests: A single linear pipeline -- items in, total out.

Traditional Approach

37/39 assertions passed on first attempt
1 bug: floating-point rounding in tax calculation (50.0 * 0.0725 = 3.6249... rounds to 3.62 instead of 3.63)
Fixed in 1 iteration

Mycelium Approach

39/39 assertions passed on first logic execution (zero logic bugs)
4 iterations of compiler-guided structural fixes before first run: missing :on-error, undeclared data flow keys, dead-end graph route
Each fix was guided by a clear error message

Verdict at This Scale

Both approaches work fine. The problem is small enough that an AI agent can hold the entire system in context. The traditional approach is simpler and faster to implement. Mycelium's overhead (~~100 extra lines + manifest) is proportionally high (~~75%) and hard to justify for a problem this small.

Mycelium advantage: The structural validators caught 3 issues that would have been silent in the traditional approach (missing error handling, undeclared data flow, dead-end route). But at this scale, these issues are easy to catch through testing or code review.

Latent bugs: Traditional 0, Mycelium 0.

Benchmark 2: Order Lifecycle V1

Scale: ~540 lines, 6 subsystems (item expansion, promotions with 5 stacking types, per-item tax with state exemptions, multi-warehouse shipping, split payment, loyalty points with tiered earning).

What it tests: Three interacting workflows (placement, returns, modification) that share data contracts. Returns must correctly reverse the forward calculation, including proportional discount distribution, per-item tax, and split payment reversal.

How It Was Built

The traditional approach was built by 4 separate AI subagents:

Agent 1: Order placement (~342 lines)
Agent 2: Returns processing (~136 lines) -- given spec + tests, no placement source
Agent 3: Order modification (~58 lines) -- given spec + tests, no other source
Agent 4: Added COMBO75 feature by modifying code it didn't write

Traditional Approach

18/18 tests passing, 136 assertions
2 latent bugs discovered:
1. :shipping-detail vs :shipping-groups -- Returns code destructures :shipping-detail but placement outputs :shipping-groups. Gets nil, silently produces $0 shipping refund for all defective returns
2. Double inventory reservation -- modify-order calls place-order which re-reserves inventory without releasing the original reservation

Mycelium Approach

18/18 tests passing on first attempt, 136 assertions
0 latent bugs
Returns manifest explicitly requires :shipping-groups :any in its input schema, making the key mismatch impossible
Modification workflow uses the same placement workflow, so inventory semantics are consistent

Verdict at This Scale

This is the tipping point. The traditional approach has crossed the threshold where implicit contracts between components fail silently. Two independently competent AI agents (placement and returns) produced code that connects incorrectly through a key name mismatch. All tests pass because no test exercises the specific path (defective return of an item with non-zero shipping cost).

The key insight: The bug is not in any single agent's work. Each agent's code is internally correct. The bug is in the contract between agents -- a contract that exists only implicitly in the traditional approach and explicitly in the mycelium manifest.

Latent bugs: Traditional 2, Mycelium 0.

Benchmark 3: Order Lifecycle V2

Scale: ~722 lines, 11 subsystems. Five new features added by 5 sequential AI subagents, each modifying the existing V1 codebase:

Bulk pricing (quantity discounts before all other promos)
Store credit (third payment method in 3-way waterfall)
Gift wrapping (per-item, separate tax rate, refund rules)
Restocking fees (category-dependent on changed-mind returns)
Multi-currency (display-time conversion with separate display fields)

Traditional Approach

30/30 tests passing, 241 assertions
4 latent bugs (2 carried forward from V1 + 2 new):
1. :shipping-detail vs :shipping-groups -- still present after 5 more agents touched the code
2. Double inventory reservation -- still present
3. currency-rates duplicated in 3 files (placement, returns, modification)
4. gift-wrap-cost-per-item duplicated in 2 files (placement, returns)

Mycelium Approach

30/30 tests passing on first attempt, 235 assertions
0 latent bugs
3-way parallel join for tax + shipping + gift-wrap declared in manifest
Each new feature maps to 1-2 new cells with explicit schemas

The Shipping Bug: Experimentally Confirmed

We ran a targeted scenario to prove the traditional bug produces wrong results:

Scenario: Return novel (defective) from headphones+novel order
East warehouse shipping cost: $6.39

Traditional shipping-refund: $0.00  (WRONG)
Mycelium shipping-refund:    $6.39  (CORRECT)

The traditional approach silently loses $6.39 per affected refund. This is not a theoretical concern -- it's a financial calculation error that would affect real transactions.

V1 Bugs Persist Through V2 Development

The most striking finding: 5 additional AI agents worked on the traditional codebase and none detected or fixed the V1 bugs. Each agent:

Read the existing source code
Added their feature successfully
Made their tests pass
Left the existing bugs untouched

This happens because each agent only reads enough context to complete its assigned task. The :shipping-detail bug is invisible unless you specifically compare the returns code's destructuring against the placement code's output keys -- a cross-file analysis that agents skip when focused on adding a feature.

Latent bugs: Traditional 4, Mycelium 0.

Verdict at This Scale

The overhead ratio has shifted. Mycelium's manifest and structural code is ~360 lines on top of ~900 lines of implementation -- about 40% overhead. But the value delivered has grown faster:

Schema validation prevents an entire class of cross-module key mismatches
Manifest serves as machine-readable architecture documentation
Parallel joins are declared, not manually implemented
Each cell can be implemented with only its schema as context
New features map to isolated cells without touching existing cell logic

Benchmark 4: Order Lifecycle V3

Scale: ~920 lines traditional, ~1146 lines mycelium, 15 subsystems. Seven new cross-cutting features designed to create maximum interaction pressure:

Subscription pricing -- 15% off base price, excludes COMBO75 eligibility
Bundle products -- composite items, opaque for promotions, component-level inventory
Tiered shipping -- weight tiers + $3/hazmat + $5/oversized, differentiated free-shipping rules
Warranty add-ons -- per-category cost, 8% service tax, 50% refund on changed-mind (vs gift wrap's $0)
Auto-upgrade loyalty tier -- bronze->silver at $500, recomputes shipping and loyalty only
County-level tax -- surcharges with exemption overrides (Buffalo overrides clothing, Austin overrides digital)
Partial fulfillment -- fulfilled/backordered split, shipping only for fulfilled items

These 7 features create a 48-cell interaction matrix of cross-cutting concerns.

Traditional Approach

35/52 tests passing, 366/383 assertions
17 test failures -- all from the V1 :shipping-detail bug
5 latent bugs (2 from V1, 3 new duplications)

Mycelium Approach

52/52 tests passing, 383/383 assertions -- first attempt
0 latent bugs
4-way parallel join for tax + shipping + gift-wrap + warranty
18 placement cells, 10 returns cells, all schema-bounded

The V1 Bug Finally Explodes

The :shipping-detail vs :shipping-groups bug was introduced in V1, survived V2 untouched, and finally causes test failures in V3. Here's why:

In V1 and V2, shipping was simple: flat rate or free (gold/platinum, subtotal >= $75). Most tested scenarios had free shipping, so the returns code's nil shipping-detail produced the accidentally-correct $0.00 refund. V3's tiered shipping with hazmat and oversized surcharges means most orders now have non-zero shipping costs. Defective returns must refund that shipping, and the nil lookup silently returns $0.

One root cause, 17 failures. The single wrong key name cascades:

shipping-refund: $0 instead of $3-8 (8 tests)
total-refund: wrong by the missing shipping amount (8 tests)
payment-refund / display amounts: cascading errors (1 test)

Test	Expected	Actual	Missing
T11 (laptop defective)	$1041.11 total refund	$1033.11	-$8.00
T17 (headphones defective)	$69.19	$65.82	-$3.37
T24 (gift-wrapped laptop)	$1046.50	$1038.50	-$8.00
T42 (warranty laptop)	$1095.10	$1087.10	-$8.00
T49 (sub+warranty+wrap)	$945.53	$937.53	-$8.00
T52 (bundle return)	$1040.09	$1032.09	-$8.00

In a production system, every defective return would silently under-refund the customer's shipping cost.

All New Features Implemented Correctly

The traditional approach correctly implements all 7 V3 features for placement:

Subscription pricing properly excludes COMBO75 (T32 passes)
Bundles don't trigger ELEC10/BUNDLE5/COMBO75 (T33-34 pass)
County tax overrides work correctly (T35-37 pass)
Tiered shipping computes correctly (T38-40 pass)
Warranty with 50% changed-mind refund (T43 passes)
Auto-upgrade triggers correctly (T44 passes)
Partial fulfillment splits correctly (T46-47 placement assertions pass)

The failures are not from incompetence. They're from the structural impossibility of tracking key names across module boundaries without a contract system.

Verdict at This Scale

The V2 analysis predicted 8-12 latent bugs at 15 subsystems. The reality was more interesting: rather than just accumulating more latent bugs, the existing V1 bug detonated. V3's tiered shipping created enough non-zero shipping scenarios that the :shipping-detail bug went from latent to catastrophic.

This is the time-bomb pattern: a bug introduced in round 1 of development explodes in round 3, when new features create paths through the code that previous tests never exercised. The codebase grew from 540 to 920 lines, 12 features were added across 2 rounds, and the bug was invisible to every test suite, code review, and AI agent that touched the code -- until it wasn't.

The Scaling Curve

Bug Growth

Benchmark	Subsystems	Traditional Bugs	Test Failures
Checkout	3	0	0
V1	6	2 latent	0
V2	11	4 latent	0
V3	15	5 (1 surfaced)	17

The traditional approach doesn't just accumulate bugs linearly -- the bugs interact with new features to create cascading failures. A latent bug that was harmless at 6 subsystems becomes a 17-assertion catastrophe at 15 subsystems because the new features create triggering conditions.

Mycelium stays at zero because every cross-component contract is explicit in the manifest. A bug like :shipping-detail vs :shipping-groups cannot exist when the manifest declares exactly which keys flow between cells.

Overhead Amortization

Benchmark	Traditional LOC	Mycelium Total LOC	Overhead %	Bugs Prevented
Checkout	130	230	77%	0
V1	540	720	33%	2
V2	722	1260	74%*	4
V3	920	1586	72%	5 + 17 test failures

*Overhead percentage stabilizes around 70-75% at scale, but the value delivered grows superlinearly. At V3 scale, ~440 lines of manifest prevent 5 latent bugs and 17 test failures that would silently produce wrong financial calculations.

The overhead percentage isn't the right metric. The right metric is bugs prevented per line of manifest. At V3 scale, the manifest prevented the single bug that cascaded into 17 assertion failures across 8 test cases.

Context Requirements

Benchmark	Lines per module	Can agent hold full context?	Result
Checkout	130 (1 file)	Yes	Both approaches work
V1	180 avg (3 files)	Mostly	Traditional: 2 cross-file bugs
V2	240 avg (3 files)	Strained	Traditional: 4 bugs, 0 fixed
V3	307 avg (3 files)	No	Traditional: 17 test failures

The traditional approach degrades because agents must hold the full system in context to avoid cross-module mismatches. As the system grows, the agent's effective context becomes a shrinking fraction of the total codebase.

Mycelium cells are independently implementable. An agent implementing :return/calc-restocking needs only:

Input schema: [:returned-detail, :reason]
Output schema: [:restocking-fee]
Business rules for restocking fee calculation

It does not need to read placement code, other returns cells, or understand the full data flow. The schema is the complete specification for that unit of work.

Why Tests Don't Catch These Bugs

All benchmarks through V2 achieve 100% test passage in both approaches. The latent bugs survive because:

Tests verify expected behavior, not contract compliance. A test for "return headphones, changed-mind" checks the refund amount. It doesn't check that the returns code reads the correct key from the placement output.
Test scenarios have accidental coverage gaps. Every defective-return test in V1 and V2 happens to involve items with free shipping (>$75 warehouse subtotal). No test returns a defective item where shipping was actually charged.
Agents write tests that match their implementation. When Agent 2 writes returns code that uses :shipping-detail, it also writes tests that don't exercise the :shipping-detail code path with non-zero shipping. The bug and the test gap are correlated because they come from the same incomplete understanding.
Passing tests create false confidence. After 30 tests pass, the natural conclusion is "the code is correct." The latent bugs only appear when you specifically probe for them with targeted scenarios.
New features can detonate old bugs. V3 proves that 100% test passage is a snapshot in time. V1's bug was harmless through V2 and catastrophic in V3. The bug didn't change -- the surrounding code did.

Schema validation is orthogonal to testing. It doesn't check business logic correctness -- tests do that. It checks structural correctness: "does the data flowing between components have the right shape?" This is precisely the category of bug that tests miss and that grows with system complexity.

Manifests as Persistent Context

Beyond preventing bugs, mycelium manifests address a practical problem in AI-assisted development: context compaction.

When an AI agent's conversation grows too long, earlier context gets compressed or dropped. The agent must rebuild its understanding by re-reading source files. In a traditional codebase, this means re-tracing data flow through hundreds of lines of imperative code to reconstruct which keys connect which modules, what the data shapes are, and how subsystems interact.

Mycelium manifests externalize this knowledge:

The manifest is a persistent context map. Reading placement.edn (~100 lines) gives the full DAG, every cell's input/output schema, and all dependencies. That's the entire system architecture in one structured file. In the traditional approach, reconstructing the same understanding requires reading 920 lines across 3 files.
Schemas are contracts that survive compaction. The :shipping-groups key name exists only in the agent's working memory in the traditional approach. Once that memory is compacted, the contract is lost. In mycelium, it's written in the manifest and survives any number of compactions.
Bounded context means less to rebuild. After compaction, an agent working on :return/calc-warranty-refund reads the cell schema (5 lines) and knows exactly what it receives and must produce. It doesn't need to re-read placement code, shipping code, or understand the full pipeline.
The 48-cell interaction matrix is externalized. V3's 15 subsystems create 48 cross-cutting interactions. No agent can hold all 48 in context simultaneously. Mycelium doesn't require it -- each cell only needs its own schema boundary. The manifest encodes the global structure so the agent doesn't have to.

Context compaction puts the agent in the same position as a new developer joining the project. Mycelium manifests serve as onboarding documentation that is also executable contracts. The traditional approach has no equivalent.

The Fundamental Asymmetry

Traditional codebases require global knowledge to avoid cross-module bugs. As the system grows, maintaining global knowledge becomes impossible for any single agent (or human). The V3 benchmark proves this: 15 subsystems and 48 interactions exceed what any agent can hold in context, and the result is a V1 bug that explodes into 17 test failures.

Mycelium requires only local knowledge (the cell's schema) to implement each component correctly. As the system grows, local knowledge stays constant per cell. The manifests encode the global structure, and the framework validates it automatically.

This is the same asymmetry that makes type systems valuable in large codebases: not because they help with small programs, but because they prevent the class of errors that grows fastest with system size.

The V3 results confirm the projection from V2: as complexity grows, the traditional approach's bug surface area grows combinatorially while mycelium's stays at zero. The only question was whether those bugs would remain latent or surface -- and V3 answered definitively.

Yes, this is simply 'technical debt'.

They should try to fix technical debt before going to the next round. Of course Claude can probably also do this.

Yeah it reads very Time Cube...

Again I don't know much about Clojure and I am too slow for functional programming in general.

I think it's all the angry people who were wrong about AI. Every week they come on HN and say AI is utter crap and useless, and every week AI becomes more and more part of the developer work flow.

I would be pissed off too if I was a hypocrite who was so sure AI was total garbage and was now at the same time needing to use claude on a daily basis.

A lot of developers are going through an identity crisis where their skills are becoming more and more useless and they need to attack comments like the above in a desperate but futile attempt to make themselves matter.

The ability to plan long term and anticipate flow on errors.

You chosen AI agent can and will plan as far ahead as you tell it to plan. If you tell it to do things in iteration 1 that might block it in iteration 3 then you should have told it more about where you want to go back in iteration 1.

According to the https://blog.katanaquant.com/p/your-llm-doesnt-write-correct... previously discussed on HN, it may be at least partially true:

> The vibes are not enough. Define what correct means. Then measure.

> It's starting to become obvious that if you can't effectively use AI to build systems it is a skill issue.

But I've been working startups off and on since 2024.

Partially agree, but I think "skill issue" undersells the genuine reliability problem the original post is describing.

The skill part is real — giving the agent the right context, breaking tasks into the right size, knowing when to intervene. Most people aren't doing that well and their results reflect it.

The top-level README gives a bit better idea. Armed with that the explanation might sound a bit more understandable.

I'm not familiar with the project (or Clojure), but let me try to explain!

> Mycelium structures applications as directed graphs of pure data transformations.

> Each node (cell) has explicit input/output schemas.

> Cells are developed and tested in complete isolation, then composed into workflows that are validated at compile time.

Sounds like they are pure functions. Workflows are validated at compile time, even if the nodes themselves are in Clojure.

> Routing between cells is determined by dispatch predicates defined at the workflow level — handlers compute data, the graph decides where it goes.

> Mycelium uses Maestro state machines and

Maestro is a Clojure library for Finite State Machines: https://github.com/yogthos/maestro

> Malli contracts

Malli looks like a parsing/data structure specification EDSL for Clojure: https://github.com/metosin/malli

> to define "The Law of the Graph," providing a high-integrity environment where humans architect and AI agents implement.

Say i buy into your mysticism based take, is it a useful tool if it blows up in damn well near every professionals face?

I'd say the only real use for these tools to date has been mass surveillance, and sometimes semi useful boilerplate.

> is it a useful tool if it blows up in damn well near every professionals face

It doesn't, that's ego-preserving cope. Saying that this stuff doesn't work for "damn well near every professional" because it doesn't work for you is like a thief saying "Everybody else steals, why are you picking on me"? It's not true, it's something you believe to protect your own self-image.

Honestly people are in such a weird place with this shit. I'm not saying don't read the fucking code - but I managed to get my setup to write 100k lines of indistinguishable SWE code in a week or so. The main limitation was my reading speed. This is something like a 10x speedup for me.

> if you provide the correct feedback

And how do you define correct feedback? If the output is correct?

I don't know if you deliberately cut-off the full point, but for the benefit of those with tired eyes I said 'feedback mechanisms', i.e. feedback in the control system sense.

It's really amazing, we've crossed a threshold, and I don't know what that means for our jobs.

No bugs means nothing if bugs get hidden and llms are great at hiding bugs and will absolutely fail to find some fairly critical ones. Your own repo, which is slop at best, fails to meet its core premise

> Another AI agent. This one is awesome, though, and very secure.

it isn't secure. It took me less than three minutes to find a vulnerability. Start engaging with your own code, it isn't as good as you think it is.

edit: i had kimi "red team" it out of curiosity, it found the main critical vulnerability i did and several others

Severity - Count - Categories

Critical - 2 - SQL Injection, Path Traversal

High - 4 - SSRF, Auth Bypass, Privilege Escalation, Secret Exposure

Medium - 3 - DoS, Information Disclosure, Injection

You need to sit down and really think about what people who do know what they're doing are saying. You're going to get yourself into deep trouble with this. I'm not a security specialist, i take a recreational interest in security, and llm's are by no means expert. A human with skill and intent would, i would gamble, be able fuck your shit up in a major way.

This sounds like arguing you can use these models to beat a game of whack-a-mole if you just know all the unknown unknowns and prompt it correctly about them.

This is an assertion that is impossible to prove or disprove.

No it's more like if you knew how to build it before - LLM agents help you build it faster. There's really no useful analogy I can think of, but it fits my current role perfectly because my work is constantly interrupted by prod support, coordination, planning, context switching between issues etc.

I rarely have blocks of "flow time" to do focused work. With LLMs I can keep progressing in parallel and then when I get to the block of time where I can actually dive deep it's review and guidance again - focus on high impact stuff instead of the noise.

I don't think I'm any faster with this than my theoretical speed (LLMs spend a lot of time rebuilding context between steps, I have a feeling current level of agents is terrible at maintaining context for larger tasks, and also I'm guessing the model context length is white a lie - they might support working with 100k tokens but agents keep reloading stuff to context because old stuff is ignored).

In practice I can get more done because I can get into the flow and back onto the task a lot faster. Will see how this pans out long term, but in current role I don't think there are alternatives, my performance would be shit otherwise.

>This is an assertion that is impossible to prove or disprove.

This is a joke right? There are complex systems that exist today that are built exclusively via AI. Is that not obvious?

The existence of such complex systems IS proof. I don't understand how people walk around claiming there's no proof? Really?

The same is true with human engineers - isn't this just what engineering is?

I think it's all the angry people who were wrong about AI. Every week they come on HN and say AI is utter crap and useless, and every week AI becomes more and more part of the developer work flow.

I would be pissed off too if I was a hypocrite who was so sure AI was total garbage and was now at the same time needing to use claude on a daily basis.

At some point I have to describe what I need with such a level of detail and precision, that if I had chosen to wrote it in my language of interest I'd have had the code I wanted long before i finish writing all the boundry conditions, safe guards, long term pitfalls etc. Thats if the system even respects all of those things, it might just lose track of some instruction at some point or make some bs placeholder and now we're back to bug hunting, only i'm now wading through code that is best described as slop.

> is it a useful tool if it blows up in damn well near every professionals face

It works fine for me, because I don't trust to to do anything more than line changes, self-contained prototypes or minor updates. I've seen what happens when people give llms full control over authoring, and it has never worked out. Just look at the dumpster fire spotify has become. Its a buggy mess that barely works and is openly touted as being primarily llm developed at present.

point me towards something complex which llms have contributed towards signifucantly without massive oversight where they didnt fuck things up. I'll eat my words happily, with just a single example.

> Another AI agent. This one is awesome, though, and very secure.

it isn't secure. It took me less than three minutes to find a vulnerability. Start engaging with your own code, it isn't as good as you think it is.

edit: i had kimi "red team" it out of curiosity, it found the main critical vulnerability i did and several others

Severity - Count - Categories

Critical - 2 - SQL Injection, Path Traversal

High - 4 - SSRF, Auth Bypass, Privilege Escalation, Secret Exposure

Medium - 3 - DoS, Information Disclosure, Injection

I don't know if you deliberately cut-off the full point, but for the benefit of those with tired eyes I said 'feedback mechanisms', i.e. feedback in the control system sense.

How does one verify 100k lines in a week? Let alone evaluate it to being SWE equivallent? That's super human. I like to think I am pretty good at what I do, but really critically engaging with 100k lines in a week is beyond even 10 of me. Forgive my skepticism, but I'm going to hazard the guess that you don't know what the fuck you're doing. You've lost your goddamn mind if you think you're doing anything other than skim read at a rate of 42 lines a minute for your entire work day without a break.

You could probably replace LLM with "junior engineer" here as it sounds like you're basically a manager now. The big negative that LLMs have in comparison with junior engineers is that they can't learn and internalise new information based on feedback.

>This is an assertion that is impossible to prove or disprove.

This is a joke right? There are complex systems that exist today that are built exclusively via AI. Is that not obvious?

The existence of such complex systems IS proof. I don't understand how people walk around claiming there's no proof? Really?

The same is true with human engineers - isn't this just what engineering is?

The assertion was "if you really know how to prompt, give feedback, do small corrections and fix LLM errors, then everything works fine".

It is impossible to prove or disprove because if everything DOES NOT work fine you can always say that the prompts were bad, the agent was not configured correctly, the model was old, etc. And if it DOES work, then all of the previous was done correctly, but without any decent definition of what correct means.

point me towards something complex which llms have contributed towards signifucantly without massive oversight where they didnt fuck things up. I'll eat my words happily, with just a single example.

On Saturday I had claude generate ~10k of lines of Lua code which uses the libASS subtitle format to build up nearly two dozen GUI widgets from subtitle drawing primitives, including nestable scrollable containers with clipping, drop down menus, animated tab bars, and everything else I could think of. I read probably about 100 lines of code myself that day, I "verified" the code only by testing out the demo claude was updating through the process.

Then on Sunday I woke up and had claude bang out a series of half a dozen projects each using this GUI library. First, a script that simply offers to loop a video when the end is reached. Updated several of my old scripts that just print text without any graphical formatting. Then more adventurous, a playlist visualizer with support for drag to reorder. Another that gives a nice little control overlay for TTS reading normal media subtitles. Another that let's people select clips from whatever they're watching, reorder them and write out an edit decision list, maybe I'll turn this one into a complete NLE today when I get home from work.

Reading every line of code? Why? The shit works, if I notice a bug I go back to claude and demand a "thoughtful and well reasoned" fix, without even caring what the fix will be so long as it works.

The concepts and building blocks used for all of this is shit I've learned myself the hard way, but to do it all myself would take weeks and I would certainly take many shortcuts, like certainly skipping animations and only implementing the bare minimum. The reason I could make that stuff work fast is because I already broadly knew the problem space, I've probably read the mpv manpage a thousand times before, so when the agent says its going to bind to shift+wheel for horizonal scrolling, I can tell it no, mpv has WHEEL_LEFT and RIGHT, use those. I can tell it to pump its brakes and stop planning to load a PNG overlay, because mpv will only load raw pixel data that way. I can tell it that dragging UI elements without simultaneously dragging the whole window certainly must be possible, because the first party OSC supports it so it should go read that mess of code and figure it out, which it dutifully does. If you know the problem space, you can get a whole lot done very fast, in a way that demonstrably works. Does it have bugs? I'd eat a hat if it doesn't. They'll get fixed if/when I find them. I'm not worried about it. Reading every line of code is for people writing airliner autopilots, not cheeky little desktop programs.

"The big negative that LLMs have in comparison with junior engineers is that they can't learn and internalise new information based on feedback."

No, but they can take "notes" and can load those notes into context. That does work, but is of course not so easy as it is with humans.

It is all about cleaning up and maintaining a tidy context.

The assertion was "if you really know how to prompt, give feedback, do small corrections and fix LLM errors, then everything works fine".

Reading every line of code? Why? The shit works, if I notice a bug I go back to claude and demand a "thoughtful and well reasoned" fix, without even caring what the fix will be so long as it works.

"The big negative that LLMs have in comparison with junior engineers is that they can't learn and internalise new information based on feedback."

No, but they can take "notes" and can load those notes into context. That does work, but is of course not so easy as it is with humans.

It is all about cleaning up and maintaining a tidy context.

Hacker Times

Hacker Times

Claude built a system in 3 rounds, latent bugs from round 1 exploded in round 3

Discussion

Discussion

Mycelium Benchmark Scaling Analysis

The Four Benchmarks

Benchmark 1: Checkout Pipeline

Traditional Approach

Mycelium Approach

Verdict at This Scale

Benchmark 2: Order Lifecycle V1

How It Was Built

Traditional Approach

Mycelium Approach

Verdict at This Scale

Benchmark 3: Order Lifecycle V2

Traditional Approach

Mycelium Approach

The Shipping Bug: Experimentally Confirmed

V1 Bugs Persist Through V2 Development

Verdict at This Scale

Benchmark 4: Order Lifecycle V3

Traditional Approach

Mycelium Approach

The V1 Bug Finally Explodes

All New Features Implemented Correctly

Verdict at This Scale

The Scaling Curve

Bug Growth

Overhead Amortization

Context Requirements

Why Tests Don't Catch These Bugs

Manifests as Persistent Context

The Fundamental Asymmetry