Toward automated verification of unreviewed AI-generated code

Using FizzBuzz as your proxy for "unreviewed code" is extremely misleading. It has practically no complexity, it's completely self-contained and easy to verify. In any codebase of even modest complexity, the challenge shifts from "does this produce the correct outputs" to "is this going to let me grow the way I need it to in the future" and thornier questions like "does this have the performance characteristics that I need".

This is a naïve approach, not just because it uses FizzBuzz, but because it ignores the fundamental complexity of software as a system of abstractions. Testing often involves understanding these abstractions and testing for/against them.

For those of us with decades of experience and who use coding agents for hours per-day, we learned that even with extended context engineering these models are not magically covering the testing space more than 50%.

If you asked your coding agent to develop a memory allocator, it would not also 'automatically verify' the memory allocator against all failure modes. It is your responsibility as an engineer to have long-term learning and regular contact with the world to inform the testing approach.

I do think that GenAI will lead to a rise in mutation testing, property testing, and fuzzing. But it's worth people keeping in mind that there are reasons why these aren't already ubiquitous. Among other issues, they can be computationally expensive, especially mutation testing.

I'm having a hard time wrapping my head around how this can scale beyond trivial programs like simplified FizzBuzz.

While I understand why people want to skip code reviews, I think it is an absolute mistake at this point in time. I think AI coding assistants are great, but I've seen them fail or go down the wrong path enough times (even with things like spec driven development) where I don't think it's reasonable to not review code. Everything from development paths in production code, improper implementations, security risks: all of those are just as likely to happen with an AI as a Human, and any team that let's humans push to production without a review would absolutely be ridiculed for it.

Again, I'm not opposed to AI coding. I know a lot of people are. I have multiple open source projects that were 100% created with AI assistants, and wrote a blog post about it you can see in my post history. I'm not anti-ai, but I do think that developers have some responsibility for the code they create with those tools.

I think generative AI works best as a code review assistant rather than something that should be relied entirely upon for code review. LLMs can provide helpful advice and help engineers think of something they might not have thought about when writing, but they cannot review code in-depth the way we can. There are plenty of blatantly obvious security issues (usually on code generated by an LLM in large patches) I've seen pass LLM and automated code review before.

LLM code review (with the right tools and setup) can encourage engineers to write more pragmatic code, help write more in-depth tests, and submit overall better code without having to involve someone else, but a language model can't replace an actual human code review for production systems. We also should not be thinking of code as "correct" or "incorrect" because that doesn't measure code quality and security. Additionally, code can be classified as correct through tests but still have major security issues not covered by a test or be something that utilizes too much CPU/memory in production.

Even with mutation testing doesn’t this still require review of the testing code?

I'm having a hard time wrapping my head around how this can scale beyond trivial programs like simplified FizzBuzz.

> is this going to let me grow the way I need it to in the future

This doesn’t matter in the age of AI - when you get a new requirement just tell the AI to fulfill it and the old requirements (perhaps backed by a decent test suite?) and let it figure out the details, up to and including totally trashing the old implementation and creating an entirely new one from scratch that matches all the requirements.

For performance, give the AI a benchmark and let it figure it out as well. You can create teams of agents each coming up with an implementation and killing the ones that don’t make the cut.

Or so goes the gospel in the age of AI. I’m being totally sarcastic, I don’t believe in AI coding

I agree that it would be a mistake to use something like this in something where people depend upon specific behaviour of the software. The only way we will get to the point where we can do this is by building things that don't quite work and then start fixing the problems. Like AI models themselves, where they fail is on problems that they couldn't even begin to attempt a short time ago. That loses track of the fact that we are still developing this technology. Premature deployment will always be fighting against people seeking a first mover advantage. People need to stay aware of that without critisising the field itself.

There are a subset of things that it would be ok to do this right now. Instances where the cost of utter failure is relatively low. For visual results the benchmark is often 'does it look right?' rather than 'Is it strictly accurate?"

Even with mutation testing doesn’t this still require review of the testing code?

Mutation is a test for the test suite. The question is whether a change to the program is detected by the tests. If it's not, the test suite lacks coverage. That's a high standard for test suites, and requires heavy testing of the obvious.

But if you actually can specify what the program is supposed to do, this can work. It's appropriate where the task is hard to do but easy to specify. A file system or a database can be specified in terms of large arrays. Most of the complexity of a file system is in performance and reliability. What it's supposed to do from the API perspective isn't that complicated. The same can be said for garbage collectors, databases, and other complex systems that do something that's conceptually simple but hard to do right.

Probably not going to help with a web page user interface. If you had a spec for what it was supposed to do, you'd have the design.

Correct. Where did the engineering go? First it was in code files. Then it went to prompts, followed by context, and then agent harnesses. I think the engineering has gone into architecture and testing now.

We are simply shuffling cognitive and entropic complexity around and calling it intelligence. As you said, at the end of the day the engineer - like the pilot - is ultimately the responsible party at all stages of the journey.

Probably not going to help with a web page user interface. If you had a spec for what it was supposed to do, you'd have the design.

> is this going to let me grow the way I need it to in the future

For performance, give the AI a benchmark and let it figure it out as well. You can create teams of agents each coming up with an implementation and killing the ones that don’t make the cut.

Or so goes the gospel in the age of AI. I’m being totally sarcastic, I don’t believe in AI coding

2026-03-16

I've been wondering what it would take for me to use unreviewed AI-generated code in a production setting.

To that end, I ran an experiment that has changed my mindset from "I must always review AI-generated code" to "I must always verify AI-generated code." By "review" I mean reading the code line by line. By "verify" I mean confirming the code is correct, whether through review, machine-enforceable constraints, or both.

I had a coding agent generate a solution to a simplified FizzBuzz problem. Then, I had it iteratively check its solution against several predefined constraints:

(1) The code must pass property-based tests (see Appendix B for a primer). This constrains the solution space to ensure the requirements are met. This includes tests verifying that no exceptions are raised and tests verifying that latency is sufficiently low.

(2) The code must pass mutation testing (see Appendix C for a primer). Mutation testing is typically used to expand your test suite. However, if we assume our tests are correct, we can instead use it to restrict the code. This constrains the solution space to ensure that only the requirements are met.

(3) The code must have no side effects.

(4) Since I'm using Python, I also enforce type-checking and linting, but a different programming language might not need those checks.

These checks seem sufficient for me to trust the generated code without looking at it. The remaining space of invalid-but-passing programs exists, but it's small and hard to land in by accident.

I was concerned that the generated code would be unmaintainable. However, I'm starting to think that maintainability and readability aren't relevant in this context. We should treat the output like compiled code.

The overhead of setting up these constraints currently outweighs the cost of just reading the code. But it establishes a baseline that can be chipped away at as agents and tooling improve.

The repo fizzbuzz-without-human-review implements these checks in Python, allowing you to try this for yourself.

↑ Back

Appendix B: Primer on property-based testing

Software tests commonly check specific inputs against specific outputs:

def test_returns_fizzbuzz_for_multiples_of_3_and_5(n: int) -> None:
    assert fizzbuzz(15) == "FizzBuzz"
    assert fizzbuzz(30) == "FizzBuzz"

Property-based tests run against a wider range of values. The property-based test below (using Hypothesis) runs fizzbuzz with 100 semi-random multiples of both 3 and 5, favoring "interesting" cases like zero or extremely large numbers.

@given(n=st.integers(min_value=1).map(lambda n: n * 3 * 5))
def test_returns_fizzbuzz_for_multiples_of_3_and_5(n: int) -> None:
    assert fizzbuzz(n) == "FizzBuzz"

Compared to testing specific input, this approach gives us more confidence that a given "property" of the system holds, at the cost of being slower, nondeterministic, and more complex.

For additional information, the Hypothesis docs are a good starting point.

↑ Back

Appendix C: Primer on mutation testing

Mutation testing tools like mutmut change your code in small ways, like swapping operators or tweaking constants, then re-run your test suite. If your tests fail, the "mutant" code is "killed" (good), and if your tests pass, the mutant "survives" (bad).

As an example, consider the following code:

def double(n: int):
    print(f"DEBUG n={n}")
    return n * 2

def test_doubles_input():
    assert double(3) == 6

Mutating print(f"DEBUG n={n}") to print(None) leaves test_doubles_input passing, so the mutant survives. You would fix it by removing the side effect or adding a test for it.

Appendix D: Acknowledgements

Thanks to Taha Vasowalla and other reviewers for their feedback on an early draft of this post.