A new era for software testing

> I have the feeling that the introduction of automatic QA may raise the bar of quality for new releases of software, and maybe partially compensate for the lower quality of the code produced at high speed with the use of automatic programming.

In theory. The only difference between today and "the aughts" is that we have machines that can spit out a ton of code very quickly.

Nothing has changed about the discipline or honesty around testing (you can skip automated tests even faster now if you wish). You can and should work with AI to write tests, but you have to know the difference between a good test and a "looks good on paper" test in order for it to truly be effective and raise the quality of what you're building.

Writing unit tests used to be the bane of my existence. I used to hate them. Often times, the LoC for unit tests was 3X the LoC of the actual code.

But not any more! Now I point the LLM to the code and order it to write unit tests, covering all edge cases, etc. I'd rather spend 3 hours arguing with the LLM than writing unit tests! :-D

Scenario testing is the new word for it and I think this is a game changer.

Two of the reasons I never liked writing tests is

- they didn’t seem to usually assert much internal logic

- they would have to be maintained along with the original code

I think scenario testing is much better instead because the actual way a person uses a feature hardly changes but the internals might change a lot.

So imagine I’m making an e-commerce website. There are lots of internal mechanisms. I’ll have an agent testing all the functionalities as if it were a customer. This gives me much much more confidence while writing code because it is more uncorellated with the code.

Tomorrow I can change a lot of internals but the testing agent stays the same.

There’s something to note though: not all code is possible to be scenario tested. Like data engineering and other things where the feedback time is huge.

I believe this can work if done on top of traditional testing. I would feel very uneasy to replace deterministic (ok, not always but mostly) test suites with something that is not deterministic at all

In theory. The only difference between today and "the aughts" is that we have machines that can spit out a ton of code very quickly.

Scenario testing is the new word for it and I think this is a game changer.

Two of the reasons I never liked writing tests is

- they didn’t seem to usually assert much internal logic

- they would have to be maintained along with the original code

I think scenario testing is much better instead because the actual way a person uses a feature hardly changes but the internals might change a lot.

Tomorrow I can change a lot of internals but the testing agent stays the same.

There’s something to note though: not all code is possible to be scenario tested. Like data engineering and other things where the feedback time is huge.

are we just re-inventing playwright tests except 10x slower and infinity times more expensive?

i feel like im going insane

>Scenario testing is the new word

How is scenario different from a behavior (as in Behavior-Driven Development)?

Gherkin and things like Cucumber are not something new, are they?

> Two of the reasons I never liked writing tests

Are you an engineer ? You must test your "creation". Or would you expect that the microwave owen you just bougth will be tested by your child while getting burned ?

I think this is just TDD or unit test dogma and I’m personally not a fan.

Unit tests and deterministic tests are hard to get right and need to be done at the correct boundary.

I have seen many people dogmatically pushing unit tests religiously but this often leads to very hard to maintain tests that mostly exist just to change along with the main code itself.

A good way to understand if your unit tests are good: are you changing them along with changing your actual code? Then it’s a bad test. I think the argument for “it’s just documentation” is weak.

Writing unit tests used to be the bane of my existence. I used to hate them. Often times, the LoC for unit tests was 3X the LoC of the actual code.

But not any more! Now I point the LLM to the code and order it to write unit tests, covering all edge cases, etc. I'd rather spend 3 hours arguing with the LLM than writing unit tests! :-D

I am curious in your experience how often the LLM must also update the tests. I find that if LLMs write tests after the implementation exists, they are either extremely brittle because they are coupled to the implementation, or they cover little of value because they mock everything to the point of testing nothing.

are we just re-inventing playwright tests except 10x slower and infinity times more expensive?

i feel like im going insane

since the rise of agentic coding tools, it feels like we're in a new "eternal september" of people discovering ui end-to-end test automation.

>Scenario testing is the new word

How is scenario different from a behavior (as in Behavior-Driven Development)?

Gherkin and things like Cucumber are not something new, are they?

> Two of the reasons I never liked writing tests

Are you an engineer ? You must test your "creation". Or would you expect that the microwave owen you just bougth will be tested by your child while getting burned ?

'I never liked writing tests' is not the same as 'I don't write tests'.

I think this is just TDD or unit test dogma and I’m personally not a fan.

Unit tests and deterministic tests are hard to get right and need to be done at the correct boundary.

I have seen many people dogmatically pushing unit tests religiously but this often leads to very hard to maintain tests that mostly exist just to change along with the main code itself.

I don’t disagree with your point, but there is still value in having unit tests that change along with the code. It’s less than a “proper” test, but when these tests break _unexpectedly_, it’s still more signal than you’d have without them. Like, always changing `file.go` alongside `file_test.go` may be acceptable if you catch errors that impact `serve_test.go` unexpectedly.

Of course, if you’re just watching Claude changing both and saying “LGTM” then it’s not very valuable.

'I never liked writing tests' is not the same as 'I don't write tests'.

since the rise of agentic coding tools, it feels like we're in a new "eternal september" of people discovering ui end-to-end test automation.

Also the merits of documentation and specs. It’s been eye-opening to see the subset of developers who were almost disdainful about writing documentation for their colleagues but are now tripping over themselves to do so for their clanker.

People are rediscovering everything. Some people have proposed using a more formal language to tell the AI precisely what code to write. That's a compiler.

Of course, if you’re just watching Claude changing both and saying “LGTM” then it’s not very valuable.

People are rediscovering everything. Some people have proposed using a more formal language to tell the AI precisely what code to write. That's a compiler.

antirez 4 days ago. 42129 views.

Automatic programming dramatically speeds up writing software in certain use cases and in the right hands. In my experience the output does not reach the structural quality and economy of complexity of the best hand-written software. However, not all the software is stellar, and my feeling is that automatic programming surpasses most of the times (and if well managed) the quality of decently developed hand-written code.

Yet, there is a tradeoff between quality and time, in the case of writing new software with AI. This tradeoff in certain projects I developed can be brutal, that is, completing projects that may take many months in a few weeks. However, there are domains where LLMs simply open new strictly more powerful ways to automate processes, without any compromise on quality. One of those domains is software QA and testing.

Traditionally software is tested using test suites that are composed of locally-scoped tests and integration tests (think of Redis: one thing is testing if SET foo 10 will be matched by GET foo => 10, another thing is testing if replication works in this case). And then by QA passes that are usually manually executed, and that can capture holes in the runnable test suite. It is a known fact that covering all the lines of the code does not mean covering all the possible states. Moreover integration testing is structurally hard: there are a number of timing issues, setups, and certain quality outputs that can only be visually inspected and not automatically checked that leave a lot of testing opportunities not really exploited because of time or logistic constraints.

LLMs offer a new way to do QA on top of the existing testing methodologies. The idea is to create a markdown file where an AI agent is asked to work as a QA engineer, performing a number of manual testings on the new release. For instance, in the case of DwarfStar (an inference engine for open weights LLMs) I use the following approach. In the markdown file, the agent is asked to check what are the new commits on top of the already released version of the software project. Then the model is told a list of things that should be performed, like:

Check that distributed inference works across MacBook A and MacBook B, making sure the output is coherent, the inference works with all the GGUF files we have in both the machines, ...
Make sure this release does not contain any speed regression.

And so forth. Notably, in the speed regression part, I don't have to tell the agent what was the previous expected speed, as this is a moving target that changes with new releases and new optimizations. Similarly the integration test for distributed inference does not require many instructions, at the start of the file there are just SSH endpoints and the key to use, the paths, and so forth.

The agent is asked to check the long list of QA activities *especially* in light of the added commits, starting with an inspection of the changes and with the identification of what could be affected, so that the QA pass specializes trying to find specific regressions.

In the case of Redis Arrays, I used a similar methodology asking the agent to build a large array-based Redis application, to setup a production environment with replication and persistency, to simulate the usage of the application for days and with many users, checking if something was odd.

Testing that uses these approaches may also move in the more psychological side of software quality, asking the agent to identify all the new features that may look surprising, not documented enough, or generally sloppy from the POV of the user. All things that needed to be executed manually before, and that most of the times were mostly skipped.

I have the feeling that the introduction of automatic QA may raise the bar of quality for new releases of software, and maybe partially compensate for the lower quality of the code produced at high speed with the use of automatic programming.

Hacker Times

Hacker Times

A new era for software testing

Discussion

Discussion