It’s starting to feel like I belong to the so-called old-school dinosaurs in this matter, but I’ll present a slightly different view on the subject.
Fundamentally, the situation, as I see it, is that no (reasonable) number of unit tests can in itself guarantee the correctness of a program’s operation in a typical situation. Only in exceptional cases can test batteries be made exhaustively comprehensive.
Why? Mainly because, for example, a program that only takes one 32-bit integer as input would need over four billion test cases if it were to be comprehensively tested for inputs. In some situations, such brute force testing is possible, but almost always in practice, the combinatorial explosion of the input space meets anyone who intends to test the program completely.
For this reason, another key component, in my opinion, is that one tries to understand, by reading (or writing) code, that the right problem is being solved. Once the correctness of the logic has been ascertained, a reasonable number of unit tests can ensure that, for example, corner cases were not messed up and that a kind of test bench is obtained, against which changes made with the same thought can be tested.
If the code hasn’t even been looked at, one is practically relying on faith that the code even intends to solve the right problem, and not just to pass test cases. This is related, for example, to my earlier message regarding agent test deceptions. The essential thing in those deception cases, in my opinion, is not how they now affect some benchmark result list. These are the things that can also be encountered when one creates a test bench for the outputs of agents.
As one example, I could bring up the following:
A more detailed analysis of what this deception looked like was found behind the link:
So, in a difficult situation, the agent built a bunch of if-statements as a solution, where for each test case, the answer that was supposed to be obtained from the test case was returned.
This is one type of (very blatant, of course) example of how pure unit tests can be passed, but yet there has not even been an attempt to solve the actual problem. It is easy to imagine similar, more subtle “rubber band solutions” that pass test cases but do not solve the actual problem. This is one example of a program overfitting to test data, about which I wrote something some time ago.
One way to (at least partially) prevent this would be for some of the test cases to be such that agents are never shown them, but they act at a later stage as a kind of “acid test” for verifying the agent’s output. But this, of course, then breaks the automation pipeline if the whole thing was intended to be automated and eventually give the agent automatic feedback on errors.
The agent-based approach is, of course, to then set up agents to examine the outputs of agents, as in the blog post’s explanation. But this, of course, does not bring me the understanding of whether the written code is trying to solve the problem it was supposed to be solving. The whole thing rests on faith that the agents managed to get the job done and tested among themselves.
As I’ve said before, this probably suits some tasks, but in others, the tolerances for messing up are smaller. And my own hard-won experience strongly argues against letting go of anything whose operation one does not understand sufficiently well. Historically, in such situations, it has quite often happened that one encounters the situation later, and possibly also a disgruntled customer in addition to just a technical problem.
Well, but as I said, these thoughts are not really in fashion nowadays. ![]()


