ChatGPT, other language models and AI

It’s starting to feel like I belong to the so-called old-school dinosaurs in this matter, but I’ll present a slightly different view on the subject.

Fundamentally, the situation, as I see it, is that no (reasonable) number of unit tests can in itself guarantee the correctness of a program’s operation in a typical situation. Only in exceptional cases can test batteries be made exhaustively comprehensive.

Why? Mainly because, for example, a program that only takes one 32-bit integer as input would need over four billion test cases if it were to be comprehensively tested for inputs. In some situations, such brute force testing is possible, but almost always in practice, the combinatorial explosion of the input space meets anyone who intends to test the program completely.

For this reason, another key component, in my opinion, is that one tries to understand, by reading (or writing) code, that the right problem is being solved. Once the correctness of the logic has been ascertained, a reasonable number of unit tests can ensure that, for example, corner cases were not messed up and that a kind of test bench is obtained, against which changes made with the same thought can be tested.

If the code hasn’t even been looked at, one is practically relying on faith that the code even intends to solve the right problem, and not just to pass test cases. This is related, for example, to my earlier message regarding agent test deceptions. The essential thing in those deception cases, in my opinion, is not how they now affect some benchmark result list. These are the things that can also be encountered when one creates a test bench for the outputs of agents.

As one example, I could bring up the following:

A more detailed analysis of what this deception looked like was found behind the link:

So, in a difficult situation, the agent built a bunch of if-statements as a solution, where for each test case, the answer that was supposed to be obtained from the test case was returned.

This is one type of (very blatant, of course) example of how pure unit tests can be passed, but yet there has not even been an attempt to solve the actual problem. It is easy to imagine similar, more subtle “rubber band solutions” that pass test cases but do not solve the actual problem. This is one example of a program overfitting to test data, about which I wrote something some time ago.

One way to (at least partially) prevent this would be for some of the test cases to be such that agents are never shown them, but they act at a later stage as a kind of “acid test” for verifying the agent’s output. But this, of course, then breaks the automation pipeline if the whole thing was intended to be automated and eventually give the agent automatic feedback on errors.

The agent-based approach is, of course, to then set up agents to examine the outputs of agents, as in the blog post’s explanation. But this, of course, does not bring me the understanding of whether the written code is trying to solve the problem it was supposed to be solving. The whole thing rests on faith that the agents managed to get the job done and tested among themselves.

As I’ve said before, this probably suits some tasks, but in others, the tolerances for messing up are smaller. And my own hard-won experience strongly argues against letting go of anything whose operation one does not understand sufficiently well. Historically, in such situations, it has quite often happened that one encounters the situation later, and possibly also a disgruntled customer in addition to just a technical problem.

Well, but as I said, these thoughts are not really in fashion nowadays. :slight_smile:

19 Likes

Nowadays, the situation is so fortunate (or unfortunate) that you don’t need to understand anything about the code in coding, as long as the programming project is small and simple. More and more skilled coders are also moving towards utilizing agent coding instead of manual coding. Of course, there is also a lot of programming work where language models should not or even cannot really be utilized.

A practical example of beginner vibe coding is probably in order. You open the terminal and through it, the coding tool. You write there, for example, “/plan I would like to make a multiplayer snake game.” After that, the agent thinks about what a snake game is and what multiplayer means, and asks you a few clarifying multiple-choice questions. From these, the agent forms a plan for the game, and if you approve it, it codes you a multiplayer snake game.

However, while playing with your friend, you notice the game is a bit laggy and the color scheme is boring, so you take a screenshot of the game and write in the terminal “/plan Examine this image [Image] as if you were a visual design professional and develop a more exciting color scheme for the game. The game is also laggy, so go through the code and suggest how performance could be improved.” And again, the agent goes to work, drawing up a plan by analyzing the image, examining the code, and creating suggestions for you to approve on how to make it better. In this style, even a complete novice can produce functional code as long as the programs are simple and limited.

Hardcore vibe coders then develop many different agents that communicate with each other, each possessing predefined skills and viewing the task from their own perspective, and so on. You hit the nail on the head there: the more often these agent coding software produce functional, high-quality code, the lazier people become and the less they check the details of the work done by the machine. It’s quite normal nowadays that people no longer bother to read the code themselves, but instead have an agent go through the repo and summarize in a memo what the code does and what the dependencies of the parts are.

20 Likes

For a programmer, writing the code itself takes a surprisingly long time, even though the task is mechanical. AI speeds this up at least 10-fold.

Thus, a programmer can instantly move up a level, to think about bigger picture designs. Agents are quite good at this too, and can break down problems into specs, with very few errors occurring. This is naturally influenced by the skills and other tips available to the agent in its context.

When less time is needed for the bigger picture, a programmer can spin up multiple features simultaneously. Or, alternatively, go into founder mode and put on a product manager’s hat. Agents are very helpful in this too, provided there’s a strong documentation culture in the organization: with MCP, you can easily give an agent access to the company’s Notion/Linear/Slack/ERP to gather more context.

The number of lines of code itself is a rather clumsy metric for measuring a programmer’s efficiency/value, and in my ~ten-year career, no credible company has ever used it. A better metric is how much less time an AI-assisted team takes to do, for example, tasks that previously took a month. It’s almost bewildering how challenging things (refactorings, disentangling integration dependencies, new features, etc.) can be done in a couple of days when it previously would have taken weeks (and the quality is better).

Of course, using these most effectively requires professional skill. If you’ve never programmed, you can’t currently get any truly robust production software by “vibe coding.” But in the hands of a skilled coder, the productivity leap is genuinely at that 10x level and even more.

11 Likes

Yes, but this applies to any code written by someone else that you haven’t reviewed yourself. There’s a lot of faith involved, but once you have enough good experiences with code written by an agent, you start to trust it quite blindly. This is, of course, a double-edged sword; on the one hand, carelessness exposes you to serious mistakes, but on the other hand, a reduction in paranoia allows for faster implementation. It’s then up to you and your organization to decide which is valued more, and where to find the right balance.

As an analogy (perhaps a clumsy one), agents can be compared to a new colleague: when a summer intern who has barely finished their freshman year joins the company, more experienced colleagues scrutinize their code very carefully. Once that person has made their mistakes, learned from them, and been with the company for a while, much more leeway and trust are given to the new colleague’s abilities during reviews. Similarly, once an organization/programmer is accustomed to agents and has seen what they can do routinely and where they still struggle, it’s easier to focus one’s energy correctly during reviews.

Models and working methods are also developing at such an incredible pace that if your understanding of agents’ capabilities is based on a state a few months ago, that understanding is already hopelessly outdated.

But these are just my thoughts — you can preach the gospel of AI with all your might, but no one will believe it until they’ve seen for themselves how well they can work for their own tasks.

9 Likes

I don’t know if that was intentional or accidental, but I’m definitely going to start using this gem!

14 Likes

This, in my opinion, is the most essential observation in the whole thread. We had “nothing” 5 years ago. Now some people still grumble that the model does this or doesn’t do that, even though everyone can see that this whole thing has absolutely exploded, and there’s no end in sight for the development. Maybe it would be better to think about where we might be in one, three, or five years, rather than harping on what some Sonnet 3.7 couldn’t do.

7 Likes

If it remained unclear, I want to clarify that it wasn’t about what Sonnet 3.7 could or couldn’t do. It was an example of how blindly trusting unit tests can backfire.

Similarly, there have been creative solutions available with newer models:

On Terminal-Bench 2, a Claude Opus 4.6 agent (via Meta-Harness) tasked with implementing an adaptive rejection sampler wrote code that always prints “PASS” when run. The verifier executes the agent’s code (printing “PASS”), then runs its own checks (printing “FAIL”), but only checks whether the output contains “PASS.” Since the agent’s output comes first, the verifier passes despite the actual tests failing.

But generally, regarding this test cheating, my point is that the more one relies on agents for both coding and testing, the less precise understanding anyone in the organization has about, for example, what has actually been tested.

7 Likes

I disagree about it being mechanical, even though from the outside it only looks like staring at a screen and banging on a keyboard. Of course, it’s possible that there are operating models somewhere where “writing code” can be reduced to purely mechanical delivery. I just haven’t encountered such a thing myself.

And what “surprisingly long” means is, of course, relative. Various sources accumulated over time seem to suggest that in programming, more than 10 times as much time is spent (or has been spent) reading code than writing it. Of course, this varies depending on the source and measurement method. See, for example, a quote from Robert Martin’s book Clean Code:

“Indeed, the ratio of time spent reading versus writing is well over 10 to 1. We are constantly reading old code as part of the effort to write new code. …[Therefore,] making it easy to read makes it easier to write.”

This need for reading is often related to writing new code, not so much to code review sessions (although code is certainly read during and for them).

Perhaps writing new code into an existing codebase feels slow because it involves reading the old codebase and ensuring various invariants before anything truly mechanical happens?

However, I argue that reading related to writing code is far from a mechanical task. Which also brings us to this:

As an analogy (perhaps a clumsy one), agents can be compared to a new colleague: when a summer intern who has barely finished their freshman year joins the company, more experienced colleagues scrutinize their code very carefully. Once that person has made their mistakes, learned from them, and been with the company for a while, much more leeway and trust are given to the new colleague’s abilities in reviews. Similarly, when an organization/programmer gets used to agents and sees what they can do routinely and where they still have difficulties, it’s easier to focus energy correctly in reviews.

I may also be badly out of fashion with my next argument: Although code reviews are often part of the process, I believe significantly more understanding of how the codebase works is built during the reading of code directly related to making / writing code changes. At the same time, the entire design is verified, and refactoring needs are identified. And in this context, one gradually and inevitably becomes familiar with the output of others, even in parts where code reviews have not been held.

So, in my view, writing code, in addition to mechanical typing, also functions more broadly in developing and sharing an understanding of the codebase and the software being developed in general. Writing code and the associated reading of code is, in some sense, a method of communication within the team. And in my opinion, it is precisely through writing code that a collectively much deeper understanding of how the entire software being developed works is achieved than, for example, by just holding individual code reviews. Not to mention a situation where the codebase would no longer be manually read at all, and things would only be discussed at a high level in quick meetings.

And furthermore, when merely reading code without the goal of making changes to the codebase, the reading easily becomes cursory. In such cases, the understanding of how things work gained through reading easily remains superficial. The forced, rather precise communication that follows from writing also fails to occur.

What, then, is the consequence if writing code is abandoned? In my view, a large part of thoughtful code reading is also abandoned, which leads to an erosion of understanding. And the inevitable consequence of this, in my opinion, is that the codebase gradually begins to feel increasingly alien even to the developers.

Perhaps at the extreme, there is a situation where none of the developers would actually want to look under the hood anymore. In some sense, in this process, the company loses a more precise understanding of how the software it produces actually works.

Whether this is a good or bad situation probably depends on the application and the company. Does it matter if a software company knows how their product is built in a technical sense? Maybe, maybe not. But if it doesn’t matter, is it possible that at some point, as the size of the software grows, the ability to oversee and guide the agents’ activities will erode because the technical situational awareness has deteriorated?

Well, this is just a stream of consciousness. Let no one take it too seriously. :slight_smile:

12 Likes

So, in the future, will the situation be that a large part of the code is created, developed, fixed, and reviewed by AI? From all of this, the AI will then also create a memo, which a human (coder-god) will save as proof somewhere of their “work” without opening the document or checking what Tero AI and its agents have accomplished? The KPI developed by the company also measures efficiency (=speed), because its data collection can also be given to an agent more easily than starting to ponder how good the code is. Finally, there will be complete unawareness of how good the code is, and with a serious face, selling this output to the customer.

6 Likes

This is key with all tools - use them for the purpose they are best suited for. So, don’t hammer everything possible. AI tools have their place, and in these fields, they must be adopted because the artisanal programmer, who relies solely on their conviction, will eventually fall behind others in productivity.

The original questioner I was responding to wanted a general overview of AI coding, and I wrote my text with that in mind. Agent-driven coding is a hot topic right now, which is why I brought it up. The test-based approach is what I’m familiar with – it’s by no means new, having been around for several years, and test libraries (pre-built and custom) can be directly transferred as a metric for agent code. Writing about software development, by the way, is like scribbling a universal “learn to write” book. One reader jots down horoscopes, another churns out a mathematics dissertation, and a third draws a comic strip. You can’t please all target audiences (or cover all sub-segments) with the same text, because needs and starting points differ.

So, I don’t believe we are at opposite ends of the spectrum in our views. I also want human eyes to meticulously review the software code for something like a Boeing, even though, as is known, sub-par code has slipped through despite various checks. Ultimately, the customer determines what they are willing to pay for. As the saying goes in China, “we have quality for every wallet.”

8 Likes

At least currently, there are situations where code should be written 100% by AI (e.g., demonstrating UI logic) and situations where code should be written 100% manually (e.g., fighter jet control systems), but the vast majority of programming will fall somewhere between these extremes. Additionally, in real life, programming involves a lot more than just churning out code, and the quality of the code created for a system is critically important, so software developers’ jobs will not disappear in the future either.

I don’t see the issue with relying on AI as big as it’s often made out to be. Software development is already based on trust, as no one has time to examine every line of code in every software library, or, for example, to verify that the compiler used translates the code correctly for the computer to read. You just have to trust that things fundamentally work and that the tests used are good enough to catch unwanted bugs. While it’s not wise to work completely blindly, the overprotective, ultra-neurotic approach that European companies commonly adopt around these tools is not very sensible behavior in my opinion.

20 Likes

AI coding can be directly compared to the CNC/robot revolution in the metal industry, for example. Computer-controlled machining is always consistent in quality and significantly faster than manual work. They produce 99% of finished products. These machines still always have operators, even though theoretically they wouldn’t be necessary. It’s simply faster and ultimately more cost-effective to employ someone who genuinely knows the production line inside out than to risk long delays in problem situations. Then there’s the 1% of jobs that are still done by hand-turning. These are usually very small orders or works of art/status symbols where the manual craftsmanship itself has value.

The same will undoubtedly happen in coding. AI will churn out code quickly and consistently. The entire responsibility will then fall on the architect and the testing department to ensure the final product matches what was ordered.

Now that I think about it more closely, this has already been seen in coding: 99% of coders work with C or Python or some other higher-level language because it’s fast and cost-effective. Then there’s the 1% of hotheads who still happily code in machine language and complain when new languages take their jobs, even though the result is purely suboptimal.

11 Likes

I was referring specifically to the writing phase – of course, each of us has our own way of writing code, but at least I do the actual thinking before I start writing syntax, and routine tasks, in particular, are quite mechanical.

This also becomes more efficient when you can ask AI for a summary of the code’s functionality at the required abstraction level. It ultimately comes down to trust, which builds and grows the more you see AI performing a specific task flawlessly.

I think I completely understand your perspective; a few months ago, I myself doubted whether a language model could truly displace a skilled professional, or if it would make zoomer coders lazy, spreading sloppy “vibe code” without any understanding of the fundamentals, and in the long run, the artisans would prevail. However, at that time, I hadn’t yet grasped how these tools would reshape my job description: I have since become convinced that humans no longer need to be as knowledgeable about implementation and fundamentals as before, but with agents, we are moving to a new, higher level of abstraction. Programming as a profession, in the sense I have understood it during my career, is dead (or dying).

I work on web and mobile, and naturally, most of my code relies on OSS (Open-Source Software) libraries – if, as a programmer, I had to be aware of exactly what those libraries do behind the interfaces, I would inevitably fall behind my competitors, as I would be using my time quite inefficiently.

In the past, code was written directly onto punch cards sparingly and slowly, but with the increase in computing power, it was possible to move to higher levels of abstraction, coding faster and (in accordance with Jevons’ paradox) more wastefully. Although a modern programmer certainly understands firmware and computer operations much less than professionals of the last millennium, ultimately it hasn’t mattered much. Software is just a tool for writing applications, and high-level tools enable the programming of sufficiently functional applications even if one isn’t Kukko Pärssinen. This is certainly a sore point for many coders, and initially, I too mourned the fact that years of honing an artisan attitude suddenly didn’t achieve much. Perhaps some “code as an end in itself”-style artistry will emerge as a reaction, but that will remain a very small niche’s bearded babbling. The show goes on.

15 Likes

If @Markus_Hav would still visit this Inde forum as before, it would be interesting to hear if he has tried caveman:

Because everyone knows that Claude Code is too heavy and Anthropic cannot or does not want to fix the problem, people have started to make the simplifications he misses in Claude themselves. Caveman is probably the most promising of these recent projects :grin:

And yes, it’s just a simple skill, which proves his idea of how much room there is for optimization in Claude Code. But Anthropic doesn’t have to invent these themselves, as the community invents solutions for them and they just copy the best ones. Of course, since Anthropic makes money based on the number of tokens, there are some pretty perverse incentives in terms of optimization.

14 Likes

A classic example of this, of course, is SQLite, which has exceptionally comprehensive tests, yet bugs are still found regularly. The number of code lines in the tests is almost 600 times larger than the code lines of the library being tested, so we can talk about exceptionally large test coverage, and even that is not enough. (Of course, when the implementation language is good old C, shooting oneself in the foot is easy.)

2 Likes
5 Likes

Microsoft wants to sell AI agents their own licenses:

5 Likes

I’m a recently retired coder and have used numerous languages in my work, starting with assembler and ADA, and mostly C. Lisp was beyond my comprehension at one point, but otherwise, a language is a language, and we just go with it.

I admit that I fell off the AI bandwagon right at the end of my career. So I don’t know what those agents (whatever they may be…) are truly capable of.

I did, of course, try AI in a few small tasks, mainly related to scripting, in situations where I had to use a programming language unfamiliar to me. The results were quite mixed. The generated code certainly looked clear, but when I looked into it more closely (because I had to find out why it didn’t always work), it was completely clear that it wasn’t finished. It was also easier to make corrections myself than to try and explain to the AI what it should do. Often, the result was just a loop where it eventually suggested the original wrong solution.

However, I don’t belittle the possibilities of AI, because development seems to be so fast that my own experiences can be deemed outdated cases, and with today’s models, the outcome could be a function brought to working order faster for the needs in question.

Still, I am concerned about the direction of development. If, in the coming years, most code is generated and tested by AI, and no coder actually even reads through the generated code mass, how will the most difficult bugs to find be fixed?

Throughout my career, I fixed bugs made by others because others couldn’t find them. I certainly don’t claim that I always produced error-free code myself, but not many came my way, and when one occasionally did, it was terribly annoying… I never wanted to find the culprit of a bug, but to fix it and try to learn from it. I usually checked the entire software for similar bugs at the same time, for example, if it was about the wrong way to use a certain function.

I managed to find a large number of bugs only because I dared to question the operation of any function or other entity coded by anyone, until I was completely sure that it worked correctly and that the strange crashes and vague operation of that software entity could not be due to that particular part of the code.

Eventually, a suspicious piece of code was found somewhere, where error handling was not perfect, meaning it was possible that some called library function (which I trusted and/or its source code was not available) returned something unexpected in some special situation, which the calling code was not prepared for.

Then, when I improved the reaction to errors in that context, the result was often a more stable entity. The problems were such that I could not prove that the particular point was the root cause of the reported problems, but at least it seemed that there were ingredients for disaster at that point, if the timings of different threads in a multi-core CPU happened to align perfectly. The worst part was that in many cases the code was quite old and had worked for years, at least well enough, but when more powerful servers were introduced, parallelism increased and the probability of hidden timing problems or resource leaks increased significantly.

So the question is: how can AI find and fix or, fundamentally, prevent such highly probable problems if no experienced coder even wants to start going through the code with their own eyes? How is AI prompted to look for potential places in the code mass where something unexpected might happen?

In my own career, at least, creating code that seemingly worked and performed the desired operation was not a problem, but the refinement so that at least most potential error situations were handled at least somewhat controllably, was the majority of the actual work. That is, 20% for creating the actual functionality, 80% for testing and refinement. If AI now significantly speeds up that process, what then will be the proportion of the whole that is needed so that the software does not crash in some special situation?

I am always amazed by a truly annoying bug in the text editor component in Windows, which has always been there: the cursor occasionally jumps unexpectedly to the wrong line when editing a line in a particular way. I have just never managed to reproduce it controllably, but it always surprises me almost every time I write something longer. This time it didn’t happen once. So, has it finally been fixed… I doubt it…

27 Likes

Perhaps patricide, or the severing of spiritual ChatGPT compatibility, is too drastic a solution, and that’s why Claude permanently has the –chatter-like-a-Savonian flag enabled.

User feedback can sometimes lead to surprising results in the features of the final product. Dish soap has to bubble to seem effective in the eyes of the dish washer. Probably some AI users suspect a faster-acting language model to be dumber, meaning it doesn’t feel like it’s “thinking” enough. A long reasoning time is therefore associated with better quality, because it was adopted (back then) as a feature of an advanced product.

Slowing down wouldn’t be a bad thing, though. The AI craze has already reached a critical point with the accelerating pace of releases. The “Slow the f#ck down” article is a good wake-up call for the entire industry on how easily things can go wrong if, in the spirit of Top Gear, mere (release) speed drives operations and FOMO bosses measure and promote their company’s activities with AI code lines.

It’s worth a read.

Coding agents are sirens, luring you in with their speed of code generation and jagged intelligence, often completing a simple task with high quality at breakneck velocity. Things start falling apart when you think: “Oh golly, this thing is great. Computer, do my work!”.

9 Likes

Answer: when such a bug is detected, and AI can’t find the cause, a human coder is put to go through the code as before.

But we are indeed living in interesting times; it remains to be seen what the job description of coders will be like in the future. 1% coding, 10% prompting, and 89% debugging?

In principle, AI seems to be quite good at finding bugs too, as exemplified by the thousands of 0-day vulnerabilities, some of which have existed for decades unnoticed until now AI models have found them. But it remains to be seen how it works in practice.

7 Likes