ChatGPT, other language models and AI

Some more fun brainstorming sessions with AI. The theme is the currency of the future and the division of people into two AI groups:

  • Fiat and cryptocurrencies would lose their significance, as in the future, the only currency would be what buys computing time from AI—i.e., tokens. If you want AGI-controlled robot workers to build you a new house, you would need a stack of tokens instead of euros to pay the AI company (or the state) for the design, construction, and materials.
  • People would have to choose between two “countries” to live in based on privacy:
    • Either they would give up their privacy and receive tokens as a universal basic income without doing anything, but the amount would be limited and you couldn’t accumulate more by doing more or better work. Would people bother doing anything if tokens come in regardless?
    • Or you would move to a community in the countryside that has its own locally-run AGI powered by solar energy, but then you would have no fixed income at all. You would have to earn tokens through your own hard work and increase your income by inventing better things that you can build with AGI and sell to others for additional tokens.

I chuckled a bit, as this reminds me of a certain country…

3 Likes

It’s easy and fun to bounce ideas around with AI. A few months ago, I mentioned a project here that was on the verge of being ready for testing, and yet here I am, still working on it. It’s nice to create and build new features. It’s easy to forget about old debt and take on new ones.

Over the weekend, I added an MCP server to the project and tried it out through Claude (inspired a bit by Inderes’ MCP, but of course, it was already in the code :wink:). The data currently includes vectorized 2025 reports for US companies (approx. 8,000 companies) and company figures from 2020–2025. On top of that, Finnish and Swedish companies.

Man, this is fun!

14 Likes

An interesting post from Aaron: https://x.com/i/status/2048950940661932319

His point, in a nutshell, is that humans are always needed for the final touches and guidance. The goalposts of work are constantly shifting, and even if AI could handle 99% of current tasks, in the future, that might only represent, say, 50% of the required input. The reasoning is that the market always expects more and better results, and when the most generic parts can be executed quickly and cheaply with AI, the bottleneck shifts elsewhere, and human input will never be completely phased out. It follows Jevons’ paradox.

I’ve envisioned a similar future myself; it remains to be seen how the development of AI tools continues and whether the market will eventually saturate so heavily that there is no longer such a great need for humans, at least in the software industry.

8 Likes

A news story gained traction on Hacker News over the weekend, reporting that an Erdős problem was solved by giving it to ChatGPT with a very simple prompt. The solver was an amateur in their twenties with no formal university-level mathematical education. Just a regular guy, basically.

5 Likes

There wasn’t really anything new in this per se; this has been done before for Erdős problems. What was new was that mathematicians had been thinking about this problem and hadn’t come up with a solution. ChatGPT came up with a novel approach that actually solved the problem. This seems to be completely new and is a rather significant matter.

7 Likes

I came across a phenomenon today that is starting to emerge internationally among sellers, buyers, and consultants.

When a B2B negotiation takes place on Teams or Zoom, the buying side can record the meeting (or use automatic transcription), feed the seller’s claims into an AI in real-time for verification, and analyze the recording afterward, ensuring the seller’s promises are in black and white. For example, a buyer could check if Huikkanen’s talk from two weeks ago is consistent with today’s chatter.

The revenue of many listed companies relies significantly on B2B sales, where the seller’s ability to “sell the vision” has traditionally been key. If buyers begin to systematically verify claims on the fly, it will change sales dynamics and likely the structure and costs of sales organizations.

Gong is an example of such a tool on the sellers’ side, but now the same logic could work in reverse: the buyer analyzes the seller. I noticed this phenomenon by chance today while I was trying to sell an idea to a client.

Do you have personal experience with this, either as a seller or a buyer? Which listed companies’ sales models are most vulnerable to this change? Is this a threat or an opportunity?

14 Likes

I came across this article from a week ago regarding AI monetization models and their sustainability.


Gartner’s Sommer studies long-term economic market trends related to generative AI, including calculating just how much money is at stake. Between 2024 and 2029, he said, Gartner estimates that capital investment in AI data centers will reach about $6.3 trillion — a “massive amount of money.”

To avoid a write-down of these assets, major AI model providers would ideally generate a return on invested capital (ROIC) of about 25 percent, Sommer said. (That’s about what Amazon, Microsoft, and Google tend to earn on their overall capital investments.) On the other hand, if the returns fall below 12 percent, institutional capital loses interest — there’s better money elsewhere, Sommer said. Below 7 percent, you’re in write-down territory, which is “an unmitigated disaster for all of the investors in this technology,” Sommer said.

To reach that bare minimum of 7 percent, Gartner forecasts that large AI companies would need to earn cumulatively close to $7 trillion in AI-driven revenue through 2029, which is close to $2 trillion per year by the end of the period. In order to achieve “historic returns,” the providers would need to earn nearly $8.2 trillion in the same period.

OpenAI has already made $600 billion in spending commitments through 2030, the company said in February, which Sommer says is already a “massive step down” from the $1.4 trillion it had planned before. Based on OpenAI’s revenue forecasts and potential compound annual growth, Sommer said that even in the best-case scenario, he predicts that the lab would only hit a fraction of the overall spend required to hit that 7 percent ROIC.

To hit investors’ revenue expectations, providers would need to process a “mind-bending” number of tokens, Sommer said.

By most measures, companies’ numbers are already pretty big. Google announced it was processing 1.3 quadrillion tokens in October, for instance. If you add all the providers’ estimates up, Sommer said, you get 100 to 200 quadrillion tokens a year. But to achieve the the $2 trillion in annual spend Gartner calculated, providers would need to be generating, by conservative estimates, a cumulative 10 sextillion tokens per year. (To make that slightly less abstract, a quadrillion has 15 zeros, and a sextillion has 21.) Even assuming a very generous profit margin of 10 percent per token, that would mean that token consumption between now and 2030 would need to grow by 50,000–100,000x.

There is plenty of other interesting content in the article, but the part I bolded particularly caught my eye. Unless the Gartner analyst’s calculations are off by several orders of magnitude, achieving a 50,000x – 100,000x increase in token consumption seems suddenly impossible by 2030.

If we consider how much additional computing capacity is expected for AI inference, a certain estimate for growth between 2025 and 2030 is provided by JLL (2026 Global Data Center Outlook):

The data center sector is projected to increase by 97 GW between 2025 and 2030, effectively doubling in size over a five-year period. By 2030, global data center capacity could reach 200 GW. This rapid growth will be driven largely by hyperscale cloud expansion and AI demand.

In the report, the share of AI inference is estimated as follows:

From this graph and the more detailed figures presented above, one can quickly calculate that, measured in watts, the inference capacity available in data centers in 2025 was approx. 9% * 97 GW = 8.7 GW, and in 2030 it is expected to be approx. 37% * 200 GW = 74 GW. Thus, the growth factor is 74 GW / 8.7 GW = approx. 8.5. We could perhaps optimistically round this up to a 10x figure.

This, of course, only measures growth through power consumption. If we think about how many tokens 2030 hardware can produce, we have to try to estimate the growth in compute efficiency (flops/W) and the development of compute required by models (token/flops).

To support this estimation, NVIDIA provides the following rough figures for the last couple of GPU generations:

NVIDIA Blackwell is the correct choice when electricity dominates TCO because Blackwell Ultra (GB300 NVL72) delivers up to 50x higher throughput per megawatt versus Hopper resulting in 35x lower cost per million tokens, while Blackwell (GB200 NVL72) delivers 10x throughput per megawatt and 15x lower cost per million tokens versus the prior generation, together producing the highest revenue per kilowatt-hour of any independently benchmarked inference platform.

One significant component in the achieved efficiency improvements, alongside others, has been the shift to lower-precision quantization (fp4), which allows for computation on simpler hardware. Regarding quantization, we’ve likely reached the end of the road for the most part, but perhaps other similar architectural improvements can be found and implemented over the coming years. Let’s extrapolate then, that the compute capacity installed in 2030 will be approx. 50x more energy-efficient (measured in flops/W required for AI computation) compared to what is currently available.

The amount of compute required by models (flops/token) seems to have grown sharply in recent years as model sizes have increased. However, if we optimistically assume that despite the expected growth in model size, the required amount of compute remains the same even in 2030 (as model optimization might allow this despite size increases), we can set the multiplier for this to 1.

So, with the 2030 compute capacity, data centers could process approximately 10 * 50 * 1 = 500 times the amount of tokens for AI models at that time, relative to how many tokens current AI models can be processed right now.

At first glance, a 500-fold increase sounds like a good addition, but then again, the Gartner analyst assumed a 50,000x – 100,000x volume of tokens to be processed for the cost calculations behind data centers to be on a sustainable footing. In this quickly calculated 500x capacity, there is still a deficit equivalent to a 100x–200x additional multiplier; in other words, we are off by a couple of orders of magnitude. To get the numbers to line up, we would need either significantly more data center capacity than predicted, significantly more energy-efficient hardware, or much more optimized models. And likely a combination of all of these. Or alternatively, the token multipliers behind those cost calculations would need to be brought down significantly.

Well, it’s possible that while drinking my morning coffee, I missed something essential, as I didn’t even manage to get into the right ballpark.

20 Likes

I watched a presentation a while ago in which Norges Bank Investment Management explains all the different ways they use AI. Some of these methods are similar to what you’re talking about. A good presentation overall, with plenty of practical examples.

2 Likes

Great post! And Gartner’s estimate is indeed very aggressive—especially if you try to reverse engineer it based solely on electricity and hardware. In that case, that +100x order-of-magnitude gap would certainly be real.

But there is probably no reason to assume that the 2025 token and its associated throughput process will remain constant until 2030.

If you cumulatively factor in, for example, MoE sparsity or specialized models with routing (activating only a small part of trillion-parameter models or using a fully downstream-specialized model), this alone might close that 100x gap. Add on top of that Speculative Generation (i.e., small models generate so-called candidate tokens and a large model verifies them) and KV-cache/context optimization innovations like SGLang, and suddenly those couple of orders of magnitude start being bridged.

Current baseline usage is still largely interactive premium inference: a human writes a prompt and a large dense model generates a response with a relatively high latency and quality budget. Such usage is expensive and quite inefficient.

If, however, 2030 inference looks more like internet backend traffic (communication between agents, background inference of document, video, sensor data, and business processes), then a token will no longer correspond to the current GPT-5-type expensive frontier token.

9 Likes

Good point.

I’ll clarify, however, that The Verge article was from only about a week ago, so (admittedly with limited basis) I assume its estimates for token volumes and related multipliers are quite current. The year 2025 was included mainly because I was looking for a good reference for recent estimates of data center inference capacity. I found figures for 2025, but not a similar benchmark for 2026. (Consequently, there might be a bit of extra optimism in the data center capacity growth multiplier, as the multiplier was calculated based on 2025 rather than the current situation.)

Correct me if I’m wrong, but it seems most frontier models (with the possible exception of Anthropic’s models) have been using the MoE approach to cut inference costs since roughly the first half of 2025. Therefore, I suspect that the multipliers outlined across the entire AI industry in The Verge’s piece should be thought of as building on top of the MoE models in use around the turn of 2025/2026. For example, Google’s token estimates from October 2025 are mentioned separately, and at least by then, Google was already heavily utilizing MoE.

Is it foreseeable that by fine-tuning current MoE models, another 100x optimization in computational needs could be squeezed out on top of what is currently in use? There are presumably some limits to how small individual expert networks can be partitioned without quality suffering. (Or, alternatively, one ends up running several invocations of smaller experts in parallel, etc.)

6 Likes

This was good! Unfortunately, in many places it’s not clear whether the discussion is about fixed-price subscriptions or API inference. One often hears that API inference prices will skyrocket once enough people have been hooked. I personally don’t believe there’s significant room to raise API prices unless the models’ capabilities grow in the same proportion. The text mentioned open models, and I spent a few weeks collecting data specifically related to this. Currently, the trend seems to be that when you release a new closed SOTA model, a roughly equally good open model enters the market about 6 months later, costing only a fraction of what the closed SOTA model costs. I would therefore see that it’s difficult to significantly raise API prices unless open models are left far behind.

Below are a few visualizations based on the data I collected. The arena.ai scores have been used as a benchmark, as there isn’t a single benchmark that would be relevant for the entire period. arena.ai scores are also harder to “benchmax”, although they do have their own issues.

Here is a graph visualizing how long it takes from the release of a closed SOTA model for an open model to reach the same level (y-axis).

Here, on the other hand, are the prices of the models themselves with an 80/20 ratio of input/output tokens. Note: logarithmic y-axis.

Here it is in table format:

Closed SOTA model Released Arena bar Matched by Lag (mo) Status Closed $/Mtok Open $/Mtok Ratio
GPT-4 2023-03-14 1286 gemma-2-27b-it 15.5 matched $36.00 $0.27 133.3x
GPT-4 Turbo (1106) 2023-11-06 1312 mistral-large-2407 8.6 matched $14.00 $2.80 5.0x
GPT-4 Turbo (0125) 2024-01-25 1313 mistral-large-2407 5.9 matched $14.00 $2.80 5.0x
Claude 3 Opus 2024-03-04 1321 llama-3.1-405b-instruct-bf16 6.4 matched $27.00 $3.50 7.7x
GPT-4 Turbo (Apr 2024) 2024-04-09 1324 llama-3.1-405b-instruct-bf16 5.2 matched $14.00 $3.50 4.0x
GPT-4o 2024-05-13 1345 deepseek-v3 7.6 matched $7.00 $0.44 16.1x
o1-preview 2024-09-17 1388 deepseek-r1 4.2 matched $24.00 $0.88 27.3x
o1 2024-12-30 1402 deepseek-r1-0528 5.6 matched $24.00 $0.88 27.3x
GPT-4.5 preview 2025-03-03 1444 kimi-k2.5-thinking 10.9 matched $90.00 $0.98 91.8x
Gemini 2.5 Pro 2025-06-24 1448 kimi-k2.5-thinking 7.2 matched $3.00 $0.98 3.1x
Claude 4.1 Opus (thinking) 2025-08-14 1449 kimi-k2.5-thinking 5.5 matched $27.00 $0.98 27.6x
Claude 4.5 Sonnet 2025-10-01 1453 glm-5 4.3 matched $5.40 $1.44 3.8x
Claude 4.5 Sonnet (thinking) 2025-10-03 1453 glm-5 4.3 matched $5.40 $1.44 3.8x
Gemini 3 Pro 2025-11-16 1486 5.4+ still leading $4.00
Claude 4.6 Opus (thinking) 2026-02-06 1501 2.7+ still leading $9.00
claude-opus-4-7-thinking 2026-04-17 1503 0.4+ still leading $9.00
13 Likes

I personally see it such that organizations (software houses in particular) become so hooked on specific vendors that they will tolerate price increases to a significant degree as long as the results are good, because switching vendors is challenging, especially for large companies. Smaller outfits, of course, move more agilely and may be more prone to optimizing costs.

But absolutely interesting observations in this thread indeed, there’s plenty of food for thought here – thanks to everyone!

6 Likes

What’s the vendor lock-in here? In traditional software, lock-in occurs when all data, processes, and employee expertise are tied to, say, SAP or Salesforce. For now, switching LLM models is like changing socks. Claude had a brief head start when the good tooling built around Claude Code provided an advantage. Well, even that code was leaked by mistake and Open Source folks clean-roomed a competing implementation.

8 Likes

If we are talking about tools like Claude, there are certainly opportunities there to push through aggressive price increases. On the Claude side, Enterprise customers have already been forced into API pricing instead of a fixed subscription, which at least tenfolds the costs for power users. Neither OpenAI nor Anthropic offer those most expensive fixed-price subscriptions for team/enterprise users.

But if we are talking about pure API inference, the risks for this are much lower. If an organization’s agents are running in, for example, Google’s, Amazon’s, or Microsoft’s cloud, inference for open models is also available from all of them. Of course, cloud services could raise the prices of all models, but vendor lock-in is still significantly lower because switching providers is reasonably easy and usually would not cause any visible change for the organization’s employees.

7 Likes

This is indeed an interesting observation. My own subjective feeling has been that open-source models lag at least a year or more behind the best models. But based on benchmarks, that doesn’t seem to be the case. The flaw in my subjective feeling, in my case, is of course that my experience with open-source models has mostly been from running them locally (or offline), and I haven’t used a proper GPU cluster, but rather setups assembled from high-end consumer hardware. In these cases, you naturally have to run quite a lot of pruned (quantized) models, which probably gives a slightly wrong impression of the state of things.

From the last sentences of that Verge article:

Therefore, Sommer said, a sustainable business model “would require that genAI be infused in everything from billboards to checkout kiosks,” with providers taking a cut of all of those transactions.

When you combine this with your observation, it makes me wonder whether this AI business is on a sustainable footing for those big players producing closed models. In other words, are we potentially in the heart of the famous AI bubble?

The fact that a sustainable revenue model requires 50,000 – 100,000x more token trade might not only cause headaches on the token production side, but in the spirit of that article, markets must also be found where 100,000x more tokens can be sold than now. This means significantly more users must be found—to put it bluntly, AI would have to be integrated almost everywhere.

And even if markets were found for that amount or more, open-source models are breathing down the necks of these closed models. Once a closed model reaches an “acceptable” level for a certain use, an open-source model will be capable of the same shortly after. Thinking this way, there is a small window of opportunity where early adopters benefit from the better capabilities of a closed model, but after a while, deployment for the “masses” can then be done based on open-source models. In this case, inference costs are practically just the price of acquiring and maintaining infrastructure plus some small premium, and it may be that in the long run, it is indeed difficult for a closed-model manufacturer to earn more than this from inference?

If things go this way, will today’s big closed-model producers find enough buyers for their (more expensive) tokens?

I was wondering the same thing. An example is this AMD case from a few weeks ago, where they wrote a bug report to Anthropic about grievances and switched providers:

“We have switched to another provider which is doing superior quality work, but Claude has been good to us, and we are leaving this in the hopes that Anthropic can fix their product,” Laurenzo explained, while declining to go into details in a comment citing NDAs about whatever new tool her team is using. That said, Laurenzo did warn Anthropic that it’s still early in the AI coding game and Anthropic is looking at giving up the top spot if its behavior continues.

10 Likes

I guess there isn’t really any actual vendor lock-in, I’m just more skeptical about large organizations’ willingness/ability to change tools they’ve found to be good “like socks.” Of course, I might be underestimating large companies – @Eevitsi already pointed out that at least AMD seems to be able to switch service providers quite agilely on the fly. Maybe I’ll eat my words soon anyway if the Anthropics and OpenAIs of the world don’t manage to keep their lead over generic OS LLM operators.

Does that 50k-100k multiplier need to come solely from increased volume? Couldn’t more tokens be processed at the same cost (if energy were the biggest expense) with more efficient hardware and software?

If currently (numbers pulled out of a hat) a data center processes a billion tokens consuming a megawatt-hour and the profit is $100, a tenfold return, $1,000, is achieved either by:

  1. decupling the token price (and hoping demand doesn’t dry up)
  2. selling 10 billion tokens at the old price
  3. finding a way to process those billion tokens with 0.1 megawatt-hours (assuming the most significant cost in processing tokens is the energy consumed)
  4. a combination of the previous options

Or was the assumption that hardware and software become more efficient already baked into that 50k-100k estimate?

2 Likes

It might also matter what industry the company is in? In the software sector, the threshold might be lower due to the nature of the field, as long as a new model can sufficiently do the same as the old one. But then if we take some local auto repair shop like “Jaska’s” (once AI starts being used there), there’s probably inertia in switching if the new system requires a certain amount of work before it’s back in working order in the same way as the old one. Well, I could be wrong.

It’s indeed a good question what Gartner’s estimate is based on. Fundamentally, as I see it, the cost of producing tokens is estimated to be significantly cheaper in 2030 than it is today. And then the idea is that there can’t be an insane premium on inference relative to production costs.

Statements from the same analyst from about a month ago:
https://www.gartner.com/en/newsroom/press-releases/2026-03-25-gartner-predicts-that-by-2030-performing-inference-on-an-llm-with-1-trillion-parameters-will-cost-genai-providers-over-90-percent-less-than-in-2025

AI tokens are the units of data that GenAI models process. For the purposes of this analysis a token is 3.5 bytes of data, or approximately 4 characters.

“These cost improvements will be driven by a combination of semiconductor and infrastructure efficiency improvements, model design innovations, higher chip utilization, increased use of inference-specialized silicon, and application of edge devices for specific use cases,” said Will Sommer, Sr. Director Analyst at Gartner.

As a result of these trends, Gartner forecasts LLMs in 2030 will be up to 100 times more cost-efficient than the earliest models of similar size developed in 2022.

At least back then, it was assumed that between 2022 → 2030, a 100x improvement in cost-efficiency would be achieved thanks to the development of hardware, models, and other infrastructure. I can’t say to what extent electricity prices, HW prices, and related changes have been accounted for in those calculations. There are upward pressures on both. Regarding HW, prices have already run high in many respects, but for electricity, this will show up if data centers don’t produce their own power and rely on local infrastructure/production.

One could get a more comprehensive article from there if one were a subscriber. It might explain the background of the calculation better.

However, the thing linked in the Verge article can also be approached from the perspective that it assumes an annual sales price of 2,000 billion dollars (= 2\cdot10^{12}) for 10^{22} tokens. This amount of tokens corresponds to a bundle of 10^{16} million tokens, so the price becomes \frac{2 \cdot 10^{12} \$} {10^{16} Mtok} = 2\cdot10^{-4} \frac{\$}{MTok} = 0.0002 \frac{\$}{Mtok}.

This is really much more affordable compared to current lower-end costs. For example, Deepseek v4 offers inference (Models & Pricing | DeepSeek API Docs) at a price of $0.14/MTok (input) and $0.28/MTok (output). For cached tokens, the price is $0.0028/MTok.

If one assumes that there isn’t much margin in Deepseek’s tokens regarding inference, then Gartner’s estimate implies tokens priced at about 1/1000th of current prices (and still 1/10th the price compared to cache tokens), so that “100x cheaper inference costs between 2022 → 2030” is still something different from what’s at the base of these estimates. That would be 100x current prices if that 2,000 billion were the bottom-line share (the 10% profit margin) rather than revenue.

However, I interpret this to mean how much is spent on buying tokens annually, not how much of it remains below the line.

I suppose a reasonable suspicion arises that the analyst’s estimates might also have one zero too many or too few (depending on which way you look at it). More detailed assumptions for those calculations would need to be known. There might be good explanations for the observed differences, but based on just a couple of figures, it’s hard to guess. After all, an astronomical number of zeros are flying back and forth in these figures.

4 Likes

If we take a Gartner-esque hand-waving break here, we can look into the future and outline (hallucinate?) an “AI ecosystem” where, like bacteria, AI models specialize in handling specific types of data and being good at it.

IBM’s Granite (Apache-licensed) emphasizes its focus on tool use and scraping content from tables and other paper-based sources. These specialized models can evolve to be very efficient in their narrow field of expertise, and compared to generalist models, they (allegedly) achieve better quality and speed with lighter resource requirements.

For the developer, investing in such niche capabilities can be very profitable, as the competition is fought on value-add rather than maximum context windows or other expensive computation requiring SOTA hardware. Switching away from such a model is likely the hardest, as the differences between providers in a narrow segment can be significant.

A future listed company could very well be an MCP operator (Model Context Protocol operator) – a sort of Yellow Pages for these language models – that holds agreements with the best language models for specific business areas.

7 Likes

I’ve made similar observations myself, and although the frameworks and ecosystems built around SOTA models are still far ahead of open-weight models, I’m often amazed by the current performance of near-free models. One massive obstacle to price increases that is too rarely brought up in discussions is the ‘good enough’ cliff. When a model is good enough for a customer’s specific use case, there is no longer any reason to pay even a cent extra for a more expensive model, and the race for revenue shifts to who can produce these ‘good enough’ models at the lowest possible price. It’s sexy in the media to compete over who has the biggest and most capable (model), but the use cases for which these are mostly used today don’t actually require a SOTA model. The small portion of high-paying customers who genuinely need the AI equivalent of a sports car are obviously not significant enough to fund this entire circus.

I would even argue, somewhat controversially, that we have already secretly reached this situation and hit that ‘good enough’ cliff, as Claude and OpenAI constantly degrade the user experience in a desperate attempt to manage reckless cash burn. If corporations were previously satisfied with GPT 5 for basic use, I don’t see why an open-weight model like GLM 5.1 wouldn’t serve the exact same purpose. It’s a total waste of money to pay for a SOTA model for a use case where it isn’t needed.

9 Likes

An interesting article from researchers at Meta, Stanford, and Harvard, where they set AI to build software from scratch.

[!quote]
ProgramBench: Can Language Models Rebuild Programs From Scratch?

Abstract

Turning ideas into full software projects from scratch has become a popular use case for language models. Agents are being deployed to seed, maintain, and grow codebases over extended periods with minimal human oversight. Such settings require models to make high-level software architecture decisions. However, existing benchmarks measure focused, limited tasks such as fixing a single bug or developing a single, specified feature. We therefore introduce \bench to measure the ability of software engineering agents to develop software holisitically. In \bench, given only a program and its documentation, agents must architect and implement a codebase that matches the reference executable’s behavior. End-to-end behavioral tests are generated via agent-driven fuzzing, enabling evaluation without prescribing implementation structure. Our 200 tasks range from compact CLI tools to widely used software such as FFmpeg, SQLite, and the PHP interpreter. We evaluate 9 LMs and find that none fully resolve any task, with the best model passing 95% of tests on only 3% of tasks. Models favor monolithic, single-file implementations that diverge sharply from human-written code.

Even though the task assignments ranged from relatively easy to extremely difficult, not a single model completed a single task by passing all test cases. The end results were also monolithic piles of code that differ significantly in structure from the content of a corresponding human-written code repository.

[!quote]
To bridge this gap, we introduce \bench, a benchmark that challenges software engineering (SWE) agents to produce code that recovers the functionality of a software program (e.g., executables, .dmg’s, .pkg’s). Given a program and documentation, a SWE-agent, defined as an LM equipped with an agent scaffold to interact with a terminal environment (Yang et al., 2024a), must write source code and a compile script that reproduces the original program’s behavior. Every software design decision is entirely the model’s to make.

In the study, AI agents were given the program’s documentation and a working example program as input. The goal was to produce the source code for a program implementing equivalent functionality based on this and to compile the actual program from it.

The statistics for the code repositories underlying the tasks were as follows:

The median size for the codebases underlying the tasks was quite small, about 8,600 lines. The smallest program had 212 lines of source code. The largest ones, of course, are then very large.

The success in creating a program matching the test cases varied by model as follows:

The visualization of passed test cases across tasks and models looks like this:

In the smallest tasks, more tests were passed; in the larger ones, significantly fewer; and the most difficult tasks resulted in practically zero success.

The article also contains interesting statistics on how the development of the codebase progressed at different stages for different models: GPT 5.4 produced most of the codebase in a single edit, while Anthropic’s models and Gemini progressed in slightly smaller increments.

This evidence (as well) suggests that agentic coding is still at a level where fully automated “code factories” cannot be built on top of current solutions, except in marketing pitches.

22 Likes