Patient Comet · Economics

LLMflation

A company spent $500 million on Claude in a single month. Not through carelessness. Through AI doing exactly what it was supposed to do. This article is about the economic pattern behind that bill, and why every team managing AI spend will recognise it.

Nadim A. MassihNadim A. Massih1 June 2026 · 11 min read
LLMflation: Why AI Gets Cheaper and Your Bill Keeps Rising — illustration

The Bill That Changed the Conversation

In May 2026, the story that changed how enterprise AI is discussed did not make every front page.

The number that changed every CFO conversation in 2026 surfaced quietly. An unnamed company had run up a $500 million bill on Anthropic’s Claude in a single month. Not through negligence, exactly. Through deployment. They had given their entire workforce access to the model, with no usage caps and no governance. Agentic workflows compounded the problem. These are automated sequences where the AI plans, acts, reads the result, and acts again, re-sending the full conversation history at every step. What looked like a modest per-query cost had turned into a figure that exceeded most companies’ entire annual technology budget (Tom’s Hardware / Fast Company, May 2026).

Microsoft’s story broke around the same time and was more specific. The company had invited thousands of engineers in its Experiences and Devices division to use Claude Code starting in December 2025. By May 2026, the tool had become the preferred choice inside the division. Token costs per engineer began outpacing what Microsoft paid those engineers in salary. The company set a June 30 cutoff and redirected engineers to Copilot CLI (The Next Web, 2026).

Uber had got there first. Its operations chief told The Information in April that the company had burned through its entire 2026 AI coding budget in four months: roughly 5,000 engineers on Claude Code, the heaviest users spending $500 to $2,000 each per month individually (Tom’s Hardware, 2026).

None of these are outliers. They are case studies in a structural problem: the price of AI has collapsed, so every new use case looks affordable. Then you do the arithmetic on volume.

The term for the dynamic: LLMflation, coined by a16z (Andreessen Horowitz, a major Silicon Valley venture fund) in 2024 to describe the paradox where AI costs per call keep falling while AI bills keep rising. A model good enough to pass a standard knowledge test cost roughly $60 per million tokens (each token is roughly three-quarters of a word) in late 2021. By 2024 it was $0.06, a fall of roughly 1,000× in three years, faster than the PC era’s decline in compute. But AI charges per action, not per seat. Lower prices triggered far more actions, not lower bills. Token prices fell roughly 1,000× over three years (a16z, 2024) while total enterprise AI spend rose more than 300% in two (Ramp, 2026).

Why Microsoft killed Claude Code: AI billing breaks at enterprise scale
Why Microsoft cancelled Claude Code: not because the tool failed, but because AI billing breaks the maths when engineers use it constantly. Click to watch on YouTube. (AI analysis, 2026)
The LLMflation paradox
$60/M$6/M$0.60/M$0.06/M$60$0.06Cost per million tokens ↘ 1,000×Baseline+320%Average enterprise AI spend ↗2023: lines cross202120222023202420252026Source: a16z, 2024; Ramp, 2026PATIENT COMET
Token prices fell roughly 1,000× over three years. Enterprise AI spend rose more than 300% in two. This is LLMflation. (a16z, 2024; Ramp, 2026)

The Meter Never Stops

Three things are happening simultaneously inside most enterprises. They look unrelated. They are the same problem.

The first: usage consistently outruns the forecast. In 2025, 85% of companies missed their AI cost projections by more than 10%; 24% missed by more than 50% (Mavvrik / Benchmarkit, 2025). Two years ago, 31% of finance teams actively managed AI spend. Today it is 98%. The number doubled not because companies got disciplined, but because the bills arrived and demanded attention (State of FinOps, 2026).

The second: agents turn the meter into a fire hose. A standard AI chat interaction uses a few thousand tokens. A million tokens is roughly a 750,000-word document (the equivalent of asking the model to read and write War and Peace from scratch, for every complex task). An agentic workflow re-sends the entire conversation context (the AI’s working memory) at every step. Industry data puts agent token use at 10-100× a standard chat. Stanford’s Digital Economy Lab measured the most complex agentic tasks at 1,000× a standard reasoning call. Simple tool-calling agents consume 5,000-15,000 tokens per task; complex multi-agent systems consume 200,000 to over 1,000,000 tokens per task (Stanford Digital Economy Lab, 2025). Goldman Sachs estimates agents will multiply total enterprise token demand 24× by 2030.

The third: the most expensive model is rarely the right model. The price gap between the cheapest production AI model and the most expensive frontier reasoning model is currently about 4,500×. The cheapest runs at around $0.04 per million tokens. The most expensive frontier reasoning models run at $180 per million tokens. Most enterprises have deployed the top tier to everyone. Analysis consistently shows that roughly 85% of enterprise queries could be handled by budget-tier models (the ones at the $0.04 end) with no meaningful quality loss (FinOps Foundation, the industry body tracking AI infrastructure costs, 2026).

The hidden overhead compounds everything. OpsLyft (an AI cost analytics firm) found that most teams underestimate their true AI bill by 40-60%. A call that looks like $0.05 typically lands at $0.20 once embeddings (text to searchable numbers), retrieval, re-ranking, output validation, retries, and prompt overflow are counted (OpsLyft, 2026).

This Comes Straight Off Your Margin

Inference is not a technology line item. It is cost of goods sold: the costs that come directly out of gross margin, the number investors use to price companies.

84% of companies reported that AI was already eroding gross margins by more than 6% (Mavvrik / Benchmarkit, 2025). Bessemer’s portfolio data puts AI-native gross margins at 50-60%, against 70-90% for mature SaaS, and software economics is quietly turning into manufacturing economics (Bessemer, 2025). Even OpenAI, the company whose products power much of this spending, ran $13B in revenue against $22B in costs in 2025, a $9B net loss at scale (Fortune, 2025).

The pattern is consistent: the more AI gets used, the more it costs, and the less of that cost is recovered in measurable output. This is not a price problem. It is a volume and governance problem. The companies that have protected their margins are not the ones with cheaper models. They are the ones that controlled who was using which model for what. There is a reason for that. It comes before usage.

This is not a price problem. It is a volume and governance problem.

The Access Problem

The piece most AI cost discussions miss is the one that makes every other problem harder to fix.

In 2026, Meta created internal leaderboards ranking employees by token consumption, a practice that quickly became known as “tokenmaxxing” (spending tokens competitively, not productively). Employees competed to spend the most tokens. They threw entire projects at agentic AI with no discipline about whether the output was good, whether a cheaper model would have worked, or whether the task suited AI at all. The leaderboard rewarded consumption, not outcomes (industry reporting, 2026).

Microsoft’s experience tells a version of the same story. The engineers who used Claude Code most heavily were not necessarily the ones producing the best software. Token costs outpaced salaries. And Faros AI found that code churn (lines of code deleted versus added, a measure of rework) increased by more than 800% in teams with high AI adoption. The teams spending the most on AI were producing the most waste (Faros AI, 2026).

The data points to something the economics alone do not explain: AI amplifies what you bring to it.

It is worth saying plainly that broad access to powerful tools has genuinely produced unexpected value. The marketer with no technical background who caught a product flaw through an unusual prompt. The junior analyst who used AI to try an approach a veteran would have ruled out on instinct, and was right. Ungated access is not without merit. The argument here is not for restriction: it is for matching model tier to demonstrated task competence, with a clear path to qualification for anyone willing to earn it.

A senior engineer with fifteen years of experience using Claude Code produces faster, better-reviewed code at lower cost, because they can evaluate the output, catch the errors the AI cannot catch, and prompt with precision. Evaluation means spotting when the model has hallucinated a citation (invented a source that does not exist), reversed a logical condition in code, or produced a plausible-looking answer that is wrong in a way only domain knowledge reveals. A junior developer with six months of experience using the same tool produces faster, worse code. They cannot tell good output from plausible-looking output. They iterate blindly on prompts they cannot evaluate, and burn tokens on approaches an experienced engineer would immediately dismiss.

The 4,500× price gap between the cheapest and most expensive models is not the real cost driver. The real driver is the gap between what a skilled user produces per dollar and what an unskilled user produces per dollar. A budget model in the hands of someone with deep domain expertise costs almost nothing and produces real value. A frontier reasoning model in the hands of someone without the background to evaluate its output costs $180 per million tokens and produces fast, confident, unreviewed work that someone else will eventually have to fix.

This is not a controversial idea when applied anywhere else. We do not give junior surgeons operating theatres without supervision. We do not give junior analysts sole authority over financial models that go to a board. We understand, in every other context, that tools amplify skill. Amplifying the absence of skill at speed is not progress. We have simply forgotten to apply this to AI.

Access to frontier AI models (most powerful, most expensive) is a professional resource. It should be treated like one.

How agents multiply token cost
10×100×1,000×10,000×Approximate log scale (relative token cost multiplier)Standard chatTool-calling agent5–30×Coding agentup to 1,000×Complex multi-agent1,000–10,000×Source: Stanford Digital Economy Lab, 2025PATIENT COMET
The shift to agentic AI did not change what the model charges per token. It changed how many tokens each task uses. (Stanford Digital Economy Lab, 2025)

Access to frontier AI models is a professional resource. The companies that figured this out first are the ones paying half as much for twice the output.

Govern the Usage and the Access

The companies that have escaped LLMflation are not the ones with the cheapest models. McKinsey puts them at roughly 6% of enterprises and labels them “AI high performers.” They are running two governance layers simultaneously: usage controls and access controls.

Usage governance: four controls that compound

Route every task to the cheapest capable model. A cascade architecture (routing tasks to cheaper models) sends simple calls to the budget tier, escalates complex ones to frontier models only when the task requires it. Analysis consistently shows this cuts inference (the cost of running the model) cost 40-70%, because most queries never need to leave the budget tier (FinOps Foundation, 2025).

Cache the answers you have already paid for. Semantic caching (reusing stored AI answers) removes about 31% of repeat queries before they reach the model. Prompt caching (reusing unchanged prompt sections) makes cached tokens roughly one-tenth the standard price. Add batch pricing on top (roughly 50% discount) and a call that would have run at $0.20 runs at about $0.01 (about 5% of standard cost) (OpenAI, 2026).

Cap the spend before it surprises you. Token budgets as hard ceilings per team and per workflow. Fix retry logic that can loop indefinitely. Define maximum context windows for each use case. The $500M bill happened because there were no ceilings.

Measure cost per outcome, not cost per call. “Cost per call” tells you almost nothing. “Cost per resolved customer query,” “cost per code review,” “cost per draft approved”: these are the numbers that tell you whether the spend is working. The vendors that price this way (Intercom at $0.99 per resolved query, Salesforce Agentforce at roughly $2 per conversation) already know their cost per outcome. The discipline turns the meter into a dial (Intercom, 2026).

Access governance: the layer most companies skip

Tier model access by role and demonstrated competence. Not everyone needs frontier reasoning. Not everyone should have it. The qualification for moving up a tier is not seniority: it is demonstrated domain knowledge. The ability to evaluate what the model produces in that domain and catch what it cannot catch.

The upgrade is available to anyone. A junior developer on the budget tier earns frontier access when they can demonstrate they can evaluate what a frontier model produces in their domain: spotting the errors, assessing the quality, catching the hallucinations. That is not hierarchy preservation. It is competence development with a measurable gate.

The advantage is real, and it is unequal: when skilled engineers have frontier access and unskilled ones are routed to budget tiers, the skilled engineers work faster and the budget recovers its balance. The companies with ungated access to everything get neither the speed nor the savings.

The cost reduction from running both governance layers together reaches 60 to 90 per cent against ungoverned baseline (FinOps Foundation, 2026). Most companies are still paying full price. Not all of them think that is the wrong call.

Bubble, or Breakthrough?

Three honest positions, all with something right in them.

The builder
“AI is cheap and getting cheaper. Building cost controls too early slows down experiments that might be valuable.”

This is half right: over-governing early-stage experiments can kill them. Where it fails: “too early” was 2021. In 2026, the bill has arrived.

The bear
“Do not build on prices propped up by a loss-making machine.”

OpenAI’s $9B net loss, roughly $1 trillion in AI infrastructure pledged across seven major vendors in overlapping commitments (The Register, 2025), and Gartner’s estimate that 25% of planned 2026 AI budgets will slip to 2027 (Gartner, 2026): these are real signals. Two-thirds of companies already plan to move some workloads back to their own hardware.

The operator
“Both describe the same reason to run with controls.”

The strongest counter-argument to this article’s access governance prescription is worth hearing directly: skill-gated access preserves existing hierarchies rather than disrupting them. The companies that will win are the ones that train everyone to use AI well, not the ones that restrict it to people who already know what they are doing. That is a genuinely strong position. The answer is in the framing. This is not permanent restriction: it is a qualification path, open to anyone willing to demonstrate they can evaluate the output in their domain. If prices are artificially low, they will rise, and the companies that know their cost per outcome will adapt. Control wins under either future.

The Take

The Price of the Wrong Hands

LLMflation is not a price problem. It is a volume and access problem, and the companies solving it are the ones who decided, before deploying anything, who needs access to what and why.

Microsoft cancelled the Claude licences because the engineers used the tool. That is not a failure of the tool. It is a failure of the access model. When every engineer gets frontier AI access with no ceiling and no qualification requirement, the ones who use it most are not necessarily the ones producing the best work. They are the ones most comfortable spending tokens on tasks they cannot fully evaluate.

The $500M bill is the same story at enterprise scale: no caps, no tiers, no way to measure whether the output was worth what the month cost.

The price of AI is falling faster than almost any technology in history. The bill is rising because ungoverned access at scale overwhelms any per-unit saving. The multiplication effect of agentic AI makes it worse: these are automated sequences where the AI acts, reads its result, and acts again without a human approving each step.

The companies that solve this are not the ones negotiating better token rates. They treat model access as a professional resource, not a default setting. They measure cost per outcome. They route tasks to the cheapest capable model. They gate frontier access to people with the domain competence to use it properly, and they make the qualification path clear and earnable.

Six per cent of companies have built this. The other ninety-four are paying for the gap (McKinsey, 2025).

Where to start
  1. Get one honest cost-per-outcome number this quarter. Pick one use case: customer query resolution, code review, document drafting. Measure what it costs per completed outcome. Not per call, not per seat. That one number will tell you more about your AI economics than any tool audit.
  2. Audit who has access to what model tier. List every frontier model deployment in your organisation. For each one, ask: does the person using this have the domain expertise to evaluate what it produces? If the answer is no, move them to the standard tier and define what qualification looks like for the upgrade.
  3. Route before you spend. Before your next AI deployment, define which tasks go to which tier. Budget for summarising and drafting. Standard for analysis. Frontier only for tasks where the specific capability of the frontier model is the difference between a good outcome and a bad one.
  4. Cap before you surprise your board. Set token budget ceilings per team and per workflow before you deploy, not after the bill arrives. Fix any retry logic that can loop. Define maximum context windows. The ceiling is not a constraint on productivity. It is the thing that makes AI spend predictable.

The right AI tool in the right hands costs almost nothing and produces real value. The wrong AI tool in the wrong hands costs $180 per million tokens and produces confident, fast, unreviewed work that someone else will fix.

Nadim A. MassihNWritten byNadim A. MassihAI & Tech StrategistMore articles
Common questions

Questions, answered first

If AI is getting cheaper, why is the bill going up?

Because AI prices per action, not per seat. When each call gets 1,000× cheaper, companies find 1,000 new things worth doing, and the total volume climbs far faster than the price falls. Token prices fell roughly 1,000× over three years while total enterprise AI spend rose more than 300% in two.

What is LLMflation?

The paradox where the price of each AI call keeps falling but the total cost of AI keeps rising, driven by volume, agentic multiplication, and the tendency to use the most expensive model for every task regardless of whether it is needed.

Why did Microsoft cancel Claude?

Cost. Engineers used it heavily. It became the preferred tool over Microsoft’s own Copilot, and token costs per engineer began outpacing what Microsoft paid those engineers in salary. The company cancelled most licences in its Experiences and Devices division with a June 30, 2026 cutoff and redirected developers to Copilot CLI, which Microsoft also has a strategic interest in building adoption for.

Why does experience matter for AI access?

AI amplifies what the user brings. A domain expert using AI produces faster, better, more reliable work, because they can evaluate the output, catch the errors, and prompt with precision. Evaluation means spotting when the model has hallucinated a citation (invented a source that does not exist), reversed a logical condition in code, or produced a plausible-looking answer that is wrong in a way only domain knowledge reveals. A person without domain background produces output they cannot evaluate, burns tokens on approaches an expert would immediately dismiss, and ships fast, confident, unreviewed work. The cost is tokens. The risk is what those tokens produced.

What is the cheapest way to cut an AI bill?

Route (send simple tasks to budget models), cache (stop re-paying for queries already answered), cap (set hard ceilings before deploying), and measure cost per outcome. Combined effect: 60-90% cost reduction against ungoverned baseline. None of these require new models or new vendors.

Receipts

Sources & references

Tom’s Hardware / Fast Company, May 2026

$500M spent by one enterprise on Anthropic’s Claude in a single month following uncapped employee access and agentic loop deployment.

Cybernews / The Next Web, 2026

Microsoft cancels Claude Code licences in Experiences & Devices division (Windows, Microsoft 365, Outlook, Teams, Surface), June 30, 2026. Token costs outpaced engineer salaries. Engineers redirected to Copilot CLI.

Tom’s Hardware; eeNews Europe, 2026

Uber burned through 2026 AI coding budget in four months; ~5,000 engineers on Claude Code; heavy users $500-$2,000/month individually. GitHub Copilot moved to usage-based billing June 2026.

Faros AI, 2026

Code churn (lines deleted vs added) increased 800%+ under high AI adoption. More AI tokens did not produce better code.

Ramp, 2026

Average business spends 13× more on AI than at start of 2025.

a16z “LLMflation,” Nov 2024; Epoch AI

Roughly 1,000× cost decline per million tokens 2021-2024; $60 to $0.06.

OpsLyft, 2026

True AI bills run 40-60% higher than visible token cost once hidden overhead is counted.

Fortune, Nov 2025

OpenAI roughly $13B revenue, roughly $22B spend, roughly $9B net loss.

Stanford Digital Economy Lab, 2025

Agentic tasks consume 10-100× standard chat tokens; most complex tasks roughly 1,000×.

Mavvrik / Benchmarkit, 2025

85% of companies miss AI cost forecasts by more than 10%; 24% by more than 50%; 84% see gross margin erosion more than 6%.

Bessemer, 2025

AI-native gross margins 50-60% vs mature SaaS 70-90%.

FinOps Foundation; arXiv, 2025

Routing / cascade architecture cuts inference cost 40-70%; all four controls combined: 60-90%.

OpenAI; Anthropic API docs, 2026

Prompt caching roughly 90% off cached tokens; Batch API roughly 50%; stacked to about 5% of standard cost. Semantic caching removes about 31% of repeat queries.

Intercom, 2026; SaaStr, 2026

Fin at $0.99 per resolved query; Salesforce Agentforce roughly $2 per conversation.

McKinsey, State of AI, Nov 2025

About 6% of companies are AI high performers; more than 80% see no clear enterprise bottom-line effect.

Goldman Sachs, 2026

AI agents will multiply enterprise token demand roughly 24× by 2030.

Gartner, 2026

25% of planned 2026 AI budgets will slip to 2027.

The Register; Calcalist, 2025

About $1T in AI infrastructure pledged across the same seven vendors in overlapping commitments.

Meta internal reporting, 2026

“Tokenmaxxing” leaderboards; employees ranked by token consumption.

Keep reading

More articles

The Vibe Coder Fallacy: Why the AI Prototype Is Never the Product
Software

The Vibe Coder Fallacy: Why the AI Prototype Is Never the Product

An AI-built social network was fully breached three days after launch. The gap between AI-generated code and production-safe code is not closing.

By Nadim A. Massih
Own Your AI: Why the AI Subscription Model Is Breaking
Infrastructure

Own Your AI: Why the AI Subscription Model Is Breaking

AI is shifting from a service you subscribe to, to a feature you ship. The companies that own their models will compound. The ones renting will pay twice.

By Nadim A. Massih
The Last Human Reader: How AI Became Your First Audience
Discovery

The Last Human Reader: How AI Became Your First Audience

The pages you publish are no longer primarily read by people. They are read first by machines that decide whether to send a visitor your way.

By Nadim A. Massih
Anyone Can Make It Now: Why Making Things Stopped Being a Competitive Advantage
Creative

Anyone Can Make It Now: Why Making Things Stopped Being a Competitive Advantage

Google made its film studio free. WPP cut a third of its creative headcount. The tools gap closed. What that means for the people who spent years developing creative skills.

By Nadim A. Massih
The Second Customer: Your Product Has Two Users Now. One Cannot Read Your Homepage.
Product

The Second Customer: Your Product Has Two Users Now. One Cannot Read Your Homepage.

AI-sourced traffic to US retail grew 393% in Q1 2026 and now converts 42% better than human traffic. Your product already serves a second user.

By Nadim A. Massih
The Cheap Code Problem: When Anyone Can Ship Software, What Is Worth Building?
Engineering

The Cheap Code Problem: When Anyone Can Ship Software, What Is Worth Building?

Snap fired a thousand people because AI writes 65% of its code. The hardest problems in software (requirements, judgment, reliability) have not changed.

By Nadim A. Massih
The Taste Problem: When the Tools Are Equal, Taste Is the Only Edge
Strategy

The Taste Problem: When the Tools Are Equal, Taste Is the Only Edge

When AI can fake polish and effort, the new proof of human presence is specificity, voice, and the visible mark of a real person's perspective.

By Nadim A. Massih
>