AI Maturity in Tech Orgs — Mid 2026
If you want a quick read on how mature an organisation's AI Engineering adoption is, ask how they ship and what they measure. Over the past three years the answer has moved through three phases: adoption of using the tools, how much they're using them, and whether the usage produced more value than not using them. Most organisations sat clearly in one of the three. Each phase felt sensible at the time. Advances in model and harness capability kept making the previous phase's measures obsolete. Now we've reached the point where new questions should be asked and new workflows adopted. Let's look back on those phases before we look to July 2026 and beyond.
Tool adoption: "please use the tools" (2023–2025)
The first phase started with ChatGPT and ran for about two years. The organisational problem was getting software engineers to adopt AI tools. Leaders bought licences, ran enablement sessions, appointed champions, and stood up dashboards tracking seat activation and weekly active usage.
The approach was rational at the time. Usage varied a lot between engineers, the gap between users and non-users was real, and encouraging adoption was the highest-leverage thing a leader could do. When usage is the differentiator, measuring usage and designing systems to encourage adoption is the right call.
The inflection point
That happened in late 2025. The models had been improving steadily and with the release of Opus 4.5 in November 2025, it was a noticable improvement and felt like a superpower. Harnesses around them improved as well helping turn a good model into an effective augmented software engineer. That's the window where the combination went from promising to clearly good enough, and not using the tools became the anomaly among software engineers by early 2026. Glancing through commits showed co-contributions from agents being the norm for almost all commits.
Adoption saturated on its own. Engineers no longer needed enablement sessions or champions. They adopted the tools because they worked. Dashboards measuring who was using AI went green within a couple of quarters and stopped telling anyone anything. I wrote in December about how management practices move in fashion cycles, and measurement follows the same pattern. The metrics that define good management in one cycle become the outdated artefacts of the next.
The immature response: tokenmaxxing
With adoption saturated, organisations needed a new number, and the immature ones reached for the next countable thing: consumption. Who uses the most tokens? Who produces the most lines of code, the most pull requests? Through early 2026 this hardened into a habit with a name, tokenmaxxing. Forbes profiled the "AI Gods". Sendbird ran a company leaderboard ranking employees from Beginner up to AI God, a title reserved for people using at least a hundred million tokens a day. Meta, Nvidia and Databricks all celebrated their top spenders in one form or another.
The gaming started immediately. Amazon employees admitted to running AI on tasks that didn't need it to pump their internal usage scores. Meta reportedly had engineers leaving token-burning bots running unattended, and by April had capped internal token spend as costs approached the billions. The public ranking dashboard was quietly shut down.
Tokens are the new lines of code, an input measure standing in for an output measure. The industry ran this experiment decades ago and abandoned it. Goodhart's law applied as it always does. Once token consumption became a target, it measured enthusiasm for consuming tokens and nothing else. By late May Fortune had declared tokenmaxxing dead. The same numbers that leaderboards celebrated in February, finance teams were pushing down by June.
Growing up: measure outcomes
The more mature metric, in mid-2026, is cost per successful task: the all-in spend to resolve a ticket, merge a pull request or ship a change, including retries and failed attempts. Cost per token tells you how cheap your inference is, not whether the work got done. Cost per successful task is also the first AI metric that finance and engineering can read the same way, which matters now that model subsidies are ending and the bills are real.
To be clear, Goodhart's law applies here too. Slice tasks smaller and the number improves without anything real changing, though there is likely a sweet spot with the size of the work with the cost. The difference is what the metric is anchored to: it binds spend to a completed outcome rather than to raw activity, which makes it harder to game invisibly than tokens or lines of code ever were. It's an improvement on what came before, not an end state, and I expect it to keep evolving.
Getting that number down is a portfolio problem. Frontier models earn their price on orchestration and on the tasks that genuinely need frontier intelligence, and waste money on almost everything else. Organisations doing this well route routine work to cheaper and open-weight models and save the expensive intelligence for where it counts.
Cursor is the clearest product expression of this. Their in-house model Composer, now on its third generation, exists so the product can blend a fast, cheap native model with frontier models and route by task. It's a window into right-sizing the model to the task becoming the norm. That capability might be baked into the harness, as it is with Cursor, provided by a framework, or built as an internal platform capability in organisations running their own agent infrastructure. Most teams will buy it rather than build it.
For engineering leaders the implication is that model spend deserves the same thinking as headcount. You wouldn't staff every project with your most senior engineers, and you shouldn't run every task through your most expensive model. Whether you buy the routing or build it, the discipline of matching spend to task is organisational work, not a configuration change.
Even cost per successful task is a waypoint, though. It tells you the machine is efficient, not that it's pointed anywhere useful. The measure that matters more as organisations mature is time to validated outcome: how long it takes to go from a customer problem to a shipped change you've confirmed actually moved something, for the customer or the business. That's the number that connects delivery speed with organisational and customer goals, and optimising it stops being a tooling question very quickly.
When the backlog stops being the constraint
What few teams planned for in 2026 is the shrinking backlog. Teams that sized their roadmaps around pre-AI throughput are finishing them early, sometimes by quarters, and discovering they don't know what comes next. A depleted backlog sounds like a nice problem to have until you sit with what it means. Building things is no longer the constraint on your organisation.
The theory of constraints predicts this. Improving throughput at one stage of a system moves the bottleneck rather than removing it. Code generation got fast, so the constraint moved, first downstream to code review, where senior engineers became the chokepoint for a rising volume of generated changes, then past review to the harder question of deciding what is worth building at all.
The failure mode I think is playing out in manny otherwise capable organisations, is senior leaders acting as the funnel. When every idea, priority call and spec has to pass through one or two people, the bottleneck has been relocated to the most expensive and least scalable resource in the building. Leaders who were proud of being the quality gate on everything are now, structurally, the reason their teams are waiting.
What maturity looks like next
If the constraint has moved from building to deciding, the response is to distribute the deciding. Push a product mindset down to every engineer, empowered with real authority to make decisions. Teams that can go from customer problem to shipped change without passing through a leadership approval chain are the ones that turn AI throughput into outcomes. Teams that can't will accumulate inventory.
Empowerment on its own won't get you there, though, because our development processes were designed for a world where coding was expensive. The SDLC most of us grew up with, with its quarterly planning, refinement sessions, estimation rituals, steering groups and stakeholder alignment, is a machine for rationing scarce engineering capacity. Run it unchanged with AI-augmented teams and the process becomes the bottleneck. You can generate a feature in a day and then spend three weeks walking it through ceremonies that protect capacity and process you no longer need to protect. Real productivity gains across complex products and larger organisations require rethinking what the SDLC looks like in 2026 and beyond, rather than accelerating the coding step inside the old one.
The review chokepoint is the clearest example of what an evolved SDLC changes. Review can now be partially or fully automated, and done well it's better than the human version. A specialised agent reviews every change for security. Another checks adherence to your ADRs and architectural principles. Another holds the line on conventions. Run in parallel, they're more thorough than any single reviewer and they approach the change from angles one person never would. Humans stay in the loop where judgment is genuinely contested, and stop being the queue everywhere else.
Once verification is automated and trustworthy, the loop extends further. Teams can wire agents into the observability stack to identify customer problems and errors, prototype the fix, and in the safest categories ship it before a human has had to act. Whether a change ships automatically or waits for a person is a risk-tier decision, the same one you already make about deploys today.
In practice, the shape of the work changes. More of the effort moves upfront, into understanding the real problem before anything gets built, because deciding is now the scarce work. Downstream of the decision, automate as much as you can. And with the cost of trying something so much lower, throw more experiments at your customers and learn faster what works and what doesn't.
The organisational shape that follows is smaller than what we're used to. Two or three engineers, design, product and a couple of domain experts, working with high autonomy and low coordination costs, can now deliver faster than the larger, better-resourced teams of the past. The old team sizes existed to parallelise coding work, and the coordination overhead was the price of that parallelism. When the work no longer needs parallelising, the overhead is just drag. Small teams with real decision-making authority and direct access to domain knowledge are the unit that matches the new constraint.
The thread running through the whole arc is that AI maturity was never about the tooling alone. Phase one organisations bought the tools, tokenmaxxing organisations measured the tools, and the most forward thinking AI organisations are redesigning themselves around what the tools made abundant, while learning to manage the new scarcity: judgment about what to build. The organisations that get this right won't be the ones with the biggest token bills or even the best cost per task. They'll be the ones that noticed the bottlenecks had moved and reorganised around a truly innovative AI-native development lifecycle.
It's an unsettling, fascinating and exciting time!