McKinsey's Q1 2026 State of AI report found that only 4% of enterprises report material business impact from their AI investments, despite 78% having adopted AI development tools. The gap isn't a technology failure — it's a measurement failure. Here's what engineering leaders are getting wrong, and what the organizations actually capturing AI value are doing differently.
For the past two years, the message from technology vendors, analysts, and conference keynotes has been consistent: adopt AI tools, move fast, and productivity gains will follow. Enterprises listened. By Q1 2026, 78% of organizations had deployed at least one AI-assisted development tool — GitHub Copilot, Cursor, Codeium, or one of dozens of competitors. AI coding assistant spend nearly doubled year-over-year from 2024 to 2025. Internal AI initiatives proliferated across every tier of the enterprise.
And now, quietly but unmistakably, the board questions have started. What are we actually getting for this? Where is the productivity dividend we were promised? If our engineers are shipping code twice as fast, why isn't our product roadmap twice as long? The questions are reasonable. In many cases, the engineering leaders receiving them don't have satisfying answers — not because the technology isn't delivering, but because they haven't built the measurement infrastructure to demonstrate it.
McKinsey's Q1 2026 State of AI report put a number on the gap that should be required reading for every VP of Engineering defending an AI budget: only 4% of enterprises report material revenue impact from their AI investments. Not immaterial impact — material impact. The kind that shows up in a quarterly earnings call. Seventy-eight percent of companies adopted the tools. Four percent are demonstrably benefiting at the business level. That is not a technology adoption problem. That is a measurement and integration problem.
The root of the measurement gap is a mismatch between what engineering teams track and what business performance actually requires. When AI tools were deployed, the most natural metrics to track were the ones closest to the tools themselves: pull request velocity, lines of code committed, story points completed per sprint, percentage of code attributed to AI suggestions. These numbers went up. Leaders reported the numbers. Leadership heard the numbers and, reasonably, expected to see them flow into business outcomes. They mostly didn't — and neither side has clearly understood why.
Activity metrics and outcome metrics are not the same thing, and in the context of AI tools, they can actively diverge. An AI coding assistant that doubles the rate of code production does not automatically double the rate of value delivery. It can increase the rate of PR merges while simultaneously increasing the volume of code in the system — which increases maintenance burden, increases surface area for defects, and increases the cognitive overhead of code review. The activity went up. The outcome did not move at the same rate, and in some cases moved in the wrong direction.
Gartner's 2025 analysis of enterprise AI tool deployments found that 62% failed to generate measurable productivity improvement at the team or product level, even in organizations where individual developer output metrics showed clear gains. The mechanism is now well understood: AI tools shift where time is spent rather than simply adding capacity. Code generation is faster; code review, debugging, and production support are more demanding. Teams that measured only the front end of that equation saw impressive numbers. Teams that measured the full development lifecycle found that the gains were real but smaller than expected, and heavily dependent on how the tools were integrated.
There is also a confounding factor that almost nobody is adjusting for: the composition of the work changes when AI tools are adopted. Engineers who previously spent 40% of their time on boilerplate and routine implementation now spend more time on review, architecture, and judgment calls — which are intrinsically harder to measure and slower to complete. Measuring output in lines of code before and after AI adoption is like measuring a logistics network's efficiency in trucks dispatched rather than goods delivered. The metric exists; it's just measuring the wrong thing.
Key Takeaways
The organizations that do report material business impact from AI investments — the 4% in McKinsey's data — are not using different tools. In most cases they are using the same GitHub Copilot, the same Cursor, the same internal AI pipelines as their peers. What distinguishes them is how they integrated those tools into their development system, and critically, what they measured from the start.
The defining characteristic of high-ROI AI adopters is that they treated AI tool deployment as a systems-level engineering investment rather than a tool rollout. Before the tools were deployed, they defined what business outcome they were trying to move — not 'increase developer productivity' in the abstract, but 'reduce time from feature approval to customer deployment by 30%' or 'reduce P1 incident rate by 20% over six months.' The tool was then deployed in a context specifically designed to move that metric, and the metric was tracked from day one.
A second distinguishing factor is explicit AI governance from the outset. High-ROI organizations created clear protocols for which types of work AI tools should be applied to, how AI-generated code would be reviewed, what the quality acceptance criteria were for AI-assisted work, and how production incidents originating from AI-generated code would be tracked separately. They treated AI as an input to a quality-managed system rather than a raw accelerant applied uniformly across all work types.
The third factor is counterintuitive: the highest-ROI organizations preserved rather than redeployed their senior engineering capacity. The naive application of AI tools is to use them to reduce the senior time required for routine work and redeploy seniors to more feature volume. The high-ROI organizations used AI tools to reduce junior and mid-level time requirements on routine work, while keeping senior engineering capacity focused on architecture, review quality, and production reliability. The result: more code moving faster through a system with consistent quality gates, rather than faster code movement through a degraded review process.
Key Takeaways
Building an AI ROI measurement framework that will satisfy a board is not technically complex, but it requires a deliberate decision to instrument the development process at the outcome level rather than the activity level. The starting point is selecting metrics that have a clear line from engineering behavior to business result. Time-to-feature — defined as calendar days from feature approval to production availability — is the single most useful engineering metric for board-level ROI conversations, because it translates directly to competitive responsiveness and revenue timing. Defect escape rate (defects reaching production per release) and mean time to recover (MTTR) from production incidents are the quality and reliability equivalents.
These outcome metrics need to be complemented by team health metrics that serve as leading indicators. The SPACE framework — Satisfaction and wellbeing, Performance, Activity, Communication and collaboration, Efficiency and flow — adapted for AI-augmented teams provides a structured approach to capturing the team-level dynamics that precede outcome changes. In particular, tracking Efficiency and flow (the degree to which engineers can complete work without interruption or context-switching) alongside Activity metrics is critical: AI tools that improve activity while fragmenting flow can generate short-term output gains that mask medium-term productivity erosion.
The measurement cycle matters as much as the metric selection. The most defensible AI ROI analysis uses a 90-day pre-deployment baseline on all selected metrics, followed by a 90-day post-deployment comparison, with explicit controls for any changes in team composition, project complexity, or business conditions during that period. Organizations that skip the baseline period are essentially asking themselves to remember what performance looked like before the tools were introduced — a task human memory performs poorly. The 90-day minimum is also important because AI tool adoption typically has a productivity dip in weeks two through six as engineers adjust workflows, followed by a recovery and improvement phase. Measuring at 30 days will show the wrong result for most teams.
One metric that has emerged as a particularly high-signal indicator for AI ROI is revenue per engineer — calculated as total product revenue divided by total engineering headcount. It is a coarse metric, but it captures both the numerator (whether AI tools are helping build more valuable product) and the denominator (whether AI tools are enabling the same outcomes with a more efficient team). Organizations where AI tools are generating genuine leverage typically show improvement in revenue per engineer over a 12-month horizon even when shorter-term metrics are mixed.
Key Takeaways
The AI ROI measurement gap becomes especially acute when engineering work is distributed across outsourcing partnerships. The same board pressure that is landing on internal engineering leaders is increasingly landing on the relationship between enterprise clients and their software development partners — and most outsourcing contracts are not structured to handle it. Standard time-and-materials agreements were designed to bill hours, not demonstrate outcomes. When a client asks 'how much of our development throughput gain is attributable to your AI tooling?' most outsourcing partners cannot answer the question in any rigorous way.
This creates a risk that is currently underappreciated: outsourcing partners who position themselves as 'AI-augmented' without the measurement infrastructure to substantiate the claim are selling a narrative rather than a demonstrable value. An offshore partner that uses AI tools to generate code faster, bills the same or similar hours, and provides no visibility into quality metrics or outcome data is, in effect, extracting the AI productivity gain for itself while passing the hidden costs — debugging overhead, review burden, production incidents — back to the client. This is not a hypothetical dynamic; it is the predictable result of incentive structures that reward hours over outcomes.
For CTOs and engineering leaders evaluating or re-evaluating outsourcing relationships in 2026, the right questions to ask have changed substantially from previous years. Rather than 'do your engineers use AI coding tools?' — which will receive a yes from every partner regardless of truth — the questions should be: What outcome metrics do you track and report at the team level? How do you differentiate AI-generated from human-written code in your quality processes? What is your post-deployment defect rate, and has it changed since you introduced AI tools? Can you share time-to-feature data from comparable recent engagements? Partners who cannot answer these questions concretely are not AI-augmented in any meaningful sense; they are AI-washed.
Senior-led nearshore teams have a structural advantage in this accountability transition. Smaller, more senior teams with clear domain ownership produce less code at higher quality, have shorter review cycles, and generate outcomes that are measurable because the team structure is simple enough to make cause-and-effect tractable. The volume outsourcing model — large teams of mid-level engineers generating high activity metrics — is precisely the model that struggles to demonstrate AI ROI, because the activity metrics were always high and the outcome metrics were never the primary delivery metric. When boards start demanding outcome accountability from outsourcing relationships, senior-led teams with transparent metrics are the obvious answer.
Key Takeaways
The AI ROI reckoning is not a sign that AI tools failed to deliver. The tools are delivering — selectively, inconsistently, and in ways that require measurement discipline to capture and communicate. The 4% of enterprises reporting material business impact are not lucky; they are better at connecting engineering inputs to business outputs, and they built that connection intentionally before the tools were deployed rather than after the board started asking questions. For engineering leaders facing this conversation now, the path is not to promise what the tools haven't yet demonstrated — it is to build the measurement infrastructure that makes the demonstration possible. Instrument at the outcome level, not the activity level. Establish a baseline before the next tool deployment. Adapt your SPACE metrics for the AI-augmented context your teams are actually working in. And if your outsourcing partners cannot report on outcomes rather than hours, treat that as the liability it is becoming, not the industry norm it has historically been. The AI productivity era requires a measurement culture that most engineering organizations built only partially — and the boards doing the asking are right to push for the rest of it.
StepTo helps European and US companies build senior-led nearshore engineering teams in Serbia. Let's talk about what your next engagement could look like.
Start a conversationStepTo Editorial
stepto.net