Why do most AI pilots fail to deliver ROI?

MIT's Project NANDA found that 95% of enterprise generative AI pilots produce no measurable P&L impact, and the cause is a learning gap rather than model quality or regulation. Organizations buy tool access but never build the fluency for the spend to compound.

Does AI fluency training actually improve engineering throughput?

After running its largest team through the Immerse AI fluency program, Gnar AI measured 43% more tickets and 93% more story points completed per sprint, with no other changes to the team, codebase, or process. The only new variable was the training.

Why your AI pilots haven't delivered ROI

Most AI pilots fail on a learning gap, not the technology. We lived it, then measured what closing it did: 43% more tickets and 93% more story points per sprint.

We made the same mistake most companies are making with AI right now. We just made it about a year earlier.

When we rolled out Claude Code to our engineering team, we did what everyone does. Bought the seats. Gave everyone access. Let them loose. And if any team could pick up agentic development by osmosis, it should have been ours. We build software for a living. Our engineers are senior. This is our home turf.

A few of them took off immediately. Watching them work was one of the most exciting things I’ve seen in ten years of running this company. They were shipping in an afternoon what used to take days, and the quality held.

Most of the team used it like expensive autocomplete.

That gap, between the few who transformed and the many who didn’t, turned out to be the most important thing we’ve learned about AI adoption. It’s also the reason most pilots fail.

Why most AI pilots fail

The MIT finding: 95% of pilots show no P&L impact

MIT’s Project NANDA put a number on this. Their 2025 report, “The GenAI Divide: State of AI in Business 2025”, drew on more than 300 AI initiatives, 52 organizational interviews, and surveys of 153 executives. It found that 95% of enterprise generative AI pilots produce no measurable impact on the P&L.

Read past the headline, though, because the diagnosis matters more than the failure rate. MIT’s researchers found the problem wasn’t model quality, and it wasn’t regulation. It was a learning gap. Organizations never learn to use the tools well enough for the spend to compound.

Spending is up, but the learning gap remains

Meanwhile, the money keeps flowing. Gartner’s 2026 CIO and Technology Executive Survey of 2,500 technology leaders found 89% plan to increase AI spending this year, and Kris van Riper, a practice VP at Gartner, summed up the shift bluntly: “2025 was about AI pilots, discovery and experimentation. 2026 will be about delivering agentic AI ROI.” Midmarket CIOs are feeling that pressure most. They’re now expected to prove their AI programs produced measurable outcomes, not a sprawl of parallel pilots competing for budget.

So spending is up, patience is running out, and the most rigorous study we have says the real problem is learning.

That matches what we lived.

A license is not a capability

Trace the incentives and the failure pattern makes sense. Seats are easy to sell and easy to buy. A GitHub Copilot license for every developer is a clean line item: approved in one meeting, deployed in a week, reported to the board as “AI adoption.” Training is messier. It changes how people work, it takes weeks, and it doesn’t come bundled with the subscription.

So companies buy access and call it adoption.

We did the same. (Our seats were Claude Code seats, and we’re a certified Claude partner, so believe me, I wanted “give everyone the tool” to be the whole answer.)

What actually separated the top performers

But access just gave everyone a starting line. The engineers who excelled weren’t more talented than the rest of the team. They had rebuilt their workflows around the tool: how they wrote context, how they broke down work, how they reviewed agent output. Everyone else was waiting for the tool to make them faster without changing anything about how they worked.

New gear never made anyone better at lacrosse, either. The reps did.

What we built instead

One of the engineers who excelled used to teach software engineering. So we made him an offer: help us turn what you figured out into something the whole team can learn.

That became Immerse, the AI fluency program we built at Gnar AI, the AI strategy and fluency brand from The Gnar Company. It runs seven levels, taught in cohorts. It starts with context and prompting and ends in production-grade agentic development: the practice of having AI agents carry out multi-step coding work under human review, using multi-agent workflows, context engineering, and review discipline. Not a lunch-and-learn. A curriculum, with skills you demonstrate before you move up a level.

Then we ran our team through it.

The results: 43% more tickets, 93% more story points

The cleanest data came from our largest team, which works in a codebase that’s 8 to 10 years old. Legacy code, real customers, the kind of system where AI demos usually go to die. Which is exactly why it’s the measurement I trust most.

After the team went through Immerse, we measured 43% more tickets completed per sprint and 93% more story points closed per sprint.

Nothing else changed. Same engineers, same codebase, same sprint cadence, same backlog. No new hires, no process overhaul, no tooling change. They already had Claude Code. The only new variable was the training.

I’ll be straight about one thing: I don’t have a tidy explanation for why story points jumped twice as much as ticket count. My best read is that the team stopped avoiding the big tickets. When the gnarly refactor stops being scary, you pull it into the sprint instead of letting it age in the backlog. But that part is interpretation. The throughput numbers are measured.

It worked well enough that we now run the same program for client engineering teams, and we’re seeing the same shape of results: a measured baseline, a trained cohort, and a velocity jump you can put in front of a board.

How to run an AI pilot that actually delivers ROI

Four things I’d do differently if I were starting over, learned the expensive way:

Baseline before you buy. Capture tickets per sprint, story points per sprint, and cycle time before anyone gets a seat. If you don’t have a “before,” you will never be able to defend the spend, no matter how well it goes.
Train a cohort, not the whole org. Our biggest gains came from structured training of one team, not broad access for everyone. Pick a team with real delivery pressure, give them the program, and let the results recruit the next cohort.
Measure 60 to 90 days out, against the baseline. If you can’t draw a line from the training to the throughput, you don’t have a pilot. You have a subscription.
Ask every vendor what changes. Before you sign, ask: “What behavior changes, and how will we measure it?” If the answer is a feature list, walk.

The pilots didn’t fail because the AI wasn’t ready. The teams weren’t.

Ours wasn’t either. Until we trained it.

Immerse is the AI fluency program we run for client engineering teams at Gnar AI. If you want to know where your team stands today, start with our AI readiness assessment.