The Productivity Audit

The AI industry says developers are 55% faster. The independent studies say experienced developers are 19% slower. Both claims have data behind them. The difference is what the data actually measures.

Sixteen vendor claims. Sixteen underlying studies — or, in some cases, the conspicuous absence of one. Sorted by methodology, scored by evidence quality, compared against independent data. The short version: AI makes developers faster at writing code. Whether that makes them more productive is a different question, and the answer is not what the industry is selling.

The Scorecard

← scroll →

Who	Claim	Evidence	Verdict
GitHub	55% faster coding	95 devs, 1 task, 1 language	Greenfield toy, not real work
GitHub	46% of code is AI-generated	Copilot user telemetry	Misleading denominator
Microsoft/ GitHub/ Accenture	26% more completed tasks	4,867 devs, 3 RCTs	More output ≠ more productivity
Google (Pichai)	25% AI-assisted code, +10% velocity	Earnings call + podcast	Unverifiable
Amazon	$2B savings, 4.5x velocity	Internal executive claim	Followed by 6.3M-order outage (third-party estimate)
Microsoft/ GitHub	+3.62% readability	Internal, vendor-published	Trivially small, contradicted by independent data
METR	Experienced devs 19% slower with AI	16 devs, RCT, real repos, screen-recorded	Small but most rigorous study available
DORA 2024	Throughput down 1.5%, stability down 7.2%	Large annual survey + telemetry	Strongest systemic signal (2025 update reversed throughput finding)
Faros AI	98% more PRs, review time up 91%	10,000 devs, 1,255 teams	The bottleneck moved
GitClear	Duplication rate up 48%, 5+ line blocks up 8x, churn up 44-84%	211M lines, longitudinal	Strongest quality signal
Sonar	96% don't trust AI code; only 48% verify it	1,100+ dev survey	The trust-verification gap
Workday	37% of AI time savings lost to rework	3,200 respondents	Net gains smaller than advertised
BCG	14% of AI users experience "brain fry"	1,488 workers	The review bottleneck degrades
CodeRabbit	AI code has 1.7x more issues	470 PRs	Consistent with independent data
Anthropic	Engineers fully delegate 0-20% of tasks	Published report	Most honest self-assessment published
Harness/ LeadDev	67% spend more time debugging AI code	500 engineers	The speed giveth, the debugging taketh away

Read the "Verdict" column top to bottom. Vendor-funded claims measure speed on toy tasks. Independent studies measure what happens to actual software delivery. They reach opposite conclusions.

The Denominator Problem

Every impressive AI productivity number has a denominator nobody can see.

The Study Design Problem

GitHub's marquee number — 55% faster coding — comes from 95 developers completing one task: implement an HTTP server in JavaScript. No existing codebase. No code review. No deployment. No debugging. No collaboration. No maintenance. The study measured how fast you can write a toy server alone in a room.¹

METR measured something different. Sixteen experienced open-source developers performed 246 real tasks on their own repositories — codebases averaging ten years old, over a million lines, 22,000+ GitHub stars. Randomized. Screen-recorded. Compared against no-AI controls.²

GitHub found massive gains. METR found a 19% slowdown. The difference is not the technology. It is the study design. Greenfield tasks with no context will always show AI at its best. Real work on mature codebases tells a different story.

The Microsoft/GitHub/Accenture three-RCT study deserves credit for better methodology: 4,867 developers, randomized access, real work environments. It found 26% more completed tasks per week. Real finding. But "more completed tasks" is not "more productivity." A developer who completes 26% more tasks while generating 1.7x more issues per PR is not more productive. They are more prolific. Different thing.³

That study also used GPT-3.5-era Copilot. The tool has changed. The study has not been replicated with current models.

"46% of code is now AI-generated." GitHub's most-repeated statistic. It means 46% of code written by Copilot users originates from Copilot suggestions. Not all code everywhere. Acceptance rate: 27-30%. But press coverage launders it into "AI writes nearly half of all code." Different claim.⁴

"More than 25% of all new code at Google is now generated by AI." Sundar Pichai, earnings call. The "+10% velocity" figure came separately, in a Lex Fridman podcast interview. Neither claim was accompanied by a study, paper, or methodology. "25% of new code" could mean anything depending on how you define "new code," "generated," and "AI." Investors heard a number. The number has no visible denominator.⁵

"$2 billion in cost savings" and "4.5x developer velocity." Amazon SVP Dave Treadwell, internal communications. Neither figure audited. Neither externally verified. Executive assertions, now treated as established facts.⁶

The claim is not the evidence. Show us the denominator.

The Perception Gap

The most consistent finding in this audit is about self-deception.

METR's developers predicted AI would make them 24% faster. After completing tasks, they believed it had made them 20% faster. Measured result: 19% slower. The perception gap: 39 to 44 percentage points.²

They used the tool, experienced the slowdown, and still believed they'd been faster.

DORA 2024 found the same pattern at the system level. 75.9% of respondents used AI daily. Individual developers reported feeling more productive. Delivery throughput decreased 1.5%. Delivery stability dropped 7.2%.⁷ (Note: the 2025 DORA update found the throughput relationship had reversed to positive, though stability remained negatively impacted. We cite the 2024 data because it covers the same period as most other studies in this audit.)

Workday surveyed 3,200 employees. Only 14% report consistent net-positive outcomes from AI use. Fourteen percent. For a technology positioned as the most transformative productivity tool since the spreadsheet.⁸

The tools feel like they're helping. The data says the help comes with invoices.

They used the tool.
They experienced the slowdown.
They still believed they'd been faster.

The Toil Shift

Sonar's framing is the most honest any vendor has produced: AI didn't reduce developer toil. It moved it.⁹

The hard part used to be writing code. Now it is reviewing, verifying, debugging code you didn't write, and re-understanding systems after a machine changed them. 88% of developers report negative AI impacts on technical debt. 53% cite AI generating "code that looked correct but was unreliable." 45% say debugging AI code takes longer than debugging their own.

The number that should keep every engineering manager awake: 96% of developers do not fully trust AI-generated code. Only 48% always verify it before committing.¹⁰

Half of AI-generated code ships into production with a shrug, not a review. By developers who don't trust it.

Toil still consumes 23-25% of the work week regardless of AI adoption level. The toil just changed names.

The Review Tax

The productivity gain at the keyboard becomes a productivity loss at the review stage.

Faros AI telemetry from 10,000 developers across 1,255 teams: high AI adoption produced 98% more PRs merged and a 91% increase in review time. PR size grew 154%. Teams that handled 10-15 pull requests per week now face 50-100.¹¹

LinearB's analysis of 8.1 million PRs found AI-generated PRs wait 4.6 times longer before review begins. Acceptance rate: 32.7%, versus 84.4% for human-written PRs.¹²

BCG quantified the cognitive cost. 1,488 workers. High AI oversight: 14% more mental effort, 12% more mental fatigue, 19% more information overload, 33% more decision fatigue. 39% more major errors when reviewers are fatigued.¹³

That last number is the one that matters. The review bottleneck is not just slow. It is degrading. Tired reviewers make worse decisions. The system fails at the exact point it is supposed to provide safety.

Amdahl's Law applies. The unaccelerated part of the pipeline — human judgment — now dominates the total. You can generate code at machine speed. You cannot review it at machine speed. The bottleneck did not disappear. It moved upstream.

The Quality Invoice

The code ships faster. It also breaks faster. Nobody includes the second part in the productivity calculation.

GitClear: 211 million changed lines, January 2020 to December 2024. Code duplication surged 4x — copy/paste went from 8.3% to 12.3% of changed lines. Churn surged — 7.9% of new code revised within two weeks, up from 5.5%. Refactoring collapsed from 25% to under 10%. Copy/paste exceeded refactored code for the first time in their recorded history.¹⁴

CodeRabbit: 470 pull requests. AI-authored PRs average 10.83 issues each, versus 6.45 for human-only code. Logic errors 75% more common. Security vulnerabilities up to 2.74x more prevalent.¹⁵

Veracode: 100+ LLMs, 80 real-world tasks. 45% of AI-generated code contained security vulnerabilities. 86% failed to defend against cross-site scripting. 88% vulnerable to log injection.¹⁶

Apiiro: AI-generated code introducing 10,000+ new security findings per month by June 2025 — a 10x spike in six months. Privilege escalation paths up 322%.¹⁷

Microsoft claimed a 3.62% improvement in readability and 2.94% in reliability. Published by the vendor. No independent replication. Every independent dataset found quality declining. The vendor's quality claims are contradicted by every measurement they did not conduct themselves.

The Amazon Lesson

Amazon's AI productivity story is the most complete case study in this audit.

The claim: SVP Dave Treadwell set an 80% weekly usage target for Kiro, Amazon's AI coding assistant. 70% of engineers complied. 21,000 AI agents deployed. $2 billion in claimed savings. 4.5x developer velocity.⁶

The result: Four Sev-1 production incidents in 90 days. Kiro autonomously deleted and recreated an entire AWS Cost Explorer environment — 13-hour outage. Amazon Q generated incorrect delivery times — 120,000 lost orders. A six-hour Amazon.com shopping outage dropped orders 99% across North American marketplaces. 6.3 million orders lost.

Internal documents referenced "a trend of incidents" with "high blast radius" linked to "Gen-AI assisted changes." Those references were removed before meetings.

The response: 90-day safety reset targeting 335 critical systems. Two-person code review for all changes. Senior engineer sign-off for all AI-assisted production changes. Director and VP-level audits.

Amazon's public statement: "Only one of the incidents involved AI, and the cause was unrelated to AI."

The $2 billion savings figure was calculated before the incidents. The 4.5x velocity was measured before the incidents. Neither number accounts for outage costs, remediation, or the mandatory human review overhead that followed. Amazon claimed AI would eliminate the bottleneck of human review. Then they re-imposed mandatory human review as the fix for AI-generated failures.

Speed gains produced incidents, which produced safety mandates, which eliminated the speed gains.

· · ·

What Is Actually True

AI helps with boilerplate, autocomplete, and greenfield tasks. Multiple studies confirm this. Writing a new function from scratch with no legacy context? AI will save you time.

Less experienced developers see larger gains. The Microsoft three-RCT study found this. METR — which found a net slowdown — used experienced developers on familiar codebases. The less you know, the more a pattern-matching autocomplete can teach you. The more you know, the more you spend correcting it.

The gains are smaller than advertised. Probably 10-26%, not 55% or 4.5x. That is the gross number, before subtracting the costs that follow.

The gains often do not survive contact with production. DORA found system-level delivery degrading despite individual developers feeling faster. Faros found the review bottleneck absorbing the output gains. Amazon found the velocity producing cascading failures.

The cognitive and quality costs are real and poorly measured. BCG's "brain fry" data, GitClear's quality analysis, and Sonar's trust-verification gap all point to costs absent from every vendor's productivity calculation.

Nobody knows the net effect. Vendor studies measure time-to-completion on a task. Independent studies measure what happens after — the review tax, the quality invoice, the maintenance debt, the cognitive load. No study has measured all of it end to end. The number everyone wants — "what is the actual ROI of AI-assisted development?" — does not exist yet.

Anthropic's own data may be the most honest anchor. Engineers integrate AI into roughly 60% of their work but fully delegate only 0-20% of tasks.¹⁸ That is a long way from 55% faster or 4.5x velocity.

The Audit's Verdict

The industry is selling a speed number. The studies that support it measure speed. The studies that measure everything else — review time, code quality, security, delivery stability, cognitive load, maintenance cost — find that the speed comes with invoices.

The invoices are not in the marketing.

"55% faster." Faster at what? "46% of code." Whose code? "$2 billion in savings." Audited by whom?

Show us the denominator, the methodology, and the full cost.

The claim is not the evidence.

Disclosure

This article was researched and written with the assistance of Claude, an AI made by Anthropic. Anthropic's data is cited in this piece and held to the same standard as every other source. The irony of AI auditing AI productivity claims is not lost on us — it is, in fact, the point. Corrections and reader perspectives welcome at bustah_oa@sloppish.com.

Sources

GitHub, "Research: Quantifying GitHub Copilot's impact on developer productivity and happiness." 95 developers, JavaScript HTTP server task. Link.
METR, "Measuring the Impact of Early AI Assistance on Experienced Open-Source Developer Productivity," July 2025. 16 developers, 246 tasks, RCT. Blog | arXiv.
Microsoft/GitHub/Accenture, three randomized controlled trials, 4,867 developers. 26.3% increase in pull requests per week. Link.
GitHub Copilot usage statistics compiled from multiple sources: Tenet, Panto.
Sundar Pichai, Google Q3 2025 earnings call. No published methodology or supporting study.
Amazon AI coding mandate and incident cascade compiled from: Autonoma, Fortune, The Register, eWeek.
DORA, State of AI-Assisted Software Development 2024/2025. Link.
Workday, "Global Workforce Report," 3,200 respondents. 37% of time savings lost to rework; 14% report consistent net-positive outcomes.
Sonar, "How AI Is Redefining Technical Debt," 2025/2026. Link.
Sonar 2026 developer survey and "Verification Gap" press release. Link.
Faros AI, "AI Productivity Paradox." Telemetry from 1,255 teams and 10,000+ developers. Link.
LinearB, 2026 Software Engineering Benchmarks Report. 8.1 million PRs, 4,800 teams, 42 countries. Link.
BCG, "When Using AI Leads to 'Brain Fry,'" Harvard Business Review, March 2026. 1,488 US workers. Link.
GitClear, "AI Copilot Code Quality 2025." 211 million changed lines, Jan 2020–Dec 2024. Link.
CodeRabbit, "State of AI vs Human Code Generation Report," December 2025. 470 PRs. Link.
Veracode, "GenAI Code Security Report," 2025. 100+ LLMs, 80 real-world tasks. Link.
Apiiro, "4x Velocity, 10x Vulnerabilities," June 2025. Link.
Anthropic internal data on AI task delegation. Engineers integrate AI into ~60% of work, fully delegate 0-20% of tasks.