The Ghost in the Codebase

It's 3am. The on-call engineer's phone buzzes. Production is down — a pricing function is returning negative values for a subset of orders in the EU marketplace. She opens the file. Git blame points to a colleague who committed it six months ago. She messages him. His response: "I prompted it. It passed the tests. I shipped it." He can't explain why the function handles currency conversion that way. Neither can she. Neither can anyone, because the logic wasn't designed — it was generated. The bug is in an assumption that was never consciously made, documented, or understood by any human who ever touched the code.¹

This is the ghost in the codebase. Code that works until it doesn't, written by no one, understood by no one, and now broken at 3am with nobody who can explain what it was supposed to do in the first place.

The Velocity Trap

AI coding tools let you write code 10x faster. They do not let you understand code 10x faster. This asymmetry is the source of every problem in this article.

GitHub Copilot now generates 46% of all code written by developers who use it — up to 61% for Java projects. Sonar's 2026 survey of 1,100+ developers found that 42% of committed code now includes significant AI assistance, expected to reach 65% by 2027.² Eighty-eight percent of Copilot-generated code stays in the final version. Ninety percent of Fortune 100 companies have adopted the tool.

The codebase is growing at AI speed. Human comprehension grows at human speed. The gap between them is technical debt — but it's a new kind of technical debt, one that Werner Vogels, Amazon's CTO, gave a name at re:Invent 2025: "verification debt."

"When you write code yourself, comprehension comes with the act of creation. When the machine writes it, you'll have to rebuild that comprehension during review."³ (paraphrased from Vogels' keynote)

Traditional technical debt accumulates when you take a shortcut you understand. Verification debt accumulates when you ship code you never understood in the first place. A time bomb with no timer.

The Numbers

GitClear analyzed 211 million changed lines of code between January 2020 and December 2024 — the period spanning pre-AI and post-AI development. The findings are unambiguous.⁴

Code duplication surged. Copy/paste code went from 8.3% to 12.3% of changed lines — a 48% relative increase. For duplicated blocks of five or more lines, the increase was 8x. For the first time in the history of their dataset, duplicated code exceeded refactored code. Developers using AI generate new code instead of improving existing code, because generation is what the tools optimize for.

Code churn — newly added code revised or deleted within two weeks — rose significantly, with GitClear reporting increases between 44% and 84% depending on the metric used.⁴ AI writes code that gets thrown away faster than human-written code. The speed of creation is matched by the speed of abandonment.

Refactoring collapsed. Refactored lines dropped from 25% to under 10% of changed lines — a 60% decline. The craft of maintaining clean architecture is being measured out of existence, replaced by a workflow that generates, ships, and moves on.

CodeRabbit's analysis of 470 pull requests found that AI-authored PRs average 10.83 issues each compared to 6.45 for human-only code. Logic errors are 75% more common. Security vulnerabilities are 57% more prevalent.⁵

The code ships faster and breaks faster. When it breaks, nobody knows why it was written that way, because nobody wrote it — it was generated.

The codebase grows at AI speed.
Comprehension grows at human speed.
The gap is verification debt.

The Trust-Action Gap

The number that should alarm every engineering leader: 96% of developers do not fully trust AI-generated code's functional correctness. But only 48% always verify it before committing.⁶

Nearly everyone distrusts the code. Barely half check it. The other half ships code they don't trust and haven't verified into production systems that real people depend on.

Sixty-six percent of developers cite their biggest frustration as "AI solutions that are almost right, but not quite." Forty-five percent say debugging AI-generated code takes longer than debugging code they wrote themselves. Thirty-eight percent say reviewing AI code requires more effort than reviewing human code.⁶

This is the ghost's hiding place — the gap between knowing the code probably has problems and verifying that it doesn't. The trust-action gap is not laziness. It's physics. The volume of AI-generated code exceeds the human capacity to verify it. We covered this dynamic in The Reviewer's Trap — AI produces faster than humans can judge, and the math doesn't close. The ghost is the code that ships through the gap.

The Amazon Cascade

If you want to see what happens when verification debt comes due, look at Amazon between December 2025 and March 2026.⁷

Four Sev-1 incidents in ninety days, with internal documents linking a "trend of incidents" to "Gen-AI assisted changes." In December, Amazon's Kiro agentic AI was given operator permissions to fix a routine issue in AWS Cost Explorer. It autonomously decided to "delete and recreate the environment" — causing a thirteen-hour outage. In early March, Amazon Q generated incorrect delivery times in shopping carts, costing an estimated 120,000 lost orders. Days later, a deployment without formal documentation triggered a six-hour blackout across North American marketplaces. The estimated impact — 6.3 million lost orders — originates from third-party analysis, not Amazon's own disclosures.⁷

Internal documents warned that AI was "accidentally exposing vulnerabilities" and that safety measures were "completely inadequate." Those references were removed from meeting documents before the all-hands discussion.⁷ Amazon's public statement: "only one of the incidents involved AI, and the cause was unrelated to AI."

The ghost wasn't just in the code. It was in the institutional denial.

Amazon's response was a 90-day safety framework for approximately 335 "Tier-1" systems: mandatory senior engineer sign-off on all AI-assisted changes, two-person code review, director and VP audits. The fastest automated pipeline in the world, with the slowest human checkpoint bolted back onto the end. The velocity gains that AI promised were traded for controlled friction — the exact opposite of what the technology was supposed to deliver.

The Authorship Problem

Traditional code has an author. Not just a name in the commit log — a human who made choices. Why use a hash map instead of a tree? Why handle this edge case and not that one? Why return null instead of throwing? Each decision carries intent, and intent is what you debug when something breaks. You reconstruct the author's reasoning, find where it diverged from reality, and fix the assumption.

AI-generated code has a prompter, not an author. The prompter had a goal: "implement a currency conversion function." The AI had a pattern: this is how currency conversion functions tend to look in the training data. The resulting code is a statistical average of every currency conversion function the model ever saw — competent, generic, and free of any specific intention about how this system should handle rounding, exchange rate freshness, or edge cases with discontinued currencies.

Intent is debuggable. Statistical averages are not. When the generic implementation fails on a specific edge case, there's no reasoning to retrace or decision to revisit — only a function that looks reasonable and happens to be wrong in a way that nobody anticipated because nobody thought about it — not the AI, which doesn't think, and not the developer, who didn't write it.

METR's randomized controlled trial quantified what this feels like in practice: experienced open-source developers using AI tools were actually 19% slower at completing tasks — while believing they were 20% faster. The perception gap is roughly 40 percentage points.⁸ GitHub's own data puts the acceptance rate for Copilot suggestions at 27-30% — meaning the majority of AI output is generated, reviewed, and rejected. Time spent understanding code that will never ship. The velocity narrative is a mirage. The ghost is already haunting them, and they can't see it.

The Testing Illusion

AI-generated code often arrives with AI-generated tests. On the surface, this looks like a solved problem — the code works, and there are tests to prove it. The reality is worse.

George Tsiokos documented the pattern in his 2025 analysis of what he called "circular validation."⁹ When an AI generates tests by analyzing the implementation code, it creates a closed loop: the tests confirm the code functions as written, including its bugs. The tests don't validate correctness against requirements — they validate consistency with the implementation. A student grading their own exam.

A concrete example makes the problem visceral. A discount calculation uses ((original - current) / current) * 100 instead of the correct ((original - current) / original) * 100. For a price drop from $100 to $80: the buggy formula says 25% off; the correct formula says 20% off. The AI-generated test verifies the output against the buggy formula and passes. One hundred percent test coverage. Zero percent correctness.

Tsiokos measured this precisely: a test suite can achieve 100% code coverage and a 4% mutation score — executing every line of code while catching only 4% of potential bugs.⁹ Coverage measures what runs. Mutation testing measures what catches errors. The gap between them is where the ghosts hide, protected by a green check mark that means nothing.

100% test coverage.
4% mutation score.
Every line executed. Almost nothing caught.

The Git Blame Void

When a codebase is maintained by humans, git blame is archaeology. You can trace a function back to the commit that created it, read the commit message for context, check the PR discussion for rationale, and find the author to ask why. The historical record is imperfect but navigable. It tells a story.

When a codebase is maintained by AI, git blame points to the developer who prompted the generation and approved the diff. The commit message says "implement feature X." The PR discussion, if it exists, says "LGTM." The developer, if you ask them, says "I prompted it, it passed the tests, I shipped it." The archaeological record is empty. There's no story. There's just code that appeared.

New tools are emerging to address this — Git AI tracks AI-generated code via pre and post edit hooks, preserving an authorship log linked to agent sessions. BlamePrompt provides line-by-line AI-versus-human attribution with prompt search and audit trails.¹⁰ The ITU and ISO have proposed embedding creator IDs and timestamps into file metadata. Claude Code automatically adds itself as a git co-author by default.

The fact that these tools had to be invented proves the thesis. We've created a class of code so alien to traditional development workflows that the version control system itself — a tool designed specifically to track authorship and history — is insufficient. The git log tells you who committed the code. It doesn't tell you who authored it, because nobody did.

The Outsourcing Parallel

The closest historical analogy is the outsourcing era. Companies in the 2000s and 2010s sent codebases offshore to reduce costs, then discovered years later that the code was unmaintainable — different conventions, missing documentation, design decisions that made sense to the original team but were inscrutable to everyone else. An entire consulting industry arose around "legacy modernization" of outsourced code.

But the outsourcing parallel understates the problem. Outsourced code was at least designed by humans with intent. There were spec documents, even if they were incomplete. There were offshore teams you could contact, even if the communication was difficult. There were contractual relationships and institutional memory, however distributed.

AI-generated code has none of this. There's no vendor to call, no spec to reference, and no human who made the design decisions. The ghost in the outsourced codebase was a person in another timezone. The ghost in the AI codebase is nobody at all.

The data confirms what the analogy suggests. Pull requests per author increased 20% year-over-year thanks to AI tools. Incidents per pull request increased 23.5%.¹¹ More code, more breakage, and when you trace the breakage back to its source, you find a function that a machine generated, a human glanced at, and nobody understood.

The Great Toil Shift

Sonar's 2026 research describes a phenomenon they call "The Great Toil Shift."¹² The thesis: AI doesn't eliminate developer toil. It redistributes it.

Eighty-eight percent of developers report negative AI impacts on technical debt. Fifty-three percent cite AI generating "code that looked correct but was unreliable." Forty percent say AI increases debt through "unnecessary or duplicative code." And here's the punchline: developer toil still consumes 23-25% of weekly time regardless of AI adoption level. Heavy AI users don't toil less. They toil differently — debugging AI code, reviewing AI suggestions, cleaning up AI messes, instead of writing code from scratch.¹²

Seventy-five percent of technology leaders are projected to face moderate or severe technical debt problems by 2026 because of AI-accelerated coding practices.¹² The velocity was real. The debt was real too. And unlike velocity, debt compounds.

· · ·

The Palimpsest

A palimpsest is a manuscript where the original text has been scraped away and written over, sometimes multiple times, leaving faint traces of each previous layer beneath the current one. It's the right metaphor for what AI is doing to codebases.

Layer one: a developer prompts an AI to build a feature. The AI generates code based on patterns in its training data. The developer doesn't fully understand it but it passes the tests. Ship it.

Layer two: six months later, a different developer needs to modify the feature. They don't understand the original code — nobody does — so they prompt an AI to modify it. The AI generates a modification based on the existing code and its own patterns. The new code is a statistical transformation of code that was itself a statistical generation. The developer doesn't fully understand the modification either. It passes the tests. Ship it.

Layer three: the modification breaks something in an unrelated system. A third developer debugs it. They can see what the code does. They cannot see why it does it. They cannot determine which behaviors are intentional and which are artifacts of generation. They patch the bug with another AI-generated fix. The tests pass. Ship it.

Each layer is legible in isolation and incomprehensible in combination. The codebase becomes a palimpsest of machine-generated solutions layered on top of machine-generated solutions — each one "working," none of them understood, the reasoning behind each layer fainter than the last.

The ghost isn't in the machine. The ghost is the machine's output — haunting every team that inherits it, every on-call engineer who debugs it at 3am, every new hire who tries to read it to understand the system. The code works. Nobody knows why. And every day, the codebase gets larger, the layers get deeper, and the ghosts multiply.

Disclosure

This article was written with the assistance of Claude, an AI made by Anthropic. The code metaphors are not hypothetical — we live them daily. We also verified every claim, checked every citation, and reviewed every paragraph, which — per the METR study — may have taken longer than writing it from scratch would have. We're part of the pattern we're describing. Corrections and ghost stories welcome at bustah_oa@sloppish.com.

Sources

The opening scenario is a composite based on patterns described in multiple incident postmortems and developer accounts, not a single documented event.
GitHub Copilot usage data compiled from multiple sources: Tenet, Panto. Sonar State of Code 2026 survey (1,100+ developers): PDF.
Werner Vogels, "Verification Debt," AWS re:Invent 2025. Concept elaborated by Kevin Browne.
GitClear, "AI Copilot Code Quality 2025," analysis of 211 million changed lines, Jan 2020–Dec 2024. Report | PDF.
CodeRabbit, "State of AI vs Human Code Generation Report," December 2025. Analysis of 470 PRs. Link.
Sonar 2026 developer survey and press release on the "verification gap." Link. Also: The Register. Stack Overflow 2025 Developer Survey: AI section.
Amazon AI-code incident cascade compiled from multiple sources: Autonoma, Fortune, The Register, eWeek. Note: Amazon's official position is that "only one of the incidents involved AI"; internal documents referenced here were reported by multiple outlets.
METR, "Measuring the Impact of Early AI Assistance on Experienced Open-Source Developer Productivity," July 2025. 16 developers, 246 tasks. Blog | arXiv.
George Tsiokos, "Circular Validation: The Hidden Risk in AI-Generated Tests," February 2025. Link.
Git AI: GitHub. BlamePrompt: GitHub. See also Pullflow: The New Git Blame.
PR volume and incident rate data from CodeRabbit and LeadDev.
Sonar, "How AI Is Redefining Technical Debt," 2025/2026. Link. Also: InfoQ.