There is a sentence I keep coming back to, sitting in my office in Berlin, watching a Claude Code session politely rewrite a function that worked perfectly well an hour ago: the memory of your project is not your codebase.

I wrote that line a few weeks back, almost as a throwaway. The longer I stare at it, the more I think it is the only thing that matters in 2026.

Because here is what nobody is being honest about: the AI coding tools work. They genuinely work. I ship faster than I have in twenty years of writing software. I have agents that can scaffold a microservice, write the tests, fix the failing ones, and open the PR — sometimes while I am making coffee. And yet, when I open a codebase that I have been “AI-assisting” for three months and try to remember why a particular invariant exists, why the retry logic has those specific numbers, why the auth flow was structured one way instead of another — there is nothing there. Not in the code. Not in my head. Not in the agent’s context window, which evaporated three sessions ago.

This is not technical debt. We know what technical debt is. This is something else. This is cognitive debt, and it is the thing that will eat us if we keep pretending the only problem is bad code.

Let me try to explain what I mean.

The metaphor we already understood

Ward Cunningham coined “technical debt” in 1992, in an experience report at OOPSLA called The WyCash Portfolio Management System. He was trying to explain to his boss why they needed to rewrite — a word he used, refactoring not yet being in the vocabulary — a financial application. He had just read Lakoff and Johnson’s Metaphors We Live By, and reached for the financial metaphor because his manager was a finance person. The phrase stuck not because it was clever but because it was true: shipping first-time code is like going into debt. A little debt speeds things up, as long as you pay it back. Don’t pay it back, and the interest grinds the project to a halt.

Martin Fowler, eighteen years later, added the famous quadrant: deliberate vs. inadvertent, prudent vs. reckless. The honest senior engineer who says “we know this is hacky, we’ll pay for it later” is in a completely different moral universe from the team that doesn’t know it’s racking up debt at all. Both quadrants exist. Both incur interest. Only one of them knows the balance is growing.

The industry has spent thirty years building tools, processes, and entire consulting practices around this metaphor. McKinsey put a number on it: by their estimates, technical debt represents 20–40% of the total value of a company’s technology estate before depreciation, and 30% of CIOs say more than 20% of their new-product budget is diverted to servicing existing debt. CISQ’s 2022 Cost of Poor Software Quality report priced US technical debt at roughly $1.52 trillion annually, with $2.08 trillion in total cost of poor software quality. These are not small numbers. They are the dark matter of every engineering organization.

We understand technical debt. We may not always fix it, but we understand it. We can argue about it in language an executive will tolerate.

What we do not understand — what most engineering organizations are not even measuring yet — is the second balance sheet that has appeared on the books over the last three years.

The older sibling — cognitive debt is not new

Before I get to the AI part, I want to be careful about something. The phrase “cognitive debt” started trending in 2025, mostly off the back of one MIT Media Lab paper, and the discourse has compressed it down to “ChatGPT makes you dumber.” That framing is lazy. Cognitive debt — as a phenomenon — is older than ChatGPT by decades. The MIT paper accelerated and named something software engineering has lived with forever.

Cognitive load theory was articulated by John Sweller in 1988. The idea is brutally simple: human working memory is tiny — Miller’s seven-plus-or-minus-two, revised downward by Nelson Cowan in his 2001 Behavioral and Brain Sciences paper “The Magical Number 4 in Short-Term Memory” to roughly four chunks for novel information once you account for output interference and other confounds — and every artifact you build either reduces or increases the cognitive load required to understand it. Felienne Hermans, in The Programmer’s Brain (Manning, 2021), translated this directly into how programmers read code: three cognitive processes (long-term memory for syntax and patterns, short-term memory for the immediate context, working memory for the active reasoning), each with hard biological limits. Programming, she argues, is the cognitive task we are worst-suited for as a species, because it makes simultaneous demands on all three.

What happens when a programmer leaves a project? Their long-term memory walks out the door with them. The codebase still exists; the model of the codebase does not. This is what we used to call project amnesia or knowledge amnesia — and it has been the dominant failure mode of long-lived software for as long as long-lived software has existed. We invented metrics for it: the bus factor, the truck number, popularized at the 1994 PLoP conference and turned into engineering folklore. We invented sociological terms: tribal knowledge, the Dead Sea effect (Bruce Webster, 2008), where talent evaporates and salt — undocumented decisions — crystallizes. We invented documentation practices to fight it: Michael Nygard’s Architecture Decision Records, formalized in his November 2011 essay Documenting Architecture Decisions, written to address exactly this: “A new person coming on to a project may be perplexed, baffled, delighted, or infuriated by some past decision. Without understanding the rationale or consequences, this person has only two choices: blindly accept the decision, or blindly change it.”

That is cognitive debt. That has always been cognitive debt. The interest payment is paid every time a new engineer onboards, every time someone has to reverse-engineer the why from the what, every time a refactor is blocked because nobody remembers which constraint the ugly bit was protecting.

We have known about this for thirty years. We mostly have not fixed it. We use phrases like “documentation rot” and “the docs are out of date” and shrug, because the economics never quite worked: writing the docs costs time now, and the benefit accrues to some future engineer (often a future version of yourself) who isn’t budgeted.

What AI assistants did was not invent cognitive debt. What they did was accelerate it by an order of magnitude and change its shape in a way we are still struggling to articulate.

What MIT actually measured

In June 2025, Nataliya Kosmyna and seven co-authors at the MIT Media Lab — including Pattie Maes — posted a preprint titled Your Brain on ChatGPT: Accumulation of Cognitive Debt when Using an AI Assistant for Essay Writing Task (arXiv:2506.08872). Fifty-four participants. Four sessions. Three groups: LLM users, search-engine users, brain-only writers. EEG monitoring throughout.

Read past the headlines and the actual findings are sharper than the discourse around them suggests. From the paper itself:

“EEG revealed significant differences in brain connectivity: Brain-only participants exhibited the strongest, most distributed networks; Search Engine users showed moderate engagement; and LLM users displayed the weakest connectivity. Cognitive activity scaled down in relation to external tool use.”

That last sentence is the one that should haunt anyone shipping AI-assisted code. Cognitive activity scaled down in relation to external tool use. The more you offloaded, the less your brain did. Not the same amount with different shape. Less.

Specifics, as summarized in the PubMed Central writeup of the paper (PMC12723506): the LLM group exhibited up to 55% reduced connectivity compared with brain-only participants across alpha and beta bands. In Session 1, 83.3% of LLM users could not accurately quote their own essays minutes after writing them — versus roughly 11% in the other groups. The most damning finding came from Session 4, where the protocol switched: LLM users were asked to write without the tool, while brain-only writers were given one. Of the LLM-to-Brain cohort, 78% still could not quote a single passage from essays they had personally signed their names to. The authors write: “Over four months, LLM users consistently underperformed at neural, linguistic, and behavioral levels.”

Kosmyna’s framing of the term is what matters: cognitive debt is the accumulation of long-term cognitive cost from over-reliance on a tool that does the thinking for you. Her own line, from the ODSC podcast: “There is no cognitive credit card. You cannot pay this debt off.”

Now — and this is important — the paper is a preprint, n=54, only 18 completed Session 4, and the task was essay writing, not coding. The authors do not claim it generalizes to engineering work. A reasonable critique published in arXiv:2601.00856 (Stanković et al., University of Vienna and TU Dresden) argues that some results need more conservative interpretation. I take all of that seriously.

But I also work in software, and I have watched my own engineers — and myself — do the coding equivalent of what those subjects did with essays. We accept the suggestion. We move on. We could not, if pressed, explain why the diff is the way it is. The structure is the AI’s, not ours. The reasoning, if it ever existed, lived for a few hundred tokens in someone else’s context window and is now gone.

METR, GitClear, and the productivity illusion

If you want hard evidence that this matters at the engineering level — not just the essay-writing level — there are two studies you need to know about.

In July 2025, METR (Model Evaluation & Threat Research) published a randomized controlled trial titled Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. Sixteen experienced developers, 246 real tasks from their own repositories (projects they had worked on for an average of five years), Cursor Pro with Claude 3.5/3.7 Sonnet. The expectation, both from the developers and from the experts METR consulted (economists, ML researchers), was a 24–39% speedup.

The result: developers using AI were 19% slower. After completing the study, they still believed they had been 20% faster. A 39-point gap between perception and reality. From the paper: “After completing the study, developers estimate that allowing AI reduced completion time by 20%. Surprisingly, we find that allowing AI actually increases completion time by 19% — AI tooling slowed developers down.”

(METR partially walked this back in February 2026, noting that developers who most believed AI helped them refused to participate in the no-AI arm even at $50/hour, which is methodologically interesting but doesn’t change the central finding for the participants who did run the trial.)

The cognitive mechanism METR identified was extra cognitive load and context-switching introduced by the AI — exactly Sweller’s extraneous load, layered onto already-demanding work. The screen recordings showed developers spending 9% of total task time just reviewing and modifying AI-generated code. That is overhead that didn’t exist before. You can feel fast and still be slow, because the felt-experience of typing fewer characters does not capture the time you spent later debugging something you didn’t write.

The second study, GitClear’s AI Copilot Code Quality research (Bill Harding and Matthew Kloster, 2024 baseline and the 2025 follow-up — 211 million lines of code analyzed across 2020–2024), measured the codebase-level consequences:

Refactoring activity collapsed. The percentage of changed lines associated with refactoring fell from 25% in 2021 to under 10% in 2024.

Code duplication exploded. GitClear tracked an eightfold increase in the frequency of code blocks with five or more lines that duplicate adjacent code — a prevalence of code duplication ten times higher than two years prior.

Code churn — lines reverted or significantly modified within two weeks of authorship — grew from 3.1% in 2020 to 5.7% in 2024.

For the first time in the dataset, in 2024, the introduction of repeated code surpassed refactoring activity.

Read those numbers together. The codebase is filling up with copy-pasted blocks (because AI doesn’t grep for existing utilities, it generates new ones). The churn rate says we are increasingly fixing code within days of writing it. And we are not refactoring — not consolidating, not improving — at anywhere near historical rates. We are generating mass without metabolism.

Now compose the three findings. MIT says your brain disengages when you offload. METR says you are slower than you feel. GitClear says your codebase is becoming a graveyard of half-understood fragments. This is not a coincidence. This is the shape of cognitive debt in the age of AI-assisted development.

The new shape of amnesia

Here is what is genuinely new, and what we need to name precisely.

In the old world, project amnesia had a single source: humans leaving. People resigned, were promoted, went on parental leave, retired. Knowledge walked out with them. The mitigations — ADRs, design docs, runbooks, pairing, code reviews — were all attempts to externalize the cognitive model into artifacts that could outlive the individual.

In the new world, project amnesia has three sources, and only one of them is the old one.

First source: humans still leave. That hasn’t changed.

Second source: AI assistants have no persistent memory across sessions. Every Claude Code session, every Cursor conversation, every Copilot exchange starts from nothing. The model that “designed” a system on Monday has, on Tuesday, zero recollection of why. The reasoning that produced the code did exist, briefly, in a context window — and then it was garbage-collected like any other ephemeral process. The codebase persists. The justification does not. We have invented a new kind of contributor: one that ships production code and then has organizational amnesia by morning.

Third source — and this is the one that should keep you up at night — humans are not building mental models in the first place, because they outsourced the thinking. This is the cognitive debt the MIT paper measured. You can no longer assume that “the engineer who wrote this code understands it.” For a growing fraction of code, no one understands it. Not the AI (it forgot). Not the human (they never modeled it). Not a future maintainer (it was never explained). The code works — until it doesn’t — and when it doesn’t, the only reasoning available is whatever can be reconstructed from the diff and the test suite.

This is the knowledge graveyard. It is the GitClear duplicate blocks and the MIT 78% recall failure and the METR 9% review overhead, all the same thing, viewed from different instruments.

The vibe-coding wave — Andrej Karpathy’s February 2, 2025 post on X coining the term, then Collins Dictionary’s Word of the Year for 2025, then 25% of Y Combinator’s W25 batch with 95% AI-generated codebases — accelerated this further. Karpathy was honest about what he was describing: “I ‘Accept All’ always, I don’t read the diffs anymore. When I get error messages I just copy paste them in with no comment, usually that fixes it. The code grows beyond my usual comprehension, I’d have to really read through it for a while.” He was describing a weekend mode. The industry adopted it as a default. CodeRabbit’s December 17, 2025 “State of AI vs Human Code Generation” report (released via BusinessWire) found AI-generated PRs contain ~1.7× more issues on average than human-written PRs — 10.83 issues per PR versus 6.45 for human-only PRs — with elevated rates of logic errors, 75% more misconfigurations, and 2.74× higher security vulnerability density. Georgia Tech’s Vibe Security Radar tracked CVEs attributable to AI coding tools climbing from 6 in January 2026 to 35 in March 2026, with estimates that the true count is 5–10× higher because most AI tools leave no commit metadata.

The interest payment on cognitive debt is now showing up as security debt, maintenance debt, and recall debt. All at once. All on the same balance sheet.

Why the agents won’t save themselves

The optimistic position — the one I see promoted by every vendor right now — is that bigger context windows and better models will fix this. A million tokens. Ten million tokens. Eventually the agent will hold the entire codebase in its head and the amnesia problem solves itself.

This is wrong, and it is wrong for the same reason that bigger RAM does not solve database design. A bigger buffer is still a buffer. It is expensive, ephemeral, and dissolves at the end of the session. It does not think. Vector stores are slightly better — they can retrieve fragments that sound like your query — but they still cannot answer the questions that matter for engineering: “Why did we choose this approach over the three alternatives? What constraint was this protecting? Who promised what to whom, and when did that promise change?”

This is the bet I have been making for the last two years, and it is the spine of the book I just finished: agents do not need bigger context windows. They need world models.

A world model, in the agent sense, is not a memory. It is a structured, evolving representation of what the system knows about its environment: the entities, the relationships, the temporal order, the causal structure, the commitments. The research community has been converging on this from multiple directions. AriGraph, from Anokhin and colleagues at AIRI (IJCAI 2025, originally arXiv 2407.04363), showed that an LLM agent equipped with a knowledge-graph world model integrating semantic and episodic memory outperformed every memory baseline in complex zero-shot tasks — text adventures so hard that humans struggled. The Graph World Model paper from Feng, Wu, Lin, and You (ICML 2025, arXiv:2507.10539) generalized this further: a unified message-passing world model that operates over graph-structured states across six domains, from multi-agent coordination to retrieval-augmented generation to planning. Earlier, the Worldformer of Ammanabrolu and Riedl (NeurIPS 2021, Learning Knowledge Graph-based World Models of Textual Environments) had already shown that predicting the difference in the graph — not the next token, but the next state of the world — was a stronger basis for agent reasoning than language modeling alone.

The pattern, across all of this work, is the same: agents that have a typed, graph-structured, persistent representation of the world they are acting in — episodic memory of what happened, semantic memory of how things relate, procedural memory of how to do things — outperform agents that have only a context window. By a lot.

Now apply this to software engineering, which is what I have been doing.

If your codebase had a real world model — a metagraph representation of components, contracts, decisions, promises, behaviors, evolution over time — then a Claude Code session next Tuesday would not start from nothing. It would attach to a persistent shared cognitive substrate that survives the session boundary. The reasoning behind a decision would not live for 200K tokens and die. It would live in the world model. The human onboarding next month would query the same world model. The AI in six months running on a different vendor’s foundation model would query the same world model.

This is the move from prompt engineering to context engineering to what I think is the real frontier: world model engineering. Prompt engineering was about phrasing. Context engineering — Tobi Lütke’s term, popularized by Karpathy in mid-2025 — was about what you cram into the window. World model engineering is about building the persistent, structured, queryable substrate that outlives the window.

If you want the deeper argument, that is what my book World Models for AI Agents (https://leanpub.com/agenticworldmodels) is for. It covers episodic and semantic memory, temporal knowledge graphs, metagraphs as a backbone, JEPA-style latent world models, promise theory for multi-agent coordination, and — crucially — why retrieval is not enough and why context windows are not memory. I do not want to recapitulate the whole book here. I want to make a different argument: that the same world-model framing applies to the engineering process itself, not just the agents we are building.

Specs are the cascading substrate

Here is where I want to get concrete, because abstract arguments about cognitive substrates are easy to nod along to and impossible to act on.

The substrate already exists in latent form, in your engineering organization, and you are throwing it away every day. It is the specification.

Spec-driven development is having a moment. GitHub Spec Kit, released in late 2024 and now supporting 30+ AI coding agents (Claude Code, Cursor, Gemini CLI, Codex CLI, Copilot, Kiro, Windsurf, the rest of the alphabet), is the highest-profile entrant. AWS Kiro, released mid-2025, embeds the same idea natively in an IDE. Tessl pushes toward spec-as-source — the spec is the code, and the implementation files carry “DO NOT EDIT” comments. Martin Fowler’s team at Thoughtworks has been writing extensively about Spec-Driven Development as a coherent methodology, identifying three rigor levels (spec as guide, spec as living artifact, spec as source).

The workflow is consistent across all of them: constitution (or steering, or principles) → specify → plan → tasks → implement. Each phase produces a Markdown (or richer) artifact that feeds the next. The agent does not start from a prompt; it starts from a cascade of specifications.

I have been writing about this cascade for a while now. It is exactly what it sounds like: a layered hierarchy of intent that flows downward. At the top, principles that are immutable for the project (Spec Kit calls them the constitution). Below that, architectural decisions — ADRs in the Nygard format, or the modernized MADR 4.0.0, or Olaf Zimmermann’s Y-Statements. Below that, component contracts. Below that, behavior specifications. Below that, code patterns and generation rules. At every level, the why is preserved, not just the what.

The cascade is the world model. Or, more precisely: the cascade is what you feed to the world model so it has something to attach to.

Here is the move that changes the equation. In the old documentation paradigm, you wrote ADRs and design docs for future humans. The economics were terrible — write now, benefit accrues later, to someone who probably won’t read them anyway. That is why ADRs, despite being a known-good practice since 2011, are sitting in two-year-old GitHub folders across a hundred thousand companies, untouched.

In the new paradigm, you write specs for the agent that will be in the loop tomorrow morning. The benefit accrues immediately. The spec is not future-tense documentation that nobody will read; it is present-tense context that the AI will read on the next /specify, the next /plan, the next /implement. The same artifact, by the same hands, but with an entirely different ROI profile because the consumer changed from “future humans, maybe” to “every agent, every session, definitely.”

This flips the documentation economics. For the first time in thirty years, writing the ADR pays you back in the same week.

And — this is the part I find genuinely beautiful — the spec serves both readers. The human can read the constitution and the design docs and reconstruct the model. The agent can ingest the same cascade and build its context. The cognitive load is shared between human and machine, with the spec as the externalized substrate. Sweller’s working-memory bottleneck is partially relieved, because what used to live only in the senior engineer’s head now lives in the cascade. The MIT cognitive-debt finding is mitigated, because the human is forced to write the spec, which is the thinking step that LLM-only workflows skip. The METR slowdown is mitigated, because the agent has structured context instead of probabilistic guesses about what you meant.

This is the argument I make in detail in my other book, Clarity Engineer: Code Is the Side Effect (https://leanpub.com/clarityengineer). The thesis there is even shorter: software engineers are not code writers; we are clarity traders. We translate the ambiguity of business requirements into the precision that machines demand. AI handles the translation step faster than ever — but only if you first achieve the clarity that makes translation possible. Code is what comes out the back end. The real work is upstream, in the specification of intent.

What to actually do on Monday morning

I want to be honest about what this implies, because it implies more work, not less.

If you take cognitive debt seriously, you have to stop measuring AI productivity by lines-of-code-per-day or tickets-closed-per-sprint. Those metrics — which the GitClear report shows can be juiced trivially by duplication — measure the speed of fragment generation. They do not measure whether the codebase has a coherent world model behind it. A better proxy: can a new engineer (or a fresh AI session) reconstruct the reasoning behind a non-trivial decision in under 30 minutes? If yes, you have low cognitive debt. If no — if the only person who knows is the senior engineer who happened to be on the Slack thread when Claude proposed the design — you have high cognitive debt, and the interest payments are coming.

Concretely: start writing ADRs again, but write them for the agent. Use the Nygard template or MADR — they are five short sections, the bar is low. Put them in docs/adr/ and reference them from your spec-kit constitution. The agent will read them. So will the next human. The same artifact serves both.

Adopt one of the spec-driven tools. Spec Kit is free, agent-agnostic, has 97K GitHub stars, and produces Markdown artifacts you own. Start with a small feature, not a brownfield monster. Notice what happens to your delivery cadence after the first four or five features: you will probably write more spec and less code; the agent will produce code more aligned with intent; you will catch ambiguities at the spec layer instead of the debugging layer.

Treat the codebase as having two artifacts, not one: the code, and the world model that explains the code. The world model lives in some combination of ADRs, specs, knowledge graphs, and the structured artifacts your spec-driven tool produces. Maintain it like you maintain the code. When something changes, both change. When you ship, you ship both. When you onboard, you onboard both.

And — please — read the diff. Karpathy’s “I accept all, I don’t read the diffs anymore” was a weekend mode. We adopted it as a default. The METR result is what happens when you do that for two years in a row.

Coda

I started by quoting myself: the memory of your project is not your codebase. Let me end by saying what I mean by it now, after writing all of the above.

The codebase is the artifact. The world model is the memory. The cascade of specifications is how the world model is constructed, maintained, and shared between humans and agents. Cognitive debt is what happens when you let the codebase grow without the world model — when you optimize for speed of code generation and forget that generation without comprehension is debt with a shorter repayment window.

The MIT paper, the METR study, the GitClear report — they are not three different problems. They are the same problem, measured by three different instruments. The fix is the same. The fix is to make the why a first-class artifact again, and to give it to both the humans and the agents as a shared substrate they can both rely on.

This is the work of the next five years. Not bigger context windows. Not smarter models. The patient, unglamorous work of building world models that outlive sessions, of writing specifications that cascade, of treating clarity as the actual product and code as the side effect.

If you want to go deeper on the agent-architecture side — episodic and semantic memory, knowledge graphs as world models, promise theory, metagraphs, the path from memory to identity — World Models for AI Agents (https://leanpub.com/agenticworldmodels) is where I make the technical case in full. If you want the engineering-practice side — the cascade of specifications, intent documentation, working with Claude Code without losing yourself in it, what AI-native organizations actually look like — Clarity Engineer: Code Is the Side Effect (https://leanpub.com/clarityengineer) is the field guide.

But honestly, if you take only one thing from this essay: there is no cognitive credit card. The interest compounds whether you are looking or not. Build the world model. Write the spec. Read the diff.

The agents will not save themselves. We have to give them — and ourselves — something to remember with.