Codebase Archaeology Technique | AI Toolkit

Developers spend roughly 58% of their time on program comprehension (Xia et al., large-scale field study). New job, inherited project, PR review in an unfamiliar module, your own code from six months ago — the problem is always the same: you need a mental model of how this system works, and the code doesn't explain itself.

The default approach is to say "explain this codebase" and get a generic summary. The result reads like a textbook: accurate but useless. You learn that "the API layer handles requests" but not why authentication uses middleware in some routes and inline checks in others, or that the utils/ directory is actually three unrelated things that grew together by accident.

This technique builds a mental model in layers, starting from what users see and drilling inward. Each layer answers a different question, and each layer's answers inform the next layer's questions.

The approach is grounded in program comprehension research — Littman et al.'s finding that developers who used systematic strategies (tracing data flow globally) produced successful modifications, while developers who studied code as-needed (only locally) consistently failed. The layers here formalize what expert developers do intuitively.

The technique

Layer 1: Shape and entry points

Start with the bird's-eye view. Don't read code yet — read structure. Directory names, config files, and entry points are what cognitive scientists call beacons — they trigger pattern recognition ("this is a Next.js app with Pages Router") that lets you chunk entire subsystems before reading a single line.

I'm new to this codebase. Help me build a mental model, starting with the shape.

1. Look at the top-level directory structure. What kind of project is this? (monorepo, single app, library, etc.)
2. Read package.json (or equivalent). What are the key scripts? What framework and language?
3. Find the entry points — where does execution start? For a web app: pages/routes. For a library: the main export. For a CLI: the bin script.
4. Find the configuration files. What tools are configured and what do the configs tell you about how the team works? (strict TypeScript? heavy linting? CI/CD?)

Give me:
- A one-paragraph summary of what this project IS
- A list of entry points (the "front doors" into the code)
- Anything surprising or unusual about the structure

Layer 2: Data flow

Now trace how information moves through the system. This is the most important layer — research consistently shows that developers who trace data flow globally understand systems more accurately than those who only read code locally.

Trace the data flow for [the most important entry point from Layer 1].

Starting from where data enters the system (user request, file read, API call), follow it through every transformation until it reaches the user (rendered page, API response, file output).

For each step:
- What file handles this step?
- What is the data shape at this point? (What fields exist? What types?)
- What transformation happens? (filtering, sorting, enriching, formatting)
- Where does the data go next?

Draw the flow as a simple chain:
[entry] → file:function → file:function → [output]

If the flow branches (e.g., same data goes to both a page and an RSS feed), show each branch.

Layer 3: Conventions and patterns

Every codebase has implicit rules that nobody wrote down. These are the things that make code "feel right" or "feel wrong" to the team, and they're invisible until you violate one.

Read 3-5 files that do similar things (e.g., multiple page components, multiple API routes, multiple test files). Compare them and identify:

## Conventions (patterns that are consistent)
- Naming: How are files, functions, variables, and components named?
- Structure: How are files organized internally? (imports order, section grouping)
- Error handling: How do errors propagate? (thrown, returned, logged, swallowed)
- State management: How is state handled? (props, context, stores, URL params)
- Data fetching: Where does data fetching happen? (server-side, client-side, both)

## Inconsistencies (patterns that vary)
- Places where files that should follow the same pattern don't
- Newer code that uses a different pattern than older code (reveals an ongoing migration)
- Areas where two different approaches coexist

## Unwritten rules
Based on what's consistent, what would a new developer need to know to write code that "fits" this codebase? List these as explicit rules, even if they were never documented.

Layer 4: The archaeology

Code is a geological record — layers of decisions, some intentional, some accidental, some actively harmful. This layer uncovers the why behind the what.

Now dig deeper. For the areas you've explored:

1. **Hotspot analysis**: Run this to find the most-changed files in the past year:
   git log --format=format: --name-only --since=12.month | sort | uniq -c | sort -rn | head -20
   Files that change most often are where bugs cluster, design problems live, and understanding is most needed.

2. **Temporal coupling**: Look for files that frequently change together in commits but don't import each other. This reveals hidden dependencies the import graph doesn't show.

3. **Pickaxe search**: When you find a pattern and want to know when it was introduced:
   git log -S'function_name' --oneline
   This shows only the commits that added or removed that string — far more useful than grep through history.

4. **Dead code and vestigial structures**: Look for:
   - Exported functions that nothing imports
   - Config options that are never used
   - TODO/FIXME/HACK comments — these are developer breadcrumbs
   - Comments that reference removed features or old behavior

5. **Decision residue**: Based on the patterns, inconsistencies, and git history, what decisions were made and why? Where does the code reveal:
   - A migration that's half-finished
   - A pattern that was adopted and then abandoned
   - A dependency that was added for one use case and is now load-bearing
   - A performance optimization that adds complexity

Summarize: what are the 3-5 most important things a new developer needs to understand about this codebase that they wouldn't learn from reading the README?

When to use it

First week at a new job — build the mental model before writing any code
Inheriting a project from another team or developer
Before a major refactor — understand the full landscape, not just the part you're changing
When reviewing a PR in an unfamiliar part of the codebase — run Layers 1-2 on that area first
Returning to your own code after months away

When NOT to use it

On a project you wrote last week — you already have the mental model
On a tiny project (under 10 files) — just read it
When you only need to understand one function — use the Impact Analysis technique instead

Why layers matter

Reading code file-by-file is like reading a dictionary to learn a language — you understand every word but can't speak. This is explained by cognitive science: your working memory holds only 3-5 "chunks" of information at once. Without a framework to organize what you're reading, every file is a new isolated fact that displaces the last one. The layered approach builds understanding in the order your brain needs it:

Shape tells you where things are (spatial orientation — now every file has a "slot")
Data flow tells you how things connect (causal reasoning — facts become a story)
Conventions tell you how to write code that fits (pattern matching — you can predict what you'll find)
Archaeology tells you why things are the way they are (historical context — the story has a plot)

Skipping to Layer 4 without Layers 1-3 produces the wrong kind of understanding — you learn interesting facts about the codebase without a framework to organize them.

Tips

The layers are a teaching structure, not a rigid sequence. In practice you'll bounce between layers as new questions arise — that's expected and healthy. The layers ensure you build each type of understanding rather than fixating on one.
Layer 3 (conventions) is the most immediately practical. If you need to start contributing code quickly, run Layers 1 and 3, skip 2 and 4 for later.
The "unwritten rules" from Layer 3 are excellent candidates for adding to a CLAUDE.md file — see the CLAUDE.md Starter for how to structure that.
After running the layers, pick a small bug or task to work on. This tests your mental model against reality and is where real understanding solidifies.
For deeper forensic analysis, see Adam Tornhill's Your Code as a Crime Scene, which extends these ideas with quantitative methods like complexity-churn analysis and geographic profiling of code changes.