Root Cause Analysis Technique | AI Toolkit

The Debug with Context skill works when you have a stack trace and can reproduce the bug locally. But the hardest bugs don't cooperate: they happen in production but not locally, they occur intermittently, or they develop gradually (performance degrades over weeks until someone notices). For these, you can't set breakpoints. You can't reproduce on demand. You have to reason backwards from incomplete evidence to a root cause — like a detective, not a debugger.

This technique is systematic hypothesis elimination, adapted from Andreas Zeller's scientific debugging methodology and Google's SRE troubleshooting framework. Instead of guessing a cause and hoping the fix works, you generate multiple hypotheses, design tests that would eliminate each one, and narrow down until one survives.

The technique

Phase 0: Triage

For production incidents, stop the bleeding before you investigate. This phase is from Google's SRE playbook: mitigate first, diagnose second.

The system is experiencing [symptom]. Before investigating the root cause:

1. **Assess impact**: How many users are affected? Is functionality degraded or completely broken?
2. **Mitigate**: Can we reduce impact immediately? (rollback recent deploy, restart service, enable feature flag, route traffic away from affected component)
3. **Preserve evidence**: Before restarting anything, capture: logs, metrics dashboards, error rates, memory/CPU state, connection pool stats. Restarts destroy the in-memory evidence needed for diagnosis.

Mitigation is NOT diagnosis. We may restart a service to stop the bleeding, but we still need to find and fix the root cause or it will recur.

Phase 1: Gather evidence

Before theorizing, collect everything available. The most common debugging mistake is forming a hypothesis too early and then only looking for confirming evidence.

I need to diagnose a problem. Before suggesting any cause or fix, gather evidence.

## The symptom
[Describe what's happening — the observable behavior, not your theory about why]

## Gather these (check all that are available):

1. **Error evidence**: Exact error messages, stack traces, log entries around the time of the incident. Include timestamps.

2. **Scope**: How widespread is this? One user? All users? One endpoint? All endpoints? One region? This narrows the search space dramatically.

3. **Timeline**: When did this start? Was it gradual or sudden? A sudden onset suggests a deployment or configuration change. Gradual onset suggests resource exhaustion, data growth, or a race condition that becomes more likely under load.

4. **What changed**: Check `git log` for recent deployments. Check for config changes, dependency upgrades, environment variable changes, infrastructure changes, certificate expirations, DNS changes. "Nothing changed" is almost never true — something always changed. Maybe it was data volume, traffic patterns, or a third-party API.

5. **What still works**: Is there a related feature or endpoint that functions correctly? The difference between "works" and "broken" often points directly at the cause.

6. **Sanity checks**: Verify the obvious first. Is the service running? Is the database reachable? Are environment variables set correctly? Is there disk space? (Agans' Rule 7: "Check the plug.")

7. **Reproduction**: Can you reproduce it? If intermittent, what conditions seem to correlate? (time of day, specific user actions, load level, specific data)

Present the evidence without interpretation. We'll theorize in the next step.

Phase 2: Generate hypotheses

With evidence collected, generate multiple possible causes. The key word is multiple — a single hypothesis creates confirmation bias.

Based on the evidence gathered, generate 3-5 hypotheses for the root cause. For each one:

1. **Hypothesis**: [What you think is happening]
2. **Prediction**: [If this hypothesis is true, what specific, observable thing do you expect to see when you test it? State this BEFORE looking.]
3. **Explains**: [Which pieces of evidence this hypothesis accounts for]
4. **Doesn't explain**: [Which pieces of evidence this hypothesis can NOT account for — this is critical for ranking]
5. **Test**: [A specific action that would confirm or eliminate this hypothesis — a log query, a code path trace, a reproduction step, a data inspection]

Rank hypotheses by:
- How much evidence they explain (and how little they contradict)
- How testable they are (prefer hypotheses you can eliminate quickly)
- How cheap the test is (run the 5-minute test before the 2-hour test)

Important: At least one hypothesis should be something other than a code bug. Consider:
- Data issues (corrupt data, unexpected values, schema mismatch)
- Infrastructure (resource exhaustion, network partitioning, clock skew)
- Race conditions (timing-dependent behavior that varies by load)
- External dependencies (third-party API changes, certificate expiry, DNS)
- Configuration drift (environment variable differs between environments)

Phase 3: Eliminate

Work through hypotheses in ranked order. Change one variable at a time — changing multiple things simultaneously makes it impossible to attribute cause.

Test hypothesis [N]: [description]

Prediction: [what I expect to see if this hypothesis is true]
Test: [specific action]

Run the test and report:
- What you found
- Whether the prediction matched (confirmed), contradicted (eliminated), or was inconclusive
- If eliminated: move to the next hypothesis
- If confirmed: proceed to Phase 4
- If inconclusive: what additional evidence would resolve it?

Important: If none of your hypotheses survive, go back to Phase 1. The most common cause of stalled debugging is incomplete evidence, not insufficient cleverness.

For bugs where you know it "used to work," git bisect is the most powerful and underused tool. It's binary search applied to version history — it finds the exact commit that introduced a regression in O(log n) steps.

Phase 4: Verify and fix

Once you have a confirmed root cause, don't just fix the symptom. Understand why the bug exists and whether the same pattern exists elsewhere.

The root cause is [confirmed hypothesis].

Before writing the fix:

1. **Why does this bug exist?** What allowed it to happen? (Missing validation, untested edge case, implicit assumption, race condition window)

2. **Why wasn't it caught?** Is there a gap in tests, monitoring, or code review that let this through?

3. **Is this a systemic pattern?** Search the codebase for the same pattern. If the bug is "we don't validate X before using it," check every place X is used.

4. **Write the fix.** The fix should address the root cause, not the symptom. If the symptom is a null pointer exception, the fix is not a null check — it's ensuring the value is never null at that point.

5. **Prove the fix.** Remove the fix and confirm the bug returns. Reapply and confirm it's gone. If you can't demonstrate this cycle, you haven't proven causation. (Agans' Rule 9: "If you didn't fix it, it ain't fixed.")

6. **Write the regression test.** A test that reproduces the exact conditions that caused this bug. This test should fail without the fix and pass with it.

7. **Add monitoring.** If this bug could recur (or a variant of it), what metric or alert would catch it early? (Error rate spike, latency increase, specific log pattern)

Example

Symptom: API responses are intermittently slow (2-5s instead of the usual 100ms). Started ~3 days ago. No deployments in that window.

Phase 2 hypotheses:

1. Database connection pool exhaustion
   Prediction: Connection pool metrics will show pool at max capacity during slow requests
   Explains: Intermittent slowness (pool sometimes has connections, sometimes doesn't)
   Doesn't explain: Why it started 3 days ago with no code change
   Test: Check connection pool metrics and active connection count over the past week

2. External API slowness (payment provider, auth service)
   Prediction: External API response time logs will show increased latency starting 3 days ago
   Explains: Intermittent (depends on external service response time)
   Doesn't explain: N/A (need to check)
   Test: Check external API response time logs for the last week

3. Data growth hitting an unindexed query
   Prediction: Slow query logs will show a specific query with increasing execution time
   Explains: Gradual onset (data grows daily), intermittent (only slow on large result sets)
   Explains: No deployment needed (data change, not code change)
   Test: Check slow query logs, check table sizes, look for sequential scans

4. Memory leak causing GC pauses
   Prediction: Memory usage trend will show a sawtooth pattern with increasing peaks
   Explains: Intermittent (GC pauses are unpredictable), gradual onset (memory grows over time)
   Test: Check memory usage trend over the past week, check GC pause duration logs

Phase 3 outcome: Hypothesis 3 confirmed — slow query logs show a WHERE clause filtering on an unindexed column. The table crossed 500K rows 3 days ago, crossing the threshold where the query planner switches from index scan to sequential scan.

Phase 4: Added the missing index. Response times returned to normal. Found 2 other queries with the same pattern — added indexes preemptively. Added a slow query alert at 500ms. Wrote a regression test that verifies query performance with a large dataset.

When to use it

Production errors that don't reproduce locally
Intermittent bugs (happen sometimes, but not always)
Performance degradation that developed gradually
User-reported issues where the report is vague ("it's slow," "it broke")
Post-incident investigation when you need to find the root cause, not just restart the service

When NOT to use it

When you have a clear stack trace pointing to the exact line — use Debug with Context instead
When the fix is obvious (typo, wrong variable name, missing import) — just fix it
When the issue is a known limitation, not a bug — document it, don't debug it

The hypothesis discipline

The hardest part of root cause analysis is not forming hypotheses — it's resisting the urge to commit to the first one. Research on debugging psychology (Kessler & Tschuggnall) shows that confirmation bias is pervasive: developers seek evidence that supports their current theory and interpret ambiguous evidence as confirming it. Worse, debiasing attempts "repeatedly fail" — you can't just tell yourself to be objective.

The only reliable counter is structural: build the process so it forces disconfirmation.

The natural instinct:

See the symptom
Think of one possible cause
Assume that's the cause
Spend 3 hours finding evidence that confirms your theory
Discover it was something else entirely

The disciplined approach:

See the symptom
Think of 3-5 possible causes
For each, state what evidence would disprove it (not confirm it)
Run the cheapest, fastest disproving test first
Narrow down by elimination, not confirmation

Elimination works because a single piece of contradictory evidence kills a hypothesis entirely. Confirmation is weaker — you can always find evidence that seems to confirm a theory. The most dangerous pattern is "explaining away" contradictions: "well, maybe the cache was stale, which is why the metric doesn't show what I expected." Each auxiliary explanation makes the theory more complex and less falsifiable.

Debugging specific bug classes

Intermittent bugs: Often require a Phase 0.5 — instrument and wait. Deploy enhanced logging/monitoring and wait for the next occurrence. You can't debug what you can't observe.

Performance degradation: Use Brendan Gregg's USE method — for every resource (CPU, memory, disk, network), check Utilization, Saturation, and Errors. Watch out for averaged metrics hiding burst behavior: a dashboard showing "80% average CPU" might be hiding 100% spikes that cause queuing.

Tips

"What changed?" is the most powerful question. If the answer is "nothing," dig deeper — something always changed.
"What still works?" is the second most powerful question. The difference between the working path and the broken path often points directly at the cause.
If you've been debugging for more than 30 minutes without progress, go back to Phase 1 and check whether you're missing evidence.
Keep a log of your hypothesis elimination. When you eventually find the root cause, this log becomes the postmortem's investigation timeline.
Print debugging is fine — as long as each print statement tests a specific prediction. Scattering print statements randomly without a hypothesis is shotgun debugging.