Operations

Incident Debugging Playbook

SEV-1/2/3/4 incident response protocols. Log analysis, root cause investigation (5 Whys, Fishbone), postmortem templates, and on-call procedures.

~25 min read 5,000 words Production-Ready

Production Playbook for DevOps and Plugin Maintainers

Debugging production incidents in multi-agent Claude Code workflows requires systematic approaches to log analysis, root cause identification, and rapid remediation. This playbook provides battle-tested debugging techniques, incident response workflows, postmortem templates, and real-world examples of common failure modes.

Incident Classification

Severity Levels

Severity Impact Response Time Example
SEV-1 Production down Immediate All agents failing, API completely offline
SEV-2 Major degradation 15 minutes 50%+ error rate, critical features broken
SEV-3 Minor degradation 1 hour Intermittent failures, single plugin broken
SEV-4 Cosmetic issues 24 hours UI bugs, non-critical warnings

Common Incident Types

enum IncidentType {
  API_FAILURE = 'api_failure',           // Claude API unreachable
  RATE_LIMIT = 'rate_limit',             // 429 errors from API
  TIMEOUT = 'timeout',                    // Agent/tool timeouts
  MEMORY_LEAK = 'memory_leak',           // Process memory exhaustion
  PLUGIN_CRASH = 'plugin_crash',         // Plugin process died
  DATA_CORRUPTION = 'data_corruption',   // Invalid data in DB/cache
  PERFORMANCE = 'performance',           // Slow response times
  AUTHENTICATION = 'authentication'      // Auth failures
}

interface Incident {
id: string;
severity: 'SEV-1' | 'SEV-2' | 'SEV-3' | 'SEV-4';
type: IncidentType;
startTime: number;
affectedUsers: number;
errorRate: number;
description: string;
}

Initial Response Protocol

First 5 Minutes (SEV-1/SEV-2)

Step 1: Assess Impact

# Check current error rate
tail -n 1000 /var/log/claude-code.log | grep -c ERROR

# Check affected users
grep "ERROR" /var/log/claude-code.log | awk '{print $5}' | sort -u | wc -l

# Check service health
curl http://localhost:3333/api/status

Step 2: Check Obvious Issues

// Quick health check script
async function quickHealthCheck(): Promise<{ healthy: boolean; issues: string[] }> {
  const issues: string[] = [];

// 1. Check Claude API connectivity
try {
const response = await fetch('https://api.anthropic.com/v1/messages', {
method: 'POST',
headers: { 'x-api-key': process.env.ANTHROPIC_API_KEY },
body: JSON.stringify({ model: 'claude-3-5-haiku-20241022', messages: [{ role: 'user', content: 'test' }], max_tokens: 10 })
});
if (!response.ok) issues.push('Claude API unreachable');
} catch (error) {
issues.push('Network connectivity issue');
}

// 2. Check disk space
const { stdout } = await execAsync("df -h / | tail -1 | awk '{print $5}' | sed 's/%//'");
if (parseInt(stdout) > 90) issues.push('Disk space critical');

// 3. Check memory
const memUsage = process.memoryUsage();
if (memUsage.heapUsed / memUsage.heapTotal > 0.9) issues.push('Memory exhaustion');

return { healthy: issues.length === 0, issues };
}

Step 3: Stabilize (if possible)

# Restart failed services
systemctl restart claude-code-daemon
pm2 restart all

# Clear cache if corrupted
redis-cli FLUSHALL

# Rate limit protection
iptables -A INPUT -p tcp --dport 80 -m limit --limit 25/minute --limit-burst 100 -j ACCEPT

Communication Template

# Incident Alert: [TITLE]

Severity: SEV-2
Status: Investigating
Started: 2025-12-24 14:35 UTC
Affected: ~1,200 users (15% of total)

<h2>Current Impact</h2>
  • Agent execution failing with 429 errors
  • Error rate: 68% (normal: <1%)
  • No data loss
<h2>Actions Taken</h2>
  • ✅ Identified rate limit exhaustion (14:40)
  • ✅ Implemented emergency rate limiting (14:42)
  • 🔄 Monitoring recovery (14:45)
<h2>Next Update</h2> In 15 minutes or when resolved.

Common Failure Modes

1. Rate Limit Exhaustion

Symptoms:

``

Error 429: Rate limit exceeded

anthropic-ratelimit-requests-remaining: 0

anthropic-ratelimit-requests-reset: 2025-12-24T15:00:00Z

`

Diagnosis:

async function diagnoseRateLimits(): Promise<void> {
  // Check recent API calls
  const recentCalls = await queryLogs('SELECT COUNT(*) FROM api_calls WHERE timestamp > NOW() - INTERVAL 1 MINUTE');
  console.log(`API calls in last minute: ${recentCalls}`);

// Check rate limit headers from last successful call
const lastHeaders = await getLastAPIHeaders();
console.log('Remaining requests:', lastHeaders['anthropic-ratelimit-requests-remaining']);
console.log('Reset time:', lastHeaders['anthropic-ratelimit-requests-reset']);
}

Fix:

// Implement token bucket rate limiter
class EmergencyRateLimiter {
  private tokens = 50; // Match API tier
  private lastRefill = Date.now();

async throttle(): Promise<void> {
this.refill();
while (this.tokens < 1) {
await sleep(100);
this.refill();
}
this.tokens--;
}

private refill() {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
const tokensToAdd = elapsed * (50 / 60); // 50 per minute
this.tokens = Math.min(50, this.tokens + tokensToAdd);
this.lastRefill = now;
}
}

2. Agent Timeout

Symptoms:

`

Error: Agent execution timed out after 300000ms

Task: code-review

Conversation: abc-123-def

`

Diagnosis:

# Check for hung processes
ps aux | grep claude | grep -v grep

# Check system load
uptime
# Output: load average: 12.5, 8.3, 5.2 (CPU overload!)

# Check for blocking I/O
iotop -o -d 5

Fix:

// Implement aggressive timeouts
class TimeoutManager {
  async executeWithTimeout<T>(
    fn: () => Promise<T>,
    timeoutMs: number
  ): Promise<T> {
    return Promise.race([
      fn(),
      new Promise<never>((_, reject) =>
        setTimeout(() => reject(new Error(`Timeout after ${timeoutMs}ms`)), timeoutMs)
      )
    ]);
  }
}

// Usage
const timeout = new TimeoutManager();
const result = await timeout.executeWithTimeout(
() => agent.execute(task),
30000 // 30 second hard limit
);

3. Memory Leak

Symptoms:

# Memory usage climbing over time
free -m
#              total   used   free
# Mem:         16384  15892    492  # Critical!

# Process memory
ps aux --sort=-%mem | head -5
# claude-daemon: 8.2GB (!)

Diagnosis:

// Track memory usage over time
setInterval(() => {
  const usage = process.memoryUsage();
  console.log(JSON.stringify({
    timestamp: Date.now(),
    heapUsed: usage.heapUsed / 1024 / 1024, // MB
    heapTotal: usage.heapTotal / 1024 / 1024,
    external: usage.external / 1024 / 1024,
    rss: usage.rss / 1024 / 1024
  }));

// Trigger GC if usage > 80%
if (usage.heapUsed / usage.heapTotal > 0.8) {
global.gc(); // Requires --expose-gc flag
}
}, 60000); // Every minute

Common Causes:

// ❌ Leak: Global cache never cleared
const cache = new Map<string, any>();
function addToCache(key: string, value: any) {
  cache.set(key, value); // Grows forever!
}

// ✅ Fix: LRU cache with size limit
import LRU from 'lru-cache';
const cache = new LRU<string, any>({ max: 1000 });

4. Plugin Crash Loop

Symptoms:

# PM2 showing rapid restarts
pm2 status
# plugin-server | errored | 47 restarts in 2 minutes

# Logs show crash
tail -f /var/log/pm2/plugin-server-error.log
# Error: ECONNREFUSED 127.0.0.1:5432
# (PostgreSQL connection failed)

Diagnosis:

# Check dependencies
docker ps | grep postgres
# (empty - PostgreSQL container not running!)

# Check network
netstat -tulpn | grep 5432
# (no listener on port 5432)

Fix:

# Restart dependency
docker-compose up -d postgres

# Verify connectivity
psql -h localhost -U user -d database -c "SELECT 1"

# Restart plugin
pm2 restart plugin-server

Debugging Techniques

1. Binary Search Debugging

Problem: Unknown change broke production

# Use git bisect to find breaking commit
git bisect start
git bisect bad HEAD              # Current version is broken
git bisect good v1.2.0           # Last known good version

# Git will check out commits for testing
# Test each commit:
npm install && npm run build && npm test

# Mark results
git bisect good   # if tests pass
git bisect bad    # if tests fail

# Git will find the exact breaking commit

2. Correlation Analysis

Find patterns in failures:

interface FailureEvent {
  timestamp: number;
  errorType: string;
  userId?: string;
  pluginName?: string;
  duration: number;
}

function analyzeFailureCorrelations(failures: FailureEvent[]): void {
// Group by time windows
const byHour = groupBy(failures, f => Math.floor(f.timestamp / 3600000));

// Find spike times
const spikes = Object.entries(byHour)
.filter(([_, events]) => events.length > 100)
.map(([hour, events]) => ({
hour: new Date(parseInt(hour) * 3600000),
count: events.length,
topError: mode(events.map(e => e.errorType))
}));

console.log('Failure spikes:', spikes);

// Find common attributes
const byPlugin = groupBy(failures, f => f.pluginName);
const suspiciousPlugin = Object.entries(byPlugin)
.sort((a, b) => b[1].length - a[1].length)[0];

console.log(`Most failures from plugin: ${suspiciousPlugin[0]} (${suspiciousPlugin[1].length} errors)`);
}

3. Distributed Tracing

Track request across services:

import { trace, context, SpanStatusCode } from '@opentelemetry/api';

const tracer = trace.getTracer('claude-code');

async function executeAgent(agentName: string, task: any): Promise<any> {
const span = tracer.startSpan('agent.execute', {
attributes: {
'agent.name': agentName,
'task.id': task.id
}
});

try {
// Execute agent logic
const result = await agent.run(task);

span.setStatus({ code: SpanStatusCode.OK });
span.setAttribute('result.success', true);

return result;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message
});
span.recordException(error);
throw error;
} finally {
span.end();
}
}

Log Analysis

Parsing Claude Code Logs

Log Format:

`

[2025-12-24T14:35:22.123Z] [ERROR] [agent:code-review] Rate limit exceeded

conversationId: abc-123-def

userId: user-456

errorCode: 429

retryAfter: 12

stack: Error: Rate limit exceeded

at callClaude (/app/src/api.ts:45:11)

`

Analysis Script:

import { readFileSync } from 'fs';

interface LogEntry {
timestamp: Date;
level: 'ERROR' | 'WARN' | 'INFO';
component: string;
message: string;
metadata: Record<string, any>;
}

function parseLog(line: string): LogEntry | null {
const match = line.match(/[(.?)] [(.?)] [(.?)] (.)/);
if (!match) return null;

const [, timestamp, level, component, rest] = match;
const lines = rest.split('
');
const message = lines[0];

// Parse metadata
const metadata: Record<string, any> = {};
for (const line of lines.slice(1)) {
const metaMatch = line.match(/^s*(w+): (.+)$/);
if (metaMatch) {
const [, key, value] = metaMatch;
metadata[key] = value;
}
}

return {
timestamp: new Date(timestamp),
level: level as any,
component,
message,
metadata
};
}

function analyzeLogs(logPath: string): void {
const content = readFileSync(logPath, 'utf-8');
const logs = content.split('
')
.map(parseLog)
.filter(Boolean) as LogEntry[];

// Error rate by component
const errorsByComponent = groupBy(
logs.filter(l => l.level === 'ERROR'),
l => l.component
);

console.log('Errors by component:');
Object.entries(errorsByComponent)
.sort((a, b) => b[1].length - a[1].length)
.forEach(([component, errors]) => {
console.log(`  ${component}: ${errors.length}`);
});

// Recent errors (last 5 minutes)
const recentErrors = logs.filter(l =>
l.level === 'ERROR' &&
Date.now() - l.timestamp.getTime() < 300000
);

console.log(`
Recent errors: ${recentErrors.length}`);
recentErrors.slice(0, 10).forEach(err => {
console.log(`  ${err.timestamp.toISOString()} - ${err.message}`);
});
}

Using Analytics Daemon

// Query analytics daemon for incident patterns
const ws = new WebSocket('ws://localhost:3456');

ws.onmessage = (event) => {
const data = JSON.parse(event.data);

// Track rate limit warnings
if (data.type === 'rate_limit.warning') {
console.warn(`⚠️ Rate limit approaching: ${data.current}/${data.limit}`);
}

// Track errors
if (data.type === 'llm.call' && data.error) {
console.error(`❌ LLM call failed: ${data.error}`);
}
};

// Query historical data
const response = await fetch('http://localhost:3333/api/sessions');
const sessions = await response.json();
const failedSessions = sessions.filter(s => s.errorCount > 0);

console.log(`Failed sessions: ${failedSessions.length}/${sessions.length}`);

Root Cause Analysis

The 5 Whys Method

Example: Agent Timeout Incident

  • Why did the agent timeout?

→ Because it took > 300 seconds to respond

  • Why did it take so long?

→ Because the Claude API call was slow (280s)

  • Why was the API call slow?

→ Because we sent a 50,000 token prompt

  • Why did we send such a large prompt?

→ Because the code-reviewer agent included entire codebase in context

  • Why did it include the entire codebase?

Root Cause: File globbing pattern */ matched all files including node_modules (500MB)

Fix: Update file globbing to exclude node_modules

// Before: includes everything
const files = glob.sync('*/');

// After: exclude dependencies
const files = glob.sync('*/', {
ignore: ['node_modules/', '.git/', 'dist/**']
});

Fishbone Diagram (Ishikawa)

interface RootCauseAnalysis {
  problem: string;
  categories: {
    people?: string[];
    process?: string[];
    technology?: string[];
    environment?: string[];
  };
  rootCause: string;
  fix: string;
}

const analysis: RootCauseAnalysis = {
problem: 'Agent timeout causing 68% error rate',
categories: {
people: [
'Developer added file globbing without testing',
'No code review caught the issue'
],
process: [
'No integration tests for large codebases',
'No performance testing in CI/CD'
],
technology: [
'Glob pattern included node_modules (500MB)',
'No size limit on prompts',
'No timeout on file reading'
],
environment: [
'Production codebase larger than test repos',
'No staging environment for testing'
]
},
rootCause: 'Missing file size validation and glob pattern filtering',
fix: 'Add file exclusion patterns and max prompt size validation'
};

Recovery Procedures

Emergency Rollback

# Immediate rollback to last known good version
git log --oneline | head -5
# c534df4 (HEAD) feat: Add new feature (BROKEN)
# 3946b1f docs: Update README
# fc73caa (tag: v1.2.0) fix: Bug fix (LAST GOOD)

# Rollback
git reset --hard fc73caa
npm install
npm run build
pm2 restart all

# Deploy
./deploy.sh production

# Verify
curl http://api.example.com/health

Circuit Breaker Reset

// Manually reset circuit breaker after fixing issue
class CircuitBreakerManager {
  private breakers = new Map<string, CircuitBreaker>();

reset(serviceName: string): void {
const breaker = this.breakers.get(serviceName);
if (breaker) {
breaker.state = 'closed';
breaker.failures = 0;
console.log(`✓ Reset circuit breaker for ${serviceName}`);
}
}

resetAll(): void {
for (const [service, breaker] of this.breakers) {
this.reset(service);
}
console.log('✓ Reset all circuit breakers');
}
}

Data Recovery

# Recover from backup
BACKUP_DATE="2025-12-24-14:00"

# Stop services
pm2 stop all

# Restore database
pg_restore -d database_prod backups/backup_${BACKUP_DATE}.sql

# Restore files
rsync -av backups/files_${BACKUP_DATE}/ /var/lib/claude-code/

# Restart
pm2 restart all

# Verify data integrity
psql -d database_prod -c "SELECT COUNT(*) FROM conversations"

Postmortem Templates

Incident Postmortem

# Postmortem: Agent Timeout Incident (2025-12-24)

Date: 2025-12-24
Duration: 14:35 - 15:15 UTC (40 minutes)
Severity: SEV-2
Impact: 1,200 users (15%), 68% error rate

<h2>Summary</h2>
Code-reviewer agent began timing out due to excessive file inclusion in prompts, causing 68% error rate for 40 minutes.

<h2>Timeline (UTC)</h2>
  • 14:35 - First timeout alerts
  • 14:40 - Error rate reaches 68%
  • 14:42 - On-call engineer paged
  • 14:45 - Root cause identified (file globbing)
  • 14:50 - Fix deployed to staging
  • 14:55 - Fix deployed to production
  • 15:00 - Error rate drops to 5%
  • 15:15 - Incident resolved, error rate < 1%
<h2>Root Cause</h2> File globbing pattern
*/ included node_modules/
directory (500MB), creating prompts exceeding Claude API's context limits and causing timeouts. <h2>Contributing Factors</h2>
  • No file size validation before prompt construction
  • No integration tests with large codebases
  • No staging environment for testing
<h2>What Went Well</h2>
  • Fast root cause identification (10 minutes)
  • Effective rollback procedure
  • Clear communication to affected users
<h2>What Went Poorly</h2>
  • No monitoring alerts before user reports
  • No prompt size limits prevented the issue
  • Fix took 20 minutes to deploy
<h2>Action Items</h2>
  • [ ] P0: Add file size validation (Owner: @dev, Due: 2025-12-25)
  • [ ] P0: Implement max prompt size limit (Owner: @dev, Due: 2025-12-25)
  • [ ] P1: Add monitoring for agent timeouts (Owner: @ops, Due: 2025-12-27)
  • [ ] P1: Create staging environment (Owner: @ops, Due: 2025-12-30)
  • [ ] P2: Add integration tests with large repos (Owner: @qa, Due: 2026-01-05)
<h2>Lessons Learned</h2>
  • File operations need size limits
  • Production testing with realistic data is critical
  • Monitoring must detect issues before users report them

Best Practices

DO ✅

  • Log structured data
// ✅ Structured logging
   logger.error('Agent execution failed', {
     agentName: 'code-reviewer',
     conversationId: 'abc-123',
     errorCode: 429,
     duration: 1234
   });

// ❌ Unstructured
console.log('Error in code-reviewer agent');
  • Set up alerts before incidents
// Alert on error rate > 5%
   if (errorRate > 0.05) {
     pagerDuty.trigger({
       severity: 'critical',
       title: 'High error rate detected',
       details: `Error rate: ${(errorRate * 100).toFixed(1)}%`
     });
   }
  • Keep runbooks updated
# Agent Timeout Runbook

1. Check logs: tail -f /var/log/claude-code.log | grep TIMEOUT
2. Identify pattern: Which agents are timing out?
3. Check system resources: top, free -m, df -h`
4. If rate limits: Implement emergency throttling
5. If resource exhaustion: Restart services
  • Test recovery procedures
# Monthly disaster recovery drill
   ./test-recovery.sh
   # 1. Trigger circuit breaker
   # 2. Verify monitoring alerts
   # 3. Execute rollback
   # 4. Verify service restoration

DON'T ❌

  • Don't skip postmortems
// ❌ Mark as resolved without learning
   incident.status = 'resolved';

// ✅ Document and learn
incident.status = 'resolved';
await createPostmortem(incident);
await scheduleReview(incident);
  • Don't blame individuals
# ❌ Blame-focused
   Root cause: Developer X wrote bad code

# ✅ System-focused
Root cause: Missing code review process for file operations
  • Don't ignore warning signs
// ❌ Suppress warnings
   if (memoryUsage > 0.8) {
     // TODO: Fix later
   }

// ✅ Alert and track
if (memoryUsage > 0.8) {
logger.warn('High memory usage', { usage: memoryUsage });
metrics.gauge('memory.usage', memoryUsage);
}

Tools & Resources

Monitoring Tools

Analytics Daemon (from this marketplace):

cd packages/analytics-daemon
pnpm start
# Real-time monitoring on http://localhost:3333

System Monitoring:

# CPU, memory, disk
htop

# Network
iftop

# Disk I/O
iotop

Log Aggregation

Centralized logging:

# Ship logs to central server
tail -f /var/log/claude-code.log |   nc logserver.example.com 514

External Tools


Summary

Key Takeaways:

  • Classify incidents immediately - SEV-1/2 require immediate response
  • Follow response protocol - Assess, stabilize, communicate
  • Use systematic debugging - Binary search, correlation analysis, tracing
  • Analyze logs effectively - Structured logging enables fast analysis
  • Find root causes - 5 Whys and Fishbone diagrams prevent recurrence
  • Document everything - Postmortems are learning opportunities
  • Test recovery procedures - Practice makes perfect

Incident Response Checklist:

  • [ ] Classify severity (SEV-1 through SEV-4)
  • [ ] Assess impact (error rate, affected users)
  • [ ] Check obvious issues (API, disk, memory)
  • [ ] Stabilize systems (restart, rate limit, rollback)
  • [ ] Communicate status to stakeholders
  • [ ] Identify root cause (5 Whys, logs, metrics)
  • [ ] Deploy fix and verify recovery
  • [ ] Write postmortem within 24 hours
  • [ ] Create action items with owners and dates
  • [ ] Schedule review meeting with team

Last Updated: 2025-12-24

Author: Jeremy Longshore

Related Playbooks: Multi-Agent Rate Limits, MCP Server Reliability