Production Playbook for DevOps and Plugin Maintainers
Debugging production incidents in multi-agent Claude Code workflows requires systematic approaches to log analysis, root cause identification, and rapid remediation. This playbook provides battle-tested debugging techniques, incident response workflows, postmortem templates, and real-world examples of common failure modes.
Incident Classification
Severity Levels
| Severity | Impact | Response Time | Example |
|---|---|---|---|
| SEV-1 | Production down | Immediate | All agents failing, API completely offline |
| SEV-2 | Major degradation | 15 minutes | 50%+ error rate, critical features broken |
| SEV-3 | Minor degradation | 1 hour | Intermittent failures, single plugin broken |
| SEV-4 | Cosmetic issues | 24 hours | UI bugs, non-critical warnings |
Common Incident Types
enum IncidentType {
API_FAILURE = 'api_failure', // Claude API unreachable
RATE_LIMIT = 'rate_limit', // 429 errors from API
TIMEOUT = 'timeout', // Agent/tool timeouts
MEMORY_LEAK = 'memory_leak', // Process memory exhaustion
PLUGIN_CRASH = 'plugin_crash', // Plugin process died
DATA_CORRUPTION = 'data_corruption', // Invalid data in DB/cache
PERFORMANCE = 'performance', // Slow response times
AUTHENTICATION = 'authentication' // Auth failures
}
interface Incident {
id: string;
severity: 'SEV-1' | 'SEV-2' | 'SEV-3' | 'SEV-4';
type: IncidentType;
startTime: number;
affectedUsers: number;
errorRate: number;
description: string;
}
Initial Response Protocol
First 5 Minutes (SEV-1/SEV-2)
Step 1: Assess Impact
# Check current error rate
tail -n 1000 /var/log/claude-code.log | grep -c ERROR
# Check affected users
grep "ERROR" /var/log/claude-code.log | awk '{print $5}' | sort -u | wc -l
# Check service health
curl http://localhost:3333/api/status
Step 2: Check Obvious Issues
// Quick health check script
async function quickHealthCheck(): Promise<{ healthy: boolean; issues: string[] }> {
const issues: string[] = [];
// 1. Check Claude API connectivity
try {
const response = await fetch('https://api.anthropic.com/v1/messages', {
method: 'POST',
headers: { 'x-api-key': process.env.ANTHROPIC_API_KEY },
body: JSON.stringify({ model: 'claude-3-5-haiku-20241022', messages: [{ role: 'user', content: 'test' }], max_tokens: 10 })
});
if (!response.ok) issues.push('Claude API unreachable');
} catch (error) {
issues.push('Network connectivity issue');
}
// 2. Check disk space
const { stdout } = await execAsync("df -h / | tail -1 | awk '{print $5}' | sed 's/%//'");
if (parseInt(stdout) > 90) issues.push('Disk space critical');
// 3. Check memory
const memUsage = process.memoryUsage();
if (memUsage.heapUsed / memUsage.heapTotal > 0.9) issues.push('Memory exhaustion');
return { healthy: issues.length === 0, issues };
}
Step 3: Stabilize (if possible)
# Restart failed services
systemctl restart claude-code-daemon
pm2 restart all
# Clear cache if corrupted
redis-cli FLUSHALL
# Rate limit protection
iptables -A INPUT -p tcp --dport 80 -m limit --limit 25/minute --limit-burst 100 -j ACCEPT
Communication Template
# Incident Alert: [TITLE]
Severity: SEV-2
Status: Investigating
Started: 2025-12-24 14:35 UTC
Affected: ~1,200 users (15% of total)
<h2>Current Impact</h2>
- Agent execution failing with 429 errors
- Error rate: 68% (normal: <1%)
- No data loss
<h2>Actions Taken</h2>
- ✅ Identified rate limit exhaustion (14:40)
- ✅ Implemented emergency rate limiting (14:42)
- 🔄 Monitoring recovery (14:45)
<h2>Next Update</h2>
In 15 minutes or when resolved.
Common Failure Modes
1. Rate Limit Exhaustion
Symptoms:
``
Error 429: Rate limit exceeded
anthropic-ratelimit-requests-remaining: 0
anthropic-ratelimit-requests-reset: 2025-12-24T15:00:00Z
`
Diagnosis:
async function diagnoseRateLimits(): Promise<void> {
// Check recent API calls
const recentCalls = await queryLogs('SELECT COUNT(*) FROM api_calls WHERE timestamp > NOW() - INTERVAL 1 MINUTE');
console.log(`API calls in last minute: ${recentCalls}`);
// Check rate limit headers from last successful call
const lastHeaders = await getLastAPIHeaders();
console.log('Remaining requests:', lastHeaders['anthropic-ratelimit-requests-remaining']);
console.log('Reset time:', lastHeaders['anthropic-ratelimit-requests-reset']);
}
Fix:
// Implement token bucket rate limiter
class EmergencyRateLimiter {
private tokens = 50; // Match API tier
private lastRefill = Date.now();
async throttle(): Promise<void> {
this.refill();
while (this.tokens < 1) {
await sleep(100);
this.refill();
}
this.tokens--;
}
private refill() {
const now = Date.now();
const elapsed = (now - this.lastRefill) / 1000;
const tokensToAdd = elapsed * (50 / 60); // 50 per minute
this.tokens = Math.min(50, this.tokens + tokensToAdd);
this.lastRefill = now;
}
}
2. Agent Timeout
Symptoms:
`
Error: Agent execution timed out after 300000ms
Task: code-review
Conversation: abc-123-def
`
Diagnosis:
# Check for hung processes
ps aux | grep claude | grep -v grep
# Check system load
uptime
# Output: load average: 12.5, 8.3, 5.2 (CPU overload!)
# Check for blocking I/O
iotop -o -d 5
Fix:
// Implement aggressive timeouts
class TimeoutManager {
async executeWithTimeout<T>(
fn: () => Promise<T>,
timeoutMs: number
): Promise<T> {
return Promise.race([
fn(),
new Promise<never>((_, reject) =>
setTimeout(() => reject(new Error(`Timeout after ${timeoutMs}ms`)), timeoutMs)
)
]);
}
}
// Usage
const timeout = new TimeoutManager();
const result = await timeout.executeWithTimeout(
() => agent.execute(task),
30000 // 30 second hard limit
);
3. Memory Leak
Symptoms:
# Memory usage climbing over time
free -m
# total used free
# Mem: 16384 15892 492 # Critical!
# Process memory
ps aux --sort=-%mem | head -5
# claude-daemon: 8.2GB (!)
Diagnosis:
// Track memory usage over time
setInterval(() => {
const usage = process.memoryUsage();
console.log(JSON.stringify({
timestamp: Date.now(),
heapUsed: usage.heapUsed / 1024 / 1024, // MB
heapTotal: usage.heapTotal / 1024 / 1024,
external: usage.external / 1024 / 1024,
rss: usage.rss / 1024 / 1024
}));
// Trigger GC if usage > 80%
if (usage.heapUsed / usage.heapTotal > 0.8) {
global.gc(); // Requires --expose-gc flag
}
}, 60000); // Every minute
Common Causes:
// ❌ Leak: Global cache never cleared
const cache = new Map<string, any>();
function addToCache(key: string, value: any) {
cache.set(key, value); // Grows forever!
}
// ✅ Fix: LRU cache with size limit
import LRU from 'lru-cache';
const cache = new LRU<string, any>({ max: 1000 });
4. Plugin Crash Loop
Symptoms:
# PM2 showing rapid restarts
pm2 status
# plugin-server | errored | 47 restarts in 2 minutes
# Logs show crash
tail -f /var/log/pm2/plugin-server-error.log
# Error: ECONNREFUSED 127.0.0.1:5432
# (PostgreSQL connection failed)
Diagnosis:
# Check dependencies
docker ps | grep postgres
# (empty - PostgreSQL container not running!)
# Check network
netstat -tulpn | grep 5432
# (no listener on port 5432)
Fix:
# Restart dependency
docker-compose up -d postgres
# Verify connectivity
psql -h localhost -U user -d database -c "SELECT 1"
# Restart plugin
pm2 restart plugin-server
Debugging Techniques
1. Binary Search Debugging
Problem: Unknown change broke production
# Use git bisect to find breaking commit
git bisect start
git bisect bad HEAD # Current version is broken
git bisect good v1.2.0 # Last known good version
# Git will check out commits for testing
# Test each commit:
npm install && npm run build && npm test
# Mark results
git bisect good # if tests pass
git bisect bad # if tests fail
# Git will find the exact breaking commit
2. Correlation Analysis
Find patterns in failures:
interface FailureEvent {
timestamp: number;
errorType: string;
userId?: string;
pluginName?: string;
duration: number;
}
function analyzeFailureCorrelations(failures: FailureEvent[]): void {
// Group by time windows
const byHour = groupBy(failures, f => Math.floor(f.timestamp / 3600000));
// Find spike times
const spikes = Object.entries(byHour)
.filter(([_, events]) => events.length > 100)
.map(([hour, events]) => ({
hour: new Date(parseInt(hour) * 3600000),
count: events.length,
topError: mode(events.map(e => e.errorType))
}));
console.log('Failure spikes:', spikes);
// Find common attributes
const byPlugin = groupBy(failures, f => f.pluginName);
const suspiciousPlugin = Object.entries(byPlugin)
.sort((a, b) => b[1].length - a[1].length)[0];
console.log(`Most failures from plugin: ${suspiciousPlugin[0]} (${suspiciousPlugin[1].length} errors)`);
}
3. Distributed Tracing
Track request across services:
import { trace, context, SpanStatusCode } from '@opentelemetry/api';
const tracer = trace.getTracer('claude-code');
async function executeAgent(agentName: string, task: any): Promise<any> {
const span = tracer.startSpan('agent.execute', {
attributes: {
'agent.name': agentName,
'task.id': task.id
}
});
try {
// Execute agent logic
const result = await agent.run(task);
span.setStatus({ code: SpanStatusCode.OK });
span.setAttribute('result.success', true);
return result;
} catch (error) {
span.setStatus({
code: SpanStatusCode.ERROR,
message: error.message
});
span.recordException(error);
throw error;
} finally {
span.end();
}
}
Log Analysis
Parsing Claude Code Logs
Log Format:
`
[2025-12-24T14:35:22.123Z] [ERROR] [agent:code-review] Rate limit exceeded
conversationId: abc-123-def
userId: user-456
errorCode: 429
retryAfter: 12
stack: Error: Rate limit exceeded
at callClaude (/app/src/api.ts:45:11)
`
Analysis Script:
import { readFileSync } from 'fs';
interface LogEntry {
timestamp: Date;
level: 'ERROR' | 'WARN' | 'INFO';
component: string;
message: string;
metadata: Record<string, any>;
}
function parseLog(line: string): LogEntry | null {
const match = line.match(/[(.?)] [(.?)] [(.?)] (.)/);
if (!match) return null;
const [, timestamp, level, component, rest] = match;
const lines = rest.split('
');
const message = lines[0];
// Parse metadata
const metadata: Record<string, any> = {};
for (const line of lines.slice(1)) {
const metaMatch = line.match(/^s*(w+): (.+)$/);
if (metaMatch) {
const [, key, value] = metaMatch;
metadata[key] = value;
}
}
return {
timestamp: new Date(timestamp),
level: level as any,
component,
message,
metadata
};
}
function analyzeLogs(logPath: string): void {
const content = readFileSync(logPath, 'utf-8');
const logs = content.split('
')
.map(parseLog)
.filter(Boolean) as LogEntry[];
// Error rate by component
const errorsByComponent = groupBy(
logs.filter(l => l.level === 'ERROR'),
l => l.component
);
console.log('Errors by component:');
Object.entries(errorsByComponent)
.sort((a, b) => b[1].length - a[1].length)
.forEach(([component, errors]) => {
console.log(` ${component}: ${errors.length}`);
});
// Recent errors (last 5 minutes)
const recentErrors = logs.filter(l =>
l.level === 'ERROR' &&
Date.now() - l.timestamp.getTime() < 300000
);
console.log(`
Recent errors: ${recentErrors.length}`);
recentErrors.slice(0, 10).forEach(err => {
console.log(` ${err.timestamp.toISOString()} - ${err.message}`);
});
}
Using Analytics Daemon
// Query analytics daemon for incident patterns
const ws = new WebSocket('ws://localhost:3456');
ws.onmessage = (event) => {
const data = JSON.parse(event.data);
// Track rate limit warnings
if (data.type === 'rate_limit.warning') {
console.warn(`⚠️ Rate limit approaching: ${data.current}/${data.limit}`);
}
// Track errors
if (data.type === 'llm.call' && data.error) {
console.error(`❌ LLM call failed: ${data.error}`);
}
};
// Query historical data
const response = await fetch('http://localhost:3333/api/sessions');
const sessions = await response.json();
const failedSessions = sessions.filter(s => s.errorCount > 0);
console.log(`Failed sessions: ${failedSessions.length}/${sessions.length}`);
Root Cause Analysis
The 5 Whys Method
Example: Agent Timeout Incident
- Why did the agent timeout?
→ Because it took > 300 seconds to respond
- Why did it take so long?
→ Because the Claude API call was slow (280s)
- Why was the API call slow?
→ Because we sent a 50,000 token prompt
- Why did we send such a large prompt?
→ Because the code-reviewer agent included entire codebase in context
- Why did it include the entire codebase?
→ Root Cause: File globbing pattern */ matched all files including node_modules (500MB)
Fix: Update file globbing to exclude node_modules
// Before: includes everything
const files = glob.sync('*/');
// After: exclude dependencies
const files = glob.sync('*/', {
ignore: ['node_modules/', '.git/', 'dist/**']
});
Fishbone Diagram (Ishikawa)
interface RootCauseAnalysis {
problem: string;
categories: {
people?: string[];
process?: string[];
technology?: string[];
environment?: string[];
};
rootCause: string;
fix: string;
}
const analysis: RootCauseAnalysis = {
problem: 'Agent timeout causing 68% error rate',
categories: {
people: [
'Developer added file globbing without testing',
'No code review caught the issue'
],
process: [
'No integration tests for large codebases',
'No performance testing in CI/CD'
],
technology: [
'Glob pattern included node_modules (500MB)',
'No size limit on prompts',
'No timeout on file reading'
],
environment: [
'Production codebase larger than test repos',
'No staging environment for testing'
]
},
rootCause: 'Missing file size validation and glob pattern filtering',
fix: 'Add file exclusion patterns and max prompt size validation'
};
Recovery Procedures
Emergency Rollback
# Immediate rollback to last known good version
git log --oneline | head -5
# c534df4 (HEAD) feat: Add new feature (BROKEN)
# 3946b1f docs: Update README
# fc73caa (tag: v1.2.0) fix: Bug fix (LAST GOOD)
# Rollback
git reset --hard fc73caa
npm install
npm run build
pm2 restart all
# Deploy
./deploy.sh production
# Verify
curl http://api.example.com/health
Circuit Breaker Reset
// Manually reset circuit breaker after fixing issue
class CircuitBreakerManager {
private breakers = new Map<string, CircuitBreaker>();
reset(serviceName: string): void {
const breaker = this.breakers.get(serviceName);
if (breaker) {
breaker.state = 'closed';
breaker.failures = 0;
console.log(`✓ Reset circuit breaker for ${serviceName}`);
}
}
resetAll(): void {
for (const [service, breaker] of this.breakers) {
this.reset(service);
}
console.log('✓ Reset all circuit breakers');
}
}
Data Recovery
# Recover from backup
BACKUP_DATE="2025-12-24-14:00"
# Stop services
pm2 stop all
# Restore database
pg_restore -d database_prod backups/backup_${BACKUP_DATE}.sql
# Restore files
rsync -av backups/files_${BACKUP_DATE}/ /var/lib/claude-code/
# Restart
pm2 restart all
# Verify data integrity
psql -d database_prod -c "SELECT COUNT(*) FROM conversations"
Postmortem Templates
Incident Postmortem
# Postmortem: Agent Timeout Incident (2025-12-24) Date: 2025-12-24 Duration: 14:35 - 15:15 UTC (40 minutes) Severity: SEV-2 Impact: 1,200 users (15%), 68% error rate <h2>Summary</h2> Code-reviewer agent began timing out due to excessive file inclusion in prompts, causing 68% error rate for 40 minutes. <h2>Timeline (UTC)</h2>*/<h2>Root Cause</h2> File globbing pattern
- 14:35 - First timeout alerts
- 14:40 - Error rate reaches 68%
- 14:42 - On-call engineer paged
- 14:45 - Root cause identified (file globbing)
- 14:50 - Fix deployed to staging
- 14:55 - Fix deployed to production
- 15:00 - Error rate drops to 5%
- 15:15 - Incident resolved, error rate < 1%
includednode_modules/ directory (500MB), creating prompts exceeding Claude API's context limits and causing timeouts. <h2>Contributing Factors</h2>
- No file size validation before prompt construction
- No integration tests with large codebases
- No staging environment for testing
- Fast root cause identification (10 minutes)
- Effective rollback procedure
- Clear communication to affected users
- No monitoring alerts before user reports
- No prompt size limits prevented the issue
- Fix took 20 minutes to deploy
- [ ] P0: Add file size validation (Owner: @dev, Due: 2025-12-25)
- [ ] P0: Implement max prompt size limit (Owner: @dev, Due: 2025-12-25)
- [ ] P1: Add monitoring for agent timeouts (Owner: @ops, Due: 2025-12-27)
- [ ] P1: Create staging environment (Owner: @ops, Due: 2025-12-30)
- [ ] P2: Add integration tests with large repos (Owner: @qa, Due: 2026-01-05)
- File operations need size limits
- Production testing with realistic data is critical
- Monitoring must detect issues before users report them
Best Practices
DO ✅
- Log structured data
// ✅ Structured logging
logger.error('Agent execution failed', {
agentName: 'code-reviewer',
conversationId: 'abc-123',
errorCode: 429,
duration: 1234
});
// ❌ Unstructured
console.log('Error in code-reviewer agent');
- Set up alerts before incidents
// Alert on error rate > 5%
if (errorRate > 0.05) {
pagerDuty.trigger({
severity: 'critical',
title: 'High error rate detected',
details: `Error rate: ${(errorRate * 100).toFixed(1)}%`
});
}
- Keep runbooks updated
# Agent Timeout Runbook
1. Check logs: tail -f /var/log/claude-code.log | grep TIMEOUT
2. Identify pattern: Which agents are timing out?
3. Check system resources: top, free -m, df -h`
4. If rate limits: Implement emergency throttling
5. If resource exhaustion: Restart services
- Test recovery procedures
# Monthly disaster recovery drill
./test-recovery.sh
# 1. Trigger circuit breaker
# 2. Verify monitoring alerts
# 3. Execute rollback
# 4. Verify service restoration
DON'T ❌
- Don't skip postmortems
// ❌ Mark as resolved without learning
incident.status = 'resolved';
// ✅ Document and learn
incident.status = 'resolved';
await createPostmortem(incident);
await scheduleReview(incident);
- Don't blame individuals
# ❌ Blame-focused
Root cause: Developer X wrote bad code
# ✅ System-focused
Root cause: Missing code review process for file operations
- Don't ignore warning signs
// ❌ Suppress warnings
if (memoryUsage > 0.8) {
// TODO: Fix later
}
// ✅ Alert and track
if (memoryUsage > 0.8) {
logger.warn('High memory usage', { usage: memoryUsage });
metrics.gauge('memory.usage', memoryUsage);
}
Tools & Resources
Monitoring Tools
Analytics Daemon (from this marketplace):
cd packages/analytics-daemon
pnpm start
# Real-time monitoring on http://localhost:3333
System Monitoring:
# CPU, memory, disk
htop
# Network
iftop
# Disk I/O
iotop
Log Aggregation
Centralized logging:
# Ship logs to central server
tail -f /var/log/claude-code.log | nc logserver.example.com 514
External Tools
- Datadog - APM and monitoring
- Sentry - Error tracking
- PagerDuty - Incident management
- Grafana - Dashboards
- ELK Stack - Log analysis
Summary
Key Takeaways:
- Classify incidents immediately - SEV-1/2 require immediate response
- Follow response protocol - Assess, stabilize, communicate
- Use systematic debugging - Binary search, correlation analysis, tracing
- Analyze logs effectively - Structured logging enables fast analysis
- Find root causes - 5 Whys and Fishbone diagrams prevent recurrence
- Document everything - Postmortems are learning opportunities
- Test recovery procedures - Practice makes perfect
Incident Response Checklist:
- [ ] Classify severity (SEV-1 through SEV-4)
- [ ] Assess impact (error rate, affected users)
- [ ] Check obvious issues (API, disk, memory)
- [ ] Stabilize systems (restart, rate limit, rollback)
- [ ] Communicate status to stakeholders
- [ ] Identify root cause (5 Whys, logs, metrics)
- [ ] Deploy fix and verify recovery
- [ ] Write postmortem within 24 hours
- [ ] Create action items with owners and dates
- [ ] Schedule review meeting with team
Last Updated: 2025-12-24
Author: Jeremy Longshore
Related Playbooks: Multi-Agent Rate Limits, MCP Server Reliability