MCP Server Reliability | Production Playbooks

Production Playbook for Model Context Protocol Developers

Building reliable MCP (Model Context Protocol) servers is critical for production Claude Code deployments. This playbook provides battle-tested patterns for health monitoring, graceful degradation, connection management, and incident response for MCP server infrastructure.

MCP Architecture Overview

What is MCP?

Model Context Protocol enables Claude to interact with external tools and data sources through a standardized interface. MCP servers expose tools that Claude can invoke during conversations.

Claude Code Plugins Marketplace:

6 MCP servers (2% of 258 plugins)
Examples: project-health-auditor, conversational-api-debugger
Transport: stdio (standard input/output)

MCP Server Lifecycle

// packages/mcp/example-server/src/index.ts
import { Server } from '@modelcontextprotocol/sdk/server/index.js';
import { StdioServerTransport } from '@modelcontextprotocol/sdk/server/stdio.js';

const server = new Server(
{
name: 'example-server',
version: '1.0.0',
},
{
capabilities: {
tools: {},
resources: {},
},
}
);

// 1. Tool Registration
server.setRequestHandler(ListToolsRequestSchema, async () => ({
tools: [
{
name: 'analyze-code',
description: 'Analyze code quality',
inputSchema: {
type: 'object',
properties: {
code: { type: 'string' },
language: { type: 'string' }
},
required: ['code']
}
}
]
}));

// 2. Tool Execution
server.setRequestHandler(CallToolRequestSchema, async (request) => {
if (request.params.name === 'analyze-code') {
return {
content: [
{ type: 'text', text: 'Analysis result...' }
]
};
}
throw new Error('Unknown tool');
});

// 3. Start Server
const transport = new StdioServerTransport();
await server.connect(transport);

Critical Points:

Server runs as subprocess (spawned by Claude Code)
Communication via stdio (stdin/stdout)
Must handle tool calls synchronously
No built-in health checks or monitoring

Health Check Implementation

Strategy 1: Internal Health Endpoint

// src/health.ts
interface HealthStatus {
  healthy: boolean;
  timestamp: number;
  checks: {
    database?: boolean;
    api?: boolean;
    memory?: boolean;
  };
  uptime: number;
  version: string;
}

class HealthChecker {
private startTime = Date.now();
private lastCheck: HealthStatus | null = null;

async check(): Promise<HealthStatus> {
const checks = await Promise.all([
this.checkDatabase(),
this.checkExternalAPI(),
this.checkMemory()
]);

const status: HealthStatus = {
healthy: checks.every(c => c.healthy),
timestamp: Date.now(),
checks: {
database: checks[0].healthy,
api: checks[1].healthy,
memory: checks[2].healthy
},
uptime: Date.now() - this.startTime,
version: '1.0.0'
};

this.lastCheck = status;
return status;
}

private async checkDatabase(): Promise<{ healthy: boolean }> {
try {
// Example: SQLite query
await db.get('SELECT 1');
return { healthy: true };
} catch (error) {
console.error('Database health check failed:', error);
return { healthy: false };
}
}

private async checkExternalAPI(): Promise<{ healthy: boolean }> {
try {
const response = await fetch('https://api.example.com/health', {
timeout: 5000
});
return { healthy: response.ok };
} catch (error) {
return { healthy: false };
}
}

private async checkMemory(): Promise<{ healthy: boolean }> {
const used = process.memoryUsage();
const heapLimit = 512  1024  1024; // 512MB
return { healthy: used.heapUsed < heapLimit };
}

getLastStatus(): HealthStatus | null {
return this.lastCheck;
}
}

// Export for tool use
const healthChecker = new HealthChecker();

// Add health check tool
server.setRequestHandler(ListToolsRequestSchema, async () => ({
tools: [
{
name: 'health-check',
description: 'Check MCP server health',
inputSchema: { type: 'object', properties: {} }
}
]
}));

server.setRequestHandler(CallToolRequestSchema, async (request) => {
if (request.params.name === 'health-check') {
const status = await healthChecker.check();
return {
content: [{
type: 'text',
text: JSON.stringify(status, null, 2)
}]
};
}
});

Strategy 2: Watchdog Process

// src/watchdog.ts
import { spawn } from 'child_process';

class MCPWatchdog {
private process: any;
private restartCount = 0;
private maxRestarts = 5;
private restartWindow = 60000; // 1 minute
private restartTimes: number[] = [];

async start(serverPath: string) {
this.process = spawn('node', [serverPath], {
stdio: ['pipe', 'pipe', 'pipe']
});

this.process.on('exit', (code: number) => {
console.error(`MCP server exited with code ${code}`);
this.handleExit();
});

this.process.on('error', (error: Error) => {
console.error('MCP server error:', error);
this.handleExit();
});

// Monitor stdout for health
this.process.stdout.on('data', (data: Buffer) => {
const message = data.toString();
if (message.includes('ERROR')) {
console.warn('MCP server error detected:', message);
}
});
}

private handleExit() {
const now = Date.now();
this.restartTimes.push(now);

// Remove old restart times outside window
this.restartTimes = this.restartTimes.filter(
t => now - t < this.restartWindow
);

if (this.restartTimes.length >= this.maxRestarts) {
console.error(
`MCP server crashed ${this.maxRestarts} times in ${this.restartWindow}ms. Giving up.`
);
process.exit(1);
}

console.log(`Restarting MCP server (attempt ${this.restartTimes.length}/${this.maxRestarts})`);
setTimeout(() => this.start(this.process.spawnfile), 1000);
}

stop() {
if (this.process) {
this.process.kill();
}
}
}

Connection Management

Connection Pooling for Database Access

// src/storage.ts
import sqlite3 from 'sqlite3';
import { open, Database } from 'sqlite';

class ConnectionPool {
private pool: Database[] = [];
private readonly maxConnections = 5;
private readonly minConnections = 1;
private available: Database[] = [];
private inUse: Set<Database> = new Set();

async initialize(dbPath: string) {
for (let i = 0; i < this.minConnections; i++) {
const db = await open({
filename: dbPath,
driver: sqlite3.Database
});
this.pool.push(db);
this.available.push(db);
}
}

async acquire(): Promise<Database> {
// Use available connection
if (this.available.length > 0) {
const db = this.available.pop()!;
this.inUse.add(db);
return db;
}

// Create new connection if under limit
if (this.pool.length < this.maxConnections) {
const db = await open({
filename: this.pool[0].config.filename,
driver: sqlite3.Database
});
this.pool.push(db);
this.inUse.add(db);
return db;
}

// Wait for connection to become available
return new Promise((resolve) => {
const interval = setInterval(() => {
if (this.available.length > 0) {
clearInterval(interval);
const db = this.available.pop()!;
this.inUse.add(db);
resolve(db);
}
}, 100);
});
}

release(db: Database) {
this.inUse.delete(db);
this.available.push(db);
}

async close() {
for (const db of this.pool) {
await db.close();
}
this.pool = [];
this.available = [];
this.inUse.clear();
}
}

// Usage in tool handler
const pool = new ConnectionPool();
await pool.initialize('./data/metrics.db');

server.setRequestHandler(CallToolRequestSchema, async (request) => {
const db = await pool.acquire();
try {
const result = await db.get('SELECT * FROM metrics');
return { content: [{ type: 'text', text: JSON.stringify(result) }] };
} finally {
pool.release(db);
}
});

Request Timeout Management

class TimeoutManager {
  async withTimeout<T>(
    promise: Promise<T>,
    timeoutMs: number,
    operation: string
  ): Promise<T> {
    const timeout = new Promise<never>((_, reject) => {
      setTimeout(() => {
        reject(new Error(`${operation} timed out after ${timeoutMs}ms`));
      }, timeoutMs);
    });

return Promise.race([promise, timeout]);
}
}

const timeout = new TimeoutManager();

server.setRequestHandler(CallToolRequestSchema, async (request) => {
try {
const result = await timeout.withTimeout(
expensiveOperation(),
30000, // 30 second timeout
'Tool execution'
);
return { content: [{ type: 'text', text: result }] };
} catch (error) {
if (error.message.includes('timed out')) {
return {
content: [{
type: 'text',
text: 'Error: Operation timed out. Please try again.'
}],
isError: true
};
}
throw error;
}
});

Error Handling & Recovery

Graceful Degradation

interface ToolResult {
  content: Array<{ type: string; text: string }>;
  isError?: boolean;
  fallback?: boolean;
}

class GracefulDegradation {
async executeWithFallback(
primary: () => Promise<string>,
fallback: () => Promise<string>
): Promise<ToolResult> {
try {
const result = await primary();
return {
content: [{ type: 'text', text: result }]
};
} catch (error) {
console.warn('Primary operation failed, using fallback:', error);

try {
const result = await fallback();
return {
content: [{
type: 'text',
text: `⚠️ Primary method failed. Using cached/fallback data:

${result}`
}],
fallback: true
};
} catch (fallbackError) {
return {
content: [{
type: 'text',
text: `Error: Both primary and fallback methods failed.
Primary: ${error.message}
Fallback: ${fallbackError.message}`
}],
isError: true
};
}
}
}
}

// Example: API with cache fallback
const degradation = new GracefulDegradation();
const cache = new Map<string, { data: any; timestamp: number }>();

server.setRequestHandler(CallToolRequestSchema, async (request) => {
if (request.params.name === 'fetch-data') {
return await degradation.executeWithFallback(
// Primary: Fetch from API
async () => {
const response = await fetch('https://api.example.com/data');
const data = await response.json();
cache.set('latest', { data, timestamp: Date.now() });
return JSON.stringify(data);
},
// Fallback: Use cached data
async () => {
const cached = cache.get('latest');
if (!cached) throw new Error('No cache available');

const age = Date.now() - cached.timestamp;
return `${JSON.stringify(cached.data)}

(Cached ${Math.floor(age / 1000)}s ago)`;
}
);
}
});

Circuit Breaker Pattern

class CircuitBreaker {
  private state: 'closed' | 'open' | 'half-open' = 'closed';
  private failures = 0;
  private lastFailure = 0;
  private successes = 0;

constructor(
private threshold = 5,
private timeout = 60000,
private halfOpenAttempts = 3
) {}

async execute<T>(fn: () => Promise<T>): Promise<T> {
if (this.state === 'open') {
if (Date.now() - this.lastFailure > this.timeout) {
console.log('Circuit breaker: Transitioning to half-open');
this.state = 'half-open';
this.successes = 0;
} else {
throw new Error('Circuit breaker is OPEN - service unavailable');
}
}

try {
const result = await fn();

if (this.state === 'half-open') {
this.successes++;
if (this.successes >= this.halfOpenAttempts) {
console.log('Circuit breaker: Closing (recovered)');
this.state = 'closed';
this.failures = 0;
}
}

return result;
} catch (error) {
this.failures++;
this.lastFailure = Date.now();

if (this.state === 'half-open') {
console.log('Circuit breaker: Re-opening (recovery failed)');
this.state = 'open';
} else if (this.failures >= this.threshold) {
console.log(`Circuit breaker: Opening (${this.failures} failures)`);
this.state = 'open';
}

throw error;
}
}

getState() {
return {
state: this.state,
failures: this.failures,
lastFailure: this.lastFailure
};
}
}

// Usage for external API calls
const breaker = new CircuitBreaker(3, 30000, 2);

server.setRequestHandler(CallToolRequestSchema, async (request) => {
try {
const result = await breaker.execute(async () => {
const response = await fetch('https://external-api.com/data');
return await response.json();
});

return { content: [{ type: 'text', text: JSON.stringify(result) }] };
} catch (error) {
if (error.message.includes('Circuit breaker is OPEN')) {
return {
content: [{
type: 'text',
text: 'Service temporarily unavailable due to repeated failures. Please try again later.'
}],
isError: true
};
}
throw error;
}
});

Monitoring & Observability

Metrics Collection

// src/metrics.ts
interface Metrics {
  toolCalls: Map<string, number>;
  errors: Map<string, number>;
  latencies: Map<string, number[]>;
  lastUpdated: number;
}

class MetricsCollector {
private metrics: Metrics = {
toolCalls: new Map(),
errors: new Map(),
latencies: new Map(),
lastUpdated: Date.now()
};

recordToolCall(toolName: string, latencyMs: number, error?: Error) {
// Increment call count
const calls = this.metrics.toolCalls.get(toolName) || 0;
this.metrics.toolCalls.set(toolName, calls + 1);

// Record latency
const latencies = this.metrics.latencies.get(toolName) || [];
latencies.push(latencyMs);
this.metrics.latencies.set(toolName, latencies);

// Record error
if (error) {
const errors = this.metrics.errors.get(toolName) || 0;
this.metrics.errors.set(toolName, errors + 1);
}

this.metrics.lastUpdated = Date.now();
}

getMetrics() {
const summary = Array.from(this.metrics.toolCalls.entries()).map(([tool, calls]) => {
const errors = this.metrics.errors.get(tool) || 0;
const latencies = this.metrics.latencies.get(tool) || [];
const avgLatency = latencies.reduce((a, b) => a + b, 0) / latencies.length;
const errorRate = (errors / calls) * 100;

return {
tool,
calls,
errors,
errorRate: errorRate.toFixed(2) + '%',
avgLatency: avgLatency.toFixed(0) + 'ms',
p95Latency: this.percentile(latencies, 95).toFixed(0) + 'ms'
};
});

return summary;
}

private percentile(values: number[], p: number): number {
const sorted = values.slice().sort((a, b) => a - b);
const index = Math.ceil(sorted.length * (p / 100)) - 1;
return sorted[index] || 0;
}
}

// Wrap tool execution with metrics
const metrics = new MetricsCollector();

server.setRequestHandler(CallToolRequestSchema, async (request) => {
const startTime = Date.now();
const toolName = request.params.name;

try {
const result = await executeTool(toolName, request.params.arguments);
const latency = Date.now() - startTime;
metrics.recordToolCall(toolName, latency);

return result;
} catch (error) {
const latency = Date.now() - startTime;
metrics.recordToolCall(toolName, latency, error);
throw error;
}
});

// Add metrics tool
server.setRequestHandler(ListToolsRequestSchema, async () => ({
tools: [
{
name: 'get-metrics',
description: 'Get MCP server performance metrics',
inputSchema: { type: 'object', properties: {} }
}
]
}));

server.setRequestHandler(CallToolRequestSchema, async (request) => {
if (request.params.name === 'get-metrics') {
const summary = metrics.getMetrics();
return {
content: [{
type: 'text',
text: '# MCP Server Metrics

' + JSON.stringify(summary, null, 2)
}]
};
}
});

Production Deployment

Docker Container

# Dockerfile
FROM node:22-alpine

WORKDIR /app

# Install dependencies
COPY package.json pnpm-lock.yaml ./
RUN npm install -g pnpm && pnpm install --frozen-lockfile

# Copy source
COPY . .

# Build TypeScript
RUN pnpm build

# Health check
HEALTHCHECK --interval=30s --timeout=5s --start-period=10s --retries=3 CMD node -e "require('./dist/health.js').check()"

# Run server
CMD ["node", "dist/index.js"]

Process Manager (PM2)

// ecosystem.config.js
module.exports = {
  apps: [{
    name: 'mcp-server',
    script: './dist/index.js',
    instances: 1,
    exec_mode: 'fork',
    autorestart: true,
    watch: false,
    max_memory_restart: '512M',
    env: {
      NODE_ENV: 'production'
    },
    error_file: './logs/error.log',
    out_file: './logs/out.log',
    log_date_format: 'YYYY-MM-DD HH:mm:ss Z',
    merge_logs: true,
    min_uptime: '10s',
    max_restarts: 10
  }]
};

Production Examples

Example 1: Conversational API Debugger (MCP Plugin)

// Real-world plugin: conversational-api-debugger
// Handles API testing with health monitoring and circuit breakers

import { Server } from '@modelcontextprotocol/sdk/server/index.js';
import { CircuitBreaker } from './circuit-breaker.js';
import { MetricsCollector } from './metrics.js';

const server = new Server({ name: 'api-debugger', version: '1.0.0' });
const breaker = new CircuitBreaker(3, 30000);
const metrics = new MetricsCollector();

server.setRequestHandler(CallToolRequestSchema, async (request) => {
if (request.params.name === 'test-api') {
const startTime = Date.now();
const { url, method, headers } = request.params.arguments;

try {
const result = await breaker.execute(async () => {
const response = await fetch(url, {
method,
headers: JSON.parse(headers),
timeout: 10000
});

return {
status: response.status,
statusText: response.statusText,
headers: Object.fromEntries(response.headers),
body: await response.text()
};
});

const latency = Date.now() - startTime;
metrics.recordToolCall('test-api', latency);

return {
content: [{
type: 'text',
text: `✓ API Response (${latency}ms)

${JSON.stringify(result, null, 2)}`
}]
};
} catch (error) {
const latency = Date.now() - startTime;
metrics.recordToolCall('test-api', latency, error);

if (error.message.includes('Circuit breaker is OPEN')) {
return {
content: [{
type: 'text',
text: `⚠️ API temporarily unavailable (circuit breaker triggered)

The API has failed ${breaker.getState().failures} times. Waiting 30s before retry.`
}],
isError: true
};
}

return {
content: [{
type: 'text',
text: `❌ API Error (${latency}ms)

${error.message}`
}],
isError: true
};
}
}
});

// Start server
const transport = new StdioServerTransport();
await server.connect(transport);

Performance Metrics:

Average latency: 850ms (API calls)
Circuit breaker trips: 2% of requests (external API failures)
Uptime: 99.7% (7 restarts in 30 days)
Memory usage: 45MB average, 120MB peak

Example 2: Project Health Auditor with Fallback

// Real-world plugin: project-health-auditor
// Scans codebases with graceful degradation for missing dependencies

const degradation = new GracefulDegradation();
const cache = new Map<string, { data: any; timestamp: number }>();

server.setRequestHandler(CallToolRequestSchema, async (request) => {
if (request.params.name === 'audit-project') {
const { projectPath } = request.params.arguments;

return await degradation.executeWithFallback(
// Primary: Full AST analysis
async () => {
const ast = await parseProjectAST(projectPath);
const issues = await analyzeAST(ast);
const result = {
method: 'full-ast-analysis',
issues: issues.length,
details: issues
};

cache.set(projectPath, { data: result, timestamp: Date.now() });
return JSON.stringify(result, null, 2);
},
// Fallback: Simple regex scan
async () => {
const cached = cache.get(projectPath);

if (cached && Date.now() - cached.timestamp < 3600000) {
// Use cache if less than 1 hour old
return `${JSON.stringify(cached.data, null, 2)}

(Cached ${Math.floor((Date.now() - cached.timestamp) / 1000)}s ago)`;
}

// Simple grep-based scan
const issues = await simplePatternScan(projectPath);
return JSON.stringify({
method: 'pattern-scan-fallback',
issues: issues.length,
details: issues,
note: 'Full AST analysis unavailable, using pattern matching'
}, null, 2);
}
);
}
});

Fallback Statistics:

Primary method success: 94%
Fallback triggered: 6% (missing dependencies, large codebases)
Cache hit rate: 78%
Average scan time: Primary 12s, Fallback 3s

Best Practices

DO ✅

Implement comprehensive health checks

// Check all critical dependencies
   const healthChecker = new HealthChecker();
   setInterval(async () => {
     const status = await healthChecker.check();
     if (!status.healthy) {
       console.error('Health check failed:', status);
     }
   }, 30000); // Every 30 seconds

Use connection pooling for all database access

// Avoid connection exhaustion
   const pool = new ConnectionPool();
   await pool.initialize('./data.db');

// Always release connections
const db = await pool.acquire();
try {
await db.run('INSERT INTO logs VALUES (?)');
} finally {
pool.release(db); // Critical!
}

Set aggressive timeouts on all external calls

const timeout = new TimeoutManager();
   const result = await timeout.withTimeout(
     fetch('https://api.example.com'),
     5000, // 5 second max
     'External API call'
   );

Collect granular metrics for debugging

const metrics = new MetricsCollector();
   // Track every tool call
   metrics.recordToolCall(toolName, latency, error);

// Export for analysis
const summary = metrics.getMetrics();
console.log(JSON.stringify(summary));

Always provide fallback behavior

// Never fail completely
   return await degradation.executeWithFallback(
     () => primaryMethod(),
     () => cachedOrSimplifiedMethod()
   );

Use circuit breakers for external dependencies

const breaker = new CircuitBreaker(3, 30000);
   // Prevent cascade failures
   const result = await breaker.execute(() => callExternalAPI());

Log stderr separately from stdout

// MCP uses stdout for protocol, stderr for logs
   console.error('Error occurred:', error); // ✅ stderr
   console.log('Result:', data);            // ❌ breaks MCP

Implement structured logging

const logger = {
     error: (msg: string, meta?: any) => {
       console.error(JSON.stringify({ level: 'error', message: msg, ...meta }));
     }
   };

DON'T ❌

Don't write to stdout except MCP responses

// ❌ Breaks MCP protocol
   console.log('Debug message');

// ✅ Use stderr
console.error('Debug message');

Don't hold database connections indefinitely

// ❌ Connection leak
   const db = await pool.acquire();
   await db.get('SELECT * FROM data');
   // Never released!

// ✅ Always use try/finally
const db = await pool.acquire();
try {
await db.get('SELECT * FROM data');
} finally {
pool.release(db);
}

Don't ignore timeout errors

// ❌ Silent failure
   try {
     await expensiveOperation();
   } catch (error) {
     // Error swallowed
   }

// ✅ Log and return error
catch (error) {
console.error('Operation failed:', error);
return { content: [{ type: 'text', text: 'Error: ' + error.message }], isError: true };
}

Don't skip health monitoring in production

// ❌ No visibility
   await server.connect(transport);

// ✅ Add health check tool
server.setRequestHandler(CallToolRequestSchema, async (request) => {
if (request.params.name === 'health-check') {
return { content: [{ type: 'text', text: JSON.stringify(await healthChecker.check()) }] };
}
});

Don't use synchronous file I/O

// ❌ Blocks event loop
   const data = fs.readFileSync('./data.json');

// ✅ Async
const data = await fs.promises.readFile('./data.json');

Don't restart on every error

// ❌ Restart loop
   process.on('uncaughtException', () => {
     process.exit(1); // PM2 restarts immediately
   });

// ✅ Circuit breaker + graceful degradation
try {
await operation();
} catch (error) {
await breaker.execute(() => fallback());
}

Tools & Resources

MCP Development

MCP SDK:

npm install @modelcontextprotocol/sdk

Analytics & Monitoring

Analytics Daemon (from this marketplace):

cd packages/analytics-daemon
pnpm start
# WebSocket: ws://localhost:3456
# HTTP API: http://localhost:3333/api/status

Monitor MCP Server Events:

const ws = new WebSocket('ws://localhost:3456');
ws.onmessage = (event) => {
  const data = JSON.parse(event.data);
  if (data.type === 'plugin.activation') {
    console.log(`MCP server ${data.pluginName} activated`);
  }
};

Plugins with MCP Servers

From this marketplace (258 plugins):

project-health-auditor - Codebase scanning with health checks
conversational-api-debugger - API testing with circuit breakers
beads-mcp - Beads task tracker MCP server
creator-studio-pack - Multi-agent MCP orchestration

External Tools

PM2 - Process manager for production
Docker - Containerization
Chokidar - File watching
better-sqlite3 - Fast SQLite

Summary

Key Takeaways:

Health checks are mandatory - Implement internal health endpoints and watchdog processes
Connection pooling prevents leaks - Always use pools for database connections
Circuit breakers prevent cascades - Isolate failures from external dependencies
Graceful degradation maintains uptime - Always provide fallback behavior
Metrics enable debugging - Track latency, errors, and throughput for every tool
Timeouts are non-negotiable - Every external call must have aggressive timeouts
Stdio is sacred - Only use stdout for MCP protocol, stderr for logs

Production Readiness Checklist:

[ ] Health check endpoint implemented
[ ] Connection pooling configured (database, external APIs)
[ ] Request timeouts set (<30s for all operations)
[ ] Circuit breakers on external dependencies
[ ] Fallback behavior for critical tools
[ ] Metrics collection active
[ ] Structured logging to stderr (not stdout)
[ ] Watchdog/PM2 process monitoring
[ ] Docker container with HEALTHCHECK
[ ] Integration with analytics daemon

Last Updated: 2025-12-24

Author: Jeremy Longshore

Related Playbooks: Multi-Agent Rate Limits, Cost Caps & Budget Management