Production Engineering

Production Playbooks

Comprehensive technical guides for building production-grade Claude Code plugin systems. Each playbook provides deep implementation details, production-ready code examples, and real-world patterns learned from operating large-scale AI agent deployments.

Playbooks

53k

Words

Featured Playbooks

Infrastructure Featured

MCP Server Reliability

Self-healing MCP servers with circuit breakers, exponential backoff, health checks, and automatic recovery. Production-grade Model Context Protocol implementations.

~18 min read • 3,500 words

Infrastructure Featured

Self-Hosted Stack Setup

Full infrastructure deployment with Docker/Kubernetes. Ollama, PostgreSQL, Redis, Prometheus, Grafana, Nginx - complete production stack with monitoring and backups.

~28 min read • 5,500 words

Cost Featured

Cost Attribution System

Multi-dimensional cost tracking (team/project/user/workflow). Automatic tagging, chargeback models, budget enforcement, and usage analytics for AI operations.

~28 min read • 5,500 words

AI Architecture Featured

Advanced Tool Use

Dynamic tool discovery, programmatic orchestration, and parameter guidance. Tool Search Tool (85% token reduction), Programmatic Tool Calling (37% efficiency gains), and Tool Use Examples (90% parameter accuracy). Enterprise-scale agent architecture.

~33 min read • 6,500 words

All Playbooks

Cost

Multi-Agent Rate Limits

Prevent API throttling in concurrent multi-agent systems. Token bucket algorithms, sliding windows, priority queues, and backpressure handling for Claude API rate limits.

~14 min read • 2,800 words

Cost

Cost Caps & Budget Management

Hard budget controls for AI spending. Real-time spend tracking, automatic shutoffs, team quotas, and financial safeguards to prevent runaway costs.

~16 min read • 3,200 words

Infrastructure

MCP Server Reliability

Self-healing MCP servers with circuit breakers, exponential backoff, health checks, and automatic recovery. Production-grade Model Context Protocol implementations.

~18 min read • 3,500 words

Infrastructure

Ollama Migration Guide

Switch from OpenAI/Anthropic to self-hosted LLMs. Complete migration path: local setup, prompt translation, performance benchmarks, and cost analysis.

~23 min read • 4,500 words

Operations

Incident Debugging Playbook

SEV-1/2/3/4 incident response protocols. Log analysis, root cause investigation (5 Whys, Fishbone), postmortem templates, and on-call procedures.

~25 min read • 5,000 words

Infrastructure

Self-Hosted Stack Setup

Full infrastructure deployment with Docker/Kubernetes. Ollama, PostgreSQL, Redis, Prometheus, Grafana, Nginx - complete production stack with monitoring and backups.

~28 min read • 5,500 words

Security

Compliance & Audit Guide

SOC 2, GDPR, HIPAA, PCI DSS implementation. Audit logging with immutable signatures, RBAC, data privacy (PII redaction), and regulatory compliance.

~30 min read • 6,000 words

Operations

Team Presets & Workflows

Team standardization and collaboration. Plugin bundles, workflow templates, automated onboarding, and multi-layer configuration hierarchy (org/team/project/individual).

~25 min read • 5,000 words

Cost

Cost Attribution System

Multi-dimensional cost tracking (team/project/user/workflow). Automatic tagging, chargeback models, budget enforcement, and usage analytics for AI operations.

~28 min read • 5,500 words

Operations

Progressive Enhancement Patterns

Safe AI feature rollout strategies. Feature flags (0% → 100%), A/B testing, canary deployments, graceful degradation, and automated rollback on failures.

~28 min read • 5,500 words

AI Architecture