speak-incident-runbook
Execute Speak incident response procedures with triage, mitigation, and postmortem. Use when responding to Speak-related outages, investigating errors, or running post-incident reviews for language learning feature failures. Trigger with phrases like "speak incident", "speak outage", "speak down", "speak on-call", "speak emergency", "speak broken". allowed-tools: Read, Grep, Bash(kubectl:*), Bash(curl:*) version: 1.0.0 license: MIT author: Jeremy Longshore <jeremy@intentsolutions.io>
Allowed Tools
No tools specified
Provided by Plugin
speak-pack
Claude Code skill pack for Speak AI Language Learning Platform (24 skills)
Installation
This skill is included in the speak-pack plugin:
/plugin install speak-pack@claude-code-plugins-plus
Click to copy
Instructions
# Speak Incident Runbook
## Overview
Rapid incident response procedures for Speak language learning-related outages.
## Prerequisites
- Access to Speak dashboard and status page
- kubectl access to production cluster
- Prometheus/Grafana access
- Communication channels (Slack, PagerDuty)
## Severity Levels
| Level | Definition | Response Time | Examples |
|-------|------------|---------------|----------|
| P1 | Complete outage | < 15 min | Speak API unreachable, all lessons failing |
| P2 | Degraded service | < 1 hour | High latency, speech recognition failing |
| P3 | Minor impact | < 4 hours | Specific language unavailable, slow scoring |
| P4 | No user impact | Next business day | Monitoring gaps, minor bugs |
## Quick Triage
```bash
#!/bin/bash
# speak-triage.sh
echo "=== Speak Quick Triage ==="
echo "Time: $(date)"
echo ""
# 1. Check Speak status
echo "1. Speak Status Page:"
curl -s https://status.speak.com/api/status | jq '.status'
echo ""
# 2. Check our integration health
echo "2. Our Health Check:"
curl -s https://api.yourapp.com/health | jq '.services.speak'
echo ""
# 3. Check active sessions
echo "3. Active Sessions:"
curl -s "localhost:9090/api/v1/query?query=speak_active_sessions" | jq '.data.result[0].value[1]'
echo ""
# 4. Check error rate (last 5 min)
echo "4. Error Rate (5m):"
curl -s "localhost:9090/api/v1/query?query=rate(speak_api_errors_total[5m])" | jq '.data.result'
echo ""
# 5. Check lesson completion rate
echo "5. Lesson Completion Rate (1h):"
curl -s "localhost:9090/api/v1/query?query=rate(speak_lessons_completed_total[1h])/rate(speak_lessons_started_total[1h])" | jq '.data.result[0].value[1]'
echo ""
# 6. Recent error logs
echo "6. Recent Errors:"
kubectl logs -l app=speak-integration --since=5m 2>/dev/null | grep -i error | tail -10
```
## Decision Tree
```
Speak API returning errors?
โโ YES: Is status.speak.com showing incident?
โ โโ YES โ Wait for Speak to resolve. Enable fallback/offline mode.
โ โโ NO โ Our integration issue. Check credentials, config, code.
โโ NO: Are lessons working?
โโ YES: Is speech recognition working?
โ โโ YES โ Likely resolved or intermittent. Monitor.
โ โโ NO โ Audio processing issue. Check audio infrastructure.
โโ NO โ Our service issue. Check pods, memory, database.
```
## Immediate Actions by Error Type
### 401/403 - Authentication Failures
```bash
# Verify API credentials are set
kubectl get secret speak-secrets -o jsonpath='{.data.api-key}' | base64 -d | head -c 10
echo "..."
# Check if credentials were rotated
# โ Verify in Speak developer dashboard
# Remediation: Update secret and restart pods
kubectl create secret generic speak-secrets \
--from-literal=api-key=NEW_KEY \
--from-literal=app-id=APP_ID \
--dry-run=client -o yaml | kubectl apply -f -
kubectl rollout restart deployment/speak-integration
```
### 429 - Rate Limited
```bash
# Check current rate limit status
curl -v https://api.speak.com/v1/health \
-H "Authorization: Bearer ${SPEAK_API_KEY}" 2>&1 | grep -i "rate"
# Enable request queuing/throttling
kubectl set env deployment/speak-integration SPEAK_RATE_LIMIT_MODE=queue
# For persistent issues, contact Speak for limit increase
```
### 500/503 - Speak Server Errors
```bash
# Enable graceful degradation (offline lessons if available)
kubectl set env deployment/speak-integration SPEAK_FALLBACK_MODE=true
# Enable cached responses
kubectl set env deployment/speak-integration SPEAK_CACHE_ONLY=true
# Update status page for users
# Monitor Speak status for resolution
```
### Speech Recognition Failures
```bash
# Check audio processing service
kubectl get pods -l component=audio-processor
# Check audio queue depth
curl -s "localhost:9090/api/v1/query?query=speak_audio_queue_depth"
# Scale audio processors if needed
kubectl scale deployment/audio-processor --replicas=5
# Verify audio storage is accessible
gsutil ls gs://your-audio-bucket/
```
## Communication Templates
### Internal (Slack)
```
๐ด P1 INCIDENT: Speak Language Learning
Status: INVESTIGATING
Impact: Users cannot start or complete lessons
Languages affected: All
Current action: Checking Speak API status and our integration
Next update: [Time + 15 min]
Incident commander: @[name]
War room: #speak-incident-[date]
```
### External (Status Page)
```
Language Learning Service Disruption
We are experiencing issues with our language learning features.
Users may be unable to:
- Start new lessons
- Complete pronunciation exercises
- Access speech recognition
We are actively working with our service provider to resolve this issue.
Offline vocabulary practice remains available.
Last updated: [timestamp]
```
### User Notification (In-App)
```
We're experiencing technical difficulties with live lessons.
While we work on a fix, you can:
โข Practice vocabulary flashcards (offline)
โข Review previous lesson notes
โข Listen to saved audio examples
We'll notify you when full service is restored.
```
## Fallback Modes
### Enable Offline Mode
```typescript
// Fallback when Speak API is unavailable
async function getLesson(config: LessonConfig): Promise {
if (process.env.SPEAK_FALLBACK_MODE === 'true') {
return getOfflineLesson(config);
}
try {
return await speakService.startSession(config);
} catch (error) {
if (isApiUnavailable(error)) {
console.warn('Speak API unavailable, using fallback');
return getOfflineLesson(config);
}
throw error;
}
}
```
### Cached Responses
```typescript
// Serve cached content when API is slow/unavailable
async function getTutorPrompt(sessionId: string): Promise {
const cacheKey = `prompt:${sessionId}`;
// Try cache first during incidents
if (process.env.SPEAK_CACHE_ONLY === 'true') {
const cached = await cache.get(cacheKey);
if (cached) return cached;
throw new Error('Prompt not in cache during incident');
}
// Normal flow with cache fallback
try {
const prompt = await speakService.tutor.getPrompt(sessionId);
await cache.set(cacheKey, prompt, 3600);
return prompt;
} catch (error) {
const cached = await cache.get(cacheKey);
if (cached) return cached;
throw error;
}
}
```
## Post-Incident
### Evidence Collection
```bash
#!/bin/bash
# collect-speak-incident-evidence.sh
INCIDENT_DIR="speak-incident-$(date +%Y%m%d-%H%M)"
mkdir -p "$INCIDENT_DIR"
# Export logs
kubectl logs -l app=speak-integration --since=2h > "$INCIDENT_DIR/logs.txt"
# Export metrics
curl "localhost:9090/api/v1/query_range?query=speak_api_errors_total&start=$(date -d '2 hours ago' +%s)&end=$(date +%s)&step=60" > "$INCIDENT_DIR/errors.json"
curl "localhost:9090/api/v1/query_range?query=speak_lessons_started_total&start=$(date -d '2 hours ago' +%s)&end=$(date +%s)&step=60" > "$INCIDENT_DIR/lessons.json"
# Capture current state
kubectl get pods -l app=speak-integration -o yaml > "$INCIDENT_DIR/pods.yaml"
kubectl get events --sort-by='.lastTimestamp' > "$INCIDENT_DIR/events.txt"
# Generate debug bundle
./speak-debug-bundle.sh
mv speak-debug-*.tar.gz "$INCIDENT_DIR/"
echo "Evidence collected in $INCIDENT_DIR"
```
### Postmortem Template
```markdown
## Incident: Speak [Error Type]
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** P[1-4]
**Affected Users:** N (X% of daily active learners)
### Summary
[1-2 sentence description of what happened]
### User Impact
- Lessons affected: N
- Languages affected: [list]
- Features unavailable: [list]
- Estimated learning time lost: X hours
### Timeline
- HH:MM - [Event]
- HH:MM - Alert fired
- HH:MM - Incident declared
- HH:MM - [Mitigation action]
- HH:MM - Service restored
### Root Cause
[Technical explanation]
### Contributing Factors
- [Factor 1]
- [Factor 2]
### Action Items
| Priority | Action | Owner | Due |
|----------|--------|-------|-----|
| P1 | [Preventive measure] | @name | [date] |
| P2 | [Improvement] | @name | [date] |
### Lessons Learned
- What went well: [list]
- What could be improved: [list]
```
## Output
- Issue identified and categorized
- Mitigation applied
- Stakeholders notified
- Evidence collected for postmortem
- Fallback modes enabled if needed
## Error Handling
| Issue | Cause | Solution |
|-------|-------|----------|
| Can't reach status page | Network issue | Use mobile or VPN |
| kubectl fails | Auth expired | Re-authenticate |
| Metrics unavailable | Prometheus down | Check backup metrics |
| Fallback not working | Cache empty | Pre-warm cache |
## Examples
### One-Line Health Check
```bash
curl -sf https://api.yourapp.com/health | jq '.services.speak.status' || echo "UNHEALTHY"
```
### Quick Fallback Toggle
```bash
# Enable fallback
kubectl set env deployment/speak-integration SPEAK_FALLBACK_MODE=true
# Disable fallback (restore normal)
kubectl set env deployment/speak-integration SPEAK_FALLBACK_MODE-
```
## Resources
- [Speak Status Page](https://status.speak.com)
- [Speak Support](https://support.speak.com)
- [Internal Runbook Wiki](#)
## Next Steps
For data handling, see `speak-data-handling`.