databricks-incident-runbook
Execute Databricks incident response procedures with triage, mitigation, and postmortem. Use when responding to Databricks-related outages, investigating job failures, or running post-incident reviews for pipeline failures. Trigger with phrases like "databricks incident", "databricks outage", "databricks down", "databricks on-call", "databricks emergency", "job failed". allowed-tools: Read, Grep, Bash(databricks:*) version: 1.0.0 license: MIT author: Jeremy Longshore <jeremy@intentsolutions.io>
Allowed Tools
No tools specified
Provided by Plugin
databricks-pack
Claude Code skill pack for Databricks (24 skills)
Installation
This skill is included in the databricks-pack plugin:
/plugin install databricks-pack@claude-code-plugins-plus
Click to copy
Instructions
# Databricks Incident Runbook
## Overview
Rapid incident response procedures for Databricks-related outages.
## Prerequisites
- Access to Databricks workspace
- CLI configured with appropriate permissions
- Access to monitoring dashboards
- Communication channels (Slack, PagerDuty)
## Severity Levels
| Level | Definition | Response Time | Examples |
|-------|------------|---------------|----------|
| P1 | Production data pipeline down | < 15 min | Critical ETL failed, data not updating |
| P2 | Degraded performance | < 1 hour | Slow queries, partial failures |
| P3 | Non-critical issues | < 4 hours | Dev cluster issues, delayed non-critical jobs |
| P4 | No user impact | Next business day | Monitoring gaps, documentation |
## Quick Triage
```bash
#!/bin/bash
# quick-triage.sh - Run this first during any incident
echo "=== Databricks Quick Triage ==="
echo "Time: $(date)"
echo ""
# 1. Check Databricks status
echo "--- Databricks Status ---"
curl -s https://status.databricks.com/api/v2/status.json | jq '.status.description'
echo ""
# 2. Check workspace connectivity
echo "--- Workspace Connectivity ---"
databricks workspace list / --output json | jq -r '.[] | .path' | head -5
if [ $? -eq 0 ]; then
echo "Workspace: CONNECTED"
else
echo "Workspace: CONNECTION FAILED"
fi
echo ""
# 3. Check recent job failures
echo "--- Recent Job Failures (last 1 hour) ---"
databricks runs list --limit 20 --output json | \
jq -r '.runs[] | select(.state.result_state == "FAILED") | "\(.run_id): \(.run_name) - \(.state.state_message)"'
echo ""
# 4. Check cluster status
echo "--- Running Clusters ---"
databricks clusters list --output json | \
jq -r '.clusters[] | select(.state == "RUNNING" or .state == "ERROR") | "\(.cluster_id): \(.cluster_name) [\(.state)]"'
echo ""
# 5. Check for errors in last hour
echo "--- Recent Errors ---"
# Query system tables via SQL warehouse or notebook
```
## Decision Tree
```
Job/Pipeline failing?
โโ YES: Is it a single job or multiple?
โ โโ SINGLE JOB โ Check job-specific issues
โ โ โโ Cluster failed to start โ Check cluster events
โ โ โโ Code error โ Check task output/logs
โ โ โโ Data issue โ Check source data
โ โ โโ Permission error โ Check grants
โ โ
โ โโ MULTIPLE JOBS โ Likely infrastructure issue
โ โโ Check Databricks status page
โ โโ Check workspace quotas
โ โโ Check network connectivity
โ
โโ NO: Is it a performance issue?
โโ Slow queries โ Check query plan, cluster sizing
โโ Slow cluster startup โ Check instance availability
โโ Data freshness โ Check upstream pipelines
```
## Immediate Actions by Error Type
### Job Failed - Code Error
```bash
# 1. Get run details
RUN_ID="your-run-id"
databricks runs get --run-id $RUN_ID
# 2. Get detailed error output
databricks runs get-output --run-id $RUN_ID | jq '.error'
# 3. Check task-level errors
databricks runs get --run-id $RUN_ID | jq '.tasks[] | select(.state.result_state == "FAILED") | {task: .task_key, error: .state.state_message}'
# 4. If notebook task, get notebook output
# (View in UI or use jobs API to get cell outputs)
```
### Cluster Failed to Start
```bash
# 1. Check cluster events
CLUSTER_ID="your-cluster-id"
databricks clusters events --cluster-id $CLUSTER_ID --limit 20
# 2. Common causes and fixes
# - QUOTA_EXCEEDED: Terminate unused clusters
# - CLOUD_PROVIDER_LAUNCH_ERROR: Check instance availability
# - DRIVER_UNREACHABLE: Network/firewall issue
# 3. Quick fix - restart cluster
databricks clusters restart --cluster-id $CLUSTER_ID
# 4. Check cluster logs
databricks clusters get --cluster-id $CLUSTER_ID | jq '.termination_reason'
```
### Permission/Auth Errors
```bash
# 1. Check current user
databricks current-user me
# 2. Check job permissions
databricks permissions get jobs --job-id $JOB_ID
# 3. Check table permissions (run in notebook)
# SHOW GRANTS ON TABLE catalog.schema.table
# 4. Fix: Grant necessary permissions
databricks permissions update jobs --job-id $JOB_ID --json '{
"access_control_list": [{
"user_name": "user@company.com",
"permission_level": "CAN_MANAGE_RUN"
}]
}'
```
### Data Quality Failures
```sql
-- Quick data quality check
SELECT
COUNT(*) as total_rows,
COUNT(DISTINCT id) as unique_ids,
SUM(CASE WHEN amount IS NULL THEN 1 ELSE 0 END) as null_amounts,
MIN(created_at) as oldest_record,
MAX(created_at) as newest_record
FROM catalog.schema.table
WHERE created_at > current_timestamp() - INTERVAL 1 DAY;
-- Check for recent changes
DESCRIBE HISTORY catalog.schema.table LIMIT 10;
-- Restore to previous version if needed
RESTORE TABLE catalog.schema.table TO VERSION AS OF 5;
```
## Communication Templates
### Internal (Slack)
```
:red_circle: **P1 INCIDENT: [Brief Description]**
**Status:** INVESTIGATING
**Impact:** [Describe user/business impact]
**Started:** [Time]
**Current Action:** [What you're doing now]
**Next Update:** [Time]
**Incident Commander:** @[name]
**Thread:** [link]
```
### External (Status Page)
```
**Data Pipeline Delay**
We are experiencing delays in data processing. Some reports may show stale data.
**Impact:** Dashboard data may be up to [X] hours delayed
**Started:** [Time] UTC
**Current Status:** Our team is actively investigating
We will provide updates every 30 minutes.
Last updated: [Timestamp]
```
## Post-Incident
### Evidence Collection
```bash
#!/bin/bash
# collect-incident-evidence.sh
INCIDENT_ID=$1
RUN_ID=$2
CLUSTER_ID=$3
mkdir -p "incident-$INCIDENT_ID"
# Job run details
databricks runs get --run-id $RUN_ID > "incident-$INCIDENT_ID/run_details.json"
databricks runs get-output --run-id $RUN_ID > "incident-$INCIDENT_ID/run_output.json"
# Cluster info
if [ -n "$CLUSTER_ID" ]; then
databricks clusters get --cluster-id $CLUSTER_ID > "incident-$INCIDENT_ID/cluster_info.json"
databricks clusters events --cluster-id $CLUSTER_ID --limit 50 > "incident-$INCIDENT_ID/cluster_events.json"
fi
# Create summary
cat << EOF > "incident-$INCIDENT_ID/summary.md"
# Incident $INCIDENT_ID
**Date:** $(date)
**Run ID:** $RUN_ID
**Cluster ID:** $CLUSTER_ID
## Evidence Collected
- run_details.json
- run_output.json
- cluster_info.json
- cluster_events.json
EOF
tar -czf "incident-$INCIDENT_ID.tar.gz" "incident-$INCIDENT_ID"
echo "Evidence collected: incident-$INCIDENT_ID.tar.gz"
```
### Postmortem Template
```markdown
## Incident: [Title]
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** P[1-4]
**Incident Commander:** [Name]
### Summary
[1-2 sentence description of what happened]
### Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | [First alert/detection] |
| HH:MM | [Investigation started] |
| HH:MM | [Root cause identified] |
| HH:MM | [Mitigation applied] |
| HH:MM | [Incident resolved] |
### Root Cause
[Technical explanation of what went wrong]
### Impact
- **Data Impact:** [Tables affected, rows impacted]
- **Users Affected:** [Number, types]
- **Duration:** [How long data was unavailable/stale]
- **Financial Impact:** [If applicable]
### Detection
- **How detected:** [Alert, user report, monitoring]
- **Time to detect:** [Minutes from issue start]
- **Detection gap:** [What could have caught this sooner]
### Response
- **Time to respond:** [Minutes from detection]
- **What worked:** [Effective response actions]
- **What didn't:** [Ineffective actions, dead ends]
### Action Items
| Priority | Action | Owner | Due Date |
|----------|--------|-------|----------|
| P1 | [Preventive measure] | [Name] | [Date] |
| P2 | [Monitoring improvement] | [Name] | [Date] |
| P3 | [Documentation update] | [Name] | [Date] |
### Lessons Learned
1. [Key learning 1]
2. [Key learning 2]
3. [Key learning 3]
```
## Instructions
### Step 1: Quick Triage
Run the triage script to identify the issue source.
### Step 2: Follow Decision Tree
Determine if the issue is Databricks-side or code/data issue.
### Step 3: Execute Immediate Actions
Apply the appropriate remediation for the error type.
### Step 4: Communicate Status
Update internal and external stakeholders.
### Step 5: Collect Evidence
Document everything for postmortem.
## Output
- Issue identified and categorized
- Remediation applied
- Stakeholders notified
- Evidence collected for postmortem
## Error Handling
| Issue | Cause | Solution |
|-------|-------|----------|
| Can't access workspace | Token expired | Re-authenticate |
| CLI commands fail | Network issue | Check VPN |
| Logs unavailable | Cluster terminated | Check cluster events |
| Restore fails | Retention exceeded | Check vacuum settings |
## Examples
### One-Line Job Health Check
```bash
databricks runs list --job-id $JOB_ID --limit 5 --output json | \
jq -r '.runs[] | "\(.start_time): \(.state.result_state)"'
```
### Quick Cluster Restart
```bash
databricks clusters restart --cluster-id $CLUSTER_ID && \
echo "Cluster restart initiated"
```
## Resources
- [Databricks Status Page](https://status.databricks.com)
- [Databricks Support](https://help.databricks.com)
- [Community Forum](https://community.databricks.com)
## Next Steps
For data handling, see `databricks-data-handling`.