databricks-prod-checklist

Execute Databricks production deployment checklist and rollback procedures. Use when deploying Databricks jobs to production, preparing for launch, or implementing go-live procedures. Trigger with phrases like "databricks production", "deploy databricks", "databricks go-live", "databricks launch checklist". allowed-tools: Read, Bash(databricks:*), Grep version: 1.0.0 license: MIT author: Jeremy Longshore <jeremy@intentsolutions.io>

v1.0.0

Jeremy Longshore

MIT

Allowed Tools

No tools specified

Provided by Plugin

databricks-pack

Claude Code skill pack for Databricks (24 skills)

saas packs v1.0.0

View Plugin

Installation

This skill is included in the databricks-pack plugin:

/plugin install databricks-pack@claude-code-plugins-plus

Click to copy

Instructions

# Databricks Production Checklist ## Overview Complete checklist for deploying Databricks jobs and pipelines to production. ## Prerequisites - Staging environment tested and verified - Production workspace access - Unity Catalog configured - Monitoring and alerting ready ## Instructions ### Step 1: Pre-Deployment Configuration #### Security - [ ] Service principal configured for automation - [ ] Secrets in Databricks Secret Scopes (not env vars) - [ ] Token expiration set (max 90 days) - [ ] Unity Catalog permissions configured - [ ] Cluster policies enforced - [ ] IP access lists configured #### Infrastructure - [ ] Production cluster pool configured - [ ] Instance types validated for workload - [ ] Autoscaling configured appropriately - [ ] Spot instance ratio set (cost vs reliability) ### Step 2: Code Quality Verification #### Testing - [ ] Unit tests passing - [ ] Integration tests on staging data - [ ] Data quality tests defined - [ ] Performance benchmarks met ```bash # Run tests via Asset Bundles databricks bundle validate -t prod databricks bundle run -t staging test-job # Verify test results databricks runs get --run-id $RUN_ID | jq '.state.result_state' ``` #### Code Review - [ ] No hardcoded credentials - [ ] Error handling covers all failure modes - [ ] Logging is production-appropriate - [ ] Delta Lake best practices followed - [ ] No `collect()` on large datasets ### Step 3: Job Configuration ```yaml # resources/prod_job.yml resources: jobs: etl_pipeline: name: "prod-etl-pipeline" tags: environment: production team: data-engineering cost_center: analytics schedule: quartz_cron_expression: "0 0 6 * * ?" timezone_id: "America/New_York" email_notifications: on_failure: - "oncall@company.com" on_success: - "data-team@company.com" webhook_notifications: on_failure: - id: "slack-webhook-id" max_concurrent_runs: 1 timeout_seconds: 14400 # 4 hours tasks: - task_key: bronze_ingest job_cluster_key: etl_cluster notebook_task: notebook_path: /Repos/prod/pipelines/bronze timeout_seconds: 3600 - task_key: silver_transform depends_on: - task_key: bronze_ingest job_cluster_key: etl_cluster notebook_task: notebook_path: /Repos/prod/pipelines/silver job_clusters: - job_cluster_key: etl_cluster new_cluster: spark_version: "14.3.x-scala2.12" node_type_id: "Standard_DS3_v2" num_workers: 4 autoscale: min_workers: 2 max_workers: 8 spark_conf: spark.sql.shuffle.partitions: "200" spark.databricks.delta.optimizeWrite.enabled: "true" instance_pool_id: "prod-pool-id" ``` ### Step 4: Deployment Commands ```bash # Pre-flight checks echo "=== Pre-flight Checks ===" databricks workspace list /Repos/prod/ # Verify repo exists databricks clusters list | grep prod # Verify pools/clusters databricks secrets list-scopes # Verify secrets # Deploy with Asset Bundles echo "=== Deploying ===" databricks bundle deploy -t prod # Verify deployment databricks bundle summary -t prod databricks jobs list | grep prod-etl # Manual trigger to verify echo "=== Verification Run ===" RUN_ID=$(databricks jobs run-now --job-id $JOB_ID | jq -r '.run_id') echo "Run ID: $RUN_ID" # Monitor run databricks runs get --run-id $RUN_ID --wait ``` ### Step 5: Monitoring Setup ```python # monitoring/health_check.py from databricks.sdk import WorkspaceClient from datetime import datetime, timedelta def check_job_health(w: WorkspaceClient, job_id: int) -> dict: """Check job health metrics.""" # Get recent runs runs = list(w.jobs.list_runs( job_id=job_id, completed_only=True, limit=10, )) if not runs: return {"status": "NO_RUNS", "healthy": False} # Calculate success rate successful = sum(1 for r in runs if r.state.result_state == "SUCCESS") success_rate = successful / len(runs) # Calculate average duration durations = [ (r.end_time - r.start_time) / 1000 / 60 # minutes for r in runs if r.end_time ] avg_duration = sum(durations) / len(durations) if durations else 0 # Check last run last_run = runs[0] last_state = last_run.state.result_state return { "status": "HEALTHY" if success_rate > 0.9 else "DEGRADED", "healthy": success_rate > 0.9 and last_state == "SUCCESS", "success_rate": success_rate, "avg_duration_minutes": avg_duration, "last_run_state": last_state, "last_run_time": datetime.fromtimestamp(last_run.start_time / 1000), } ``` ### Step 6: Rollback Procedure ```bash #!/bin/bash # rollback.sh - Emergency rollback procedure JOB_ID=$1 PREVIOUS_VERSION=$2 echo "=== ROLLBACK INITIATED ===" echo "Job: $JOB_ID" echo "Target Version: $PREVIOUS_VERSION" # 1. Pause the job echo "Pausing job..." databricks jobs update --job-id $JOB_ID --json '{"settings": {"schedule": null}}' # 2. Cancel active runs echo "Cancelling active runs..." databricks runs list --job-id $JOB_ID --active-only | \ jq -r '.runs[].run_id' | \ xargs -I {} databricks runs cancel --run-id {} # 3. Reset to previous version echo "Rolling back to version $PREVIOUS_VERSION..." databricks bundle deploy -t prod --force # 4. Re-enable schedule echo "Re-enabling schedule..." # (restore from backup config) # 5. Trigger verification run echo "Triggering verification run..." databricks jobs run-now --job-id $JOB_ID echo "=== ROLLBACK COMPLETE ===" ``` ## Output - Deployed production job - Health checks passing - Monitoring active - Rollback procedure documented ## Error Handling | Alert | Condition | Severity | |-------|-----------|----------| | Job Failed | `result_state = FAILED` | P1 | | Long Running | Duration > 2x average | P2 | | Consecutive Failures | 3+ failures in a row | P1 | | Data Quality | Expectations failed | P2 | ## Examples ### Production Health Dashboard Query ```sql -- Job health metrics (Unity Catalog system tables) SELECT job_id, job_name, COUNT(*) as total_runs, SUM(CASE WHEN result_state = 'SUCCESS' THEN 1 ELSE 0 END) as successes, AVG(execution_duration) / 60000 as avg_minutes, MAX(start_time) as last_run FROM system.lakeflow.job_run_timeline WHERE start_time > current_timestamp() - INTERVAL 7 DAYS GROUP BY job_id, job_name ORDER BY total_runs DESC ``` ### Pre-Production Verification ```bash # Comprehensive pre-prod check databricks bundle validate -t prod && \ databricks bundle deploy -t prod --dry-run && \ echo "Validation passed, ready to deploy" ``` ## Resources - [Databricks Production Best Practices](https://docs.databricks.com/dev-tools/bundles/best-practices.html) - [Job Configuration](https://docs.databricks.com/workflows/jobs/jobs.html) - [Monitoring Guide](https://docs.databricks.com/administration-guide/workspace/monitoring.html) ## Next Steps For version upgrades, see `databricks-upgrade-migration`.