databricks-reference-architecture
Implement Databricks reference architecture with best-practice project layout. Use when designing new Databricks projects, reviewing architecture, or establishing standards for Databricks applications. Trigger with phrases like "databricks architecture", "databricks best practices", "databricks project structure", "how to organize databricks", "databricks layout". allowed-tools: Read, Grep version: 1.0.0 license: MIT author: Jeremy Longshore <jeremy@intentsolutions.io>
Allowed Tools
No tools specified
Provided by Plugin
databricks-pack
Claude Code skill pack for Databricks (24 skills)
Installation
This skill is included in the databricks-pack plugin:
/plugin install databricks-pack@claude-code-plugins-plus
Click to copy
Instructions
# Databricks Reference Architecture
## Overview
Production-ready architecture patterns for Databricks data platforms.
## Prerequisites
- Understanding of Databricks components
- Unity Catalog configured
- Asset Bundles knowledge
- CI/CD pipeline setup
## Project Structure
```
databricks-platform/
βββ databricks.yml # Main Asset Bundle config
βββ bundles/
β βββ base.yml # Shared configurations
β βββ dev.yml # Dev environment overrides
β βββ staging.yml # Staging overrides
β βββ prod.yml # Production overrides
βββ src/
β βββ pipelines/ # Data pipelines
β β βββ bronze/
β β β βββ ingest_orders.py
β β β βββ ingest_customers.py
β β β βββ ingest_products.py
β β βββ silver/
β β β βββ clean_orders.py
β β β βββ dedupe_customers.py
β β β βββ enrich_products.py
β β βββ gold/
β β βββ agg_daily_sales.py
β β βββ agg_customer_360.py
β β βββ agg_product_metrics.py
β βββ ml/
β β βββ features/
β β β βββ customer_features.py
β β βββ training/
β β β βββ churn_model.py
β β βββ serving/
β β βββ model_inference.py
β βββ utils/
β β βββ __init__.py
β β βββ delta.py # Delta Lake utilities
β β βββ quality.py # Data quality checks
β β βββ logging.py # Structured logging
β βββ notebooks/ # Development notebooks
β βββ exploration/
β βββ adhoc/
βββ resources/
β βββ jobs/
β β βββ etl_jobs.yml
β β βββ ml_jobs.yml
β β βββ quality_jobs.yml
β βββ dlt/
β β βββ sales_pipeline.yml
β βββ clusters/
β β βββ cluster_policies.yml
β βββ sql/
β βββ dashboards.yml
βββ tests/
β βββ unit/
β β βββ test_delta.py
β β βββ test_quality.py
β βββ integration/
β β βββ test_bronze_ingest.py
β β βββ test_ml_pipeline.py
β βββ fixtures/
β βββ sample_data/
βββ docs/
β βββ architecture.md
β βββ runbooks/
β βββ api/
βββ infrastructure/
β βββ terraform/
β β βββ main.tf
β β βββ unity_catalog.tf
β β βββ workspaces.tf
β βββ scripts/
β βββ bootstrap.sh
βββ .github/
β βββ workflows/
β βββ ci.yml
β βββ deploy-staging.yml
β βββ deploy-prod.yml
βββ pyproject.toml
βββ README.md
```
## Layered Architecture
```
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Presentation Layer β
β (Dashboards, Reports, APIs, Applications) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Serving Layer β
β (SQL Warehouses, Model Endpoints, APIs) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Gold Layer β
β (Business Aggregations, Features, Curated Datasets) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Silver Layer β
β (Cleansed, Conformed, Deduplicated Data) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Bronze Layer β
β (Raw Data with Metadata, Immutable) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Ingestion Layer β
β (Auto Loader, DLT, Streaming, Batch Connectors) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Source Systems β
β (Databases, Files, APIs, Streaming, SaaS Applications) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
```
## Unity Catalog Structure
```sql
-- Catalog hierarchy for data platform
CREATE CATALOG IF NOT EXISTS data_platform;
-- Environment-specific schemas
CREATE SCHEMA IF NOT EXISTS data_platform.bronze;
CREATE SCHEMA IF NOT EXISTS data_platform.silver;
CREATE SCHEMA IF NOT EXISTS data_platform.gold;
CREATE SCHEMA IF NOT EXISTS data_platform.ml_features;
CREATE SCHEMA IF NOT EXISTS data_platform.ml_models;
-- Shared reference data
CREATE SCHEMA IF NOT EXISTS data_platform.reference;
-- Governance setup
GRANT USAGE ON CATALOG data_platform TO `data-engineers`;
GRANT SELECT ON SCHEMA data_platform.gold TO `data-analysts`;
GRANT ALL PRIVILEGES ON SCHEMA data_platform.ml_features TO `data-scientists`;
```
## Key Components
### Step 1: Delta Lake Configuration
```python
# src/utils/delta.py
from pyspark.sql import SparkSession
from delta.tables import DeltaTable
class DeltaConfig:
"""Centralized Delta Lake configuration."""
# Table properties for all tables
DEFAULT_PROPERTIES = {
"delta.autoOptimize.optimizeWrite": "true",
"delta.autoOptimize.autoCompact": "true",
"delta.deletionVectors.enabled": "true",
"delta.enableChangeDataFeed": "true",
}
# Default retention for time travel
RETENTION_HOURS = 168 # 7 days
@staticmethod
def create_table_with_defaults(
spark: SparkSession,
df,
table_name: str,
partition_by: list[str] = None,
cluster_by: list[str] = None,
):
"""Create Delta table with standard configurations."""
writer = df.write.format("delta")
# Apply partitioning or clustering
if cluster_by:
# Liquid clustering (preferred for new tables)
writer = writer.option("clusteringColumns", ",".join(cluster_by))
elif partition_by:
writer = writer.partitionBy(*partition_by)
# Write table
writer.saveAsTable(table_name)
# Apply properties
for key, value in DeltaConfig.DEFAULT_PROPERTIES.items():
spark.sql(f"ALTER TABLE {table_name} SET TBLPROPERTIES ('{key}' = '{value}')")
```
### Step 2: Data Quality Framework
```python
# src/utils/quality.py
from pyspark.sql import DataFrame
from dataclasses import dataclass
from typing import Callable
@dataclass
class QualityCheck:
"""Data quality check definition."""
name: str
check_fn: Callable[[DataFrame], bool]
severity: str = "ERROR" # ERROR, WARNING, INFO
class QualityFramework:
"""Centralized data quality checks."""
def __init__(self, spark):
self.spark = spark
self.checks: list[QualityCheck] = []
self.results: list[dict] = []
def add_check(self, check: QualityCheck):
"""Register a quality check."""
self.checks.append(check)
def run_checks(self, df: DataFrame, table_name: str) -> bool:
"""Run all registered checks."""
all_passed = True
for check in self.checks:
try:
passed = check.check_fn(df)
self.results.append({
"table": table_name,
"check": check.name,
"passed": passed,
"severity": check.severity,
})
if not passed and check.severity == "ERROR":
all_passed = False
except Exception as e:
self.results.append({
"table": table_name,
"check": check.name,
"passed": False,
"error": str(e),
})
all_passed = False
return all_passed
# Standard checks
def not_null_check(column: str) -> QualityCheck:
return QualityCheck(
name=f"not_null_{column}",
check_fn=lambda df: df.filter(f"{column} IS NULL").count() == 0,
)
def unique_check(columns: list[str]) -> QualityCheck:
cols = ", ".join(columns)
return QualityCheck(
name=f"unique_{cols}",
check_fn=lambda df: df.groupBy(columns).count().filter("count > 1").count() == 0,
)
def range_check(column: str, min_val, max_val) -> QualityCheck:
return QualityCheck(
name=f"range_{column}",
check_fn=lambda df: df.filter(f"{column} < {min_val} OR {column} > {max_val}").count() == 0,
)
```
### Step 3: Job Template
```yaml
# resources/jobs/etl_job_template.yml
resources:
jobs:
${job_name}:
name: "${bundle.name}-${job_name}-${bundle.target}"
tags:
domain: ${domain}
tier: ${tier}
owner: ${owner}
environment: ${bundle.target}
schedule:
quartz_cron_expression: ${schedule}
timezone_id: "UTC"
max_concurrent_runs: 1
timeout_seconds: ${timeout_seconds}
queue:
enabled: true
email_notifications:
on_failure: ${alert_emails}
health:
rules:
- metric: RUN_DURATION_SECONDS
op: GREATER_THAN
value: ${duration_threshold}
tasks:
- task_key: ${task_name}
job_cluster_key: ${cluster_key}
libraries:
- whl: ../artifacts/data_platform/*.whl
notebook_task:
notebook_path: ${notebook_path}
base_parameters:
catalog: ${var.catalog}
schema: ${var.schema}
run_date: "{{job.parameters.run_date}}"
```
### Step 4: Monitoring Dashboard
```sql
-- Create monitoring views for operational dashboard
-- Job health summary
CREATE OR REPLACE VIEW data_platform.monitoring.job_health AS
SELECT
job_name,
DATE(start_time) as run_date,
COUNT(*) as total_runs,
SUM(CASE WHEN result_state = 'SUCCESS' THEN 1 ELSE 0 END) as successes,
SUM(CASE WHEN result_state = 'FAILED' THEN 1 ELSE 0 END) as failures,
AVG(duration) / 60000 as avg_duration_minutes,
MAX(duration) / 60000 as max_duration_minutes
FROM system.lakeflow.job_run_timeline
WHERE start_time > current_timestamp() - INTERVAL 7 DAYS
GROUP BY job_name, DATE(start_time);
-- Data freshness tracking
CREATE OR REPLACE VIEW data_platform.monitoring.data_freshness AS
SELECT
table_catalog,
table_schema,
table_name,
MAX(commit_timestamp) as last_update,
TIMESTAMPDIFF(HOUR, MAX(commit_timestamp), current_timestamp()) as hours_since_update
FROM system.information_schema.table_history
GROUP BY table_catalog, table_schema, table_name;
```
## Data Flow Diagram
```
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Sources ββββββΆβ Ingestion ββββββΆβ Bronze β
β (S3, JDBC) β β (Auto Loaderβ β (Raw Data) β
βββββββββββββββ β DLT, APIs) β ββββββββ¬βββββββ
βββββββββββββββ β
βΌ
βββββββββββββββ βββββββββββββββ
β Silver βββββββ Transform β
β (Cleansed) β β (Spark) β
ββββββββ¬βββββββ βββββββββββββββ
β
βΌ
βββββββββββββββ βββββββββββββββ βββββββββββββββ
β Serve βββββββ Gold βββββββ Aggregate β
β (Warehouse, β β (Analytics) β β (Spark) β
β APIs) β βββββββββββββββ βββββββββββββββ
βββββββββββββββ
```
## Instructions
### Step 1: Initialize Project
```bash
# Clone template
git clone https://github.com/databricks/bundle-examples.git
cd bundle-examples/default-python
# Customize for your project
mv databricks.yml.template databricks.yml
```
### Step 2: Configure Environments
Set up dev, staging, and prod targets in `databricks.yml`.
### Step 3: Implement Pipelines
Create Bronze, Silver, Gold pipelines following medallion architecture.
### Step 4: Add Quality Checks
Integrate data quality framework into each layer.
## Output
- Structured project layout
- Medallion architecture implemented
- Data quality framework integrated
- Monitoring dashboards ready
## Error Handling
| Issue | Cause | Solution |
|-------|-------|----------|
| Circular dependencies | Wrong layering | Review medallion flow |
| Config not loading | Wrong paths | Verify bundle includes |
| Quality check failures | Bad data | Add data quarantine |
| Schema drift | Source changes | Enable schema evolution |
## Examples
### Quick Setup Script
```bash
#!/bin/bash
# Initialize reference architecture
mkdir -p src/{pipelines/{bronze,silver,gold},ml,utils,notebooks}
mkdir -p resources/{jobs,dlt,clusters}
mkdir -p tests/{unit,integration}
mkdir -p infrastructure/terraform
mkdir -p docs/runbooks
# Create initial files
touch src/utils/{__init__,delta,quality,logging}.py
touch resources/jobs/etl_jobs.yml
touch databricks.yml
echo "Reference architecture initialized!"
```
## Resources
- [Databricks Architecture](https://docs.databricks.com/lakehouse-architecture/index.html)
- [Medallion Architecture](https://www.databricks.com/glossary/medallion-architecture)
- [Asset Bundles](https://docs.databricks.com/dev-tools/bundles/index.html)
- [Unity Catalog Best Practices](https://docs.databricks.com/data-governance/unity-catalog/best-practices.html)
## Flagship Skills
For multi-environment setup, see `databricks-multi-env-setup`.