- mkdocs.yml: Material theme with dark/light mode, search, tabs, sitemap - scripts/generate-docs.py: auto-generates docs from all SKILL.md files - docs/index.md: landing page with domain overview and quick install - docs/getting-started.md: installation guide for Claude Code, Codex, OpenClaw - docs/skills/: 170 skill pages + 9 domain index pages - .github/workflows/static.yml: MkDocs build + GitHub Pages deploy - .gitignore: exclude site/ build output Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1000 lines
24 KiB
Markdown
1000 lines
24 KiB
Markdown
---
|
|
title: "Senior Data Engineer"
|
|
description: "Senior Data Engineer - Claude Code skill from the Engineering - Core domain."
|
|
---
|
|
|
|
# Senior Data Engineer
|
|
|
|
**Domain:** Engineering - Core | **Skill:** `senior-data-engineer` | **Source:** [`engineering-team/senior-data-engineer/SKILL.md`](https://github.com/alirezarezvani/claude-skills/tree/main/engineering-team/senior-data-engineer/SKILL.md)
|
|
|
|
---
|
|
|
|
|
|
# Senior Data Engineer
|
|
|
|
Production-grade data engineering skill for building scalable, reliable data systems.
|
|
|
|
## Table of Contents
|
|
|
|
1. [Trigger Phrases](#trigger-phrases)
|
|
2. [Quick Start](#quick-start)
|
|
3. [Workflows](#workflows)
|
|
- [Building a Batch ETL Pipeline](#workflow-1-building-a-batch-etl-pipeline)
|
|
- [Implementing Real-Time Streaming](#workflow-2-implementing-real-time-streaming)
|
|
- [Data Quality Framework Setup](#workflow-3-data-quality-framework-setup)
|
|
4. [Architecture Decision Framework](#architecture-decision-framework)
|
|
5. [Tech Stack](#tech-stack)
|
|
6. [Reference Documentation](#reference-documentation)
|
|
7. [Troubleshooting](#troubleshooting)
|
|
|
|
---
|
|
|
|
## Trigger Phrases
|
|
|
|
Activate this skill when you see:
|
|
|
|
**Pipeline Design:**
|
|
- "Design a data pipeline for..."
|
|
- "Build an ETL/ELT process..."
|
|
- "How should I ingest data from..."
|
|
- "Set up data extraction from..."
|
|
|
|
**Architecture:**
|
|
- "Should I use batch or streaming?"
|
|
- "Lambda vs Kappa architecture"
|
|
- "How to handle late-arriving data"
|
|
- "Design a data lakehouse"
|
|
|
|
**Data Modeling:**
|
|
- "Create a dimensional model..."
|
|
- "Star schema vs snowflake"
|
|
- "Implement slowly changing dimensions"
|
|
- "Design a data vault"
|
|
|
|
**Data Quality:**
|
|
- "Add data validation to..."
|
|
- "Set up data quality checks"
|
|
- "Monitor data freshness"
|
|
- "Implement data contracts"
|
|
|
|
**Performance:**
|
|
- "Optimize this Spark job"
|
|
- "Query is running slow"
|
|
- "Reduce pipeline execution time"
|
|
- "Tune Airflow DAG"
|
|
|
|
---
|
|
|
|
## Quick Start
|
|
|
|
### Core Tools
|
|
|
|
```bash
|
|
# Generate pipeline orchestration config
|
|
python scripts/pipeline_orchestrator.py generate \
|
|
--type airflow \
|
|
--source postgres \
|
|
--destination snowflake \
|
|
--schedule "0 5 * * *"
|
|
|
|
# Validate data quality
|
|
python scripts/data_quality_validator.py validate \
|
|
--input data/sales.parquet \
|
|
--schema schemas/sales.json \
|
|
--checks freshness,completeness,uniqueness
|
|
|
|
# Optimize ETL performance
|
|
python scripts/etl_performance_optimizer.py analyze \
|
|
--query queries/daily_aggregation.sql \
|
|
--engine spark \
|
|
--recommend
|
|
```
|
|
|
|
---
|
|
|
|
## Workflows
|
|
|
|
### Workflow 1: Building a Batch ETL Pipeline
|
|
|
|
**Scenario:** Extract data from PostgreSQL, transform with dbt, load to Snowflake.
|
|
|
|
#### Step 1: Define Source Schema
|
|
|
|
```sql
|
|
-- Document source tables
|
|
SELECT
|
|
table_name,
|
|
column_name,
|
|
data_type,
|
|
is_nullable
|
|
FROM information_schema.columns
|
|
WHERE table_schema = 'source_schema'
|
|
ORDER BY table_name, ordinal_position;
|
|
```
|
|
|
|
#### Step 2: Generate Extraction Config
|
|
|
|
```bash
|
|
python scripts/pipeline_orchestrator.py generate \
|
|
--type airflow \
|
|
--source postgres \
|
|
--tables orders,customers,products \
|
|
--mode incremental \
|
|
--watermark updated_at \
|
|
--output dags/extract_source.py
|
|
```
|
|
|
|
#### Step 3: Create dbt Models
|
|
|
|
```sql
|
|
-- models/staging/stg_orders.sql
|
|
WITH source AS (
|
|
SELECT * FROM {{ source('postgres', 'orders') }}
|
|
),
|
|
|
|
renamed AS (
|
|
SELECT
|
|
order_id,
|
|
customer_id,
|
|
order_date,
|
|
total_amount,
|
|
status,
|
|
_extracted_at
|
|
FROM source
|
|
WHERE order_date >= DATEADD(day, -3, CURRENT_DATE)
|
|
)
|
|
|
|
SELECT * FROM renamed
|
|
```
|
|
|
|
```sql
|
|
-- models/marts/fct_orders.sql
|
|
{{
|
|
config(
|
|
materialized='incremental',
|
|
unique_key='order_id',
|
|
cluster_by=['order_date']
|
|
)
|
|
}}
|
|
|
|
SELECT
|
|
o.order_id,
|
|
o.customer_id,
|
|
c.customer_segment,
|
|
o.order_date,
|
|
o.total_amount,
|
|
o.status
|
|
FROM {{ ref('stg_orders') }} o
|
|
LEFT JOIN {{ ref('dim_customers') }} c
|
|
ON o.customer_id = c.customer_id
|
|
|
|
{% if is_incremental() %}
|
|
WHERE o._extracted_at > (SELECT MAX(_extracted_at) FROM {{ this }})
|
|
{% endif %}
|
|
```
|
|
|
|
#### Step 4: Configure Data Quality Tests
|
|
|
|
```yaml
|
|
# models/marts/schema.yml
|
|
version: 2
|
|
|
|
models:
|
|
- name: fct_orders
|
|
description: "Order fact table"
|
|
columns:
|
|
- name: order_id
|
|
tests:
|
|
- unique
|
|
- not_null
|
|
- name: total_amount
|
|
tests:
|
|
- not_null
|
|
- dbt_utils.accepted_range:
|
|
min_value: 0
|
|
max_value: 1000000
|
|
- name: order_date
|
|
tests:
|
|
- not_null
|
|
- dbt_utils.recency:
|
|
datepart: day
|
|
field: order_date
|
|
interval: 1
|
|
```
|
|
|
|
#### Step 5: Create Airflow DAG
|
|
|
|
```python
|
|
# dags/daily_etl.py
|
|
from airflow import DAG
|
|
from airflow.providers.postgres.operators.postgres import PostgresOperator
|
|
from airflow.operators.bash import BashOperator
|
|
from airflow.utils.dates import days_ago
|
|
from datetime import timedelta
|
|
|
|
default_args = {
|
|
'owner': 'data-team',
|
|
'depends_on_past': False,
|
|
'email_on_failure': True,
|
|
'email': ['data-alerts@company.com'],
|
|
'retries': 2,
|
|
'retry_delay': timedelta(minutes=5),
|
|
}
|
|
|
|
with DAG(
|
|
'daily_etl_pipeline',
|
|
default_args=default_args,
|
|
description='Daily ETL from PostgreSQL to Snowflake',
|
|
schedule_interval='0 5 * * *',
|
|
start_date=days_ago(1),
|
|
catchup=False,
|
|
tags=['etl', 'daily'],
|
|
) as dag:
|
|
|
|
extract = BashOperator(
|
|
task_id='extract_source_data',
|
|
bash_command='python /opt/airflow/scripts/extract.py --date {{ ds }}',
|
|
)
|
|
|
|
transform = BashOperator(
|
|
task_id='run_dbt_models',
|
|
bash_command='cd /opt/airflow/dbt && dbt run --select marts.*',
|
|
)
|
|
|
|
test = BashOperator(
|
|
task_id='run_dbt_tests',
|
|
bash_command='cd /opt/airflow/dbt && dbt test --select marts.*',
|
|
)
|
|
|
|
notify = BashOperator(
|
|
task_id='send_notification',
|
|
bash_command='python /opt/airflow/scripts/notify.py --status success',
|
|
trigger_rule='all_success',
|
|
)
|
|
|
|
extract >> transform >> test >> notify
|
|
```
|
|
|
|
#### Step 6: Validate Pipeline
|
|
|
|
```bash
|
|
# Test locally
|
|
dbt run --select stg_orders fct_orders
|
|
dbt test --select fct_orders
|
|
|
|
# Validate data quality
|
|
python scripts/data_quality_validator.py validate \
|
|
--table fct_orders \
|
|
--checks all \
|
|
--output reports/quality_report.json
|
|
```
|
|
|
|
---
|
|
|
|
### Workflow 2: Implementing Real-Time Streaming
|
|
|
|
**Scenario:** Stream events from Kafka, process with Flink/Spark Streaming, sink to data lake.
|
|
|
|
#### Step 1: Define Event Schema
|
|
|
|
```json
|
|
{
|
|
"$schema": "http://json-schema.org/draft-07/schema#",
|
|
"title": "UserEvent",
|
|
"type": "object",
|
|
"required": ["event_id", "user_id", "event_type", "timestamp"],
|
|
"properties": {
|
|
"event_id": {"type": "string", "format": "uuid"},
|
|
"user_id": {"type": "string"},
|
|
"event_type": {"type": "string", "enum": ["page_view", "click", "purchase"]},
|
|
"timestamp": {"type": "string", "format": "date-time"},
|
|
"properties": {"type": "object"}
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Step 2: Create Kafka Topic
|
|
|
|
```bash
|
|
# Create topic with appropriate partitions
|
|
kafka-topics.sh --create \
|
|
--bootstrap-server localhost:9092 \
|
|
--topic user-events \
|
|
--partitions 12 \
|
|
--replication-factor 3 \
|
|
--config retention.ms=604800000 \
|
|
--config cleanup.policy=delete
|
|
|
|
# Verify topic
|
|
kafka-topics.sh --describe \
|
|
--bootstrap-server localhost:9092 \
|
|
--topic user-events
|
|
```
|
|
|
|
#### Step 3: Implement Spark Streaming Job
|
|
|
|
```python
|
|
# streaming/user_events_processor.py
|
|
from pyspark.sql import SparkSession
|
|
from pyspark.sql.functions import (
|
|
from_json, col, window, count, avg,
|
|
to_timestamp, current_timestamp
|
|
)
|
|
from pyspark.sql.types import (
|
|
StructType, StructField, StringType,
|
|
TimestampType, MapType
|
|
)
|
|
|
|
# Initialize Spark
|
|
spark = SparkSession.builder \
|
|
.appName("UserEventsProcessor") \
|
|
.config("spark.sql.streaming.checkpointLocation", "/checkpoints/user-events") \
|
|
.config("spark.sql.shuffle.partitions", "12") \
|
|
.getOrCreate()
|
|
|
|
# Define schema
|
|
event_schema = StructType([
|
|
StructField("event_id", StringType(), False),
|
|
StructField("user_id", StringType(), False),
|
|
StructField("event_type", StringType(), False),
|
|
StructField("timestamp", StringType(), False),
|
|
StructField("properties", MapType(StringType(), StringType()), True)
|
|
])
|
|
|
|
# Read from Kafka
|
|
events_df = spark.readStream \
|
|
.format("kafka") \
|
|
.option("kafka.bootstrap.servers", "localhost:9092") \
|
|
.option("subscribe", "user-events") \
|
|
.option("startingOffsets", "latest") \
|
|
.option("failOnDataLoss", "false") \
|
|
.load()
|
|
|
|
# Parse JSON
|
|
parsed_df = events_df \
|
|
.select(from_json(col("value").cast("string"), event_schema).alias("data")) \
|
|
.select("data.*") \
|
|
.withColumn("event_timestamp", to_timestamp(col("timestamp")))
|
|
|
|
# Windowed aggregation
|
|
aggregated_df = parsed_df \
|
|
.withWatermark("event_timestamp", "10 minutes") \
|
|
.groupBy(
|
|
window(col("event_timestamp"), "5 minutes"),
|
|
col("event_type")
|
|
) \
|
|
.agg(
|
|
count("*").alias("event_count"),
|
|
approx_count_distinct("user_id").alias("unique_users")
|
|
)
|
|
|
|
# Write to Delta Lake
|
|
query = aggregated_df.writeStream \
|
|
.format("delta") \
|
|
.outputMode("append") \
|
|
.option("checkpointLocation", "/checkpoints/user-events-aggregated") \
|
|
.option("path", "/data/lake/user_events_aggregated") \
|
|
.trigger(processingTime="1 minute") \
|
|
.start()
|
|
|
|
query.awaitTermination()
|
|
```
|
|
|
|
#### Step 4: Handle Late Data and Errors
|
|
|
|
```python
|
|
# Dead letter queue for failed records
|
|
from pyspark.sql.functions import current_timestamp, lit
|
|
|
|
def process_with_error_handling(batch_df, batch_id):
|
|
try:
|
|
# Attempt processing
|
|
valid_df = batch_df.filter(col("event_id").isNotNull())
|
|
invalid_df = batch_df.filter(col("event_id").isNull())
|
|
|
|
# Write valid records
|
|
valid_df.write \
|
|
.format("delta") \
|
|
.mode("append") \
|
|
.save("/data/lake/user_events")
|
|
|
|
# Write invalid to DLQ
|
|
if invalid_df.count() > 0:
|
|
invalid_df \
|
|
.withColumn("error_timestamp", current_timestamp()) \
|
|
.withColumn("error_reason", lit("missing_event_id")) \
|
|
.write \
|
|
.format("delta") \
|
|
.mode("append") \
|
|
.save("/data/lake/dlq/user_events")
|
|
|
|
except Exception as e:
|
|
# Log error, alert, continue
|
|
logger.error(f"Batch {batch_id} failed: {e}")
|
|
raise
|
|
|
|
# Use foreachBatch for custom processing
|
|
query = parsed_df.writeStream \
|
|
.foreachBatch(process_with_error_handling) \
|
|
.option("checkpointLocation", "/checkpoints/user-events") \
|
|
.start()
|
|
```
|
|
|
|
#### Step 5: Monitor Stream Health
|
|
|
|
```python
|
|
# monitoring/stream_metrics.py
|
|
from prometheus_client import Gauge, Counter, start_http_server
|
|
|
|
# Define metrics
|
|
RECORDS_PROCESSED = Counter(
|
|
'stream_records_processed_total',
|
|
'Total records processed',
|
|
['stream_name', 'status']
|
|
)
|
|
|
|
PROCESSING_LAG = Gauge(
|
|
'stream_processing_lag_seconds',
|
|
'Current processing lag',
|
|
['stream_name']
|
|
)
|
|
|
|
BATCH_DURATION = Gauge(
|
|
'stream_batch_duration_seconds',
|
|
'Last batch processing duration',
|
|
['stream_name']
|
|
)
|
|
|
|
def emit_metrics(query):
|
|
"""Emit Prometheus metrics from streaming query."""
|
|
progress = query.lastProgress
|
|
if progress:
|
|
RECORDS_PROCESSED.labels(
|
|
stream_name='user-events',
|
|
status='success'
|
|
).inc(progress['numInputRows'])
|
|
|
|
if progress['sources']:
|
|
# Calculate lag from latest offset
|
|
for source in progress['sources']:
|
|
end_offset = source.get('endOffset', {})
|
|
# Parse Kafka offsets and calculate lag
|
|
```
|
|
|
|
---
|
|
|
|
### Workflow 3: Data Quality Framework Setup
|
|
|
|
**Scenario:** Implement comprehensive data quality monitoring with Great Expectations.
|
|
|
|
#### Step 1: Initialize Great Expectations
|
|
|
|
```bash
|
|
# Install and initialize
|
|
pip install great_expectations
|
|
|
|
great_expectations init
|
|
|
|
# Connect to data source
|
|
great_expectations datasource new
|
|
```
|
|
|
|
#### Step 2: Create Expectation Suite
|
|
|
|
```python
|
|
# expectations/orders_suite.py
|
|
import great_expectations as gx
|
|
|
|
context = gx.get_context()
|
|
|
|
# Create expectation suite
|
|
suite = context.add_expectation_suite("orders_quality_suite")
|
|
|
|
# Add expectations
|
|
validator = context.get_validator(
|
|
batch_request={
|
|
"datasource_name": "warehouse",
|
|
"data_asset_name": "orders",
|
|
},
|
|
expectation_suite_name="orders_quality_suite"
|
|
)
|
|
|
|
# Schema expectations
|
|
validator.expect_table_columns_to_match_ordered_list(
|
|
column_list=[
|
|
"order_id", "customer_id", "order_date",
|
|
"total_amount", "status", "created_at"
|
|
]
|
|
)
|
|
|
|
# Completeness expectations
|
|
validator.expect_column_values_to_not_be_null("order_id")
|
|
validator.expect_column_values_to_not_be_null("customer_id")
|
|
validator.expect_column_values_to_not_be_null("order_date")
|
|
|
|
# Uniqueness expectations
|
|
validator.expect_column_values_to_be_unique("order_id")
|
|
|
|
# Range expectations
|
|
validator.expect_column_values_to_be_between(
|
|
"total_amount",
|
|
min_value=0,
|
|
max_value=1000000
|
|
)
|
|
|
|
# Categorical expectations
|
|
validator.expect_column_values_to_be_in_set(
|
|
"status",
|
|
["pending", "confirmed", "shipped", "delivered", "cancelled"]
|
|
)
|
|
|
|
# Freshness expectation
|
|
validator.expect_column_max_to_be_between(
|
|
"order_date",
|
|
min_value={"$PARAMETER": "now - timedelta(days=1)"},
|
|
max_value={"$PARAMETER": "now"}
|
|
)
|
|
|
|
# Referential integrity
|
|
validator.expect_column_values_to_be_in_set(
|
|
"customer_id",
|
|
value_set={"$PARAMETER": "valid_customer_ids"}
|
|
)
|
|
|
|
validator.save_expectation_suite(discard_failed_expectations=False)
|
|
```
|
|
|
|
#### Step 3: Create Data Quality Checks with dbt
|
|
|
|
```yaml
|
|
# models/marts/schema.yml
|
|
version: 2
|
|
|
|
models:
|
|
- name: fct_orders
|
|
description: "Order fact table with data quality checks"
|
|
|
|
tests:
|
|
# Row count check
|
|
- dbt_utils.equal_rowcount:
|
|
compare_model: ref('stg_orders')
|
|
|
|
# Freshness check
|
|
- dbt_utils.recency:
|
|
datepart: hour
|
|
field: created_at
|
|
interval: 24
|
|
|
|
columns:
|
|
- name: order_id
|
|
description: "Unique order identifier"
|
|
tests:
|
|
- unique
|
|
- not_null
|
|
- relationships:
|
|
to: ref('dim_orders')
|
|
field: order_id
|
|
|
|
- name: total_amount
|
|
tests:
|
|
- not_null
|
|
- dbt_utils.accepted_range:
|
|
min_value: 0
|
|
max_value: 1000000
|
|
inclusive: true
|
|
- dbt_expectations.expect_column_values_to_be_between:
|
|
min_value: 0
|
|
row_condition: "status != 'cancelled'"
|
|
|
|
- name: customer_id
|
|
tests:
|
|
- not_null
|
|
- relationships:
|
|
to: ref('dim_customers')
|
|
field: customer_id
|
|
severity: warn
|
|
```
|
|
|
|
#### Step 4: Implement Data Contracts
|
|
|
|
```yaml
|
|
# contracts/orders_contract.yaml
|
|
contract:
|
|
name: orders_data_contract
|
|
version: "1.0.0"
|
|
owner: data-team@company.com
|
|
|
|
schema:
|
|
type: object
|
|
properties:
|
|
order_id:
|
|
type: string
|
|
format: uuid
|
|
description: "Unique order identifier"
|
|
customer_id:
|
|
type: string
|
|
not_null: true
|
|
order_date:
|
|
type: date
|
|
not_null: true
|
|
total_amount:
|
|
type: decimal
|
|
precision: 10
|
|
scale: 2
|
|
minimum: 0
|
|
status:
|
|
type: string
|
|
enum: ["pending", "confirmed", "shipped", "delivered", "cancelled"]
|
|
|
|
sla:
|
|
freshness:
|
|
max_delay_hours: 1
|
|
completeness:
|
|
min_percentage: 99.9
|
|
accuracy:
|
|
duplicate_tolerance: 0.01
|
|
|
|
consumers:
|
|
- name: analytics-team
|
|
usage: "Daily reporting dashboards"
|
|
- name: ml-team
|
|
usage: "Churn prediction model"
|
|
```
|
|
|
|
#### Step 5: Set Up Quality Monitoring Dashboard
|
|
|
|
```python
|
|
# monitoring/quality_dashboard.py
|
|
from datetime import datetime, timedelta
|
|
import pandas as pd
|
|
|
|
def generate_quality_report(connection, table_name: str) -> dict:
|
|
"""Generate comprehensive data quality report."""
|
|
|
|
report = {
|
|
"table": table_name,
|
|
"timestamp": datetime.now().isoformat(),
|
|
"checks": {}
|
|
}
|
|
|
|
# Row count check
|
|
row_count = connection.execute(
|
|
f"SELECT COUNT(*) FROM {table_name}"
|
|
).fetchone()[0]
|
|
report["checks"]["row_count"] = {
|
|
"value": row_count,
|
|
"status": "pass" if row_count > 0 else "fail"
|
|
}
|
|
|
|
# Freshness check
|
|
max_date = connection.execute(
|
|
f"SELECT MAX(created_at) FROM {table_name}"
|
|
).fetchone()[0]
|
|
hours_old = (datetime.now() - max_date).total_seconds() / 3600
|
|
report["checks"]["freshness"] = {
|
|
"max_timestamp": max_date.isoformat(),
|
|
"hours_old": round(hours_old, 2),
|
|
"status": "pass" if hours_old < 24 else "fail"
|
|
}
|
|
|
|
# Null rate check
|
|
null_query = f"""
|
|
SELECT
|
|
SUM(CASE WHEN order_id IS NULL THEN 1 ELSE 0 END) as null_order_id,
|
|
SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) as null_customer_id,
|
|
COUNT(*) as total
|
|
FROM {table_name}
|
|
"""
|
|
null_result = connection.execute(null_query).fetchone()
|
|
report["checks"]["null_rates"] = {
|
|
"order_id": null_result[0] / null_result[2] if null_result[2] > 0 else 0,
|
|
"customer_id": null_result[1] / null_result[2] if null_result[2] > 0 else 0,
|
|
"status": "pass" if null_result[0] == 0 and null_result[1] == 0 else "fail"
|
|
}
|
|
|
|
# Duplicate check
|
|
dup_query = f"""
|
|
SELECT COUNT(*) - COUNT(DISTINCT order_id) as duplicates
|
|
FROM {table_name}
|
|
"""
|
|
duplicates = connection.execute(dup_query).fetchone()[0]
|
|
report["checks"]["duplicates"] = {
|
|
"count": duplicates,
|
|
"status": "pass" if duplicates == 0 else "fail"
|
|
}
|
|
|
|
# Overall status
|
|
all_passed = all(
|
|
check["status"] == "pass"
|
|
for check in report["checks"].values()
|
|
)
|
|
report["overall_status"] = "pass" if all_passed else "fail"
|
|
|
|
return report
|
|
```
|
|
|
|
---
|
|
|
|
## Architecture Decision Framework
|
|
|
|
Use this framework to choose the right approach for your data pipeline.
|
|
|
|
### Batch vs Streaming
|
|
|
|
| Criteria | Batch | Streaming |
|
|
|----------|-------|-----------|
|
|
| **Latency requirement** | Hours to days | Seconds to minutes |
|
|
| **Data volume** | Large historical datasets | Continuous event streams |
|
|
| **Processing complexity** | Complex transformations, ML | Simple aggregations, filtering |
|
|
| **Cost sensitivity** | More cost-effective | Higher infrastructure cost |
|
|
| **Error handling** | Easier to reprocess | Requires careful design |
|
|
|
|
**Decision Tree:**
|
|
```
|
|
Is real-time insight required?
|
|
├── Yes → Use streaming
|
|
│ └── Is exactly-once semantics needed?
|
|
│ ├── Yes → Kafka + Flink/Spark Structured Streaming
|
|
│ └── No → Kafka + consumer groups
|
|
└── No → Use batch
|
|
└── Is data volume > 1TB daily?
|
|
├── Yes → Spark/Databricks
|
|
└── No → dbt + warehouse compute
|
|
```
|
|
|
|
### Lambda vs Kappa Architecture
|
|
|
|
| Aspect | Lambda | Kappa |
|
|
|--------|--------|-------|
|
|
| **Complexity** | Two codebases (batch + stream) | Single codebase |
|
|
| **Maintenance** | Higher (sync batch/stream logic) | Lower |
|
|
| **Reprocessing** | Native batch layer | Replay from source |
|
|
| **Use case** | ML training + real-time serving | Pure event-driven |
|
|
|
|
**When to choose Lambda:**
|
|
- Need to train ML models on historical data
|
|
- Complex batch transformations not feasible in streaming
|
|
- Existing batch infrastructure
|
|
|
|
**When to choose Kappa:**
|
|
- Event-sourced architecture
|
|
- All processing can be expressed as stream operations
|
|
- Starting fresh without legacy systems
|
|
|
|
### Data Warehouse vs Data Lakehouse
|
|
|
|
| Feature | Warehouse (Snowflake/BigQuery) | Lakehouse (Delta/Iceberg) |
|
|
|---------|-------------------------------|---------------------------|
|
|
| **Best for** | BI, SQL analytics | ML, unstructured data |
|
|
| **Storage cost** | Higher (proprietary format) | Lower (open formats) |
|
|
| **Flexibility** | Schema-on-write | Schema-on-read |
|
|
| **Performance** | Excellent for SQL | Good, improving |
|
|
| **Ecosystem** | Mature BI tools | Growing ML tooling |
|
|
|
|
---
|
|
|
|
## Tech Stack
|
|
|
|
| Category | Technologies |
|
|
|----------|--------------|
|
|
| **Languages** | Python, SQL, Scala |
|
|
| **Orchestration** | Airflow, Prefect, Dagster |
|
|
| **Transformation** | dbt, Spark, Flink |
|
|
| **Streaming** | Kafka, Kinesis, Pub/Sub |
|
|
| **Storage** | S3, GCS, Delta Lake, Iceberg |
|
|
| **Warehouses** | Snowflake, BigQuery, Redshift, Databricks |
|
|
| **Quality** | Great Expectations, dbt tests, Monte Carlo |
|
|
| **Monitoring** | Prometheus, Grafana, Datadog |
|
|
|
|
---
|
|
|
|
## Reference Documentation
|
|
|
|
### 1. Data Pipeline Architecture
|
|
See `references/data_pipeline_architecture.md` for:
|
|
- Lambda vs Kappa architecture patterns
|
|
- Batch processing with Spark and Airflow
|
|
- Stream processing with Kafka and Flink
|
|
- Exactly-once semantics implementation
|
|
- Error handling and dead letter queues
|
|
|
|
### 2. Data Modeling Patterns
|
|
See `references/data_modeling_patterns.md` for:
|
|
- Dimensional modeling (Star/Snowflake)
|
|
- Slowly Changing Dimensions (SCD Types 1-6)
|
|
- Data Vault modeling
|
|
- dbt best practices
|
|
- Partitioning and clustering
|
|
|
|
### 3. DataOps Best Practices
|
|
See `references/dataops_best_practices.md` for:
|
|
- Data testing frameworks
|
|
- Data contracts and schema validation
|
|
- CI/CD for data pipelines
|
|
- Observability and lineage
|
|
- Incident response
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
|
|
### Pipeline Failures
|
|
|
|
**Symptom:** Airflow DAG fails with timeout
|
|
```
|
|
Task exceeded max execution time
|
|
```
|
|
|
|
**Solution:**
|
|
1. Check resource allocation
|
|
2. Profile slow operations
|
|
3. Add incremental processing
|
|
```python
|
|
# Increase timeout
|
|
default_args = {
|
|
'execution_timeout': timedelta(hours=2),
|
|
}
|
|
|
|
# Or use incremental loads
|
|
WHERE updated_at > '{{ prev_ds }}'
|
|
```
|
|
|
|
---
|
|
|
|
**Symptom:** Spark job OOM
|
|
```
|
|
java.lang.OutOfMemoryError: Java heap space
|
|
```
|
|
|
|
**Solution:**
|
|
1. Increase executor memory
|
|
2. Reduce partition size
|
|
3. Use disk spill
|
|
```python
|
|
spark.conf.set("spark.executor.memory", "8g")
|
|
spark.conf.set("spark.sql.shuffle.partitions", "200")
|
|
spark.conf.set("spark.memory.fraction", "0.8")
|
|
```
|
|
|
|
---
|
|
|
|
**Symptom:** Kafka consumer lag increasing
|
|
```
|
|
Consumer lag: 1000000 messages
|
|
```
|
|
|
|
**Solution:**
|
|
1. Increase consumer parallelism
|
|
2. Optimize processing logic
|
|
3. Scale consumer group
|
|
```bash
|
|
# Add more partitions
|
|
kafka-topics.sh --alter \
|
|
--bootstrap-server localhost:9092 \
|
|
--topic user-events \
|
|
--partitions 24
|
|
```
|
|
|
|
---
|
|
|
|
### Data Quality Issues
|
|
|
|
**Symptom:** Duplicate records appearing
|
|
```
|
|
Expected unique, found 150 duplicates
|
|
```
|
|
|
|
**Solution:**
|
|
1. Add deduplication logic
|
|
2. Use merge/upsert operations
|
|
```sql
|
|
-- dbt incremental with dedup
|
|
{{
|
|
config(
|
|
materialized='incremental',
|
|
unique_key='order_id'
|
|
)
|
|
}}
|
|
|
|
SELECT * FROM (
|
|
SELECT
|
|
*,
|
|
ROW_NUMBER() OVER (
|
|
PARTITION BY order_id
|
|
ORDER BY updated_at DESC
|
|
) as rn
|
|
FROM {{ source('raw', 'orders') }}
|
|
) WHERE rn = 1
|
|
```
|
|
|
|
---
|
|
|
|
**Symptom:** Stale data in tables
|
|
```
|
|
Last update: 3 days ago
|
|
```
|
|
|
|
**Solution:**
|
|
1. Check upstream pipeline status
|
|
2. Verify source availability
|
|
3. Add freshness monitoring
|
|
```yaml
|
|
# dbt freshness check
|
|
sources:
|
|
- name: raw
|
|
freshness:
|
|
warn_after: {count: 12, period: hour}
|
|
error_after: {count: 24, period: hour}
|
|
loaded_at_field: _loaded_at
|
|
```
|
|
|
|
---
|
|
|
|
**Symptom:** Schema drift detected
|
|
```
|
|
Column 'new_field' not in expected schema
|
|
```
|
|
|
|
**Solution:**
|
|
1. Update data contract
|
|
2. Modify transformations
|
|
3. Communicate with producers
|
|
```python
|
|
# Handle schema evolution
|
|
df = spark.read.format("delta") \
|
|
.option("mergeSchema", "true") \
|
|
.load("/data/orders")
|
|
```
|
|
|
|
---
|
|
|
|
### Performance Issues
|
|
|
|
**Symptom:** Query takes hours
|
|
```
|
|
Query runtime: 4 hours (expected: 30 minutes)
|
|
```
|
|
|
|
**Solution:**
|
|
1. Check query plan
|
|
2. Add proper partitioning
|
|
3. Optimize joins
|
|
```sql
|
|
-- Before: Full table scan
|
|
SELECT * FROM orders WHERE order_date = '2024-01-15';
|
|
|
|
-- After: Partition pruning
|
|
-- Table partitioned by order_date
|
|
SELECT * FROM orders WHERE order_date = '2024-01-15';
|
|
|
|
-- Add clustering for frequent filters
|
|
ALTER TABLE orders CLUSTER BY (customer_id);
|
|
```
|
|
|
|
---
|
|
|
|
**Symptom:** dbt model takes too long
|
|
```
|
|
Model fct_orders completed in 45 minutes
|
|
```
|
|
|
|
**Solution:**
|
|
1. Use incremental materialization
|
|
2. Reduce upstream dependencies
|
|
3. Pre-aggregate where possible
|
|
```sql
|
|
-- Convert to incremental
|
|
{{
|
|
config(
|
|
materialized='incremental',
|
|
unique_key='order_id',
|
|
on_schema_change='sync_all_columns'
|
|
)
|
|
}}
|
|
|
|
SELECT * FROM {{ ref('stg_orders') }}
|
|
{% if is_incremental() %}
|
|
WHERE _loaded_at > (SELECT MAX(_loaded_at) FROM {{ this }})
|
|
{% endif %}
|
|
```
|