Complete overhaul of senior-data-engineer skill (previously Grade F: 43/100): SKILL.md (~550 lines): - Added table of contents and trigger phrases - 3 actionable workflows: Batch ETL Pipeline, Real-Time Streaming, Data Quality Framework - Architecture decision framework (Batch vs Stream, Lambda vs Kappa) - Tech stack overview with decision matrix - Troubleshooting section with common issues and solutions Reference Files (all rewritten from 81-line boilerplate): - data_pipeline_architecture.md (~700 lines): Lambda/Kappa architectures, batch processing with Spark, stream processing with Kafka/Flink, exactly-once semantics, error handling strategies, orchestration patterns - data_modeling_patterns.md (~650 lines): Dimensional modeling (Star/Snowflake/OBT), SCD Types 0-6 with SQL implementations, Data Vault (Hub/Satellite/Link), dbt best practices, partitioning and clustering strategies - dataops_best_practices.md (~750 lines): Data testing (Great Expectations, dbt), data contracts with YAML definitions, CI/CD pipelines, observability with OpenLineage, incident response runbooks, cost optimization Python Scripts (all rewritten from 101-line placeholders): - pipeline_orchestrator.py (~600 lines): Generates Airflow DAGs, Prefect flows, and Dagster jobs with configurable ETL patterns - data_quality_validator.py (~1640 lines): Schema validation, data profiling, Great Expectations suite generation, data contract validation, anomaly detection - etl_performance_optimizer.py (~1680 lines): SQL query analysis, Spark job optimization, partition strategy recommendations, cost estimation for BigQuery/Snowflake/Redshift/Databricks Resolves #53 Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
1101 lines
34 KiB
Markdown
1101 lines
34 KiB
Markdown
# Data Pipeline Architecture
|
|
|
|
Comprehensive guide to designing and implementing production data pipelines.
|
|
|
|
## Table of Contents
|
|
|
|
1. [Architecture Patterns](#architecture-patterns)
|
|
2. [Batch Processing](#batch-processing)
|
|
3. [Stream Processing](#stream-processing)
|
|
4. [Exactly-Once Semantics](#exactly-once-semantics)
|
|
5. [Error Handling](#error-handling)
|
|
6. [Data Ingestion Patterns](#data-ingestion-patterns)
|
|
7. [Orchestration](#orchestration)
|
|
|
|
---
|
|
|
|
## Architecture Patterns
|
|
|
|
### Lambda Architecture
|
|
|
|
The Lambda architecture combines batch and stream processing for comprehensive data handling.
|
|
|
|
```
|
|
┌─────────────────────────────────────┐
|
|
│ Data Sources │
|
|
└─────────────────┬───────────────────┘
|
|
│
|
|
┌─────────────────▼───────────────────┐
|
|
│ Message Queue (Kafka) │
|
|
└───────┬─────────────────┬───────────┘
|
|
│ │
|
|
┌─────────────▼─────┐ ┌───────▼─────────────┐
|
|
│ Batch Layer │ │ Speed Layer │
|
|
│ (Spark/Airflow) │ │ (Flink/Spark SS) │
|
|
└─────────────┬─────┘ └───────┬─────────────┘
|
|
│ │
|
|
┌─────────────▼─────┐ ┌───────▼─────────────┐
|
|
│ Master Dataset │ │ Real-time Views │
|
|
│ (Data Lake) │ │ (Redis/Druid) │
|
|
└─────────────┬─────┘ └───────┬─────────────┘
|
|
│ │
|
|
┌───────▼─────────────────▼───────┐
|
|
│ Serving Layer │
|
|
│ (Merged Batch + Real-time) │
|
|
└─────────────────────────────────┘
|
|
```
|
|
|
|
**Components:**
|
|
|
|
1. **Batch Layer**
|
|
- Processes complete historical data
|
|
- Creates precomputed batch views
|
|
- Handles complex transformations, ML training
|
|
- Reprocessable from raw data
|
|
|
|
2. **Speed Layer**
|
|
- Processes real-time data stream
|
|
- Creates real-time views for recent data
|
|
- Low latency, simpler transformations
|
|
- Compensates for batch layer delay
|
|
|
|
3. **Serving Layer**
|
|
- Merges batch and real-time views
|
|
- Responds to queries
|
|
- Provides unified interface
|
|
|
|
**Implementation Example:**
|
|
|
|
```python
|
|
# Batch layer: Daily aggregation with Spark
|
|
def batch_daily_aggregation(spark, date):
|
|
"""Process full day of data for batch views."""
|
|
raw_df = spark.read.parquet(f"s3://data-lake/raw/events/date={date}")
|
|
|
|
aggregated = raw_df.groupBy("user_id", "event_type") \
|
|
.agg(
|
|
count("*").alias("event_count"),
|
|
sum("revenue").alias("total_revenue"),
|
|
max("timestamp").alias("last_event")
|
|
)
|
|
|
|
aggregated.write \
|
|
.mode("overwrite") \
|
|
.partitionBy("event_type") \
|
|
.parquet(f"s3://data-lake/batch-views/daily_agg/date={date}")
|
|
|
|
# Speed layer: Real-time aggregation with Spark Structured Streaming
|
|
def speed_realtime_aggregation(spark):
|
|
"""Process streaming data for real-time views."""
|
|
stream_df = spark.readStream \
|
|
.format("kafka") \
|
|
.option("kafka.bootstrap.servers", "kafka:9092") \
|
|
.option("subscribe", "events") \
|
|
.load()
|
|
|
|
parsed = stream_df.select(
|
|
from_json(col("value").cast("string"), event_schema).alias("data")
|
|
).select("data.*")
|
|
|
|
aggregated = parsed \
|
|
.withWatermark("timestamp", "5 minutes") \
|
|
.groupBy(
|
|
window("timestamp", "1 minute"),
|
|
"user_id",
|
|
"event_type"
|
|
) \
|
|
.agg(count("*").alias("event_count"))
|
|
|
|
query = aggregated.writeStream \
|
|
.format("redis") \
|
|
.option("host", "redis") \
|
|
.outputMode("update") \
|
|
.start()
|
|
|
|
return query
|
|
```
|
|
|
|
### Kappa Architecture
|
|
|
|
Kappa simplifies Lambda by using only stream processing with replay capability.
|
|
|
|
```
|
|
┌─────────────────────────────────────┐
|
|
│ Data Sources │
|
|
└─────────────────┬───────────────────┘
|
|
│
|
|
┌─────────────────▼───────────────────┐
|
|
│ Immutable Log (Kafka/Kinesis) │
|
|
│ (Long retention) │
|
|
└─────────────────┬───────────────────┘
|
|
│
|
|
┌─────────────────▼───────────────────┐
|
|
│ Stream Processor │
|
|
│ (Flink/Spark Streaming) │
|
|
└─────────────────┬───────────────────┘
|
|
│
|
|
┌─────────────────▼───────────────────┐
|
|
│ Serving Layer │
|
|
│ (Database/Data Warehouse) │
|
|
└─────────────────────────────────────┘
|
|
```
|
|
|
|
**Key Principles:**
|
|
|
|
1. **Single Processing Path**: All data processed as streams
|
|
2. **Immutable Log**: Kafka/Kinesis as source of truth with long retention
|
|
3. **Reprocessing via Replay**: Re-run stream processor from beginning when needed
|
|
|
|
**Reprocessing Strategy:**
|
|
|
|
```python
|
|
# Reprocessing in Kappa architecture
|
|
class KappaReprocessor:
|
|
"""Handle reprocessing by replaying from Kafka."""
|
|
|
|
def __init__(self, kafka_config, flink_job):
|
|
self.kafka = kafka_config
|
|
self.job = flink_job
|
|
|
|
def reprocess(self, from_timestamp: str):
|
|
"""Reprocess all data from a specific timestamp."""
|
|
|
|
# 1. Start new consumer group reading from timestamp
|
|
new_consumer_group = f"reprocess-{uuid.uuid4()}"
|
|
|
|
# 2. Configure stream processor with new group
|
|
self.job.set_config({
|
|
"group.id": new_consumer_group,
|
|
"auto.offset.reset": "none" # We'll set offset manually
|
|
})
|
|
|
|
# 3. Seek to timestamp
|
|
offsets = self._get_offsets_for_timestamp(from_timestamp)
|
|
self.job.seek_to_offsets(offsets)
|
|
|
|
# 4. Write to new output table/topic
|
|
output_table = f"events_reprocessed_{datetime.now().strftime('%Y%m%d')}"
|
|
self.job.set_output(output_table)
|
|
|
|
# 5. Run until caught up
|
|
self.job.run_until_caught_up()
|
|
|
|
# 6. Swap output tables atomically
|
|
self._atomic_table_swap("events", output_table)
|
|
|
|
def _get_offsets_for_timestamp(self, timestamp):
|
|
"""Get Kafka offsets for a specific timestamp."""
|
|
consumer = KafkaConsumer(bootstrap_servers=self.kafka["brokers"])
|
|
partitions = consumer.partitions_for_topic("events")
|
|
|
|
offsets = {}
|
|
for partition in partitions:
|
|
tp = TopicPartition("events", partition)
|
|
offset = consumer.offsets_for_times({tp: timestamp})
|
|
offsets[tp] = offset[tp].offset
|
|
|
|
return offsets
|
|
```
|
|
|
|
### Medallion Architecture (Bronze/Silver/Gold)
|
|
|
|
Common in data lakehouses (Databricks, Delta Lake).
|
|
|
|
```
|
|
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
|
│ Bronze │────▶│ Silver │────▶│ Gold │
|
|
│ (Raw Data) │ │ (Cleansed) │ │ (Analytics) │
|
|
└─────────────┘ └─────────────┘ └─────────────┘
|
|
│ │ │
|
|
▼ ▼ ▼
|
|
Landing zone Validated, Aggregated,
|
|
Append-only deduplicated, business-ready
|
|
Schema evolution standardized Star schema
|
|
```
|
|
|
|
**Implementation with Delta Lake:**
|
|
|
|
```python
|
|
# Bronze: Raw ingestion
|
|
def ingest_to_bronze(spark, source_path, bronze_path):
|
|
"""Ingest raw data to bronze layer."""
|
|
df = spark.read.format("json").load(source_path)
|
|
|
|
# Add metadata
|
|
df = df.withColumn("_ingested_at", current_timestamp()) \
|
|
.withColumn("_source_file", input_file_name())
|
|
|
|
df.write \
|
|
.format("delta") \
|
|
.mode("append") \
|
|
.option("mergeSchema", "true") \
|
|
.save(bronze_path)
|
|
|
|
# Silver: Cleansing and validation
|
|
def bronze_to_silver(spark, bronze_path, silver_path):
|
|
"""Transform bronze to silver with cleansing."""
|
|
bronze_df = spark.read.format("delta").load(bronze_path)
|
|
|
|
# Read last processed version
|
|
last_version = get_last_processed_version(silver_path, "bronze")
|
|
|
|
# Get only new records
|
|
new_records = bronze_df.filter(col("_commit_version") > last_version)
|
|
|
|
# Cleanse and validate
|
|
silver_df = new_records \
|
|
.filter(col("user_id").isNotNull()) \
|
|
.filter(col("event_type").isin(["click", "view", "purchase"])) \
|
|
.withColumn("event_date", to_date("timestamp")) \
|
|
.dropDuplicates(["event_id"])
|
|
|
|
# Merge to silver (upsert)
|
|
silver_table = DeltaTable.forPath(spark, silver_path)
|
|
|
|
silver_table.alias("target") \
|
|
.merge(
|
|
silver_df.alias("source"),
|
|
"target.event_id = source.event_id"
|
|
) \
|
|
.whenMatchedUpdateAll() \
|
|
.whenNotMatchedInsertAll() \
|
|
.execute()
|
|
|
|
# Gold: Business aggregations
|
|
def silver_to_gold(spark, silver_path, gold_path):
|
|
"""Create business-ready aggregations in gold layer."""
|
|
silver_df = spark.read.format("delta").load(silver_path)
|
|
|
|
# Daily user metrics
|
|
daily_metrics = silver_df \
|
|
.groupBy("user_id", "event_date") \
|
|
.agg(
|
|
count("*").alias("total_events"),
|
|
countDistinct("session_id").alias("sessions"),
|
|
sum(when(col("event_type") == "purchase", col("revenue")).otherwise(0)).alias("revenue"),
|
|
max("timestamp").alias("last_activity")
|
|
)
|
|
|
|
# Write as gold table
|
|
daily_metrics.write \
|
|
.format("delta") \
|
|
.mode("overwrite") \
|
|
.partitionBy("event_date") \
|
|
.save(gold_path + "/daily_user_metrics")
|
|
```
|
|
|
|
---
|
|
|
|
## Batch Processing
|
|
|
|
### Apache Spark Best Practices
|
|
|
|
#### Memory Management
|
|
|
|
```python
|
|
# Optimal Spark configuration for batch jobs
|
|
spark = SparkSession.builder \
|
|
.appName("BatchETL") \
|
|
.config("spark.executor.memory", "8g") \
|
|
.config("spark.executor.cores", "4") \
|
|
.config("spark.driver.memory", "4g") \
|
|
.config("spark.sql.shuffle.partitions", "200") \
|
|
.config("spark.sql.adaptive.enabled", "true") \
|
|
.config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
|
|
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
|
|
.getOrCreate()
|
|
```
|
|
|
|
**Memory Tuning Guidelines:**
|
|
|
|
| Data Size | Executors | Memory/Executor | Cores/Executor |
|
|
|-----------|-----------|-----------------|----------------|
|
|
| < 10 GB | 2-4 | 4-8 GB | 2-4 |
|
|
| 10-100 GB | 10-20 | 8-16 GB | 4-8 |
|
|
| 100+ GB | 50+ | 16-32 GB | 4-8 |
|
|
|
|
#### Partition Optimization
|
|
|
|
```python
|
|
# Repartition vs Coalesce
|
|
# Repartition: Full shuffle, use for increasing partitions
|
|
df_repartitioned = df.repartition(100, "date") # Partition by column
|
|
|
|
# Coalesce: No shuffle, use for decreasing partitions
|
|
df_coalesced = df.coalesce(10) # Reduce partitions without shuffle
|
|
|
|
# Optimal partition size: 128-256 MB each
|
|
# Calculate partitions:
|
|
# num_partitions = total_data_size_mb / 200
|
|
|
|
# Check current partitions
|
|
print(f"Current partitions: {df.rdd.getNumPartitions()}")
|
|
|
|
# Repartition for optimal join performance
|
|
large_df = large_df.repartition(200, "join_key")
|
|
small_df = small_df.repartition(200, "join_key")
|
|
result = large_df.join(small_df, "join_key")
|
|
```
|
|
|
|
#### Join Optimization
|
|
|
|
```python
|
|
# Broadcast join for small tables (< 10MB by default)
|
|
from pyspark.sql.functions import broadcast
|
|
|
|
# Explicit broadcast hint
|
|
result = large_df.join(broadcast(small_df), "key")
|
|
|
|
# Increase broadcast threshold if needed
|
|
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "100m")
|
|
|
|
# Sort-merge join for large tables
|
|
spark.conf.set("spark.sql.join.preferSortMergeJoin", "true")
|
|
|
|
# Bucket tables for frequent joins
|
|
df.write \
|
|
.bucketBy(100, "customer_id") \
|
|
.sortBy("customer_id") \
|
|
.mode("overwrite") \
|
|
.saveAsTable("bucketed_orders")
|
|
```
|
|
|
|
#### Caching Strategy
|
|
|
|
```python
|
|
# Cache when:
|
|
# 1. DataFrame is used multiple times
|
|
# 2. After expensive transformations
|
|
# 3. Before iterative operations
|
|
|
|
# Use MEMORY_AND_DISK for large datasets
|
|
from pyspark import StorageLevel
|
|
|
|
df.persist(StorageLevel.MEMORY_AND_DISK)
|
|
|
|
# Cache only necessary columns
|
|
df.select("id", "value").cache()
|
|
|
|
# Unpersist when done
|
|
df.unpersist()
|
|
|
|
# Check storage
|
|
spark.catalog.clearCache() # Clear all caches
|
|
```
|
|
|
|
### Airflow DAG Patterns
|
|
|
|
#### Idempotent Tasks
|
|
|
|
```python
|
|
# Always design idempotent tasks
|
|
from airflow.decorators import dag, task
|
|
from airflow.utils.dates import days_ago
|
|
from datetime import timedelta
|
|
|
|
@dag(
|
|
schedule_interval="@daily",
|
|
start_date=days_ago(7),
|
|
catchup=True,
|
|
default_args={
|
|
"retries": 3,
|
|
"retry_delay": timedelta(minutes=5),
|
|
}
|
|
)
|
|
def idempotent_etl():
|
|
|
|
@task
|
|
def extract(execution_date=None):
|
|
"""Idempotent extraction - same date always returns same data."""
|
|
date_str = execution_date.strftime("%Y-%m-%d")
|
|
|
|
# Query for specific date only
|
|
query = f"""
|
|
SELECT * FROM source_table
|
|
WHERE DATE(created_at) = '{date_str}'
|
|
"""
|
|
return query_database(query)
|
|
|
|
@task
|
|
def transform(data):
|
|
"""Pure function - no side effects."""
|
|
return [transform_record(r) for r in data]
|
|
|
|
@task
|
|
def load(data, execution_date=None):
|
|
"""Idempotent load - delete before insert or use MERGE."""
|
|
date_str = execution_date.strftime("%Y-%m-%d")
|
|
|
|
# Option 1: Delete and reinsert
|
|
execute_sql(f"DELETE FROM target WHERE date = '{date_str}'")
|
|
insert_data(data)
|
|
|
|
# Option 2: Use MERGE/UPSERT
|
|
# MERGE INTO target USING source ON target.id = source.id
|
|
# WHEN MATCHED THEN UPDATE
|
|
# WHEN NOT MATCHED THEN INSERT
|
|
|
|
raw = extract()
|
|
transformed = transform(raw)
|
|
load(transformed)
|
|
|
|
dag = idempotent_etl()
|
|
```
|
|
|
|
#### Backfill Pattern
|
|
|
|
```python
|
|
from airflow import DAG
|
|
from airflow.operators.python import PythonOperator
|
|
from airflow.utils.dates import days_ago
|
|
from datetime import datetime, timedelta
|
|
|
|
def process_date(ds, **kwargs):
|
|
"""Process a single date - supports backfill."""
|
|
logical_date = datetime.strptime(ds, "%Y-%m-%d")
|
|
|
|
# Always process specific date, not "latest"
|
|
data = extract_for_date(logical_date)
|
|
transformed = transform(data)
|
|
|
|
# Use partition/date-specific target
|
|
load_to_partition(transformed, partition=ds)
|
|
|
|
with DAG(
|
|
"backfillable_etl",
|
|
schedule_interval="@daily",
|
|
start_date=datetime(2024, 1, 1),
|
|
catchup=True, # Enable backfill
|
|
max_active_runs=3, # Limit parallel backfills
|
|
) as dag:
|
|
|
|
process = PythonOperator(
|
|
task_id="process",
|
|
python_callable=process_date,
|
|
provide_context=True,
|
|
)
|
|
|
|
# Backfill command:
|
|
# airflow dags backfill -s 2024-01-01 -e 2024-01-31 backfillable_etl
|
|
```
|
|
|
|
---
|
|
|
|
## Stream Processing
|
|
|
|
### Apache Kafka Architecture
|
|
|
|
#### Topic Design
|
|
|
|
```bash
|
|
# Create topic with proper configuration
|
|
kafka-topics.sh --create \
|
|
--bootstrap-server localhost:9092 \
|
|
--topic user-events \
|
|
--partitions 24 \
|
|
--replication-factor 3 \
|
|
--config retention.ms=604800000 \ # 7 days
|
|
--config retention.bytes=107374182400 \ # 100GB
|
|
--config cleanup.policy=delete \
|
|
--config min.insync.replicas=2 \ # Durability
|
|
--config segment.bytes=1073741824 # 1GB segments
|
|
```
|
|
|
|
**Partition Count Guidelines:**
|
|
|
|
| Throughput | Partitions | Notes |
|
|
|------------|------------|-------|
|
|
| < 10K msg/s | 6-12 | Single consumer can handle |
|
|
| 10K-100K msg/s | 24-48 | Multiple consumers needed |
|
|
| > 100K msg/s | 100+ | Scale consumers with partitions |
|
|
|
|
**Partition Key Selection:**
|
|
|
|
```python
|
|
# Good partition keys: Even distribution, related data together
|
|
# For user events: user_id (events for same user on same partition)
|
|
# For orders: order_id (if no ordering needed) or customer_id (if needed)
|
|
|
|
from kafka import KafkaProducer
|
|
import json
|
|
|
|
producer = KafkaProducer(
|
|
bootstrap_servers=['localhost:9092'],
|
|
value_serializer=lambda v: json.dumps(v).encode('utf-8'),
|
|
key_serializer=lambda k: k.encode('utf-8')
|
|
)
|
|
|
|
def send_event(event):
|
|
# Use user_id as key for user-based partitioning
|
|
producer.send(
|
|
topic='user-events',
|
|
key=event['user_id'], # Partition key
|
|
value=event
|
|
)
|
|
```
|
|
|
|
### Spark Structured Streaming
|
|
|
|
#### Watermarks and Late Data
|
|
|
|
```python
|
|
from pyspark.sql.functions import window, col
|
|
|
|
# Read stream
|
|
events = spark.readStream \
|
|
.format("kafka") \
|
|
.option("kafka.bootstrap.servers", "localhost:9092") \
|
|
.option("subscribe", "events") \
|
|
.load() \
|
|
.select(from_json(col("value").cast("string"), schema).alias("data")) \
|
|
.select("data.*")
|
|
|
|
# Add watermark for late data handling
|
|
# Data arriving more than 10 minutes late will be dropped
|
|
windowed_counts = events \
|
|
.withWatermark("event_time", "10 minutes") \
|
|
.groupBy(
|
|
window("event_time", "5 minutes", "1 minute"), # 5-min windows, 1-min slide
|
|
"event_type"
|
|
) \
|
|
.count()
|
|
|
|
# Write with append mode (only final results for complete windows)
|
|
query = windowed_counts.writeStream \
|
|
.format("delta") \
|
|
.outputMode("append") \
|
|
.option("checkpointLocation", "/checkpoints/windowed_counts") \
|
|
.start()
|
|
```
|
|
|
|
**Watermark Behavior:**
|
|
|
|
```
|
|
Timeline: ─────────────────────────────────────────▶
|
|
Events: E1 E2 E3 E4(late) E5
|
|
│ │ │ │ │
|
|
Time: 10:00 10:02 10:05 10:03 10:15
|
|
▲ ▲
|
|
│ │
|
|
Current Arrives at 10:15
|
|
watermark but event_time=10:03
|
|
= max_event_time
|
|
- threshold
|
|
= 10:05 - 10min If watermark > event_time:
|
|
= 9:55 Event is dropped (too late)
|
|
```
|
|
|
|
#### Stateful Operations
|
|
|
|
```python
|
|
from pyspark.sql.functions import pandas_udf, PandasUDFType
|
|
from pyspark.sql.streaming.state import GroupState, GroupStateTimeout
|
|
|
|
# Session windows using flatMapGroupsWithState
|
|
def session_aggregation(key, events, state):
|
|
"""Aggregate events into sessions with 30-minute timeout."""
|
|
|
|
# Get or initialize state
|
|
if state.exists:
|
|
session = state.get
|
|
else:
|
|
session = {"start": None, "events": [], "total": 0}
|
|
|
|
# Process new events
|
|
for event in events:
|
|
if session["start"] is None:
|
|
session["start"] = event.timestamp
|
|
session["events"].append(event)
|
|
session["total"] += event.value
|
|
|
|
# Set timeout (session expires after 30 min of inactivity)
|
|
state.setTimeoutDuration("30 minutes")
|
|
|
|
# Check if session should close
|
|
if state.hasTimedOut():
|
|
# Emit completed session
|
|
output = {
|
|
"user_id": key,
|
|
"session_start": session["start"],
|
|
"event_count": len(session["events"]),
|
|
"total_value": session["total"]
|
|
}
|
|
state.remove()
|
|
yield output
|
|
else:
|
|
# Update state
|
|
state.update(session)
|
|
|
|
# Apply stateful operation
|
|
sessions = events \
|
|
.groupByKey(lambda e: e.user_id) \
|
|
.flatMapGroupsWithState(
|
|
session_aggregation,
|
|
outputMode="append",
|
|
stateTimeout=GroupStateTimeout.ProcessingTimeTimeout()
|
|
)
|
|
```
|
|
|
|
---
|
|
|
|
## Exactly-Once Semantics
|
|
|
|
### Producer Idempotence
|
|
|
|
```python
|
|
from kafka import KafkaProducer
|
|
|
|
# Enable idempotent producer
|
|
producer = KafkaProducer(
|
|
bootstrap_servers=['localhost:9092'],
|
|
acks='all', # Wait for all replicas
|
|
enable_idempotence=True, # Exactly-once per partition
|
|
max_in_flight_requests_per_connection=5, # Max with idempotence
|
|
retries=2147483647, # Infinite retries
|
|
value_serializer=lambda v: json.dumps(v).encode('utf-8')
|
|
)
|
|
|
|
# Producer will deduplicate based on sequence numbers
|
|
for i in range(100):
|
|
producer.send('topic', {'id': i, 'data': 'value'})
|
|
|
|
producer.flush()
|
|
```
|
|
|
|
### Transactional Processing
|
|
|
|
```python
|
|
from kafka import KafkaProducer, KafkaConsumer
|
|
from kafka.errors import KafkaError
|
|
|
|
# Transactional producer
|
|
producer = KafkaProducer(
|
|
bootstrap_servers=['localhost:9092'],
|
|
transactional_id='my-transactional-id', # Enable transactions
|
|
enable_idempotence=True,
|
|
acks='all'
|
|
)
|
|
|
|
producer.init_transactions()
|
|
|
|
def process_with_transactions(consumer, producer):
|
|
"""Read-process-write with exactly-once semantics."""
|
|
|
|
try:
|
|
producer.begin_transaction()
|
|
|
|
# Read
|
|
records = consumer.poll(timeout_ms=1000)
|
|
|
|
for tp, messages in records.items():
|
|
for message in messages:
|
|
# Process
|
|
result = transform(message.value)
|
|
|
|
# Write to output topic
|
|
producer.send('output-topic', result)
|
|
|
|
# Commit offsets and transaction atomically
|
|
producer.send_offsets_to_transaction(
|
|
consumer.position(consumer.assignment()),
|
|
consumer.group_id
|
|
)
|
|
producer.commit_transaction()
|
|
|
|
except KafkaError as e:
|
|
producer.abort_transaction()
|
|
raise
|
|
```
|
|
|
|
### Spark Exactly-Once to External Systems
|
|
|
|
```python
|
|
# Use foreachBatch with idempotent writes
|
|
def write_to_database_idempotent(batch_df, batch_id):
|
|
"""Write batch with exactly-once semantics."""
|
|
|
|
# Add batch_id for deduplication
|
|
batch_with_id = batch_df.withColumn("batch_id", lit(batch_id))
|
|
|
|
# Use MERGE for idempotent writes
|
|
batch_with_id.write \
|
|
.format("jdbc") \
|
|
.option("url", "jdbc:postgresql://localhost/db") \
|
|
.option("dbtable", "staging_events") \
|
|
.option("driver", "org.postgresql.Driver") \
|
|
.mode("append") \
|
|
.save()
|
|
|
|
# Merge staging to final (idempotent)
|
|
execute_sql("""
|
|
MERGE INTO events AS target
|
|
USING staging_events AS source
|
|
ON target.event_id = source.event_id
|
|
WHEN MATCHED THEN UPDATE SET *
|
|
WHEN NOT MATCHED THEN INSERT *
|
|
""")
|
|
|
|
# Clean staging
|
|
execute_sql("TRUNCATE staging_events")
|
|
|
|
query = events.writeStream \
|
|
.foreachBatch(write_to_database_idempotent) \
|
|
.option("checkpointLocation", "/checkpoints/to-postgres") \
|
|
.start()
|
|
```
|
|
|
|
---
|
|
|
|
## Error Handling
|
|
|
|
### Dead Letter Queue (DLQ)
|
|
|
|
```python
|
|
class DeadLetterQueue:
|
|
"""Handle failed records with dead letter queue pattern."""
|
|
|
|
def __init__(self, dlq_topic: str, producer: KafkaProducer):
|
|
self.dlq_topic = dlq_topic
|
|
self.producer = producer
|
|
|
|
def send_to_dlq(self, record, error: Exception, context: dict):
|
|
"""Send failed record to DLQ with error metadata."""
|
|
|
|
dlq_record = {
|
|
"original_record": record,
|
|
"error_type": type(error).__name__,
|
|
"error_message": str(error),
|
|
"timestamp": datetime.utcnow().isoformat(),
|
|
"context": context,
|
|
"retry_count": context.get("retry_count", 0)
|
|
}
|
|
|
|
self.producer.send(
|
|
self.dlq_topic,
|
|
value=json.dumps(dlq_record).encode('utf-8')
|
|
)
|
|
|
|
def process_with_dlq(consumer, processor, dlq):
|
|
"""Process records with DLQ for failures."""
|
|
|
|
for message in consumer:
|
|
try:
|
|
result = processor.process(message.value)
|
|
# Success - commit offset
|
|
consumer.commit()
|
|
|
|
except ValidationError as e:
|
|
# Non-retryable - send to DLQ immediately
|
|
dlq.send_to_dlq(
|
|
message.value,
|
|
e,
|
|
{"topic": message.topic, "partition": message.partition}
|
|
)
|
|
consumer.commit() # Don't retry
|
|
|
|
except TemporaryError as e:
|
|
# Retryable - don't commit, let consumer retry
|
|
# After max retries, send to DLQ
|
|
retry_count = message.headers.get("retry_count", 0)
|
|
if retry_count >= MAX_RETRIES:
|
|
dlq.send_to_dlq(message.value, e, {"retry_count": retry_count})
|
|
consumer.commit()
|
|
else:
|
|
raise # Will be retried
|
|
```
|
|
|
|
### Circuit Breaker
|
|
|
|
```python
|
|
from dataclasses import dataclass
|
|
from datetime import datetime, timedelta
|
|
from enum import Enum
|
|
import threading
|
|
|
|
class CircuitState(Enum):
|
|
CLOSED = "closed" # Normal operation
|
|
OPEN = "open" # Failing, reject calls
|
|
HALF_OPEN = "half_open" # Testing if recovered
|
|
|
|
@dataclass
|
|
class CircuitBreaker:
|
|
"""Circuit breaker for external service calls."""
|
|
|
|
failure_threshold: int = 5
|
|
recovery_timeout: timedelta = timedelta(seconds=30)
|
|
success_threshold: int = 3
|
|
|
|
def __post_init__(self):
|
|
self.state = CircuitState.CLOSED
|
|
self.failure_count = 0
|
|
self.success_count = 0
|
|
self.last_failure_time = None
|
|
self.lock = threading.Lock()
|
|
|
|
def call(self, func, *args, **kwargs):
|
|
"""Execute function with circuit breaker protection."""
|
|
|
|
with self.lock:
|
|
if self.state == CircuitState.OPEN:
|
|
if self._should_attempt_reset():
|
|
self.state = CircuitState.HALF_OPEN
|
|
else:
|
|
raise CircuitOpenError("Circuit is open")
|
|
|
|
try:
|
|
result = func(*args, **kwargs)
|
|
self._record_success()
|
|
return result
|
|
|
|
except Exception as e:
|
|
self._record_failure()
|
|
raise
|
|
|
|
def _record_success(self):
|
|
with self.lock:
|
|
if self.state == CircuitState.HALF_OPEN:
|
|
self.success_count += 1
|
|
if self.success_count >= self.success_threshold:
|
|
self.state = CircuitState.CLOSED
|
|
self.failure_count = 0
|
|
self.success_count = 0
|
|
elif self.state == CircuitState.CLOSED:
|
|
self.failure_count = 0
|
|
|
|
def _record_failure(self):
|
|
with self.lock:
|
|
self.failure_count += 1
|
|
self.last_failure_time = datetime.now()
|
|
|
|
if self.state == CircuitState.HALF_OPEN:
|
|
self.state = CircuitState.OPEN
|
|
self.success_count = 0
|
|
elif self.failure_count >= self.failure_threshold:
|
|
self.state = CircuitState.OPEN
|
|
|
|
def _should_attempt_reset(self):
|
|
if self.last_failure_time is None:
|
|
return True
|
|
return datetime.now() - self.last_failure_time >= self.recovery_timeout
|
|
|
|
# Usage
|
|
circuit = CircuitBreaker(failure_threshold=5, recovery_timeout=timedelta(seconds=60))
|
|
|
|
def call_external_api(data):
|
|
return circuit.call(external_api.process, data)
|
|
```
|
|
|
|
---
|
|
|
|
## Data Ingestion Patterns
|
|
|
|
### Change Data Capture (CDC)
|
|
|
|
```python
|
|
# Using Debezium with Kafka Connect
|
|
# connector-config.json
|
|
{
|
|
"name": "postgres-cdc-connector",
|
|
"config": {
|
|
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
|
|
"database.hostname": "postgres",
|
|
"database.port": "5432",
|
|
"database.user": "debezium",
|
|
"database.password": "password",
|
|
"database.dbname": "source_db",
|
|
"database.server.name": "source",
|
|
"table.include.list": "public.orders,public.customers",
|
|
"plugin.name": "pgoutput",
|
|
"publication.name": "dbz_publication",
|
|
"slot.name": "debezium_slot",
|
|
"transforms": "unwrap",
|
|
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
|
|
"transforms.unwrap.drop.tombstones": "false"
|
|
}
|
|
}
|
|
```
|
|
|
|
**Processing CDC Events:**
|
|
|
|
```python
|
|
def process_cdc_event(event):
|
|
"""Process Debezium CDC event."""
|
|
|
|
operation = event.get("op")
|
|
|
|
if operation == "c": # Create (INSERT)
|
|
after = event.get("after")
|
|
return {"action": "insert", "data": after}
|
|
|
|
elif operation == "u": # Update
|
|
before = event.get("before")
|
|
after = event.get("after")
|
|
return {"action": "update", "before": before, "after": after}
|
|
|
|
elif operation == "d": # Delete
|
|
before = event.get("before")
|
|
return {"action": "delete", "data": before}
|
|
|
|
elif operation == "r": # Read (snapshot)
|
|
after = event.get("after")
|
|
return {"action": "snapshot", "data": after}
|
|
```
|
|
|
|
### Bulk Ingestion
|
|
|
|
```python
|
|
# Efficient bulk loading to data warehouse
|
|
from concurrent.futures import ThreadPoolExecutor
|
|
import boto3
|
|
|
|
class BulkIngester:
|
|
"""Bulk ingest data to Snowflake via S3."""
|
|
|
|
def __init__(self, s3_bucket: str, snowflake_conn):
|
|
self.s3 = boto3.client('s3')
|
|
self.bucket = s3_bucket
|
|
self.snowflake = snowflake_conn
|
|
|
|
def ingest_dataframe(self, df, table_name: str, mode: str = "append"):
|
|
"""Bulk ingest DataFrame to Snowflake."""
|
|
|
|
# 1. Write to S3 as Parquet (compressed, columnar)
|
|
s3_path = f"s3://{self.bucket}/staging/{table_name}/{uuid.uuid4()}"
|
|
df.write.parquet(s3_path)
|
|
|
|
# 2. Create external stage if not exists
|
|
self.snowflake.execute(f"""
|
|
CREATE STAGE IF NOT EXISTS {table_name}_stage
|
|
URL = '{s3_path}'
|
|
CREDENTIALS = (AWS_KEY_ID='...' AWS_SECRET_KEY='...')
|
|
FILE_FORMAT = (TYPE = 'PARQUET')
|
|
""")
|
|
|
|
# 3. COPY INTO (much faster than INSERT)
|
|
if mode == "overwrite":
|
|
self.snowflake.execute(f"TRUNCATE TABLE {table_name}")
|
|
|
|
self.snowflake.execute(f"""
|
|
COPY INTO {table_name}
|
|
FROM @{table_name}_stage
|
|
FILE_FORMAT = (TYPE = 'PARQUET')
|
|
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE
|
|
ON_ERROR = 'CONTINUE'
|
|
""")
|
|
|
|
# 4. Cleanup staging files
|
|
self._cleanup_s3(s3_path)
|
|
```
|
|
|
|
---
|
|
|
|
## Orchestration
|
|
|
|
### Dependency Management
|
|
|
|
```python
|
|
from airflow import DAG
|
|
from airflow.operators.python import PythonOperator
|
|
from airflow.sensors.external_task import ExternalTaskSensor
|
|
from airflow.utils.task_group import TaskGroup
|
|
from datetime import timedelta
|
|
|
|
with DAG("complex_pipeline") as dag:
|
|
|
|
# Wait for upstream DAG
|
|
wait_for_source = ExternalTaskSensor(
|
|
task_id="wait_for_source_etl",
|
|
external_dag_id="source_etl_dag",
|
|
external_task_id="final_task",
|
|
execution_delta=timedelta(hours=0),
|
|
timeout=3600,
|
|
mode="poke",
|
|
poke_interval=60,
|
|
)
|
|
|
|
# Parallel extraction group
|
|
with TaskGroup("extract") as extract_group:
|
|
extract_orders = PythonOperator(
|
|
task_id="extract_orders",
|
|
python_callable=extract_orders_func,
|
|
)
|
|
extract_customers = PythonOperator(
|
|
task_id="extract_customers",
|
|
python_callable=extract_customers_func,
|
|
)
|
|
extract_products = PythonOperator(
|
|
task_id="extract_products",
|
|
python_callable=extract_products_func,
|
|
)
|
|
|
|
# Sequential transformation
|
|
with TaskGroup("transform") as transform_group:
|
|
join_data = PythonOperator(
|
|
task_id="join_data",
|
|
python_callable=join_func,
|
|
)
|
|
aggregate = PythonOperator(
|
|
task_id="aggregate",
|
|
python_callable=aggregate_func,
|
|
)
|
|
join_data >> aggregate
|
|
|
|
# Load
|
|
load = PythonOperator(
|
|
task_id="load",
|
|
python_callable=load_func,
|
|
)
|
|
|
|
# Define dependencies
|
|
wait_for_source >> extract_group >> transform_group >> load
|
|
```
|
|
|
|
### Dynamic DAG Generation
|
|
|
|
```python
|
|
from airflow import DAG
|
|
from airflow.operators.python import PythonOperator
|
|
from datetime import datetime
|
|
import yaml
|
|
|
|
def create_etl_dag(config: dict) -> DAG:
|
|
"""Factory function to create ETL DAGs from config."""
|
|
|
|
dag = DAG(
|
|
dag_id=f"etl_{config['source']}_{config['destination']}",
|
|
schedule_interval=config.get('schedule', '@daily'),
|
|
start_date=datetime(2024, 1, 1),
|
|
catchup=False,
|
|
tags=['etl', 'auto-generated'],
|
|
)
|
|
|
|
with dag:
|
|
extract = PythonOperator(
|
|
task_id='extract',
|
|
python_callable=create_extract_func(config['source']),
|
|
)
|
|
|
|
transform = PythonOperator(
|
|
task_id='transform',
|
|
python_callable=create_transform_func(config['transformations']),
|
|
)
|
|
|
|
load = PythonOperator(
|
|
task_id='load',
|
|
python_callable=create_load_func(config['destination']),
|
|
)
|
|
|
|
extract >> transform >> load
|
|
|
|
return dag
|
|
|
|
# Load configurations
|
|
with open('/config/etl_pipelines.yaml') as f:
|
|
configs = yaml.safe_load(f)
|
|
|
|
# Generate DAGs
|
|
for config in configs['pipelines']:
|
|
dag_id = f"etl_{config['source']}_{config['destination']}"
|
|
globals()[dag_id] = create_etl_dag(config)
|
|
```
|