Complete overhaul of senior-data-engineer skill (previously Grade F: 43/100): SKILL.md (~550 lines): - Added table of contents and trigger phrases - 3 actionable workflows: Batch ETL Pipeline, Real-Time Streaming, Data Quality Framework - Architecture decision framework (Batch vs Stream, Lambda vs Kappa) - Tech stack overview with decision matrix - Troubleshooting section with common issues and solutions Reference Files (all rewritten from 81-line boilerplate): - data_pipeline_architecture.md (~700 lines): Lambda/Kappa architectures, batch processing with Spark, stream processing with Kafka/Flink, exactly-once semantics, error handling strategies, orchestration patterns - data_modeling_patterns.md (~650 lines): Dimensional modeling (Star/Snowflake/OBT), SCD Types 0-6 with SQL implementations, Data Vault (Hub/Satellite/Link), dbt best practices, partitioning and clustering strategies - dataops_best_practices.md (~750 lines): Data testing (Great Expectations, dbt), data contracts with YAML definitions, CI/CD pipelines, observability with OpenLineage, incident response runbooks, cost optimization Python Scripts (all rewritten from 101-line placeholders): - pipeline_orchestrator.py (~600 lines): Generates Airflow DAGs, Prefect flows, and Dagster jobs with configurable ETL patterns - data_quality_validator.py (~1640 lines): Schema validation, data profiling, Great Expectations suite generation, data contract validation, anomaly detection - etl_performance_optimizer.py (~1680 lines): SQL query analysis, Spark job optimization, partition strategy recommendations, cost estimation for BigQuery/Snowflake/Redshift/Databricks Resolves #53 Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
34 KiB
34 KiB
Data Pipeline Architecture
Comprehensive guide to designing and implementing production data pipelines.
Table of Contents
- Architecture Patterns
- Batch Processing
- Stream Processing
- Exactly-Once Semantics
- Error Handling
- Data Ingestion Patterns
- Orchestration
Architecture Patterns
Lambda Architecture
The Lambda architecture combines batch and stream processing for comprehensive data handling.
┌─────────────────────────────────────┐
│ Data Sources │
└─────────────────┬───────────────────┘
│
┌─────────────────▼───────────────────┐
│ Message Queue (Kafka) │
└───────┬─────────────────┬───────────┘
│ │
┌─────────────▼─────┐ ┌───────▼─────────────┐
│ Batch Layer │ │ Speed Layer │
│ (Spark/Airflow) │ │ (Flink/Spark SS) │
└─────────────┬─────┘ └───────┬─────────────┘
│ │
┌─────────────▼─────┐ ┌───────▼─────────────┐
│ Master Dataset │ │ Real-time Views │
│ (Data Lake) │ │ (Redis/Druid) │
└─────────────┬─────┘ └───────┬─────────────┘
│ │
┌───────▼─────────────────▼───────┐
│ Serving Layer │
│ (Merged Batch + Real-time) │
└─────────────────────────────────┘
Components:
-
Batch Layer
- Processes complete historical data
- Creates precomputed batch views
- Handles complex transformations, ML training
- Reprocessable from raw data
-
Speed Layer
- Processes real-time data stream
- Creates real-time views for recent data
- Low latency, simpler transformations
- Compensates for batch layer delay
-
Serving Layer
- Merges batch and real-time views
- Responds to queries
- Provides unified interface
Implementation Example:
# Batch layer: Daily aggregation with Spark
def batch_daily_aggregation(spark, date):
"""Process full day of data for batch views."""
raw_df = spark.read.parquet(f"s3://data-lake/raw/events/date={date}")
aggregated = raw_df.groupBy("user_id", "event_type") \
.agg(
count("*").alias("event_count"),
sum("revenue").alias("total_revenue"),
max("timestamp").alias("last_event")
)
aggregated.write \
.mode("overwrite") \
.partitionBy("event_type") \
.parquet(f"s3://data-lake/batch-views/daily_agg/date={date}")
# Speed layer: Real-time aggregation with Spark Structured Streaming
def speed_realtime_aggregation(spark):
"""Process streaming data for real-time views."""
stream_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "kafka:9092") \
.option("subscribe", "events") \
.load()
parsed = stream_df.select(
from_json(col("value").cast("string"), event_schema).alias("data")
).select("data.*")
aggregated = parsed \
.withWatermark("timestamp", "5 minutes") \
.groupBy(
window("timestamp", "1 minute"),
"user_id",
"event_type"
) \
.agg(count("*").alias("event_count"))
query = aggregated.writeStream \
.format("redis") \
.option("host", "redis") \
.outputMode("update") \
.start()
return query
Kappa Architecture
Kappa simplifies Lambda by using only stream processing with replay capability.
┌─────────────────────────────────────┐
│ Data Sources │
└─────────────────┬───────────────────┘
│
┌─────────────────▼───────────────────┐
│ Immutable Log (Kafka/Kinesis) │
│ (Long retention) │
└─────────────────┬───────────────────┘
│
┌─────────────────▼───────────────────┐
│ Stream Processor │
│ (Flink/Spark Streaming) │
└─────────────────┬───────────────────┘
│
┌─────────────────▼───────────────────┐
│ Serving Layer │
│ (Database/Data Warehouse) │
└─────────────────────────────────────┘
Key Principles:
- Single Processing Path: All data processed as streams
- Immutable Log: Kafka/Kinesis as source of truth with long retention
- Reprocessing via Replay: Re-run stream processor from beginning when needed
Reprocessing Strategy:
# Reprocessing in Kappa architecture
class KappaReprocessor:
"""Handle reprocessing by replaying from Kafka."""
def __init__(self, kafka_config, flink_job):
self.kafka = kafka_config
self.job = flink_job
def reprocess(self, from_timestamp: str):
"""Reprocess all data from a specific timestamp."""
# 1. Start new consumer group reading from timestamp
new_consumer_group = f"reprocess-{uuid.uuid4()}"
# 2. Configure stream processor with new group
self.job.set_config({
"group.id": new_consumer_group,
"auto.offset.reset": "none" # We'll set offset manually
})
# 3. Seek to timestamp
offsets = self._get_offsets_for_timestamp(from_timestamp)
self.job.seek_to_offsets(offsets)
# 4. Write to new output table/topic
output_table = f"events_reprocessed_{datetime.now().strftime('%Y%m%d')}"
self.job.set_output(output_table)
# 5. Run until caught up
self.job.run_until_caught_up()
# 6. Swap output tables atomically
self._atomic_table_swap("events", output_table)
def _get_offsets_for_timestamp(self, timestamp):
"""Get Kafka offsets for a specific timestamp."""
consumer = KafkaConsumer(bootstrap_servers=self.kafka["brokers"])
partitions = consumer.partitions_for_topic("events")
offsets = {}
for partition in partitions:
tp = TopicPartition("events", partition)
offset = consumer.offsets_for_times({tp: timestamp})
offsets[tp] = offset[tp].offset
return offsets
Medallion Architecture (Bronze/Silver/Gold)
Common in data lakehouses (Databricks, Delta Lake).
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Bronze │────▶│ Silver │────▶│ Gold │
│ (Raw Data) │ │ (Cleansed) │ │ (Analytics) │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
▼ ▼ ▼
Landing zone Validated, Aggregated,
Append-only deduplicated, business-ready
Schema evolution standardized Star schema
Implementation with Delta Lake:
# Bronze: Raw ingestion
def ingest_to_bronze(spark, source_path, bronze_path):
"""Ingest raw data to bronze layer."""
df = spark.read.format("json").load(source_path)
# Add metadata
df = df.withColumn("_ingested_at", current_timestamp()) \
.withColumn("_source_file", input_file_name())
df.write \
.format("delta") \
.mode("append") \
.option("mergeSchema", "true") \
.save(bronze_path)
# Silver: Cleansing and validation
def bronze_to_silver(spark, bronze_path, silver_path):
"""Transform bronze to silver with cleansing."""
bronze_df = spark.read.format("delta").load(bronze_path)
# Read last processed version
last_version = get_last_processed_version(silver_path, "bronze")
# Get only new records
new_records = bronze_df.filter(col("_commit_version") > last_version)
# Cleanse and validate
silver_df = new_records \
.filter(col("user_id").isNotNull()) \
.filter(col("event_type").isin(["click", "view", "purchase"])) \
.withColumn("event_date", to_date("timestamp")) \
.dropDuplicates(["event_id"])
# Merge to silver (upsert)
silver_table = DeltaTable.forPath(spark, silver_path)
silver_table.alias("target") \
.merge(
silver_df.alias("source"),
"target.event_id = source.event_id"
) \
.whenMatchedUpdateAll() \
.whenNotMatchedInsertAll() \
.execute()
# Gold: Business aggregations
def silver_to_gold(spark, silver_path, gold_path):
"""Create business-ready aggregations in gold layer."""
silver_df = spark.read.format("delta").load(silver_path)
# Daily user metrics
daily_metrics = silver_df \
.groupBy("user_id", "event_date") \
.agg(
count("*").alias("total_events"),
countDistinct("session_id").alias("sessions"),
sum(when(col("event_type") == "purchase", col("revenue")).otherwise(0)).alias("revenue"),
max("timestamp").alias("last_activity")
)
# Write as gold table
daily_metrics.write \
.format("delta") \
.mode("overwrite") \
.partitionBy("event_date") \
.save(gold_path + "/daily_user_metrics")
Batch Processing
Apache Spark Best Practices
Memory Management
# Optimal Spark configuration for batch jobs
spark = SparkSession.builder \
.appName("BatchETL") \
.config("spark.executor.memory", "8g") \
.config("spark.executor.cores", "4") \
.config("spark.driver.memory", "4g") \
.config("spark.sql.shuffle.partitions", "200") \
.config("spark.sql.adaptive.enabled", "true") \
.config("spark.sql.adaptive.coalescePartitions.enabled", "true") \
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer") \
.getOrCreate()
Memory Tuning Guidelines:
| Data Size | Executors | Memory/Executor | Cores/Executor |
|---|---|---|---|
| < 10 GB | 2-4 | 4-8 GB | 2-4 |
| 10-100 GB | 10-20 | 8-16 GB | 4-8 |
| 100+ GB | 50+ | 16-32 GB | 4-8 |
Partition Optimization
# Repartition vs Coalesce
# Repartition: Full shuffle, use for increasing partitions
df_repartitioned = df.repartition(100, "date") # Partition by column
# Coalesce: No shuffle, use for decreasing partitions
df_coalesced = df.coalesce(10) # Reduce partitions without shuffle
# Optimal partition size: 128-256 MB each
# Calculate partitions:
# num_partitions = total_data_size_mb / 200
# Check current partitions
print(f"Current partitions: {df.rdd.getNumPartitions()}")
# Repartition for optimal join performance
large_df = large_df.repartition(200, "join_key")
small_df = small_df.repartition(200, "join_key")
result = large_df.join(small_df, "join_key")
Join Optimization
# Broadcast join for small tables (< 10MB by default)
from pyspark.sql.functions import broadcast
# Explicit broadcast hint
result = large_df.join(broadcast(small_df), "key")
# Increase broadcast threshold if needed
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "100m")
# Sort-merge join for large tables
spark.conf.set("spark.sql.join.preferSortMergeJoin", "true")
# Bucket tables for frequent joins
df.write \
.bucketBy(100, "customer_id") \
.sortBy("customer_id") \
.mode("overwrite") \
.saveAsTable("bucketed_orders")
Caching Strategy
# Cache when:
# 1. DataFrame is used multiple times
# 2. After expensive transformations
# 3. Before iterative operations
# Use MEMORY_AND_DISK for large datasets
from pyspark import StorageLevel
df.persist(StorageLevel.MEMORY_AND_DISK)
# Cache only necessary columns
df.select("id", "value").cache()
# Unpersist when done
df.unpersist()
# Check storage
spark.catalog.clearCache() # Clear all caches
Airflow DAG Patterns
Idempotent Tasks
# Always design idempotent tasks
from airflow.decorators import dag, task
from airflow.utils.dates import days_ago
from datetime import timedelta
@dag(
schedule_interval="@daily",
start_date=days_ago(7),
catchup=True,
default_args={
"retries": 3,
"retry_delay": timedelta(minutes=5),
}
)
def idempotent_etl():
@task
def extract(execution_date=None):
"""Idempotent extraction - same date always returns same data."""
date_str = execution_date.strftime("%Y-%m-%d")
# Query for specific date only
query = f"""
SELECT * FROM source_table
WHERE DATE(created_at) = '{date_str}'
"""
return query_database(query)
@task
def transform(data):
"""Pure function - no side effects."""
return [transform_record(r) for r in data]
@task
def load(data, execution_date=None):
"""Idempotent load - delete before insert or use MERGE."""
date_str = execution_date.strftime("%Y-%m-%d")
# Option 1: Delete and reinsert
execute_sql(f"DELETE FROM target WHERE date = '{date_str}'")
insert_data(data)
# Option 2: Use MERGE/UPSERT
# MERGE INTO target USING source ON target.id = source.id
# WHEN MATCHED THEN UPDATE
# WHEN NOT MATCHED THEN INSERT
raw = extract()
transformed = transform(raw)
load(transformed)
dag = idempotent_etl()
Backfill Pattern
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.utils.dates import days_ago
from datetime import datetime, timedelta
def process_date(ds, **kwargs):
"""Process a single date - supports backfill."""
logical_date = datetime.strptime(ds, "%Y-%m-%d")
# Always process specific date, not "latest"
data = extract_for_date(logical_date)
transformed = transform(data)
# Use partition/date-specific target
load_to_partition(transformed, partition=ds)
with DAG(
"backfillable_etl",
schedule_interval="@daily",
start_date=datetime(2024, 1, 1),
catchup=True, # Enable backfill
max_active_runs=3, # Limit parallel backfills
) as dag:
process = PythonOperator(
task_id="process",
python_callable=process_date,
provide_context=True,
)
# Backfill command:
# airflow dags backfill -s 2024-01-01 -e 2024-01-31 backfillable_etl
Stream Processing
Apache Kafka Architecture
Topic Design
# Create topic with proper configuration
kafka-topics.sh --create \
--bootstrap-server localhost:9092 \
--topic user-events \
--partitions 24 \
--replication-factor 3 \
--config retention.ms=604800000 \ # 7 days
--config retention.bytes=107374182400 \ # 100GB
--config cleanup.policy=delete \
--config min.insync.replicas=2 \ # Durability
--config segment.bytes=1073741824 # 1GB segments
Partition Count Guidelines:
| Throughput | Partitions | Notes |
|---|---|---|
| < 10K msg/s | 6-12 | Single consumer can handle |
| 10K-100K msg/s | 24-48 | Multiple consumers needed |
| > 100K msg/s | 100+ | Scale consumers with partitions |
Partition Key Selection:
# Good partition keys: Even distribution, related data together
# For user events: user_id (events for same user on same partition)
# For orders: order_id (if no ordering needed) or customer_id (if needed)
from kafka import KafkaProducer
import json
producer = KafkaProducer(
bootstrap_servers=['localhost:9092'],
value_serializer=lambda v: json.dumps(v).encode('utf-8'),
key_serializer=lambda k: k.encode('utf-8')
)
def send_event(event):
# Use user_id as key for user-based partitioning
producer.send(
topic='user-events',
key=event['user_id'], # Partition key
value=event
)
Spark Structured Streaming
Watermarks and Late Data
from pyspark.sql.functions import window, col
# Read stream
events = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "events") \
.load() \
.select(from_json(col("value").cast("string"), schema).alias("data")) \
.select("data.*")
# Add watermark for late data handling
# Data arriving more than 10 minutes late will be dropped
windowed_counts = events \
.withWatermark("event_time", "10 minutes") \
.groupBy(
window("event_time", "5 minutes", "1 minute"), # 5-min windows, 1-min slide
"event_type"
) \
.count()
# Write with append mode (only final results for complete windows)
query = windowed_counts.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/checkpoints/windowed_counts") \
.start()
Watermark Behavior:
Timeline: ─────────────────────────────────────────▶
Events: E1 E2 E3 E4(late) E5
│ │ │ │ │
Time: 10:00 10:02 10:05 10:03 10:15
▲ ▲
│ │
Current Arrives at 10:15
watermark but event_time=10:03
= max_event_time
- threshold
= 10:05 - 10min If watermark > event_time:
= 9:55 Event is dropped (too late)
Stateful Operations
from pyspark.sql.functions import pandas_udf, PandasUDFType
from pyspark.sql.streaming.state import GroupState, GroupStateTimeout
# Session windows using flatMapGroupsWithState
def session_aggregation(key, events, state):
"""Aggregate events into sessions with 30-minute timeout."""
# Get or initialize state
if state.exists:
session = state.get
else:
session = {"start": None, "events": [], "total": 0}
# Process new events
for event in events:
if session["start"] is None:
session["start"] = event.timestamp
session["events"].append(event)
session["total"] += event.value
# Set timeout (session expires after 30 min of inactivity)
state.setTimeoutDuration("30 minutes")
# Check if session should close
if state.hasTimedOut():
# Emit completed session
output = {
"user_id": key,
"session_start": session["start"],
"event_count": len(session["events"]),
"total_value": session["total"]
}
state.remove()
yield output
else:
# Update state
state.update(session)
# Apply stateful operation
sessions = events \
.groupByKey(lambda e: e.user_id) \
.flatMapGroupsWithState(
session_aggregation,
outputMode="append",
stateTimeout=GroupStateTimeout.ProcessingTimeTimeout()
)
Exactly-Once Semantics
Producer Idempotence
from kafka import KafkaProducer
# Enable idempotent producer
producer = KafkaProducer(
bootstrap_servers=['localhost:9092'],
acks='all', # Wait for all replicas
enable_idempotence=True, # Exactly-once per partition
max_in_flight_requests_per_connection=5, # Max with idempotence
retries=2147483647, # Infinite retries
value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
# Producer will deduplicate based on sequence numbers
for i in range(100):
producer.send('topic', {'id': i, 'data': 'value'})
producer.flush()
Transactional Processing
from kafka import KafkaProducer, KafkaConsumer
from kafka.errors import KafkaError
# Transactional producer
producer = KafkaProducer(
bootstrap_servers=['localhost:9092'],
transactional_id='my-transactional-id', # Enable transactions
enable_idempotence=True,
acks='all'
)
producer.init_transactions()
def process_with_transactions(consumer, producer):
"""Read-process-write with exactly-once semantics."""
try:
producer.begin_transaction()
# Read
records = consumer.poll(timeout_ms=1000)
for tp, messages in records.items():
for message in messages:
# Process
result = transform(message.value)
# Write to output topic
producer.send('output-topic', result)
# Commit offsets and transaction atomically
producer.send_offsets_to_transaction(
consumer.position(consumer.assignment()),
consumer.group_id
)
producer.commit_transaction()
except KafkaError as e:
producer.abort_transaction()
raise
Spark Exactly-Once to External Systems
# Use foreachBatch with idempotent writes
def write_to_database_idempotent(batch_df, batch_id):
"""Write batch with exactly-once semantics."""
# Add batch_id for deduplication
batch_with_id = batch_df.withColumn("batch_id", lit(batch_id))
# Use MERGE for idempotent writes
batch_with_id.write \
.format("jdbc") \
.option("url", "jdbc:postgresql://localhost/db") \
.option("dbtable", "staging_events") \
.option("driver", "org.postgresql.Driver") \
.mode("append") \
.save()
# Merge staging to final (idempotent)
execute_sql("""
MERGE INTO events AS target
USING staging_events AS source
ON target.event_id = source.event_id
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
# Clean staging
execute_sql("TRUNCATE staging_events")
query = events.writeStream \
.foreachBatch(write_to_database_idempotent) \
.option("checkpointLocation", "/checkpoints/to-postgres") \
.start()
Error Handling
Dead Letter Queue (DLQ)
class DeadLetterQueue:
"""Handle failed records with dead letter queue pattern."""
def __init__(self, dlq_topic: str, producer: KafkaProducer):
self.dlq_topic = dlq_topic
self.producer = producer
def send_to_dlq(self, record, error: Exception, context: dict):
"""Send failed record to DLQ with error metadata."""
dlq_record = {
"original_record": record,
"error_type": type(error).__name__,
"error_message": str(error),
"timestamp": datetime.utcnow().isoformat(),
"context": context,
"retry_count": context.get("retry_count", 0)
}
self.producer.send(
self.dlq_topic,
value=json.dumps(dlq_record).encode('utf-8')
)
def process_with_dlq(consumer, processor, dlq):
"""Process records with DLQ for failures."""
for message in consumer:
try:
result = processor.process(message.value)
# Success - commit offset
consumer.commit()
except ValidationError as e:
# Non-retryable - send to DLQ immediately
dlq.send_to_dlq(
message.value,
e,
{"topic": message.topic, "partition": message.partition}
)
consumer.commit() # Don't retry
except TemporaryError as e:
# Retryable - don't commit, let consumer retry
# After max retries, send to DLQ
retry_count = message.headers.get("retry_count", 0)
if retry_count >= MAX_RETRIES:
dlq.send_to_dlq(message.value, e, {"retry_count": retry_count})
consumer.commit()
else:
raise # Will be retried
Circuit Breaker
from dataclasses import dataclass
from datetime import datetime, timedelta
from enum import Enum
import threading
class CircuitState(Enum):
CLOSED = "closed" # Normal operation
OPEN = "open" # Failing, reject calls
HALF_OPEN = "half_open" # Testing if recovered
@dataclass
class CircuitBreaker:
"""Circuit breaker for external service calls."""
failure_threshold: int = 5
recovery_timeout: timedelta = timedelta(seconds=30)
success_threshold: int = 3
def __post_init__(self):
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
self.last_failure_time = None
self.lock = threading.Lock()
def call(self, func, *args, **kwargs):
"""Execute function with circuit breaker protection."""
with self.lock:
if self.state == CircuitState.OPEN:
if self._should_attempt_reset():
self.state = CircuitState.HALF_OPEN
else:
raise CircuitOpenError("Circuit is open")
try:
result = func(*args, **kwargs)
self._record_success()
return result
except Exception as e:
self._record_failure()
raise
def _record_success(self):
with self.lock:
if self.state == CircuitState.HALF_OPEN:
self.success_count += 1
if self.success_count >= self.success_threshold:
self.state = CircuitState.CLOSED
self.failure_count = 0
self.success_count = 0
elif self.state == CircuitState.CLOSED:
self.failure_count = 0
def _record_failure(self):
with self.lock:
self.failure_count += 1
self.last_failure_time = datetime.now()
if self.state == CircuitState.HALF_OPEN:
self.state = CircuitState.OPEN
self.success_count = 0
elif self.failure_count >= self.failure_threshold:
self.state = CircuitState.OPEN
def _should_attempt_reset(self):
if self.last_failure_time is None:
return True
return datetime.now() - self.last_failure_time >= self.recovery_timeout
# Usage
circuit = CircuitBreaker(failure_threshold=5, recovery_timeout=timedelta(seconds=60))
def call_external_api(data):
return circuit.call(external_api.process, data)
Data Ingestion Patterns
Change Data Capture (CDC)
# Using Debezium with Kafka Connect
# connector-config.json
{
"name": "postgres-cdc-connector",
"config": {
"connector.class": "io.debezium.connector.postgresql.PostgresConnector",
"database.hostname": "postgres",
"database.port": "5432",
"database.user": "debezium",
"database.password": "password",
"database.dbname": "source_db",
"database.server.name": "source",
"table.include.list": "public.orders,public.customers",
"plugin.name": "pgoutput",
"publication.name": "dbz_publication",
"slot.name": "debezium_slot",
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": "false"
}
}
Processing CDC Events:
def process_cdc_event(event):
"""Process Debezium CDC event."""
operation = event.get("op")
if operation == "c": # Create (INSERT)
after = event.get("after")
return {"action": "insert", "data": after}
elif operation == "u": # Update
before = event.get("before")
after = event.get("after")
return {"action": "update", "before": before, "after": after}
elif operation == "d": # Delete
before = event.get("before")
return {"action": "delete", "data": before}
elif operation == "r": # Read (snapshot)
after = event.get("after")
return {"action": "snapshot", "data": after}
Bulk Ingestion
# Efficient bulk loading to data warehouse
from concurrent.futures import ThreadPoolExecutor
import boto3
class BulkIngester:
"""Bulk ingest data to Snowflake via S3."""
def __init__(self, s3_bucket: str, snowflake_conn):
self.s3 = boto3.client('s3')
self.bucket = s3_bucket
self.snowflake = snowflake_conn
def ingest_dataframe(self, df, table_name: str, mode: str = "append"):
"""Bulk ingest DataFrame to Snowflake."""
# 1. Write to S3 as Parquet (compressed, columnar)
s3_path = f"s3://{self.bucket}/staging/{table_name}/{uuid.uuid4()}"
df.write.parquet(s3_path)
# 2. Create external stage if not exists
self.snowflake.execute(f"""
CREATE STAGE IF NOT EXISTS {table_name}_stage
URL = '{s3_path}'
CREDENTIALS = (AWS_KEY_ID='...' AWS_SECRET_KEY='...')
FILE_FORMAT = (TYPE = 'PARQUET')
""")
# 3. COPY INTO (much faster than INSERT)
if mode == "overwrite":
self.snowflake.execute(f"TRUNCATE TABLE {table_name}")
self.snowflake.execute(f"""
COPY INTO {table_name}
FROM @{table_name}_stage
FILE_FORMAT = (TYPE = 'PARQUET')
MATCH_BY_COLUMN_NAME = CASE_INSENSITIVE
ON_ERROR = 'CONTINUE'
""")
# 4. Cleanup staging files
self._cleanup_s3(s3_path)
Orchestration
Dependency Management
from airflow import DAG
from airflow.operators.python import PythonOperator
from airflow.sensors.external_task import ExternalTaskSensor
from airflow.utils.task_group import TaskGroup
from datetime import timedelta
with DAG("complex_pipeline") as dag:
# Wait for upstream DAG
wait_for_source = ExternalTaskSensor(
task_id="wait_for_source_etl",
external_dag_id="source_etl_dag",
external_task_id="final_task",
execution_delta=timedelta(hours=0),
timeout=3600,
mode="poke",
poke_interval=60,
)
# Parallel extraction group
with TaskGroup("extract") as extract_group:
extract_orders = PythonOperator(
task_id="extract_orders",
python_callable=extract_orders_func,
)
extract_customers = PythonOperator(
task_id="extract_customers",
python_callable=extract_customers_func,
)
extract_products = PythonOperator(
task_id="extract_products",
python_callable=extract_products_func,
)
# Sequential transformation
with TaskGroup("transform") as transform_group:
join_data = PythonOperator(
task_id="join_data",
python_callable=join_func,
)
aggregate = PythonOperator(
task_id="aggregate",
python_callable=aggregate_func,
)
join_data >> aggregate
# Load
load = PythonOperator(
task_id="load",
python_callable=load_func,
)
# Define dependencies
wait_for_source >> extract_group >> transform_group >> load
Dynamic DAG Generation
from airflow import DAG
from airflow.operators.python import PythonOperator
from datetime import datetime
import yaml
def create_etl_dag(config: dict) -> DAG:
"""Factory function to create ETL DAGs from config."""
dag = DAG(
dag_id=f"etl_{config['source']}_{config['destination']}",
schedule_interval=config.get('schedule', '@daily'),
start_date=datetime(2024, 1, 1),
catchup=False,
tags=['etl', 'auto-generated'],
)
with dag:
extract = PythonOperator(
task_id='extract',
python_callable=create_extract_func(config['source']),
)
transform = PythonOperator(
task_id='transform',
python_callable=create_transform_func(config['transformations']),
)
load = PythonOperator(
task_id='load',
python_callable=create_load_func(config['destination']),
)
extract >> transform >> load
return dag
# Load configurations
with open('/config/etl_pipelines.yaml') as f:
configs = yaml.safe_load(f)
# Generate DAGs
for config in configs['pipelines']:
dag_id = f"etl_{config['source']}_{config['destination']}"
globals()[dag_id] = create_etl_dag(config)