- Update skill count to 170, Python tools to 213, references to 314 across all docs - Regenerate all 170 skill doc pages from latest SKILL.md sources - Update CLAUDE.md with v2.1.1 highlights, accurate architecture tree, and roadmap - Update README.md badges and overview table - Update marketplace.json metadata description and version - Update mkdocs.yml, index.md, getting-started.md with correct numbers
202 lines
5.8 KiB
Markdown
202 lines
5.8 KiB
Markdown
---
|
|
title: "Senior Data Engineer"
|
|
description: "Senior Data Engineer - Claude Code skill from the Engineering - Core domain."
|
|
---
|
|
|
|
# Senior Data Engineer
|
|
|
|
**Domain:** Engineering - Core | **Skill:** `senior-data-engineer` | **Source:** [`engineering-team/senior-data-engineer/SKILL.md`](https://github.com/alirezarezvani/claude-skills/tree/main/engineering-team/senior-data-engineer/SKILL.md)
|
|
|
|
---
|
|
|
|
|
|
# Senior Data Engineer
|
|
|
|
Production-grade data engineering skill for building scalable, reliable data systems.
|
|
|
|
## Table of Contents
|
|
|
|
1. [Trigger Phrases](#trigger-phrases)
|
|
2. [Quick Start](#quick-start)
|
|
3. [Workflows](#workflows)
|
|
- [Building a Batch ETL Pipeline](#workflow-1-building-a-batch-etl-pipeline)
|
|
- [Implementing Real-Time Streaming](#workflow-2-implementing-real-time-streaming)
|
|
- [Data Quality Framework Setup](#workflow-3-data-quality-framework-setup)
|
|
4. [Architecture Decision Framework](#architecture-decision-framework)
|
|
5. [Tech Stack](#tech-stack)
|
|
6. [Reference Documentation](#reference-documentation)
|
|
7. [Troubleshooting](#troubleshooting)
|
|
|
|
---
|
|
|
|
## Trigger Phrases
|
|
|
|
Activate this skill when you see:
|
|
|
|
**Pipeline Design:**
|
|
- "Design a data pipeline for..."
|
|
- "Build an ETL/ELT process..."
|
|
- "How should I ingest data from..."
|
|
- "Set up data extraction from..."
|
|
|
|
**Architecture:**
|
|
- "Should I use batch or streaming?"
|
|
- "Lambda vs Kappa architecture"
|
|
- "How to handle late-arriving data"
|
|
- "Design a data lakehouse"
|
|
|
|
**Data Modeling:**
|
|
- "Create a dimensional model..."
|
|
- "Star schema vs snowflake"
|
|
- "Implement slowly changing dimensions"
|
|
- "Design a data vault"
|
|
|
|
**Data Quality:**
|
|
- "Add data validation to..."
|
|
- "Set up data quality checks"
|
|
- "Monitor data freshness"
|
|
- "Implement data contracts"
|
|
|
|
**Performance:**
|
|
- "Optimize this Spark job"
|
|
- "Query is running slow"
|
|
- "Reduce pipeline execution time"
|
|
- "Tune Airflow DAG"
|
|
|
|
---
|
|
|
|
## Quick Start
|
|
|
|
### Core Tools
|
|
|
|
```bash
|
|
# Generate pipeline orchestration config
|
|
python scripts/pipeline_orchestrator.py generate \
|
|
--type airflow \
|
|
--source postgres \
|
|
--destination snowflake \
|
|
--schedule "0 5 * * *"
|
|
|
|
# Validate data quality
|
|
python scripts/data_quality_validator.py validate \
|
|
--input data/sales.parquet \
|
|
--schema schemas/sales.json \
|
|
--checks freshness,completeness,uniqueness
|
|
|
|
# Optimize ETL performance
|
|
python scripts/etl_performance_optimizer.py analyze \
|
|
--query queries/daily_aggregation.sql \
|
|
--engine spark \
|
|
--recommend
|
|
```
|
|
|
|
---
|
|
|
|
## Workflows
|
|
→ See references/workflows.md for details
|
|
|
|
## Architecture Decision Framework
|
|
|
|
Use this framework to choose the right approach for your data pipeline.
|
|
|
|
### Batch vs Streaming
|
|
|
|
| Criteria | Batch | Streaming |
|
|
|----------|-------|-----------|
|
|
| **Latency requirement** | Hours to days | Seconds to minutes |
|
|
| **Data volume** | Large historical datasets | Continuous event streams |
|
|
| **Processing complexity** | Complex transformations, ML | Simple aggregations, filtering |
|
|
| **Cost sensitivity** | More cost-effective | Higher infrastructure cost |
|
|
| **Error handling** | Easier to reprocess | Requires careful design |
|
|
|
|
**Decision Tree:**
|
|
```
|
|
Is real-time insight required?
|
|
├── Yes → Use streaming
|
|
│ └── Is exactly-once semantics needed?
|
|
│ ├── Yes → Kafka + Flink/Spark Structured Streaming
|
|
│ └── No → Kafka + consumer groups
|
|
└── No → Use batch
|
|
└── Is data volume > 1TB daily?
|
|
├── Yes → Spark/Databricks
|
|
└── No → dbt + warehouse compute
|
|
```
|
|
|
|
### Lambda vs Kappa Architecture
|
|
|
|
| Aspect | Lambda | Kappa |
|
|
|--------|--------|-------|
|
|
| **Complexity** | Two codebases (batch + stream) | Single codebase |
|
|
| **Maintenance** | Higher (sync batch/stream logic) | Lower |
|
|
| **Reprocessing** | Native batch layer | Replay from source |
|
|
| **Use case** | ML training + real-time serving | Pure event-driven |
|
|
|
|
**When to choose Lambda:**
|
|
- Need to train ML models on historical data
|
|
- Complex batch transformations not feasible in streaming
|
|
- Existing batch infrastructure
|
|
|
|
**When to choose Kappa:**
|
|
- Event-sourced architecture
|
|
- All processing can be expressed as stream operations
|
|
- Starting fresh without legacy systems
|
|
|
|
### Data Warehouse vs Data Lakehouse
|
|
|
|
| Feature | Warehouse (Snowflake/BigQuery) | Lakehouse (Delta/Iceberg) |
|
|
|---------|-------------------------------|---------------------------|
|
|
| **Best for** | BI, SQL analytics | ML, unstructured data |
|
|
| **Storage cost** | Higher (proprietary format) | Lower (open formats) |
|
|
| **Flexibility** | Schema-on-write | Schema-on-read |
|
|
| **Performance** | Excellent for SQL | Good, improving |
|
|
| **Ecosystem** | Mature BI tools | Growing ML tooling |
|
|
|
|
---
|
|
|
|
## Tech Stack
|
|
|
|
| Category | Technologies |
|
|
|----------|--------------|
|
|
| **Languages** | Python, SQL, Scala |
|
|
| **Orchestration** | Airflow, Prefect, Dagster |
|
|
| **Transformation** | dbt, Spark, Flink |
|
|
| **Streaming** | Kafka, Kinesis, Pub/Sub |
|
|
| **Storage** | S3, GCS, Delta Lake, Iceberg |
|
|
| **Warehouses** | Snowflake, BigQuery, Redshift, Databricks |
|
|
| **Quality** | Great Expectations, dbt tests, Monte Carlo |
|
|
| **Monitoring** | Prometheus, Grafana, Datadog |
|
|
|
|
---
|
|
|
|
## Reference Documentation
|
|
|
|
### 1. Data Pipeline Architecture
|
|
See `references/data_pipeline_architecture.md` for:
|
|
- Lambda vs Kappa architecture patterns
|
|
- Batch processing with Spark and Airflow
|
|
- Stream processing with Kafka and Flink
|
|
- Exactly-once semantics implementation
|
|
- Error handling and dead letter queues
|
|
|
|
### 2. Data Modeling Patterns
|
|
See `references/data_modeling_patterns.md` for:
|
|
- Dimensional modeling (Star/Snowflake)
|
|
- Slowly Changing Dimensions (SCD Types 1-6)
|
|
- Data Vault modeling
|
|
- dbt best practices
|
|
- Partitioning and clustering
|
|
|
|
### 3. DataOps Best Practices
|
|
See `references/dataops_best_practices.md` for:
|
|
- Data testing frameworks
|
|
- Data contracts and schema validation
|
|
- CI/CD for data pipelines
|
|
- Observability and lineage
|
|
- Incident response
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
→ See references/troubleshooting.md for details
|
|
|