- AgentHub: 13 files updated with non-engineering examples (content drafts, research, strategy) — engineering stays primary, cross-domain secondary - AgentHub: 7 slash commands, 5 Python scripts, 3 references, 1 agent, dry_run.py validation (57 checks) - Marketplace: agenthub entry added with cross-domain keywords, engineering POWERFUL updated (25→30), product (12→13), counts synced across all configs - SEO: generate-docs.py now produces keyword-rich <title> tags and meta descriptions using SKILL.md frontmatter — "Claude Code Skills" in site_name propagates to all 276 HTML pages - SEO: per-domain title suffixes (Agent Skill for Codex & OpenClaw, etc.), slug-as-title cleanup, domain label stripping from titles - Broken links: 141→0 warnings — new rewrite_skill_internal_links() converts references/, scripts/, assets/ links to GitHub source URLs; skills/index.md phantom slugs fixed (6 marketing, 7 RA/QM) - Counts synced: 204 skills, 266 tools, 382 refs, 16 agents, 17 commands, 21 plugins — consistent across CLAUDE.md, README.md, docs/index.md, marketplace.json, getting-started.md, mkdocs.yml - Platform sync: Codex 163 skills, Gemini 246 items, OpenClaw compatible Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
206 lines
6.2 KiB
Markdown
206 lines
6.2 KiB
Markdown
---
|
|
title: "Senior Data Engineer — Agent Skill & Codex Plugin"
|
|
description: "Data engineering skill for building scalable data pipelines, ETL/ELT systems, and data infrastructure. Expertise in Python, SQL, Spark, Airflow, dbt. Agent skill for Claude Code, Codex CLI, Gemini CLI, OpenClaw."
|
|
---
|
|
|
|
# Senior Data Engineer
|
|
|
|
<div class="page-meta" markdown>
|
|
<span class="meta-badge">:material-code-braces: Engineering - Core</span>
|
|
<span class="meta-badge">:material-identifier: `senior-data-engineer`</span>
|
|
<span class="meta-badge">:material-github: <a href="https://github.com/alirezarezvani/claude-skills/tree/main/engineering-team/senior-data-engineer/SKILL.md">Source</a></span>
|
|
</div>
|
|
|
|
<div class="install-banner" markdown>
|
|
<span class="install-label">Install:</span> <code>claude /plugin install engineering-skills</code>
|
|
</div>
|
|
|
|
|
|
Production-grade data engineering skill for building scalable, reliable data systems.
|
|
|
|
## Table of Contents
|
|
|
|
1. [Trigger Phrases](#trigger-phrases)
|
|
2. [Quick Start](#quick-start)
|
|
3. [Workflows](#workflows)
|
|
- [Building a Batch ETL Pipeline](#workflow-1-building-a-batch-etl-pipeline)
|
|
- [Implementing Real-Time Streaming](#workflow-2-implementing-real-time-streaming)
|
|
- [Data Quality Framework Setup](#workflow-3-data-quality-framework-setup)
|
|
4. [Architecture Decision Framework](#architecture-decision-framework)
|
|
5. [Tech Stack](#tech-stack)
|
|
6. [Reference Documentation](#reference-documentation)
|
|
7. [Troubleshooting](#troubleshooting)
|
|
|
|
---
|
|
|
|
## Trigger Phrases
|
|
|
|
Activate this skill when you see:
|
|
|
|
**Pipeline Design:**
|
|
- "Design a data pipeline for..."
|
|
- "Build an ETL/ELT process..."
|
|
- "How should I ingest data from..."
|
|
- "Set up data extraction from..."
|
|
|
|
**Architecture:**
|
|
- "Should I use batch or streaming?"
|
|
- "Lambda vs Kappa architecture"
|
|
- "How to handle late-arriving data"
|
|
- "Design a data lakehouse"
|
|
|
|
**Data Modeling:**
|
|
- "Create a dimensional model..."
|
|
- "Star schema vs snowflake"
|
|
- "Implement slowly changing dimensions"
|
|
- "Design a data vault"
|
|
|
|
**Data Quality:**
|
|
- "Add data validation to..."
|
|
- "Set up data quality checks"
|
|
- "Monitor data freshness"
|
|
- "Implement data contracts"
|
|
|
|
**Performance:**
|
|
- "Optimize this Spark job"
|
|
- "Query is running slow"
|
|
- "Reduce pipeline execution time"
|
|
- "Tune Airflow DAG"
|
|
|
|
---
|
|
|
|
## Quick Start
|
|
|
|
### Core Tools
|
|
|
|
```bash
|
|
# Generate pipeline orchestration config
|
|
python scripts/pipeline_orchestrator.py generate \
|
|
--type airflow \
|
|
--source postgres \
|
|
--destination snowflake \
|
|
--schedule "0 5 * * *"
|
|
|
|
# Validate data quality
|
|
python scripts/data_quality_validator.py validate \
|
|
--input data/sales.parquet \
|
|
--schema schemas/sales.json \
|
|
--checks freshness,completeness,uniqueness
|
|
|
|
# Optimize ETL performance
|
|
python scripts/etl_performance_optimizer.py analyze \
|
|
--query queries/daily_aggregation.sql \
|
|
--engine spark \
|
|
--recommend
|
|
```
|
|
|
|
---
|
|
|
|
## Workflows
|
|
→ See references/workflows.md for details
|
|
|
|
## Architecture Decision Framework
|
|
|
|
Use this framework to choose the right approach for your data pipeline.
|
|
|
|
### Batch vs Streaming
|
|
|
|
| Criteria | Batch | Streaming |
|
|
|----------|-------|-----------|
|
|
| **Latency requirement** | Hours to days | Seconds to minutes |
|
|
| **Data volume** | Large historical datasets | Continuous event streams |
|
|
| **Processing complexity** | Complex transformations, ML | Simple aggregations, filtering |
|
|
| **Cost sensitivity** | More cost-effective | Higher infrastructure cost |
|
|
| **Error handling** | Easier to reprocess | Requires careful design |
|
|
|
|
**Decision Tree:**
|
|
```
|
|
Is real-time insight required?
|
|
├── Yes → Use streaming
|
|
│ └── Is exactly-once semantics needed?
|
|
│ ├── Yes → Kafka + Flink/Spark Structured Streaming
|
|
│ └── No → Kafka + consumer groups
|
|
└── No → Use batch
|
|
└── Is data volume > 1TB daily?
|
|
├── Yes → Spark/Databricks
|
|
└── No → dbt + warehouse compute
|
|
```
|
|
|
|
### Lambda vs Kappa Architecture
|
|
|
|
| Aspect | Lambda | Kappa |
|
|
|--------|--------|-------|
|
|
| **Complexity** | Two codebases (batch + stream) | Single codebase |
|
|
| **Maintenance** | Higher (sync batch/stream logic) | Lower |
|
|
| **Reprocessing** | Native batch layer | Replay from source |
|
|
| **Use case** | ML training + real-time serving | Pure event-driven |
|
|
|
|
**When to choose Lambda:**
|
|
- Need to train ML models on historical data
|
|
- Complex batch transformations not feasible in streaming
|
|
- Existing batch infrastructure
|
|
|
|
**When to choose Kappa:**
|
|
- Event-sourced architecture
|
|
- All processing can be expressed as stream operations
|
|
- Starting fresh without legacy systems
|
|
|
|
### Data Warehouse vs Data Lakehouse
|
|
|
|
| Feature | Warehouse (Snowflake/BigQuery) | Lakehouse (Delta/Iceberg) |
|
|
|---------|-------------------------------|---------------------------|
|
|
| **Best for** | BI, SQL analytics | ML, unstructured data |
|
|
| **Storage cost** | Higher (proprietary format) | Lower (open formats) |
|
|
| **Flexibility** | Schema-on-write | Schema-on-read |
|
|
| **Performance** | Excellent for SQL | Good, improving |
|
|
| **Ecosystem** | Mature BI tools | Growing ML tooling |
|
|
|
|
---
|
|
|
|
## Tech Stack
|
|
|
|
| Category | Technologies |
|
|
|----------|--------------|
|
|
| **Languages** | Python, SQL, Scala |
|
|
| **Orchestration** | Airflow, Prefect, Dagster |
|
|
| **Transformation** | dbt, Spark, Flink |
|
|
| **Streaming** | Kafka, Kinesis, Pub/Sub |
|
|
| **Storage** | S3, GCS, Delta Lake, Iceberg |
|
|
| **Warehouses** | Snowflake, BigQuery, Redshift, Databricks |
|
|
| **Quality** | Great Expectations, dbt tests, Monte Carlo |
|
|
| **Monitoring** | Prometheus, Grafana, Datadog |
|
|
|
|
---
|
|
|
|
## Reference Documentation
|
|
|
|
### 1. Data Pipeline Architecture
|
|
See `references/data_pipeline_architecture.md` for:
|
|
- Lambda vs Kappa architecture patterns
|
|
- Batch processing with Spark and Airflow
|
|
- Stream processing with Kafka and Flink
|
|
- Exactly-once semantics implementation
|
|
- Error handling and dead letter queues
|
|
|
|
### 2. Data Modeling Patterns
|
|
See `references/data_modeling_patterns.md` for:
|
|
- Dimensional modeling (Star/Snowflake)
|
|
- Slowly Changing Dimensions (SCD Types 1-6)
|
|
- Data Vault modeling
|
|
- dbt best practices
|
|
- Partitioning and clustering
|
|
|
|
### 3. DataOps Best Practices
|
|
See `references/dataops_best_practices.md` for:
|
|
- Data testing frameworks
|
|
- Data contracts and schema validation
|
|
- CI/CD for data pipelines
|
|
- Observability and lineage
|
|
- Incident response
|
|
|
|
---
|
|
|
|
## Troubleshooting
|
|
→ See references/troubleshooting.md for details
|
|
|