Files
claude-skills-reference/engineering-team/senior-data-engineer/scripts/pipeline_orchestrator.py
Alireza Rezvani a68ae3a05e Dev (#305)
* chore: update gitignore for audit reports and playwright cache

* fix: add YAML frontmatter (name + description) to all SKILL.md files

- Added frontmatter to 34 skills that were missing it entirely (0% Tessl score)
- Fixed name field format to kebab-case across all 169 skills
- Resolves #284

* chore: sync codex skills symlinks [automated]

* fix: optimize 14 low-scoring skills via Tessl review (#290)

Tessl optimization: 14 skills improved from ≤69% to 85%+. Closes #285, #286.

* chore: sync codex skills symlinks [automated]

* fix: optimize 18 skills via Tessl review + compliance fix (closes #287) (#291)

Phase 1: 18 skills optimized via Tessl (avg 77% → 95%). Closes #287.

* feat: add scripts and references to 4 prompt-only skills + Tessl optimization (#292)

Phase 2: 3 new scripts + 2 reference files for prompt-only skills. Tessl 45-55% → 94-100%.

* feat: add 6 agents + 5 slash commands for full coverage (v2.7.0) (#293)

Phase 3: 6 new agents (all 9 categories covered) + 5 slash commands.

* fix: Phase 5 verification fixes + docs update (#294)

Phase 5 verification fixes

* chore: sync codex skills symlinks [automated]

* fix: marketplace audit — all 11 plugins validated by Claude Code (#295)

Marketplace audit: all 11 plugins validated + installed + tested in Claude Code

* fix: restore 7 removed plugins + revert playwright-pro name to pw

Reverts two overly aggressive audit changes:
- Restored content-creator, demand-gen, fullstack-engineer, aws-architect,
  product-manager, scrum-master, skill-security-auditor to marketplace
- Reverted playwright-pro plugin.json name back to 'pw' (intentional short name)

* refactor: split 21 over-500-line skills into SKILL.md + references (#296)

* chore: sync codex skills symlinks [automated]

* docs: update all documentation with accurate counts and regenerated skill pages

- Update skill count to 170, Python tools to 213, references to 314 across all docs
- Regenerate all 170 skill doc pages from latest SKILL.md sources
- Update CLAUDE.md with v2.1.1 highlights, accurate architecture tree, and roadmap
- Update README.md badges and overview table
- Update marketplace.json metadata description and version
- Update mkdocs.yml, index.md, getting-started.md with correct numbers

* fix: add root-level SKILL.md and .codex/instructions.md to all domains (#301)

Root cause: CLI tools (ai-agent-skills, agent-skills-cli) look for SKILL.md
at the specified install path. 7 of 9 domain directories were missing this
file, causing "Skill not found" errors for bundle installs like:
  npx ai-agent-skills install alirezarezvani/claude-skills/engineering-team

Fix:
- Add root-level SKILL.md with YAML frontmatter to 7 domains
- Add .codex/instructions.md to 8 domains (for Codex CLI discovery)
- Update INSTALLATION.md with accurate skill counts (53→170)
- Add troubleshooting entry for "Skill not found" error

All 9 domains now have: SKILL.md + .codex/instructions.md + plugin.json

Closes #301

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat: add Gemini CLI + OpenClaw support, fix Codex missing 25 skills

Gemini CLI:
- Add GEMINI.md with activation instructions
- Add scripts/gemini-install.sh setup script
- Add scripts/sync-gemini-skills.py (194 skills indexed)
- Add .gemini/skills/ with symlinks for all skills, agents, commands
- Remove phantom medium-content-pro entries from sync script
- Add top-level folder filter to prevent gitignored dirs from leaking

Codex CLI:
- Fix sync-codex-skills.py missing "engineering" domain (25 POWERFUL skills)
- Regenerate .codex/skills-index.json: 124 → 149 skills
- Add 25 new symlinks in .codex/skills/

OpenClaw:
- Add OpenClaw installation section to INSTALLATION.md
- Add ClawHub install + manual install + YAML frontmatter docs

Documentation:
- Update INSTALLATION.md with all 4 platforms + accurate counts
- Update README.md: "three platforms" → "four platforms" + Gemini quick start
- Update CLAUDE.md with Gemini CLI support in v2.1.1 highlights
- Update SKILL-AUTHORING-STANDARD.md + SKILL_PIPELINE.md with Gemini steps
- Add OpenClaw + Gemini to installation locations reference table

Marketplace: all 18 plugins validated — sources exist, SKILL.md present

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* feat(product,pm): world-class product & PM skills audit — 6 scripts, 5 agents, 7 commands, 23 references/assets

Phase 1 — Agent & Command Foundation:
- Rewrite cs-project-manager agent (55→515 lines, 4 workflows, 6 skill integrations)
- Expand cs-product-manager agent (408→684 lines, orchestrates all 8 product skills)
- Add 7 slash commands: /rice, /okr, /persona, /user-story, /sprint-health, /project-health, /retro

Phase 2 — Script Gap Closure (2,779 lines):
- jira-expert: jql_query_builder.py (22 patterns), workflow_validator.py
- confluence-expert: space_structure_generator.py, content_audit_analyzer.py
- atlassian-admin: permission_audit_tool.py
- atlassian-templates: template_scaffolder.py (Confluence XHTML generation)

Phase 3 — Reference & Asset Enrichment:
- 9 product references (competitive-teardown, landing-page-generator, saas-scaffolder)
- 6 PM references (confluence-expert, atlassian-admin, atlassian-templates)
- 7 product assets (templates for PRD, RICE, sprint, stories, OKR, research, design system)
- 1 PM asset (permission_scheme_template.json)

Phase 4 — New Agents:
- cs-agile-product-owner, cs-product-strategist, cs-ux-researcher

Phase 5 — Integration & Polish:
- Related Skills cross-references in 8 SKILL.md files
- Updated product-team/CLAUDE.md (5→8 skills, 6→9 tools, 4 agents, 5 commands)
- Updated project-management/CLAUDE.md (0→12 scripts, 3 commands)
- Regenerated docs site (177 pages), updated homepage and getting-started

Quality audit: 31 files reviewed, 29 PASS, 2 fixed (copy-frameworks.md, governance-framework.md)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: audit and repair all plugins, agents, and commands

- Fix 12 command files: correct CLI arg syntax, script paths, and usage docs
- Fix 3 agents with broken script/reference paths (cs-content-creator,
  cs-demand-gen-specialist, cs-financial-analyst)
- Add complete YAML frontmatter to 5 agents (cs-growth-strategist,
  cs-engineering-lead, cs-senior-engineer, cs-financial-analyst,
  cs-quality-regulatory)
- Fix cs-ceo-advisor related agent path
- Update marketplace.json metadata counts (224 tools, 341 refs, 14 agents,
  12 commands)

Verified: all 19 scripts pass --help, all 14 agent paths resolve, mkdocs
builds clean.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix: repair 25 Python scripts failing --help across all domains

- Fix Python 3.10+ syntax (float | None → Optional[float]) in 2 scripts
- Add argparse CLI handling to 9 marketing scripts using raw sys.argv
- Fix 10 scripts crashing at module level (wrap in __main__, add argparse)
- Make yaml/prefect/mcp imports conditional with stdlib fallbacks (4 scripts)
- Fix f-string backslash syntax in project_bootstrapper.py
- Fix -h flag conflict in pr_analyzer.py
- Fix tech-debt.md description (score → prioritize)

All 237 scripts now pass python3 --help verification.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* fix(product-team): close 3 verified gaps in product skills

- Fix competitive-teardown/SKILL.md: replace broken references
  DATA_COLLECTION.md → references/data-collection-guide.md and
  TEMPLATES.md → references/analysis-templates.md (workflow was broken
  at steps 2 and 4)

- Upgrade landing_page_scaffolder.py: add TSX + Tailwind output format
  (--format tsx) matching SKILL.md promise of Next.js/React components.
  4 design styles (dark-saas, clean-minimal, bold-startup, enterprise).
  TSX is now default; HTML preserved via --format html

- Rewrite README.md: fix stale counts (was 5 skills/15+ tools, now
  accurately shows 8 skills/9 tools), remove 7 ghost scripts that
  never existed (sprint_planner.py, velocity_tracker.py, etc.)

- Fix tech-debt.md description (score → prioritize)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* release: v2.1.2 — landing page TSX output, brand voice integration, docs update

- Landing page generator defaults to Next.js TSX + Tailwind CSS (4 design styles)
- Brand voice analyzer integrated into landing page generation workflow
- CHANGELOG, CLAUDE.md, README.md updated for v2.1.2
- All 13 plugin.json + marketplace.json bumped to 2.1.2
- Gemini/Codex skill indexes re-synced
- Backward compatible: --format html preserved, no breaking changes

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

---------

Co-authored-by: alirezarezvani <5697919+alirezarezvani@users.noreply.github.com>
Co-authored-by: Leo <leo@openclaw.ai>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
2026-03-10 09:48:49 +01:00

884 lines
28 KiB
Python
Executable File

#!/usr/bin/env python3
"""
Pipeline Orchestrator
Generate pipeline configurations for Airflow, Prefect, and Dagster.
Supports ETL pattern generation, dependency management, and scheduling.
Usage:
python pipeline_orchestrator.py generate --type airflow --source postgres --destination snowflake
python pipeline_orchestrator.py generate --type prefect --config pipeline.yaml
python pipeline_orchestrator.py visualize --dag dags/my_dag.py
python pipeline_orchestrator.py validate --dag dags/my_dag.py
"""
import os
import sys
import json
try:
import yaml
HAS_YAML = True
except ImportError:
HAS_YAML = False
import logging
import argparse
from pathlib import Path
from typing import Dict, List, Optional, Any
from datetime import datetime, timedelta
from dataclasses import dataclass, field, asdict
from abc import ABC, abstractmethod
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
# ============================================================================
# Data Classes
# ============================================================================
@dataclass
class SourceConfig:
"""Source system configuration."""
type: str # postgres, mysql, s3, kafka, api
connection_id: str
schema: Optional[str] = None
tables: List[str] = field(default_factory=list)
query: Optional[str] = None
incremental_column: Optional[str] = None
incremental_strategy: str = "timestamp" # timestamp, id, cdc
@dataclass
class DestinationConfig:
"""Destination system configuration."""
type: str # snowflake, bigquery, redshift, s3, delta
connection_id: str
schema: str = "raw"
write_mode: str = "append" # append, overwrite, merge
partition_by: Optional[str] = None
cluster_by: List[str] = field(default_factory=list)
@dataclass
class TaskConfig:
"""Individual task configuration."""
task_id: str
operator: str
dependencies: List[str] = field(default_factory=list)
params: Dict[str, Any] = field(default_factory=dict)
retries: int = 2
retry_delay_minutes: int = 5
timeout_minutes: int = 60
pool: Optional[str] = None
priority_weight: int = 1
@dataclass
class PipelineConfig:
"""Complete pipeline configuration."""
name: str
description: str
schedule: str # cron expression or @daily, @hourly
owner: str = "data-team"
tags: List[str] = field(default_factory=list)
catchup: bool = False
max_active_runs: int = 1
default_retries: int = 2
source: Optional[SourceConfig] = None
destination: Optional[DestinationConfig] = None
tasks: List[TaskConfig] = field(default_factory=list)
# ============================================================================
# Pipeline Generators
# ============================================================================
class PipelineGenerator(ABC):
"""Abstract base class for pipeline generators."""
@abstractmethod
def generate(self, config: PipelineConfig) -> str:
"""Generate pipeline code from config."""
pass
@abstractmethod
def validate(self, code: str) -> Dict[str, Any]:
"""Validate generated pipeline code."""
pass
class AirflowGenerator(PipelineGenerator):
"""Generate Airflow DAG code."""
OPERATOR_IMPORTS = {
'python': 'from airflow.operators.python import PythonOperator',
'bash': 'from airflow.operators.bash import BashOperator',
'postgres': 'from airflow.providers.postgres.operators.postgres import PostgresOperator',
'snowflake': 'from airflow.providers.snowflake.operators.snowflake import SnowflakeOperator',
's3': 'from airflow.providers.amazon.aws.operators.s3 import S3CreateBucketOperator',
's3_to_snowflake': 'from airflow.providers.snowflake.transfers.s3_to_snowflake import S3ToSnowflakeOperator',
'sensor': 'from airflow.sensors.base import BaseSensorOperator',
'trigger': 'from airflow.operators.trigger_dagrun import TriggerDagRunOperator',
'email': 'from airflow.operators.email import EmailOperator',
'slack': 'from airflow.providers.slack.operators.slack_webhook import SlackWebhookOperator',
}
def generate(self, config: PipelineConfig) -> str:
"""Generate Airflow DAG from configuration."""
# Collect required imports
imports = self._collect_imports(config)
# Generate DAG code
code = self._generate_header(imports)
code += self._generate_default_args(config)
code += self._generate_dag_definition(config)
code += self._generate_tasks(config)
code += self._generate_dependencies(config)
return code
def _collect_imports(self, config: PipelineConfig) -> List[str]:
"""Collect required import statements."""
imports = [
"from airflow import DAG",
"from airflow.utils.dates import days_ago",
"from datetime import datetime, timedelta",
]
operators_used = set()
for task in config.tasks:
op_type = task.operator.split('_')[0].lower()
if op_type in self.OPERATOR_IMPORTS:
operators_used.add(op_type)
# Add source/destination specific imports
if config.source:
if config.source.type == 'postgres':
operators_used.add('postgres')
elif config.source.type == 's3':
operators_used.add('s3')
if config.destination:
if config.destination.type == 'snowflake':
operators_used.add('snowflake')
operators_used.add('s3_to_snowflake')
for op in operators_used:
if op in self.OPERATOR_IMPORTS:
imports.append(self.OPERATOR_IMPORTS[op])
return imports
def _generate_header(self, imports: List[str]) -> str:
"""Generate file header with imports."""
header = '''"""
Auto-generated Airflow DAG
Generated by Pipeline Orchestrator
"""
'''
header += '\n'.join(imports)
header += '\n\n'
return header
def _generate_default_args(self, config: PipelineConfig) -> str:
"""Generate default_args dictionary."""
return f'''
default_args = {{
'owner': '{config.owner}',
'depends_on_past': False,
'email_on_failure': True,
'email_on_retry': False,
'retries': {config.default_retries},
'retry_delay': timedelta(minutes=5),
}}
'''
def _generate_dag_definition(self, config: PipelineConfig) -> str:
"""Generate DAG definition."""
tags_str = str(config.tags) if config.tags else "[]"
return f'''
with DAG(
dag_id='{config.name}',
default_args=default_args,
description='{config.description}',
schedule_interval='{config.schedule}',
start_date=days_ago(1),
catchup={config.catchup},
max_active_runs={config.max_active_runs},
tags={tags_str},
) as dag:
'''
def _generate_tasks(self, config: PipelineConfig) -> str:
"""Generate task definitions."""
tasks_code = ""
for task in config.tasks:
if 'python' in task.operator.lower():
tasks_code += self._generate_python_task(task)
elif 'bash' in task.operator.lower():
tasks_code += self._generate_bash_task(task)
elif 'sql' in task.operator.lower() or 'postgres' in task.operator.lower():
tasks_code += self._generate_sql_task(task, config)
elif 'snowflake' in task.operator.lower():
tasks_code += self._generate_snowflake_task(task)
else:
tasks_code += self._generate_generic_task(task)
return tasks_code
def _generate_python_task(self, task: TaskConfig) -> str:
"""Generate PythonOperator task."""
callable_name = task.params.get('callable', 'process_data')
return f'''
def {callable_name}(**kwargs):
"""Task: {task.task_id}"""
# Add your processing logic here
execution_date = kwargs.get('ds')
print(f"Processing data for {{execution_date}}")
return True
{task.task_id} = PythonOperator(
task_id='{task.task_id}',
python_callable={callable_name},
retries={task.retries},
retry_delay=timedelta(minutes={task.retry_delay_minutes}),
execution_timeout=timedelta(minutes={task.timeout_minutes}),
)
'''
def _generate_bash_task(self, task: TaskConfig) -> str:
"""Generate BashOperator task."""
command = task.params.get('command', 'echo "Hello World"')
return f'''
{task.task_id} = BashOperator(
task_id='{task.task_id}',
bash_command='{command}',
retries={task.retries},
retry_delay=timedelta(minutes={task.retry_delay_minutes}),
execution_timeout=timedelta(minutes={task.timeout_minutes}),
)
'''
def _generate_sql_task(self, task: TaskConfig, config: PipelineConfig) -> str:
"""Generate SQL operator task."""
sql = task.params.get('sql', 'SELECT 1')
conn_id = config.source.connection_id if config.source else 'default_conn'
return f'''
{task.task_id} = PostgresOperator(
task_id='{task.task_id}',
postgres_conn_id='{conn_id}',
sql="""{sql}""",
retries={task.retries},
retry_delay=timedelta(minutes={task.retry_delay_minutes}),
)
'''
def _generate_snowflake_task(self, task: TaskConfig) -> str:
"""Generate SnowflakeOperator task."""
sql = task.params.get('sql', 'SELECT 1')
return f'''
{task.task_id} = SnowflakeOperator(
task_id='{task.task_id}',
snowflake_conn_id='snowflake_default',
sql="""{sql}""",
retries={task.retries},
retry_delay=timedelta(minutes={task.retry_delay_minutes}),
)
'''
def _generate_generic_task(self, task: TaskConfig) -> str:
"""Generate generic task placeholder."""
return f'''
# TODO: Implement {task.operator} for {task.task_id}
{task.task_id} = PythonOperator(
task_id='{task.task_id}',
python_callable=lambda: print("{task.task_id}"),
)
'''
def _generate_dependencies(self, config: PipelineConfig) -> str:
"""Generate task dependencies."""
deps_code = "\n # Task dependencies\n"
for task in config.tasks:
if task.dependencies:
for dep in task.dependencies:
deps_code += f" {dep} >> {task.task_id}\n"
return deps_code
def validate(self, code: str) -> Dict[str, Any]:
"""Validate generated DAG code."""
issues = []
warnings = []
# Check for common issues
if 'default_args' not in code:
issues.append("Missing default_args definition")
if 'with DAG' not in code:
issues.append("Missing DAG context manager")
if 'schedule_interval' not in code:
warnings.append("No schedule_interval defined, DAG won't run automatically")
# Try to parse the code
try:
compile(code, '<string>', 'exec')
except SyntaxError as e:
issues.append(f"Syntax error: {e}")
return {
'valid': len(issues) == 0,
'issues': issues,
'warnings': warnings
}
class PrefectGenerator(PipelineGenerator):
"""Generate Prefect flow code."""
def generate(self, config: PipelineConfig) -> str:
"""Generate Prefect flow from configuration."""
code = self._generate_header()
code += self._generate_tasks(config)
code += self._generate_flow(config)
return code
def _generate_header(self) -> str:
"""Generate file header."""
return '''"""
Auto-generated Prefect Flow
Generated by Pipeline Orchestrator
"""
from prefect import flow, task, get_run_logger
from prefect.tasks import task_input_hash
from datetime import timedelta
import pandas as pd
'''
def _generate_tasks(self, config: PipelineConfig) -> str:
"""Generate Prefect tasks."""
tasks_code = ""
for task_config in config.tasks:
cache_expiration = task_config.params.get('cache_hours', 1)
tasks_code += f'''
@task(
name="{task_config.task_id}",
retries={task_config.retries},
retry_delay_seconds={task_config.retry_delay_minutes * 60},
cache_key_fn=task_input_hash,
cache_expiration=timedelta(hours={cache_expiration}),
)
def {task_config.task_id}(input_data=None):
"""Task: {task_config.task_id}"""
logger = get_run_logger()
logger.info(f"Executing {task_config.task_id}")
# Add processing logic here
result = input_data
return result
'''
return tasks_code
def _generate_flow(self, config: PipelineConfig) -> str:
"""Generate Prefect flow."""
flow_code = f'''
@flow(
name="{config.name}",
description="{config.description}",
version="1.0.0",
)
def {config.name.replace('-', '_')}_flow():
"""Main flow orchestrating all tasks."""
logger = get_run_logger()
logger.info("Starting flow: {config.name}")
'''
# Generate task calls with dependencies
task_vars = {}
for i, task_config in enumerate(config.tasks):
task_name = task_config.task_id
var_name = f"result_{i}"
task_vars[task_name] = var_name
if task_config.dependencies:
# Get input from first dependency
dep_var = task_vars.get(task_config.dependencies[0], "None")
flow_code += f" {var_name} = {task_name}({dep_var})\n"
else:
flow_code += f" {var_name} = {task_name}()\n"
flow_code += '''
logger.info("Flow completed successfully")
return True
if __name__ == "__main__":
''' + f'{config.name.replace("-", "_")}_flow()' + '\n'
return flow_code
def validate(self, code: str) -> Dict[str, Any]:
"""Validate Prefect flow code."""
issues = []
if '@flow' not in code:
issues.append("Missing @flow decorator")
if '@task' not in code:
issues.append("No tasks defined with @task decorator")
try:
compile(code, '<string>', 'exec')
except SyntaxError as e:
issues.append(f"Syntax error: {e}")
return {
'valid': len(issues) == 0,
'issues': issues,
'warnings': []
}
class DagsterGenerator(PipelineGenerator):
"""Generate Dagster job code."""
def generate(self, config: PipelineConfig) -> str:
"""Generate Dagster job from configuration."""
code = self._generate_header()
code += self._generate_ops(config)
code += self._generate_job(config)
return code
def _generate_header(self) -> str:
"""Generate file header."""
return '''"""
Auto-generated Dagster Job
Generated by Pipeline Orchestrator
"""
from dagster import op, job, In, Out, Output, DynamicOut, graph
from dagster import AssetMaterialization, MetadataValue
import pandas as pd
'''
def _generate_ops(self, config: PipelineConfig) -> str:
"""Generate Dagster ops."""
ops_code = ""
for task_config in config.tasks:
has_input = len(task_config.dependencies) > 0
if has_input:
ops_code += f'''
@op(
ins={{"input_data": In()}},
out=Out(),
)
def {task_config.task_id}(context, input_data):
"""Op: {task_config.task_id}"""
context.log.info(f"Executing {task_config.task_id}")
# Add processing logic here
result = input_data
# Log asset materialization
yield AssetMaterialization(
asset_key="{task_config.task_id}",
metadata={{
"row_count": MetadataValue.int(len(result) if hasattr(result, '__len__') else 0),
}}
)
yield Output(result)
'''
else:
ops_code += f'''
@op(out=Out())
def {task_config.task_id}(context):
"""Op: {task_config.task_id}"""
context.log.info(f"Executing {task_config.task_id}")
# Add processing logic here
result = {{}}
yield AssetMaterialization(
asset_key="{task_config.task_id}",
)
yield Output(result)
'''
return ops_code
def _generate_job(self, config: PipelineConfig) -> str:
"""Generate Dagster job."""
job_code = f'''
@job(
name="{config.name}",
description="{config.description}",
tags={{
"owner": "{config.owner}",
"schedule": "{config.schedule}",
}},
)
def {config.name.replace('-', '_')}_job():
"""Main job orchestrating all ops."""
'''
# Build dependency graph
task_outputs = {}
for task_config in config.tasks:
task_name = task_config.task_id
if task_config.dependencies:
dep_output = task_outputs.get(task_config.dependencies[0], None)
if dep_output:
job_code += f" {task_name}_output = {task_name}({dep_output})\n"
else:
job_code += f" {task_name}_output = {task_name}()\n"
else:
job_code += f" {task_name}_output = {task_name}()\n"
task_outputs[task_name] = f"{task_name}_output"
return job_code
def validate(self, code: str) -> Dict[str, Any]:
"""Validate Dagster job code."""
issues = []
if '@job' not in code:
issues.append("Missing @job decorator")
if '@op' not in code:
issues.append("No ops defined with @op decorator")
try:
compile(code, '<string>', 'exec')
except SyntaxError as e:
issues.append(f"Syntax error: {e}")
return {
'valid': len(issues) == 0,
'issues': issues,
'warnings': []
}
# ============================================================================
# ETL Pattern Templates
# ============================================================================
class ETLPatternGenerator:
"""Generate common ETL patterns."""
@staticmethod
def generate_extract_load(
source_type: str,
destination_type: str,
tables: List[str],
mode: str = "incremental"
) -> PipelineConfig:
"""Generate extract-load pipeline configuration."""
tasks = []
# Extract tasks
for table in tables:
extract_task = TaskConfig(
task_id=f"extract_{table}",
operator="python_operator",
params={
'callable': f'extract_{table}',
'sql': f'SELECT * FROM {table}' + (
' WHERE updated_at > {{{{ prev_ds }}}}' if mode == 'incremental' else ''
)
}
)
tasks.append(extract_task)
# Load tasks with dependencies
for table in tables:
load_task = TaskConfig(
task_id=f"load_{table}",
operator="python_operator",
dependencies=[f"extract_{table}"],
params={'callable': f'load_{table}'}
)
tasks.append(load_task)
# Quality check task
quality_task = TaskConfig(
task_id="quality_check",
operator="python_operator",
dependencies=[f"load_{table}" for table in tables],
params={'callable': 'run_quality_checks'}
)
tasks.append(quality_task)
return PipelineConfig(
name=f"el_{source_type}_to_{destination_type}",
description=f"Extract from {source_type}, load to {destination_type}",
schedule="0 5 * * *", # Daily at 5 AM
tags=["etl", source_type, destination_type],
source=SourceConfig(
type=source_type,
connection_id=f"{source_type}_default",
tables=tables,
incremental_strategy="timestamp" if mode == "incremental" else "full"
),
destination=DestinationConfig(
type=destination_type,
connection_id=f"{destination_type}_default",
write_mode="append" if mode == "incremental" else "overwrite"
),
tasks=tasks
)
@staticmethod
def generate_transform_pipeline(
source_tables: List[str],
target_table: str,
dbt_models: List[str]
) -> PipelineConfig:
"""Generate transformation pipeline with dbt."""
tasks = []
# Sensor for source freshness
for table in source_tables:
sensor_task = TaskConfig(
task_id=f"wait_for_{table}",
operator="sql_sensor",
params={
'sql': f"SELECT MAX(updated_at) FROM {table} WHERE updated_at > '{{{{ ds }}}}'"
}
)
tasks.append(sensor_task)
# dbt run task
dbt_run = TaskConfig(
task_id="dbt_run",
operator="bash_operator",
dependencies=[f"wait_for_{t}" for t in source_tables],
params={
'command': f'cd /opt/dbt && dbt run --select {" ".join(dbt_models)}'
},
timeout_minutes=120
)
tasks.append(dbt_run)
# dbt test task
dbt_test = TaskConfig(
task_id="dbt_test",
operator="bash_operator",
dependencies=["dbt_run"],
params={
'command': f'cd /opt/dbt && dbt test --select {" ".join(dbt_models)}'
}
)
tasks.append(dbt_test)
return PipelineConfig(
name=f"transform_{target_table}",
description=f"Transform data into {target_table} using dbt",
schedule="0 6 * * *", # Daily at 6 AM (after extraction)
tags=["transform", "dbt"],
tasks=tasks
)
# ============================================================================
# CLI Interface
# ============================================================================
def main():
parser = argparse.ArgumentParser(
description="Pipeline Orchestrator - Generate and manage data pipeline configurations",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
Generate Airflow DAG:
python pipeline_orchestrator.py generate --type airflow --source postgres --destination snowflake --tables orders,customers
Generate from config file:
python pipeline_orchestrator.py generate --config pipeline.yaml --type prefect
Validate existing DAG:
python pipeline_orchestrator.py validate --dag dags/my_dag.py --type airflow
"""
)
subparsers = parser.add_subparsers(dest='command', help='Command to run')
# Generate command
gen_parser = subparsers.add_parser('generate', help='Generate pipeline code')
gen_parser.add_argument('--type', '-t', required=True,
choices=['airflow', 'prefect', 'dagster'],
help='Pipeline framework type')
gen_parser.add_argument('--source', '-s', help='Source system type')
gen_parser.add_argument('--destination', '-d', help='Destination system type')
gen_parser.add_argument('--tables', help='Comma-separated list of tables')
gen_parser.add_argument('--config', '-c', help='Configuration YAML file')
gen_parser.add_argument('--output', '-o', help='Output file path')
gen_parser.add_argument('--name', '-n', help='Pipeline name')
gen_parser.add_argument('--schedule', default='0 5 * * *', help='Cron schedule')
gen_parser.add_argument('--mode', default='incremental',
choices=['incremental', 'full'],
help='Load mode')
# Validate command
val_parser = subparsers.add_parser('validate', help='Validate pipeline code')
val_parser.add_argument('--dag', required=True, help='DAG file to validate')
val_parser.add_argument('--type', '-t', required=True,
choices=['airflow', 'prefect', 'dagster'])
# Template command
tmpl_parser = subparsers.add_parser('template', help='Generate from template')
tmpl_parser.add_argument('--pattern', '-p', required=True,
choices=['extract-load', 'transform', 'cdc'],
help='ETL pattern to generate')
tmpl_parser.add_argument('--type', '-t', required=True,
choices=['airflow', 'prefect', 'dagster'])
tmpl_parser.add_argument('--source', '-s', required=True)
tmpl_parser.add_argument('--destination', '-d', required=True)
tmpl_parser.add_argument('--tables', required=True)
tmpl_parser.add_argument('--output', '-o', help='Output file path')
args = parser.parse_args()
if args.command is None:
parser.print_help()
sys.exit(1)
try:
if args.command == 'generate':
# Load config if provided
if args.config:
with open(args.config) as f:
if HAS_YAML:
config_data = yaml.safe_load(f)
else:
config_data = json.load(f)
config = PipelineConfig(**config_data)
else:
# Build config from arguments
tables = args.tables.split(',') if args.tables else []
config = ETLPatternGenerator.generate_extract_load(
source_type=args.source or 'postgres',
destination_type=args.destination or 'snowflake',
tables=tables,
mode=args.mode
)
if args.name:
config.name = args.name
config.schedule = args.schedule
# Generate code
generators = {
'airflow': AirflowGenerator(),
'prefect': PrefectGenerator(),
'dagster': DagsterGenerator()
}
generator = generators[args.type]
code = generator.generate(config)
# Validate
validation = generator.validate(code)
if not validation['valid']:
logger.warning(f"Validation issues: {validation['issues']}")
# Output
if args.output:
with open(args.output, 'w') as f:
f.write(code)
logger.info(f"Generated pipeline saved to {args.output}")
else:
print(code)
elif args.command == 'validate':
with open(args.dag) as f:
code = f.read()
generators = {
'airflow': AirflowGenerator(),
'prefect': PrefectGenerator(),
'dagster': DagsterGenerator()
}
generator = generators[args.type]
result = generator.validate(code)
print(json.dumps(result, indent=2))
sys.exit(0 if result['valid'] else 1)
elif args.command == 'template':
tables = args.tables.split(',')
if args.pattern == 'extract-load':
config = ETLPatternGenerator.generate_extract_load(
source_type=args.source,
destination_type=args.destination,
tables=tables
)
elif args.pattern == 'transform':
config = ETLPatternGenerator.generate_transform_pipeline(
source_tables=tables,
target_table='fct_output',
dbt_models=['stg_*', 'fct_*']
)
else:
logger.error(f"Pattern {args.pattern} not yet implemented")
sys.exit(1)
generators = {
'airflow': AirflowGenerator(),
'prefect': PrefectGenerator(),
'dagster': DagsterGenerator()
}
generator = generators[args.type]
code = generator.generate(config)
if args.output:
with open(args.output, 'w') as f:
f.write(code)
logger.info(f"Generated {args.pattern} pipeline saved to {args.output}")
else:
print(code)
sys.exit(0)
except Exception as e:
logger.error(f"Error: {e}")
sys.exit(1)
if __name__ == '__main__':
main()