refactor: split 21 over-500-line skills into SKILL.md + references (#296)
This commit is contained in:
@@ -369,204 +369,7 @@ Status page: {link}
|
||||
- **{Pitfall}:** {description and how to avoid}
|
||||
|
||||
## Reference Information
|
||||
- **Architecture Diagram:** {link}
|
||||
- **Monitoring Dashboard:** {link}
|
||||
- **Related Runbooks:** {links to dependent service runbooks}
|
||||
```
|
||||
|
||||
### Post-Incident Review (PIR) Framework
|
||||
|
||||
#### PIR Timeline and Ownership
|
||||
|
||||
**Timeline:**
|
||||
- **24 hours:** Initial PIR draft completed by Incident Commander
|
||||
- **3 business days:** Final PIR published with all stakeholder input
|
||||
- **1 week:** Action items assigned with owners and due dates
|
||||
- **4 weeks:** Follow-up review on action item progress
|
||||
|
||||
**Roles:**
|
||||
- **PIR Owner:** Incident Commander (can delegate writing but owns completion)
|
||||
- **Technical Contributors:** All engineers involved in response
|
||||
- **Review Committee:** Engineering leadership, affected product teams
|
||||
- **Action Item Owners:** Assigned based on expertise and capacity
|
||||
|
||||
#### Root Cause Analysis Frameworks
|
||||
|
||||
#### 1. Five Whys Method
|
||||
|
||||
The Five Whys technique involves asking "why" repeatedly to drill down to root causes:
|
||||
|
||||
**Example Application:**
|
||||
- **Problem:** Database became unresponsive during peak traffic
|
||||
- **Why 1:** Why did the database become unresponsive? → Connection pool was exhausted
|
||||
- **Why 2:** Why was the connection pool exhausted? → Application was creating more connections than usual
|
||||
- **Why 3:** Why was the application creating more connections? → New feature wasn't properly connection pooling
|
||||
- **Why 4:** Why wasn't the feature properly connection pooling? → Code review missed this pattern
|
||||
- **Why 5:** Why did code review miss this? → No automated checks for connection pooling patterns
|
||||
|
||||
**Best Practices:**
|
||||
- Ask "why" at least 3 times, often need 5+ iterations
|
||||
- Focus on process failures, not individual blame
|
||||
- Each "why" should point to a actionable system improvement
|
||||
- Consider multiple root cause paths, not just one linear chain
|
||||
|
||||
#### 2. Fishbone (Ishikawa) Diagram
|
||||
|
||||
Systematic analysis across multiple categories of potential causes:
|
||||
|
||||
**Categories:**
|
||||
- **People:** Training, experience, communication, handoffs
|
||||
- **Process:** Procedures, change management, review processes
|
||||
- **Technology:** Architecture, tooling, monitoring, automation
|
||||
- **Environment:** Infrastructure, dependencies, external factors
|
||||
|
||||
**Application Method:**
|
||||
1. State the problem clearly at the "head" of the fishbone
|
||||
2. For each category, brainstorm potential contributing factors
|
||||
3. For each factor, ask what caused that factor (sub-causes)
|
||||
4. Identify the factors most likely to be root causes
|
||||
5. Validate root causes with evidence from the incident
|
||||
|
||||
#### 3. Timeline Analysis
|
||||
|
||||
Reconstruct the incident chronologically to identify decision points and missed opportunities:
|
||||
|
||||
**Timeline Elements:**
|
||||
- **Detection:** When was the issue first observable? When was it first detected?
|
||||
- **Notification:** How quickly were the right people informed?
|
||||
- **Response:** What actions were taken and how effective were they?
|
||||
- **Communication:** When were stakeholders updated?
|
||||
- **Resolution:** What finally resolved the issue?
|
||||
|
||||
**Analysis Questions:**
|
||||
- Where were there delays and what caused them?
|
||||
- What decisions would we make differently with perfect information?
|
||||
- Where did communication break down?
|
||||
- What automation could have detected/resolved faster?
|
||||
|
||||
### Escalation Paths
|
||||
|
||||
#### Technical Escalation
|
||||
|
||||
**Level 1:** On-call engineer
|
||||
- **Responsibility:** Initial response and common issue resolution
|
||||
- **Escalation Trigger:** Issue not resolved within SLA timeframe
|
||||
- **Timeframe:** 15 minutes (SEV1), 30 minutes (SEV2)
|
||||
|
||||
**Level 2:** Senior engineer/Team lead
|
||||
- **Responsibility:** Complex technical issues requiring deeper expertise
|
||||
- **Escalation Trigger:** Level 1 requests help or timeout occurs
|
||||
- **Timeframe:** 30 minutes (SEV1), 1 hour (SEV2)
|
||||
|
||||
**Level 3:** Engineering Manager/Staff Engineer
|
||||
- **Responsibility:** Cross-team coordination and architectural decisions
|
||||
- **Escalation Trigger:** Issue spans multiple systems or teams
|
||||
- **Timeframe:** 45 minutes (SEV1), 2 hours (SEV2)
|
||||
|
||||
**Level 4:** Director of Engineering/CTO
|
||||
- **Responsibility:** Resource allocation and business impact decisions
|
||||
- **Escalation Trigger:** Extended outage or significant business impact
|
||||
- **Timeframe:** 1 hour (SEV1), 4 hours (SEV2)
|
||||
|
||||
#### Business Escalation
|
||||
|
||||
**Customer Impact Assessment:**
|
||||
- **High:** Revenue loss, SLA breaches, customer churn risk
|
||||
- **Medium:** User experience degradation, support ticket volume
|
||||
- **Low:** Internal tools, development impact only
|
||||
|
||||
**Escalation Matrix:**
|
||||
|
||||
| Severity | Duration | Business Escalation |
|
||||
|----------|----------|-------------------|
|
||||
| SEV1 | Immediate | VP Engineering |
|
||||
| SEV1 | 30 minutes | CTO + Customer Success VP |
|
||||
| SEV1 | 1 hour | CEO + Full Executive Team |
|
||||
| SEV2 | 2 hours | VP Engineering |
|
||||
| SEV2 | 4 hours | CTO |
|
||||
| SEV3 | 1 business day | Engineering Manager |
|
||||
|
||||
### Status Page Management
|
||||
|
||||
#### Update Principles
|
||||
|
||||
1. **Transparency:** Provide factual information without speculation
|
||||
2. **Timeliness:** Update within committed timeframes
|
||||
3. **Clarity:** Use customer-friendly language, avoid technical jargon
|
||||
4. **Completeness:** Include impact scope, status, and next update time
|
||||
|
||||
#### Status Categories
|
||||
|
||||
- **Operational:** All systems functioning normally
|
||||
- **Degraded Performance:** Some users may experience slowness
|
||||
- **Partial Outage:** Subset of features unavailable
|
||||
- **Major Outage:** Service unavailable for most/all users
|
||||
- **Under Maintenance:** Planned maintenance window
|
||||
|
||||
#### Update Template
|
||||
|
||||
```
|
||||
{Timestamp} - {Status Category}
|
||||
|
||||
{Brief description of current state}
|
||||
|
||||
Impact: {who is affected and how}
|
||||
Cause: {root cause if known, "under investigation" if not}
|
||||
Resolution: {what's being done to fix it}
|
||||
|
||||
Next update: {specific time}
|
||||
|
||||
We apologize for any inconvenience this may cause.
|
||||
```
|
||||
|
||||
### Action Item Framework
|
||||
|
||||
#### Action Item Categories
|
||||
|
||||
1. **Immediate Fixes**
|
||||
- Critical bugs discovered during incident
|
||||
- Security vulnerabilities exposed
|
||||
- Data integrity issues
|
||||
|
||||
2. **Process Improvements**
|
||||
- Communication gaps
|
||||
- Escalation procedure updates
|
||||
- Runbook additions/updates
|
||||
|
||||
3. **Technical Debt**
|
||||
- Architecture improvements
|
||||
- Monitoring enhancements
|
||||
- Automation opportunities
|
||||
|
||||
4. **Organizational Changes**
|
||||
- Team structure adjustments
|
||||
- Training requirements
|
||||
- Tool/platform investments
|
||||
|
||||
#### Action Item Template
|
||||
|
||||
```
|
||||
**Title:** {Concise description of the action}
|
||||
**Priority:** {Critical/High/Medium/Low}
|
||||
**Category:** {Fix/Process/Technical/Organizational}
|
||||
**Owner:** {Assigned person}
|
||||
**Due Date:** {Specific date}
|
||||
**Success Criteria:** {How will we know this is complete}
|
||||
**Dependencies:** {What needs to happen first}
|
||||
**Related PIRs:** {Links to other incidents this addresses}
|
||||
|
||||
**Description:**
|
||||
{Detailed description of what needs to be done and why}
|
||||
|
||||
**Implementation Plan:**
|
||||
1. {Step 1}
|
||||
2. {Step 2}
|
||||
3. {Validation step}
|
||||
|
||||
**Progress Updates:**
|
||||
- {Date}: {Progress update}
|
||||
- {Date}: {Progress update}
|
||||
```
|
||||
→ See references/reference-information.md for details
|
||||
|
||||
## Usage Examples
|
||||
|
||||
@@ -670,4 +473,4 @@ The Incident Commander skill provides a comprehensive framework for managing inc
|
||||
|
||||
The key to successful incident management is preparation, practice, and continuous learning. Use this framework as a starting point, but adapt it to your organization's specific needs, culture, and technical environment.
|
||||
|
||||
Remember: The goal isn't to prevent all incidents (which is impossible), but to detect them quickly, respond effectively, communicate clearly, and learn continuously.
|
||||
Remember: The goal isn't to prevent all incidents (which is impossible), but to detect them quickly, respond effectively, communicate clearly, and learn continuously.
|
||||
|
||||
@@ -0,0 +1,201 @@
|
||||
# incident-commander reference
|
||||
|
||||
## Reference Information
|
||||
- **Architecture Diagram:** {link}
|
||||
- **Monitoring Dashboard:** {link}
|
||||
- **Related Runbooks:** {links to dependent service runbooks}
|
||||
```
|
||||
|
||||
### Post-Incident Review (PIR) Framework
|
||||
|
||||
#### PIR Timeline and Ownership
|
||||
|
||||
**Timeline:**
|
||||
- **24 hours:** Initial PIR draft completed by Incident Commander
|
||||
- **3 business days:** Final PIR published with all stakeholder input
|
||||
- **1 week:** Action items assigned with owners and due dates
|
||||
- **4 weeks:** Follow-up review on action item progress
|
||||
|
||||
**Roles:**
|
||||
- **PIR Owner:** Incident Commander (can delegate writing but owns completion)
|
||||
- **Technical Contributors:** All engineers involved in response
|
||||
- **Review Committee:** Engineering leadership, affected product teams
|
||||
- **Action Item Owners:** Assigned based on expertise and capacity
|
||||
|
||||
#### Root Cause Analysis Frameworks
|
||||
|
||||
#### 1. Five Whys Method
|
||||
|
||||
The Five Whys technique involves asking "why" repeatedly to drill down to root causes:
|
||||
|
||||
**Example Application:**
|
||||
- **Problem:** Database became unresponsive during peak traffic
|
||||
- **Why 1:** Why did the database become unresponsive? → Connection pool was exhausted
|
||||
- **Why 2:** Why was the connection pool exhausted? → Application was creating more connections than usual
|
||||
- **Why 3:** Why was the application creating more connections? → New feature wasn't properly connection pooling
|
||||
- **Why 4:** Why wasn't the feature properly connection pooling? → Code review missed this pattern
|
||||
- **Why 5:** Why did code review miss this? → No automated checks for connection pooling patterns
|
||||
|
||||
**Best Practices:**
|
||||
- Ask "why" at least 3 times, often need 5+ iterations
|
||||
- Focus on process failures, not individual blame
|
||||
- Each "why" should point to a actionable system improvement
|
||||
- Consider multiple root cause paths, not just one linear chain
|
||||
|
||||
#### 2. Fishbone (Ishikawa) Diagram
|
||||
|
||||
Systematic analysis across multiple categories of potential causes:
|
||||
|
||||
**Categories:**
|
||||
- **People:** Training, experience, communication, handoffs
|
||||
- **Process:** Procedures, change management, review processes
|
||||
- **Technology:** Architecture, tooling, monitoring, automation
|
||||
- **Environment:** Infrastructure, dependencies, external factors
|
||||
|
||||
**Application Method:**
|
||||
1. State the problem clearly at the "head" of the fishbone
|
||||
2. For each category, brainstorm potential contributing factors
|
||||
3. For each factor, ask what caused that factor (sub-causes)
|
||||
4. Identify the factors most likely to be root causes
|
||||
5. Validate root causes with evidence from the incident
|
||||
|
||||
#### 3. Timeline Analysis
|
||||
|
||||
Reconstruct the incident chronologically to identify decision points and missed opportunities:
|
||||
|
||||
**Timeline Elements:**
|
||||
- **Detection:** When was the issue first observable? When was it first detected?
|
||||
- **Notification:** How quickly were the right people informed?
|
||||
- **Response:** What actions were taken and how effective were they?
|
||||
- **Communication:** When were stakeholders updated?
|
||||
- **Resolution:** What finally resolved the issue?
|
||||
|
||||
**Analysis Questions:**
|
||||
- Where were there delays and what caused them?
|
||||
- What decisions would we make differently with perfect information?
|
||||
- Where did communication break down?
|
||||
- What automation could have detected/resolved faster?
|
||||
|
||||
### Escalation Paths
|
||||
|
||||
#### Technical Escalation
|
||||
|
||||
**Level 1:** On-call engineer
|
||||
- **Responsibility:** Initial response and common issue resolution
|
||||
- **Escalation Trigger:** Issue not resolved within SLA timeframe
|
||||
- **Timeframe:** 15 minutes (SEV1), 30 minutes (SEV2)
|
||||
|
||||
**Level 2:** Senior engineer/Team lead
|
||||
- **Responsibility:** Complex technical issues requiring deeper expertise
|
||||
- **Escalation Trigger:** Level 1 requests help or timeout occurs
|
||||
- **Timeframe:** 30 minutes (SEV1), 1 hour (SEV2)
|
||||
|
||||
**Level 3:** Engineering Manager/Staff Engineer
|
||||
- **Responsibility:** Cross-team coordination and architectural decisions
|
||||
- **Escalation Trigger:** Issue spans multiple systems or teams
|
||||
- **Timeframe:** 45 minutes (SEV1), 2 hours (SEV2)
|
||||
|
||||
**Level 4:** Director of Engineering/CTO
|
||||
- **Responsibility:** Resource allocation and business impact decisions
|
||||
- **Escalation Trigger:** Extended outage or significant business impact
|
||||
- **Timeframe:** 1 hour (SEV1), 4 hours (SEV2)
|
||||
|
||||
#### Business Escalation
|
||||
|
||||
**Customer Impact Assessment:**
|
||||
- **High:** Revenue loss, SLA breaches, customer churn risk
|
||||
- **Medium:** User experience degradation, support ticket volume
|
||||
- **Low:** Internal tools, development impact only
|
||||
|
||||
**Escalation Matrix:**
|
||||
|
||||
| Severity | Duration | Business Escalation |
|
||||
|----------|----------|-------------------|
|
||||
| SEV1 | Immediate | VP Engineering |
|
||||
| SEV1 | 30 minutes | CTO + Customer Success VP |
|
||||
| SEV1 | 1 hour | CEO + Full Executive Team |
|
||||
| SEV2 | 2 hours | VP Engineering |
|
||||
| SEV2 | 4 hours | CTO |
|
||||
| SEV3 | 1 business day | Engineering Manager |
|
||||
|
||||
### Status Page Management
|
||||
|
||||
#### Update Principles
|
||||
|
||||
1. **Transparency:** Provide factual information without speculation
|
||||
2. **Timeliness:** Update within committed timeframes
|
||||
3. **Clarity:** Use customer-friendly language, avoid technical jargon
|
||||
4. **Completeness:** Include impact scope, status, and next update time
|
||||
|
||||
#### Status Categories
|
||||
|
||||
- **Operational:** All systems functioning normally
|
||||
- **Degraded Performance:** Some users may experience slowness
|
||||
- **Partial Outage:** Subset of features unavailable
|
||||
- **Major Outage:** Service unavailable for most/all users
|
||||
- **Under Maintenance:** Planned maintenance window
|
||||
|
||||
#### Update Template
|
||||
|
||||
```
|
||||
{Timestamp} - {Status Category}
|
||||
|
||||
{Brief description of current state}
|
||||
|
||||
Impact: {who is affected and how}
|
||||
Cause: {root cause if known, "under investigation" if not}
|
||||
Resolution: {what's being done to fix it}
|
||||
|
||||
Next update: {specific time}
|
||||
|
||||
We apologize for any inconvenience this may cause.
|
||||
```
|
||||
|
||||
### Action Item Framework
|
||||
|
||||
#### Action Item Categories
|
||||
|
||||
1. **Immediate Fixes**
|
||||
- Critical bugs discovered during incident
|
||||
- Security vulnerabilities exposed
|
||||
- Data integrity issues
|
||||
|
||||
2. **Process Improvements**
|
||||
- Communication gaps
|
||||
- Escalation procedure updates
|
||||
- Runbook additions/updates
|
||||
|
||||
3. **Technical Debt**
|
||||
- Architecture improvements
|
||||
- Monitoring enhancements
|
||||
- Automation opportunities
|
||||
|
||||
4. **Organizational Changes**
|
||||
- Team structure adjustments
|
||||
- Training requirements
|
||||
- Tool/platform investments
|
||||
|
||||
#### Action Item Template
|
||||
|
||||
```
|
||||
**Title:** {Concise description of the action}
|
||||
**Priority:** {Critical/High/Medium/Low}
|
||||
**Category:** {Fix/Process/Technical/Organizational}
|
||||
**Owner:** {Assigned person}
|
||||
**Due Date:** {Specific date}
|
||||
**Success Criteria:** {How will we know this is complete}
|
||||
**Dependencies:** {What needs to happen first}
|
||||
**Related PIRs:** {Links to other incidents this addresses}
|
||||
|
||||
**Description:**
|
||||
{Detailed description of what needs to be done and why}
|
||||
|
||||
**Implementation Plan:**
|
||||
1. {Step 1}
|
||||
2. {Step 2}
|
||||
3. {Validation step}
|
||||
|
||||
**Progress Updates:**
|
||||
- {Date}: {Progress update}
|
||||
- {Date}: {Progress update}
|
||||
```
|
||||
@@ -9,18 +9,5 @@
|
||||
"homepage": "https://github.com/alirezarezvani/claude-skills/tree/main/engineering-team/playwright-pro",
|
||||
"repository": "https://github.com/alirezarezvani/claude-skills",
|
||||
"license": "MIT",
|
||||
"keywords": [
|
||||
"playwright",
|
||||
"testing",
|
||||
"e2e",
|
||||
"qa",
|
||||
"browserstack",
|
||||
"testrail",
|
||||
"test-automation",
|
||||
"cross-browser",
|
||||
"migration",
|
||||
"cypress",
|
||||
"selenium"
|
||||
],
|
||||
"skills": "./skills"
|
||||
"skills": "./"
|
||||
}
|
||||
|
||||
@@ -419,99 +419,7 @@ python scripts/dataset_pipeline_builder.py data/final/ \
|
||||
| Positional encoding | Implicit | Explicit |
|
||||
|
||||
## Reference Documentation
|
||||
|
||||
### 1. Computer Vision Architectures
|
||||
|
||||
See `references/computer_vision_architectures.md` for:
|
||||
|
||||
- CNN backbone architectures (ResNet, EfficientNet, ConvNeXt)
|
||||
- Vision Transformer variants (ViT, DeiT, Swin)
|
||||
- Detection heads (anchor-based vs anchor-free)
|
||||
- Feature Pyramid Networks (FPN, BiFPN, PANet)
|
||||
- Neck architectures for multi-scale detection
|
||||
|
||||
### 2. Object Detection Optimization
|
||||
|
||||
See `references/object_detection_optimization.md` for:
|
||||
|
||||
- Non-Maximum Suppression variants (NMS, Soft-NMS, DIoU-NMS)
|
||||
- Anchor optimization and anchor-free alternatives
|
||||
- Loss function design (focal loss, GIoU, CIoU, DIoU)
|
||||
- Training strategies (warmup, cosine annealing, EMA)
|
||||
- Data augmentation for detection (mosaic, mixup, copy-paste)
|
||||
|
||||
### 3. Production Vision Systems
|
||||
|
||||
See `references/production_vision_systems.md` for:
|
||||
|
||||
- ONNX export and optimization
|
||||
- TensorRT deployment pipeline
|
||||
- Batch inference optimization
|
||||
- Edge device deployment (Jetson, Intel NCS)
|
||||
- Model serving with Triton
|
||||
- Video processing pipelines
|
||||
|
||||
## Common Commands
|
||||
|
||||
### Ultralytics YOLO
|
||||
|
||||
```bash
|
||||
# Training
|
||||
yolo detect train data=coco.yaml model=yolov8m.pt epochs=100 imgsz=640
|
||||
|
||||
# Validation
|
||||
yolo detect val model=best.pt data=coco.yaml
|
||||
|
||||
# Inference
|
||||
yolo detect predict model=best.pt source=images/ save=True
|
||||
|
||||
# Export
|
||||
yolo export model=best.pt format=onnx simplify=True dynamic=True
|
||||
```
|
||||
|
||||
### Detectron2
|
||||
|
||||
```bash
|
||||
# Training
|
||||
python train_net.py --config-file configs/COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml \
|
||||
--num-gpus 1 OUTPUT_DIR ./output
|
||||
|
||||
# Evaluation
|
||||
python train_net.py --config-file configs/faster_rcnn.yaml --eval-only \
|
||||
MODEL.WEIGHTS output/model_final.pth
|
||||
|
||||
# Inference
|
||||
python demo.py --config-file configs/faster_rcnn.yaml \
|
||||
--input images/*.jpg --output results/ \
|
||||
--opts MODEL.WEIGHTS output/model_final.pth
|
||||
```
|
||||
|
||||
### MMDetection
|
||||
|
||||
```bash
|
||||
# Training
|
||||
python tools/train.py configs/faster_rcnn/faster-rcnn_r50_fpn_1x_coco.py
|
||||
|
||||
# Testing
|
||||
python tools/test.py configs/faster_rcnn.py checkpoints/latest.pth --eval bbox
|
||||
|
||||
# Inference
|
||||
python demo/image_demo.py demo.jpg configs/faster_rcnn.py checkpoints/latest.pth
|
||||
```
|
||||
|
||||
### Model Optimization
|
||||
|
||||
```bash
|
||||
# ONNX export and simplify
|
||||
python -c "import torch; model = torch.load('model.pt'); torch.onnx.export(model, torch.randn(1,3,640,640), 'model.onnx', opset_version=17)"
|
||||
python -m onnxsim model.onnx model_sim.onnx
|
||||
|
||||
# TensorRT conversion
|
||||
trtexec --onnx=model.onnx --saveEngine=model.engine --fp16 --workspace=4096
|
||||
|
||||
# Benchmark
|
||||
trtexec --loadEngine=model.engine --batch=1 --iterations=1000 --avgRuns=100
|
||||
```
|
||||
→ See references/reference-docs-and-commands.md for details
|
||||
|
||||
## Performance Targets
|
||||
|
||||
|
||||
@@ -0,0 +1,96 @@
|
||||
# senior-computer-vision reference
|
||||
|
||||
## Reference Documentation
|
||||
|
||||
### 1. Computer Vision Architectures
|
||||
|
||||
See `references/computer_vision_architectures.md` for:
|
||||
|
||||
- CNN backbone architectures (ResNet, EfficientNet, ConvNeXt)
|
||||
- Vision Transformer variants (ViT, DeiT, Swin)
|
||||
- Detection heads (anchor-based vs anchor-free)
|
||||
- Feature Pyramid Networks (FPN, BiFPN, PANet)
|
||||
- Neck architectures for multi-scale detection
|
||||
|
||||
### 2. Object Detection Optimization
|
||||
|
||||
See `references/object_detection_optimization.md` for:
|
||||
|
||||
- Non-Maximum Suppression variants (NMS, Soft-NMS, DIoU-NMS)
|
||||
- Anchor optimization and anchor-free alternatives
|
||||
- Loss function design (focal loss, GIoU, CIoU, DIoU)
|
||||
- Training strategies (warmup, cosine annealing, EMA)
|
||||
- Data augmentation for detection (mosaic, mixup, copy-paste)
|
||||
|
||||
### 3. Production Vision Systems
|
||||
|
||||
See `references/production_vision_systems.md` for:
|
||||
|
||||
- ONNX export and optimization
|
||||
- TensorRT deployment pipeline
|
||||
- Batch inference optimization
|
||||
- Edge device deployment (Jetson, Intel NCS)
|
||||
- Model serving with Triton
|
||||
- Video processing pipelines
|
||||
|
||||
## Common Commands
|
||||
|
||||
### Ultralytics YOLO
|
||||
|
||||
```bash
|
||||
# Training
|
||||
yolo detect train data=coco.yaml model=yolov8m.pt epochs=100 imgsz=640
|
||||
|
||||
# Validation
|
||||
yolo detect val model=best.pt data=coco.yaml
|
||||
|
||||
# Inference
|
||||
yolo detect predict model=best.pt source=images/ save=True
|
||||
|
||||
# Export
|
||||
yolo export model=best.pt format=onnx simplify=True dynamic=True
|
||||
```
|
||||
|
||||
### Detectron2
|
||||
|
||||
```bash
|
||||
# Training
|
||||
python train_net.py --config-file configs/COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml \
|
||||
--num-gpus 1 OUTPUT_DIR ./output
|
||||
|
||||
# Evaluation
|
||||
python train_net.py --config-file configs/faster_rcnn.yaml --eval-only \
|
||||
MODEL.WEIGHTS output/model_final.pth
|
||||
|
||||
# Inference
|
||||
python demo.py --config-file configs/faster_rcnn.yaml \
|
||||
--input images/*.jpg --output results/ \
|
||||
--opts MODEL.WEIGHTS output/model_final.pth
|
||||
```
|
||||
|
||||
### MMDetection
|
||||
|
||||
```bash
|
||||
# Training
|
||||
python tools/train.py configs/faster_rcnn/faster-rcnn_r50_fpn_1x_coco.py
|
||||
|
||||
# Testing
|
||||
python tools/test.py configs/faster_rcnn.py checkpoints/latest.pth --eval bbox
|
||||
|
||||
# Inference
|
||||
python demo/image_demo.py demo.jpg configs/faster_rcnn.py checkpoints/latest.pth
|
||||
```
|
||||
|
||||
### Model Optimization
|
||||
|
||||
```bash
|
||||
# ONNX export and simplify
|
||||
python -c "import torch; model = torch.load('model.pt'); torch.onnx.export(model, torch.randn(1,3,640,640), 'model.onnx', opset_version=17)"
|
||||
python -m onnxsim model.onnx model_sim.onnx
|
||||
|
||||
# TensorRT conversion
|
||||
trtexec --onnx=model.onnx --saveEngine=model.engine --fp16 --workspace=4096
|
||||
|
||||
# Benchmark
|
||||
trtexec --loadEngine=model.engine --batch=1 --iterations=1000 --avgRuns=100
|
||||
```
|
||||
@@ -86,627 +86,7 @@ python scripts/etl_performance_optimizer.py analyze \
|
||||
---
|
||||
|
||||
## Workflows
|
||||
|
||||
### Workflow 1: Building a Batch ETL Pipeline
|
||||
|
||||
**Scenario:** Extract data from PostgreSQL, transform with dbt, load to Snowflake.
|
||||
|
||||
#### Step 1: Define Source Schema
|
||||
|
||||
```sql
|
||||
-- Document source tables
|
||||
SELECT
|
||||
table_name,
|
||||
column_name,
|
||||
data_type,
|
||||
is_nullable
|
||||
FROM information_schema.columns
|
||||
WHERE table_schema = 'source_schema'
|
||||
ORDER BY table_name, ordinal_position;
|
||||
```
|
||||
|
||||
#### Step 2: Generate Extraction Config
|
||||
|
||||
```bash
|
||||
python scripts/pipeline_orchestrator.py generate \
|
||||
--type airflow \
|
||||
--source postgres \
|
||||
--tables orders,customers,products \
|
||||
--mode incremental \
|
||||
--watermark updated_at \
|
||||
--output dags/extract_source.py
|
||||
```
|
||||
|
||||
#### Step 3: Create dbt Models
|
||||
|
||||
```sql
|
||||
-- models/staging/stg_orders.sql
|
||||
WITH source AS (
|
||||
SELECT * FROM {{ source('postgres', 'orders') }}
|
||||
),
|
||||
|
||||
renamed AS (
|
||||
SELECT
|
||||
order_id,
|
||||
customer_id,
|
||||
order_date,
|
||||
total_amount,
|
||||
status,
|
||||
_extracted_at
|
||||
FROM source
|
||||
WHERE order_date >= DATEADD(day, -3, CURRENT_DATE)
|
||||
)
|
||||
|
||||
SELECT * FROM renamed
|
||||
```
|
||||
|
||||
```sql
|
||||
-- models/marts/fct_orders.sql
|
||||
{{
|
||||
config(
|
||||
materialized='incremental',
|
||||
unique_key='order_id',
|
||||
cluster_by=['order_date']
|
||||
)
|
||||
}}
|
||||
|
||||
SELECT
|
||||
o.order_id,
|
||||
o.customer_id,
|
||||
c.customer_segment,
|
||||
o.order_date,
|
||||
o.total_amount,
|
||||
o.status
|
||||
FROM {{ ref('stg_orders') }} o
|
||||
LEFT JOIN {{ ref('dim_customers') }} c
|
||||
ON o.customer_id = c.customer_id
|
||||
|
||||
{% if is_incremental() %}
|
||||
WHERE o._extracted_at > (SELECT MAX(_extracted_at) FROM {{ this }})
|
||||
{% endif %}
|
||||
```
|
||||
|
||||
#### Step 4: Configure Data Quality Tests
|
||||
|
||||
```yaml
|
||||
# models/marts/schema.yml
|
||||
version: 2
|
||||
|
||||
models:
|
||||
- name: "fct-orders"
|
||||
description: "Order fact table"
|
||||
columns:
|
||||
- name: "order-id"
|
||||
tests:
|
||||
- unique
|
||||
- not_null
|
||||
- name: "total-amount"
|
||||
tests:
|
||||
- not_null
|
||||
- dbt_utils.accepted_range:
|
||||
min_value: 0
|
||||
max_value: 1000000
|
||||
- name: "order-date"
|
||||
tests:
|
||||
- not_null
|
||||
- dbt_utils.recency:
|
||||
datepart: day
|
||||
field: order_date
|
||||
interval: 1
|
||||
```
|
||||
|
||||
#### Step 5: Create Airflow DAG
|
||||
|
||||
```python
|
||||
# dags/daily_etl.py
|
||||
from airflow import DAG
|
||||
from airflow.providers.postgres.operators.postgres import PostgresOperator
|
||||
from airflow.operators.bash import BashOperator
|
||||
from airflow.utils.dates import days_ago
|
||||
from datetime import timedelta
|
||||
|
||||
default_args = {
|
||||
'owner': 'data-team',
|
||||
'depends_on_past': False,
|
||||
'email_on_failure': True,
|
||||
'email': ['data-alerts@company.com'],
|
||||
'retries': 2,
|
||||
'retry_delay': timedelta(minutes=5),
|
||||
}
|
||||
|
||||
with DAG(
|
||||
'daily_etl_pipeline',
|
||||
default_args=default_args,
|
||||
description='Daily ETL from PostgreSQL to Snowflake',
|
||||
schedule_interval='0 5 * * *',
|
||||
start_date=days_ago(1),
|
||||
catchup=False,
|
||||
tags=['etl', 'daily'],
|
||||
) as dag:
|
||||
|
||||
extract = BashOperator(
|
||||
task_id='extract_source_data',
|
||||
bash_command='python /opt/airflow/scripts/extract.py --date {{ ds }}',
|
||||
)
|
||||
|
||||
transform = BashOperator(
|
||||
task_id='run_dbt_models',
|
||||
bash_command='cd /opt/airflow/dbt && dbt run --select marts.*',
|
||||
)
|
||||
|
||||
test = BashOperator(
|
||||
task_id='run_dbt_tests',
|
||||
bash_command='cd /opt/airflow/dbt && dbt test --select marts.*',
|
||||
)
|
||||
|
||||
notify = BashOperator(
|
||||
task_id='send_notification',
|
||||
bash_command='python /opt/airflow/scripts/notify.py --status success',
|
||||
trigger_rule='all_success',
|
||||
)
|
||||
|
||||
extract >> transform >> test >> notify
|
||||
```
|
||||
|
||||
#### Step 6: Validate Pipeline
|
||||
|
||||
```bash
|
||||
# Test locally
|
||||
dbt run --select stg_orders fct_orders
|
||||
dbt test --select fct_orders
|
||||
|
||||
# Validate data quality
|
||||
python scripts/data_quality_validator.py validate \
|
||||
--table fct_orders \
|
||||
--checks all \
|
||||
--output reports/quality_report.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Workflow 2: Implementing Real-Time Streaming
|
||||
|
||||
**Scenario:** Stream events from Kafka, process with Flink/Spark Streaming, sink to data lake.
|
||||
|
||||
#### Step 1: Define Event Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"$schema": "http://json-schema.org/draft-07/schema#",
|
||||
"title": "UserEvent",
|
||||
"type": "object",
|
||||
"required": ["event_id", "user_id", "event_type", "timestamp"],
|
||||
"properties": {
|
||||
"event_id": {"type": "string", "format": "uuid"},
|
||||
"user_id": {"type": "string"},
|
||||
"event_type": {"type": "string", "enum": ["page_view", "click", "purchase"]},
|
||||
"timestamp": {"type": "string", "format": "date-time"},
|
||||
"properties": {"type": "object"}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Step 2: Create Kafka Topic
|
||||
|
||||
```bash
|
||||
# Create topic with appropriate partitions
|
||||
kafka-topics.sh --create \
|
||||
--bootstrap-server localhost:9092 \
|
||||
--topic user-events \
|
||||
--partitions 12 \
|
||||
--replication-factor 3 \
|
||||
--config retention.ms=604800000 \
|
||||
--config cleanup.policy=delete
|
||||
|
||||
# Verify topic
|
||||
kafka-topics.sh --describe \
|
||||
--bootstrap-server localhost:9092 \
|
||||
--topic user-events
|
||||
```
|
||||
|
||||
#### Step 3: Implement Spark Streaming Job
|
||||
|
||||
```python
|
||||
# streaming/user_events_processor.py
|
||||
from pyspark.sql import SparkSession
|
||||
from pyspark.sql.functions import (
|
||||
from_json, col, window, count, avg,
|
||||
to_timestamp, current_timestamp
|
||||
)
|
||||
from pyspark.sql.types import (
|
||||
StructType, StructField, StringType,
|
||||
TimestampType, MapType
|
||||
)
|
||||
|
||||
# Initialize Spark
|
||||
spark = SparkSession.builder \
|
||||
.appName("UserEventsProcessor") \
|
||||
.config("spark.sql.streaming.checkpointLocation", "/checkpoints/user-events") \
|
||||
.config("spark.sql.shuffle.partitions", "12") \
|
||||
.getOrCreate()
|
||||
|
||||
# Define schema
|
||||
event_schema = StructType([
|
||||
StructField("event_id", StringType(), False),
|
||||
StructField("user_id", StringType(), False),
|
||||
StructField("event_type", StringType(), False),
|
||||
StructField("timestamp", StringType(), False),
|
||||
StructField("properties", MapType(StringType(), StringType()), True)
|
||||
])
|
||||
|
||||
# Read from Kafka
|
||||
events_df = spark.readStream \
|
||||
.format("kafka") \
|
||||
.option("kafka.bootstrap.servers", "localhost:9092") \
|
||||
.option("subscribe", "user-events") \
|
||||
.option("startingOffsets", "latest") \
|
||||
.option("failOnDataLoss", "false") \
|
||||
.load()
|
||||
|
||||
# Parse JSON
|
||||
parsed_df = events_df \
|
||||
.select(from_json(col("value").cast("string"), event_schema).alias("data")) \
|
||||
.select("data.*") \
|
||||
.withColumn("event_timestamp", to_timestamp(col("timestamp")))
|
||||
|
||||
# Windowed aggregation
|
||||
aggregated_df = parsed_df \
|
||||
.withWatermark("event_timestamp", "10 minutes") \
|
||||
.groupBy(
|
||||
window(col("event_timestamp"), "5 minutes"),
|
||||
col("event_type")
|
||||
) \
|
||||
.agg(
|
||||
count("*").alias("event_count"),
|
||||
approx_count_distinct("user_id").alias("unique_users")
|
||||
)
|
||||
|
||||
# Write to Delta Lake
|
||||
query = aggregated_df.writeStream \
|
||||
.format("delta") \
|
||||
.outputMode("append") \
|
||||
.option("checkpointLocation", "/checkpoints/user-events-aggregated") \
|
||||
.option("path", "/data/lake/user_events_aggregated") \
|
||||
.trigger(processingTime="1 minute") \
|
||||
.start()
|
||||
|
||||
query.awaitTermination()
|
||||
```
|
||||
|
||||
#### Step 4: Handle Late Data and Errors
|
||||
|
||||
```python
|
||||
# Dead letter queue for failed records
|
||||
from pyspark.sql.functions import current_timestamp, lit
|
||||
|
||||
def process_with_error_handling(batch_df, batch_id):
|
||||
try:
|
||||
# Attempt processing
|
||||
valid_df = batch_df.filter(col("event_id").isNotNull())
|
||||
invalid_df = batch_df.filter(col("event_id").isNull())
|
||||
|
||||
# Write valid records
|
||||
valid_df.write \
|
||||
.format("delta") \
|
||||
.mode("append") \
|
||||
.save("/data/lake/user_events")
|
||||
|
||||
# Write invalid to DLQ
|
||||
if invalid_df.count() > 0:
|
||||
invalid_df \
|
||||
.withColumn("error_timestamp", current_timestamp()) \
|
||||
.withColumn("error_reason", lit("missing_event_id")) \
|
||||
.write \
|
||||
.format("delta") \
|
||||
.mode("append") \
|
||||
.save("/data/lake/dlq/user_events")
|
||||
|
||||
except Exception as e:
|
||||
# Log error, alert, continue
|
||||
logger.error(f"Batch {batch_id} failed: {e}")
|
||||
raise
|
||||
|
||||
# Use foreachBatch for custom processing
|
||||
query = parsed_df.writeStream \
|
||||
.foreachBatch(process_with_error_handling) \
|
||||
.option("checkpointLocation", "/checkpoints/user-events") \
|
||||
.start()
|
||||
```
|
||||
|
||||
#### Step 5: Monitor Stream Health
|
||||
|
||||
```python
|
||||
# monitoring/stream_metrics.py
|
||||
from prometheus_client import Gauge, Counter, start_http_server
|
||||
|
||||
# Define metrics
|
||||
RECORDS_PROCESSED = Counter(
|
||||
'stream_records_processed_total',
|
||||
'Total records processed',
|
||||
['stream_name', 'status']
|
||||
)
|
||||
|
||||
PROCESSING_LAG = Gauge(
|
||||
'stream_processing_lag_seconds',
|
||||
'Current processing lag',
|
||||
['stream_name']
|
||||
)
|
||||
|
||||
BATCH_DURATION = Gauge(
|
||||
'stream_batch_duration_seconds',
|
||||
'Last batch processing duration',
|
||||
['stream_name']
|
||||
)
|
||||
|
||||
def emit_metrics(query):
|
||||
"""Emit Prometheus metrics from streaming query."""
|
||||
progress = query.lastProgress
|
||||
if progress:
|
||||
RECORDS_PROCESSED.labels(
|
||||
stream_name='user-events',
|
||||
status='success'
|
||||
).inc(progress['numInputRows'])
|
||||
|
||||
if progress['sources']:
|
||||
# Calculate lag from latest offset
|
||||
for source in progress['sources']:
|
||||
end_offset = source.get('endOffset', {})
|
||||
# Parse Kafka offsets and calculate lag
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Workflow 3: Data Quality Framework Setup
|
||||
|
||||
**Scenario:** Implement comprehensive data quality monitoring with Great Expectations.
|
||||
|
||||
#### Step 1: Initialize Great Expectations
|
||||
|
||||
```bash
|
||||
# Install and initialize
|
||||
pip install great_expectations
|
||||
|
||||
great_expectations init
|
||||
|
||||
# Connect to data source
|
||||
great_expectations datasource new
|
||||
```
|
||||
|
||||
#### Step 2: Create Expectation Suite
|
||||
|
||||
```python
|
||||
# expectations/orders_suite.py
|
||||
import great_expectations as gx
|
||||
|
||||
context = gx.get_context()
|
||||
|
||||
# Create expectation suite
|
||||
suite = context.add_expectation_suite("orders_quality_suite")
|
||||
|
||||
# Add expectations
|
||||
validator = context.get_validator(
|
||||
batch_request={
|
||||
"datasource_name": "warehouse",
|
||||
"data_asset_name": "orders",
|
||||
},
|
||||
expectation_suite_name="orders_quality_suite"
|
||||
)
|
||||
|
||||
# Schema expectations
|
||||
validator.expect_table_columns_to_match_ordered_list(
|
||||
column_list=[
|
||||
"order_id", "customer_id", "order_date",
|
||||
"total_amount", "status", "created_at"
|
||||
]
|
||||
)
|
||||
|
||||
# Completeness expectations
|
||||
validator.expect_column_values_to_not_be_null("order_id")
|
||||
validator.expect_column_values_to_not_be_null("customer_id")
|
||||
validator.expect_column_values_to_not_be_null("order_date")
|
||||
|
||||
# Uniqueness expectations
|
||||
validator.expect_column_values_to_be_unique("order_id")
|
||||
|
||||
# Range expectations
|
||||
validator.expect_column_values_to_be_between(
|
||||
"total_amount",
|
||||
min_value=0,
|
||||
max_value=1000000
|
||||
)
|
||||
|
||||
# Categorical expectations
|
||||
validator.expect_column_values_to_be_in_set(
|
||||
"status",
|
||||
["pending", "confirmed", "shipped", "delivered", "cancelled"]
|
||||
)
|
||||
|
||||
# Freshness expectation
|
||||
validator.expect_column_max_to_be_between(
|
||||
"order_date",
|
||||
min_value={"$PARAMETER": "now - timedelta(days=1)"},
|
||||
max_value={"$PARAMETER": "now"}
|
||||
)
|
||||
|
||||
# Referential integrity
|
||||
validator.expect_column_values_to_be_in_set(
|
||||
"customer_id",
|
||||
value_set={"$PARAMETER": "valid_customer_ids"}
|
||||
)
|
||||
|
||||
validator.save_expectation_suite(discard_failed_expectations=False)
|
||||
```
|
||||
|
||||
#### Step 3: Create Data Quality Checks with dbt
|
||||
|
||||
```yaml
|
||||
# models/marts/schema.yml
|
||||
version: 2
|
||||
|
||||
models:
|
||||
- name: "fct-orders"
|
||||
description: "Order fact table with data quality checks"
|
||||
|
||||
tests:
|
||||
# Row count check
|
||||
- dbt_utils.equal_rowcount:
|
||||
compare_model: ref('stg_orders')
|
||||
|
||||
# Freshness check
|
||||
- dbt_utils.recency:
|
||||
datepart: hour
|
||||
field: created_at
|
||||
interval: 24
|
||||
|
||||
columns:
|
||||
- name: "order-id"
|
||||
description: "Unique order identifier"
|
||||
tests:
|
||||
- unique
|
||||
- not_null
|
||||
- relationships:
|
||||
to: ref('dim_orders')
|
||||
field: order_id
|
||||
|
||||
- name: "total-amount"
|
||||
tests:
|
||||
- not_null
|
||||
- dbt_utils.accepted_range:
|
||||
min_value: 0
|
||||
max_value: 1000000
|
||||
inclusive: true
|
||||
- dbt_expectations.expect_column_values_to_be_between:
|
||||
min_value: 0
|
||||
row_condition: "status != 'cancelled'"
|
||||
|
||||
- name: "customer-id"
|
||||
tests:
|
||||
- not_null
|
||||
- relationships:
|
||||
to: ref('dim_customers')
|
||||
field: customer_id
|
||||
severity: warn
|
||||
```
|
||||
|
||||
#### Step 4: Implement Data Contracts
|
||||
|
||||
```yaml
|
||||
# contracts/orders_contract.yaml
|
||||
contract:
|
||||
name: "orders-data-contract"
|
||||
version: "1.0.0"
|
||||
owner: data-team@company.com
|
||||
|
||||
schema:
|
||||
type: object
|
||||
properties:
|
||||
order_id:
|
||||
type: string
|
||||
format: uuid
|
||||
description: "Unique order identifier"
|
||||
customer_id:
|
||||
type: string
|
||||
not_null: true
|
||||
order_date:
|
||||
type: date
|
||||
not_null: true
|
||||
total_amount:
|
||||
type: decimal
|
||||
precision: 10
|
||||
scale: 2
|
||||
minimum: 0
|
||||
status:
|
||||
type: string
|
||||
enum: ["pending", "confirmed", "shipped", "delivered", "cancelled"]
|
||||
|
||||
sla:
|
||||
freshness:
|
||||
max_delay_hours: 1
|
||||
completeness:
|
||||
min_percentage: 99.9
|
||||
accuracy:
|
||||
duplicate_tolerance: 0.01
|
||||
|
||||
consumers:
|
||||
- name: "analytics-team"
|
||||
usage: "Daily reporting dashboards"
|
||||
- name: "ml-team"
|
||||
usage: "Churn prediction model"
|
||||
```
|
||||
|
||||
#### Step 5: Set Up Quality Monitoring Dashboard
|
||||
|
||||
```python
|
||||
# monitoring/quality_dashboard.py
|
||||
from datetime import datetime, timedelta
|
||||
import pandas as pd
|
||||
|
||||
def generate_quality_report(connection, table_name: "str-dict"
|
||||
"""Generate comprehensive data quality report."""
|
||||
|
||||
report = {
|
||||
"table": table_name,
|
||||
"timestamp": datetime.now().isoformat(),
|
||||
"checks": {}
|
||||
}
|
||||
|
||||
# Row count check
|
||||
row_count = connection.execute(
|
||||
f"SELECT COUNT(*) FROM {table_name}"
|
||||
).fetchone()[0]
|
||||
report["checks"]["row_count"] = {
|
||||
"value": row_count,
|
||||
"status": "pass" if row_count > 0 else "fail"
|
||||
}
|
||||
|
||||
# Freshness check
|
||||
max_date = connection.execute(
|
||||
f"SELECT MAX(created_at) FROM {table_name}"
|
||||
).fetchone()[0]
|
||||
hours_old = (datetime.now() - max_date).total_seconds() / 3600
|
||||
report["checks"]["freshness"] = {
|
||||
"max_timestamp": max_date.isoformat(),
|
||||
"hours_old": round(hours_old, 2),
|
||||
"status": "pass" if hours_old < 24 else "fail"
|
||||
}
|
||||
|
||||
# Null rate check
|
||||
null_query = f"""
|
||||
SELECT
|
||||
SUM(CASE WHEN order_id IS NULL THEN 1 ELSE 0 END) as null_order_id,
|
||||
SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) as null_customer_id,
|
||||
COUNT(*) as total
|
||||
FROM {table_name}
|
||||
"""
|
||||
null_result = connection.execute(null_query).fetchone()
|
||||
report["checks"]["null_rates"] = {
|
||||
"order_id": null_result[0] / null_result[2] if null_result[2] > 0 else 0,
|
||||
"customer_id": null_result[1] / null_result[2] if null_result[2] > 0 else 0,
|
||||
"status": "pass" if null_result[0] == 0 and null_result[1] == 0 else "fail"
|
||||
}
|
||||
|
||||
# Duplicate check
|
||||
dup_query = f"""
|
||||
SELECT COUNT(*) - COUNT(DISTINCT order_id) as duplicates
|
||||
FROM {table_name}
|
||||
"""
|
||||
duplicates = connection.execute(dup_query).fetchone()[0]
|
||||
report["checks"]["duplicates"] = {
|
||||
"count": duplicates,
|
||||
"status": "pass" if duplicates == 0 else "fail"
|
||||
}
|
||||
|
||||
# Overall status
|
||||
all_passed = all(
|
||||
check["status"] == "pass"
|
||||
for check in report["checks"].values()
|
||||
)
|
||||
report["overall_status"] = "pass" if all_passed else "fail"
|
||||
|
||||
return report
|
||||
```
|
||||
|
||||
---
|
||||
→ See references/workflows.md for details
|
||||
|
||||
## Architecture Decision Framework
|
||||
|
||||
@@ -810,183 +190,5 @@ See `references/dataops_best_practices.md` for:
|
||||
---
|
||||
|
||||
## Troubleshooting
|
||||
→ See references/troubleshooting.md for details
|
||||
|
||||
### Pipeline Failures
|
||||
|
||||
**Symptom:** Airflow DAG fails with timeout
|
||||
```
|
||||
Task exceeded max execution time
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
1. Check resource allocation
|
||||
2. Profile slow operations
|
||||
3. Add incremental processing
|
||||
```python
|
||||
# Increase timeout
|
||||
default_args = {
|
||||
'execution_timeout': timedelta(hours=2),
|
||||
}
|
||||
|
||||
# Or use incremental loads
|
||||
WHERE updated_at > '{{ prev_ds }}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Symptom:** Spark job OOM
|
||||
```
|
||||
java.lang.OutOfMemoryError: Java heap space
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
1. Increase executor memory
|
||||
2. Reduce partition size
|
||||
3. Use disk spill
|
||||
```python
|
||||
spark.conf.set("spark.executor.memory", "8g")
|
||||
spark.conf.set("spark.sql.shuffle.partitions", "200")
|
||||
spark.conf.set("spark.memory.fraction", "0.8")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Symptom:** Kafka consumer lag increasing
|
||||
```
|
||||
Consumer lag: 1000000 messages
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
1. Increase consumer parallelism
|
||||
2. Optimize processing logic
|
||||
3. Scale consumer group
|
||||
```bash
|
||||
# Add more partitions
|
||||
kafka-topics.sh --alter \
|
||||
--bootstrap-server localhost:9092 \
|
||||
--topic user-events \
|
||||
--partitions 24
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Data Quality Issues
|
||||
|
||||
**Symptom:** Duplicate records appearing
|
||||
```
|
||||
Expected unique, found 150 duplicates
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
1. Add deduplication logic
|
||||
2. Use merge/upsert operations
|
||||
```sql
|
||||
-- dbt incremental with dedup
|
||||
{{
|
||||
config(
|
||||
materialized='incremental',
|
||||
unique_key='order_id'
|
||||
)
|
||||
}}
|
||||
|
||||
SELECT * FROM (
|
||||
SELECT
|
||||
*,
|
||||
ROW_NUMBER() OVER (
|
||||
PARTITION BY order_id
|
||||
ORDER BY updated_at DESC
|
||||
) as rn
|
||||
FROM {{ source('raw', 'orders') }}
|
||||
) WHERE rn = 1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Symptom:** Stale data in tables
|
||||
```
|
||||
Last update: 3 days ago
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
1. Check upstream pipeline status
|
||||
2. Verify source availability
|
||||
3. Add freshness monitoring
|
||||
```yaml
|
||||
# dbt freshness check
|
||||
sources:
|
||||
- name: "raw"
|
||||
freshness:
|
||||
warn_after: {count: 12, period: hour}
|
||||
error_after: {count: 24, period: hour}
|
||||
loaded_at_field: _loaded_at
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Symptom:** Schema drift detected
|
||||
```
|
||||
Column 'new_field' not in expected schema
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
1. Update data contract
|
||||
2. Modify transformations
|
||||
3. Communicate with producers
|
||||
```python
|
||||
# Handle schema evolution
|
||||
df = spark.read.format("delta") \
|
||||
.option("mergeSchema", "true") \
|
||||
.load("/data/orders")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Performance Issues
|
||||
|
||||
**Symptom:** Query takes hours
|
||||
```
|
||||
Query runtime: 4 hours (expected: 30 minutes)
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
1. Check query plan
|
||||
2. Add proper partitioning
|
||||
3. Optimize joins
|
||||
```sql
|
||||
-- Before: Full table scan
|
||||
SELECT * FROM orders WHERE order_date = '2024-01-15';
|
||||
|
||||
-- After: Partition pruning
|
||||
-- Table partitioned by order_date
|
||||
SELECT * FROM orders WHERE order_date = '2024-01-15';
|
||||
|
||||
-- Add clustering for frequent filters
|
||||
ALTER TABLE orders CLUSTER BY (customer_id);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Symptom:** dbt model takes too long
|
||||
```
|
||||
Model fct_orders completed in 45 minutes
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
1. Use incremental materialization
|
||||
2. Reduce upstream dependencies
|
||||
3. Pre-aggregate where possible
|
||||
```sql
|
||||
-- Convert to incremental
|
||||
{{
|
||||
config(
|
||||
materialized='incremental',
|
||||
unique_key='order_id',
|
||||
on_schema_change='sync_all_columns'
|
||||
)
|
||||
}}
|
||||
|
||||
SELECT * FROM {{ ref('stg_orders') }}
|
||||
{% if is_incremental() %}
|
||||
WHERE _loaded_at > (SELECT MAX(_loaded_at) FROM {{ this }})
|
||||
{% endif %}
|
||||
```
|
||||
|
||||
@@ -0,0 +1,183 @@
|
||||
# senior-data-engineer reference
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Pipeline Failures
|
||||
|
||||
**Symptom:** Airflow DAG fails with timeout
|
||||
```
|
||||
Task exceeded max execution time
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
1. Check resource allocation
|
||||
2. Profile slow operations
|
||||
3. Add incremental processing
|
||||
```python
|
||||
# Increase timeout
|
||||
default_args = {
|
||||
'execution_timeout': timedelta(hours=2),
|
||||
}
|
||||
|
||||
# Or use incremental loads
|
||||
WHERE updated_at > '{{ prev_ds }}'
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Symptom:** Spark job OOM
|
||||
```
|
||||
java.lang.OutOfMemoryError: Java heap space
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
1. Increase executor memory
|
||||
2. Reduce partition size
|
||||
3. Use disk spill
|
||||
```python
|
||||
spark.conf.set("spark.executor.memory", "8g")
|
||||
spark.conf.set("spark.sql.shuffle.partitions", "200")
|
||||
spark.conf.set("spark.memory.fraction", "0.8")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Symptom:** Kafka consumer lag increasing
|
||||
```
|
||||
Consumer lag: 1000000 messages
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
1. Increase consumer parallelism
|
||||
2. Optimize processing logic
|
||||
3. Scale consumer group
|
||||
```bash
|
||||
# Add more partitions
|
||||
kafka-topics.sh --alter \
|
||||
--bootstrap-server localhost:9092 \
|
||||
--topic user-events \
|
||||
--partitions 24
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Data Quality Issues
|
||||
|
||||
**Symptom:** Duplicate records appearing
|
||||
```
|
||||
Expected unique, found 150 duplicates
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
1. Add deduplication logic
|
||||
2. Use merge/upsert operations
|
||||
```sql
|
||||
-- dbt incremental with dedup
|
||||
{{
|
||||
config(
|
||||
materialized='incremental',
|
||||
unique_key='order_id'
|
||||
)
|
||||
}}
|
||||
|
||||
SELECT * FROM (
|
||||
SELECT
|
||||
*,
|
||||
ROW_NUMBER() OVER (
|
||||
PARTITION BY order_id
|
||||
ORDER BY updated_at DESC
|
||||
) as rn
|
||||
FROM {{ source('raw', 'orders') }}
|
||||
) WHERE rn = 1
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Symptom:** Stale data in tables
|
||||
```
|
||||
Last update: 3 days ago
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
1. Check upstream pipeline status
|
||||
2. Verify source availability
|
||||
3. Add freshness monitoring
|
||||
```yaml
|
||||
# dbt freshness check
|
||||
sources:
|
||||
- name: "raw"
|
||||
freshness:
|
||||
warn_after: {count: 12, period: hour}
|
||||
error_after: {count: 24, period: hour}
|
||||
loaded_at_field: _loaded_at
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Symptom:** Schema drift detected
|
||||
```
|
||||
Column 'new_field' not in expected schema
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
1. Update data contract
|
||||
2. Modify transformations
|
||||
3. Communicate with producers
|
||||
```python
|
||||
# Handle schema evolution
|
||||
df = spark.read.format("delta") \
|
||||
.option("mergeSchema", "true") \
|
||||
.load("/data/orders")
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Performance Issues
|
||||
|
||||
**Symptom:** Query takes hours
|
||||
```
|
||||
Query runtime: 4 hours (expected: 30 minutes)
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
1. Check query plan
|
||||
2. Add proper partitioning
|
||||
3. Optimize joins
|
||||
```sql
|
||||
-- Before: Full table scan
|
||||
SELECT * FROM orders WHERE order_date = '2024-01-15';
|
||||
|
||||
-- After: Partition pruning
|
||||
-- Table partitioned by order_date
|
||||
SELECT * FROM orders WHERE order_date = '2024-01-15';
|
||||
|
||||
-- Add clustering for frequent filters
|
||||
ALTER TABLE orders CLUSTER BY (customer_id);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
**Symptom:** dbt model takes too long
|
||||
```
|
||||
Model fct_orders completed in 45 minutes
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
1. Use incremental materialization
|
||||
2. Reduce upstream dependencies
|
||||
3. Pre-aggregate where possible
|
||||
```sql
|
||||
-- Convert to incremental
|
||||
{{
|
||||
config(
|
||||
materialized='incremental',
|
||||
unique_key='order_id',
|
||||
on_schema_change='sync_all_columns'
|
||||
)
|
||||
}}
|
||||
|
||||
SELECT * FROM {{ ref('stg_orders') }}
|
||||
{% if is_incremental() %}
|
||||
WHERE _loaded_at > (SELECT MAX(_loaded_at) FROM {{ this }})
|
||||
{% endif %}
|
||||
```
|
||||
624
engineering-team/senior-data-engineer/references/workflows.md
Normal file
624
engineering-team/senior-data-engineer/references/workflows.md
Normal file
@@ -0,0 +1,624 @@
|
||||
# senior-data-engineer reference
|
||||
|
||||
## Workflows
|
||||
|
||||
### Workflow 1: Building a Batch ETL Pipeline
|
||||
|
||||
**Scenario:** Extract data from PostgreSQL, transform with dbt, load to Snowflake.
|
||||
|
||||
#### Step 1: Define Source Schema
|
||||
|
||||
```sql
|
||||
-- Document source tables
|
||||
SELECT
|
||||
table_name,
|
||||
column_name,
|
||||
data_type,
|
||||
is_nullable
|
||||
FROM information_schema.columns
|
||||
WHERE table_schema = 'source_schema'
|
||||
ORDER BY table_name, ordinal_position;
|
||||
```
|
||||
|
||||
#### Step 2: Generate Extraction Config
|
||||
|
||||
```bash
|
||||
python scripts/pipeline_orchestrator.py generate \
|
||||
--type airflow \
|
||||
--source postgres \
|
||||
--tables orders,customers,products \
|
||||
--mode incremental \
|
||||
--watermark updated_at \
|
||||
--output dags/extract_source.py
|
||||
```
|
||||
|
||||
#### Step 3: Create dbt Models
|
||||
|
||||
```sql
|
||||
-- models/staging/stg_orders.sql
|
||||
WITH source AS (
|
||||
SELECT * FROM {{ source('postgres', 'orders') }}
|
||||
),
|
||||
|
||||
renamed AS (
|
||||
SELECT
|
||||
order_id,
|
||||
customer_id,
|
||||
order_date,
|
||||
total_amount,
|
||||
status,
|
||||
_extracted_at
|
||||
FROM source
|
||||
WHERE order_date >= DATEADD(day, -3, CURRENT_DATE)
|
||||
)
|
||||
|
||||
SELECT * FROM renamed
|
||||
```
|
||||
|
||||
```sql
|
||||
-- models/marts/fct_orders.sql
|
||||
{{
|
||||
config(
|
||||
materialized='incremental',
|
||||
unique_key='order_id',
|
||||
cluster_by=['order_date']
|
||||
)
|
||||
}}
|
||||
|
||||
SELECT
|
||||
o.order_id,
|
||||
o.customer_id,
|
||||
c.customer_segment,
|
||||
o.order_date,
|
||||
o.total_amount,
|
||||
o.status
|
||||
FROM {{ ref('stg_orders') }} o
|
||||
LEFT JOIN {{ ref('dim_customers') }} c
|
||||
ON o.customer_id = c.customer_id
|
||||
|
||||
{% if is_incremental() %}
|
||||
WHERE o._extracted_at > (SELECT MAX(_extracted_at) FROM {{ this }})
|
||||
{% endif %}
|
||||
```
|
||||
|
||||
#### Step 4: Configure Data Quality Tests
|
||||
|
||||
```yaml
|
||||
# models/marts/schema.yml
|
||||
version: 2
|
||||
|
||||
models:
|
||||
- name: "fct-orders"
|
||||
description: "Order fact table"
|
||||
columns:
|
||||
- name: "order-id"
|
||||
tests:
|
||||
- unique
|
||||
- not_null
|
||||
- name: "total-amount"
|
||||
tests:
|
||||
- not_null
|
||||
- dbt_utils.accepted_range:
|
||||
min_value: 0
|
||||
max_value: 1000000
|
||||
- name: "order-date"
|
||||
tests:
|
||||
- not_null
|
||||
- dbt_utils.recency:
|
||||
datepart: day
|
||||
field: order_date
|
||||
interval: 1
|
||||
```
|
||||
|
||||
#### Step 5: Create Airflow DAG
|
||||
|
||||
```python
|
||||
# dags/daily_etl.py
|
||||
from airflow import DAG
|
||||
from airflow.providers.postgres.operators.postgres import PostgresOperator
|
||||
from airflow.operators.bash import BashOperator
|
||||
from airflow.utils.dates import days_ago
|
||||
from datetime import timedelta
|
||||
|
||||
default_args = {
|
||||
'owner': 'data-team',
|
||||
'depends_on_past': False,
|
||||
'email_on_failure': True,
|
||||
'email': ['data-alerts@company.com'],
|
||||
'retries': 2,
|
||||
'retry_delay': timedelta(minutes=5),
|
||||
}
|
||||
|
||||
with DAG(
|
||||
'daily_etl_pipeline',
|
||||
default_args=default_args,
|
||||
description='Daily ETL from PostgreSQL to Snowflake',
|
||||
schedule_interval='0 5 * * *',
|
||||
start_date=days_ago(1),
|
||||
catchup=False,
|
||||
tags=['etl', 'daily'],
|
||||
) as dag:
|
||||
|
||||
extract = BashOperator(
|
||||
task_id='extract_source_data',
|
||||
bash_command='python /opt/airflow/scripts/extract.py --date {{ ds }}',
|
||||
)
|
||||
|
||||
transform = BashOperator(
|
||||
task_id='run_dbt_models',
|
||||
bash_command='cd /opt/airflow/dbt && dbt run --select marts.*',
|
||||
)
|
||||
|
||||
test = BashOperator(
|
||||
task_id='run_dbt_tests',
|
||||
bash_command='cd /opt/airflow/dbt && dbt test --select marts.*',
|
||||
)
|
||||
|
||||
notify = BashOperator(
|
||||
task_id='send_notification',
|
||||
bash_command='python /opt/airflow/scripts/notify.py --status success',
|
||||
trigger_rule='all_success',
|
||||
)
|
||||
|
||||
extract >> transform >> test >> notify
|
||||
```
|
||||
|
||||
#### Step 6: Validate Pipeline
|
||||
|
||||
```bash
|
||||
# Test locally
|
||||
dbt run --select stg_orders fct_orders
|
||||
dbt test --select fct_orders
|
||||
|
||||
# Validate data quality
|
||||
python scripts/data_quality_validator.py validate \
|
||||
--table fct_orders \
|
||||
--checks all \
|
||||
--output reports/quality_report.json
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Workflow 2: Implementing Real-Time Streaming
|
||||
|
||||
**Scenario:** Stream events from Kafka, process with Flink/Spark Streaming, sink to data lake.
|
||||
|
||||
#### Step 1: Define Event Schema
|
||||
|
||||
```json
|
||||
{
|
||||
"$schema": "http://json-schema.org/draft-07/schema#",
|
||||
"title": "UserEvent",
|
||||
"type": "object",
|
||||
"required": ["event_id", "user_id", "event_type", "timestamp"],
|
||||
"properties": {
|
||||
"event_id": {"type": "string", "format": "uuid"},
|
||||
"user_id": {"type": "string"},
|
||||
"event_type": {"type": "string", "enum": ["page_view", "click", "purchase"]},
|
||||
"timestamp": {"type": "string", "format": "date-time"},
|
||||
"properties": {"type": "object"}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
#### Step 2: Create Kafka Topic
|
||||
|
||||
```bash
|
||||
# Create topic with appropriate partitions
|
||||
kafka-topics.sh --create \
|
||||
--bootstrap-server localhost:9092 \
|
||||
--topic user-events \
|
||||
--partitions 12 \
|
||||
--replication-factor 3 \
|
||||
--config retention.ms=604800000 \
|
||||
--config cleanup.policy=delete
|
||||
|
||||
# Verify topic
|
||||
kafka-topics.sh --describe \
|
||||
--bootstrap-server localhost:9092 \
|
||||
--topic user-events
|
||||
```
|
||||
|
||||
#### Step 3: Implement Spark Streaming Job
|
||||
|
||||
```python
|
||||
# streaming/user_events_processor.py
|
||||
from pyspark.sql import SparkSession
|
||||
from pyspark.sql.functions import (
|
||||
from_json, col, window, count, avg,
|
||||
to_timestamp, current_timestamp
|
||||
)
|
||||
from pyspark.sql.types import (
|
||||
StructType, StructField, StringType,
|
||||
TimestampType, MapType
|
||||
)
|
||||
|
||||
# Initialize Spark
|
||||
spark = SparkSession.builder \
|
||||
.appName("UserEventsProcessor") \
|
||||
.config("spark.sql.streaming.checkpointLocation", "/checkpoints/user-events") \
|
||||
.config("spark.sql.shuffle.partitions", "12") \
|
||||
.getOrCreate()
|
||||
|
||||
# Define schema
|
||||
event_schema = StructType([
|
||||
StructField("event_id", StringType(), False),
|
||||
StructField("user_id", StringType(), False),
|
||||
StructField("event_type", StringType(), False),
|
||||
StructField("timestamp", StringType(), False),
|
||||
StructField("properties", MapType(StringType(), StringType()), True)
|
||||
])
|
||||
|
||||
# Read from Kafka
|
||||
events_df = spark.readStream \
|
||||
.format("kafka") \
|
||||
.option("kafka.bootstrap.servers", "localhost:9092") \
|
||||
.option("subscribe", "user-events") \
|
||||
.option("startingOffsets", "latest") \
|
||||
.option("failOnDataLoss", "false") \
|
||||
.load()
|
||||
|
||||
# Parse JSON
|
||||
parsed_df = events_df \
|
||||
.select(from_json(col("value").cast("string"), event_schema).alias("data")) \
|
||||
.select("data.*") \
|
||||
.withColumn("event_timestamp", to_timestamp(col("timestamp")))
|
||||
|
||||
# Windowed aggregation
|
||||
aggregated_df = parsed_df \
|
||||
.withWatermark("event_timestamp", "10 minutes") \
|
||||
.groupBy(
|
||||
window(col("event_timestamp"), "5 minutes"),
|
||||
col("event_type")
|
||||
) \
|
||||
.agg(
|
||||
count("*").alias("event_count"),
|
||||
approx_count_distinct("user_id").alias("unique_users")
|
||||
)
|
||||
|
||||
# Write to Delta Lake
|
||||
query = aggregated_df.writeStream \
|
||||
.format("delta") \
|
||||
.outputMode("append") \
|
||||
.option("checkpointLocation", "/checkpoints/user-events-aggregated") \
|
||||
.option("path", "/data/lake/user_events_aggregated") \
|
||||
.trigger(processingTime="1 minute") \
|
||||
.start()
|
||||
|
||||
query.awaitTermination()
|
||||
```
|
||||
|
||||
#### Step 4: Handle Late Data and Errors
|
||||
|
||||
```python
|
||||
# Dead letter queue for failed records
|
||||
from pyspark.sql.functions import current_timestamp, lit
|
||||
|
||||
def process_with_error_handling(batch_df, batch_id):
|
||||
try:
|
||||
# Attempt processing
|
||||
valid_df = batch_df.filter(col("event_id").isNotNull())
|
||||
invalid_df = batch_df.filter(col("event_id").isNull())
|
||||
|
||||
# Write valid records
|
||||
valid_df.write \
|
||||
.format("delta") \
|
||||
.mode("append") \
|
||||
.save("/data/lake/user_events")
|
||||
|
||||
# Write invalid to DLQ
|
||||
if invalid_df.count() > 0:
|
||||
invalid_df \
|
||||
.withColumn("error_timestamp", current_timestamp()) \
|
||||
.withColumn("error_reason", lit("missing_event_id")) \
|
||||
.write \
|
||||
.format("delta") \
|
||||
.mode("append") \
|
||||
.save("/data/lake/dlq/user_events")
|
||||
|
||||
except Exception as e:
|
||||
# Log error, alert, continue
|
||||
logger.error(f"Batch {batch_id} failed: {e}")
|
||||
raise
|
||||
|
||||
# Use foreachBatch for custom processing
|
||||
query = parsed_df.writeStream \
|
||||
.foreachBatch(process_with_error_handling) \
|
||||
.option("checkpointLocation", "/checkpoints/user-events") \
|
||||
.start()
|
||||
```
|
||||
|
||||
#### Step 5: Monitor Stream Health
|
||||
|
||||
```python
|
||||
# monitoring/stream_metrics.py
|
||||
from prometheus_client import Gauge, Counter, start_http_server
|
||||
|
||||
# Define metrics
|
||||
RECORDS_PROCESSED = Counter(
|
||||
'stream_records_processed_total',
|
||||
'Total records processed',
|
||||
['stream_name', 'status']
|
||||
)
|
||||
|
||||
PROCESSING_LAG = Gauge(
|
||||
'stream_processing_lag_seconds',
|
||||
'Current processing lag',
|
||||
['stream_name']
|
||||
)
|
||||
|
||||
BATCH_DURATION = Gauge(
|
||||
'stream_batch_duration_seconds',
|
||||
'Last batch processing duration',
|
||||
['stream_name']
|
||||
)
|
||||
|
||||
def emit_metrics(query):
|
||||
"""Emit Prometheus metrics from streaming query."""
|
||||
progress = query.lastProgress
|
||||
if progress:
|
||||
RECORDS_PROCESSED.labels(
|
||||
stream_name='user-events',
|
||||
status='success'
|
||||
).inc(progress['numInputRows'])
|
||||
|
||||
if progress['sources']:
|
||||
# Calculate lag from latest offset
|
||||
for source in progress['sources']:
|
||||
end_offset = source.get('endOffset', {})
|
||||
# Parse Kafka offsets and calculate lag
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### Workflow 3: Data Quality Framework Setup
|
||||
|
||||
**Scenario:** Implement comprehensive data quality monitoring with Great Expectations.
|
||||
|
||||
#### Step 1: Initialize Great Expectations
|
||||
|
||||
```bash
|
||||
# Install and initialize
|
||||
pip install great_expectations
|
||||
|
||||
great_expectations init
|
||||
|
||||
# Connect to data source
|
||||
great_expectations datasource new
|
||||
```
|
||||
|
||||
#### Step 2: Create Expectation Suite
|
||||
|
||||
```python
|
||||
# expectations/orders_suite.py
|
||||
import great_expectations as gx
|
||||
|
||||
context = gx.get_context()
|
||||
|
||||
# Create expectation suite
|
||||
suite = context.add_expectation_suite("orders_quality_suite")
|
||||
|
||||
# Add expectations
|
||||
validator = context.get_validator(
|
||||
batch_request={
|
||||
"datasource_name": "warehouse",
|
||||
"data_asset_name": "orders",
|
||||
},
|
||||
expectation_suite_name="orders_quality_suite"
|
||||
)
|
||||
|
||||
# Schema expectations
|
||||
validator.expect_table_columns_to_match_ordered_list(
|
||||
column_list=[
|
||||
"order_id", "customer_id", "order_date",
|
||||
"total_amount", "status", "created_at"
|
||||
]
|
||||
)
|
||||
|
||||
# Completeness expectations
|
||||
validator.expect_column_values_to_not_be_null("order_id")
|
||||
validator.expect_column_values_to_not_be_null("customer_id")
|
||||
validator.expect_column_values_to_not_be_null("order_date")
|
||||
|
||||
# Uniqueness expectations
|
||||
validator.expect_column_values_to_be_unique("order_id")
|
||||
|
||||
# Range expectations
|
||||
validator.expect_column_values_to_be_between(
|
||||
"total_amount",
|
||||
min_value=0,
|
||||
max_value=1000000
|
||||
)
|
||||
|
||||
# Categorical expectations
|
||||
validator.expect_column_values_to_be_in_set(
|
||||
"status",
|
||||
["pending", "confirmed", "shipped", "delivered", "cancelled"]
|
||||
)
|
||||
|
||||
# Freshness expectation
|
||||
validator.expect_column_max_to_be_between(
|
||||
"order_date",
|
||||
min_value={"$PARAMETER": "now - timedelta(days=1)"},
|
||||
max_value={"$PARAMETER": "now"}
|
||||
)
|
||||
|
||||
# Referential integrity
|
||||
validator.expect_column_values_to_be_in_set(
|
||||
"customer_id",
|
||||
value_set={"$PARAMETER": "valid_customer_ids"}
|
||||
)
|
||||
|
||||
validator.save_expectation_suite(discard_failed_expectations=False)
|
||||
```
|
||||
|
||||
#### Step 3: Create Data Quality Checks with dbt
|
||||
|
||||
```yaml
|
||||
# models/marts/schema.yml
|
||||
version: 2
|
||||
|
||||
models:
|
||||
- name: "fct-orders"
|
||||
description: "Order fact table with data quality checks"
|
||||
|
||||
tests:
|
||||
# Row count check
|
||||
- dbt_utils.equal_rowcount:
|
||||
compare_model: ref('stg_orders')
|
||||
|
||||
# Freshness check
|
||||
- dbt_utils.recency:
|
||||
datepart: hour
|
||||
field: created_at
|
||||
interval: 24
|
||||
|
||||
columns:
|
||||
- name: "order-id"
|
||||
description: "Unique order identifier"
|
||||
tests:
|
||||
- unique
|
||||
- not_null
|
||||
- relationships:
|
||||
to: ref('dim_orders')
|
||||
field: order_id
|
||||
|
||||
- name: "total-amount"
|
||||
tests:
|
||||
- not_null
|
||||
- dbt_utils.accepted_range:
|
||||
min_value: 0
|
||||
max_value: 1000000
|
||||
inclusive: true
|
||||
- dbt_expectations.expect_column_values_to_be_between:
|
||||
min_value: 0
|
||||
row_condition: "status != 'cancelled'"
|
||||
|
||||
- name: "customer-id"
|
||||
tests:
|
||||
- not_null
|
||||
- relationships:
|
||||
to: ref('dim_customers')
|
||||
field: customer_id
|
||||
severity: warn
|
||||
```
|
||||
|
||||
#### Step 4: Implement Data Contracts
|
||||
|
||||
```yaml
|
||||
# contracts/orders_contract.yaml
|
||||
contract:
|
||||
name: "orders-data-contract"
|
||||
version: "1.0.0"
|
||||
owner: data-team@company.com
|
||||
|
||||
schema:
|
||||
type: object
|
||||
properties:
|
||||
order_id:
|
||||
type: string
|
||||
format: uuid
|
||||
description: "Unique order identifier"
|
||||
customer_id:
|
||||
type: string
|
||||
not_null: true
|
||||
order_date:
|
||||
type: date
|
||||
not_null: true
|
||||
total_amount:
|
||||
type: decimal
|
||||
precision: 10
|
||||
scale: 2
|
||||
minimum: 0
|
||||
status:
|
||||
type: string
|
||||
enum: ["pending", "confirmed", "shipped", "delivered", "cancelled"]
|
||||
|
||||
sla:
|
||||
freshness:
|
||||
max_delay_hours: 1
|
||||
completeness:
|
||||
min_percentage: 99.9
|
||||
accuracy:
|
||||
duplicate_tolerance: 0.01
|
||||
|
||||
consumers:
|
||||
- name: "analytics-team"
|
||||
usage: "Daily reporting dashboards"
|
||||
- name: "ml-team"
|
||||
usage: "Churn prediction model"
|
||||
```
|
||||
|
||||
#### Step 5: Set Up Quality Monitoring Dashboard
|
||||
|
||||
```python
|
||||
# monitoring/quality_dashboard.py
|
||||
from datetime import datetime, timedelta
|
||||
import pandas as pd
|
||||
|
||||
def generate_quality_report(connection, table_name: "str-dict"
|
||||
"""Generate comprehensive data quality report."""
|
||||
|
||||
report = {
|
||||
"table": table_name,
|
||||
"timestamp": datetime.now().isoformat(),
|
||||
"checks": {}
|
||||
}
|
||||
|
||||
# Row count check
|
||||
row_count = connection.execute(
|
||||
f"SELECT COUNT(*) FROM {table_name}"
|
||||
).fetchone()[0]
|
||||
report["checks"]["row_count"] = {
|
||||
"value": row_count,
|
||||
"status": "pass" if row_count > 0 else "fail"
|
||||
}
|
||||
|
||||
# Freshness check
|
||||
max_date = connection.execute(
|
||||
f"SELECT MAX(created_at) FROM {table_name}"
|
||||
).fetchone()[0]
|
||||
hours_old = (datetime.now() - max_date).total_seconds() / 3600
|
||||
report["checks"]["freshness"] = {
|
||||
"max_timestamp": max_date.isoformat(),
|
||||
"hours_old": round(hours_old, 2),
|
||||
"status": "pass" if hours_old < 24 else "fail"
|
||||
}
|
||||
|
||||
# Null rate check
|
||||
null_query = f"""
|
||||
SELECT
|
||||
SUM(CASE WHEN order_id IS NULL THEN 1 ELSE 0 END) as null_order_id,
|
||||
SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) as null_customer_id,
|
||||
COUNT(*) as total
|
||||
FROM {table_name}
|
||||
"""
|
||||
null_result = connection.execute(null_query).fetchone()
|
||||
report["checks"]["null_rates"] = {
|
||||
"order_id": null_result[0] / null_result[2] if null_result[2] > 0 else 0,
|
||||
"customer_id": null_result[1] / null_result[2] if null_result[2] > 0 else 0,
|
||||
"status": "pass" if null_result[0] == 0 and null_result[1] == 0 else "fail"
|
||||
}
|
||||
|
||||
# Duplicate check
|
||||
dup_query = f"""
|
||||
SELECT COUNT(*) - COUNT(DISTINCT order_id) as duplicates
|
||||
FROM {table_name}
|
||||
"""
|
||||
duplicates = connection.execute(dup_query).fetchone()[0]
|
||||
report["checks"]["duplicates"] = {
|
||||
"count": duplicates,
|
||||
"status": "pass" if duplicates == 0 else "fail"
|
||||
}
|
||||
|
||||
# Overall status
|
||||
all_passed = all(
|
||||
check["status"] == "pass"
|
||||
for check in report["checks"].values()
|
||||
)
|
||||
report["overall_status"] = "pass" if all_passed else "fail"
|
||||
|
||||
return report
|
||||
```
|
||||
|
||||
---
|
||||
Reference in New Issue
Block a user