release: v2.1.1 — skill optimization, agents, commands, reference splits (#297)

This commit is contained in:
Alireza Rezvani
2026-03-09 15:54:25 +01:00
committed by GitHub
parent 3c34297a6b
commit 8902ba79d7
567 changed files with 17859 additions and 22442 deletions

View File

@@ -0,0 +1,40 @@
# Engineering Skills — Codex CLI Instructions
When working on engineering tasks, use the engineering skill system:
## Routing
1. **Identify the domain:** Architecture, frontend, backend, DevOps, security, AI/ML, data, or QA
2. **Read the specialist SKILL.md** for detailed instructions
3. **Use Python tools** for scaffolding and analysis
## Python Tools
All scripts in `engineering-team/*/scripts/` are stdlib-only and CLI-first. Run them directly:
```bash
python3 engineering-team/senior-fullstack/scripts/project_scaffolder.py --help
python3 engineering-team/senior-security/scripts/threat_modeler.py --help
python3 engineering-team/senior-frontend/scripts/bundle_analyzer.py --help
```
## Key Skills by Task
| Task | Skill |
|------|-------|
| System design | senior-architect |
| React/Next.js | senior-frontend |
| API design | senior-backend |
| Full project scaffold | senior-fullstack |
| Test generation | senior-qa |
| CI/CD pipelines | senior-devops |
| Threat modeling | senior-security |
| AWS architecture | aws-solution-architect |
| Code review | code-reviewer |
| E2E testing | playwright-pro |
| Stripe payments | stripe-integration-expert |
## Rules
- Load only 1-2 skills per request — don't bulk-load
- Use Python tools for analysis and scaffolding

91
engineering-team/SKILL.md Normal file
View File

@@ -0,0 +1,91 @@
---
name: "engineering-skills"
description: "23 production-ready engineering skills covering architecture, frontend, backend, fullstack, QA, DevOps, security, AI/ML, data engineering, computer vision, and specialized tools like Playwright Pro, Stripe integration, AWS, and MS365. 30+ Python automation tools (all stdlib-only). Works with Claude Code, Codex CLI, and OpenClaw."
version: 1.1.0
author: Alireza Rezvani
license: MIT
tags:
- engineering
- frontend
- backend
- devops
- security
- ai-ml
- data-engineering
agents:
- claude-code
- codex-cli
- openclaw
---
# Engineering Team Skills
23 production-ready engineering skills organized into core engineering, AI/ML/Data, and specialized tools.
## Quick Start
### Claude Code
```
/read engineering-team/senior-fullstack/SKILL.md
```
### Codex CLI
```bash
npx agent-skills-cli add alirezarezvani/claude-skills/engineering-team
```
## Skills Overview
### Core Engineering (13 skills)
| Skill | Folder | Focus |
|-------|--------|-------|
| Senior Architect | `senior-architect/` | System design, architecture patterns |
| Senior Frontend | `senior-frontend/` | React, Next.js, TypeScript, Tailwind |
| Senior Backend | `senior-backend/` | API design, database optimization |
| Senior Fullstack | `senior-fullstack/` | Project scaffolding, code quality |
| Senior QA | `senior-qa/` | Test generation, coverage analysis |
| Senior DevOps | `senior-devops/` | CI/CD, infrastructure, containers |
| Senior SecOps | `senior-secops/` | Security operations, vulnerability management |
| Code Reviewer | `code-reviewer/` | PR review, code quality analysis |
| Senior Security | `senior-security/` | Threat modeling, STRIDE, penetration testing |
| AWS Solution Architect | `aws-solution-architect/` | Serverless, CloudFormation, cost optimization |
| MS365 Tenant Manager | `ms365-tenant-manager/` | Microsoft 365 administration |
| TDD Guide | `tdd-guide/` | Test-driven development workflows |
| Tech Stack Evaluator | `tech-stack-evaluator/` | Technology comparison, TCO analysis |
### AI/ML/Data (5 skills)
| Skill | Folder | Focus |
|-------|--------|-------|
| Senior Data Scientist | `senior-data-scientist/` | Statistical modeling, experimentation |
| Senior Data Engineer | `senior-data-engineer/` | Pipelines, ETL, data quality |
| Senior ML Engineer | `senior-ml-engineer/` | Model deployment, MLOps, LLM integration |
| Senior Prompt Engineer | `senior-prompt-engineer/` | Prompt optimization, RAG, agents |
| Senior Computer Vision | `senior-computer-vision/` | Object detection, segmentation |
### Specialized Tools (5 skills)
| Skill | Folder | Focus |
|-------|--------|-------|
| Playwright Pro | `playwright-pro/` | E2E testing (9 sub-skills) |
| Self-Improving Agent | `self-improving-agent/` | Memory curation (5 sub-skills) |
| Stripe Integration | `stripe-integration-expert/` | Payment integration, webhooks |
| Incident Commander | `incident-commander/` | Incident response workflows |
| Email Template Builder | `email-template-builder/` | HTML email generation |
## Python Tools
30+ scripts, all stdlib-only. Run directly:
```bash
python3 <skill>/scripts/<tool>.py --help
```
No pip install needed. Scripts include embedded samples for demo mode.
## Rules
- Load only the specific skill SKILL.md you need — don't bulk-load all 23
- Use Python tools for analysis and scaffolding, not manual judgment
- Check CLAUDE.md for tool usage examples and workflows

View File

@@ -1,5 +1,5 @@
---
name: aws-solution-architect
name: "aws-solution-architect"
description: Design AWS architectures for startups using serverless patterns and IaC templates. Use when asked to design serverless architecture, create CloudFormation templates, optimize AWS costs, set up CI/CD pipelines, or migrate to AWS. Covers Lambda, API Gateway, DynamoDB, ECS, Aurora, and cost optimization.
---
@@ -9,36 +9,6 @@ Design scalable, cost-effective AWS architectures for startups with infrastructu
---
## Table of Contents
- [Trigger Terms](#trigger-terms)
- [Workflow](#workflow)
- [Tools](#tools)
- [Quick Start](#quick-start)
- [Input Requirements](#input-requirements)
- [Output Formats](#output-formats)
---
## Trigger Terms
Use this skill when you encounter:
| Category | Terms |
|----------|-------|
| **Architecture Design** | serverless architecture, AWS architecture, cloud design, microservices, three-tier |
| **IaC Generation** | CloudFormation, CDK, Terraform, infrastructure as code, deploy template |
| **Serverless** | Lambda, API Gateway, DynamoDB, Step Functions, EventBridge, AppSync |
| **Containers** | ECS, Fargate, EKS, container orchestration, Docker on AWS |
| **Cost Optimization** | reduce AWS costs, optimize spending, right-sizing, Savings Plans |
| **Database** | Aurora, RDS, DynamoDB design, database migration, data modeling |
| **Security** | IAM policies, VPC design, encryption, Cognito, WAF |
| **CI/CD** | CodePipeline, CodeBuild, CodeDeploy, GitHub Actions AWS |
| **Monitoring** | CloudWatch, X-Ray, observability, alarms, dashboards |
| **Migration** | migrate to AWS, lift and shift, replatform, DMS |
---
## Workflow
### Step 1: Gather Requirements
@@ -62,6 +32,18 @@ Run the architecture designer to get pattern recommendations:
python scripts/architecture_designer.py --input requirements.json
```
**Example output:**
```json
{
"recommended_pattern": "serverless_web",
"service_stack": ["S3", "CloudFront", "API Gateway", "Lambda", "DynamoDB", "Cognito"],
"estimated_monthly_cost_usd": 35,
"pros": ["Low ops overhead", "Pay-per-use", "Auto-scaling"],
"cons": ["Cold starts", "15-min Lambda limit", "Eventual consistency"]
}
```
Select from recommended patterns:
- **Serverless Web**: S3 + CloudFront + API Gateway + Lambda + DynamoDB
- **Event-Driven Microservices**: EventBridge + Lambda + SQS + Step Functions
@@ -70,6 +52,8 @@ Select from recommended patterns:
See `references/architecture_patterns.md` for detailed pattern specifications.
**Validation checkpoint:** Confirm the recommended pattern matches the team's operational maturity and compliance requirements before proceeding to Step 3.
### Step 3: Generate IaC Templates
Create infrastructure-as-code for the selected pattern:
@@ -77,8 +61,76 @@ Create infrastructure-as-code for the selected pattern:
```bash
# Serverless stack (CloudFormation)
python scripts/serverless_stack.py --app-name my-app --region us-east-1
```
# Output: CloudFormation YAML template ready to deploy
**Example CloudFormation YAML output (core serverless resources):**
```yaml
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Parameters:
AppName:
Type: String
Default: my-app
Resources:
ApiFunction:
Type: AWS::Serverless::Function
Properties:
Handler: index.handler
Runtime: nodejs20.x
MemorySize: 512
Timeout: 30
Environment:
Variables:
TABLE_NAME: !Ref DataTable
Policies:
- DynamoDBCrudPolicy:
TableName: !Ref DataTable
Events:
ApiEvent:
Type: Api
Properties:
Path: /{proxy+}
Method: ANY
DataTable:
Type: AWS::DynamoDB::Table
Properties:
BillingMode: PAY_PER_REQUEST
AttributeDefinitions:
- AttributeName: pk
AttributeType: S
- AttributeName: sk
AttributeType: S
KeySchema:
- AttributeName: pk
KeyType: HASH
- AttributeName: sk
KeyType: RANGE
```
> Full templates including API Gateway, Cognito, IAM roles, and CloudWatch logging are generated by `serverless_stack.py` and also available in `references/architecture_patterns.md`.
**Example CDK TypeScript snippet (three-tier pattern):**
```typescript
import * as ecs from 'aws-cdk-lib/aws-ecs';
import * as ec2 from 'aws-cdk-lib/aws-ec2';
import * as rds from 'aws-cdk-lib/aws-rds';
const vpc = new ec2.Vpc(this, 'AppVpc', { maxAzs: 2 });
const cluster = new ecs.Cluster(this, 'AppCluster', { vpc });
const db = new rds.ServerlessCluster(this, 'AppDb', {
engine: rds.DatabaseClusterEngine.auroraPostgres({
version: rds.AuroraPostgresEngineVersion.VER_15_2,
}),
vpc,
scaling: { minCapacity: 0.5, maxCapacity: 4 },
});
```
### Step 4: Review Costs
@@ -89,6 +141,20 @@ Analyze estimated costs and optimization opportunities:
python scripts/cost_optimizer.py --resources current_setup.json --monthly-spend 2000
```
**Example output:**
```json
{
"current_monthly_usd": 2000,
"recommendations": [
{ "action": "Right-size RDS db.r5.2xlarge → db.r5.large", "savings_usd": 420, "priority": "high" },
{ "action": "Purchase 1-yr Compute Savings Plan at 40% utilization", "savings_usd": 310, "priority": "high" },
{ "action": "Move S3 objects >90 days to Glacier Instant Retrieval", "savings_usd": 85, "priority": "medium" }
],
"total_potential_savings_usd": 815
}
```
Output includes:
- Monthly cost breakdown by service
- Right-sizing recommendations
@@ -113,7 +179,7 @@ cdk deploy
terraform init && terraform apply
```
### Step 6: Validate
### Step 6: Validate and Handle Failures
Verify deployment and set up monitoring:
@@ -125,6 +191,30 @@ aws cloudformation describe-stacks --stack-name my-app-stack
aws cloudwatch put-metric-alarm --alarm-name high-errors ...
```
**If stack creation fails:**
1. Check the failure reason:
```bash
aws cloudformation describe-stack-events \
--stack-name my-app-stack \
--query 'StackEvents[?ResourceStatus==`CREATE_FAILED`]'
```
2. Review CloudWatch Logs for Lambda or ECS errors.
3. Fix the template or resource configuration.
4. Delete the failed stack before retrying:
```bash
aws cloudformation delete-stack --stack-name my-app-stack
# Wait for deletion
aws cloudformation wait stack-delete-complete --stack-name my-app-stack
# Redeploy
aws cloudformation create-stack ...
```
**Common failure causes:**
- IAM permission errors → verify `--capabilities CAPABILITY_IAM` and role trust policies
- Resource limit exceeded → request quota increase via Service Quotas console
- Invalid template syntax → run `aws cloudformation validate-template --template-body file://template.yaml` before deploying
---
## Tools
@@ -267,10 +357,7 @@ Provide these details for architecture design:
- Pattern recommendation with rationale
- Service stack diagram (ASCII)
- Configuration specifications
- Monthly cost estimate
- Scaling characteristics
- Trade-offs and limitations
- Monthly cost estimate and trade-offs
### IaC Templates
@@ -280,10 +367,8 @@ Provide these details for architecture design:
### Cost Analysis
- Current spend breakdown
- Optimization recommendations with savings
- Priority action list (high/medium/low)
- Implementation checklist
- Current spend breakdown with optimization recommendations
- Priority action list (high/medium/low) and implementation checklist
---
@@ -294,13 +379,3 @@ Provide these details for architecture design:
| `references/architecture_patterns.md` | 6 patterns: serverless, microservices, three-tier, data processing, GraphQL, multi-region |
| `references/service_selection.md` | Decision matrices for compute, database, storage, messaging |
| `references/best_practices.md` | Serverless design, cost optimization, security hardening, scalability |
---
## Limitations
- Lambda: 15-minute execution, 10GB memory max
- API Gateway: 29-second timeout, 10MB payload
- DynamoDB: 400KB item size, eventually consistent by default
- Regional availability varies by service
- Some services have AWS-specific lock-in

View File

@@ -1,5 +1,5 @@
---
name: code-reviewer
name: "code-reviewer"
description: Code review automation for TypeScript, JavaScript, Python, Go, Swift, Kotlin. Analyzes PRs for complexity and risk, checks code quality for SOLID violations and code smells, generates review reports. Use when reviewing pull requests, analyzing code quality, identifying issues, generating review checklists.
---

View File

@@ -1,3 +1,8 @@
---
name: "email-template-builder"
description: "Email Template Builder"
---
# Email Template Builder
**Tier:** POWERFUL
@@ -157,7 +162,7 @@ import { Button, Heading, Text } from "@react-email/components"
import { EmailLayout } from "../components/layout/email-layout"
interface WelcomeEmailProps {
name: string
name: "string"
confirmUrl: string
trialDays?: number
}
@@ -216,7 +221,7 @@ import { EmailLayout } from "../components/layout/email-layout"
interface InvoiceItem { description: string; amount: number }
interface InvoiceEmailProps {
name: string
name: "string"
invoiceNumber: string
invoiceDate: string
dueDate: string
@@ -323,7 +328,7 @@ export async function sendEmail(to: string, payload: EmailPayload) {
to,
subject: template.subject,
html: trackedHtml,
tags: [{ name: "email_type", value: payload.type }],
tags: [{ name: "email-type", value: payload.type }],
})
return result
@@ -356,8 +361,8 @@ export async function sendEmail(to: string, payload: EmailPayload) {
// emails/i18n/en.ts
export const en = {
welcome: {
preview: (name: string) => `Welcome to MyApp, ${name}!`,
heading: (name: string) => `Welcome to MyApp, ${name}!`,
preview: (name: "string-welcome-to-myapp-name"
heading: (name: "string-welcome-to-myapp-name"
body: (days: number) => `You've got ${days} days to explore everything.`,
cta: "Confirm Email Address",
},
@@ -366,8 +371,8 @@ export const en = {
// emails/i18n/de.ts
export const de = {
welcome: {
preview: (name: string) => `Willkommen bei MyApp, ${name}!`,
heading: (name: string) => `Willkommen bei MyApp, ${name}!`,
preview: (name: "string-willkommen-bei-myapp-name"
heading: (name: "string-willkommen-bei-myapp-name"
body: (days: number) => `Du hast ${days} Tage Zeit, alles zu erkunden.`,
cta: "E-Mail-Adresse bestätigen",
},

View File

@@ -1,3 +1,8 @@
---
name: "incident-commander"
description: "Incident Commander Skill"
---
# Incident Commander Skill
**Category:** Engineering Team
@@ -364,204 +369,7 @@ Status page: {link}
- **{Pitfall}:** {description and how to avoid}
## Reference Information
- **Architecture Diagram:** {link}
- **Monitoring Dashboard:** {link}
- **Related Runbooks:** {links to dependent service runbooks}
```
### Post-Incident Review (PIR) Framework
#### PIR Timeline and Ownership
**Timeline:**
- **24 hours:** Initial PIR draft completed by Incident Commander
- **3 business days:** Final PIR published with all stakeholder input
- **1 week:** Action items assigned with owners and due dates
- **4 weeks:** Follow-up review on action item progress
**Roles:**
- **PIR Owner:** Incident Commander (can delegate writing but owns completion)
- **Technical Contributors:** All engineers involved in response
- **Review Committee:** Engineering leadership, affected product teams
- **Action Item Owners:** Assigned based on expertise and capacity
#### Root Cause Analysis Frameworks
#### 1. Five Whys Method
The Five Whys technique involves asking "why" repeatedly to drill down to root causes:
**Example Application:**
- **Problem:** Database became unresponsive during peak traffic
- **Why 1:** Why did the database become unresponsive? → Connection pool was exhausted
- **Why 2:** Why was the connection pool exhausted? → Application was creating more connections than usual
- **Why 3:** Why was the application creating more connections? → New feature wasn't properly connection pooling
- **Why 4:** Why wasn't the feature properly connection pooling? → Code review missed this pattern
- **Why 5:** Why did code review miss this? → No automated checks for connection pooling patterns
**Best Practices:**
- Ask "why" at least 3 times, often need 5+ iterations
- Focus on process failures, not individual blame
- Each "why" should point to a actionable system improvement
- Consider multiple root cause paths, not just one linear chain
#### 2. Fishbone (Ishikawa) Diagram
Systematic analysis across multiple categories of potential causes:
**Categories:**
- **People:** Training, experience, communication, handoffs
- **Process:** Procedures, change management, review processes
- **Technology:** Architecture, tooling, monitoring, automation
- **Environment:** Infrastructure, dependencies, external factors
**Application Method:**
1. State the problem clearly at the "head" of the fishbone
2. For each category, brainstorm potential contributing factors
3. For each factor, ask what caused that factor (sub-causes)
4. Identify the factors most likely to be root causes
5. Validate root causes with evidence from the incident
#### 3. Timeline Analysis
Reconstruct the incident chronologically to identify decision points and missed opportunities:
**Timeline Elements:**
- **Detection:** When was the issue first observable? When was it first detected?
- **Notification:** How quickly were the right people informed?
- **Response:** What actions were taken and how effective were they?
- **Communication:** When were stakeholders updated?
- **Resolution:** What finally resolved the issue?
**Analysis Questions:**
- Where were there delays and what caused them?
- What decisions would we make differently with perfect information?
- Where did communication break down?
- What automation could have detected/resolved faster?
### Escalation Paths
#### Technical Escalation
**Level 1:** On-call engineer
- **Responsibility:** Initial response and common issue resolution
- **Escalation Trigger:** Issue not resolved within SLA timeframe
- **Timeframe:** 15 minutes (SEV1), 30 minutes (SEV2)
**Level 2:** Senior engineer/Team lead
- **Responsibility:** Complex technical issues requiring deeper expertise
- **Escalation Trigger:** Level 1 requests help or timeout occurs
- **Timeframe:** 30 minutes (SEV1), 1 hour (SEV2)
**Level 3:** Engineering Manager/Staff Engineer
- **Responsibility:** Cross-team coordination and architectural decisions
- **Escalation Trigger:** Issue spans multiple systems or teams
- **Timeframe:** 45 minutes (SEV1), 2 hours (SEV2)
**Level 4:** Director of Engineering/CTO
- **Responsibility:** Resource allocation and business impact decisions
- **Escalation Trigger:** Extended outage or significant business impact
- **Timeframe:** 1 hour (SEV1), 4 hours (SEV2)
#### Business Escalation
**Customer Impact Assessment:**
- **High:** Revenue loss, SLA breaches, customer churn risk
- **Medium:** User experience degradation, support ticket volume
- **Low:** Internal tools, development impact only
**Escalation Matrix:**
| Severity | Duration | Business Escalation |
|----------|----------|-------------------|
| SEV1 | Immediate | VP Engineering |
| SEV1 | 30 minutes | CTO + Customer Success VP |
| SEV1 | 1 hour | CEO + Full Executive Team |
| SEV2 | 2 hours | VP Engineering |
| SEV2 | 4 hours | CTO |
| SEV3 | 1 business day | Engineering Manager |
### Status Page Management
#### Update Principles
1. **Transparency:** Provide factual information without speculation
2. **Timeliness:** Update within committed timeframes
3. **Clarity:** Use customer-friendly language, avoid technical jargon
4. **Completeness:** Include impact scope, status, and next update time
#### Status Categories
- **Operational:** All systems functioning normally
- **Degraded Performance:** Some users may experience slowness
- **Partial Outage:** Subset of features unavailable
- **Major Outage:** Service unavailable for most/all users
- **Under Maintenance:** Planned maintenance window
#### Update Template
```
{Timestamp} - {Status Category}
{Brief description of current state}
Impact: {who is affected and how}
Cause: {root cause if known, "under investigation" if not}
Resolution: {what's being done to fix it}
Next update: {specific time}
We apologize for any inconvenience this may cause.
```
### Action Item Framework
#### Action Item Categories
1. **Immediate Fixes**
- Critical bugs discovered during incident
- Security vulnerabilities exposed
- Data integrity issues
2. **Process Improvements**
- Communication gaps
- Escalation procedure updates
- Runbook additions/updates
3. **Technical Debt**
- Architecture improvements
- Monitoring enhancements
- Automation opportunities
4. **Organizational Changes**
- Team structure adjustments
- Training requirements
- Tool/platform investments
#### Action Item Template
```
**Title:** {Concise description of the action}
**Priority:** {Critical/High/Medium/Low}
**Category:** {Fix/Process/Technical/Organizational}
**Owner:** {Assigned person}
**Due Date:** {Specific date}
**Success Criteria:** {How will we know this is complete}
**Dependencies:** {What needs to happen first}
**Related PIRs:** {Links to other incidents this addresses}
**Description:**
{Detailed description of what needs to be done and why}
**Implementation Plan:**
1. {Step 1}
2. {Step 2}
3. {Validation step}
**Progress Updates:**
- {Date}: {Progress update}
- {Date}: {Progress update}
```
→ See references/reference-information.md for details
## Usage Examples
@@ -665,4 +473,4 @@ The Incident Commander skill provides a comprehensive framework for managing inc
The key to successful incident management is preparation, practice, and continuous learning. Use this framework as a starting point, but adapt it to your organization's specific needs, culture, and technical environment.
Remember: The goal isn't to prevent all incidents (which is impossible), but to detect them quickly, respond effectively, communicate clearly, and learn continuously.
Remember: The goal isn't to prevent all incidents (which is impossible), but to detect them quickly, respond effectively, communicate clearly, and learn continuously.

View File

@@ -0,0 +1,201 @@
# incident-commander reference
## Reference Information
- **Architecture Diagram:** {link}
- **Monitoring Dashboard:** {link}
- **Related Runbooks:** {links to dependent service runbooks}
```
### Post-Incident Review (PIR) Framework
#### PIR Timeline and Ownership
**Timeline:**
- **24 hours:** Initial PIR draft completed by Incident Commander
- **3 business days:** Final PIR published with all stakeholder input
- **1 week:** Action items assigned with owners and due dates
- **4 weeks:** Follow-up review on action item progress
**Roles:**
- **PIR Owner:** Incident Commander (can delegate writing but owns completion)
- **Technical Contributors:** All engineers involved in response
- **Review Committee:** Engineering leadership, affected product teams
- **Action Item Owners:** Assigned based on expertise and capacity
#### Root Cause Analysis Frameworks
#### 1. Five Whys Method
The Five Whys technique involves asking "why" repeatedly to drill down to root causes:
**Example Application:**
- **Problem:** Database became unresponsive during peak traffic
- **Why 1:** Why did the database become unresponsive? → Connection pool was exhausted
- **Why 2:** Why was the connection pool exhausted? → Application was creating more connections than usual
- **Why 3:** Why was the application creating more connections? → New feature wasn't properly connection pooling
- **Why 4:** Why wasn't the feature properly connection pooling? → Code review missed this pattern
- **Why 5:** Why did code review miss this? → No automated checks for connection pooling patterns
**Best Practices:**
- Ask "why" at least 3 times, often need 5+ iterations
- Focus on process failures, not individual blame
- Each "why" should point to a actionable system improvement
- Consider multiple root cause paths, not just one linear chain
#### 2. Fishbone (Ishikawa) Diagram
Systematic analysis across multiple categories of potential causes:
**Categories:**
- **People:** Training, experience, communication, handoffs
- **Process:** Procedures, change management, review processes
- **Technology:** Architecture, tooling, monitoring, automation
- **Environment:** Infrastructure, dependencies, external factors
**Application Method:**
1. State the problem clearly at the "head" of the fishbone
2. For each category, brainstorm potential contributing factors
3. For each factor, ask what caused that factor (sub-causes)
4. Identify the factors most likely to be root causes
5. Validate root causes with evidence from the incident
#### 3. Timeline Analysis
Reconstruct the incident chronologically to identify decision points and missed opportunities:
**Timeline Elements:**
- **Detection:** When was the issue first observable? When was it first detected?
- **Notification:** How quickly were the right people informed?
- **Response:** What actions were taken and how effective were they?
- **Communication:** When were stakeholders updated?
- **Resolution:** What finally resolved the issue?
**Analysis Questions:**
- Where were there delays and what caused them?
- What decisions would we make differently with perfect information?
- Where did communication break down?
- What automation could have detected/resolved faster?
### Escalation Paths
#### Technical Escalation
**Level 1:** On-call engineer
- **Responsibility:** Initial response and common issue resolution
- **Escalation Trigger:** Issue not resolved within SLA timeframe
- **Timeframe:** 15 minutes (SEV1), 30 minutes (SEV2)
**Level 2:** Senior engineer/Team lead
- **Responsibility:** Complex technical issues requiring deeper expertise
- **Escalation Trigger:** Level 1 requests help or timeout occurs
- **Timeframe:** 30 minutes (SEV1), 1 hour (SEV2)
**Level 3:** Engineering Manager/Staff Engineer
- **Responsibility:** Cross-team coordination and architectural decisions
- **Escalation Trigger:** Issue spans multiple systems or teams
- **Timeframe:** 45 minutes (SEV1), 2 hours (SEV2)
**Level 4:** Director of Engineering/CTO
- **Responsibility:** Resource allocation and business impact decisions
- **Escalation Trigger:** Extended outage or significant business impact
- **Timeframe:** 1 hour (SEV1), 4 hours (SEV2)
#### Business Escalation
**Customer Impact Assessment:**
- **High:** Revenue loss, SLA breaches, customer churn risk
- **Medium:** User experience degradation, support ticket volume
- **Low:** Internal tools, development impact only
**Escalation Matrix:**
| Severity | Duration | Business Escalation |
|----------|----------|-------------------|
| SEV1 | Immediate | VP Engineering |
| SEV1 | 30 minutes | CTO + Customer Success VP |
| SEV1 | 1 hour | CEO + Full Executive Team |
| SEV2 | 2 hours | VP Engineering |
| SEV2 | 4 hours | CTO |
| SEV3 | 1 business day | Engineering Manager |
### Status Page Management
#### Update Principles
1. **Transparency:** Provide factual information without speculation
2. **Timeliness:** Update within committed timeframes
3. **Clarity:** Use customer-friendly language, avoid technical jargon
4. **Completeness:** Include impact scope, status, and next update time
#### Status Categories
- **Operational:** All systems functioning normally
- **Degraded Performance:** Some users may experience slowness
- **Partial Outage:** Subset of features unavailable
- **Major Outage:** Service unavailable for most/all users
- **Under Maintenance:** Planned maintenance window
#### Update Template
```
{Timestamp} - {Status Category}
{Brief description of current state}
Impact: {who is affected and how}
Cause: {root cause if known, "under investigation" if not}
Resolution: {what's being done to fix it}
Next update: {specific time}
We apologize for any inconvenience this may cause.
```
### Action Item Framework
#### Action Item Categories
1. **Immediate Fixes**
- Critical bugs discovered during incident
- Security vulnerabilities exposed
- Data integrity issues
2. **Process Improvements**
- Communication gaps
- Escalation procedure updates
- Runbook additions/updates
3. **Technical Debt**
- Architecture improvements
- Monitoring enhancements
- Automation opportunities
4. **Organizational Changes**
- Team structure adjustments
- Training requirements
- Tool/platform investments
#### Action Item Template
```
**Title:** {Concise description of the action}
**Priority:** {Critical/High/Medium/Low}
**Category:** {Fix/Process/Technical/Organizational}
**Owner:** {Assigned person}
**Due Date:** {Specific date}
**Success Criteria:** {How will we know this is complete}
**Dependencies:** {What needs to happen first}
**Related PIRs:** {Links to other incidents this addresses}
**Description:**
{Detailed description of what needs to be done and why}
**Implementation Plan:**
1. {Step 1}
2. {Step 2}
3. {Validation step}
**Progress Updates:**
- {Date}: {Progress update}
- {Date}: {Progress update}
```

View File

@@ -1,5 +1,5 @@
---
name: ms365-tenant-manager
name: "ms365-tenant-manager"
description: Microsoft 365 tenant administration for Global Administrators. Automate M365 tenant setup, Office 365 admin tasks, Azure AD user management, Exchange Online configuration, Teams administration, and security policies. Generate PowerShell scripts for bulk operations, Conditional Access policies, license management, and compliance reporting. Use for M365 tenant manager, Office 365 admin, Azure AD users, Global Administrator, tenant configuration, or Microsoft 365 automation.
---
@@ -9,136 +9,38 @@ Expert guidance and automation for Microsoft 365 Global Administrators managing
---
## Table of Contents
- [Trigger Phrases](#trigger-phrases)
- [Quick Start](#quick-start)
- [Tools](#tools)
- [Workflows](#workflows)
- [Best Practices](#best-practices)
- [Reference Guides](#reference-guides)
- [Limitations](#limitations)
---
## Trigger Phrases
Use this skill when you hear:
- "set up Microsoft 365 tenant"
- "create Office 365 users"
- "configure Azure AD"
- "generate PowerShell script for M365"
- "set up Conditional Access"
- "bulk user provisioning"
- "M365 security audit"
- "license management"
- "Exchange Online configuration"
- "Teams administration"
---
## Quick Start
### Generate Security Audit Script
### Run a Security Audit
```bash
python scripts/powershell_generator.py --action audit --output audit_script.ps1
```powershell
Connect-MgGraph -Scopes "Directory.Read.All","Policy.Read.All","AuditLog.Read.All"
Get-MgSubscribedSku | Select-Object SkuPartNumber, ConsumedUnits, @{N="Total";E={$_.PrepaidUnits.Enabled}}
Get-MgPolicyAuthorizationPolicy | Select-Object AllowInvitesFrom, DefaultUserRolePermissions
```
### Create Bulk User Provisioning Script
### Bulk Provision Users from CSV
```bash
python scripts/user_management.py --action provision --csv users.csv --license E3
```powershell
# CSV columns: DisplayName, UserPrincipalName, Department, LicenseSku
Import-Csv .\new_users.csv | ForEach-Object {
$passwordProfile = @{ Password = (New-Guid).ToString().Substring(0,16) + "!"; ForceChangePasswordNextSignIn = $true }
New-MgUser -DisplayName $_.DisplayName -UserPrincipalName $_.UserPrincipalName `
-Department $_.Department -AccountEnabled -PasswordProfile $passwordProfile
}
```
### Configure Conditional Access Policy
### Create a Conditional Access Policy (MFA for Admins)
```bash
python scripts/powershell_generator.py --action conditional-access --require-mfa --include-admins
```
---
## Tools
### powershell_generator.py
Generates ready-to-use PowerShell scripts for Microsoft 365 administration.
**Usage:**
```bash
# Generate security audit script
python scripts/powershell_generator.py --action audit
# Generate Conditional Access policy script
python scripts/powershell_generator.py --action conditional-access \
--policy-name "Require MFA for Admins" \
--require-mfa \
--include-users "All"
# Generate bulk license assignment script
python scripts/powershell_generator.py --action license \
--csv users.csv \
--sku "ENTERPRISEPACK"
```
**Parameters:**
| Parameter | Required | Description |
|-----------|----------|-------------|
| `--action` | Yes | Script type: `audit`, `conditional-access`, `license`, `users` |
| `--policy-name` | No | Name for Conditional Access policy |
| `--require-mfa` | No | Require MFA in policy |
| `--include-users` | No | Users to include: `All` or specific UPNs |
| `--csv` | No | CSV file path for bulk operations |
| `--sku` | No | License SKU for assignment |
| `--output` | No | Output file path (default: stdout) |
**Output:** Complete PowerShell scripts with error handling, logging, and best practices.
### user_management.py
Automates user lifecycle operations and bulk provisioning.
**Usage:**
```bash
# Provision users from CSV
python scripts/user_management.py --action provision --csv new_users.csv
# Offboard user securely
python scripts/user_management.py --action offboard --user john.doe@company.com
# Generate inactive users report
python scripts/user_management.py --action report-inactive --days 90
```
**Parameters:**
| Parameter | Required | Description |
|-----------|----------|-------------|
| `--action` | Yes | Operation: `provision`, `offboard`, `report-inactive`, `sync` |
| `--csv` | No | CSV file for bulk operations |
| `--user` | No | Single user UPN |
| `--days` | No | Days for inactivity threshold (default: 90) |
| `--license` | No | License SKU to assign |
### tenant_setup.py
Initial tenant configuration and service provisioning automation.
**Usage:**
```bash
# Generate tenant setup checklist
python scripts/tenant_setup.py --action checklist --company "Acme Inc" --users 50
# Generate DNS records configuration
python scripts/tenant_setup.py --action dns --domain acme.com
# Generate security baseline script
python scripts/tenant_setup.py --action security-baseline
```powershell
$adminRoles = (Get-MgDirectoryRole | Where-Object { $_.DisplayName -match "Admin" }).Id
$policy = @{
DisplayName = "Require MFA for Admins"
State = "enabledForReportingButNotEnforced" # Start in report-only mode
Conditions = @{ Users = @{ IncludeRoles = $adminRoles } }
GrantControls = @{ Operator = "OR"; BuiltInControls = @("mfa") }
}
New-MgIdentityConditionalAccessPolicy -BodyParameter $policy
```
---
@@ -149,69 +51,150 @@ python scripts/tenant_setup.py --action security-baseline
**Step 1: Generate Setup Checklist**
```bash
python scripts/tenant_setup.py --action checklist --company "Company Name" --users 100
Confirm prerequisites before provisioning:
- Global Admin account created and secured with MFA
- Custom domain purchased and accessible for DNS edits
- License SKUs confirmed (E3 vs E5 feature requirements noted)
**Step 2: Configure and Verify DNS Records**
```powershell
# After adding the domain in the M365 admin center, verify propagation before proceeding
$domain = "company.com"
Resolve-DnsName -Name "_msdcs.$domain" -Type NS -ErrorAction SilentlyContinue
# Also run from a shell prompt:
# nslookup -type=MX company.com
# nslookup -type=TXT company.com # confirm SPF record
```
**Step 2: Configure DNS Records**
```bash
python scripts/tenant_setup.py --action dns --domain company.com
```
Wait for DNS propagation (up to 48 h) before bulk user creation.
**Step 3: Apply Security Baseline**
```bash
python scripts/powershell_generator.py --action audit > initial_audit.ps1
```powershell
# Disable legacy authentication (blocks Basic Auth protocols)
$policy = @{
DisplayName = "Block Legacy Authentication"
State = "enabled"
Conditions = @{ ClientAppTypes = @("exchangeActiveSync","other") }
GrantControls = @{ Operator = "OR"; BuiltInControls = @("block") }
}
New-MgIdentityConditionalAccessPolicy -BodyParameter $policy
# Enable unified audit log
Set-AdminAuditLogConfig -UnifiedAuditLogIngestionEnabled $true
```
**Step 4: Provision Users**
```bash
python scripts/user_management.py --action provision --csv employees.csv --license E3
```powershell
$licenseSku = (Get-MgSubscribedSku | Where-Object { $_.SkuPartNumber -eq "ENTERPRISEPACK" }).SkuId
Import-Csv .\employees.csv | ForEach-Object {
try {
$user = New-MgUser -DisplayName $_.DisplayName -UserPrincipalName $_.UserPrincipalName `
-AccountEnabled -PasswordProfile @{ Password = (New-Guid).ToString().Substring(0,12)+"!"; ForceChangePasswordNextSignIn = $true }
Set-MgUserLicense -UserId $user.Id -AddLicenses @(@{ SkuId = $licenseSku }) -RemoveLicenses @()
Write-Host "Provisioned: $($_.UserPrincipalName)"
} catch {
Write-Warning "Failed $($_.UserPrincipalName): $_"
}
}
```
**Validation:** Spot-check 35 accounts in the M365 admin portal; confirm licenses show "Active."
---
### Workflow 2: Security Hardening
**Step 1: Run Security Audit**
```bash
python scripts/powershell_generator.py --action audit --output security_audit.ps1
```powershell
Connect-MgGraph -Scopes "Directory.Read.All","Policy.Read.All","AuditLog.Read.All","Reports.Read.All"
# Export Conditional Access policy inventory
Get-MgIdentityConditionalAccessPolicy | Select-Object DisplayName, State |
Export-Csv .\ca_policies.csv -NoTypeInformation
# Find accounts without MFA registered
$report = Get-MgReportAuthenticationMethodUserRegistrationDetail
$report | Where-Object { -not $_.IsMfaRegistered } |
Select-Object UserPrincipalName, IsMfaRegistered |
Export-Csv .\no_mfa_users.csv -NoTypeInformation
Write-Host "Audit complete. Review ca_policies.csv and no_mfa_users.csv."
```
**Step 2: Create MFA Policy**
**Step 2: Create MFA Policy (report-only first)**
```bash
python scripts/powershell_generator.py --action conditional-access \
--policy-name "Require MFA All Users" \
--require-mfa \
--include-users "All"
```powershell
$policy = @{
DisplayName = "Require MFA All Users"
State = "enabledForReportingButNotEnforced"
Conditions = @{ Users = @{ IncludeUsers = @("All") } }
GrantControls = @{ Operator = "OR"; BuiltInControls = @("mfa") }
}
New-MgIdentityConditionalAccessPolicy -BodyParameter $policy
```
**Step 3: Review Results**
**Validation:** After 48 h, review Sign-in logs in Entra ID; confirm expected users would be challenged, then change `State` to `"enabled"`.
Execute generated scripts and review CSV reports in output directory.
**Step 3: Review Secure Score**
```powershell
# Retrieve current Secure Score and top improvement actions
Get-MgSecuritySecureScore -Top 1 | Select-Object CurrentScore, MaxScore, ActiveUserCount
Get-MgSecuritySecureScoreControlProfile | Sort-Object -Property ActionType |
Select-Object Title, ImplementationStatus, MaxScore | Format-Table -AutoSize
```
---
### Workflow 3: User Offboarding
**Step 1: Generate Offboarding Script**
```bash
python scripts/user_management.py --action offboard --user departing.user@company.com
```
**Step 2: Execute Script with -WhatIf**
**Step 1: Block Sign-in and Revoke Sessions**
```powershell
.\offboard_user.ps1 -WhatIf
$upn = "departing.user@company.com"
$user = Get-MgUser -Filter "userPrincipalName eq '$upn'"
# Block sign-in immediately
Update-MgUser -UserId $user.Id -AccountEnabled:$false
# Revoke all active tokens
Invoke-MgInvalidateAllUserRefreshToken -UserId $user.Id
Write-Host "Sign-in blocked and sessions revoked for $upn"
```
**Step 3: Execute for Real**
**Step 2: Preview with -WhatIf (license removal)**
```powershell
.\offboard_user.ps1 -Confirm:$false
# Identify assigned licenses
$licenses = (Get-MgUserLicenseDetail -UserId $user.Id).SkuId
# Dry-run: print what would be removed
$licenses | ForEach-Object { Write-Host "[WhatIf] Would remove SKU: $_" }
```
**Step 3: Execute Offboarding**
```powershell
# Remove licenses
Set-MgUserLicense -UserId $user.Id -AddLicenses @() -RemoveLicenses $licenses
# Convert mailbox to shared (requires ExchangeOnlineManagement module)
Set-Mailbox -Identity $upn -Type Shared
# Remove from all groups
Get-MgUserMemberOf -UserId $user.Id | ForEach-Object {
try { Remove-MgGroupMemberByRef -GroupId $_.Id -DirectoryObjectId $user.Id } catch {}
}
Write-Host "Offboarding complete for $upn"
```
**Validation:** Confirm in the M365 admin portal that the account shows "Blocked," has no active licenses, and the mailbox type is "Shared."
---
## Best Practices
@@ -221,47 +204,42 @@ python scripts/user_management.py --action offboard --user departing.user@compan
1. Enable MFA before adding users
2. Configure named locations for Conditional Access
3. Use separate admin accounts with PIM
4. Verify custom domains before bulk user creation
4. Verify custom domains (and DNS propagation) before bulk user creation
5. Apply Microsoft Secure Score recommendations
### Security Operations
1. Start Conditional Access policies in report-only mode
2. Use `-WhatIf` parameter before executing scripts
3. Never hardcode credentials in scripts
4. Enable audit logging for all operations
5. Regular quarterly security reviews
2. Review Sign-in logs for 48 h before enforcing a new policy
3. Never hardcode credentials in scripts — use Azure Key Vault or `Get-Credential`
4. Enable unified audit logging for all operations
5. Conduct quarterly security reviews and Secure Score check-ins
### PowerShell Automation
1. Prefer Microsoft Graph over legacy MSOnline modules
2. Include try/catch blocks for error handling
3. Implement logging for audit trails
4. Use Azure Key Vault for credential management
5. Test in non-production tenant first
1. Prefer Microsoft Graph (`Microsoft.Graph` module) over legacy MSOnline
2. Include `try/catch` blocks for error handling
3. Implement `Write-Host`/`Write-Warning` logging for audit trails
4. Use `-WhatIf` or dry-run output before bulk destructive operations
5. Test in a non-production tenant first
---
## Reference Guides
### When to Use Each Reference
**references/powershell-templates.md**
- Ready-to-use script templates
- Conditional Access policy examples
- Bulk user provisioning scripts
- Security audit scripts
**references/security-policies.md**
- Conditional Access configuration
- MFA enforcement strategies
- DLP and retention policies
- Security baseline settings
**references/troubleshooting.md**
- Common error resolutions
- PowerShell module issues
- Permission troubleshooting
@@ -289,7 +267,7 @@ Install-Module MicrosoftTeams -Scope CurrentUser
### Required Permissions
- **Global Administrator** - Full tenant setup
- **User Administrator** - User management
- **Security Administrator** - Security policies
- **Exchange Administrator** - Mailbox management
- **Global Administrator** Full tenant setup
- **User Administrator** User management
- **Security Administrator** Security policies
- **Exchange Administrator** Mailbox management

View File

@@ -9,17 +9,5 @@
"homepage": "https://github.com/alirezarezvani/claude-skills/tree/main/engineering-team/playwright-pro",
"repository": "https://github.com/alirezarezvani/claude-skills",
"license": "MIT",
"keywords": [
"playwright",
"testing",
"e2e",
"qa",
"browserstack",
"testrail",
"test-automation",
"cross-browser",
"migration",
"cypress",
"selenium"
]
"skills": "./"
}

View File

@@ -1,6 +1,6 @@
---
name: playwright-pro
description: "Production-grade Playwright testing toolkit. Generate tests, fix flaky failures, migrate from Cypress/Selenium, sync with TestRail, run on BrowserStack. 55 templates, 3 agents, smart reporting."
name: "playwright-pro"
description: "Production-grade Playwright testing toolkit. Use when the user mentions Playwright tests, end-to-end testing, browser automation, fixing flaky tests, test migration, CI/CD testing, or test suites. Generate tests, fix flaky failures, migrate from Cypress/Selenium, sync with TestRail, run on BrowserStack. 55 templates, 3 agents, smart reporting."
---
# Playwright Pro
@@ -23,6 +23,45 @@ When installed as a Claude Code plugin, these are available as `/pw:` commands:
| `/pw:browserstack` | Run on BrowserStack, pull cross-browser reports |
| `/pw:report` | Generate test report in your preferred format |
## Quick Start Workflow
The recommended sequence for most projects:
```
1. /pw:init → scaffolds config, CI pipeline, and a first smoke test
2. /pw:generate → generates tests from your spec or URL
3. /pw:review → validates quality and flags anti-patterns ← always run after generate
4. /pw:fix <test> → diagnoses and repairs any failing/flaky tests ← run when CI turns red
```
**Validation checkpoints:**
- After `/pw:generate` — always run `/pw:review` before committing; it catches locator anti-patterns and missing assertions automatically.
- After `/pw:fix` — re-run the full suite locally (`npx playwright test`) to confirm the fix doesn't introduce regressions.
- After `/pw:migrate` — run `/pw:coverage` to confirm parity with the old suite before decommissioning Cypress/Selenium tests.
### Example: Generate → Review → Fix
```bash
# 1. Generate tests from a user story
/pw:generate "As a user I can log in with email and password"
# Generated: tests/auth/login.spec.ts
# → Playwright Pro creates the file using the auth template.
# 2. Review the generated tests
/pw:review tests/auth/login.spec.ts
# → Flags: one test used page.locator('input[type=password]') — suggests getByLabel('Password')
# → Fix applied automatically.
# 3. Run locally to confirm
npx playwright test tests/auth/login.spec.ts --headed
# 4. If a test is flaky in CI, diagnose it
/pw:fix tests/auth/login.spec.ts
# → Identifies missing web-first assertion; replaces waitForTimeout(2000) with expect(locator).toBeVisible()
```
## Golden Rules
1. `getByRole()` over CSS/XPath — resilient to markup changes

View File

@@ -1,5 +1,5 @@
---
name: browserstack
name: "browserstack"
description: >-
Run tests on BrowserStack. Use when user mentions "browserstack",
"cross-browser", "cloud testing", "browser matrix", "test on safari",
@@ -40,7 +40,7 @@ export default defineConfig({
// ... existing config
projects: isBS ? [
{
name: 'chrome@latest:Windows 11',
name: "chromelatestwindows-11",
use: {
connectOptions: {
wsEndpoint: `wss://cdp.browserstack.com/playwright?caps=${encodeURIComponent(JSON.stringify({
@@ -55,7 +55,7 @@ export default defineConfig({
},
},
{
name: 'firefox@latest:Windows 11',
name: "firefoxlatestwindows-11",
use: {
connectOptions: {
wsEndpoint: `wss://cdp.browserstack.com/playwright?caps=${encodeURIComponent(JSON.stringify({
@@ -70,7 +70,7 @@ export default defineConfig({
},
},
{
name: 'webkit@latest:OS X Ventura',
name: "webkitlatestos-x-ventura",
use: {
connectOptions: {
wsEndpoint: `wss://cdp.browserstack.com/playwright?caps=${encodeURIComponent(JSON.stringify({

View File

@@ -1,5 +1,5 @@
---
name: coverage
name: "coverage"
description: >-
Analyze test coverage gaps. Use when user says "test coverage",
"what's not tested", "coverage gaps", "missing tests", "coverage report",

View File

@@ -1,5 +1,5 @@
---
name: fix
name: "fix"
description: >-
Fix failing or flaky Playwright tests. Use when user says "fix test",
"flaky test", "test failing", "debug test", "test broken", "test passes
@@ -14,7 +14,7 @@ Diagnose and fix a Playwright test that fails or passes intermittently using a s
`$ARGUMENTS` contains:
- A test file path: `e2e/login.spec.ts`
- A test name: `"should redirect after login"`
- A test name: ""should redirect after login"`
- A description: `"the checkout test fails in CI but passes locally"`
## Steps

View File

@@ -1,5 +1,5 @@
---
name: generate
name: "generate"
description: >-
Generate Playwright tests. Use when user says "write tests", "generate tests",
"add tests for", "test this component", "e2e test", "create test for",

View File

@@ -1,5 +1,5 @@
---
name: init
name: "init"
description: >-
Set up Playwright in a project. Use when user says "set up playwright",
"add e2e tests", "configure playwright", "testing setup", "init playwright",
@@ -61,9 +61,9 @@ export default defineConfig({
screenshot: 'only-on-failure',
},
projects: [
{ name: 'chromium', use: { ...devices['Desktop Chrome'] } },
{ name: 'firefox', use: { ...devices['Desktop Firefox'] } },
{ name: 'webkit', use: { ...devices['Desktop Safari'] } },
{ name: "chromium", use: { ...devices['Desktop Chrome'] } },
{ name: "firefox", use: { ...devices['Desktop Firefox'] } },
{ name: "webkit", use: { ...devices['Desktop Safari'] } },
],
webServer: {
command: 'npm run dev',
@@ -125,7 +125,7 @@ test.describe('Homepage', () => {
If `.github/workflows/` exists, create `playwright.yml`:
```yaml
name: Playwright Tests
name: "playwright-tests"
on:
push:
@@ -142,16 +142,16 @@ jobs:
- uses: actions/setup-node@v4
with:
node-version: lts/*
- name: Install dependencies
- name: "install-dependencies"
run: npm ci
- name: Install Playwright Browsers
- name: "install-playwright-browsers"
run: npx playwright install --with-deps
- name: Run Playwright tests
- name: "run-playwright-tests"
run: npx playwright test
- uses: actions/upload-artifact@v4
if: ${{ !cancelled() }}
with:
name: playwright-report
name: "playwright-report"
path: playwright-report/
retention-days: 30
```

View File

@@ -1,5 +1,5 @@
---
name: migrate
name: "migrate"
description: >-
Migrate from Cypress or Selenium to Playwright. Use when user mentions
"cypress", "selenium", "migrate tests", "convert tests", "switch to

View File

@@ -1,5 +1,5 @@
---
name: report
name: "report"
description: >-
Generate test report. Use when user says "test report", "results summary",
"test status", "show results", "test dashboard", or "how did tests go".

View File

@@ -1,5 +1,5 @@
---
name: review
name: "review"
description: >-
Review Playwright tests for quality. Use when user says "review tests",
"check test quality", "audit tests", "improve tests", "test code review",
@@ -72,7 +72,7 @@ For each file:
### Critical
- Line 15: `waitForTimeout(2000)` → use `expect(locator).toBeVisible()`
- Line 28: CSS selector `.btn-submit` → `getByRole('button', { name: 'Submit' })`
- Line 28: CSS selector `.btn-submit` → `getByRole('button', { name: "submit" })`
### Warning
- Line 42: Test name "test login" → "should redirect to dashboard after login"

View File

@@ -1,5 +1,5 @@
---
name: testrail
name: "testrail"
description: >-
Sync tests with TestRail. Use when user mentions "testrail", "test management",
"test cases", "test run", "sync test cases", "push results to testrail",

View File

@@ -7,5 +7,6 @@
},
"repository": "https://github.com/alirezarezvani/claude-skills",
"license": "MIT",
"skills": "./"
"skills": "./",
"homepage": "https://github.com/alirezarezvani/claude-skills/tree/main/engineering-team/self-improving-agent"
}

View File

@@ -1,5 +1,5 @@
---
name: self-improving-agent
name: "self-improving-agent"
description: "Curate Claude Code's auto-memory into durable project knowledge. Analyze MEMORY.md for patterns, promote proven learnings to CLAUDE.md and .claude/rules/, extract recurring solutions into reusable skills. Use when: (1) reviewing what Claude has learned about your project, (2) graduating a pattern from notes to enforced rules, (3) turning a debugging solution into a skill, (4) checking memory health and capacity."
---

View File

@@ -1,5 +1,5 @@
---
name: extract
name: "extract"
description: "Turn a proven pattern or debugging solution into a standalone reusable skill with SKILL.md, reference docs, and examples."
command: /si:extract
---
@@ -75,7 +75,7 @@ The generated SKILL.md must follow this format:
```markdown
---
name: <skill-name>
name: "skill-name"
description: "<one-line description>. Use when: <trigger conditions>."
---

View File

@@ -1,5 +1,5 @@
---
name: promote
name: "promote"
description: "Graduate a proven pattern from auto-memory (MEMORY.md) to CLAUDE.md or .claude/rules/ for permanent enforcement."
command: /si:promote
---

View File

@@ -1,5 +1,5 @@
---
name: remember
name: "remember"
description: "Explicitly save important knowledge to auto-memory with timestamp and context. Use when a discovery is too important to rely on auto-capture."
command: /si:remember
---

View File

@@ -1,5 +1,5 @@
---
name: review
name: "review"
description: "Analyze auto-memory for promotion candidates, stale entries, consolidation opportunities, and health metrics."
command: /si:review
---

View File

@@ -1,5 +1,5 @@
---
name: status
name: "status"
description: "Memory health dashboard showing line counts, topic files, capacity, stale entries, and recommendations."
command: /si:status
---

View File

@@ -1,5 +1,5 @@
---
name: senior-architect
name: "senior-architect"
description: This skill should be used when the user asks to "design system architecture", "evaluate microservices vs monolith", "create architecture diagrams", "analyze dependencies", "choose a database", "plan for scalability", "make technical decisions", or "review system design". Use for architecture decision records (ADRs), tech stack evaluation, system design reviews, dependency analysis, and generating architecture diagrams in Mermaid, PlantUML, or ASCII format.
---

View File

@@ -1,26 +1,12 @@
---
name: senior-backend
description: This skill should be used when the user asks to "design REST APIs", "optimize database queries", "implement authentication", "build microservices", "review backend code", "set up GraphQL", "handle database migrations", or "load test APIs". Use for Node.js/Express/Fastify development, PostgreSQL optimization, API security, and backend architecture patterns.
name: "senior-backend"
description: Designs and implements backend systems including REST APIs, microservices, database architectures, authentication flows, and security hardening. Use when the user asks to "design REST APIs", "optimize database queries", "implement authentication", "build microservices", "review backend code", "set up GraphQL", "handle database migrations", or "load test APIs". Covers Node.js/Express/Fastify development, PostgreSQL optimization, API security, and backend architecture patterns.
---
# Senior Backend Engineer
Backend development patterns, API design, database optimization, and security practices.
## Table of Contents
- [Quick Start](#quick-start)
- [Tools Overview](#tools-overview)
- [API Scaffolder](#1-api-scaffolder)
- [Database Migration Tool](#2-database-migration-tool)
- [API Load Tester](#3-api-load-tester)
- [Backend Development Workflows](#backend-development-workflows)
- [API Design Workflow](#api-design-workflow)
- [Database Optimization Workflow](#database-optimization-workflow)
- [Security Hardening Workflow](#security-hardening-workflow)
- [Reference Documentation](#reference-documentation)
- [Common Patterns Quick Reference](#common-patterns-quick-reference)
---
## Quick Start
@@ -51,17 +37,7 @@ Generates API route handlers, middleware, and OpenAPI specifications from schema
```bash
# Generate Express routes from OpenAPI spec
python scripts/api_scaffolder.py openapi.yaml --framework express --output src/routes/
# Output:
# Generated 12 route handlers in src/routes/
# - GET /users (listUsers)
# - POST /users (createUser)
# - GET /users/{id} (getUser)
# - PUT /users/{id} (updateUser)
# - DELETE /users/{id} (deleteUser)
# ...
# Created validation middleware: src/middleware/validators.ts
# Created TypeScript types: src/types/api.ts
# Output: Generated 12 route handlers, validation middleware, and TypeScript types
# Generate from database schema
python scripts/api_scaffolder.py --from-db postgres://localhost/mydb --output src/routes/
@@ -88,32 +64,12 @@ Analyzes database schemas, detects changes, and generates migration files with r
```bash
# Analyze current schema and suggest optimizations
python scripts/database_migration_tool.py --connection postgres://localhost/mydb --analyze
# Output:
# === Database Analysis Report ===
# Tables: 24
# Total rows: 1,247,832
#
# MISSING INDEXES (5 found):
# orders.user_id - 847ms avg query time, ADD INDEX recommended
# products.category_id - 234ms avg query time, ADD INDEX recommended
#
# N+1 QUERY RISKS (3 found):
# users -> orders relationship (no eager loading)
#
# SUGGESTED MIGRATIONS:
# 1. Add index on orders(user_id)
# 2. Add index on products(category_id)
# 3. Add composite index on order_items(order_id, product_id)
# Output: Missing indexes, N+1 query risks, and suggested migration files
# Generate migration from schema diff
python scripts/database_migration_tool.py --connection postgres://localhost/mydb \
--compare schema/v2.sql --output migrations/
# Output:
# Generated migration: migrations/20240115_add_user_indexes.sql
# Generated rollback: migrations/20240115_add_user_indexes_rollback.sql
# Dry-run a migration
python scripts/database_migration_tool.py --connection postgres://localhost/mydb \
--migrate migrations/20240115_add_user_indexes.sql --dry-run
@@ -132,32 +88,7 @@ Performs HTTP load testing with configurable concurrency, measuring latency perc
```bash
# Basic load test
python scripts/api_load_tester.py https://api.example.com/users --concurrency 50 --duration 30
# Output:
# === Load Test Results ===
# Target: https://api.example.com/users
# Duration: 30s | Concurrency: 50
#
# THROUGHPUT:
# Total requests: 15,247
# Requests/sec: 508.2
# Successful: 15,102 (99.0%)
# Failed: 145 (1.0%)
#
# LATENCY (ms):
# Min: 12
# Avg: 89
# P50: 67
# P95: 198
# P99: 423
# Max: 1,247
#
# ERRORS:
# Connection timeout: 89
# HTTP 503: 56
#
# RECOMMENDATION: P99 latency (423ms) exceeds 200ms target.
# Consider: connection pooling, query optimization, or horizontal scaling.
# Output: Throughput (req/sec), latency percentiles (P50/P95/P99), error counts, and scaling recommendations
# Test with custom headers and body
python scripts/api_load_tester.py https://api.example.com/orders \
@@ -192,7 +123,7 @@ paths:
get:
summary: List users
parameters:
- name: limit
- name: "limit"
in: query
schema:
type: integer
@@ -319,7 +250,7 @@ import { z } from 'zod';
const CreateUserSchema = z.object({
email: z.string().email().max(255),
name: z.string().min(1).max(100),
name: "zstringmin1max100"
age: z.number().int().positive().optional()
});

View File

@@ -1,5 +1,5 @@
---
name: senior-computer-vision
name: "senior-computer-vision"
description: Computer vision engineering skill for object detection, image segmentation, and visual AI systems. Covers CNN and Vision Transformer architectures, YOLO/Faster R-CNN/DETR detection, Mask R-CNN/SAM segmentation, and production deployment with ONNX/TensorRT. Includes PyTorch, torchvision, Ultralytics, Detectron2, and MMDetection frameworks. Use when building detection pipelines, training custom models, optimizing inference, or deploying vision systems.
---
@@ -419,99 +419,7 @@ python scripts/dataset_pipeline_builder.py data/final/ \
| Positional encoding | Implicit | Explicit |
## Reference Documentation
### 1. Computer Vision Architectures
See `references/computer_vision_architectures.md` for:
- CNN backbone architectures (ResNet, EfficientNet, ConvNeXt)
- Vision Transformer variants (ViT, DeiT, Swin)
- Detection heads (anchor-based vs anchor-free)
- Feature Pyramid Networks (FPN, BiFPN, PANet)
- Neck architectures for multi-scale detection
### 2. Object Detection Optimization
See `references/object_detection_optimization.md` for:
- Non-Maximum Suppression variants (NMS, Soft-NMS, DIoU-NMS)
- Anchor optimization and anchor-free alternatives
- Loss function design (focal loss, GIoU, CIoU, DIoU)
- Training strategies (warmup, cosine annealing, EMA)
- Data augmentation for detection (mosaic, mixup, copy-paste)
### 3. Production Vision Systems
See `references/production_vision_systems.md` for:
- ONNX export and optimization
- TensorRT deployment pipeline
- Batch inference optimization
- Edge device deployment (Jetson, Intel NCS)
- Model serving with Triton
- Video processing pipelines
## Common Commands
### Ultralytics YOLO
```bash
# Training
yolo detect train data=coco.yaml model=yolov8m.pt epochs=100 imgsz=640
# Validation
yolo detect val model=best.pt data=coco.yaml
# Inference
yolo detect predict model=best.pt source=images/ save=True
# Export
yolo export model=best.pt format=onnx simplify=True dynamic=True
```
### Detectron2
```bash
# Training
python train_net.py --config-file configs/COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml \
--num-gpus 1 OUTPUT_DIR ./output
# Evaluation
python train_net.py --config-file configs/faster_rcnn.yaml --eval-only \
MODEL.WEIGHTS output/model_final.pth
# Inference
python demo.py --config-file configs/faster_rcnn.yaml \
--input images/*.jpg --output results/ \
--opts MODEL.WEIGHTS output/model_final.pth
```
### MMDetection
```bash
# Training
python tools/train.py configs/faster_rcnn/faster-rcnn_r50_fpn_1x_coco.py
# Testing
python tools/test.py configs/faster_rcnn.py checkpoints/latest.pth --eval bbox
# Inference
python demo/image_demo.py demo.jpg configs/faster_rcnn.py checkpoints/latest.pth
```
### Model Optimization
```bash
# ONNX export and simplify
python -c "import torch; model = torch.load('model.pt'); torch.onnx.export(model, torch.randn(1,3,640,640), 'model.onnx', opset_version=17)"
python -m onnxsim model.onnx model_sim.onnx
# TensorRT conversion
trtexec --onnx=model.onnx --saveEngine=model.engine --fp16 --workspace=4096
# Benchmark
trtexec --loadEngine=model.engine --batch=1 --iterations=1000 --avgRuns=100
```
→ See references/reference-docs-and-commands.md for details
## Performance Targets

View File

@@ -0,0 +1,96 @@
# senior-computer-vision reference
## Reference Documentation
### 1. Computer Vision Architectures
See `references/computer_vision_architectures.md` for:
- CNN backbone architectures (ResNet, EfficientNet, ConvNeXt)
- Vision Transformer variants (ViT, DeiT, Swin)
- Detection heads (anchor-based vs anchor-free)
- Feature Pyramid Networks (FPN, BiFPN, PANet)
- Neck architectures for multi-scale detection
### 2. Object Detection Optimization
See `references/object_detection_optimization.md` for:
- Non-Maximum Suppression variants (NMS, Soft-NMS, DIoU-NMS)
- Anchor optimization and anchor-free alternatives
- Loss function design (focal loss, GIoU, CIoU, DIoU)
- Training strategies (warmup, cosine annealing, EMA)
- Data augmentation for detection (mosaic, mixup, copy-paste)
### 3. Production Vision Systems
See `references/production_vision_systems.md` for:
- ONNX export and optimization
- TensorRT deployment pipeline
- Batch inference optimization
- Edge device deployment (Jetson, Intel NCS)
- Model serving with Triton
- Video processing pipelines
## Common Commands
### Ultralytics YOLO
```bash
# Training
yolo detect train data=coco.yaml model=yolov8m.pt epochs=100 imgsz=640
# Validation
yolo detect val model=best.pt data=coco.yaml
# Inference
yolo detect predict model=best.pt source=images/ save=True
# Export
yolo export model=best.pt format=onnx simplify=True dynamic=True
```
### Detectron2
```bash
# Training
python train_net.py --config-file configs/COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml \
--num-gpus 1 OUTPUT_DIR ./output
# Evaluation
python train_net.py --config-file configs/faster_rcnn.yaml --eval-only \
MODEL.WEIGHTS output/model_final.pth
# Inference
python demo.py --config-file configs/faster_rcnn.yaml \
--input images/*.jpg --output results/ \
--opts MODEL.WEIGHTS output/model_final.pth
```
### MMDetection
```bash
# Training
python tools/train.py configs/faster_rcnn/faster-rcnn_r50_fpn_1x_coco.py
# Testing
python tools/test.py configs/faster_rcnn.py checkpoints/latest.pth --eval bbox
# Inference
python demo/image_demo.py demo.jpg configs/faster_rcnn.py checkpoints/latest.pth
```
### Model Optimization
```bash
# ONNX export and simplify
python -c "import torch; model = torch.load('model.pt'); torch.onnx.export(model, torch.randn(1,3,640,640), 'model.onnx', opset_version=17)"
python -m onnxsim model.onnx model_sim.onnx
# TensorRT conversion
trtexec --onnx=model.onnx --saveEngine=model.engine --fp16 --workspace=4096
# Benchmark
trtexec --loadEngine=model.engine --batch=1 --iterations=1000 --avgRuns=100
```

View File

@@ -1,5 +1,5 @@
---
name: senior-data-engineer
name: "senior-data-engineer"
description: Data engineering skill for building scalable data pipelines, ETL/ELT systems, and data infrastructure. Expertise in Python, SQL, Spark, Airflow, dbt, Kafka, and modern data stack. Includes data modeling, pipeline orchestration, data quality, and DataOps. Use when designing data architectures, building data pipelines, optimizing data workflows, implementing data governance, or troubleshooting data issues.
---
@@ -86,627 +86,7 @@ python scripts/etl_performance_optimizer.py analyze \
---
## Workflows
### Workflow 1: Building a Batch ETL Pipeline
**Scenario:** Extract data from PostgreSQL, transform with dbt, load to Snowflake.
#### Step 1: Define Source Schema
```sql
-- Document source tables
SELECT
table_name,
column_name,
data_type,
is_nullable
FROM information_schema.columns
WHERE table_schema = 'source_schema'
ORDER BY table_name, ordinal_position;
```
#### Step 2: Generate Extraction Config
```bash
python scripts/pipeline_orchestrator.py generate \
--type airflow \
--source postgres \
--tables orders,customers,products \
--mode incremental \
--watermark updated_at \
--output dags/extract_source.py
```
#### Step 3: Create dbt Models
```sql
-- models/staging/stg_orders.sql
WITH source AS (
SELECT * FROM {{ source('postgres', 'orders') }}
),
renamed AS (
SELECT
order_id,
customer_id,
order_date,
total_amount,
status,
_extracted_at
FROM source
WHERE order_date >= DATEADD(day, -3, CURRENT_DATE)
)
SELECT * FROM renamed
```
```sql
-- models/marts/fct_orders.sql
{{
config(
materialized='incremental',
unique_key='order_id',
cluster_by=['order_date']
)
}}
SELECT
o.order_id,
o.customer_id,
c.customer_segment,
o.order_date,
o.total_amount,
o.status
FROM {{ ref('stg_orders') }} o
LEFT JOIN {{ ref('dim_customers') }} c
ON o.customer_id = c.customer_id
{% if is_incremental() %}
WHERE o._extracted_at > (SELECT MAX(_extracted_at) FROM {{ this }})
{% endif %}
```
#### Step 4: Configure Data Quality Tests
```yaml
# models/marts/schema.yml
version: 2
models:
- name: fct_orders
description: "Order fact table"
columns:
- name: order_id
tests:
- unique
- not_null
- name: total_amount
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
max_value: 1000000
- name: order_date
tests:
- not_null
- dbt_utils.recency:
datepart: day
field: order_date
interval: 1
```
#### Step 5: Create Airflow DAG
```python
# dags/daily_etl.py
from airflow import DAG
from airflow.providers.postgres.operators.postgres import PostgresOperator
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago
from datetime import timedelta
default_args = {
'owner': 'data-team',
'depends_on_past': False,
'email_on_failure': True,
'email': ['data-alerts@company.com'],
'retries': 2,
'retry_delay': timedelta(minutes=5),
}
with DAG(
'daily_etl_pipeline',
default_args=default_args,
description='Daily ETL from PostgreSQL to Snowflake',
schedule_interval='0 5 * * *',
start_date=days_ago(1),
catchup=False,
tags=['etl', 'daily'],
) as dag:
extract = BashOperator(
task_id='extract_source_data',
bash_command='python /opt/airflow/scripts/extract.py --date {{ ds }}',
)
transform = BashOperator(
task_id='run_dbt_models',
bash_command='cd /opt/airflow/dbt && dbt run --select marts.*',
)
test = BashOperator(
task_id='run_dbt_tests',
bash_command='cd /opt/airflow/dbt && dbt test --select marts.*',
)
notify = BashOperator(
task_id='send_notification',
bash_command='python /opt/airflow/scripts/notify.py --status success',
trigger_rule='all_success',
)
extract >> transform >> test >> notify
```
#### Step 6: Validate Pipeline
```bash
# Test locally
dbt run --select stg_orders fct_orders
dbt test --select fct_orders
# Validate data quality
python scripts/data_quality_validator.py validate \
--table fct_orders \
--checks all \
--output reports/quality_report.json
```
---
### Workflow 2: Implementing Real-Time Streaming
**Scenario:** Stream events from Kafka, process with Flink/Spark Streaming, sink to data lake.
#### Step 1: Define Event Schema
```json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "UserEvent",
"type": "object",
"required": ["event_id", "user_id", "event_type", "timestamp"],
"properties": {
"event_id": {"type": "string", "format": "uuid"},
"user_id": {"type": "string"},
"event_type": {"type": "string", "enum": ["page_view", "click", "purchase"]},
"timestamp": {"type": "string", "format": "date-time"},
"properties": {"type": "object"}
}
}
```
#### Step 2: Create Kafka Topic
```bash
# Create topic with appropriate partitions
kafka-topics.sh --create \
--bootstrap-server localhost:9092 \
--topic user-events \
--partitions 12 \
--replication-factor 3 \
--config retention.ms=604800000 \
--config cleanup.policy=delete
# Verify topic
kafka-topics.sh --describe \
--bootstrap-server localhost:9092 \
--topic user-events
```
#### Step 3: Implement Spark Streaming Job
```python
# streaming/user_events_processor.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
from_json, col, window, count, avg,
to_timestamp, current_timestamp
)
from pyspark.sql.types import (
StructType, StructField, StringType,
TimestampType, MapType
)
# Initialize Spark
spark = SparkSession.builder \
.appName("UserEventsProcessor") \
.config("spark.sql.streaming.checkpointLocation", "/checkpoints/user-events") \
.config("spark.sql.shuffle.partitions", "12") \
.getOrCreate()
# Define schema
event_schema = StructType([
StructField("event_id", StringType(), False),
StructField("user_id", StringType(), False),
StructField("event_type", StringType(), False),
StructField("timestamp", StringType(), False),
StructField("properties", MapType(StringType(), StringType()), True)
])
# Read from Kafka
events_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "user-events") \
.option("startingOffsets", "latest") \
.option("failOnDataLoss", "false") \
.load()
# Parse JSON
parsed_df = events_df \
.select(from_json(col("value").cast("string"), event_schema).alias("data")) \
.select("data.*") \
.withColumn("event_timestamp", to_timestamp(col("timestamp")))
# Windowed aggregation
aggregated_df = parsed_df \
.withWatermark("event_timestamp", "10 minutes") \
.groupBy(
window(col("event_timestamp"), "5 minutes"),
col("event_type")
) \
.agg(
count("*").alias("event_count"),
approx_count_distinct("user_id").alias("unique_users")
)
# Write to Delta Lake
query = aggregated_df.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/checkpoints/user-events-aggregated") \
.option("path", "/data/lake/user_events_aggregated") \
.trigger(processingTime="1 minute") \
.start()
query.awaitTermination()
```
#### Step 4: Handle Late Data and Errors
```python
# Dead letter queue for failed records
from pyspark.sql.functions import current_timestamp, lit
def process_with_error_handling(batch_df, batch_id):
try:
# Attempt processing
valid_df = batch_df.filter(col("event_id").isNotNull())
invalid_df = batch_df.filter(col("event_id").isNull())
# Write valid records
valid_df.write \
.format("delta") \
.mode("append") \
.save("/data/lake/user_events")
# Write invalid to DLQ
if invalid_df.count() > 0:
invalid_df \
.withColumn("error_timestamp", current_timestamp()) \
.withColumn("error_reason", lit("missing_event_id")) \
.write \
.format("delta") \
.mode("append") \
.save("/data/lake/dlq/user_events")
except Exception as e:
# Log error, alert, continue
logger.error(f"Batch {batch_id} failed: {e}")
raise
# Use foreachBatch for custom processing
query = parsed_df.writeStream \
.foreachBatch(process_with_error_handling) \
.option("checkpointLocation", "/checkpoints/user-events") \
.start()
```
#### Step 5: Monitor Stream Health
```python
# monitoring/stream_metrics.py
from prometheus_client import Gauge, Counter, start_http_server
# Define metrics
RECORDS_PROCESSED = Counter(
'stream_records_processed_total',
'Total records processed',
['stream_name', 'status']
)
PROCESSING_LAG = Gauge(
'stream_processing_lag_seconds',
'Current processing lag',
['stream_name']
)
BATCH_DURATION = Gauge(
'stream_batch_duration_seconds',
'Last batch processing duration',
['stream_name']
)
def emit_metrics(query):
"""Emit Prometheus metrics from streaming query."""
progress = query.lastProgress
if progress:
RECORDS_PROCESSED.labels(
stream_name='user-events',
status='success'
).inc(progress['numInputRows'])
if progress['sources']:
# Calculate lag from latest offset
for source in progress['sources']:
end_offset = source.get('endOffset', {})
# Parse Kafka offsets and calculate lag
```
---
### Workflow 3: Data Quality Framework Setup
**Scenario:** Implement comprehensive data quality monitoring with Great Expectations.
#### Step 1: Initialize Great Expectations
```bash
# Install and initialize
pip install great_expectations
great_expectations init
# Connect to data source
great_expectations datasource new
```
#### Step 2: Create Expectation Suite
```python
# expectations/orders_suite.py
import great_expectations as gx
context = gx.get_context()
# Create expectation suite
suite = context.add_expectation_suite("orders_quality_suite")
# Add expectations
validator = context.get_validator(
batch_request={
"datasource_name": "warehouse",
"data_asset_name": "orders",
},
expectation_suite_name="orders_quality_suite"
)
# Schema expectations
validator.expect_table_columns_to_match_ordered_list(
column_list=[
"order_id", "customer_id", "order_date",
"total_amount", "status", "created_at"
]
)
# Completeness expectations
validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_not_be_null("customer_id")
validator.expect_column_values_to_not_be_null("order_date")
# Uniqueness expectations
validator.expect_column_values_to_be_unique("order_id")
# Range expectations
validator.expect_column_values_to_be_between(
"total_amount",
min_value=0,
max_value=1000000
)
# Categorical expectations
validator.expect_column_values_to_be_in_set(
"status",
["pending", "confirmed", "shipped", "delivered", "cancelled"]
)
# Freshness expectation
validator.expect_column_max_to_be_between(
"order_date",
min_value={"$PARAMETER": "now - timedelta(days=1)"},
max_value={"$PARAMETER": "now"}
)
# Referential integrity
validator.expect_column_values_to_be_in_set(
"customer_id",
value_set={"$PARAMETER": "valid_customer_ids"}
)
validator.save_expectation_suite(discard_failed_expectations=False)
```
#### Step 3: Create Data Quality Checks with dbt
```yaml
# models/marts/schema.yml
version: 2
models:
- name: fct_orders
description: "Order fact table with data quality checks"
tests:
# Row count check
- dbt_utils.equal_rowcount:
compare_model: ref('stg_orders')
# Freshness check
- dbt_utils.recency:
datepart: hour
field: created_at
interval: 24
columns:
- name: order_id
description: "Unique order identifier"
tests:
- unique
- not_null
- relationships:
to: ref('dim_orders')
field: order_id
- name: total_amount
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
max_value: 1000000
inclusive: true
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
row_condition: "status != 'cancelled'"
- name: customer_id
tests:
- not_null
- relationships:
to: ref('dim_customers')
field: customer_id
severity: warn
```
#### Step 4: Implement Data Contracts
```yaml
# contracts/orders_contract.yaml
contract:
name: orders_data_contract
version: "1.0.0"
owner: data-team@company.com
schema:
type: object
properties:
order_id:
type: string
format: uuid
description: "Unique order identifier"
customer_id:
type: string
not_null: true
order_date:
type: date
not_null: true
total_amount:
type: decimal
precision: 10
scale: 2
minimum: 0
status:
type: string
enum: ["pending", "confirmed", "shipped", "delivered", "cancelled"]
sla:
freshness:
max_delay_hours: 1
completeness:
min_percentage: 99.9
accuracy:
duplicate_tolerance: 0.01
consumers:
- name: analytics-team
usage: "Daily reporting dashboards"
- name: ml-team
usage: "Churn prediction model"
```
#### Step 5: Set Up Quality Monitoring Dashboard
```python
# monitoring/quality_dashboard.py
from datetime import datetime, timedelta
import pandas as pd
def generate_quality_report(connection, table_name: str) -> dict:
"""Generate comprehensive data quality report."""
report = {
"table": table_name,
"timestamp": datetime.now().isoformat(),
"checks": {}
}
# Row count check
row_count = connection.execute(
f"SELECT COUNT(*) FROM {table_name}"
).fetchone()[0]
report["checks"]["row_count"] = {
"value": row_count,
"status": "pass" if row_count > 0 else "fail"
}
# Freshness check
max_date = connection.execute(
f"SELECT MAX(created_at) FROM {table_name}"
).fetchone()[0]
hours_old = (datetime.now() - max_date).total_seconds() / 3600
report["checks"]["freshness"] = {
"max_timestamp": max_date.isoformat(),
"hours_old": round(hours_old, 2),
"status": "pass" if hours_old < 24 else "fail"
}
# Null rate check
null_query = f"""
SELECT
SUM(CASE WHEN order_id IS NULL THEN 1 ELSE 0 END) as null_order_id,
SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) as null_customer_id,
COUNT(*) as total
FROM {table_name}
"""
null_result = connection.execute(null_query).fetchone()
report["checks"]["null_rates"] = {
"order_id": null_result[0] / null_result[2] if null_result[2] > 0 else 0,
"customer_id": null_result[1] / null_result[2] if null_result[2] > 0 else 0,
"status": "pass" if null_result[0] == 0 and null_result[1] == 0 else "fail"
}
# Duplicate check
dup_query = f"""
SELECT COUNT(*) - COUNT(DISTINCT order_id) as duplicates
FROM {table_name}
"""
duplicates = connection.execute(dup_query).fetchone()[0]
report["checks"]["duplicates"] = {
"count": duplicates,
"status": "pass" if duplicates == 0 else "fail"
}
# Overall status
all_passed = all(
check["status"] == "pass"
for check in report["checks"].values()
)
report["overall_status"] = "pass" if all_passed else "fail"
return report
```
---
→ See references/workflows.md for details
## Architecture Decision Framework
@@ -810,183 +190,5 @@ See `references/dataops_best_practices.md` for:
---
## Troubleshooting
→ See references/troubleshooting.md for details
### Pipeline Failures
**Symptom:** Airflow DAG fails with timeout
```
Task exceeded max execution time
```
**Solution:**
1. Check resource allocation
2. Profile slow operations
3. Add incremental processing
```python
# Increase timeout
default_args = {
'execution_timeout': timedelta(hours=2),
}
# Or use incremental loads
WHERE updated_at > '{{ prev_ds }}'
```
---
**Symptom:** Spark job OOM
```
java.lang.OutOfMemoryError: Java heap space
```
**Solution:**
1. Increase executor memory
2. Reduce partition size
3. Use disk spill
```python
spark.conf.set("spark.executor.memory", "8g")
spark.conf.set("spark.sql.shuffle.partitions", "200")
spark.conf.set("spark.memory.fraction", "0.8")
```
---
**Symptom:** Kafka consumer lag increasing
```
Consumer lag: 1000000 messages
```
**Solution:**
1. Increase consumer parallelism
2. Optimize processing logic
3. Scale consumer group
```bash
# Add more partitions
kafka-topics.sh --alter \
--bootstrap-server localhost:9092 \
--topic user-events \
--partitions 24
```
---
### Data Quality Issues
**Symptom:** Duplicate records appearing
```
Expected unique, found 150 duplicates
```
**Solution:**
1. Add deduplication logic
2. Use merge/upsert operations
```sql
-- dbt incremental with dedup
{{
config(
materialized='incremental',
unique_key='order_id'
)
}}
SELECT * FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY updated_at DESC
) as rn
FROM {{ source('raw', 'orders') }}
) WHERE rn = 1
```
---
**Symptom:** Stale data in tables
```
Last update: 3 days ago
```
**Solution:**
1. Check upstream pipeline status
2. Verify source availability
3. Add freshness monitoring
```yaml
# dbt freshness check
sources:
- name: raw
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
loaded_at_field: _loaded_at
```
---
**Symptom:** Schema drift detected
```
Column 'new_field' not in expected schema
```
**Solution:**
1. Update data contract
2. Modify transformations
3. Communicate with producers
```python
# Handle schema evolution
df = spark.read.format("delta") \
.option("mergeSchema", "true") \
.load("/data/orders")
```
---
### Performance Issues
**Symptom:** Query takes hours
```
Query runtime: 4 hours (expected: 30 minutes)
```
**Solution:**
1. Check query plan
2. Add proper partitioning
3. Optimize joins
```sql
-- Before: Full table scan
SELECT * FROM orders WHERE order_date = '2024-01-15';
-- After: Partition pruning
-- Table partitioned by order_date
SELECT * FROM orders WHERE order_date = '2024-01-15';
-- Add clustering for frequent filters
ALTER TABLE orders CLUSTER BY (customer_id);
```
---
**Symptom:** dbt model takes too long
```
Model fct_orders completed in 45 minutes
```
**Solution:**
1. Use incremental materialization
2. Reduce upstream dependencies
3. Pre-aggregate where possible
```sql
-- Convert to incremental
{{
config(
materialized='incremental',
unique_key='order_id',
on_schema_change='sync_all_columns'
)
}}
SELECT * FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
WHERE _loaded_at > (SELECT MAX(_loaded_at) FROM {{ this }})
{% endif %}
```

View File

@@ -0,0 +1,183 @@
# senior-data-engineer reference
## Troubleshooting
### Pipeline Failures
**Symptom:** Airflow DAG fails with timeout
```
Task exceeded max execution time
```
**Solution:**
1. Check resource allocation
2. Profile slow operations
3. Add incremental processing
```python
# Increase timeout
default_args = {
'execution_timeout': timedelta(hours=2),
}
# Or use incremental loads
WHERE updated_at > '{{ prev_ds }}'
```
---
**Symptom:** Spark job OOM
```
java.lang.OutOfMemoryError: Java heap space
```
**Solution:**
1. Increase executor memory
2. Reduce partition size
3. Use disk spill
```python
spark.conf.set("spark.executor.memory", "8g")
spark.conf.set("spark.sql.shuffle.partitions", "200")
spark.conf.set("spark.memory.fraction", "0.8")
```
---
**Symptom:** Kafka consumer lag increasing
```
Consumer lag: 1000000 messages
```
**Solution:**
1. Increase consumer parallelism
2. Optimize processing logic
3. Scale consumer group
```bash
# Add more partitions
kafka-topics.sh --alter \
--bootstrap-server localhost:9092 \
--topic user-events \
--partitions 24
```
---
### Data Quality Issues
**Symptom:** Duplicate records appearing
```
Expected unique, found 150 duplicates
```
**Solution:**
1. Add deduplication logic
2. Use merge/upsert operations
```sql
-- dbt incremental with dedup
{{
config(
materialized='incremental',
unique_key='order_id'
)
}}
SELECT * FROM (
SELECT
*,
ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY updated_at DESC
) as rn
FROM {{ source('raw', 'orders') }}
) WHERE rn = 1
```
---
**Symptom:** Stale data in tables
```
Last update: 3 days ago
```
**Solution:**
1. Check upstream pipeline status
2. Verify source availability
3. Add freshness monitoring
```yaml
# dbt freshness check
sources:
- name: "raw"
freshness:
warn_after: {count: 12, period: hour}
error_after: {count: 24, period: hour}
loaded_at_field: _loaded_at
```
---
**Symptom:** Schema drift detected
```
Column 'new_field' not in expected schema
```
**Solution:**
1. Update data contract
2. Modify transformations
3. Communicate with producers
```python
# Handle schema evolution
df = spark.read.format("delta") \
.option("mergeSchema", "true") \
.load("/data/orders")
```
---
### Performance Issues
**Symptom:** Query takes hours
```
Query runtime: 4 hours (expected: 30 minutes)
```
**Solution:**
1. Check query plan
2. Add proper partitioning
3. Optimize joins
```sql
-- Before: Full table scan
SELECT * FROM orders WHERE order_date = '2024-01-15';
-- After: Partition pruning
-- Table partitioned by order_date
SELECT * FROM orders WHERE order_date = '2024-01-15';
-- Add clustering for frequent filters
ALTER TABLE orders CLUSTER BY (customer_id);
```
---
**Symptom:** dbt model takes too long
```
Model fct_orders completed in 45 minutes
```
**Solution:**
1. Use incremental materialization
2. Reduce upstream dependencies
3. Pre-aggregate where possible
```sql
-- Convert to incremental
{{
config(
materialized='incremental',
unique_key='order_id',
on_schema_change='sync_all_columns'
)
}}
SELECT * FROM {{ ref('stg_orders') }}
{% if is_incremental() %}
WHERE _loaded_at > (SELECT MAX(_loaded_at) FROM {{ this }})
{% endif %}
```

View File

@@ -0,0 +1,624 @@
# senior-data-engineer reference
## Workflows
### Workflow 1: Building a Batch ETL Pipeline
**Scenario:** Extract data from PostgreSQL, transform with dbt, load to Snowflake.
#### Step 1: Define Source Schema
```sql
-- Document source tables
SELECT
table_name,
column_name,
data_type,
is_nullable
FROM information_schema.columns
WHERE table_schema = 'source_schema'
ORDER BY table_name, ordinal_position;
```
#### Step 2: Generate Extraction Config
```bash
python scripts/pipeline_orchestrator.py generate \
--type airflow \
--source postgres \
--tables orders,customers,products \
--mode incremental \
--watermark updated_at \
--output dags/extract_source.py
```
#### Step 3: Create dbt Models
```sql
-- models/staging/stg_orders.sql
WITH source AS (
SELECT * FROM {{ source('postgres', 'orders') }}
),
renamed AS (
SELECT
order_id,
customer_id,
order_date,
total_amount,
status,
_extracted_at
FROM source
WHERE order_date >= DATEADD(day, -3, CURRENT_DATE)
)
SELECT * FROM renamed
```
```sql
-- models/marts/fct_orders.sql
{{
config(
materialized='incremental',
unique_key='order_id',
cluster_by=['order_date']
)
}}
SELECT
o.order_id,
o.customer_id,
c.customer_segment,
o.order_date,
o.total_amount,
o.status
FROM {{ ref('stg_orders') }} o
LEFT JOIN {{ ref('dim_customers') }} c
ON o.customer_id = c.customer_id
{% if is_incremental() %}
WHERE o._extracted_at > (SELECT MAX(_extracted_at) FROM {{ this }})
{% endif %}
```
#### Step 4: Configure Data Quality Tests
```yaml
# models/marts/schema.yml
version: 2
models:
- name: "fct-orders"
description: "Order fact table"
columns:
- name: "order-id"
tests:
- unique
- not_null
- name: "total-amount"
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
max_value: 1000000
- name: "order-date"
tests:
- not_null
- dbt_utils.recency:
datepart: day
field: order_date
interval: 1
```
#### Step 5: Create Airflow DAG
```python
# dags/daily_etl.py
from airflow import DAG
from airflow.providers.postgres.operators.postgres import PostgresOperator
from airflow.operators.bash import BashOperator
from airflow.utils.dates import days_ago
from datetime import timedelta
default_args = {
'owner': 'data-team',
'depends_on_past': False,
'email_on_failure': True,
'email': ['data-alerts@company.com'],
'retries': 2,
'retry_delay': timedelta(minutes=5),
}
with DAG(
'daily_etl_pipeline',
default_args=default_args,
description='Daily ETL from PostgreSQL to Snowflake',
schedule_interval='0 5 * * *',
start_date=days_ago(1),
catchup=False,
tags=['etl', 'daily'],
) as dag:
extract = BashOperator(
task_id='extract_source_data',
bash_command='python /opt/airflow/scripts/extract.py --date {{ ds }}',
)
transform = BashOperator(
task_id='run_dbt_models',
bash_command='cd /opt/airflow/dbt && dbt run --select marts.*',
)
test = BashOperator(
task_id='run_dbt_tests',
bash_command='cd /opt/airflow/dbt && dbt test --select marts.*',
)
notify = BashOperator(
task_id='send_notification',
bash_command='python /opt/airflow/scripts/notify.py --status success',
trigger_rule='all_success',
)
extract >> transform >> test >> notify
```
#### Step 6: Validate Pipeline
```bash
# Test locally
dbt run --select stg_orders fct_orders
dbt test --select fct_orders
# Validate data quality
python scripts/data_quality_validator.py validate \
--table fct_orders \
--checks all \
--output reports/quality_report.json
```
---
### Workflow 2: Implementing Real-Time Streaming
**Scenario:** Stream events from Kafka, process with Flink/Spark Streaming, sink to data lake.
#### Step 1: Define Event Schema
```json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"title": "UserEvent",
"type": "object",
"required": ["event_id", "user_id", "event_type", "timestamp"],
"properties": {
"event_id": {"type": "string", "format": "uuid"},
"user_id": {"type": "string"},
"event_type": {"type": "string", "enum": ["page_view", "click", "purchase"]},
"timestamp": {"type": "string", "format": "date-time"},
"properties": {"type": "object"}
}
}
```
#### Step 2: Create Kafka Topic
```bash
# Create topic with appropriate partitions
kafka-topics.sh --create \
--bootstrap-server localhost:9092 \
--topic user-events \
--partitions 12 \
--replication-factor 3 \
--config retention.ms=604800000 \
--config cleanup.policy=delete
# Verify topic
kafka-topics.sh --describe \
--bootstrap-server localhost:9092 \
--topic user-events
```
#### Step 3: Implement Spark Streaming Job
```python
# streaming/user_events_processor.py
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
from_json, col, window, count, avg,
to_timestamp, current_timestamp
)
from pyspark.sql.types import (
StructType, StructField, StringType,
TimestampType, MapType
)
# Initialize Spark
spark = SparkSession.builder \
.appName("UserEventsProcessor") \
.config("spark.sql.streaming.checkpointLocation", "/checkpoints/user-events") \
.config("spark.sql.shuffle.partitions", "12") \
.getOrCreate()
# Define schema
event_schema = StructType([
StructField("event_id", StringType(), False),
StructField("user_id", StringType(), False),
StructField("event_type", StringType(), False),
StructField("timestamp", StringType(), False),
StructField("properties", MapType(StringType(), StringType()), True)
])
# Read from Kafka
events_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "user-events") \
.option("startingOffsets", "latest") \
.option("failOnDataLoss", "false") \
.load()
# Parse JSON
parsed_df = events_df \
.select(from_json(col("value").cast("string"), event_schema).alias("data")) \
.select("data.*") \
.withColumn("event_timestamp", to_timestamp(col("timestamp")))
# Windowed aggregation
aggregated_df = parsed_df \
.withWatermark("event_timestamp", "10 minutes") \
.groupBy(
window(col("event_timestamp"), "5 minutes"),
col("event_type")
) \
.agg(
count("*").alias("event_count"),
approx_count_distinct("user_id").alias("unique_users")
)
# Write to Delta Lake
query = aggregated_df.writeStream \
.format("delta") \
.outputMode("append") \
.option("checkpointLocation", "/checkpoints/user-events-aggregated") \
.option("path", "/data/lake/user_events_aggregated") \
.trigger(processingTime="1 minute") \
.start()
query.awaitTermination()
```
#### Step 4: Handle Late Data and Errors
```python
# Dead letter queue for failed records
from pyspark.sql.functions import current_timestamp, lit
def process_with_error_handling(batch_df, batch_id):
try:
# Attempt processing
valid_df = batch_df.filter(col("event_id").isNotNull())
invalid_df = batch_df.filter(col("event_id").isNull())
# Write valid records
valid_df.write \
.format("delta") \
.mode("append") \
.save("/data/lake/user_events")
# Write invalid to DLQ
if invalid_df.count() > 0:
invalid_df \
.withColumn("error_timestamp", current_timestamp()) \
.withColumn("error_reason", lit("missing_event_id")) \
.write \
.format("delta") \
.mode("append") \
.save("/data/lake/dlq/user_events")
except Exception as e:
# Log error, alert, continue
logger.error(f"Batch {batch_id} failed: {e}")
raise
# Use foreachBatch for custom processing
query = parsed_df.writeStream \
.foreachBatch(process_with_error_handling) \
.option("checkpointLocation", "/checkpoints/user-events") \
.start()
```
#### Step 5: Monitor Stream Health
```python
# monitoring/stream_metrics.py
from prometheus_client import Gauge, Counter, start_http_server
# Define metrics
RECORDS_PROCESSED = Counter(
'stream_records_processed_total',
'Total records processed',
['stream_name', 'status']
)
PROCESSING_LAG = Gauge(
'stream_processing_lag_seconds',
'Current processing lag',
['stream_name']
)
BATCH_DURATION = Gauge(
'stream_batch_duration_seconds',
'Last batch processing duration',
['stream_name']
)
def emit_metrics(query):
"""Emit Prometheus metrics from streaming query."""
progress = query.lastProgress
if progress:
RECORDS_PROCESSED.labels(
stream_name='user-events',
status='success'
).inc(progress['numInputRows'])
if progress['sources']:
# Calculate lag from latest offset
for source in progress['sources']:
end_offset = source.get('endOffset', {})
# Parse Kafka offsets and calculate lag
```
---
### Workflow 3: Data Quality Framework Setup
**Scenario:** Implement comprehensive data quality monitoring with Great Expectations.
#### Step 1: Initialize Great Expectations
```bash
# Install and initialize
pip install great_expectations
great_expectations init
# Connect to data source
great_expectations datasource new
```
#### Step 2: Create Expectation Suite
```python
# expectations/orders_suite.py
import great_expectations as gx
context = gx.get_context()
# Create expectation suite
suite = context.add_expectation_suite("orders_quality_suite")
# Add expectations
validator = context.get_validator(
batch_request={
"datasource_name": "warehouse",
"data_asset_name": "orders",
},
expectation_suite_name="orders_quality_suite"
)
# Schema expectations
validator.expect_table_columns_to_match_ordered_list(
column_list=[
"order_id", "customer_id", "order_date",
"total_amount", "status", "created_at"
]
)
# Completeness expectations
validator.expect_column_values_to_not_be_null("order_id")
validator.expect_column_values_to_not_be_null("customer_id")
validator.expect_column_values_to_not_be_null("order_date")
# Uniqueness expectations
validator.expect_column_values_to_be_unique("order_id")
# Range expectations
validator.expect_column_values_to_be_between(
"total_amount",
min_value=0,
max_value=1000000
)
# Categorical expectations
validator.expect_column_values_to_be_in_set(
"status",
["pending", "confirmed", "shipped", "delivered", "cancelled"]
)
# Freshness expectation
validator.expect_column_max_to_be_between(
"order_date",
min_value={"$PARAMETER": "now - timedelta(days=1)"},
max_value={"$PARAMETER": "now"}
)
# Referential integrity
validator.expect_column_values_to_be_in_set(
"customer_id",
value_set={"$PARAMETER": "valid_customer_ids"}
)
validator.save_expectation_suite(discard_failed_expectations=False)
```
#### Step 3: Create Data Quality Checks with dbt
```yaml
# models/marts/schema.yml
version: 2
models:
- name: "fct-orders"
description: "Order fact table with data quality checks"
tests:
# Row count check
- dbt_utils.equal_rowcount:
compare_model: ref('stg_orders')
# Freshness check
- dbt_utils.recency:
datepart: hour
field: created_at
interval: 24
columns:
- name: "order-id"
description: "Unique order identifier"
tests:
- unique
- not_null
- relationships:
to: ref('dim_orders')
field: order_id
- name: "total-amount"
tests:
- not_null
- dbt_utils.accepted_range:
min_value: 0
max_value: 1000000
inclusive: true
- dbt_expectations.expect_column_values_to_be_between:
min_value: 0
row_condition: "status != 'cancelled'"
- name: "customer-id"
tests:
- not_null
- relationships:
to: ref('dim_customers')
field: customer_id
severity: warn
```
#### Step 4: Implement Data Contracts
```yaml
# contracts/orders_contract.yaml
contract:
name: "orders-data-contract"
version: "1.0.0"
owner: data-team@company.com
schema:
type: object
properties:
order_id:
type: string
format: uuid
description: "Unique order identifier"
customer_id:
type: string
not_null: true
order_date:
type: date
not_null: true
total_amount:
type: decimal
precision: 10
scale: 2
minimum: 0
status:
type: string
enum: ["pending", "confirmed", "shipped", "delivered", "cancelled"]
sla:
freshness:
max_delay_hours: 1
completeness:
min_percentage: 99.9
accuracy:
duplicate_tolerance: 0.01
consumers:
- name: "analytics-team"
usage: "Daily reporting dashboards"
- name: "ml-team"
usage: "Churn prediction model"
```
#### Step 5: Set Up Quality Monitoring Dashboard
```python
# monitoring/quality_dashboard.py
from datetime import datetime, timedelta
import pandas as pd
def generate_quality_report(connection, table_name: "str-dict"
"""Generate comprehensive data quality report."""
report = {
"table": table_name,
"timestamp": datetime.now().isoformat(),
"checks": {}
}
# Row count check
row_count = connection.execute(
f"SELECT COUNT(*) FROM {table_name}"
).fetchone()[0]
report["checks"]["row_count"] = {
"value": row_count,
"status": "pass" if row_count > 0 else "fail"
}
# Freshness check
max_date = connection.execute(
f"SELECT MAX(created_at) FROM {table_name}"
).fetchone()[0]
hours_old = (datetime.now() - max_date).total_seconds() / 3600
report["checks"]["freshness"] = {
"max_timestamp": max_date.isoformat(),
"hours_old": round(hours_old, 2),
"status": "pass" if hours_old < 24 else "fail"
}
# Null rate check
null_query = f"""
SELECT
SUM(CASE WHEN order_id IS NULL THEN 1 ELSE 0 END) as null_order_id,
SUM(CASE WHEN customer_id IS NULL THEN 1 ELSE 0 END) as null_customer_id,
COUNT(*) as total
FROM {table_name}
"""
null_result = connection.execute(null_query).fetchone()
report["checks"]["null_rates"] = {
"order_id": null_result[0] / null_result[2] if null_result[2] > 0 else 0,
"customer_id": null_result[1] / null_result[2] if null_result[2] > 0 else 0,
"status": "pass" if null_result[0] == 0 and null_result[1] == 0 else "fail"
}
# Duplicate check
dup_query = f"""
SELECT COUNT(*) - COUNT(DISTINCT order_id) as duplicates
FROM {table_name}
"""
duplicates = connection.execute(dup_query).fetchone()[0]
report["checks"]["duplicates"] = {
"count": duplicates,
"status": "pass" if duplicates == 0 else "fail"
}
# Overall status
all_passed = all(
check["status"] == "pass"
for check in report["checks"].values()
)
report["overall_status"] = "pass" if all_passed else "fail"
return report
```
---

View File

@@ -1,176 +1,214 @@
---
name: senior-data-scientist
description: World-class data science skill for statistical modeling, experimentation, causal inference, and advanced analytics. Expertise in Python (NumPy, Pandas, Scikit-learn), R, SQL, statistical methods, A/B testing, time series, and business intelligence. Includes experiment design, feature engineering, model evaluation, and stakeholder communication. Use when designing experiments, building predictive models, performing causal analysis, or driving data-driven decisions.
name: "senior-data-scientist"
description: World-class senior data scientist skill specialising in statistical modeling, experiment design, causal inference, and predictive analytics. Covers A/B testing (sample sizing, two-proportion z-tests, Bonferroni correction), difference-in-differences, feature engineering pipelines (Scikit-learn, XGBoost), cross-validated model evaluation (AUC-ROC, AUC-PR, SHAP), and MLflow experiment tracking — using Python (NumPy, Pandas, Scikit-learn), R, and SQL. Use when designing or analysing controlled experiments, building and evaluating classification or regression models, performing causal analysis on observational data, engineering features for structured tabular datasets, or translating statistical findings into data-driven business decisions.
---
# Senior Data Scientist
World-class senior data scientist skill for production-grade AI/ML/Data systems.
## Quick Start
## Core Workflows
### Main Capabilities
### 1. Design an A/B Test
```bash
# Core Tool 1
python scripts/experiment_designer.py --input data/ --output results/
```python
import numpy as np
from scipy import stats
# Core Tool 2
python scripts/feature_engineering_pipeline.py --target project/ --analyze
def calculate_sample_size(baseline_rate, mde, alpha=0.05, power=0.8):
"""
Calculate required sample size per variant.
baseline_rate: current conversion rate (e.g. 0.10)
mde: minimum detectable effect (relative, e.g. 0.05 = 5% lift)
"""
p1 = baseline_rate
p2 = baseline_rate * (1 + mde)
effect_size = abs(p2 - p1) / np.sqrt((p1 * (1 - p1) + p2 * (1 - p2)) / 2)
z_alpha = stats.norm.ppf(1 - alpha / 2)
z_beta = stats.norm.ppf(power)
n = ((z_alpha + z_beta) / effect_size) ** 2
return int(np.ceil(n))
# Core Tool 3
python scripts/model_evaluation_suite.py --config config.yaml --deploy
def analyze_experiment(control, treatment, alpha=0.05):
"""
Run two-proportion z-test and return structured results.
control/treatment: dicts with 'conversions' and 'visitors'.
"""
p_c = control["conversions"] / control["visitors"]
p_t = treatment["conversions"] / treatment["visitors"]
pooled = (control["conversions"] + treatment["conversions"]) / (control["visitors"] + treatment["visitors"])
se = np.sqrt(pooled * (1 - pooled) * (1 / control["visitors"] + 1 / treatment["visitors"]))
z = (p_t - p_c) / se
p_value = 2 * (1 - stats.norm.cdf(abs(z)))
ci_low = (p_t - p_c) - stats.norm.ppf(1 - alpha / 2) * se
ci_high = (p_t - p_c) + stats.norm.ppf(1 - alpha / 2) * se
return {
"lift": (p_t - p_c) / p_c,
"p_value": p_value,
"significant": p_value < alpha,
"ci_95": (ci_low, ci_high),
}
# --- Experiment checklist ---
# 1. Define ONE primary metric and pre-register secondary metrics.
# 2. Calculate sample size BEFORE starting: calculate_sample_size(0.10, 0.05)
# 3. Randomise at the user (not session) level to avoid leakage.
# 4. Run for at least 1 full business cycle (typically 2 weeks).
# 5. Check for sample ratio mismatch: abs(n_control - n_treatment) / expected < 0.01
# 6. Analyze with analyze_experiment() and report lift + CI, not just p-value.
# 7. Apply Bonferroni correction if testing multiple metrics: alpha / n_metrics
```
## Core Expertise
### 2. Build a Feature Engineering Pipeline
This skill covers world-class capabilities in:
```python
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
- Advanced production patterns and architectures
- Scalable system design and implementation
- Performance optimization at scale
- MLOps and DataOps best practices
- Real-time processing and inference
- Distributed computing frameworks
- Model deployment and monitoring
- Security and compliance
- Cost optimization
- Team leadership and mentoring
def build_feature_pipeline(numeric_cols, categorical_cols, date_cols=None):
"""
Returns a fitted-ready ColumnTransformer for structured tabular data.
"""
numeric_pipeline = Pipeline([
("impute", SimpleImputer(strategy="median")),
("scale", StandardScaler()),
])
categorical_pipeline = Pipeline([
("impute", SimpleImputer(strategy="most_frequent")),
("encode", OneHotEncoder(handle_unknown="ignore", sparse_output=False)),
])
transformers = [
("num", numeric_pipeline, numeric_cols),
("cat", categorical_pipeline, categorical_cols),
]
return ColumnTransformer(transformers, remainder="drop")
## Tech Stack
def add_time_features(df, date_col):
"""Extract cyclical and lag features from a datetime column."""
df = df.copy()
df[date_col] = pd.to_datetime(df[date_col])
df["dow_sin"] = np.sin(2 * np.pi * df[date_col].dt.dayofweek / 7)
df["dow_cos"] = np.cos(2 * np.pi * df[date_col].dt.dayofweek / 7)
df["month_sin"] = np.sin(2 * np.pi * df[date_col].dt.month / 12)
df["month_cos"] = np.cos(2 * np.pi * df[date_col].dt.month / 12)
df["is_weekend"] = (df[date_col].dt.dayofweek >= 5).astype(int)
return df
**Languages:** Python, SQL, R, Scala, Go
**ML Frameworks:** PyTorch, TensorFlow, Scikit-learn, XGBoost
**Data Tools:** Spark, Airflow, dbt, Kafka, Databricks
**LLM Frameworks:** LangChain, LlamaIndex, DSPy
**Deployment:** Docker, Kubernetes, AWS/GCP/Azure
**Monitoring:** MLflow, Weights & Biases, Prometheus
**Databases:** PostgreSQL, BigQuery, Snowflake, Pinecone
# --- Feature engineering checklist ---
# 1. Never fit transformers on the full dataset — fit on train, transform test.
# 2. Log-transform right-skewed numeric features before scaling.
# 3. For high-cardinality categoricals (>50 levels), use target encoding or embeddings.
# 4. Generate lag/rolling features BEFORE the train/test split to avoid leakage.
# 5. Document each feature's business meaning alongside its code.
```
### 3. Train, Evaluate, and Select a Prediction Model
```python
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.metrics import make_scorer, roc_auc_score, average_precision_score
import xgboost as xgb
import mlflow
SCORERS = {
"roc_auc": make_scorer(roc_auc_score, needs_proba=True),
"avg_prec": make_scorer(average_precision_score, needs_proba=True),
}
def evaluate_model(model, X, y, cv=5):
"""
Cross-validate and return mean ± std for each scorer.
Use StratifiedKFold for classification to preserve class balance.
"""
cv_results = cross_validate(
model, X, y,
cv=StratifiedKFold(n_splits=cv, shuffle=True, random_state=42),
scoring=SCORERS,
return_train_score=True,
)
summary = {}
for metric in SCORERS:
test_scores = cv_results[f"test_{metric}"]
summary[metric] = {"mean": test_scores.mean(), "std": test_scores.std()}
# Flag overfitting: large gap between train and test score
train_mean = cv_results[f"train_{metric}"].mean()
summary[metric]["overfit_gap"] = train_mean - test_scores.mean()
return summary
def train_and_log(model, X_train, y_train, X_test, y_test, run_name):
"""Train model and log all artefacts to MLflow."""
with mlflow.start_run(run_name=run_name):
model.fit(X_train, y_train)
proba = model.predict_proba(X_test)[:, 1]
metrics = {
"roc_auc": roc_auc_score(y_test, proba),
"avg_prec": average_precision_score(y_test, proba),
}
mlflow.log_params(model.get_params())
mlflow.log_metrics(metrics)
mlflow.sklearn.log_model(model, "model")
return metrics
# --- Model evaluation checklist ---
# 1. Always report AUC-PR alongside AUC-ROC for imbalanced datasets.
# 2. Check overfit_gap > 0.05 as a warning sign of overfitting.
# 3. Calibrate probabilities (Platt scaling / isotonic) before production use.
# 4. Compute SHAP values to validate feature importance makes business sense.
# 5. Run a baseline (e.g. DummyClassifier) and verify the model beats it.
# 6. Log every run to MLflow — never rely on notebook output for comparison.
```
### 4. Causal Inference: Difference-in-Differences
```python
import statsmodels.formula.api as smf
def diff_in_diff(df, outcome, treatment_col, post_col, controls=None):
"""
Estimate ATT via OLS DiD with optional covariates.
df must have: outcome, treatment_col (0/1), post_col (0/1).
Returns the interaction coefficient (treatment × post) and its p-value.
"""
covariates = " + ".join(controls) if controls else ""
formula = (
f"{outcome} ~ {treatment_col} * {post_col}"
+ (f" + {covariates}" if covariates else "")
)
result = smf.ols(formula, data=df).fit(cov_type="HC3")
interaction = f"{treatment_col}:{post_col}"
return {
"att": result.params[interaction],
"p_value": result.pvalues[interaction],
"ci_95": result.conf_int().loc[interaction].tolist(),
"summary": result.summary(),
}
# --- Causal inference checklist ---
# 1. Validate parallel trends in pre-period before trusting DiD estimates.
# 2. Use HC3 robust standard errors to handle heteroskedasticity.
# 3. For panel data, cluster SEs at the unit level (add groups= param to fit).
# 4. Consider propensity score matching if groups differ at baseline.
# 5. Report the ATT with confidence interval, not just statistical significance.
```
## Reference Documentation
### 1. Statistical Methods Advanced
Comprehensive guide available in `references/statistical_methods_advanced.md` covering:
- Advanced patterns and best practices
- Production implementation strategies
- Performance optimization techniques
- Scalability considerations
- Security and compliance
- Real-world case studies
### 2. Experiment Design Frameworks
Complete workflow documentation in `references/experiment_design_frameworks.md` including:
- Step-by-step processes
- Architecture design patterns
- Tool integration guides
- Performance tuning strategies
- Troubleshooting procedures
### 3. Feature Engineering Patterns
Technical reference guide in `references/feature_engineering_patterns.md` with:
- System design principles
- Implementation examples
- Configuration best practices
- Deployment strategies
- Monitoring and observability
## Production Patterns
### Pattern 1: Scalable Data Processing
Enterprise-scale data processing with distributed computing:
- Horizontal scaling architecture
- Fault-tolerant design
- Real-time and batch processing
- Data quality validation
- Performance monitoring
### Pattern 2: ML Model Deployment
Production ML system with high availability:
- Model serving with low latency
- A/B testing infrastructure
- Feature store integration
- Model monitoring and drift detection
- Automated retraining pipelines
### Pattern 3: Real-Time Inference
High-throughput inference system:
- Batching and caching strategies
- Load balancing
- Auto-scaling
- Latency optimization
- Cost optimization
## Best Practices
### Development
- Test-driven development
- Code reviews and pair programming
- Documentation as code
- Version control everything
- Continuous integration
### Production
- Monitor everything critical
- Automate deployments
- Feature flags for releases
- Canary deployments
- Comprehensive logging
### Team Leadership
- Mentor junior engineers
- Drive technical decisions
- Establish coding standards
- Foster learning culture
- Cross-functional collaboration
## Performance Targets
**Latency:**
- P50: < 50ms
- P95: < 100ms
- P99: < 200ms
**Throughput:**
- Requests/second: > 1000
- Concurrent users: > 10,000
**Availability:**
- Uptime: 99.9%
- Error rate: < 0.1%
## Security & Compliance
- Authentication & authorization
- Data encryption (at rest & in transit)
- PII handling and anonymization
- GDPR/CCPA compliance
- Regular security audits
- Vulnerability management
- **Statistical Methods:** `references/statistical_methods_advanced.md`
- **Experiment Design Frameworks:** `references/experiment_design_frameworks.md`
- **Feature Engineering Patterns:** `references/feature_engineering_patterns.md`
## Common Commands
```bash
# Development
python -m pytest tests/ -v --cov
python -m black src/
python -m pylint src/
# Testing & linting
python -m pytest tests/ -v --cov=src/
python -m black src/ && python -m pylint src/
# Training
# Training & evaluation
python scripts/train.py --config prod.yaml
python scripts/evaluate.py --model best.pth
@@ -179,48 +217,7 @@ docker build -t service:v1 .
kubectl apply -f k8s/
helm upgrade service ./charts/
# Monitoring
# Monitoring & health
kubectl logs -f deployment/service
python scripts/health_check.py
```
## Resources
- Advanced Patterns: `references/statistical_methods_advanced.md`
- Implementation Guide: `references/experiment_design_frameworks.md`
- Technical Reference: `references/feature_engineering_patterns.md`
- Automation Scripts: `scripts/` directory
## Senior-Level Responsibilities
As a world-class senior professional:
1. **Technical Leadership**
- Drive architectural decisions
- Mentor team members
- Establish best practices
- Ensure code quality
2. **Strategic Thinking**
- Align with business goals
- Evaluate trade-offs
- Plan for scale
- Manage technical debt
3. **Collaboration**
- Work across teams
- Communicate effectively
- Build consensus
- Share knowledge
4. **Innovation**
- Stay current with research
- Experiment with new approaches
- Contribute to community
- Drive continuous improvement
5. **Production Excellence**
- Ensure high availability
- Monitor proactively
- Optimize performance
- Respond to incidents

View File

@@ -1,5 +1,5 @@
---
name: senior-devops
name: "senior-devops"
description: Comprehensive DevOps skill for CI/CD, infrastructure automation, containerization, and cloud platforms (AWS, GCP, Azure). Includes pipeline setup, infrastructure as code, deployment automation, and monitoring. Use when setting up pipelines, deploying applications, managing infrastructure, implementing monitoring, or optimizing deployment processes.
---
@@ -14,196 +14,262 @@ Complete toolkit for senior devops with modern tools and best practices.
This skill provides three core capabilities through automated scripts:
```bash
# Script 1: Pipeline Generator
python scripts/pipeline_generator.py [options]
# Script 1: Pipeline Generator — scaffolds CI/CD pipelines for GitHub Actions or CircleCI
python scripts/pipeline_generator.py ./app --platform=github --stages=build,test,deploy
# Script 2: Terraform Scaffolder
python scripts/terraform_scaffolder.py [options]
# Script 2: Terraform Scaffolder — generates and validates IaC modules for AWS/GCP/Azure
python scripts/terraform_scaffolder.py ./infra --provider=aws --module=ecs-service --verbose
# Script 3: Deployment Manager
python scripts/deployment_manager.py [options]
# Script 3: Deployment Manager — orchestrates container deployments with rollback support
python scripts/deployment_manager.py deploy --env=production --image=app:1.2.3 --strategy=blue-green
```
## Core Capabilities
### 1. Pipeline Generator
Automated tool for pipeline generator tasks.
Scaffolds CI/CD pipeline configurations for GitHub Actions or CircleCI, with stages for build, test, security scan, and deploy.
**Features:**
- Automated scaffolding
- Best practices built-in
- Configurable templates
- Quality checks
**Example — GitHub Actions workflow:**
```yaml
# .github/workflows/ci.yml
name: CI/CD Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
build-and-test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Node.js
uses: actions/setup-node@v4
with:
node-version: '20'
cache: 'npm'
- run: npm ci
- run: npm run lint
- run: npm test -- --coverage
- name: Upload coverage
uses: codecov/codecov-action@v4
build-docker:
needs: build-and-test
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build and push image
uses: docker/build-push-action@v5
with:
push: ${{ github.ref == 'refs/heads/main' }}
tags: ghcr.io/${{ github.repository }}:${{ github.sha }}
deploy:
needs: build-docker
if: github.ref == 'refs/heads/main'
runs-on: ubuntu-latest
steps:
- name: Deploy to ECS
run: |
aws ecs update-service \
--cluster production \
--service app-service \
--force-new-deployment
```
**Usage:**
```bash
python scripts/pipeline_generator.py <project-path> [options]
python scripts/pipeline_generator.py <project-path> --platform=github|circleci --stages=build,test,deploy
```
### 2. Terraform Scaffolder
Comprehensive analysis and optimization tool.
Generates, validates, and plans Terraform modules. Enforces consistent module structure and runs `terraform validate` + `terraform plan` before any apply.
**Features:**
- Deep analysis
- Performance metrics
- Recommendations
- Automated fixes
**Example — AWS ECS service module:**
```hcl
# modules/ecs-service/main.tf
resource "aws_ecs_task_definition" "app" {
family = var.service_name
requires_compatibilities = ["FARGATE"]
network_mode = "awsvpc"
cpu = var.cpu
memory = var.memory
container_definitions = jsonencode([{
name = var.service_name
image = var.container_image
essential = true
portMappings = [{
containerPort = var.container_port
protocol = "tcp"
}]
environment = [for k, v in var.env_vars : { name = k, value = v }]
logConfiguration = {
logDriver = "awslogs"
options = {
awslogs-group = "/ecs/${var.service_name}"
awslogs-region = var.aws_region
awslogs-stream-prefix = "ecs"
}
}
}])
}
resource "aws_ecs_service" "app" {
name = var.service_name
cluster = var.cluster_id
task_definition = aws_ecs_task_definition.app.arn
desired_count = var.desired_count
launch_type = "FARGATE"
network_configuration {
subnets = var.private_subnet_ids
security_groups = [aws_security_group.app.id]
assign_public_ip = false
}
load_balancer {
target_group_arn = aws_lb_target_group.app.arn
container_name = var.service_name
container_port = var.container_port
}
}
```
**Usage:**
```bash
python scripts/terraform_scaffolder.py <target-path> [--verbose]
python scripts/terraform_scaffolder.py <target-path> --provider=aws|gcp|azure --module=ecs-service|gke-deployment|aks-service [--verbose]
```
### 3. Deployment Manager
Advanced tooling for specialized tasks.
Orchestrates deployments with blue/green or rolling strategies, health-check gates, and automatic rollback on failure.
**Features:**
- Expert-level automation
- Custom configurations
- Integration ready
- Production-grade output
**Example — Kubernetes blue/green deployment (blue-slot specific elements):**
```yaml
# k8s/deployment-blue.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: app-blue
labels:
app: myapp
slot: blue # slot label distinguishes blue from green
spec:
replicas: 3
selector:
matchLabels:
app: myapp
slot: blue
template:
metadata:
labels:
app: myapp
slot: blue
spec:
containers:
- name: app
image: ghcr.io/org/app:1.2.3
readinessProbe: # gate: pod must pass before traffic switches
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
resources:
requests:
cpu: "250m"
memory: "256Mi"
limits:
cpu: "500m"
memory: "512Mi"
```
**Usage:**
```bash
python scripts/deployment_manager.py [arguments] [options]
python scripts/deployment_manager.py deploy \
--env=staging|production \
--image=app:1.2.3 \
--strategy=blue-green|rolling \
--health-check-url=https://app.example.com/healthz
python scripts/deployment_manager.py rollback --env=production --to-version=1.2.2
python scripts/deployment_manager.py --analyze --env=production # audit current state
```
## Reference Documentation
## Resources
### Cicd Pipeline Guide
Comprehensive guide available in `references/cicd_pipeline_guide.md`:
- Detailed patterns and practices
- Code examples
- Best practices
- Anti-patterns to avoid
- Real-world scenarios
### Infrastructure As Code
Complete workflow documentation in `references/infrastructure_as_code.md`:
- Step-by-step processes
- Optimization strategies
- Tool integrations
- Performance tuning
- Troubleshooting guide
### Deployment Strategies
Technical reference guide in `references/deployment_strategies.md`:
- Technology stack details
- Configuration examples
- Integration patterns
- Security considerations
- Scalability guidelines
## Tech Stack
**Languages:** TypeScript, JavaScript, Python, Go, Swift, Kotlin
**Frontend:** React, Next.js, React Native, Flutter
**Backend:** Node.js, Express, GraphQL, REST APIs
**Database:** PostgreSQL, Prisma, NeonDB, Supabase
**DevOps:** Docker, Kubernetes, Terraform, GitHub Actions, CircleCI
**Cloud:** AWS, GCP, Azure
- Pattern Reference: `references/cicd_pipeline_guide.md` — detailed CI/CD patterns, best practices, anti-patterns
- Workflow Guide: `references/infrastructure_as_code.md` — IaC step-by-step processes, optimization, troubleshooting
- Technical Guide: `references/deployment_strategies.md` — deployment strategy configs, security considerations, scalability
- Tool Scripts: `scripts/` directory
## Development Workflow
### 1. Setup and Configuration
### 1. Infrastructure Changes (Terraform)
```bash
# Install dependencies
npm install
# or
pip install -r requirements.txt
# Scaffold or update module
python scripts/terraform_scaffolder.py ./infra --provider=aws --module=ecs-service --verbose
# Configure environment
cp .env.example .env
# Validate and plan — review diff before applying
terraform -chdir=infra init
terraform -chdir=infra validate
terraform -chdir=infra plan -out=tfplan
# Apply only after plan review
terraform -chdir=infra apply tfplan
# Verify resources are healthy
aws ecs describe-services --cluster production --services app-service \
--query 'services[0].{Status:status,Running:runningCount,Desired:desiredCount}'
```
### 2. Run Quality Checks
### 2. Application Deployment
```bash
# Use the analyzer script
python scripts/terraform_scaffolder.py .
# Generate or update pipeline config
python scripts/pipeline_generator.py . --platform=github --stages=build,test,security,deploy
# Review recommendations
# Apply fixes
# Build and tag image
docker build -t ghcr.io/org/app:$(git rev-parse --short HEAD) .
docker push ghcr.io/org/app:$(git rev-parse --short HEAD)
# Deploy with health-check gate
python scripts/deployment_manager.py deploy \
--env=production \
--image=app:$(git rev-parse --short HEAD) \
--strategy=blue-green \
--health-check-url=https://app.example.com/healthz
# Verify pods are running
kubectl get pods -n production -l app=myapp
kubectl rollout status deployment/app-blue -n production
# Switch traffic after verification
kubectl patch service app-svc -n production \
-p '{"spec":{"selector":{"slot":"blue"}}}'
```
### 3. Implement Best Practices
Follow the patterns and practices documented in:
- `references/cicd_pipeline_guide.md`
- `references/infrastructure_as_code.md`
- `references/deployment_strategies.md`
## Best Practices Summary
### Code Quality
- Follow established patterns
- Write comprehensive tests
- Document decisions
- Review regularly
### Performance
- Measure before optimizing
- Use appropriate caching
- Optimize critical paths
- Monitor in production
### Security
- Validate all inputs
- Use parameterized queries
- Implement proper authentication
- Keep dependencies updated
### Maintainability
- Write clear code
- Use consistent naming
- Add helpful comments
- Keep it simple
## Common Commands
### 3. Rollback Procedure
```bash
# Development
npm run dev
npm run build
npm run test
npm run lint
# Immediate rollback via deployment manager
python scripts/deployment_manager.py rollback --env=production --to-version=1.2.2
# Analysis
python scripts/terraform_scaffolder.py .
python scripts/deployment_manager.py --analyze
# Or via kubectl
kubectl rollout undo deployment/app -n production
kubectl rollout status deployment/app -n production
# Deployment
docker build -t app:latest .
docker-compose up -d
kubectl apply -f k8s/
# Verify rollback succeeded
kubectl get pods -n production -l app=myapp
curl -sf https://app.example.com/healthz || echo "ROLLBACK FAILED — escalate"
```
## Troubleshooting
### Common Issues
Check the comprehensive troubleshooting section in `references/deployment_strategies.md`.
### Getting Help
- Review reference documentation
- Check script output messages
- Consult tech stack documentation
- Review error logs
## Resources
- Pattern Reference: `references/cicd_pipeline_guide.md`
- Workflow Guide: `references/infrastructure_as_code.md`
- Technical Guide: `references/deployment_strategies.md`
- Tool Scripts: `scripts/` directory

View File

@@ -1,5 +1,5 @@
---
name: senior-frontend
name: "senior-frontend"
description: Frontend development skill for React, Next.js, TypeScript, and Tailwind CSS applications. Use when building React components, optimizing Next.js performance, analyzing bundle sizes, scaffolding frontend projects, implementing accessibility, or reviewing frontend code quality.
---
@@ -422,7 +422,7 @@ test('dialog is accessible', async () => {
// next.config.js
const nextConfig = {
images: {
remotePatterns: [{ hostname: 'cdn.example.com' }],
remotePatterns: [{ hostname: "cdnexamplecom" }],
formats: ['image/avif', 'image/webp'],
},
experimental: {

View File

@@ -1,5 +1,5 @@
---
name: senior-fullstack
name: "senior-fullstack"
description: Fullstack development toolkit with project scaffolding for Next.js, FastAPI, MERN, and Django stacks, code quality analysis with security and complexity scoring, and stack selection guidance. Use when the user asks to "scaffold a new project", "create a Next.js app", "set up FastAPI with React", "analyze code quality", "audit my codebase", "what stack should I use", "generate project boilerplate", or mentions fullstack development, project setup, or tech stack comparison.
---

View File

@@ -1,6 +1,6 @@
---
name: senior-ml-engineer
description: ML engineering skill for productionizing models, building MLOps pipelines, and integrating LLMs. Covers model deployment, feature stores, drift monitoring, RAG systems, and cost optimization.
name: "senior-ml-engineer"
description: ML engineering skill for productionizing models, building MLOps pipelines, and integrating LLMs. Covers model deployment, feature stores, drift monitoring, RAG systems, and cost optimization. Use when the user asks about deploying ML models to production, setting up MLOps infrastructure (MLflow, Kubeflow, Kubernetes, Docker), monitoring model performance or drift, building RAG pipelines, or integrating LLM APIs with retry logic and cost controls. Focused on production and operational concerns rather than model research or initial training.
triggers:
- MLOps pipeline
- model deployment

View File

@@ -1,5 +1,5 @@
---
name: senior-prompt-engineer
name: "senior-prompt-engineer"
description: This skill should be used when the user asks to "optimize prompts", "design prompt templates", "evaluate LLM outputs", "build agentic systems", "implement RAG", "create few-shot examples", "analyze token usage", or "design AI workflows". Use for prompt engineering patterns, LLM evaluation frameworks, agent architectures, and structured output design.
---

View File

@@ -1,26 +1,12 @@
---
name: senior-qa
description: This skill should be used when the user asks to "generate tests", "write unit tests", "analyze test coverage", "scaffold E2E tests", "set up Playwright", "configure Jest", "implement testing patterns", or "improve test quality". Use for React/Next.js testing with Jest, React Testing Library, and Playwright.
name: "senior-qa"
description: Generates unit tests, integration tests, and E2E tests for React/Next.js applications. Scans components to create Jest + React Testing Library test stubs, analyzes Istanbul/LCOV coverage reports to surface gaps, scaffolds Playwright test files from Next.js routes, mocks API calls with MSW, creates test fixtures, and configures test runners. Use when the user asks to "generate tests", "write unit tests", "analyze test coverage", "scaffold E2E tests", "set up Playwright", "configure Jest", "implement testing patterns", or "improve test quality".
---
# Senior QA Engineer
Test automation, coverage analysis, and quality assurance patterns for React and Next.js applications.
## Table of Contents
- [Quick Start](#quick-start)
- [Tools Overview](#tools-overview)
- [Test Suite Generator](#1-test-suite-generator)
- [Coverage Analyzer](#2-coverage-analyzer)
- [E2E Test Scaffolder](#3-e2e-test-scaffolder)
- [QA Workflows](#qa-workflows)
- [Unit Test Generation Workflow](#unit-test-generation-workflow)
- [Coverage Analysis Workflow](#coverage-analysis-workflow)
- [E2E Test Setup Workflow](#e2e-test-setup-workflow)
- [Reference Documentation](#reference-documentation)
- [Common Patterns Quick Reference](#common-patterns-quick-reference)
---
## Quick Start
@@ -52,18 +38,6 @@ Scans React/TypeScript components and generates Jest + React Testing Library tes
# Basic usage - scan components and generate tests
python scripts/test_suite_generator.py src/components/ --output __tests__/
# Output:
# Scanning: src/components/
# Found 24 React components
#
# Generated tests:
# __tests__/Button.test.tsx (render, click handler, disabled state)
# __tests__/Modal.test.tsx (render, open/close, keyboard events)
# __tests__/Form.test.tsx (render, validation, submission)
# ...
#
# Summary: 24 test files, 87 test cases
# Include accessibility tests
python scripts/test_suite_generator.py src/ --output __tests__/ --include-a11y
@@ -91,29 +65,6 @@ Parses Jest/Istanbul coverage reports and identifies gaps, uncovered branches, a
# Analyze coverage report
python scripts/coverage_analyzer.py coverage/coverage-final.json
# Output:
# === Coverage Analysis Report ===
# Overall: 72.4% (target: 80%)
#
# BY TYPE:
# Statements: 74.2%
# Branches: 68.1%
# Functions: 71.8%
# Lines: 73.5%
#
# CRITICAL GAPS (uncovered business logic):
# src/services/payment.ts:45-67 - Payment processing
# src/hooks/useAuth.ts:23-41 - Authentication flow
#
# RECOMMENDATIONS:
# 1. Add tests for payment service error handling
# 2. Cover authentication edge cases
# 3. Test form validation branches
#
# Files below threshold (80%):
# src/components/Checkout.tsx: 45%
# src/services/api.ts: 62%
# Enforce threshold (exit 1 if below)
python scripts/coverage_analyzer.py coverage/ --threshold 80 --strict
@@ -135,21 +86,6 @@ Scans Next.js pages/app directory and generates Playwright test files with commo
# Scaffold E2E tests for Next.js App Router
python scripts/e2e_test_scaffolder.py src/app/ --output e2e/
# Output:
# Scanning: src/app/
# Found 12 routes
#
# Generated E2E tests:
# e2e/home.spec.ts (navigation, hero section)
# e2e/auth/login.spec.ts (form submission, validation)
# e2e/auth/register.spec.ts (registration flow)
# e2e/dashboard.spec.ts (authenticated routes)
# e2e/products/[id].spec.ts (dynamic routes)
# ...
#
# Generated: playwright.config.ts
# Generated: e2e/fixtures/auth.ts
# Include Page Object Model classes
python scripts/e2e_test_scaffolder.py src/app/ --output e2e/ --include-pom
@@ -184,7 +120,7 @@ import { Button } from '../src/components/Button';
describe('Button', () => {
it('renders with label', () => {
render(<Button>Click me</Button>);
expect(screen.getByRole('button', { name: /click me/i })).toBeInTheDocument();
expect(screen.getByRole('button', { name: "click-mei-tobeinthedocument"
});
it('calls onClick when clicked', () => {
@@ -278,12 +214,12 @@ npx playwright show-report
**Step 5: Add to CI pipeline**
```yaml
# .github/workflows/e2e.yml
- name: Run E2E tests
- name: "run-e2e-tests"
run: npx playwright test
- name: Upload report
- name: "upload-report"
uses: actions/upload-artifact@v3
with:
name: playwright-report
name: "playwright-report"
path: playwright-report/
```
@@ -305,7 +241,7 @@ npx playwright show-report
```typescript
// Preferred (accessible)
screen.getByRole('button', { name: /submit/i })
screen.getByRole('button', { name: "submiti"
screen.getByLabelText(/email/i)
screen.getByPlaceholderText(/search/i)
@@ -336,7 +272,7 @@ import { setupServer } from 'msw/node';
const server = setupServer(
rest.get('/api/users', (req, res, ctx) => {
return res(ctx.json([{ id: 1, name: 'John' }]));
return res(ctx.json([{ id: 1, name: "john" }]));
})
);
@@ -349,7 +285,7 @@ afterAll(() => server.close());
```typescript
// Preferred
page.getByRole('button', { name: 'Submit' })
page.getByRole('button', { name: "submit" })
page.getByLabel('Email')
page.getByText('Welcome')

View File

@@ -1,6 +1,6 @@
---
name: senior-secops
description: Comprehensive SecOps skill for application security, vulnerability management, compliance, and secure development practices. Includes security scanning, vulnerability assessment, compliance checking, and security automation. Use when implementing security controls, conducting security audits, responding to vulnerabilities, or ensuring compliance requirements.
name: "senior-secops"
description: Senior SecOps engineer skill for application security, vulnerability management, compliance verification, and secure development practices. Runs SAST/DAST scans, generates CVE remediation plans, checks dependency vulnerabilities, creates security policies, enforces secure coding patterns, and automates compliance checks against SOC2, PCI-DSS, HIPAA, and GDPR. Use when conducting a security review or audit, responding to a CVE or security incident, hardening infrastructure, implementing authentication or secrets management, running penetration test prep, checking OWASP Top 10 exposure, or enforcing security controls in CI/CD pipelines.
---
# Senior SecOps Engineer
@@ -11,7 +11,6 @@ Complete toolkit for Security Operations including vulnerability management, com
## Table of Contents
- [Trigger Terms](#trigger-terms)
- [Core Capabilities](#core-capabilities)
- [Workflows](#workflows)
- [Tool Reference](#tool-reference)
@@ -21,27 +20,6 @@ Complete toolkit for Security Operations including vulnerability management, com
---
## Trigger Terms
Use this skill when you encounter:
| Category | Terms |
|----------|-------|
| **Vulnerability Management** | CVE, CVSS, vulnerability scan, security patch, dependency audit, npm audit, pip-audit |
| **OWASP Top 10** | injection, XSS, CSRF, broken authentication, security misconfiguration, sensitive data exposure |
| **Compliance** | SOC 2, PCI-DSS, HIPAA, GDPR, compliance audit, security controls, access control |
| **Secure Coding** | input validation, output encoding, parameterized queries, prepared statements, sanitization |
| **Secrets Management** | API key, secrets vault, environment variables, HashiCorp Vault, AWS Secrets Manager |
| **Authentication** | JWT, OAuth, MFA, 2FA, TOTP, password hashing, bcrypt, argon2, session management |
| **Security Testing** | SAST, DAST, penetration test, security scan, Snyk, Semgrep, CodeQL, Trivy |
| **Incident Response** | security incident, breach notification, incident response, forensics, containment |
| **Network Security** | TLS, HTTPS, HSTS, CSP, CORS, security headers, firewall rules, WAF |
| **Infrastructure Security** | container security, Kubernetes security, IAM, least privilege, zero trust |
| **Cryptography** | encryption at rest, encryption in transit, AES-256, RSA, key management, KMS |
| **Monitoring** | security monitoring, SIEM, audit logging, intrusion detection, anomaly detection |
---
## Core Capabilities
### 1. Security Scanner
@@ -129,14 +107,23 @@ Complete security assessment of a codebase.
```bash
# Step 1: Scan for code vulnerabilities
python scripts/security_scanner.py . --severity medium
# STOP if exit code 2 — resolve critical findings before continuing
```
```bash
# Step 2: Check dependency vulnerabilities
python scripts/vulnerability_assessor.py . --severity high
# STOP if exit code 2 — patch critical CVEs before continuing
```
```bash
# Step 3: Verify compliance controls
python scripts/compliance_checker.py . --framework all
# STOP if exit code 2 — address critical gaps before proceeding
```
# Step 4: Generate combined report
```bash
# Step 4: Generate combined reports
python scripts/security_scanner.py . --json --output security.json
python scripts/vulnerability_assessor.py . --json --output vulns.json
python scripts/compliance_checker.py . --json --output compliance.json
@@ -148,7 +135,7 @@ Integrate security checks into deployment pipeline.
```yaml
# .github/workflows/security.yml
name: Security Scan
name: "security-scan"
on:
pull_request:
@@ -160,21 +147,23 @@ jobs:
steps:
- uses: actions/checkout@v4
- name: Set up Python
- name: "set-up-python"
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: Security Scanner
- name: "security-scanner"
run: python scripts/security_scanner.py . --severity high
- name: Vulnerability Assessment
- name: "vulnerability-assessment"
run: python scripts/vulnerability_assessor.py . --severity critical
- name: Compliance Check
- name: "compliance-check"
run: python scripts/compliance_checker.py . --framework soc2
```
Each step fails the pipeline on its respective exit code — no deployment proceeds past a critical finding.
### Workflow 3: CVE Triage
Respond to a new CVE affecting your application.
@@ -184,6 +173,7 @@ Respond to a new CVE affecting your application.
- Identify affected systems using vulnerability_assessor.py
- Check if CVE is being actively exploited
- Determine CVSS environmental score for your context
- STOP if CVSS 9.0+ on internet-facing system — escalate immediately
2. PRIORITIZE
- Critical (CVSS 9.0+, internet-facing): 24 hours
@@ -193,7 +183,8 @@ Respond to a new CVE affecting your application.
3. REMEDIATE
- Update affected dependency to fixed version
- Run security_scanner.py to verify fix
- Run security_scanner.py to verify fix (must return exit code 0)
- STOP if scanner still flags the CVE — do not deploy
- Test for regressions
- Deploy with enhanced monitoring
@@ -223,7 +214,7 @@ PHASE 2: CONTAIN (15-60 min)
PHASE 3: ERADICATE (1-4 hours)
- Root cause identified
- Malware/backdoors removed
- Vulnerabilities patched (run security_scanner.py)
- Vulnerabilities patched (run security_scanner.py; must return exit code 0)
- Systems hardened
PHASE 4: RECOVER (4-24 hours)
@@ -254,10 +245,7 @@ PHASE 5: POST-INCIDENT (24-72 hours)
| `--json` | Output results as JSON |
| `--output, -o` | Write results to file |
**Exit Codes:**
- `0`: No critical/high findings
- `1`: High severity findings
- `2`: Critical severity findings
**Exit Codes:** `0` = no critical/high findings · `1` = high severity findings · `2` = critical severity findings
### vulnerability_assessor.py
@@ -269,10 +257,7 @@ PHASE 5: POST-INCIDENT (24-72 hours)
| `--json` | Output results as JSON |
| `--output, -o` | Write results to file |
**Exit Codes:**
- `0`: No critical/high vulnerabilities
- `1`: High severity vulnerabilities
- `2`: Critical severity vulnerabilities
**Exit Codes:** `0` = no critical/high vulnerabilities · `1` = high severity vulnerabilities · `2` = critical severity vulnerabilities
### compliance_checker.py
@@ -284,29 +269,13 @@ PHASE 5: POST-INCIDENT (24-72 hours)
| `--json` | Output results as JSON |
| `--output, -o` | Write results to file |
**Exit Codes:**
- `0`: Compliant (90%+ score)
- `1`: Non-compliant (50-69% score)
- `2`: Critical gaps (<50% score)
**Exit Codes:** `0` = compliant (90%+ score) · `1` = non-compliant (50-69% score) · `2` = critical gaps (<50% score)
---
## Security Standards
### OWASP Top 10 Prevention
| Vulnerability | Prevention |
|--------------|------------|
| **A01: Broken Access Control** | Implement RBAC, deny by default, validate permissions server-side |
| **A02: Cryptographic Failures** | Use TLS 1.2+, AES-256 encryption, secure key management |
| **A03: Injection** | Parameterized queries, input validation, escape output |
| **A04: Insecure Design** | Threat modeling, secure design patterns, defense in depth |
| **A05: Security Misconfiguration** | Hardening guides, remove defaults, disable unused features |
| **A06: Vulnerable Components** | Dependency scanning, automated updates, SBOM |
| **A07: Authentication Failures** | MFA, rate limiting, secure password storage |
| **A08: Data Integrity Failures** | Code signing, integrity checks, secure CI/CD |
| **A09: Security Logging Failures** | Comprehensive audit logs, SIEM integration, alerting |
| **A10: SSRF** | URL validation, allowlist destinations, network segmentation |
See `references/security_standards.md` for OWASP Top 10 full guidance, secure coding standards, authentication requirements, and API security controls.
### Secure Coding Checklist
@@ -346,47 +315,28 @@ PHASE 5: POST-INCIDENT (24-72 hours)
## Compliance Frameworks
### SOC 2 Type II Controls
See `references/compliance_requirements.md` for full control mappings. Run `compliance_checker.py` to verify the controls below:
| Control | Category | Description |
|---------|----------|-------------|
| CC1 | Control Environment | Security policies, org structure |
| CC2 | Communication | Security awareness, documentation |
| CC3 | Risk Assessment | Vulnerability scanning, threat modeling |
| CC6 | Logical Access | Authentication, authorization, MFA |
| CC7 | System Operations | Monitoring, logging, incident response |
| CC8 | Change Management | CI/CD, code review, deployment controls |
### SOC 2 Type II
- **CC6** Logical Access: authentication, authorization, MFA
- **CC7** System Operations: monitoring, logging, incident response
- **CC8** Change Management: CI/CD, code review, deployment controls
### PCI-DSS v4.0 Requirements
| Requirement | Description |
|-------------|-------------|
| Req 3 | Protect stored cardholder data (encryption at rest) |
| Req 4 | Encrypt transmission (TLS 1.2+) |
| Req 6 | Secure development (input validation, secure coding) |
| Req 8 | Strong authentication (MFA, password policy) |
| Req 10 | Audit logging (all access to cardholder data) |
| Req 11 | Security testing (SAST, DAST, penetration testing) |
### PCI-DSS v4.0
- **Req 3/4**: Encryption at rest and in transit (TLS 1.2+)
- **Req 6**: Secure development (input validation, secure coding)
- **Req 8**: Strong authentication (MFA, password policy)
- **Req 10/11**: Audit logging, SAST/DAST/penetration testing
### HIPAA Security Rule
- Unique user IDs and audit trails for PHI access (164.312(a)(1), 164.312(b))
- MFA for person/entity authentication (164.312(d))
- Transmission encryption via TLS (164.312(e)(1))
| Safeguard | Requirement |
|-----------|-------------|
| 164.312(a)(1) | Unique user identification for PHI access |
| 164.312(b) | Audit trails for PHI access |
| 164.312(c)(1) | Data integrity controls |
| 164.312(d) | Person/entity authentication (MFA) |
| 164.312(e)(1) | Transmission encryption (TLS) |
### GDPR Requirements
| Article | Requirement |
|---------|-------------|
| Art 25 | Privacy by design, data minimization |
| Art 32 | Security measures, encryption, pseudonymization |
| Art 33 | Breach notification (72 hours) |
| Art 17 | Right to erasure (data deletion) |
| Art 20 | Data portability (export capability) |
### GDPR
- **Art 25/32**: Privacy by design, encryption, pseudonymization
- **Art 33**: Breach notification within 72 hours
- **Art 17/20**: Right to erasure and data portability
---
@@ -469,37 +419,4 @@ app.use((req, res, next) => {
|----------|-------------|
| `references/security_standards.md` | OWASP Top 10, secure coding, authentication, API security |
| `references/vulnerability_management_guide.md` | CVE triage, CVSS scoring, remediation workflows |
| `references/compliance_requirements.md` | SOC 2, PCI-DSS, HIPAA, GDPR requirements |
---
## Tech Stack
**Security Scanning:**
- Snyk (dependency scanning)
- Semgrep (SAST)
- CodeQL (code analysis)
- Trivy (container scanning)
- OWASP ZAP (DAST)
**Secrets Management:**
- HashiCorp Vault
- AWS Secrets Manager
- Azure Key Vault
- 1Password Secrets Automation
**Authentication:**
- bcrypt, argon2 (password hashing)
- jsonwebtoken (JWT)
- passport.js (authentication middleware)
- speakeasy (TOTP/MFA)
**Logging & Monitoring:**
- Winston, Pino (Node.js logging)
- Datadog, Splunk (SIEM)
- PagerDuty (alerting)
**Compliance:**
- Vanta (SOC 2 automation)
- Drata (compliance management)
- AWS Config (configuration compliance)
| `references/compliance_requirements.md` | SOC 2, PCI-DSS, HIPAA, GDPR full control mappings |

View File

@@ -1,6 +1,6 @@
---
name: senior-security
description: Security engineering toolkit for threat modeling, vulnerability analysis, secure architecture, and penetration testing. Includes STRIDE analysis, OWASP guidance, cryptography patterns, and security scanning tools.
name: "senior-security"
description: Security engineering toolkit for threat modeling, vulnerability analysis, secure architecture, and penetration testing. Includes STRIDE analysis, OWASP guidance, cryptography patterns, and security scanning tools. Use when the user asks about security reviews, threat analysis, vulnerability assessments, secure coding practices, security audits, attack surface analysis, CVE remediation, or security best practices.
triggers:
- security architecture
- threat modeling
@@ -49,13 +49,7 @@ Identify and analyze security threats using STRIDE methodology.
- Processes (application components)
- Data stores (databases, caches)
- Data flows (APIs, network connections)
3. Apply STRIDE to each DFD element:
- Spoofing: Can identity be faked?
- Tampering: Can data be modified?
- Repudiation: Can actions be denied?
- Information Disclosure: Can data leak?
- Denial of Service: Can availability be affected?
- Elevation of Privilege: Can access be escalated?
3. Apply STRIDE to each DFD element (see [STRIDE per Element Matrix](#stride-per-element-matrix) below)
4. Score risks using DREAD:
- Damage potential (1-10)
- Reproducibility (1-10)
@@ -69,14 +63,14 @@ Identify and analyze security threats using STRIDE methodology.
### STRIDE Threat Categories
| Category | Description | Security Property | Mitigation Focus |
|----------|-------------|-------------------|------------------|
| Spoofing | Impersonating users or systems | Authentication | MFA, certificates, strong auth |
| Tampering | Modifying data or code | Integrity | Signing, checksums, validation |
| Repudiation | Denying actions | Non-repudiation | Audit logs, digital signatures |
| Information Disclosure | Exposing data | Confidentiality | Encryption, access controls |
| Denial of Service | Disrupting availability | Availability | Rate limiting, redundancy |
| Elevation of Privilege | Gaining unauthorized access | Authorization | RBAC, least privilege |
| Category | Security Property | Mitigation Focus |
|----------|-------------------|------------------|
| Spoofing | Authentication | MFA, certificates, strong auth |
| Tampering | Integrity | Signing, checksums, validation |
| Repudiation | Non-repudiation | Audit logs, digital signatures |
| Information Disclosure | Confidentiality | Encryption, access controls |
| Denial of Service | Availability | Rate limiting, redundancy |
| Elevation of Privilege | Authorization | RBAC, least privilege |
### STRIDE per Element Matrix
@@ -195,24 +189,11 @@ Identify and remediate security vulnerabilities in applications.
7. Verify fixes and document
8. **Validation:** Scope defined; automated and manual testing complete; findings classified; remediation tracked
### OWASP Top 10 Mapping
| Rank | Vulnerability | Testing Approach |
|------|---------------|------------------|
| A01 | Broken Access Control | Manual IDOR testing, authorization checks |
| A02 | Cryptographic Failures | Algorithm review, key management audit |
| A03 | Injection | SAST + manual payload testing |
| A04 | Insecure Design | Threat modeling, architecture review |
| A05 | Security Misconfiguration | Configuration audit, CIS benchmarks |
| A06 | Vulnerable Components | Dependency scanning, CVE monitoring |
| A07 | Authentication Failures | Password policy, session management review |
| A08 | Software/Data Integrity | CI/CD security, code signing verification |
| A09 | Logging Failures | Log review, SIEM configuration check |
| A10 | SSRF | Manual URL manipulation testing |
For OWASP Top 10 vulnerability descriptions and testing guidance, refer to [owasp.org/Top10](https://owasp.org/Top10).
### Vulnerability Severity Matrix
| Impact / Exploitability | Easy | Moderate | Difficult |
| Impact \ Exploitability | Easy | Moderate | Difficult |
|-------------------------|------|----------|-----------|
| Critical | Critical | Critical | High |
| High | Critical | High | Medium |
@@ -280,6 +261,55 @@ Review code for security vulnerabilities before deployment.
| MD5/SHA1 for passwords | Weak hashing | Use Argon2id or bcrypt |
| Math.random for tokens | Predictable values | Use crypto.getRandomValues |
### Inline Code Examples
**SQL Injection — insecure vs. secure (Python):**
```python
# ❌ Insecure: string formatting allows SQL injection
query = f"SELECT * FROM users WHERE username = '{username}'"
cursor.execute(query)
# ✅ Secure: parameterized query — user input never interpreted as SQL
query = "SELECT * FROM users WHERE username = %s"
cursor.execute(query, (username,))
```
**Password Hashing with Argon2id (Python):**
```python
from argon2 import PasswordHasher
ph = PasswordHasher() # uses secure defaults (time_cost, memory_cost)
# On registration
hashed = ph.hash(plain_password)
# On login — raises argon2.exceptions.VerifyMismatchError on failure
ph.verify(hashed, plain_password)
```
**Secret Scanning — core pattern matching (Python):**
```python
import re, pathlib
SECRET_PATTERNS = {
"aws_access_key": re.compile(r"AKIA[0-9A-Z]{16}"),
"github_token": re.compile(r"ghp_[A-Za-z0-9]{36}"),
"private_key": re.compile(r"-----BEGIN (RSA |EC )?PRIVATE KEY-----"),
"generic_secret": re.compile(r'(?i)(password|secret|api_key)\s*=\s*["\']?\S{8,}'),
}
def scan_file(path: pathlib.Path) -> list[dict]:
findings = []
for lineno, line in enumerate(path.read_text(errors="replace").splitlines(), 1):
for name, pattern in SECRET_PATTERNS.items():
if pattern.search(line):
findings.append({"file": str(path), "line": lineno, "type": name})
return findings
```
---
## Incident Response Workflow
@@ -317,12 +347,12 @@ Respond to and contain security incidents.
### Incident Severity Levels
| Level | Description | Response Time | Escalation |
|-------|-------------|---------------|------------|
| P1 - Critical | Active breach, data exfiltration | Immediate | CISO, Legal, Executive |
| P2 - High | Confirmed compromise, contained | 1 hour | Security Lead, IT Director |
| P3 - Medium | Potential compromise, under investigation | 4 hours | Security Team |
| P4 - Low | Suspicious activity, low impact | 24 hours | On-call engineer |
| Level | Response Time | Escalation |
|-------|---------------|------------|
| P1 - Critical (active breach/exfiltration) | Immediate | CISO, Legal, Executive |
| P2 - High (confirmed, contained) | 1 hour | Security Lead, IT Director |
| P3 - Medium (potential, under investigation) | 4 hours | Security Team |
| P4 - Low (suspicious, low impact) | 24 hours | On-call engineer |
### Incident Response Checklist
@@ -370,24 +400,12 @@ See: [references/cryptography-implementation.md](references/cryptography-impleme
### Scripts
| Script | Purpose | Usage |
|--------|---------|-------|
| [threat_modeler.py](scripts/threat_modeler.py) | STRIDE threat analysis with risk scoring | `python threat_modeler.py --component "Authentication"` |
| [secret_scanner.py](scripts/secret_scanner.py) | Detect hardcoded secrets and credentials | `python secret_scanner.py /path/to/project` |
| Script | Purpose |
|--------|---------|
| [threat_modeler.py](scripts/threat_modeler.py) | STRIDE threat analysis with DREAD risk scoring; JSON and text output; interactive guided mode |
| [secret_scanner.py](scripts/secret_scanner.py) | Detect hardcoded secrets and credentials across 20+ patterns; CI/CD integration ready |
**Threat Modeler Features:**
- STRIDE analysis for any system component
- DREAD risk scoring
- Mitigation recommendations
- JSON and text output formats
- Interactive mode for guided analysis
**Secret Scanner Features:**
- Detects AWS, GCP, Azure credentials
- Finds API keys and tokens (GitHub, Slack, Stripe)
- Identifies private keys and passwords
- Supports 20+ secret patterns
- CI/CD integration ready
For usage, see the inline code examples in [Secure Code Review Workflow](#inline-code-examples) and the script source files directly.
### References
@@ -401,17 +419,6 @@ See: [references/cryptography-implementation.md](references/cryptography-impleme
## Security Standards Reference
### Compliance Frameworks
| Framework | Focus | Applicable To |
|-----------|-------|---------------|
| OWASP ASVS | Application security | Web applications |
| CIS Benchmarks | System hardening | Servers, containers, cloud |
| NIST CSF | Risk management | Enterprise security programs |
| PCI-DSS | Payment card data | Payment processing |
| HIPAA | Healthcare data | Healthcare applications |
| SOC 2 | Service organization controls | SaaS providers |
### Security Headers Checklist
| Header | Recommended Value |
@@ -423,6 +430,8 @@ See: [references/cryptography-implementation.md](references/cryptography-impleme
| Referrer-Policy | strict-origin-when-cross-origin |
| Permissions-Policy | geolocation=(), microphone=(), camera=() |
For compliance framework requirements (OWASP ASVS, CIS Benchmarks, NIST CSF, PCI-DSS, HIPAA, SOC 2), refer to the respective official documentation.
---
## Related Skills

View File

@@ -1,3 +1,8 @@
---
name: "stripe-integration-expert"
description: "Stripe Integration Expert"
---
# Stripe Integration Expert
**Tier:** POWERFUL
@@ -67,7 +72,7 @@ export const stripe = new Stripe(process.env.STRIPE_SECRET_KEY!, {
apiVersion: "2024-04-10",
typescript: true,
appInfo: {
name: "MyApp",
name: "myapp",
version: "1.0.0",
},
})
@@ -109,7 +114,7 @@ export async function POST(req: Request) {
if (!stripeCustomerId) {
const customer = await stripe.customers.create({
email: user.email,
name: user.name ?? undefined,
name: "username-undefined"
metadata: { userId: user.id },
})
stripeCustomerId = customer.id

View File

@@ -1,6 +1,6 @@
---
name: tdd-guide
description: Test-driven development workflow with test generation, coverage analysis, and multi-framework support
name: "tdd-guide"
description: "Test-driven development skill for writing unit tests, generating test fixtures and mocks, analyzing coverage gaps, and guiding red-green-refactor workflows across Jest, Pytest, JUnit, Vitest, and Mocha. Use when the user asks to write tests, improve test coverage, practice TDD, generate mocks or stubs, or mentions testing frameworks like Jest, pytest, or JUnit. Handles test generation from source code, coverage report parsing (LCOV/JSON/XML), quality scoring, and framework conversion for TypeScript, JavaScript, Python, and Java projects."
triggers:
- generate tests
- analyze coverage
@@ -16,27 +16,6 @@ triggers:
Test-driven development skill for generating tests, analyzing coverage, and guiding red-green-refactor workflows across Jest, Pytest, JUnit, and Vitest.
## Table of Contents
- [Capabilities](#capabilities)
- [Workflows](#workflows)
- [Tools](#tools)
- [Input Requirements](#input-requirements)
- [Limitations](#limitations)
---
## Capabilities
| Capability | Description |
|------------|-------------|
| Test Generation | Convert requirements or code into test cases with proper structure |
| Coverage Analysis | Parse LCOV/JSON/XML reports, identify gaps, prioritize fixes |
| TDD Workflow | Guide red-green-refactor cycles with validation |
| Framework Adapters | Generate tests for Jest, Pytest, JUnit, Vitest, Mocha |
| Quality Scoring | Assess test isolation, assertions, naming, detect test smells |
| Fixture Generation | Create realistic test data, mocks, and factories |
---
## Workflows
@@ -68,18 +47,85 @@ Test-driven development skill for generating tests, analyzing coverage, and guid
---
## Tools
## Examples
### Test Generation — Input → Output (Pytest)
**Input source function (`math_utils.py`):**
```python
def divide(a: float, b: float) -> float:
if b == 0:
raise ValueError("Cannot divide by zero")
return a / b
```
**Command:**
```bash
python scripts/test_generator.py --input math_utils.py --framework pytest
```
**Generated test output (`test_math_utils.py`):**
```python
import pytest
from math_utils import divide
class TestDivide:
def test_divide_positive_numbers(self):
assert divide(10, 2) == 5.0
def test_divide_negative_numerator(self):
assert divide(-10, 2) == -5.0
def test_divide_float_result(self):
assert divide(1, 3) == pytest.approx(0.333, rel=1e-3)
def test_divide_by_zero_raises_value_error(self):
with pytest.raises(ValueError, match="Cannot divide by zero"):
divide(10, 0)
def test_divide_zero_numerator(self):
assert divide(0, 5) == 0.0
```
---
### Coverage Analysis — Sample P0/P1/P2 Output
**Command:**
```bash
python scripts/coverage_analyzer.py --report lcov.info --threshold 80
```
**Sample output:**
```
Coverage Report — Overall: 63% (threshold: 80%)
P0 — Critical gaps (uncovered error paths):
auth/login.py:42-58 handle_expired_token() 0% covered
payments/process.py:91-110 handle_payment_failure() 0% covered
P1 — High-value gaps (core logic branches):
users/service.py:77 update_profile() — else branch 0% covered
orders/cart.py:134 apply_discount() — zero-qty guard 0% covered
P2 — Low-risk gaps (utility / helper functions):
utils/formatting.py:12 format_currency() 0% covered
Recommended: Generate tests for P0 items first to reach 80% threshold.
```
---
## Key Tools
| Tool | Purpose | Usage |
|------|---------|-------|
| `test_generator.py` | Generate test cases from code/requirements | `python scripts/test_generator.py --input source.py --framework pytest` |
| `coverage_analyzer.py` | Parse and analyze coverage reports | `python scripts/coverage_analyzer.py --report lcov.info --threshold 80` |
| `tdd_workflow.py` | Guide red-green-refactor cycles | `python scripts/tdd_workflow.py --phase red --test test_auth.py` |
| `framework_adapter.py` | Convert tests between frameworks | `python scripts/framework_adapter.py --from jest --to pytest` |
| `fixture_generator.py` | Generate test data and mocks | `python scripts/fixture_generator.py --entity User --count 5` |
| `metrics_calculator.py` | Calculate test quality metrics | `python scripts/metrics_calculator.py --tests tests/` |
| `format_detector.py` | Detect language and framework | `python scripts/format_detector.py --file source.ts` |
| `output_formatter.py` | Format output for CLI/desktop/CI | `python scripts/output_formatter.py --format markdown` |
Additional scripts: `framework_adapter.py` (convert between frameworks), `metrics_calculator.py` (quality metrics), `format_detector.py` (detect language/framework), `output_formatter.py` (CLI/desktop/CI output).
---

View File

@@ -1,5 +1,5 @@
---
name: tech-stack-evaluator
name: "tech-stack-evaluator"
description: Technology stack evaluation and comparison with TCO analysis, security assessment, and ecosystem health scoring. Use when comparing frameworks, evaluating technology stacks, calculating total cost of ownership, assessing migration paths, or analyzing ecosystem viability.
---