Add two executive leadership skill packages: CEO Advisor: - Strategy analyzer and financial scenario analyzer (Python tools) - Executive decision framework - Leadership & organizational culture guidelines - Board governance & investor relations guidance CTO Advisor: - Tech debt analyzer and team scaling calculator (Python tools) - Engineering metrics framework - Technology evaluation framework - Architecture decision records templates Also includes packaged .zip archives for easy distribution. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
394 lines
12 KiB
Markdown
394 lines
12 KiB
Markdown
# Engineering Metrics & KPIs Guide
|
|
|
|
## Metrics Framework
|
|
|
|
### DORA Metrics (DevOps Research and Assessment)
|
|
|
|
#### 1. Deployment Frequency
|
|
- **Definition**: How often code is deployed to production
|
|
- **Target**:
|
|
- Elite: Multiple deploys per day
|
|
- High: Weekly to monthly
|
|
- Medium: Monthly to bi-annually
|
|
- Low: Less than bi-annually
|
|
- **Measurement**: Deployments per day/week/month
|
|
- **Improvement**: Smaller batch sizes, feature flags, CI/CD
|
|
|
|
#### 2. Lead Time for Changes
|
|
- **Definition**: Time from code commit to production
|
|
- **Target**:
|
|
- Elite: Less than 1 hour
|
|
- High: 1 day to 1 week
|
|
- Medium: 1 week to 1 month
|
|
- Low: More than 1 month
|
|
- **Measurement**: Median time from commit to deploy
|
|
- **Improvement**: Automation, parallel testing, smaller changes
|
|
|
|
#### 3. Mean Time to Recovery (MTTR)
|
|
- **Definition**: Time to restore service after incident
|
|
- **Target**:
|
|
- Elite: Less than 1 hour
|
|
- High: Less than 1 day
|
|
- Medium: 1 day to 1 week
|
|
- Low: More than 1 week
|
|
- **Measurement**: Average incident resolution time
|
|
- **Improvement**: Monitoring, rollback capability, runbooks
|
|
|
|
#### 4. Change Failure Rate
|
|
- **Definition**: Percentage of changes causing failures
|
|
- **Target**:
|
|
- Elite: 0-15%
|
|
- High: 16-30%
|
|
- Medium/Low: >30%
|
|
- **Measurement**: Failed deploys / Total deploys
|
|
- **Improvement**: Testing, code review, gradual rollouts
|
|
|
|
### Engineering Productivity Metrics
|
|
|
|
#### Code Quality
|
|
| Metric | Formula | Target | Action if Below |
|
|
|--------|---------|--------|-----------------|
|
|
| Test Coverage | Tests / Total Code | >80% | Add unit tests |
|
|
| Code Review Coverage | Reviewed PRs / Total PRs | 100% | Enforce review policy |
|
|
| Technical Debt Ratio | Debt / Development Time | <10% | Dedicate debt sprints |
|
|
| Cyclomatic Complexity | Per function/method | <10 | Refactor complex code |
|
|
| Code Duplication | Duplicate Lines / Total | <5% | Extract common code |
|
|
|
|
#### Development Velocity
|
|
| Metric | Formula | Target | Action if Below |
|
|
|--------|---------|--------|-----------------|
|
|
| Sprint Velocity | Story Points / Sprint | Stable ±10% | Review estimation |
|
|
| Cycle Time | Start to Done Time | <5 days | Reduce WIP |
|
|
| PR Merge Time | Open to Merge | <24 hours | Smaller PRs |
|
|
| Build Time | Code to Artifact | <10 minutes | Optimize pipeline |
|
|
| Test Execution Time | Full Test Suite | <30 minutes | Parallelize tests |
|
|
|
|
#### Team Health
|
|
| Metric | Formula | Target | Action if Below |
|
|
|--------|---------|--------|-----------------|
|
|
| On-call Incidents | Incidents / Week | <5 | Improve monitoring |
|
|
| Bug Escape Rate | Prod Bugs / Release | <5% | Improve testing |
|
|
| Unplanned Work | Unplanned / Total | <20% | Better planning |
|
|
| Meeting Time | Meetings / Total Time | <20% | Reduce meetings |
|
|
| Focus Time | Uninterrupted Hours | >4h/day | Block calendars |
|
|
|
|
### Business Impact Metrics
|
|
|
|
#### System Performance
|
|
| Metric | Description | Target | Business Impact |
|
|
|--------|-------------|--------|-----------------|
|
|
| Uptime | System availability | 99.9%+ | Revenue protection |
|
|
| Page Load Time | Time to interactive | <3s | User retention |
|
|
| API Response Time | P95 latency | <200ms | User experience |
|
|
| Error Rate | Errors / Requests | <0.1% | Customer satisfaction |
|
|
| Throughput | Requests / Second | Per requirement | Scalability |
|
|
|
|
#### Product Delivery
|
|
| Metric | Description | Target | Business Impact |
|
|
|--------|-------------|--------|-----------------|
|
|
| Feature Delivery Rate | Features / Quarter | Per roadmap | Market competitiveness |
|
|
| Time to Market | Idea to Production | <3 months | First mover advantage |
|
|
| Customer Defect Rate | Customer Bugs / Month | <10 | Customer satisfaction |
|
|
| Feature Adoption | Users / Feature | >50% | ROI validation |
|
|
| NPS from Engineering | Customer Score | >50 | Product quality |
|
|
|
|
## Metrics Dashboards
|
|
|
|
### Executive Dashboard (Weekly)
|
|
```
|
|
┌─────────────────────────────────────┐
|
|
│ EXECUTIVE METRICS │
|
|
├─────────────────────────────────────┤
|
|
│ Uptime: 99.97% ✓ │
|
|
│ Sprint Velocity: 142 pts ✓ │
|
|
│ Deployment Frequency: 3.2/day ✓ │
|
|
│ Lead Time: 4.2 hrs ✓ │
|
|
│ MTTR: 47 min ✓ │
|
|
│ Change Failure Rate: 8.3% ✓ │
|
|
│ │
|
|
│ Team Health: 8.2/10 │
|
|
│ Tech Debt Ratio: 12% ⚠ │
|
|
│ Feature Delivery: 85% ✓ │
|
|
└─────────────────────────────────────┘
|
|
```
|
|
|
|
### Team Dashboard (Daily)
|
|
```
|
|
┌─────────────────────────────────────┐
|
|
│ TEAM METRICS │
|
|
├─────────────────────────────────────┤
|
|
│ Current Sprint: │
|
|
│ Completed: 65/100 pts (65%) │
|
|
│ In Progress: 20 pts │
|
|
│ Days Left: 3 │
|
|
│ │
|
|
│ PR Queue: 8 pending │
|
|
│ Build Status: ✓ Passing │
|
|
│ Test Coverage: 82.3% │
|
|
│ Open Incidents: 2 (P2, P3) │
|
|
│ │
|
|
│ On-call Load: 3 pages this week │
|
|
└─────────────────────────────────────┘
|
|
```
|
|
|
|
### Individual Dashboard (Daily)
|
|
```
|
|
┌─────────────────────────────────────┐
|
|
│ DEVELOPER METRICS │
|
|
├─────────────────────────────────────┤
|
|
│ This Week: │
|
|
│ PRs Merged: 8 │
|
|
│ Code Reviews: 12 │
|
|
│ Commits: 23 │
|
|
│ Focus Time: 22.5 hrs │
|
|
│ │
|
|
│ Quality: │
|
|
│ Test Coverage: 87% │
|
|
│ Code Review Feedback: 95% ✓ │
|
|
│ Bug Introduction Rate: 0% │
|
|
└─────────────────────────────────────┘
|
|
```
|
|
|
|
## Implementation Guide
|
|
|
|
### Phase 1: Foundation (Month 1)
|
|
1. **Basic Metrics**
|
|
- Deployment frequency
|
|
- Build success rate
|
|
- Uptime/availability
|
|
- Team velocity
|
|
|
|
2. **Tools Setup**
|
|
- CI/CD instrumentation
|
|
- Basic monitoring
|
|
- Time tracking
|
|
|
|
### Phase 2: Quality (Month 2)
|
|
1. **Quality Metrics**
|
|
- Test coverage
|
|
- Code review metrics
|
|
- Bug rates
|
|
- Technical debt
|
|
|
|
2. **Tool Integration**
|
|
- Static analysis
|
|
- Test reporting
|
|
- Code quality gates
|
|
|
|
### Phase 3: Performance (Month 3)
|
|
1. **Performance Metrics**
|
|
- DORA metrics complete
|
|
- System performance
|
|
- API metrics
|
|
- Database metrics
|
|
|
|
2. **Advanced Monitoring**
|
|
- APM tools
|
|
- Distributed tracing
|
|
- Custom dashboards
|
|
|
|
### Phase 4: Optimization (Ongoing)
|
|
1. **Advanced Analytics**
|
|
- Predictive metrics
|
|
- Trend analysis
|
|
- Anomaly detection
|
|
- Correlation analysis
|
|
|
|
## Metric Anti-patterns
|
|
|
|
### What NOT to Measure
|
|
|
|
❌ **Lines of Code**: Encourages bloat
|
|
❌ **Hours Worked**: Promotes presenteeism
|
|
❌ **Individual Velocity**: Creates competition
|
|
❌ **Bug Count Without Context**: Discourages risk-taking
|
|
❌ **Commit Count**: Encourages tiny commits
|
|
|
|
### Goodhart's Law
|
|
"When a measure becomes a target, it ceases to be a good measure"
|
|
|
|
**Examples**:
|
|
- Optimizing test coverage → Writing meaningless tests
|
|
- Reducing bug count → Not reporting bugs
|
|
- Increasing velocity → Inflating estimates
|
|
- Reducing meeting time → Skipping important discussions
|
|
|
|
### How to Avoid Gaming
|
|
|
|
1. **Use Multiple Metrics**: No single metric tells the whole story
|
|
2. **Focus on Trends**: Not absolute numbers
|
|
3. **Combine Leading and Lagging**: Balance predictive and historical
|
|
4. **Regular Review**: Adjust metrics that are being gamed
|
|
5. **Team Ownership**: Let teams choose their metrics
|
|
|
|
## OKR Framework for Engineering
|
|
|
|
### Company Level OKRs
|
|
**Objective**: Deliver exceptional product quality
|
|
|
|
**Key Results**:
|
|
- KR1: Achieve 99.95% uptime (from 99.9%)
|
|
- KR2: Reduce customer-reported bugs by 50%
|
|
- KR3: Improve deployment frequency to 10x/day
|
|
|
|
### Engineering OKRs
|
|
**Objective**: Build scalable, reliable infrastructure
|
|
|
|
**Key Results**:
|
|
- KR1: Migrate 80% of services to Kubernetes
|
|
- KR2: Reduce MTTR to <30 minutes
|
|
- KR3: Achieve 85% test coverage
|
|
|
|
### Team OKRs
|
|
**Objective**: Improve developer productivity
|
|
|
|
**Key Results**:
|
|
- KR1: Reduce build time to <5 minutes
|
|
- KR2: Automate 90% of deployment process
|
|
- KR3: Reduce PR review time to <4 hours
|
|
|
|
## Reporting Templates
|
|
|
|
### Monthly Engineering Report
|
|
|
|
```markdown
|
|
# Engineering Report - [Month Year]
|
|
|
|
## Executive Summary
|
|
- Key Achievement: [Highlight]
|
|
- Main Challenge: [Issue and resolution]
|
|
- Next Month Focus: [Priority]
|
|
|
|
## DORA Metrics
|
|
| Metric | This Month | Last Month | Target | Status |
|
|
|--------|------------|------------|--------|--------|
|
|
| Deploy Frequency | X/day | Y/day | Z/day | ✓/⚠/✗ |
|
|
| Lead Time | X hrs | Y hrs | <Z hrs | ✓/⚠/✗ |
|
|
| MTTR | X min | Y min | <Z min | ✓/⚠/✗ |
|
|
| Change Failure | X% | Y% | <Z% | ✓/⚠/✗ |
|
|
|
|
## Team Performance
|
|
- Velocity: X story points (Y% of plan)
|
|
- Sprint Completion: X%
|
|
- Unplanned Work: X%
|
|
|
|
## Quality Metrics
|
|
- Test Coverage: X% (Δ Y%)
|
|
- Customer Bugs: X (Δ Y)
|
|
- Code Review Coverage: X%
|
|
|
|
## Highlights
|
|
1. [Major feature or improvement]
|
|
2. [Technical achievement]
|
|
3. [Process improvement]
|
|
|
|
## Challenges & Solutions
|
|
1. Challenge: [Issue]
|
|
Solution: [Action taken]
|
|
|
|
## Next Month Priorities
|
|
1. [Priority 1]
|
|
2. [Priority 2]
|
|
3. [Priority 3]
|
|
```
|
|
|
|
### Quarterly Business Review
|
|
|
|
```markdown
|
|
# Engineering QBR - Q[X] [Year]
|
|
|
|
## Strategic Alignment
|
|
- Business Goal: [Goal]
|
|
- Engineering Contribution: [How engineering supported]
|
|
- Impact: [Measurable outcome]
|
|
|
|
## Quarterly Metrics
|
|
|
|
### Delivery
|
|
- Features Shipped: X of Y planned (Z%)
|
|
- Major Releases: [List]
|
|
- Technical Debt Reduced: X%
|
|
|
|
### Reliability
|
|
- Uptime: X%
|
|
- Incidents: X (PY critical, PZ major)
|
|
- Customer Impact: [Description]
|
|
|
|
### Efficiency
|
|
- Cost per Transaction: $X (Δ Y%)
|
|
- Infrastructure Cost: $X (Δ Y%)
|
|
- Engineering Cost per Feature: $X
|
|
|
|
## Team Growth
|
|
- Headcount: Start: X → End: Y
|
|
- Attrition: X%
|
|
- Key Hires: [Roles]
|
|
|
|
## Innovation
|
|
- Patents Filed: X
|
|
- Open Source Contributions: X
|
|
- Hackathon Projects: X
|
|
|
|
## Lessons Learned
|
|
1. [What worked well]
|
|
2. [What didn't work]
|
|
3. [What we're changing]
|
|
|
|
## Next Quarter Focus
|
|
1. [Strategic Initiative 1]
|
|
2. [Strategic Initiative 2]
|
|
3. [Strategic Initiative 3]
|
|
```
|
|
|
|
## Tool Recommendations
|
|
|
|
### Metrics Collection
|
|
- **DataDog**: Comprehensive monitoring
|
|
- **New Relic**: Application performance
|
|
- **Grafana + Prometheus**: Open source stack
|
|
- **CloudWatch**: AWS native
|
|
|
|
### Engineering Analytics
|
|
- **LinearB**: Developer productivity
|
|
- **Velocity**: Engineering metrics
|
|
- **Sleuth**: DORA metrics
|
|
- **Swarmia**: Engineering insights
|
|
|
|
### Project Tracking
|
|
- **Jira**: Issue tracking
|
|
- **Linear**: Modern issue tracking
|
|
- **Azure DevOps**: Microsoft ecosystem
|
|
- **GitHub Projects**: Integrated with code
|
|
|
|
### Incident Management
|
|
- **PagerDuty**: On-call management
|
|
- **Opsgenie**: Incident response
|
|
- **StatusPage**: Status communication
|
|
- **FireHydrant**: Incident command
|
|
|
|
## Success Indicators
|
|
|
|
### Healthy Engineering Organization
|
|
✓ DORA metrics improving quarter-over-quarter
|
|
✓ Team satisfaction >8/10
|
|
✓ Attrition <10% annually
|
|
✓ On-time delivery >80%
|
|
✓ Technical debt <15% of capacity
|
|
✓ Innovation time >20%
|
|
|
|
### Warning Signs
|
|
⚠️ Increasing MTTR trend
|
|
⚠️ Declining velocity
|
|
⚠️ Rising bug escape rate
|
|
⚠️ Increasing unplanned work
|
|
⚠️ Growing PR queue
|
|
⚠️ Decreasing test coverage
|
|
|
|
### Crisis Indicators
|
|
🚨 Multiple production incidents per week
|
|
🚨 Team satisfaction <6/10
|
|
🚨 Attrition >20%
|
|
🚨 Technical debt >30%
|
|
🚨 No deployments for >1 week
|
|
🚨 Customer escalations increasing
|