# Engineering Metrics & KPIs Guide ## Metrics Framework ### DORA Metrics (DevOps Research and Assessment) #### 1. Deployment Frequency - **Definition**: How often code is deployed to production - **Target**: - Elite: Multiple deploys per day - High: Weekly to monthly - Medium: Monthly to bi-annually - Low: Less than bi-annually - **Measurement**: Deployments per day/week/month - **Improvement**: Smaller batch sizes, feature flags, CI/CD #### 2. Lead Time for Changes - **Definition**: Time from code commit to production - **Target**: - Elite: Less than 1 hour - High: 1 day to 1 week - Medium: 1 week to 1 month - Low: More than 1 month - **Measurement**: Median time from commit to deploy - **Improvement**: Automation, parallel testing, smaller changes #### 3. Mean Time to Recovery (MTTR) - **Definition**: Time to restore service after incident - **Target**: - Elite: Less than 1 hour - High: Less than 1 day - Medium: 1 day to 1 week - Low: More than 1 week - **Measurement**: Average incident resolution time - **Improvement**: Monitoring, rollback capability, runbooks #### 4. Change Failure Rate - **Definition**: Percentage of changes causing failures - **Target**: - Elite: 0-15% - High: 16-30% - Medium/Low: >30% - **Measurement**: Failed deploys / Total deploys - **Improvement**: Testing, code review, gradual rollouts ### Engineering Productivity Metrics #### Code Quality | Metric | Formula | Target | Action if Below | |--------|---------|--------|-----------------| | Test Coverage | Tests / Total Code | >80% | Add unit tests | | Code Review Coverage | Reviewed PRs / Total PRs | 100% | Enforce review policy | | Technical Debt Ratio | Debt / Development Time | <10% | Dedicate debt sprints | | Cyclomatic Complexity | Per function/method | <10 | Refactor complex code | | Code Duplication | Duplicate Lines / Total | <5% | Extract common code | #### Development Velocity | Metric | Formula | Target | Action if Below | |--------|---------|--------|-----------------| | Sprint Velocity | Story Points / Sprint | Stable ±10% | Review estimation | | Cycle Time | Start to Done Time | <5 days | Reduce WIP | | PR Merge Time | Open to Merge | <24 hours | Smaller PRs | | Build Time | Code to Artifact | <10 minutes | Optimize pipeline | | Test Execution Time | Full Test Suite | <30 minutes | Parallelize tests | #### Team Health | Metric | Formula | Target | Action if Below | |--------|---------|--------|-----------------| | On-call Incidents | Incidents / Week | <5 | Improve monitoring | | Bug Escape Rate | Prod Bugs / Release | <5% | Improve testing | | Unplanned Work | Unplanned / Total | <20% | Better planning | | Meeting Time | Meetings / Total Time | <20% | Reduce meetings | | Focus Time | Uninterrupted Hours | >4h/day | Block calendars | ### Business Impact Metrics #### System Performance | Metric | Description | Target | Business Impact | |--------|-------------|--------|-----------------| | Uptime | System availability | 99.9%+ | Revenue protection | | Page Load Time | Time to interactive | <3s | User retention | | API Response Time | P95 latency | <200ms | User experience | | Error Rate | Errors / Requests | <0.1% | Customer satisfaction | | Throughput | Requests / Second | Per requirement | Scalability | #### Product Delivery | Metric | Description | Target | Business Impact | |--------|-------------|--------|-----------------| | Feature Delivery Rate | Features / Quarter | Per roadmap | Market competitiveness | | Time to Market | Idea to Production | <3 months | First mover advantage | | Customer Defect Rate | Customer Bugs / Month | <10 | Customer satisfaction | | Feature Adoption | Users / Feature | >50% | ROI validation | | NPS from Engineering | Customer Score | >50 | Product quality | ## Metrics Dashboards ### Executive Dashboard (Weekly) ``` ┌─────────────────────────────────────┐ │ EXECUTIVE METRICS │ ├─────────────────────────────────────┤ │ Uptime: 99.97% ✓ │ │ Sprint Velocity: 142 pts ✓ │ │ Deployment Frequency: 3.2/day ✓ │ │ Lead Time: 4.2 hrs ✓ │ │ MTTR: 47 min ✓ │ │ Change Failure Rate: 8.3% ✓ │ │ │ │ Team Health: 8.2/10 │ │ Tech Debt Ratio: 12% ⚠ │ │ Feature Delivery: 85% ✓ │ └─────────────────────────────────────┘ ``` ### Team Dashboard (Daily) ``` ┌─────────────────────────────────────┐ │ TEAM METRICS │ ├─────────────────────────────────────┤ │ Current Sprint: │ │ Completed: 65/100 pts (65%) │ │ In Progress: 20 pts │ │ Days Left: 3 │ │ │ │ PR Queue: 8 pending │ │ Build Status: ✓ Passing │ │ Test Coverage: 82.3% │ │ Open Incidents: 2 (P2, P3) │ │ │ │ On-call Load: 3 pages this week │ └─────────────────────────────────────┘ ``` ### Individual Dashboard (Daily) ``` ┌─────────────────────────────────────┐ │ DEVELOPER METRICS │ ├─────────────────────────────────────┤ │ This Week: │ │ PRs Merged: 8 │ │ Code Reviews: 12 │ │ Commits: 23 │ │ Focus Time: 22.5 hrs │ │ │ │ Quality: │ │ Test Coverage: 87% │ │ Code Review Feedback: 95% ✓ │ │ Bug Introduction Rate: 0% │ └─────────────────────────────────────┘ ``` ## Implementation Guide ### Phase 1: Foundation (Month 1) 1. **Basic Metrics** - Deployment frequency - Build success rate - Uptime/availability - Team velocity 2. **Tools Setup** - CI/CD instrumentation - Basic monitoring - Time tracking ### Phase 2: Quality (Month 2) 1. **Quality Metrics** - Test coverage - Code review metrics - Bug rates - Technical debt 2. **Tool Integration** - Static analysis - Test reporting - Code quality gates ### Phase 3: Performance (Month 3) 1. **Performance Metrics** - DORA metrics complete - System performance - API metrics - Database metrics 2. **Advanced Monitoring** - APM tools - Distributed tracing - Custom dashboards ### Phase 4: Optimization (Ongoing) 1. **Advanced Analytics** - Predictive metrics - Trend analysis - Anomaly detection - Correlation analysis ## Metric Anti-patterns ### What NOT to Measure ❌ **Lines of Code**: Encourages bloat ❌ **Hours Worked**: Promotes presenteeism ❌ **Individual Velocity**: Creates competition ❌ **Bug Count Without Context**: Discourages risk-taking ❌ **Commit Count**: Encourages tiny commits ### Goodhart's Law "When a measure becomes a target, it ceases to be a good measure" **Examples**: - Optimizing test coverage → Writing meaningless tests - Reducing bug count → Not reporting bugs - Increasing velocity → Inflating estimates - Reducing meeting time → Skipping important discussions ### How to Avoid Gaming 1. **Use Multiple Metrics**: No single metric tells the whole story 2. **Focus on Trends**: Not absolute numbers 3. **Combine Leading and Lagging**: Balance predictive and historical 4. **Regular Review**: Adjust metrics that are being gamed 5. **Team Ownership**: Let teams choose their metrics ## OKR Framework for Engineering ### Company Level OKRs **Objective**: Deliver exceptional product quality **Key Results**: - KR1: Achieve 99.95% uptime (from 99.9%) - KR2: Reduce customer-reported bugs by 50% - KR3: Improve deployment frequency to 10x/day ### Engineering OKRs **Objective**: Build scalable, reliable infrastructure **Key Results**: - KR1: Migrate 80% of services to Kubernetes - KR2: Reduce MTTR to <30 minutes - KR3: Achieve 85% test coverage ### Team OKRs **Objective**: Improve developer productivity **Key Results**: - KR1: Reduce build time to <5 minutes - KR2: Automate 90% of deployment process - KR3: Reduce PR review time to <4 hours ## Reporting Templates ### Monthly Engineering Report ```markdown # Engineering Report - [Month Year] ## Executive Summary - Key Achievement: [Highlight] - Main Challenge: [Issue and resolution] - Next Month Focus: [Priority] ## DORA Metrics | Metric | This Month | Last Month | Target | Status | |--------|------------|------------|--------|--------| | Deploy Frequency | X/day | Y/day | Z/day | ✓/⚠/✗ | | Lead Time | X hrs | Y hrs | 8/10 ✓ Attrition <10% annually ✓ On-time delivery >80% ✓ Technical debt <15% of capacity ✓ Innovation time >20% ### Warning Signs ⚠️ Increasing MTTR trend ⚠️ Declining velocity ⚠️ Rising bug escape rate ⚠️ Increasing unplanned work ⚠️ Growing PR queue ⚠️ Decreasing test coverage ### Crisis Indicators 🚨 Multiple production incidents per week 🚨 Team satisfaction <6/10 🚨 Attrition >20% 🚨 Technical debt >30% 🚨 No deployments for >1 week 🚨 Customer escalations increasing