Files
claude-skills-reference/engineering/skill-tester/references/quality-scoring-rubric.md
Leo d63685401d feat: add skill-tester POWERFUL-tier skill
- Comprehensive skill validation, testing, and quality scoring framework
- skill_validator.py: validates structure, documentation, and compliance (700+ LOC)
- script_tester.py: tests syntax, functionality, and runtime behavior (800+ LOC)
- quality_scorer.py: multi-dimensional quality assessment with scoring (1100+ LOC)
- Complete reference documentation (structure spec, tier requirements, scoring rubric)
- Sample skill with assets and expected outputs for testing
- CI/CD integration examples and pre-commit hook support
- Zero external dependencies, dual output formats (JSON + human-readable)
- Self-testing capable meta-skill for quality assurance automation
2026-02-16 16:53:49 +00:00

405 lines
15 KiB
Markdown

# Quality Scoring Rubric
**Version**: 1.0.0
**Last Updated**: 2026-02-16
**Authority**: Claude Skills Engineering Team
## Overview
This document defines the comprehensive quality scoring methodology used to assess skills within the claude-skills ecosystem. The scoring system evaluates four key dimensions, each weighted equally at 25%, to provide an objective and consistent measure of skill quality.
## Scoring Framework
### Overall Scoring Scale
- **A+ (95-100)**: Exceptional quality, exceeds all standards
- **A (90-94)**: Excellent quality, meets highest standards consistently
- **A- (85-89)**: Very good quality, minor areas for improvement
- **B+ (80-84)**: Good quality, meets most standards well
- **B (75-79)**: Satisfactory quality, meets standards adequately
- **B- (70-74)**: Below average, several areas need improvement
- **C+ (65-69)**: Poor quality, significant improvements needed
- **C (60-64)**: Minimal acceptable quality, major improvements required
- **C- (55-59)**: Unacceptable quality, extensive rework needed
- **D (50-54)**: Very poor quality, fundamental issues present
- **F (0-49)**: Failing quality, does not meet basic standards
### Dimension Weights
Each dimension contributes equally to the overall score:
- **Documentation Quality**: 25%
- **Code Quality**: 25%
- **Completeness**: 25%
- **Usability**: 25%
## Documentation Quality (25% Weight)
### Scoring Components
#### SKILL.md Quality (40% of Documentation Score)
**Component Breakdown:**
- **Length and Depth (25%)**: Line count and content substance
- **Frontmatter Quality (25%)**: Completeness and accuracy of YAML metadata
- **Section Coverage (25%)**: Required and recommended section presence
- **Content Depth (25%)**: Technical detail and comprehensiveness
**Scoring Criteria:**
| Score Range | Length | Frontmatter | Sections | Depth |
|-------------|--------|-------------|----------|-------|
| 90-100 | 400+ lines | All fields complete + extras | All required + 4+ recommended | Rich technical detail, examples |
| 80-89 | 300-399 lines | All required fields complete | All required + 2-3 recommended | Good technical coverage |
| 70-79 | 200-299 lines | Most required fields | All required + 1 recommended | Adequate technical content |
| 60-69 | 150-199 lines | Some required fields | Most required sections | Basic technical information |
| 50-59 | 100-149 lines | Minimal frontmatter | Some required sections | Limited technical detail |
| Below 50 | <100 lines | Missing/invalid frontmatter | Few/no required sections | Insufficient content |
#### README.md Quality (25% of Documentation Score)
**Scoring Criteria:**
- **Excellent (90-100)**: 1000+ chars, comprehensive usage guide, examples, troubleshooting
- **Good (75-89)**: 500-999 chars, clear usage instructions, basic examples
- **Satisfactory (60-74)**: 200-499 chars, minimal usage information
- **Poor (40-59)**: <200 chars or confusing content
- **Failing (0-39)**: Missing or completely inadequate
#### Reference Documentation (20% of Documentation Score)
**Scoring Criteria:**
- **Excellent (90-100)**: Multiple comprehensive reference docs (2000+ chars total)
- **Good (75-89)**: 2-3 reference files with substantial content
- **Satisfactory (60-74)**: 1-2 reference files with adequate content
- **Poor (40-59)**: Minimal reference content or poor quality
- **Failing (0-39)**: No reference documentation
#### Examples and Usage Clarity (15% of Documentation Score)
**Scoring Criteria:**
- **Excellent (90-100)**: 5+ diverse examples, clear usage patterns
- **Good (75-89)**: 3-4 examples covering different scenarios
- **Satisfactory (60-74)**: 2-3 basic examples
- **Poor (40-59)**: 1-2 minimal examples
- **Failing (0-39)**: No examples or unclear usage
## Code Quality (25% Weight)
### Scoring Components
#### Script Complexity and Architecture (25% of Code Score)
**Evaluation Criteria:**
- Lines of code per script relative to tier requirements
- Function and class organization
- Code modularity and reusability
- Algorithm sophistication
**Scoring Matrix:**
| Tier | Excellent (90-100) | Good (75-89) | Satisfactory (60-74) | Poor (Below 60) |
|------|-------------------|--------------|---------------------|-----------------|
| BASIC | 200-300 LOC, well-structured | 150-199 LOC, organized | 100-149 LOC, basic | <100 LOC, minimal |
| STANDARD | 400-500 LOC, modular | 350-399 LOC, structured | 300-349 LOC, adequate | <300 LOC, basic |
| POWERFUL | 600-800 LOC, sophisticated | 550-599 LOC, advanced | 500-549 LOC, solid | <500 LOC, simple |
#### Error Handling Quality (25% of Code Score)
**Scoring Criteria:**
- **Excellent (90-100)**: Comprehensive exception handling, specific error types, recovery mechanisms
- **Good (75-89)**: Good exception handling, meaningful error messages, logging
- **Satisfactory (60-74)**: Basic try/except blocks, simple error messages
- **Poor (40-59)**: Minimal error handling, generic exceptions
- **Failing (0-39)**: No error handling or inappropriate handling
**Error Handling Checklist:**
- [ ] Try/except blocks for risky operations
- [ ] Specific exception types (not just Exception)
- [ ] Meaningful error messages for users
- [ ] Proper error logging or reporting
- [ ] Graceful degradation where possible
- [ ] Input validation and sanitization
#### Code Structure and Organization (25% of Code Score)
**Evaluation Elements:**
- Function decomposition and single responsibility
- Class design and inheritance patterns
- Import organization and dependency management
- Documentation and comments quality
- Consistent naming conventions
- PEP 8 compliance
**Scoring Guidelines:**
- **Excellent (90-100)**: Exemplary structure, comprehensive docstrings, perfect style
- **Good (75-89)**: Well-organized, good documentation, minor style issues
- **Satisfactory (60-74)**: Adequate structure, basic documentation, some style issues
- **Poor (40-59)**: Poor organization, minimal documentation, style problems
- **Failing (0-39)**: No clear structure, no documentation, major style violations
#### Output Format Support (25% of Code Score)
**Required Capabilities:**
- JSON output format support
- Human-readable output format
- Proper data serialization
- Consistent output structure
- Error output handling
**Scoring Criteria:**
- **Excellent (90-100)**: Dual format + custom formats, perfect serialization
- **Good (75-89)**: Dual format support, good serialization
- **Satisfactory (60-74)**: Single format well-implemented
- **Poor (40-59)**: Basic output, formatting issues
- **Failing (0-39)**: Poor or no structured output
## Completeness (25% Weight)
### Scoring Components
#### Directory Structure Compliance (25% of Completeness Score)
**Required Directories by Tier:**
- **BASIC**: scripts/ (required), assets/ + references/ (recommended)
- **STANDARD**: scripts/ + assets/ + references/ (required), expected_outputs/ (recommended)
- **POWERFUL**: scripts/ + assets/ + references/ + expected_outputs/ (all required)
**Scoring Calculation:**
```
Structure Score = (Required Present / Required Total) * 0.6 +
(Recommended Present / Recommended Total) * 0.4
```
#### Asset Availability and Quality (25% of Completeness Score)
**Scoring Criteria:**
- **Excellent (90-100)**: 5+ diverse assets, multiple file types, realistic data
- **Good (75-89)**: 3-4 assets, some diversity, good quality
- **Satisfactory (60-74)**: 2-3 assets, basic variety
- **Poor (40-59)**: 1-2 minimal assets
- **Failing (0-39)**: No assets or unusable assets
**Asset Quality Factors:**
- File diversity (JSON, CSV, YAML, etc.)
- Data realism and complexity
- Coverage of use cases
- File size appropriateness
- Documentation of asset purpose
#### Expected Output Coverage (25% of Completeness Score)
**Evaluation Criteria:**
- Correspondence with asset files
- Coverage of success and error scenarios
- Output format variety
- Reproducibility and accuracy
**Scoring Matrix:**
- **Excellent (90-100)**: Complete output coverage, all scenarios, verified accuracy
- **Good (75-89)**: Good coverage, most scenarios, mostly accurate
- **Satisfactory (60-74)**: Basic coverage, main scenarios
- **Poor (40-59)**: Minimal coverage, some inaccuracies
- **Failing (0-39)**: No expected outputs or completely inaccurate
#### Test Coverage and Validation (25% of Completeness Score)
**Assessment Areas:**
- Sample data processing capability
- Output verification mechanisms
- Edge case handling
- Error condition testing
- Integration test scenarios
**Scoring Guidelines:**
- **Excellent (90-100)**: Comprehensive test coverage, automated validation
- **Good (75-89)**: Good test coverage, manual validation possible
- **Satisfactory (60-74)**: Basic testing capability
- **Poor (40-59)**: Minimal testing support
- **Failing (0-39)**: No testing or validation capability
## Usability (25% Weight)
### Scoring Components
#### Installation and Setup Simplicity (25% of Usability Score)
**Evaluation Factors:**
- Dependency requirements (Python stdlib preferred)
- Setup complexity
- Environment requirements
- Installation documentation clarity
**Scoring Criteria:**
- **Excellent (90-100)**: Zero external dependencies, single-file execution
- **Good (75-89)**: Minimal dependencies, simple setup
- **Satisfactory (60-74)**: Some dependencies, documented setup
- **Poor (40-59)**: Complex dependencies, unclear setup
- **Failing (0-39)**: Unable to install or excessive complexity
#### Usage Clarity and Help Quality (25% of Usability Score)
**Assessment Elements:**
- Command-line help comprehensiveness
- Usage example clarity
- Parameter documentation quality
- Error message helpfulness
**Help Quality Checklist:**
- [ ] Comprehensive --help output
- [ ] Clear parameter descriptions
- [ ] Usage examples included
- [ ] Error messages are actionable
- [ ] Progress indicators where appropriate
**Scoring Matrix:**
- **Excellent (90-100)**: Exemplary help, multiple examples, perfect error messages
- **Good (75-89)**: Good help quality, clear examples, helpful errors
- **Satisfactory (60-74)**: Adequate help, basic examples
- **Poor (40-59)**: Minimal help, confusing interface
- **Failing (0-39)**: No help or completely unclear interface
#### Documentation Accessibility (25% of Usability Score)
**Evaluation Criteria:**
- README quick start effectiveness
- SKILL.md navigation and structure
- Reference material organization
- Learning curve considerations
**Accessibility Factors:**
- Information hierarchy clarity
- Cross-reference quality
- Beginner-friendly explanations
- Advanced user shortcuts
- Troubleshooting guidance
#### Practical Example Quality (25% of Usability Score)
**Assessment Areas:**
- Example realism and relevance
- Complexity progression (simple to advanced)
- Output demonstration
- Common use case coverage
- Integration scenarios
**Scoring Guidelines:**
- **Excellent (90-100)**: 5+ examples, perfect progression, real-world scenarios
- **Good (75-89)**: 3-4 examples, good variety, practical scenarios
- **Satisfactory (60-74)**: 2-3 examples, adequate coverage
- **Poor (40-59)**: 1-2 examples, limited practical value
- **Failing (0-39)**: No examples or completely impractical
## Scoring Calculations
### Dimension Score Calculation
Each dimension score is calculated as a weighted average of its components:
```python
def calculate_dimension_score(components):
total_weighted_score = 0
total_weight = 0
for component_name, component_data in components.items():
score = component_data['score']
weight = component_data['weight']
total_weighted_score += score * weight
total_weight += weight
return total_weighted_score / total_weight if total_weight > 0 else 0
```
### Overall Score Calculation
The overall score combines all dimensions with equal weighting:
```python
def calculate_overall_score(dimensions):
return sum(dimension.score * 0.25 for dimension in dimensions.values())
```
### Letter Grade Assignment
```python
def assign_letter_grade(overall_score):
if overall_score >= 95: return "A+"
elif overall_score >= 90: return "A"
elif overall_score >= 85: return "A-"
elif overall_score >= 80: return "B+"
elif overall_score >= 75: return "B"
elif overall_score >= 70: return "B-"
elif overall_score >= 65: return "C+"
elif overall_score >= 60: return "C"
elif overall_score >= 55: return "C-"
elif overall_score >= 50: return "D"
else: return "F"
```
## Quality Improvement Recommendations
### Score-Based Recommendations
#### For Scores Below 60 (C- or Lower)
**Priority Actions:**
1. Address fundamental structural issues
2. Implement basic error handling
3. Add essential documentation sections
4. Create minimal viable examples
5. Fix critical functionality issues
#### For Scores 60-74 (C+ to B-)
**Improvement Areas:**
1. Expand documentation comprehensiveness
2. Enhance error handling sophistication
3. Add more diverse examples and use cases
4. Improve code organization and structure
5. Increase test coverage and validation
#### For Scores 75-84 (B to B+)
**Enhancement Opportunities:**
1. Refine documentation for expert-level quality
2. Implement advanced error recovery mechanisms
3. Add comprehensive reference materials
4. Optimize code architecture and performance
5. Develop extensive example library
#### For Scores 85+ (A- or Higher)
**Excellence Maintenance:**
1. Regular quality audits and updates
2. Community feedback integration
3. Best practice evolution tracking
4. Mentoring lower-quality skills
5. Innovation and cutting-edge feature adoption
### Dimension-Specific Improvement Strategies
#### Low Documentation Scores
- Expand SKILL.md with technical details
- Add comprehensive API reference
- Include architecture diagrams and explanations
- Develop troubleshooting guides
- Create contributor documentation
#### Low Code Quality Scores
- Refactor for better modularity
- Implement comprehensive error handling
- Add extensive code documentation
- Apply advanced design patterns
- Optimize performance and efficiency
#### Low Completeness Scores
- Add missing directories and files
- Develop comprehensive sample datasets
- Create expected output libraries
- Implement automated testing
- Add integration examples
#### Low Usability Scores
- Simplify installation process
- Improve command-line interface design
- Enhance help text and documentation
- Create beginner-friendly tutorials
- Add interactive examples
## Quality Assurance Process
### Automated Scoring
The quality scorer runs automated assessments based on this rubric:
1. File system analysis for structure compliance
2. Content analysis for documentation quality
3. Code analysis for quality metrics
4. Asset inventory and quality assessment
### Manual Review Process
Human reviewers validate automated scores and provide qualitative insights:
1. Content quality assessment beyond automated metrics
2. Usability testing with real-world scenarios
3. Technical accuracy verification
4. Community value assessment
### Continuous Improvement
The scoring rubric evolves based on:
- Community feedback and usage patterns
- Industry best practice changes
- Tool capability enhancements
- Quality trend analysis
This quality scoring rubric ensures consistent, objective, and comprehensive assessment of all skills within the claude-skills ecosystem while providing clear guidance for quality improvement.