firefrost-gaming/claude-skills-reference

Files

Leo d63685401d feat: add skill-tester POWERFUL-tier skill

- Comprehensive skill validation, testing, and quality scoring framework
- skill_validator.py: validates structure, documentation, and compliance (700+ LOC)
- script_tester.py: tests syntax, functionality, and runtime behavior (800+ LOC)
- quality_scorer.py: multi-dimensional quality assessment with scoring (1100+ LOC)
- Complete reference documentation (structure spec, tier requirements, scoring rubric)
- Sample skill with assets and expected outputs for testing
- CI/CD integration examples and pre-commit hook support
- Zero external dependencies, dual output formats (JSON + human-readable)
- Self-testing capable meta-skill for quality assurance automation

2026-02-16 16:53:49 +00:00

15 KiB

Raw Blame History

Quality Scoring Rubric

Version: 1.0.0
Last Updated: 2026-02-16
Authority: Claude Skills Engineering Team

Overview

This document defines the comprehensive quality scoring methodology used to assess skills within the claude-skills ecosystem. The scoring system evaluates four key dimensions, each weighted equally at 25%, to provide an objective and consistent measure of skill quality.

Scoring Framework

Overall Scoring Scale

A+ (95-100): Exceptional quality, exceeds all standards
A (90-94): Excellent quality, meets highest standards consistently
A- (85-89): Very good quality, minor areas for improvement
B+ (80-84): Good quality, meets most standards well
B (75-79): Satisfactory quality, meets standards adequately
B- (70-74): Below average, several areas need improvement
C+ (65-69): Poor quality, significant improvements needed
C (60-64): Minimal acceptable quality, major improvements required
C- (55-59): Unacceptable quality, extensive rework needed
D (50-54): Very poor quality, fundamental issues present
F (0-49): Failing quality, does not meet basic standards

Dimension Weights

Each dimension contributes equally to the overall score:

Documentation Quality: 25%
Code Quality: 25%
Completeness: 25%
Usability: 25%

Documentation Quality (25% Weight)

Scoring Components

SKILL.md Quality (40% of Documentation Score)

Component Breakdown:

Length and Depth (25%): Line count and content substance
Frontmatter Quality (25%): Completeness and accuracy of YAML metadata
Section Coverage (25%): Required and recommended section presence
Content Depth (25%): Technical detail and comprehensiveness

Scoring Criteria:

Score Range	Length	Frontmatter	Sections	Depth
90-100	400+ lines	All fields complete + extras	All required + 4+ recommended	Rich technical detail, examples
80-89	300-399 lines	All required fields complete	All required + 2-3 recommended	Good technical coverage
70-79	200-299 lines	Most required fields	All required + 1 recommended	Adequate technical content
60-69	150-199 lines	Some required fields	Most required sections	Basic technical information
50-59	100-149 lines	Minimal frontmatter	Some required sections	Limited technical detail
Below 50	<100 lines	Missing/invalid frontmatter	Few/no required sections	Insufficient content

README.md Quality (25% of Documentation Score)

Scoring Criteria:

Excellent (90-100): 1000+ chars, comprehensive usage guide, examples, troubleshooting
Good (75-89): 500-999 chars, clear usage instructions, basic examples
Satisfactory (60-74): 200-499 chars, minimal usage information
Poor (40-59): <200 chars or confusing content
Failing (0-39): Missing or completely inadequate

Reference Documentation (20% of Documentation Score)

Scoring Criteria:

Excellent (90-100): Multiple comprehensive reference docs (2000+ chars total)
Good (75-89): 2-3 reference files with substantial content
Satisfactory (60-74): 1-2 reference files with adequate content
Poor (40-59): Minimal reference content or poor quality
Failing (0-39): No reference documentation

Examples and Usage Clarity (15% of Documentation Score)

Scoring Criteria:

Excellent (90-100): 5+ diverse examples, clear usage patterns
Good (75-89): 3-4 examples covering different scenarios
Satisfactory (60-74): 2-3 basic examples
Poor (40-59): 1-2 minimal examples
Failing (0-39): No examples or unclear usage

Code Quality (25% Weight)

Scoring Components

Script Complexity and Architecture (25% of Code Score)

Evaluation Criteria:

Lines of code per script relative to tier requirements
Function and class organization
Code modularity and reusability
Algorithm sophistication

Scoring Matrix:

Tier	Excellent (90-100)	Good (75-89)	Satisfactory (60-74)	Poor (Below 60)
BASIC	200-300 LOC, well-structured	150-199 LOC, organized	100-149 LOC, basic	<100 LOC, minimal
STANDARD	400-500 LOC, modular	350-399 LOC, structured	300-349 LOC, adequate	<300 LOC, basic
POWERFUL	600-800 LOC, sophisticated	550-599 LOC, advanced	500-549 LOC, solid	<500 LOC, simple

Error Handling Quality (25% of Code Score)

Scoring Criteria:

Excellent (90-100): Comprehensive exception handling, specific error types, recovery mechanisms
Good (75-89): Good exception handling, meaningful error messages, logging
Satisfactory (60-74): Basic try/except blocks, simple error messages
Poor (40-59): Minimal error handling, generic exceptions
Failing (0-39): No error handling or inappropriate handling

Error Handling Checklist:

Try/except blocks for risky operations
Specific exception types (not just Exception)
Meaningful error messages for users
Proper error logging or reporting
Graceful degradation where possible
Input validation and sanitization

Code Structure and Organization (25% of Code Score)

Evaluation Elements:

Function decomposition and single responsibility
Class design and inheritance patterns
Import organization and dependency management
Documentation and comments quality
Consistent naming conventions
PEP 8 compliance

Scoring Guidelines:

Excellent (90-100): Exemplary structure, comprehensive docstrings, perfect style
Good (75-89): Well-organized, good documentation, minor style issues
Satisfactory (60-74): Adequate structure, basic documentation, some style issues
Poor (40-59): Poor organization, minimal documentation, style problems
Failing (0-39): No clear structure, no documentation, major style violations

Output Format Support (25% of Code Score)

Required Capabilities:

JSON output format support
Human-readable output format
Proper data serialization
Consistent output structure
Error output handling

Scoring Criteria:

Excellent (90-100): Dual format + custom formats, perfect serialization
Good (75-89): Dual format support, good serialization
Satisfactory (60-74): Single format well-implemented
Poor (40-59): Basic output, formatting issues
Failing (0-39): Poor or no structured output

Completeness (25% Weight)

Scoring Components

Directory Structure Compliance (25% of Completeness Score)

Required Directories by Tier:

BASIC: scripts/ (required), assets/ + references/ (recommended)
STANDARD: scripts/ + assets/ + references/ (required), expected_outputs/ (recommended)
POWERFUL: scripts/ + assets/ + references/ + expected_outputs/ (all required)

Scoring Calculation:

Structure Score = (Required Present / Required Total) * 0.6 + 
                  (Recommended Present / Recommended Total) * 0.4

Asset Availability and Quality (25% of Completeness Score)

Scoring Criteria:

Excellent (90-100): 5+ diverse assets, multiple file types, realistic data
Good (75-89): 3-4 assets, some diversity, good quality
Satisfactory (60-74): 2-3 assets, basic variety
Poor (40-59): 1-2 minimal assets
Failing (0-39): No assets or unusable assets

Asset Quality Factors:

File diversity (JSON, CSV, YAML, etc.)
Data realism and complexity
Coverage of use cases
File size appropriateness
Documentation of asset purpose

Expected Output Coverage (25% of Completeness Score)

Evaluation Criteria:

Correspondence with asset files
Coverage of success and error scenarios
Output format variety
Reproducibility and accuracy

Scoring Matrix:

Excellent (90-100): Complete output coverage, all scenarios, verified accuracy
Good (75-89): Good coverage, most scenarios, mostly accurate
Satisfactory (60-74): Basic coverage, main scenarios
Poor (40-59): Minimal coverage, some inaccuracies
Failing (0-39): No expected outputs or completely inaccurate

Test Coverage and Validation (25% of Completeness Score)

Assessment Areas:

Sample data processing capability
Output verification mechanisms
Edge case handling
Error condition testing
Integration test scenarios

Scoring Guidelines:

Excellent (90-100): Comprehensive test coverage, automated validation
Good (75-89): Good test coverage, manual validation possible
Satisfactory (60-74): Basic testing capability
Poor (40-59): Minimal testing support
Failing (0-39): No testing or validation capability

Usability (25% Weight)

Scoring Components

Installation and Setup Simplicity (25% of Usability Score)

Evaluation Factors:

Dependency requirements (Python stdlib preferred)
Setup complexity
Environment requirements
Installation documentation clarity

Scoring Criteria:

Excellent (90-100): Zero external dependencies, single-file execution
Good (75-89): Minimal dependencies, simple setup
Satisfactory (60-74): Some dependencies, documented setup
Poor (40-59): Complex dependencies, unclear setup
Failing (0-39): Unable to install or excessive complexity

Usage Clarity and Help Quality (25% of Usability Score)

Assessment Elements:

Command-line help comprehensiveness
Usage example clarity
Parameter documentation quality
Error message helpfulness

Help Quality Checklist:

Comprehensive --help output
Clear parameter descriptions
Usage examples included
Error messages are actionable
Progress indicators where appropriate

Scoring Matrix:

Excellent (90-100): Exemplary help, multiple examples, perfect error messages
Good (75-89): Good help quality, clear examples, helpful errors
Satisfactory (60-74): Adequate help, basic examples
Poor (40-59): Minimal help, confusing interface
Failing (0-39): No help or completely unclear interface

Documentation Accessibility (25% of Usability Score)

Evaluation Criteria:

README quick start effectiveness
SKILL.md navigation and structure
Reference material organization
Learning curve considerations

Accessibility Factors:

Information hierarchy clarity
Cross-reference quality
Beginner-friendly explanations
Advanced user shortcuts
Troubleshooting guidance

Practical Example Quality (25% of Usability Score)

Assessment Areas:

Example realism and relevance
Complexity progression (simple to advanced)
Output demonstration
Common use case coverage
Integration scenarios

Scoring Guidelines:

Excellent (90-100): 5+ examples, perfect progression, real-world scenarios
Good (75-89): 3-4 examples, good variety, practical scenarios
Satisfactory (60-74): 2-3 examples, adequate coverage
Poor (40-59): 1-2 examples, limited practical value
Failing (0-39): No examples or completely impractical

Scoring Calculations

Dimension Score Calculation

Each dimension score is calculated as a weighted average of its components:

def calculate_dimension_score(components):
    total_weighted_score = 0
    total_weight = 0
    
    for component_name, component_data in components.items():
        score = component_data['score']
        weight = component_data['weight']
        total_weighted_score += score * weight
        total_weight += weight
    
    return total_weighted_score / total_weight if total_weight > 0 else 0

Overall Score Calculation

The overall score combines all dimensions with equal weighting:

def calculate_overall_score(dimensions):
    return sum(dimension.score * 0.25 for dimension in dimensions.values())

Letter Grade Assignment

def assign_letter_grade(overall_score):
    if overall_score >= 95: return "A+"
    elif overall_score >= 90: return "A"
    elif overall_score >= 85: return "A-"
    elif overall_score >= 80: return "B+"
    elif overall_score >= 75: return "B"
    elif overall_score >= 70: return "B-"
    elif overall_score >= 65: return "C+"
    elif overall_score >= 60: return "C"
    elif overall_score >= 55: return "C-"
    elif overall_score >= 50: return "D"
    else: return "F"

Quality Improvement Recommendations

Score-Based Recommendations

For Scores Below 60 (C- or Lower)

Priority Actions:

Address fundamental structural issues
Implement basic error handling
Add essential documentation sections
Create minimal viable examples
Fix critical functionality issues

For Scores 60-74 (C+ to B-)

Improvement Areas:

Expand documentation comprehensiveness
Enhance error handling sophistication
Add more diverse examples and use cases
Improve code organization and structure
Increase test coverage and validation

For Scores 75-84 (B to B+)

Enhancement Opportunities:

Refine documentation for expert-level quality
Implement advanced error recovery mechanisms
Add comprehensive reference materials
Optimize code architecture and performance
Develop extensive example library

For Scores 85+ (A- or Higher)

Excellence Maintenance:

Regular quality audits and updates
Community feedback integration
Best practice evolution tracking
Mentoring lower-quality skills
Innovation and cutting-edge feature adoption

Dimension-Specific Improvement Strategies

Low Documentation Scores

Expand SKILL.md with technical details
Add comprehensive API reference
Include architecture diagrams and explanations
Develop troubleshooting guides
Create contributor documentation

Low Code Quality Scores

Refactor for better modularity
Implement comprehensive error handling
Add extensive code documentation
Apply advanced design patterns
Optimize performance and efficiency

Low Completeness Scores

Add missing directories and files
Develop comprehensive sample datasets
Create expected output libraries
Implement automated testing
Add integration examples

Low Usability Scores

Simplify installation process
Improve command-line interface design
Enhance help text and documentation
Create beginner-friendly tutorials
Add interactive examples

Quality Assurance Process

Automated Scoring

The quality scorer runs automated assessments based on this rubric:

File system analysis for structure compliance
Content analysis for documentation quality
Code analysis for quality metrics
Asset inventory and quality assessment

Manual Review Process

Human reviewers validate automated scores and provide qualitative insights:

Content quality assessment beyond automated metrics
Usability testing with real-world scenarios
Technical accuracy verification
Community value assessment

Continuous Improvement

The scoring rubric evolves based on:

Community feedback and usage patterns
Industry best practice changes
Tool capability enhancements
Quality trend analysis

This quality scoring rubric ensures consistent, objective, and comprehensive assessment of all skills within the claude-skills ecosystem while providing clear guidance for quality improvement.

15 KiB Raw Blame History

Quality Scoring Rubric

Overview

Scoring Framework

Overall Scoring Scale

Dimension Weights

Documentation Quality (25% Weight)

Scoring Components

SKILL.md Quality (40% of Documentation Score)

README.md Quality (25% of Documentation Score)

Reference Documentation (20% of Documentation Score)

Examples and Usage Clarity (15% of Documentation Score)

Code Quality (25% Weight)

Scoring Components

Script Complexity and Architecture (25% of Code Score)

Error Handling Quality (25% of Code Score)

Code Structure and Organization (25% of Code Score)

Output Format Support (25% of Code Score)

Completeness (25% Weight)

Scoring Components

Directory Structure Compliance (25% of Completeness Score)

Asset Availability and Quality (25% of Completeness Score)

Expected Output Coverage (25% of Completeness Score)

Test Coverage and Validation (25% of Completeness Score)

Usability (25% Weight)

Scoring Components

Installation and Setup Simplicity (25% of Usability Score)

Usage Clarity and Help Quality (25% of Usability Score)

Documentation Accessibility (25% of Usability Score)

Practical Example Quality (25% of Usability Score)

Scoring Calculations

Dimension Score Calculation

Overall Score Calculation

Letter Grade Assignment

Quality Improvement Recommendations

Score-Based Recommendations

For Scores Below 60 (C- or Lower)

For Scores 60-74 (C+ to B-)

For Scores 75-84 (B to B+)

For Scores 85+ (A- or Higher)

Dimension-Specific Improvement Strategies

Low Documentation Scores

Low Code Quality Scores

Low Completeness Scores

Low Usability Scores

Quality Assurance Process

Automated Scoring

Manual Review Process

Continuous Improvement

15 KiB

Raw Blame History