firefrost-gaming/claude-skills-reference

Files

xingzihai e0e683ee5e fix(skill-tester): make Security dimension opt-in with --include-security flag

- Add --include-security flag to quality_scorer.py
- Default: 4 dimensions × 25% (backward compatible)
- With --include-security: 5 dimensions × 20%
- Update tier recommendation logic for optional Security
- Update documentation to reflect opt-in behavior

This addresses the breaking change concern from PR review:
the weight change from 25% to 20% would affect all existing
audit baselines. The new opt-in approach preserves backward
compatibility.

2026-03-27 10:05:12 +00:00

20 KiB

Raw Blame History

Quality Scoring Rubric

Version: 2.0.0
Last Updated: 2026-03-27
Authority: Claude Skills Engineering Team

Overview

This document defines the comprehensive quality scoring methodology used to assess skills within the claude-skills ecosystem. The scoring system evaluates four key dimensions by default (each weighted at 25%), with an optional fifth Security dimension (enabled via --include-security flag).

Dimension Configuration

Default Mode (backward compatible):

Documentation Quality: 25%
Code Quality: 25%
Completeness: 25%
Usability: 25%

With --include-security flag:

Documentation Quality: 20%
Code Quality: 20%
Completeness: 20%
Security: 20%
Usability: 20%

Scoring Framework

Overall Scoring Scale

A+ (95-100): Exceptional quality, exceeds all standards
A (90-94): Excellent quality, meets highest standards consistently
A- (85-89): Very good quality, minor areas for improvement
B+ (80-84): Good quality, meets most standards well
B (75-79): Satisfactory quality, meets standards adequately
B- (70-74): Below average, several areas need improvement
C+ (65-69): Poor quality, significant improvements needed
C (60-64): Minimal acceptable quality, major improvements required
C- (55-59): Unacceptable quality, extensive rework needed
D (50-54): Very poor quality, fundamental issues present
F (0-49): Failing quality, does not meet basic standards

Dimension Weights (Default: 4 dimensions × 25%)

Each dimension contributes equally to the overall score:

Documentation Quality: 25%
Code Quality: 25%
Completeness: 25%
Usability: 25%

When --include-security is used, all five dimensions are weighted at 20% each.

Documentation Quality (20% Weight)

Scoring Components

SKILL.md Quality (40% of Documentation Score)

Component Breakdown:

Length and Depth (25%): Line count and content substance
Frontmatter Quality (25%): Completeness and accuracy of YAML metadata
Section Coverage (25%): Required and recommended section presence
Content Depth (25%): Technical detail and comprehensiveness

Scoring Criteria:

Score Range	Length	Frontmatter	Sections	Depth
90-100	400+ lines	All fields complete + extras	All required + 4+ recommended	Rich technical detail, examples
80-89	300-399 lines	All required fields complete	All required + 2-3 recommended	Good technical coverage
70-79	200-299 lines	Most required fields	All required + 1 recommended	Adequate technical content
60-69	150-199 lines	Some required fields	Most required sections	Basic technical information
50-59	100-149 lines	Minimal frontmatter	Some required sections	Limited technical detail
Below 50	<100 lines	Missing/invalid frontmatter	Few/no required sections	Insufficient content

README.md Quality (25% of Documentation Score)

Scoring Criteria:

Excellent (90-100): 1000+ chars, comprehensive usage guide, examples, troubleshooting
Good (75-89): 500-999 chars, clear usage instructions, basic examples
Satisfactory (60-74): 200-499 chars, minimal usage information
Poor (40-59): <200 chars or confusing content
Failing (0-39): Missing or completely inadequate

Reference Documentation (20% of Documentation Score)

Scoring Criteria:

Excellent (90-100): Multiple comprehensive reference docs (2000+ chars total)
Good (75-89): 2-3 reference files with substantial content
Satisfactory (60-74): 1-2 reference files with adequate content
Poor (40-59): Minimal reference content or poor quality
Failing (0-39): No reference documentation

Examples and Usage Clarity (15% of Documentation Score)

Scoring Criteria:

Excellent (90-100): 5+ diverse examples, clear usage patterns
Good (75-89): 3-4 examples covering different scenarios
Satisfactory (60-74): 2-3 basic examples
Poor (40-59): 1-2 minimal examples
Failing (0-39): No examples or unclear usage

Code Quality (20% Weight)

Scoring Components

Script Complexity and Architecture (25% of Code Score)

Evaluation Criteria:

Lines of code per script relative to tier requirements
Function and class organization
Code modularity and reusability
Algorithm sophistication

Scoring Matrix:

Tier	Excellent (90-100)	Good (75-89)	Satisfactory (60-74)	Poor (Below 60)
BASIC	200-300 LOC, well-structured	150-199 LOC, organized	100-149 LOC, basic	<100 LOC, minimal
STANDARD	400-500 LOC, modular	350-399 LOC, structured	300-349 LOC, adequate	<300 LOC, basic
POWERFUL	600-800 LOC, sophisticated	550-599 LOC, advanced	500-549 LOC, solid	<500 LOC, simple

Error Handling Quality (25% of Code Score)

Scoring Criteria:

Excellent (90-100): Comprehensive exception handling, specific error types, recovery mechanisms
Good (75-89): Good exception handling, meaningful error messages, logging
Satisfactory (60-74): Basic try/except blocks, simple error messages
Poor (40-59): Minimal error handling, generic exceptions
Failing (0-39): No error handling or inappropriate handling

Error Handling Checklist:

Try/except blocks for risky operations
Specific exception types (not just Exception)
Meaningful error messages for users
Proper error logging or reporting
Graceful degradation where possible
Input validation and sanitization

Code Structure and Organization (25% of Code Score)

Evaluation Elements:

Function decomposition and single responsibility
Class design and inheritance patterns
Import organization and dependency management
Documentation and comments quality
Consistent naming conventions
PEP 8 compliance

Scoring Guidelines:

Excellent (90-100): Exemplary structure, comprehensive docstrings, perfect style
Good (75-89): Well-organized, good documentation, minor style issues
Satisfactory (60-74): Adequate structure, basic documentation, some style issues
Poor (40-59): Poor organization, minimal documentation, style problems
Failing (0-39): No clear structure, no documentation, major style violations

Output Format Support (25% of Code Score)

Required Capabilities:

JSON output format support
Human-readable output format
Proper data serialization
Consistent output structure
Error output handling

Scoring Criteria:

Excellent (90-100): Dual format + custom formats, perfect serialization
Good (75-89): Dual format support, good serialization
Satisfactory (60-74): Single format well-implemented
Poor (40-59): Basic output, formatting issues
Failing (0-39): Poor or no structured output

Completeness (20% Weight)

Scoring Components

Directory Structure Compliance (25% of Completeness Score)

Required Directories by Tier:

BASIC: scripts/ (required), assets/ + references/ (recommended)
STANDARD: scripts/ + assets/ + references/ (required), expected_outputs/ (recommended)
POWERFUL: scripts/ + assets/ + references/ + expected_outputs/ (all required)

Scoring Calculation:

Structure Score = (Required Present / Required Total) * 0.6 + 
                  (Recommended Present / Recommended Total) * 0.4

Asset Availability and Quality (25% of Completeness Score)

Scoring Criteria:

Excellent (90-100): 5+ diverse assets, multiple file types, realistic data
Good (75-89): 3-4 assets, some diversity, good quality
Satisfactory (60-74): 2-3 assets, basic variety
Poor (40-59): 1-2 minimal assets
Failing (0-39): No assets or unusable assets

Asset Quality Factors:

File diversity (JSON, CSV, YAML, etc.)
Data realism and complexity
Coverage of use cases
File size appropriateness
Documentation of asset purpose

Expected Output Coverage (25% of Completeness Score)

Evaluation Criteria:

Correspondence with asset files
Coverage of success and error scenarios
Output format variety
Reproducibility and accuracy

Scoring Matrix:

Excellent (90-100): Complete output coverage, all scenarios, verified accuracy
Good (75-89): Good coverage, most scenarios, mostly accurate
Satisfactory (60-74): Basic coverage, main scenarios
Poor (40-59): Minimal coverage, some inaccuracies
Failing (0-39): No expected outputs or completely inaccurate

Test Coverage and Validation (25% of Completeness Score)

Assessment Areas:

Sample data processing capability
Output verification mechanisms
Edge case handling
Error condition testing
Integration test scenarios

Scoring Guidelines:

Excellent (90-100): Comprehensive test coverage, automated validation
Good (75-89): Good test coverage, manual validation possible
Satisfactory (60-74): Basic testing capability
Poor (40-59): Minimal testing support
Failing (0-39): No testing or validation capability

Usability (20% Weight)

Scoring Components

Installation and Setup Simplicity (25% of Usability Score)

Evaluation Factors:

Dependency requirements (Python stdlib preferred)
Setup complexity
Environment requirements
Installation documentation clarity

Scoring Criteria:

Excellent (90-100): Zero external dependencies, single-file execution
Good (75-89): Minimal dependencies, simple setup
Satisfactory (60-74): Some dependencies, documented setup
Poor (40-59): Complex dependencies, unclear setup
Failing (0-39): Unable to install or excessive complexity

Usage Clarity and Help Quality (25% of Usability Score)

Assessment Elements:

Command-line help comprehensiveness
Usage example clarity
Parameter documentation quality
Error message helpfulness

Help Quality Checklist:

Comprehensive --help output
Clear parameter descriptions
Usage examples included
Error messages are actionable
Progress indicators where appropriate

Scoring Matrix:

Excellent (90-100): Exemplary help, multiple examples, perfect error messages
Good (75-89): Good help quality, clear examples, helpful errors
Satisfactory (60-74): Adequate help, basic examples
Poor (40-59): Minimal help, confusing interface
Failing (0-39): No help or completely unclear interface

Documentation Accessibility (25% of Usability Score)

Evaluation Criteria:

README quick start effectiveness
SKILL.md navigation and structure
Reference material organization
Learning curve considerations

Accessibility Factors:

Information hierarchy clarity
Cross-reference quality
Beginner-friendly explanations
Advanced user shortcuts
Troubleshooting guidance

Practical Example Quality (25% of Usability Score)

Assessment Areas:

Example realism and relevance
Complexity progression (simple to advanced)
Output demonstration
Common use case coverage
Integration scenarios

Scoring Guidelines:

Excellent (90-100): 5+ examples, perfect progression, real-world scenarios
Good (75-89): 3-4 examples, good variety, practical scenarios
Satisfactory (60-74): 2-3 examples, adequate coverage
Poor (40-59): 1-2 examples, limited practical value
Failing (0-39): No examples or completely impractical

Scoring Calculations

Dimension Score Calculation

Each dimension score is calculated as a weighted average of its components:

def calculate_dimension_score(components):
    total_weighted_score = 0
    total_weight = 0
    
    for component_name, component_data in components.items():
        score = component_data['score']
        weight = component_data['weight']
        total_weighted_score += score * weight
        total_weight += weight
    
    return total_weighted_score / total_weight if total_weight > 0 else 0

Overall Score Calculation

The overall score combines all dimensions with equal weighting:

def calculate_overall_score(dimensions):
    return sum(dimension.score * 0.25 for dimension in dimensions.values())

Letter Grade Assignment

def assign_letter_grade(overall_score):
    if overall_score >= 95: return "A+"
    elif overall_score >= 90: return "A"
    elif overall_score >= 85: return "A-"
    elif overall_score >= 80: return "B+"
    elif overall_score >= 75: return "B"
    elif overall_score >= 70: return "B-"
    elif overall_score >= 65: return "C+"
    elif overall_score >= 60: return "C"
    elif overall_score >= 55: return "C-"
    elif overall_score >= 50: return "D"
    else: return "F"

Quality Improvement Recommendations

Score-Based Recommendations

For Scores Below 60 (C- or Lower)

Priority Actions:

Address fundamental structural issues
Implement basic error handling
Add essential documentation sections
Create minimal viable examples
Fix critical functionality issues

For Scores 60-74 (C+ to B-)

Improvement Areas:

Expand documentation comprehensiveness
Enhance error handling sophistication
Add more diverse examples and use cases
Improve code organization and structure
Increase test coverage and validation

For Scores 75-84 (B to B+)

Enhancement Opportunities:

Refine documentation for expert-level quality
Implement advanced error recovery mechanisms
Add comprehensive reference materials
Optimize code architecture and performance
Develop extensive example library

For Scores 85+ (A- or Higher)

Excellence Maintenance:

Regular quality audits and updates
Community feedback integration
Best practice evolution tracking
Mentoring lower-quality skills
Innovation and cutting-edge feature adoption

Dimension-Specific Improvement Strategies

Low Documentation Scores

Expand SKILL.md with technical details
Add comprehensive API reference
Include architecture diagrams and explanations
Develop troubleshooting guides
Create contributor documentation

Low Code Quality Scores

Refactor for better modularity
Implement comprehensive error handling
Add extensive code documentation
Apply advanced design patterns
Optimize performance and efficiency

Low Completeness Scores

Add missing directories and files
Develop comprehensive sample datasets
Create expected output libraries
Implement automated testing
Add integration examples

Low Usability Scores

Simplify installation process
Improve command-line interface design
Enhance help text and documentation
Create beginner-friendly tutorials
Add interactive examples

Security (Optional, 20% Weight when enabled)

Overview

The Security dimension evaluates Python scripts for security vulnerabilities and best practices. This dimension is optional and only evaluated when the --include-security flag is passed to the quality scorer.

Important: By default, the quality scorer uses 4 dimensions × 25% weights for backward compatibility. To include Security assessment, use:

python quality_scorer.py <skill_path> --include-security

When Security is enabled, all dimensions are rebalanced to 20% each (5 dimensions × 20% = 100%).

This dimension is critical for ensuring that skills do not introduce security risks into the claude-skills ecosystem.

Scoring Components

Sensitive Data Exposure Prevention (25% of Security Score)

Component Breakdown:

Hardcoded Credentials Detection: Passwords, API keys, tokens, secrets
AWS Credential Detection: Access keys and secret keys
Private Key Detection: RSA, SSH, and other private keys
JWT Token Detection: JSON Web Tokens in code

Scoring Criteria:

Score Range	Criteria
90-100	No hardcoded credentials, uses environment variables properly
75-89	Minor issues (e.g., placeholder values that aren't real secrets)
60-74	One or two low-severity issues
40-59	Multiple medium-severity issues
Below 40	Critical hardcoded secrets detected

Safe File Operations (25% of Security Score)

Component Breakdown:

Path Traversal Detection: ../, URL-encoded variants, Unicode variants
String Concatenation Risks: open(path + user_input)
Null Byte Injection: %00, \x00
Safe Pattern Usage: pathlib.Path, os.path.basename

Scoring Criteria:

Score Range	Criteria
90-100	Uses pathlib/os.path safely, no path traversal vulnerabilities
75-89	Minor issues, uses safe patterns mostly
60-74	Some path concatenation with user input
40-59	Path traversal patterns detected
Below 40	Critical vulnerabilities present

Command Injection Prevention (25% of Security Score)

Component Breakdown:

Dangerous Functions: os.system(), eval(), exec(), subprocess with shell=True
Safe Alternatives: subprocess.run(args, shell=False), shlex.quote(), shlex.split()

Scoring Criteria:

Score Range	Criteria
90-100	No command injection risks, uses subprocess safely
75-89	Minor issues, mostly safe patterns
60-74	Some use of shell=True or eval with safe context
40-59	Command injection patterns detected
Below 40	Critical vulnerabilities (unfiltered user input to shell)

Input Validation Quality (25% of Security Score)

Component Breakdown:

Argparse Usage: CLI argument validation
Type Checking: isinstance(), type hints
Error Handling: try/except blocks
Input Sanitization: Regex validation, input cleaning

Scoring Criteria:

Score Range	Criteria
90-100	Comprehensive input validation, proper error handling
75-89	Good validation coverage, most inputs checked
60-74	Basic validation present
40-59	Minimal input validation
Below 40	No input validation

Security Best Practices

Recommended Patterns:

# Use environment variables for secrets
import os
password = os.environ.get("PASSWORD")

# Use pathlib for safe path operations
from pathlib import Path
safe_path = Path(base_dir) / user_input

# Use subprocess safely
import subprocess
result = subprocess.run(["ls", user_input], capture_output=True)

# Use shlex for shell argument safety
import shlex
safe_arg = shlex.quote(user_input)

Patterns to Avoid:

# Don't hardcode secrets
password = "my_secret_password"  # BAD

# Don't use string concatenation for paths
open(base_path + "/" + user_input)  # BAD

# Don't use shell=True with user input
os.system(f"ls {user_input}")  # BAD

# Don't use eval on user input
eval(user_input)  # VERY BAD

Security Score Impact on Tiers

Note: Security requirements only apply when --include-security is used.

When Security dimension is enabled:

POWERFUL Tier: Requires Security score ≥ 70
STANDARD Tier: Requires Security score ≥ 50
BASIC Tier: No minimum Security requirement

When Security dimension is not enabled (default):

Tier recommendations are based on the 4 core dimensions (Documentation, Code Quality, Completeness, Usability)

Low Security Scores

Remove hardcoded credentials, use environment variables
Fix path traversal vulnerabilities
Replace dangerous functions with safe alternatives
Add input validation and error handling

Quality Assurance Process

Automated Scoring

The quality scorer runs automated assessments based on this rubric:

File system analysis for structure compliance
Content analysis for documentation quality
Code analysis for quality metrics
Asset inventory and quality assessment

Manual Review Process

Human reviewers validate automated scores and provide qualitative insights:

Content quality assessment beyond automated metrics
Usability testing with real-world scenarios
Technical accuracy verification
Community value assessment

Continuous Improvement

The scoring rubric evolves based on:

Community feedback and usage patterns
Industry best practice changes
Tool capability enhancements
Quality trend analysis

This quality scoring rubric ensures consistent, objective, and comprehensive assessment of all skills within the claude-skills ecosystem while providing clear guidance for quality improvement.

20 KiB Raw Blame History Unescape Escape

Quality Scoring Rubric

Overview

Dimension Configuration

Scoring Framework

Overall Scoring Scale

Dimension Weights (Default: 4 dimensions × 25%)

Documentation Quality (20% Weight)

Scoring Components

SKILL.md Quality (40% of Documentation Score)

README.md Quality (25% of Documentation Score)

Reference Documentation (20% of Documentation Score)

Examples and Usage Clarity (15% of Documentation Score)

Code Quality (20% Weight)

Scoring Components

Script Complexity and Architecture (25% of Code Score)

Error Handling Quality (25% of Code Score)

Code Structure and Organization (25% of Code Score)

Output Format Support (25% of Code Score)

Completeness (20% Weight)

Scoring Components

Directory Structure Compliance (25% of Completeness Score)

Asset Availability and Quality (25% of Completeness Score)

Expected Output Coverage (25% of Completeness Score)

Test Coverage and Validation (25% of Completeness Score)

Usability (20% Weight)

Scoring Components

Installation and Setup Simplicity (25% of Usability Score)

Usage Clarity and Help Quality (25% of Usability Score)

Documentation Accessibility (25% of Usability Score)

Practical Example Quality (25% of Usability Score)

Scoring Calculations

Dimension Score Calculation

Overall Score Calculation

Letter Grade Assignment

Quality Improvement Recommendations

Score-Based Recommendations

For Scores Below 60 (C- or Lower)

For Scores 60-74 (C+ to B-)

For Scores 75-84 (B to B+)

For Scores 85+ (A- or Higher)

Dimension-Specific Improvement Strategies

Low Documentation Scores

Low Code Quality Scores

Low Completeness Scores

Low Usability Scores

Security (Optional, 20% Weight when enabled)

Overview

Scoring Components

Sensitive Data Exposure Prevention (25% of Security Score)

Safe File Operations (25% of Security Score)

Command Injection Prevention (25% of Security Score)

Input Validation Quality (25% of Security Score)

Security Best Practices

Security Score Impact on Tiers

Low Security Scores

Quality Assurance Process

Automated Scoring

Manual Review Process

Continuous Improvement

20 KiB

Raw Blame History