feat: Add multi-level confidence filtering for pattern detection (fixes #240)
## Problem
Pattern detection was producing too many low-confidence patterns:
- 905 patterns detected (overwhelming)
- Many with confidence as low as 0.50
- 4,875 lines in patterns index.md
- Low signal-to-noise ratio
## Solution
### 1. Added Confidence Thresholds (pattern_recognizer.py)
```python
CONFIDENCE_THRESHOLDS = {
'critical': 0.80, # High-confidence for ARCHITECTURE.md
'high': 0.70, # Detailed analysis
'medium': 0.60, # Include with warning
'low': 0.50, # Minimum detection
}
```
### 2. Created Filtering Utilities (pattern_recognizer.py:1650-1723)
- `filter_patterns_by_confidence()` - Filter by threshold
- `create_multi_level_report()` - Multi-level grouping with statistics
### 3. Multi-Level Output Files (codebase_scraper.py:1009-1055)
Now generates 4 output files:
- **all_patterns.json** - All detected patterns (unfiltered)
- **high_confidence_patterns.json** - Patterns ≥ 0.70 (for detailed analysis)
- **critical_patterns.json** - Patterns ≥ 0.80 (for ARCHITECTURE.md)
- **summary.json** - Statistics and thresholds
### 4. Enhanced Logging
```
✅ Detected 4 patterns in 1 files
🔴 Critical (≥0.80): 0 patterns
🟠 High (≥0.70): 0 patterns
🟡 Medium (≥0.60): 1 patterns
⚪ Low (<0.60): 3 patterns
```
## Results
**Before:**
- Single output file with all patterns
- No confidence-based filtering
- Overwhelming amount of data
**After:**
- 4 output files by confidence level
- Clear quality indicators (🔴🟠🟡⚪)
- Easy to find high-quality patterns
- Statistics in summary.json
**Example Output:**
```json
{
"statistics": {
"total": 4,
"critical_count": 0,
"high_confidence_count": 0,
"medium_count": 1,
"low_count": 3
},
"thresholds": {
"critical": 0.80,
"high": 0.70,
"medium": 0.60,
"low": 0.50
}
}
```
## Benefits
1. **Better Signal-to-Noise Ratio**
- Focus on high-confidence patterns
- Low-confidence patterns separate
2. **Flexible Usage**
- ARCHITECTURE.md uses critical_patterns.json
- Detailed analysis uses high_confidence_patterns.json
- Debug/research uses all_patterns.json
3. **Clear Quality Indicators**
- Visual indicators (🔴🟠🟡⚪)
- Explicit thresholds documented
- Statistics for quick assessment
4. **Backward Compatible**
- all_patterns.json maintains full data
- No breaking changes to existing code
- Additional files are opt-in
## Testing
**Test project:**
```python
class SingletonDatabase: # Detected with varying confidence
class UserFactory: # Detected patterns
class Logger: # Observer pattern (0.60 confidence)
```
**Results:**
- ✅ All 41 tests passing
- ✅ Multi-level filtering works correctly
- ✅ Statistics accurate
- ✅ Output files created properly
## Future Improvements (Not in this PR)
- Context-aware confidence boosting (pattern in design_patterns/ dir)
- Pattern count limits (top N per file/type)
- AI-enhanced confidence scoring
- Per-language threshold tuning
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
@@ -28,6 +28,17 @@ from pathlib import Path
|
||||
|
||||
logger = logging.getLogger(__name__)
|
||||
|
||||
# Confidence thresholds for pattern filtering (Issue #240)
|
||||
CONFIDENCE_THRESHOLDS = {
|
||||
'critical': 0.80, # High-confidence patterns for ARCHITECTURE.md
|
||||
'high': 0.70, # Include in detailed analysis
|
||||
'medium': 0.60, # Include with warning/context
|
||||
'low': 0.50, # Minimum detection threshold
|
||||
}
|
||||
|
||||
# Default minimum confidence for pattern detection
|
||||
DEFAULT_MIN_CONFIDENCE = CONFIDENCE_THRESHOLDS['low']
|
||||
|
||||
|
||||
@dataclass
|
||||
class PatternInstance:
|
||||
@@ -1636,6 +1647,76 @@ class LanguageAdapter:
|
||||
return pattern
|
||||
|
||||
|
||||
# ============================================================================
|
||||
# PATTERN FILTERING UTILITIES (Issue #240 - C4.2)
|
||||
# ============================================================================
|
||||
|
||||
|
||||
def filter_patterns_by_confidence(patterns: list[dict], min_confidence: float) -> list[dict]:
|
||||
"""
|
||||
Filter patterns by minimum confidence threshold.
|
||||
|
||||
Args:
|
||||
patterns: List of pattern dictionaries (from PatternReport.to_dict())
|
||||
min_confidence: Minimum confidence threshold (0.0-1.0)
|
||||
|
||||
Returns:
|
||||
Filtered list of patterns meeting the threshold
|
||||
"""
|
||||
filtered = []
|
||||
for pattern in patterns:
|
||||
if pattern.get("confidence", 0.0) >= min_confidence:
|
||||
filtered.append(pattern)
|
||||
return filtered
|
||||
|
||||
|
||||
def create_multi_level_report(pattern_results: list[dict]) -> dict:
|
||||
"""
|
||||
Create multi-level pattern report with different confidence thresholds.
|
||||
|
||||
Args:
|
||||
pattern_results: List of PatternReport dictionaries
|
||||
|
||||
Returns:
|
||||
Dictionary with patterns grouped by confidence level:
|
||||
- all_patterns: All detected patterns
|
||||
- high_confidence: Patterns >= 0.70 (for detailed analysis)
|
||||
- critical: Patterns >= 0.80 (for ARCHITECTURE.md)
|
||||
- statistics: Pattern count by level
|
||||
"""
|
||||
# Flatten all patterns from all files
|
||||
all_patterns = []
|
||||
for report in pattern_results:
|
||||
file_path = report.get("file_path", "unknown")
|
||||
for pattern in report.get("patterns", []):
|
||||
# Add file path to pattern for context
|
||||
pattern_with_file = {**pattern, "file_path": file_path}
|
||||
all_patterns.append(pattern_with_file)
|
||||
|
||||
# Sort by confidence (highest first)
|
||||
all_patterns_sorted = sorted(all_patterns, key=lambda p: p.get("confidence", 0.0), reverse=True)
|
||||
|
||||
# Filter by confidence levels
|
||||
critical = filter_patterns_by_confidence(all_patterns_sorted, CONFIDENCE_THRESHOLDS['critical'])
|
||||
high_confidence = filter_patterns_by_confidence(all_patterns_sorted, CONFIDENCE_THRESHOLDS['high'])
|
||||
medium = filter_patterns_by_confidence(all_patterns_sorted, CONFIDENCE_THRESHOLDS['medium'])
|
||||
|
||||
return {
|
||||
"all_patterns": all_patterns_sorted,
|
||||
"critical": critical,
|
||||
"high_confidence": high_confidence,
|
||||
"medium": medium,
|
||||
"statistics": {
|
||||
"total": len(all_patterns_sorted),
|
||||
"critical_count": len(critical),
|
||||
"high_confidence_count": len(high_confidence),
|
||||
"medium_count": len(medium),
|
||||
"low_count": len(all_patterns_sorted) - len(medium),
|
||||
},
|
||||
"thresholds": CONFIDENCE_THRESHOLDS,
|
||||
}
|
||||
|
||||
|
||||
def main():
|
||||
"""
|
||||
CLI entry point for pattern detection.
|
||||
|
||||
Reference in New Issue
Block a user