- Created LanguageDetector class supporting 20+ programming languages
- Confidence-based detection with customizable thresholds (min_confidence parameter)
- Replaces duplicate language detection code in doc_scraper and pdf_extractor
- Comprehensive test suite with 100+ test cases
Changes:
- NEW: src/skill_seekers/cli/language_detector.py (17 KB)
- Unified detector with pattern matching for 20+ languages
- Confidence scoring (0.0-1.0 scale)
- Supports: Python, JavaScript, TypeScript, Java, C++, C#, Go, Rust, PHP, Ruby, Swift, Kotlin, Shell, SQL, HTML, CSS, JSON, YAML, XML, and more
- NEW: tests/test_language_detector.py (20 KB)
- 100+ test cases covering all supported languages
- Edge case testing (mixed code, low confidence, etc.)
- MODIFIED: src/skill_seekers/cli/doc_scraper.py
- Removed 80+ lines of duplicate detection code
- Now uses shared LanguageDetector instance
- MODIFIED: src/skill_seekers/cli/pdf_extractor_poc.py
- Removed 130+ lines of duplicate detection code
- Now uses shared LanguageDetector instance
- MODIFIED: tests/test_pdf_extractor.py
- Fixed imports to use proper package paths
- Added manual detector initialization in test setup
Benefits:
- DRY: Single source of truth for language detection
- Maintainability: Add new languages in one place
- Consistency: Same detection logic across all scrapers
- Testability: Comprehensive test coverage
- Extensibility: Easy to add new languages or improve patterns
Addresses technical debt from having duplicate detection logic in multiple files.