Thanks @franklegolasyoung for the excellent work on the core fixes for issues #267, #242, and #260! 🙏
Your comprehensive approach to fixing PDF processing, expanding workflow detection, and improving the Chinese README documentation is much appreciated. I've added code quality fixes and comprehensive tests to ensure everything passes CI.
All 1266+ tests are now passing, and the issues are resolved! 🎉
Three critical UX improvements for custom config handling:
1. User config directory support:
- Added ~/.config/skill-seekers/configs/ to search path
- Users can now place custom configs in their home directory
- Path resolution order: exact path → ./configs/ → user config dir → API
2. Better error messages:
- Show all searched absolute paths when config not found
- Added get_last_searched_paths() function to track locations
- Clear guidance on where to place custom configs
3. Auto-create config.json:
- ConfigManager now creates config.json on first initialization
- Creates configs/ subdirectory for user custom configs
- Display shows custom configs directory path
Fixes reported by @melamers in issue #262 where:
- Config path shown by `skill-seekers config` didn't exist
- Unclear where to save custom configs
- Error messages didn't show exact paths searched
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixes#264
Users reported that preset configs (react.json, godot.json, etc.) were not
found after installing via pip/uv, causing immediate failure on first use.
Solution: Instead of bundling configs in the package, the CLI now automatically
fetches missing configs from the SkillSeekersWeb.com API.
Changes:
- Created config_fetcher.py with smart config resolution:
1. Check local path (backward compatible)
2. Check with configs/ prefix
3. Auto-fetch from SkillSeekersWeb.com API (new!)
- Updated doc_scraper.py to use ConfigValidator (supports unified configs)
- Added 15 comprehensive tests for auto-fetch functionality
User Experience:
- Zero configuration needed - presets work immediately after install
- Better error messages showing available configs from API
- Downloaded configs are cached locally for future use
- Fully backward compatible with existing local configs
Testing:
- 15 new unit tests (all passing)
- 2 integration tests with real API
- Full test suite: 1387 tests passing
- No breaking changes
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This hotfix resolves 4 critical bugs reported by users:
Issue #258: install command fails with unified_scraper
- Added --fresh and --dry-run flags to unified_scraper.py
- Updated main.py to pass both flags to unified scraper
- Fixed "unrecognized arguments" error
Issue #259 (Original): scrape command doesn't accept positional URL and --max-pages
- Added positional URL argument to scrape command
- Added --max-pages flag with safety warnings (>1000 pages, <10 pages)
- Updated doc_scraper.py and main.py argument parsers
Issue #259 (Comment A): Version shows 2.7.0 instead of actual version
- Fixed hardcoded version in main.py
- Now reads version dynamically from __init__.py
Issue #259 (Comment B): PDF command shows empty "Error: " message
- Improved exception handler in main.py to show exception type if message is empty
- Added proper error handling in pdf_scraper.py with context-specific messages
- Added traceback support in verbose mode
All fixes tested and verified with exact commands from issue reports.
Resolves: #258, #259
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add Markdown file parsing in doc_scraper (_extract_markdown_content, _extract_html_as_markdown)
- Add URL extraction and cleaning in llms_txt_parser (extract_urls, _clean_url)
- Support multiple documentation/github/pdf sources in unified_scraper
- Generate separate reference directories per source in unified_skill_builder
- Skip pages with empty/short content (<50 chars)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
Implements hybrid smart extraction + improved fallback templates for
skill descriptions across all scrapers.
Changes:
- github_scraper.py:
* Added extract_description_from_readme() helper
* Extracts from README first paragraph (60 lines)
* Updates description after README extraction
* Fallback: "Use when working with {name}"
* Updated 3 locations (GitHubScraper, GitHubToSkillConverter, main)
- doc_scraper.py:
* Added infer_description_from_docs() helper
* Extracts from meta tags or first paragraph (65 lines)
* Tries: meta description, og:description, first content paragraph
* Fallback: "Use when working with {name}"
* Updated 2 locations (create_enhanced_skill_md, get_configuration)
- pdf_scraper.py:
* Added infer_description_from_pdf() helper
* Extracts from PDF metadata (subject, title)
* Fallback: "Use when referencing {name} documentation"
* Updated 3 locations (PDFToSkillConverter, main x2)
- generate_router.py:
* Updated 2 locations with improved router descriptions
* "Use when working with {name} development and programming"
All changes:
- Only apply to NEW skill generations (don't modify existing)
- No API calls (free/offline)
- Smart extraction when metadata/README available
- Improved "Use when..." fallbacks instead of generic templates
- 612 tests passing (100%)
Fixes#191
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixes#209 - UnicodeDecodeError on Windows with non-ASCII characters
**Problem:**
Windows users with non-English locales (Chinese, Japanese, Korean, etc.)
experienced GBK/SHIFT-JIS codec errors when the system default encoding
is not UTF-8.
Error: 'gbk' codec can't decode byte 0xac in position 206: illegal
multibyte sequence
**Root Cause:**
File operations using open() without explicit encoding parameter use
the system default encoding, which on Windows Chinese edition is GBK.
JSON files contain UTF-8 encoded characters that fail to decode with GBK.
**Solution:**
Added encoding='utf-8' to ALL file operations across:
- doc_scraper.py (4 instances):
* load_config() - line 1310
* check_existing_data() - line 1416
* save_checkpoint() - line 173
* load_checkpoint() - line 186
- github_scraper.py (1 instance):
* main() config loading - line 922
- unified_scraper.py (10 instances):
* All JSON read/write operations - lines 134, 153, 205, 239, 275,
278, 325, 328, 342, 364
**Test Results:**
- ✅ All 612 tests passing (100% pass rate)
- ✅ Backward compatible (UTF-8 is standard on Linux/macOS)
- ✅ Fixes Windows locale issues
**Impact:**
- ✅ Works on ALL Windows locales (Chinese, Japanese, Korean, etc.)
- ✅ Maintains compatibility with Linux/macOS
- ✅ Prevents future encoding issues
**Thanks to:** @my5icol for the detailed bug report and fix suggestion!
- Created LanguageDetector class supporting 20+ programming languages
- Confidence-based detection with customizable thresholds (min_confidence parameter)
- Replaces duplicate language detection code in doc_scraper and pdf_extractor
- Comprehensive test suite with 100+ test cases
Changes:
- NEW: src/skill_seekers/cli/language_detector.py (17 KB)
- Unified detector with pattern matching for 20+ languages
- Confidence scoring (0.0-1.0 scale)
- Supports: Python, JavaScript, TypeScript, Java, C++, C#, Go, Rust, PHP, Ruby, Swift, Kotlin, Shell, SQL, HTML, CSS, JSON, YAML, XML, and more
- NEW: tests/test_language_detector.py (20 KB)
- 100+ test cases covering all supported languages
- Edge case testing (mixed code, low confidence, etc.)
- MODIFIED: src/skill_seekers/cli/doc_scraper.py
- Removed 80+ lines of duplicate detection code
- Now uses shared LanguageDetector instance
- MODIFIED: src/skill_seekers/cli/pdf_extractor_poc.py
- Removed 130+ lines of duplicate detection code
- Now uses shared LanguageDetector instance
- MODIFIED: tests/test_pdf_extractor.py
- Fixed imports to use proper package paths
- Added manual detector initialization in test setup
Benefits:
- DRY: Single source of truth for language detection
- Maintainability: Add new languages in one place
- Consistency: Same detection logic across all scrapers
- Testability: Comprehensive test coverage
- Extensibility: Easy to add new languages or improve patterns
Addresses technical debt from having duplicate detection logic in multiple files.
Merges feat/add-skip-llm-to-config by @sogoiii.
This PR adds a valuable configuration option to explicitly skip llms.txt
detection, useful when a site's llms.txt is incomplete, incorrect, or when
specific HTML scraping is needed.
Key features:
- New 'skip_llms_txt' config option (default: false, backward compatible)
- Boolean type validation with warning for invalid values
- Support in both sync and async scraping modes
- 17 comprehensive tests (15 feature tests + 2 config validation tests)
All tests passing after fixing import paths to use proper package names.
Test results: ✅ 17/17 tests passing
Full test suite: ✅ 391 tests passing
Co-authored-by: sogoiii <sogoiii@users.noreply.github.com>
- Add skip_llms_txt config option (default: False)
- Validate value is boolean, warn and default to False if not
- Support in both sync and async scraping modes
- Add 17 tests for config, behavior, and edge cases