PyGithub's get_languages() returns raw API JSON which in some environments
includes non-integer metadata keys (e.g., "url"), causing a TypeError in
sum(). Now filters to integer values only before calculating percentages.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PyGithub's PaginatedList slicing (issues[:max_issues]) may fail with
'list index out of range' on some PyGithub versions or when repos
have no issues. Replace with itertools.islice() which works reliably
with any iterable, including PaginatedList.
Bug reported by @dream0438-cmd in PR #269.
Closes#269
- Fix#299: rename --chunk-size/--chunk-overlap to --streaming-chunk-size/
--streaming-overlap in arguments/package.py to avoid collision with the
RAG --chunk-size flag from arguments/common.py
- Phase 1a: make package_skill.py import args via add_package_arguments()
instead of a 105-line inline duplicate argparse block; fixes the root
cause of _reconstruct_argv() passing unrecognised flag names
- Phase 1b: centralise setup_logging() into utils.py and remove 4
duplicate module-level logging.basicConfig() calls from doc_scraper.py,
github_scraper.py, codebase_scraper.py, and unified_scraper.py
- Fix test_package_structure.py / test_cli_paths.py version strings
(3.1.1 → 3.1.2)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
All scrapers (scrape, github, analyze, pdf) now share a common argument
contract via add_all_standard_arguments() in arguments/common.py.
Universal flags (--dry-run, --verbose, --quiet, --name, --description,
workflow args) work consistently across all source types.
Previously, `create <url> --dry-run`, `create owner/repo --dry-run`,
and `create ./path --dry-run` would crash because sub-scrapers didn't
accept those flags. Also fixes main.py _handle_analyze_command() not
forwarding --dry-run, --preset, --quiet, --name, --description to
codebase_scraper.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add --local-repo-path to UNIVERSAL_ARGUMENTS in create.py so it is
registered in the actual parser (not just help display)
- Add --local-repo-path to GITHUB_ARGUMENTS in arguments/github.py for
the standalone github subcommand
- Forward --local-repo-path through create_command._route_github() to
github_scraper
- Add local_repo_path to the config dict built from CLI args in
github_scraper.main()
- Add early validation in GitHubScraper.__init__(): warn and reset to
None if path does not exist, triggering a real GitHub API fallback
instead of silently operating with an empty file tree (fixes#281)
- Update test_create_arguments.py count/names assertions (17 -> 18)
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Fixes ruff format --check CI failure. 22 files reformatted to satisfy
the ruff formatter's style requirements. No logic changes, only
whitespace/formatting adjustments.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Change --enhance-workflow from type:str to action:append in all argument
files (workflow, create, scrape, github, pdf) so the flag can be given
multiple times to chain workflows in sequence
- Add workflow_runner.py: shared utility used by all 4 scrapers
- collect_workflow_vars(): merges extra context then user --var flags
(user flags take precedence over scraper metadata)
- run_workflows(): executes named workflows in order, then any inline
--enhance-stage workflow; handles dry-run/preview mode
- Remove duplicate ~115-130 line workflow blocks from doc_scraper,
github_scraper, pdf_scraper, and codebase_scraper; replace with
single run_workflows() call each
- Remove mutual exclusivity between workflows and AI enhancement:
workflows now run first, then traditional enhancement continues
independently (--enhance-level 0 to disable)
- Add tests/test_workflow_runner.py: 21 tests covering no-flags, single
workflow, multiple/chained workflows, inline stages, mixed mode,
variable precedence, and dry-run
- Fix test_markdown_parsing: accept "text" or "unknown" for unlabelled
code blocks (unified MarkdownParser returns "text" by default)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Changed 'pathspec.PathSpec' | None to Optional['pathspec.PathSpec']
- Fixes TypeError in Python 3.10/3.11 where | operator doesn't work with string literals
- Adds Optional to typing imports
Completes the implementation for Unity/Unreal/Godot game engine support
and adds missing "local" source type validation.
Changes:
- Add "local" to VALID_SOURCE_TYPES in config_validator.py
- Add _validate_local_source() method with full validation
- Add Unity/Unreal/Godot to FRAMEWORK_MARKERS for priority detection
- Add game engine directory exclusions to all 3 scrapers:
* Unity: Library/, Temp/, Logs/, UserSettings/, etc.
* Unreal: Intermediate/, Saved/, DerivedDataCache/
* Godot: .godot/, .import/
- Prevents scanning massive build cache directories (saves GBs + hours)
This completes all features mentioned in PR #278:
✅ Unity/Unreal/Godot framework detection with priority
✅ Pattern enhancement performance fix (grouped approach)
✅ Game engine directory exclusions
✅ Phase 5 SKILL.md AI enhancement
✅ Local source references copying
✅ "local" source type validation
✅ Config field name compatibility
✅ C# test example extraction
Tested:
- All unified config tests pass (18/18)
- All config validation tests pass (28/28)
- Ready for Unity project testing
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixed incorrect variable names in list comprehensions that were causing
NameError in CI (Python 3.11/3.12):
Critical fixes:
- tests/test_markdown_parsing.py: 'l' → 'link' in list comprehension
- src/skill_seekers/cli/pdf_extractor_poc.py: 'l' → 'line' (2 occurrences)
Additional auto-lint fixes:
- Removed unused imports in llms_txt_downloader.py, llms_txt_parser.py
- Fixed comparison operators in config files
- Fixed list comprehension in other files
All tests now pass in CI.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add pathspec import with graceful fallback
- Add gitignore_spec attribute to GitHubScraper class
- Implement _load_gitignore() method to parse .gitignore files
- Update should_exclude_dir() to check .gitignore rules
- Load .gitignore automatically in local repository mode
- Handle directory patterns with and without trailing slash
- Add 4 comprehensive tests for .gitignore functionality
Closes#63 - C2.1 File Tree Walker with .gitignore support complete
Features:
- Loads .gitignore from local repository root
- Respects .gitignore patterns for directory exclusion
- Falls back gracefully when pathspec not installed
- Works alongside existing hard-coded exclusions
- Only active in local_repo_path mode (not GitHub API mode)
Test coverage:
- test_load_gitignore_exists: .gitignore parsing
- test_load_gitignore_missing: Missing .gitignore handling
- test_should_exclude_dir_with_gitignore: .gitignore exclusion
- test_should_exclude_dir_default_exclusions: Existing exclusions still work
Integration:
- github_scraper.py now has same .gitignore support as codebase_scraper.py
- Both tools use pathspec library for consistent behavior
- Enables proper repository analysis respecting project .gitignore rules
**Problem #1: Large File Encoding Error** ✅ FIXED
- Add large file download support via download_url
- Detect encoding='none' for files >1MB
- Download via GitHub raw URL instead of API
- Handles ccxt/ccxt's 1.4MB CHANGELOG.md successfully
**Problem #2: Missing CLI Enhancement Flags** ✅ FIXED
- Add --enhance, --enhance-local, --api-key to main.py github_parser
- Add flag forwarding in CLI dispatcher
- Fixes 'unrecognized arguments' error
- Users can now use: skill-seekers github --repo owner/repo --enhance-local
**Problem #3: Custom API Endpoint Support** ✅ FIXED
- Support ANTHROPIC_BASE_URL environment variable
- Support ANTHROPIC_AUTH_TOKEN (alternative to ANTHROPIC_API_KEY)
- Fix ThinkingBlock.text error with newer Anthropic SDK
- Find TextBlock in response content array (handles thinking blocks)
**Changes**:
- src/skill_seekers/cli/enhance_skill.py:
- Support custom base_url parameter
- Support both ANTHROPIC_API_KEY and ANTHROPIC_AUTH_TOKEN
- Iterate through content blocks to find text (handles ThinkingBlock)
- src/skill_seekers/cli/main.py:
- Add --enhance, --enhance-local, --api-key to github_parser
- Forward flags to github_scraper.py in dispatcher
- src/skill_seekers/cli/github_scraper.py:
- Add large file detection (encoding=None/"none")
- Download via download_url with requests
- Log file size and download progress
- tests/test_github_scraper.py:
- Add test_get_file_content_large_file
- Add test_extract_changelog_large_file
- All 31 tests passing ✅
**Credits**:
- Thanks to @XGCoder for detailed bug report
- Thanks to @gorquan for local fixes and guidance
Fixes#219🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add _get_file_content() helper method to detect and follow symlinks
- Update _extract_readme() to use new helper
- Update _extract_changelog() to use new helper
- Add 7 comprehensive tests for symlink handling
- All 29 GitHub scraper tests passing
Fixes#225
When README.md or CHANGELOG.md are symlinks (like in vercel/ai repo),
PyGithub returns ContentFile with type='symlink' and encoding=None.
Direct access to decoded_content throws AssertionError.
Solution: Detect symlink type, follow target path, then decode actual file.
Handles edge cases: broken symlinks, missing targets, encoding errors.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Implements hybrid smart extraction + improved fallback templates for
skill descriptions across all scrapers.
Changes:
- github_scraper.py:
* Added extract_description_from_readme() helper
* Extracts from README first paragraph (60 lines)
* Updates description after README extraction
* Fallback: "Use when working with {name}"
* Updated 3 locations (GitHubScraper, GitHubToSkillConverter, main)
- doc_scraper.py:
* Added infer_description_from_docs() helper
* Extracts from meta tags or first paragraph (65 lines)
* Tries: meta description, og:description, first content paragraph
* Fallback: "Use when working with {name}"
* Updated 2 locations (create_enhanced_skill_md, get_configuration)
- pdf_scraper.py:
* Added infer_description_from_pdf() helper
* Extracts from PDF metadata (subject, title)
* Fallback: "Use when referencing {name} documentation"
* Updated 3 locations (PDFToSkillConverter, main x2)
- generate_router.py:
* Updated 2 locations with improved router descriptions
* "Use when working with {name} development and programming"
All changes:
- Only apply to NEW skill generations (don't modify existing)
- No API calls (free/offline)
- Smart extraction when metadata/README available
- Improved "Use when..." fallbacks instead of generic templates
- 612 tests passing (100%)
Fixes#191
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixes#209 - UnicodeDecodeError on Windows with non-ASCII characters
**Problem:**
Windows users with non-English locales (Chinese, Japanese, Korean, etc.)
experienced GBK/SHIFT-JIS codec errors when the system default encoding
is not UTF-8.
Error: 'gbk' codec can't decode byte 0xac in position 206: illegal
multibyte sequence
**Root Cause:**
File operations using open() without explicit encoding parameter use
the system default encoding, which on Windows Chinese edition is GBK.
JSON files contain UTF-8 encoded characters that fail to decode with GBK.
**Solution:**
Added encoding='utf-8' to ALL file operations across:
- doc_scraper.py (4 instances):
* load_config() - line 1310
* check_existing_data() - line 1416
* save_checkpoint() - line 173
* load_checkpoint() - line 186
- github_scraper.py (1 instance):
* main() config loading - line 922
- unified_scraper.py (10 instances):
* All JSON read/write operations - lines 134, 153, 205, 239, 275,
278, 325, 328, 342, 364
**Test Results:**
- ✅ All 612 tests passing (100% pass rate)
- ✅ Backward compatible (UTF-8 is standard on Linux/macOS)
- ✅ Fixes Windows locale issues
**Impact:**
- ✅ Works on ALL Windows locales (Chinese, Japanese, Korean, etc.)
- ✅ Maintains compatibility with Linux/macOS
- ✅ Prevents future encoding issues
**Thanks to:** @my5icol for the detailed bug report and fix suggestion!
This commit fixes three critical limitations discovered during local repository skill extraction testing:
**Fix 1: Code Analyzer Import Issue**
- Changed unified_scraper.py to use absolute imports instead of relative imports
- Fixed: `from github_scraper import` → `from skill_seekers.cli.github_scraper import`
- Fixed: `from pdf_scraper import` → `from skill_seekers.cli.pdf_scraper import`
- Result: CodeAnalyzer now available during extraction, deep analysis works
**Fix 2: Unity Library Exclusions**
- Updated should_exclude_dir() to accept and check full directory paths
- Updated _extract_file_tree_local() to pass both dir name and full path
- Added exclusion config passing from unified_scraper to github_scraper
- Result: exclude_dirs_additional now works (297 files excluded in test)
**Fix 3: AI Enhancement for Single Sources**
- Changed read_reference_files() to use rglob() for recursive search
- Now finds reference files in subdirectories (e.g., references/github/README.md)
- Result: AI enhancement works with unified skills that have nested references
**Test Results:**
- Code Analyzer: ✅ Working (deep analysis running)
- Unity Exclusions: ✅ Working (297 files excluded from 679)
- AI Enhancement: ✅ Working (finds and reads nested references)
**Files Changed:**
- src/skill_seekers/cli/unified_scraper.py (Fix 1 & 2)
- src/skill_seekers/cli/github_scraper.py (Fix 2)
- src/skill_seekers/cli/utils.py (Fix 3)
**Test Artifacts:**
- configs/deck_deck_go_local.json (test configuration)
- docs/LOCAL_REPO_TEST_RESULTS.md (comprehensive test report)
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Fixes Issue #190 - "name 'logger' is not defined" error
**Problem:**
- Logger was used at line 40 (in code_analyzer import exception)
- Logger was defined at line 47
- Caused runtime error when code_analyzer import failed
**Solution:**
- Moved logging.basicConfig() and logger initialization to lines 34-39
- Now logger is defined BEFORE the code_analyzer import block
- Warning message now works correctly when code_analyzer is missing
**Testing:**
- ✅ All 22 GitHub scraper tests pass
- ✅ Logger warning appears correctly when code_analyzer missing
- ✅ No similar issues found in other CLI files
Closes#190
Fixes#193 - PDF scraping broken for PyPI users
Changed 3 files from absolute to relative imports to fix
ModuleNotFoundError when package is installed via pip:
1. pdf_scraper.py:22
- from pdf_extractor_poc import → from .pdf_extractor_poc import
- Fixes: skill-seekers pdf command failed with import error
2. github_scraper.py:36
- from code_analyzer import → from .code_analyzer import
- Proactive fix: prevents future import errors
3. test_unified_simple.py:17
- from config_validator import → from .config_validator import
- Proactive fix: test helper file
These absolute imports worked locally due to sys.path differences
but failed when installed via PyPI (pip install skill-seekers).
Tested with:
- skill-seekers pdf command now works ✅
- Extracted 32-page Godot Farming PDF successfully
All CLI commands should now work correctly when installed from PyPI.
🤖 Generated with [Claude Code](https://claude.ai/code)
Co-Authored-By: Claude <noreply@anthropic.com>