skill-seekers-reference

firefrost-gaming/skill-seekers-reference

Author	SHA1	Message	Date
yusyus	31a57c448b	style: apply ruff formatting to github_scraper.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 23:46:42 +03:00
yusyus	d71c1d3aa3	fix: filter non-integer metadata from GitHub languages API response (#322 ) PyGithub's get_languages() returns raw API JSON which in some environments includes non-integer metadata keys (e.g., "url"), causing a TypeError in sum(). Now filters to integer values only before calculating percentages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 23:44:52 +03:00
yusyus	7185531f94	fix: replace PaginatedList slicing with itertools.islice in _extract_issues PyGithub's PaginatedList slicing (issues[:max_issues]) may fail with 'list index out of range' on some PyGithub versions or when repos have no issues. Replace with itertools.islice() which works reliably with any iterable, including PaginatedList. Bug reported by @dream0438-cmd in PR #269. Closes #269	2026-03-15 02:44:06 +03:00
copperlang2007	89f5e6fe5f	perf: optimize with caching, pre-compiled regex, O(1) lookups, and bisect line indexing (#309 ) ## Summary Performance optimizations across core scraping and analysis modules: - doc_scraper.py: Pre-compiled regex at module level, O(1) URL dedup via _enqueued_urls set, cached URL patterns, _enqueue_url() helper (DRY), seen_links set for link extraction, pre-lowercased category keywords, async error logging (bug fix), summary I/O error handling - code_analyzer.py: O(log n) bisect-based line lookups replacing O(n) count("\n") across all 10 language analyzers; O(n) parent class map replacing O(n^2) AST walks for Python method detection - dependency_analyzer.py: Same bisect line-index optimization for all import extractors - codebase_scraper.py: Module-level import re, pre-imported parser classes outside loop - github_scraper.py: deque.popleft() for O(1) tree traversal, module-level import fnmatch - utils.py: Shared build_line_index() / offset_to_line() utilities (DRY) - test_adaptor_benchmarks.py: Stabilized flaky test_benchmark_metadata_overhead (median, warm-up, more iterations) Review fixes applied on top of original PR: 1. Renamed misleading _pending_set to _enqueued_urls 2. Extracted duplicated line-index code into shared cli/utils.py 3. Fixed pre-existing "tutorial" vs "tutorials" key mismatch bug in infer_categories() 4. Removed unnecessary _store_results() closure 5. Simplified parser pre-import pattern	2026-03-14 23:35:39 +03:00
yusyus	68bdbe8307	style: ruff format remaining 14 files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 10:54:45 +03:00
yusyus	064405c052	fix: resolve 18 bugs and code quality issues across adaptors, CLI, and chunking pipeline Bug fixes: - Fix --var flag silently dropped in create routing (args.workflow_var → args.var) - Fix double _score_code_quality() call in word scraper - Add .docx file extension validation in WordToSkillConverter - Fix weaviate ImportError masked by generic Exception handler - Fix RAG chunking crash using non-existent converter.output_dir Chunking pipeline improvements: - Wire --chunk-overlap-tokens through entire package pipeline (package_skill → adaptor.package → format_skill_md → _maybe_chunk_content → RAGChunker) - Add auto-scaling overlap: max(50, chunk_tokens//10) when chunk size is non-default - Rename --no-preserve-code to --no-preserve-code-blocks (backward-compat alias kept) - Replace hardcoded 512/50 chunk defaults with DEFAULT_CHUNK_TOKENS/DEFAULT_CHUNK_OVERLAP_TOKENS constants across all 12 concrete adaptors, rag_chunker, base, and package_skill Code quality: - Extract shared _generate_openai_embeddings() and _generate_st_embeddings() to SkillAdaptor base class, removing ~150 lines of duplication from chroma/weaviate/pinecone - Add Pinecone adaptor with full upload support (pinecone_adaptor.py) Tests (14 new): - chunk_overlap_tokens parameter wiring, auto-scaling overlap, preserve_code_blocks flag - .docx/.doc/no-extension file validation, --var flag routing E2E - Embedding method inheritance verification, backward-compatible flag aliases Docs: - Update CHANGELOG, CLI_REFERENCE, API_REFERENCE, packaging guide (EN+ZH) - Update README test count badge (1880+ → 2283+) All 2283 tests passing, 8 skipped, 0 failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 21:57:59 +03:00
yusyus	b81d55fda0	feat(B2): add Microsoft Word (.docx) support Implements ROADMAP task B2 — full .docx scraping support via mammoth + python-docx, producing SKILL.md + references/ output identical to other source types. New files: - src/skill_seekers/cli/word_scraper.py — WordToSkillConverter class + main() entry point (~600 lines); mammoth → BeautifulSoup pipeline; handles headings, code detection (incl. monospace <p><br> blocks), tables, images, metadata extraction - src/skill_seekers/cli/arguments/word.py — add_word_arguments() + WORD_ARGUMENTS dict - src/skill_seekers/cli/parsers/word_parser.py — WordParser for unified CLI parser registry - tests/test_word_scraper.py — comprehensive test suite (~300 lines) Modified files: - src/skill_seekers/cli/main.py — registered "word" command module - src/skill_seekers/cli/source_detector.py — .docx auto-detection + _detect_word() classmethod - src/skill_seekers/cli/create_command.py — _route_word() + --help-word - src/skill_seekers/cli/arguments/create.py — WORD_ARGUMENTS + routing - src/skill_seekers/cli/arguments/__init__.py — export word args - src/skill_seekers/cli/parsers/__init__.py — register WordParser - src/skill_seekers/cli/unified_scraper.py — _scrape_word() integration - src/skill_seekers/cli/pdf_scraper.py — fix: real enhancement instead of stub; remove [:3] reference file limit; capture run_workflows return - src/skill_seekers/cli/github_scraper.py — fix: remove arbitrary open_issues[:20] / closed_issues[:10] reference file limits - pyproject.toml — skill-seekers-word entry point + docx optional dep - tests/test_cli_parsers.py — update parser count 21→22 Bug fixes applied during real-world testing: - Code detection: detect monospace <p><br> blocks as code (mammoth renders Courier paragraphs this way, not as <pre>/<code>) - Language detector: fix wrong method name detect_from_text → detect_from_code - Description inference: pass None from main() so extract_docx() can infer description from Word document subject/title metadata - Bullet-point guard: exclude prose starting with •/-/* from code scoring - Enhancement: implement real API/LOCAL enhancement (was stub) - pip install message: add quotes around skill-seekers[docx] Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-25 21:47:30 +03:00
yusyus	b636a0a292	fix: resolve issue #299 and Phase 1 cleanup - Fix #299: rename --chunk-size/--chunk-overlap to --streaming-chunk-size/ --streaming-overlap in arguments/package.py to avoid collision with the RAG --chunk-size flag from arguments/common.py - Phase 1a: make package_skill.py import args via add_package_arguments() instead of a 105-line inline duplicate argparse block; fixes the root cause of _reconstruct_argv() passing unrecognised flag names - Phase 1b: centralise setup_logging() into utils.py and remove 4 duplicate module-level logging.basicConfig() calls from doc_scraper.py, github_scraper.py, codebase_scraper.py, and unified_scraper.py - Fix test_package_structure.py / test_cli_paths.py version strings (3.1.1 → 3.1.2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-24 21:22:05 +03:00
YusufKaraaslanSpyke	3adc5a8c1d	fix: unify scraper argument interface and fix create command forwarding All scrapers (scrape, github, analyze, pdf) now share a common argument contract via add_all_standard_arguments() in arguments/common.py. Universal flags (--dry-run, --verbose, --quiet, --name, --description, workflow args) work consistently across all source types. Previously, `create <url> --dry-run`, `create owner/repo --dry-run`, and `create ./path --dry-run` would crash because sub-scrapers didn't accept those flags. Also fixes main.py _handle_analyze_command() not forwarding --dry-run, --preset, --quiet, --name, --description to codebase_scraper. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-23 20:56:13 +03:00
yusyus	c996e88dac	feat: wire --local-repo-path into create command and add validation - Add --local-repo-path to UNIVERSAL_ARGUMENTS in create.py so it is registered in the actual parser (not just help display) - Add --local-repo-path to GITHUB_ARGUMENTS in arguments/github.py for the standalone github subcommand - Forward --local-repo-path through create_command._route_github() to github_scraper - Add local_repo_path to the config dict built from CLI args in github_scraper.main() - Add early validation in GitHubScraper.__init__(): warn and reset to None if path does not exist, triggering a real GitHub API fallback instead of silently operating with an empty file tree (fixes #281) - Update test_create_arguments.py count/names assertions (17 -> 18) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-20 07:28:49 +03:00
yusyus	4b89e0a015	style: apply ruff format to all source and test files Fixes ruff format --check CI failure. 22 files reformatted to satisfy the ruff formatter's style requirements. No logic changes, only whitespace/formatting adjustments. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-18 22:50:05 +03:00
yusyus	60c46673ed	feat: support multiple --enhance-workflow flags with shared workflow_runner - Change --enhance-workflow from type:str to action:append in all argument files (workflow, create, scrape, github, pdf) so the flag can be given multiple times to chain workflows in sequence - Add workflow_runner.py: shared utility used by all 4 scrapers - collect_workflow_vars(): merges extra context then user --var flags (user flags take precedence over scraper metadata) - run_workflows(): executes named workflows in order, then any inline --enhance-stage workflow; handles dry-run/preview mode - Remove duplicate ~115-130 line workflow blocks from doc_scraper, github_scraper, pdf_scraper, and codebase_scraper; replace with single run_workflows() call each - Remove mutual exclusivity between workflows and AI enhancement: workflows now run first, then traditional enhancement continues independently (--enhance-level 0 to disable) - Add tests/test_workflow_runner.py: 21 tests covering no-flags, single workflow, multiple/chained workflows, inline stages, mixed mode, variable precedence, and dry-run - Fix test_markdown_parsing: accept "text" or "unknown" for unlabelled code blocks (unified MarkdownParser returns "text" by default) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-17 22:05:27 +03:00
yusyus	c3abb83fc8	fix: Use Optional[] for forward reference type union (Python 3.10 compat) - Changed 'pathspec.PathSpec' \| None to Optional['pathspec.PathSpec'] - Fixes TypeError in Python 3.10/3.11 where \| operator doesn't work with string literals - Adds Optional to typing imports	2026-02-15 20:37:02 +03:00
yusyus	57061b7daf	style: Auto-format 48 files with ruff format - Fixed formatting to comply with ruff standards - No functional changes, only formatting/style - Completes CI/CD pipeline formatting requirements	2026-02-15 20:24:32 +03:00
yusyus	83b03d9f9f	fix: Resolve all linting errors from ruff Fix 145 linting errors across CLI refactor code: Type annotation modernization (Python 3.9+): - Replace typing.Dict with dict - Replace typing.List with list - Replace typing.Set with set - Replace Optional[X] with X \| None Code quality improvements: - Remove trailing whitespace (W291) - Remove whitespace from blank lines (W293) - Remove unused imports (F401) - Use dictionary lookup instead of if-elif chains (SIM116) - Combine nested if statements (SIM102) Files fixed (45 files): - src/skill_seekers/cli/arguments/.py (10 files) - src/skill_seekers/cli/parsers/.py (24 files) - src/skill_seekers/cli/presets/.py (4 files) - src/skill_seekers/cli/create_command.py - src/skill_seekers/cli/source_detector.py - src/skill_seekers/cli/github_scraper.py - tests/test_.py (5 test files) All files now pass ruff linting checks. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-15 20:20:55 +03:00
yusyus	ba1670a220	feat: Unified create command + consolidated enhancement flags This commit includes two major improvements: ## 1. Unified Create Command (v3.0.0 feature) - Auto-detects source type (web, GitHub, local, PDF, config) - Three-tier argument organization (universal, source-specific, advanced) - Routes to existing scrapers (100% backward compatible) - Progressive disclosure: 15 universal flags in default help New files: - src/skill_seekers/cli/source_detector.py - Auto-detection logic - src/skill_seekers/cli/arguments/create.py - Argument definitions - src/skill_seekers/cli/create_command.py - Main orchestrator - src/skill_seekers/cli/parsers/create_parser.py - Parser integration Tests: - tests/test_source_detector.py (35 tests) - tests/test_create_arguments.py (30 tests) - tests/test_create_integration_basic.py (10 tests) ## 2. Enhanced Flag Consolidation (Phase 1) - Consolidated 3 flags (--enhance, --enhance-local, --enhance-level) → 1 flag - --enhance-level 0-3 with auto-detection of API vs LOCAL mode - Default: --enhance-level 2 (balanced enhancement) Modified files: - arguments/{common,create,scrape,github,analyze}.py - Added enhance_level - {doc_scraper,github_scraper,config_extractor,main}.py - Updated logic - create_command.py - Uses consolidated flag Auto-detection: - If ANTHROPIC_API_KEY set → API mode - Else → LOCAL mode (Claude Code) ## 3. PresetManager Bug Fix - Fixed module naming conflict (presets.py vs presets/ directory) - Moved presets.py → presets/manager.py - Updated __init__.py exports Test Results: - All 160+ tests passing - Zero regressions - 100% backward compatible Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-15 14:29:19 +03:00
yusyus	32e080da1f	feat: Complete Unity/game engine support and local source type validation Completes the implementation for Unity/Unreal/Godot game engine support and adds missing "local" source type validation. Changes: - Add "local" to VALID_SOURCE_TYPES in config_validator.py - Add _validate_local_source() method with full validation - Add Unity/Unreal/Godot to FRAMEWORK_MARKERS for priority detection - Add game engine directory exclusions to all 3 scrapers: * Unity: Library/, Temp/, Logs/, UserSettings/, etc. * Unreal: Intermediate/, Saved/, DerivedDataCache/ * Godot: .godot/, .import/ - Prevents scanning massive build cache directories (saves GBs + hours) This completes all features mentioned in PR #278: ✅ Unity/Unreal/Godot framework detection with priority ✅ Pattern enhancement performance fix (grouped approach) ✅ Game engine directory exclusions ✅ Phase 5 SKILL.md AI enhancement ✅ Local source references copying ✅ "local" source type validation ✅ Config field name compatibility ✅ C# test example extraction Tested: - All unified config tests pass (18/18) - All config validation tests pass (28/28) - Ready for Unity project testing Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-02 21:06:01 +03:00
yusyus	6439c85cde	fix: Fix list comprehension variable names (NameError in CI) Fixed incorrect variable names in list comprehensions that were causing NameError in CI (Python 3.11/3.12): Critical fixes: - tests/test_markdown_parsing.py: 'l' → 'link' in list comprehension - src/skill_seekers/cli/pdf_extractor_poc.py: 'l' → 'line' (2 occurrences) Additional auto-lint fixes: - Removed unused imports in llms_txt_downloader.py, llms_txt_parser.py - Fixed comparison operators in config files - Fixed list comprehension in other files All tests now pass in CI. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-17 23:33:34 +03:00
yusyus	596b219599	fix: Resolve remaining 188 linting errors (249 total fixed) Second batch of comprehensive linting fixes: Unused Arguments/Variables (136 errors): - ARG002/ARG001 (91 errors): Prefixed unused method/function arguments with '_' - Interface methods in adaptors (base.py, gemini.py, markdown.py) - AST analyzer methods maintaining signatures (code_analyzer.py) - Test fixtures and hooks (conftest.py) - Added noqa: ARG001/ARG002 for pytest hooks requiring exact names - F841 (45 errors): Prefixed unused local variables with '_' - Tuple unpacking where some values aren't needed - Variables assigned but not referenced Loop & Boolean Quality (28 errors): - B007 (18 errors): Prefixed unused loop control variables with '_' - enumerate() loops where index not used - for-in loops where loop variable not referenced - E712 (10 errors): Simplified boolean comparisons - Changed '== True' to direct boolean check - Changed '== False' to 'not' expression - Improved test readability Code Quality (24 errors): - SIM201 (4 errors): Already fixed in previous commit - SIM118 (2 errors): Already fixed in previous commit - E741 (4 errors): Already fixed in previous commit - Config manager loop variable fix (1 error) All Tests Passing: - test_scraper_features.py: 42 passed - test_integration.py: 51 passed - test_architecture_scenarios.py: 11 passed - test_real_world_fastmcp.py: 19 passed, 1 skipped Note: Some SIM errors (nested if, multiple with) remain unfixed as they would require non-trivial refactoring. Focus was on functional correctness. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-17 23:02:11 +03:00
Pablo Estevez	c33c6f9073	change max lenght	2026-01-17 17:48:15 +00:00
Pablo Estevez	5ed767ff9a	run ruff	2026-01-17 17:29:21 +00:00
yusyus	c89f059712	feat(v2.7.0): Smart Rate Limit Management & Multi-Token Configuration Major Features: - Multi-profile GitHub token system with secure storage - Smart rate limit handler with 4 strategies (prompt/wait/switch/fail) - Interactive configuration wizard with browser integration - Configurable timeout (default 30 min) per profile - Automatic profile switching on rate limits - Live countdown timers with real-time progress - Non-interactive mode for CI/CD (--non-interactive flag) - Progress tracking and resume capability (skeleton) - Comprehensive test suite (16 tests, all passing) Solves: - Indefinite waiting on GitHub rate limits - Confusing GitHub token setup Files Added: - src/skill_seekers/cli/config_manager.py (~490 lines) - src/skill_seekers/cli/config_command.py (~400 lines) - src/skill_seekers/cli/rate_limit_handler.py (~450 lines) - src/skill_seekers/cli/resume_command.py (~150 lines) - tests/test_rate_limit_handler.py (16 tests) Files Modified: - src/skill_seekers/cli/github_fetcher.py (rate limit integration) - src/skill_seekers/cli/github_scraper.py (--non-interactive, --profile flags) - src/skill_seekers/cli/main.py (config, resume subcommands) - pyproject.toml (version 2.7.0) - CHANGELOG.md, README.md, CLAUDE.md (documentation) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-17 18:38:31 +03:00
yusyus	a99e22c639	feat: Multi-Source Synthesis Architecture - Rich Standalone Skills + Smart Combination BREAKING CHANGE: Major architectural improvements to multi-source skill generation This commit implements the complete "Multi-Source Synthesis Architecture" where each source (documentation, GitHub, PDF) generates a rich standalone SKILL.md file before being intelligently synthesized with source-specific formulas. ## 🎯 Core Architecture Changes ### 1. Rich Standalone SKILL.md Generation (Source Parity) Each source now generates comprehensive, production-quality SKILL.md files that can stand alone OR be synthesized with other sources. GitHub Scraper Enhancements (+263 lines): - Now generates 300+ line SKILL.md (was ~50 lines) - Integrates C3.x codebase analysis data: - C2.5: API Reference extraction - C3.1: Design pattern detection (27 high-confidence patterns) - C3.2: Test example extraction (215 examples) - C3.7: Architectural pattern analysis - Enhanced sections: - ⚡ Quick Reference with pattern summaries - 📝 Code Examples from real repository tests - 🔧 API Reference from codebase analysis - 🏗️ Architecture Overview with design patterns - ⚠️ Known Issues from GitHub issues - Location: src/skill_seekers/cli/github_scraper.py PDF Scraper Enhancements (+205 lines): - Now generates 200+ line SKILL.md (was ~50 lines) - Enhanced content extraction: - 📖 Chapter Overview (PDF structure breakdown) - 🔑 Key Concepts (extracted from headings) - ⚡ Quick Reference (pattern extraction) - 📝 Code Examples: Top 15 (was top 5), grouped by language - Quality scoring and intelligent truncation - Better formatting and organization - Location: src/skill_seekers/cli/pdf_scraper.py Result: All 3 sources (docs, GitHub, PDF) now have equal capability to generate rich, comprehensive standalone skills. ### 2. File Organization & Caching System Problem: output/ directory cluttered with intermediate files, data, and logs. Solution: New `.skillseeker-cache/` hidden directory for all intermediate files. New Structure: ``` .skillseeker-cache/{skill_name}/ ├── sources/ # Standalone SKILL.md from each source │ ├── httpx_docs/ │ ├── httpx_github/ │ └── httpx_pdf/ ├── data/ # Raw scraped data (JSON) ├── repos/ # Cloned GitHub repositories (cached for reuse) └── logs/ # Session logs with timestamps output/{skill_name}/ # CLEAN: Only final synthesized skill ├── SKILL.md └── references/ ``` Benefits: - ✅ Clean output/ directory (only final product) - ✅ Intermediate files preserved for debugging - ✅ Repository clones cached and reused (faster re-runs) - ✅ Timestamped logs for each scraping session - ✅ All cache dirs added to .gitignore Changes: - .gitignore: Added `.skillseeker-cache/` entry - unified_scraper.py: Complete reorganization (+238 lines) - Added cache directory structure - File logging with timestamps - Repository cloning with caching/reuse - Cleaner intermediate file management - Better subprocess logging and error handling ### 3. Config Repository Migration Moved to separate config repository: https://github.com/yusufkaraaslan/skill-seekers-configs Deleted from this repo (35 config files): - ansible-core.json, astro.json, claude-code.json - django.json, django_unified.json, fastapi.json, fastapi_unified.json - godot.json, godot_unified.json, godot_github.json, godot-large-example.json - react.json, react_unified.json, react_github.json, react_github_example.json - vue.json, kubernetes.json, laravel.json, tailwind.json, hono.json - svelte_cli_unified.json, steam-economy-complete.json - deck_deck_go_local.json, python-tutorial-test.json, example_pdf.json - test-manual.json, fastapi_unified_test.json, fastmcp_github_example.json - example-team/ directory (4 files) Kept as reference example: - configs/httpx_comprehensive.json (complete multi-source example) Rationale: - Cleaner repository (979+ lines added, 1680 deleted) - Configs managed separately with versioning - Official presets available via `fetch-config` command - Users can maintain private config repos ### 4. AI Enhancement Improvements enhance_skill.py (+125 lines): - Better integration with multi-source synthesis - Enhanced prompt generation for synthesized skills - Improved error handling and logging - Support for source metadata in enhancement ### 5. Documentation Updates CLAUDE.md (+252 lines): - Comprehensive project documentation - Architecture explanations - Development workflow guidelines - Testing requirements - Multi-source synthesis patterns SKILL_QUALITY_ANALYSIS.md (new): - Quality assessment framework - Before/after analysis of httpx skill - Grading rubric for skill quality - Metrics and benchmarks ### 6. Testing & Validation Scripts test_httpx_skill.sh (new): - Complete httpx skill generation test - Multi-source synthesis validation - Quality metrics verification test_httpx_quick.sh (new): - Quick validation script - Subset of features for rapid testing ## 📊 Quality Improvements \| Metric \| Before \| After \| Improvement \| \|--------\|--------\|-------\|-------------\| \| GitHub SKILL.md lines \| ~50 \| 300+ \| +500% \| \| PDF SKILL.md lines \| ~50 \| 200+ \| +300% \| \| GitHub C3.x integration \| ❌ No \| ✅ Yes \| New feature \| \| PDF pattern extraction \| ❌ No \| ✅ Yes \| New feature \| \| File organization \| Messy \| Clean cache \| Major improvement \| \| Repository cloning \| Always fresh \| Cached reuse \| Faster re-runs \| \| Logging \| Console only \| Timestamped files \| Better debugging \| \| Config management \| In-repo \| Separate repo \| Cleaner separation \| ## 🧪 Testing All existing tests pass: - test_c3_integration.py: Updated for new architecture - 700+ tests passing - Multi-source synthesis validated with httpx example ## 🔧 Technical Details Modified Core Files: 1. src/skill_seekers/cli/github_scraper.py (+263 lines) - _generate_skill_md(): Rich content with C3.x integration - _format_pattern_summary(): Design pattern summaries - _format_code_examples(): Test example formatting - _format_api_reference(): API reference from codebase - _format_architecture(): Architectural pattern analysis 2. src/skill_seekers/cli/pdf_scraper.py (+205 lines) - _generate_skill_md(): Enhanced with rich content - _format_key_concepts(): Extract concepts from headings - _format_patterns_from_content(): Pattern extraction - Code examples: Top 15, grouped by language, better quality scoring 3. src/skill_seekers/cli/unified_scraper.py (+238 lines) - __init__(): Cache directory structure - _setup_logging(): File logging with timestamps - _clone_github_repo(): Repository caching system - _scrape_documentation(): Move to cache, better logging - Better subprocess handling and error reporting 4. src/skill_seekers/cli/enhance_skill.py (+125 lines) - Multi-source synthesis awareness - Enhanced prompt generation - Better error handling Minor Updates: - src/skill_seekers/cli/codebase_scraper.py (+3 lines): Minor improvements - src/skill_seekers/cli/test_example_extractor.py: Quality scoring adjustments - tests/test_c3_integration.py: Test updates for new architecture ## 🚀 Migration Guide For users with existing configs: No action required - all existing configs continue to work. For users wanting official presets: ```bash # Fetch from official config repo skill-seekers fetch-config --name react --target unified # Or use existing local configs skill-seekers unified --config configs/httpx_comprehensive.json ``` Cache directory: New `.skillseeker-cache/` directory will be created automatically. Safe to delete - will be regenerated on next run. ## 📈 Next Steps This architecture enables: - ✅ Source parity: All sources generate rich standalone skills - ✅ Smart synthesis: Each combination has optimal formula - ✅ Better debugging: Cached files and logs preserved - ✅ Faster iteration: Repository caching, clean output - 🔄 Future: Multi-platform enhancement (Gemini, GPT-4) - planned - 🔄 Future: Conflict detection between sources - planned - 🔄 Future: Source prioritization rules - planned ## 🎓 Example: httpx Skill Quality Before: 186 lines, basic synthesis, missing data After: 640 lines with AI enhancement, A- (9/10) quality What changed: - All C3.x analysis data integrated (patterns, tests, API, architecture) - GitHub metadata included (stars, topics, languages) - PDF chapter structure visible - Professional formatting with emojis and clear sections - Real-world code examples from test suite - Design patterns explained with confidence scores - Known issues with impact assessment 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-11 23:01:07 +03:00
yusyus	eac1f4ef8e	feat(C2.1): Add .gitignore support to github_scraper for local repos - Add pathspec import with graceful fallback - Add gitignore_spec attribute to GitHubScraper class - Implement _load_gitignore() method to parse .gitignore files - Update should_exclude_dir() to check .gitignore rules - Load .gitignore automatically in local repository mode - Handle directory patterns with and without trailing slash - Add 4 comprehensive tests for .gitignore functionality Closes #63 - C2.1 File Tree Walker with .gitignore support complete Features: - Loads .gitignore from local repository root - Respects .gitignore patterns for directory exclusion - Falls back gracefully when pathspec not installed - Works alongside existing hard-coded exclusions - Only active in local_repo_path mode (not GitHub API mode) Test coverage: - test_load_gitignore_exists: .gitignore parsing - test_load_gitignore_missing: Missing .gitignore handling - test_should_exclude_dir_with_gitignore: .gitignore exclusion - test_should_exclude_dir_default_exclusions: Existing exclusions still work Integration: - github_scraper.py now has same .gitignore support as codebase_scraper.py - Both tools use pathspec library for consistent behavior - Enables proper repository analysis respecting project .gitignore rules	2026-01-01 23:21:12 +03:00
yusyus	f2faebb8d5	fix: Complete fix for Issue #219 - All three problems resolved Problem #1: Large File Encoding Error ✅ FIXED - Add large file download support via download_url - Detect encoding='none' for files >1MB - Download via GitHub raw URL instead of API - Handles ccxt/ccxt's 1.4MB CHANGELOG.md successfully Problem #2: Missing CLI Enhancement Flags ✅ FIXED - Add --enhance, --enhance-local, --api-key to main.py github_parser - Add flag forwarding in CLI dispatcher - Fixes 'unrecognized arguments' error - Users can now use: skill-seekers github --repo owner/repo --enhance-local Problem #3: Custom API Endpoint Support ✅ FIXED - Support ANTHROPIC_BASE_URL environment variable - Support ANTHROPIC_AUTH_TOKEN (alternative to ANTHROPIC_API_KEY) - Fix ThinkingBlock.text error with newer Anthropic SDK - Find TextBlock in response content array (handles thinking blocks) Changes: - src/skill_seekers/cli/enhance_skill.py: - Support custom base_url parameter - Support both ANTHROPIC_API_KEY and ANTHROPIC_AUTH_TOKEN - Iterate through content blocks to find text (handles ThinkingBlock) - src/skill_seekers/cli/main.py: - Add --enhance, --enhance-local, --api-key to github_parser - Forward flags to github_scraper.py in dispatcher - src/skill_seekers/cli/github_scraper.py: - Add large file detection (encoding=None/"none") - Download via download_url with requests - Log file size and download progress - tests/test_github_scraper.py: - Add test_get_file_content_large_file - Add test_extract_changelog_large_file - All 31 tests passing ✅ Credits: - Thanks to @XGCoder for detailed bug report - Thanks to @gorquan for local fixes and guidance Fixes #219 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-01 20:57:03 +03:00
yusyus	58286f454a	fix: Handle symlinked README.md and CHANGELOG.md in GitHub scraper - Add _get_file_content() helper method to detect and follow symlinks - Update _extract_readme() to use new helper - Update _extract_changelog() to use new helper - Add 7 comprehensive tests for symlink handling - All 29 GitHub scraper tests passing Fixes #225 When README.md or CHANGELOG.md are symlinks (like in vercel/ai repo), PyGithub returns ContentFile with type='symlink' and encoding=None. Direct access to decoded_content throws AssertionError. Solution: Detect symlink type, follow target path, then decode actual file. Handles edge cases: broken symlinks, missing targets, encoding errors. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-01-01 20:41:28 +03:00
yusyus	74bae4b49f	feat(#191 ): Smart description generation for skill descriptions Implements hybrid smart extraction + improved fallback templates for skill descriptions across all scrapers. Changes: - github_scraper.py: * Added extract_description_from_readme() helper * Extracts from README first paragraph (60 lines) * Updates description after README extraction * Fallback: "Use when working with {name}" * Updated 3 locations (GitHubScraper, GitHubToSkillConverter, main) - doc_scraper.py: * Added infer_description_from_docs() helper * Extracts from meta tags or first paragraph (65 lines) * Tries: meta description, og:description, first content paragraph * Fallback: "Use when working with {name}" * Updated 2 locations (create_enhanced_skill_md, get_configuration) - pdf_scraper.py: * Added infer_description_from_pdf() helper * Extracts from PDF metadata (subject, title) * Fallback: "Use when referencing {name} documentation" * Updated 3 locations (PDFToSkillConverter, main x2) - generate_router.py: * Updated 2 locations with improved router descriptions * "Use when working with {name} development and programming" All changes: - Only apply to NEW skill generations (don't modify existing) - No API calls (free/offline) - Smart extraction when metadata/README available - Improved "Use when..." fallbacks instead of generic templates - 612 tests passing (100%) Fixes #191 Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-28 19:00:26 +03:00
yusyus	c411eb24ec	fix: Add UTF-8 encoding to all file operations for Windows compatibility Fixes #209 - UnicodeDecodeError on Windows with non-ASCII characters Problem: Windows users with non-English locales (Chinese, Japanese, Korean, etc.) experienced GBK/SHIFT-JIS codec errors when the system default encoding is not UTF-8. Error: 'gbk' codec can't decode byte 0xac in position 206: illegal multibyte sequence Root Cause: File operations using open() without explicit encoding parameter use the system default encoding, which on Windows Chinese edition is GBK. JSON files contain UTF-8 encoded characters that fail to decode with GBK. Solution: Added encoding='utf-8' to ALL file operations across: - doc_scraper.py (4 instances): * load_config() - line 1310 * check_existing_data() - line 1416 * save_checkpoint() - line 173 * load_checkpoint() - line 186 - github_scraper.py (1 instance): * main() config loading - line 922 - unified_scraper.py (10 instances): * All JSON read/write operations - lines 134, 153, 205, 239, 275, 278, 325, 328, 342, 364 Test Results: - ✅ All 612 tests passing (100% pass rate) - ✅ Backward compatible (UTF-8 is standard on Linux/macOS) - ✅ Fixes Windows locale issues Impact: - ✅ Works on ALL Windows locales (Chinese, Japanese, Korean, etc.) - ✅ Maintains compatibility with Linux/macOS - ✅ Prevents future encoding issues Thanks to: @my5icol for the detailed bug report and fix suggestion!	2025-12-28 18:27:50 +03:00
yusyus	eb3b9d9175	fix: Add robust CHANGELOG encoding handling and enhancement flags Fixes #219 - Two issues resolved: 1. Encoding Error Fix: - Added graceful error handling for CHANGELOG extraction - Handles 'unsupported encoding: none' error from GitHub API - Falls back to latin-1 encoding if UTF-8 fails - Logs warnings instead of crashing - Continues processing even if CHANGELOG has encoding issues 2. Enhancement Flags Added: - Added --enhance-local flag to github command - Added --enhance flag for API-based enhancement - Added --api-key flag for API authentication - Auto-enhancement after skill building when flags used - Matches doc_scraper.py functionality Test Results: - ✅ All 612 tests passing (100% pass rate) - ✅ All 22 github_scraper tests passing - ✅ Backward compatible Usage: ```bash # Local enhancement (no API key needed) skill-seekers github --repo ccxt/ccxt --name ccxtSkills --enhance-local # API-based enhancement skill-seekers github --repo owner/repo --enhance --api-key sk-ant-... ```	2025-12-28 18:21:03 +03:00
yusyus	65ded6c07c	fix: Fix local repo extraction limitations (code analyzer, exclusions, enhancement) This commit fixes three critical limitations discovered during local repository skill extraction testing: Fix 1: Code Analyzer Import Issue - Changed unified_scraper.py to use absolute imports instead of relative imports - Fixed: `from github_scraper import` → `from skill_seekers.cli.github_scraper import` - Fixed: `from pdf_scraper import` → `from skill_seekers.cli.pdf_scraper import` - Result: CodeAnalyzer now available during extraction, deep analysis works Fix 2: Unity Library Exclusions - Updated should_exclude_dir() to accept and check full directory paths - Updated _extract_file_tree_local() to pass both dir name and full path - Added exclusion config passing from unified_scraper to github_scraper - Result: exclude_dirs_additional now works (297 files excluded in test) Fix 3: AI Enhancement for Single Sources - Changed read_reference_files() to use rglob() for recursive search - Now finds reference files in subdirectories (e.g., references/github/README.md) - Result: AI enhancement works with unified skills that have nested references Test Results: - Code Analyzer: ✅ Working (deep analysis running) - Unity Exclusions: ✅ Working (297 files excluded from 679) - AI Enhancement: ✅ Working (finds and reads nested references) Files Changed: - src/skill_seekers/cli/unified_scraper.py (Fix 1 & 2) - src/skill_seekers/cli/github_scraper.py (Fix 2) - src/skill_seekers/cli/utils.py (Fix 3) Test Artifacts: - configs/deck_deck_go_local.json (test configuration) - docs/LOCAL_REPO_TEST_RESULTS.md (comprehensive test report) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2025-12-21 22:24:38 +03:00
yusyus	ea289cebe1	feat: Make EXCLUDED_DIRS configurable for local repository analysis Closes #203 Adds configuration options to customize directory exclusions during local repository analysis, while maintaining backward compatibility with smart defaults. New Config Options: 1. `exclude_dirs_additional` - Extend defaults (most common) - Adds custom directories to default exclusions - Example: ["proprietary", "legacy", "third_party"] - Total exclusions = defaults + additional 2. `exclude_dirs` - Replace defaults (advanced users) - Completely overrides default exclusions - Example: ["node_modules", ".git", "custom_vendor"] - Gives full control over exclusions Implementation: - Modified GitHubScraper.__init__() to parse exclude_dirs config - Changed should_exclude_dir() to use instance variable instead of global - Added logging for custom exclusions (INFO for extend, WARNING for replace) - Maintains backward compatibility (no config = use defaults) Testing: - Added 12 comprehensive tests in test_excluded_dirs_config.py - 3 tests for defaults (backward compatibility) - 3 tests for extend mode - 3 tests for replace mode - 1 test for precedence - 2 tests for edge cases - All 12 new tests passing ✅ - All 22 existing github_scraper tests passing ✅ Documentation: - Updated CLAUDE.md config parameters section - Added detailed "Configurable Directory Exclusions" feature section - Included examples for both modes - Listed common use cases (monorepos, enterprise, legacy codebases) Use Cases: - Monorepos with custom directory structures - Enterprise projects with non-standard naming conventions - Including unusual directories for analysis - Minimal exclusions for small/simple projects Backward Compatibility: ✅ Fully backward compatible - existing configs work unchanged ✅ Smart defaults maintained when no config provided ✅ All existing tests pass Co-authored-by: jimmy058910 <jimmy058910@users.noreply.github.com>	2025-11-29 23:53:27 +03:00
yusyus	58ec69eb52	feat: Add unlimited local repository analysis with bug fixes (PR #195 ) Merges PR #195 by @jimmy058910 with conflict resolution. New Features: - Local repository analysis via `local_repo_path` configuration - Bypass GitHub API rate limits (50 → unlimited files) - Auto-exclusion of virtual environments and build artifacts - Support for analyzing large codebases (323 files vs 50 before) Improvements: - Code analysis coverage: 14% → 93.6% (+79.6pp) - Files analyzed: 50 → 323 (+546%) - Classes extracted: 55 → 585 (+964%) - Functions extracted: 512 → 2,784 (+444%) - AST parsing errors: 95 → 0 (-100%) Conflict Resolution: - Preserved logger initialization fix from development (Issue #190) - Kept relative imports from development (Task 1.2 fix) - Integrated EXCLUDED_DIRS and local repo features from PR - Combined best of both implementations Testing: - ✅ All 22 GitHub scraper tests passing - ✅ Syntax validation passed - ✅ Local repo analysis feature intact - ✅ Bug fixes from development preserved Original implementation by @jimmy058910 in PR #195. Conflict resolution preserves all bug fixes while adding local repo feature. Co-authored-by: jimmy058910 <jimmy058910@users.noreply.github.com>	2025-11-29 22:46:31 +03:00
yusyus	414519b3c7	fix: Initialize logger before use in github_scraper.py Fixes Issue #190 - "name 'logger' is not defined" error Problem: - Logger was used at line 40 (in code_analyzer import exception) - Logger was defined at line 47 - Caused runtime error when code_analyzer import failed Solution: - Moved logging.basicConfig() and logger initialization to lines 34-39 - Now logger is defined BEFORE the code_analyzer import block - Warning message now works correctly when code_analyzer is missing Testing: - ✅ All 22 GitHub scraper tests pass - ✅ Logger warning appears correctly when code_analyzer missing - ✅ No similar issues found in other CLI files Closes #190	2025-11-29 22:01:38 +03:00
yusyus	d7a4c51427	fix: Convert absolute imports to relative imports in cli modules Fixes #193 - PDF scraping broken for PyPI users Changed 3 files from absolute to relative imports to fix ModuleNotFoundError when package is installed via pip: 1. pdf_scraper.py:22 - from pdf_extractor_poc import → from .pdf_extractor_poc import - Fixes: skill-seekers pdf command failed with import error 2. github_scraper.py:36 - from code_analyzer import → from .code_analyzer import - Proactive fix: prevents future import errors 3. test_unified_simple.py:17 - from config_validator import → from .config_validator import - Proactive fix: test helper file These absolute imports worked locally due to sys.path differences but failed when installed via PyPI (pip install skill-seekers). Tested with: - skill-seekers pdf command now works ✅ - Extracted 32-page Godot Farming PDF successfully All CLI commands should now work correctly when installed from PyPI. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-29 21:47:18 +03:00
Jimmy Moceri	0b2a0d121e	feat: Add unlimited local repository analysis and fix 10 critical bugs Features: - Add local_repo_path config parameter for unlimited file analysis - Auto-exclude virtual environments and build artifacts (95% noise reduction) - Enable comprehensive codebase analysis (50 → 323 files, 546% increase) Bug Fixes: - Fix logger initialization error (Issue #190) - Fix NoneType subscriptable errors in release tag parsing (3 instances) - Fix relative import paths causing ModuleNotFoundError - Fix hardcoded 50-file analysis limit - Fix GitHub API file tree limitation (140 → 345 files discovered) - Fix AST parser 'not iterable' errors (95 → 0 parsing failures) - Fix virtual environment file pollution (23,341 → 1,109 file tree items) - Fix force_rescrape flag not checked before interactive prompt Impact: - Code coverage: 14% → 93.6% (+79.6pp) - Files analyzed: 50 → 323 (+546%) - Classes extracted: 55 → 585 (+964%) - Functions extracted: 512 → 2,784 (+444%) - AST errors: 95 → 0 (-100%) Tested on JMo Security repository with 345 Python files.	2025-11-16 22:35:23 -05:00
yusyus	13ca374295	refactor: Update CLI commands to use new unified entry points Updated all command examples in CLI scripts from old pattern: python3 cli/<script>.py → skill-seekers <command> Changes: - doc_scraper.py → skill-seekers scrape - github_scraper.py → skill-seekers github - pdf_scraper.py → skill-seekers pdf - unified_scraper.py → skill-seekers unified - enhance_skill.py → skill-seekers enhance - enhance_skill_local.py → skill-seekers enhance - package_skill.py → skill-seekers package - estimate_pages.py → skill-seekers estimate This reflects the new modern Python packaging with proper entry points. Users can now use clean commands instead of file paths. Files updated: 10 CLI scripts 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-07 01:23:17 +03:00
yusyus	ce1c07b437	feat: Add modern Python packaging - Phase 1 (Foundation) Implements issue #168 - Modern Python packaging with uv support This is Phase 1 of the modernization effort, establishing the core package structure and build system. ## Major Changes ### 1. Migrated to src/ Layout - Moved cli/ → src/skill_seekers/cli/ - Moved skill_seeker_mcp/ → src/skill_seekers/mcp/ - Created root package: src/skill_seekers/__init__.py - Updated all imports: cli. → skill_seekers.cli. - Updated all imports: skill_seeker_mcp. → skill_seekers.mcp. ### 2. Created pyproject.toml - Modern Python packaging configuration - All dependencies properly declared - 8 CLI entry points configured: * skill-seekers (unified CLI) * skill-seekers-scrape * skill-seekers-github * skill-seekers-pdf * skill-seekers-unified * skill-seekers-enhance * skill-seekers-package * skill-seekers-upload * skill-seekers-estimate - uv tool support enabled - Build system: setuptools with wheel ### 3. Created Unified CLI (main.py) - Git-style subcommands (skill-seekers scrape, etc.) - Delegates to existing tool main() functions - Full help system at top-level and subcommand level - Backwards compatible with individual commands ### 4. Updated Package Versions - cli/__init__.py: 1.3.0 → 2.0.0 - mcp/__init__.py: 1.2.0 → 2.0.0 - Root package: 2.0.0 ### 5. Updated Test Suite - Fixed test_package_structure.py for new layout - All 28 package structure tests passing - Updated all test imports for new structure ## Installation Methods (Working) ```bash # Development install pip install -e . # Run unified CLI skill-seekers --version # → 2.0.0 skill-seekers --help # Run individual tools skill-seekers-scrape --help skill-seekers-github --help ``` ## Test Results - Package structure tests: 28/28 passing ✅ - Package installs successfully ✅ - All entry points working ✅ ## Still TODO (Phase 2) - [ ] Run full test suite (299 tests) - [ ] Update documentation (README, CLAUDE.md, etc.) - [ ] Test with uv tool run/install - [ ] Build and publish to PyPI - [ ] Create PR and merge ## Breaking Changes None - fully backwards compatible. Old import paths still work. ## Migration for Users No action needed. Package works with both pip and uv. Closes #168 (when complete) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>	2025-11-07 01:14:24 +03:00

37 Commits