skill-seekers-reference

firefrost-gaming/skill-seekers-reference

Author	SHA1	Message	Date
yusyus	ea4fed0be4	feat: add headless browser rendering for JavaScript SPA sites (#321 ) New BrowserRenderer class uses Playwright to render JavaScript-heavy documentation sites (React, Vue SPAs) that return empty HTML shells with requests.get(). Activated via --browser flag on web scraping. - browser_renderer.py: Playwright wrapper with lazy browser launch, auto-install Chromium on first use, context manager support - doc_scraper.py: browser_mode config, _render_with_browser() helper, integrated into scrape_page() and scrape_page_async() - SPA detection warnings now suggest --browser flag - Optional dep: pip install "skill-seekers[browser]" - 14 real e2e tests (actual Chromium, no mocks) - UML updated: Scrapers class diagram (BrowserRenderer + dependency), Parsers (DoctorParser), Utilities (Doctor), Components, and new Browser Rendering sequence diagram (#20) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-28 22:06:14 +03:00
yusyus	006cccabae	feat: add skill-seekers doctor health check command (#316 ) 8 diagnostic checks: Python version (3.10+), package install, git, 14 core deps, 10 optional deps, API keys, MCP server, output dir. Each check reports pass/warn/fail with --verbose for extra detail. Exit code 0 if no critical failures, 1 otherwise. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-28 21:27:17 +03:00
yusyus	43bdabb84f	feat: add prompt injection check workflow for content security (#324 ) New bundled workflow `prompt-injection-check` scans scraped content for prompt injection patterns (role assumption, instruction overrides, delimiter injection, hidden instructions, encoded payloads) using AI. Flags suspicious content without removing it — preserves documentation accuracy while warning about adversarial content. Added as first stage in both `default` and `security-focus` workflows so it runs automatically with --enhance-level >= 1. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-28 21:17:57 +03:00
yusyus	6beff3d52f	fix: add path traversal protection to get_workflow_tool + tests (#325 ) PR #326 added _validate_name() to create/update/delete but missed get_workflow_tool, which would raise an unhandled ValueError instead of returning a user-friendly error. Added try/except handling and 6 tests covering all 4 tool functions with malicious names. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-28 20:56:52 +03:00
Spidershield-contrib	a12743769e	fix: prevent path traversal in workflow name parameter (CWE-22) (#326 ) Co-authored-by: spidershield-contrib <spidershield-contrib@users.noreply.github.com>	2026-03-28 20:55:13 +03:00
yusyus	d381315340	fix: pass enhance_level instead of removed enhance_with_ai/ai_mode to analyze_codebase (#323 ) Two call sites (_run_c3_analysis in unified_scraper.py and _analyze_c3x in unified_codebase_analyzer.py) still passed the old enhance_with_ai and ai_mode kwargs which were replaced by enhance_level. This caused a TypeError when running C3.x codebase analysis. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-27 22:14:51 +03:00
yusyus	31a57c448b	style: apply ruff formatting to github_scraper.py Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 23:46:42 +03:00
yusyus	d71c1d3aa3	fix: filter non-integer metadata from GitHub languages API response (#322 ) PyGithub's get_languages() returns raw API JSON which in some environments includes non-integer metadata keys (e.g., "url"), causing a TypeError in sum(). Now filters to integer values only before calculating percentages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-26 23:44:52 +03:00
yusyus	d76ab1d9a4	fix: report accurate saved/skipped page counts and detect SPA sites (#320 , #321 ) The scraper previously reported len(visited_urls) as "Scraped N pages" even when save_page() silently skipped pages with empty content (<50 chars). For JavaScript SPA sites this meant "Scraped 190 pages" followed by "No scraped data found!" with no explanation. Changes: - Added pages_saved/pages_skipped counters to DocToSkillConverter - save_page() now increments pages_skipped on skip, pages_saved on save - New _log_scrape_completion() reports "(N saved, M skipped)" breakdown - SPA detection warns when all/most pages have empty content - build_skill() error now explains empty content cause when pages skipped - Updated both sync and async scrape completion paths - 14 new tests across 4 test classes (counting, messages, SPA, build) Fixes #320 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-24 22:26:35 +03:00
yusyus	0fa99641aa	style: fix pre-existing ruff format issues in 5 files Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 21:24:21 +03:00
yusyus	cd7b322b5e	feat: expand platform coverage with 8 new adaptors, 7 new CLI agents, and OpenCode skill tools Phase 1 - OpenCode Integration: - Add OpenCodeAdaptor with directory-based packaging and dual-format YAML frontmatter - Kebab-case name validation matching OpenCode's regex spec Phase 2 - OpenAI-Compatible LLM Platforms: - Extract OpenAICompatibleAdaptor base class from MiniMax (shared format/package/upload/enhance) - Refactor MiniMax to ~20 lines of constants inheriting from base - Add 6 new LLM adaptors: Kimi, DeepSeek, Qwen, OpenRouter, Together AI, Fireworks AI - All use OpenAI-compatible API with platform-specific constants Phase 3 - CLI Agent Expansion: - Add 7 new install-agent paths: roo, cline, aider, bolt, kilo, continue, kimi-code - Total agents: 11 -> 18 Phase 4 - Advanced Features: - OpenCode skill splitter (auto-split large docs into focused sub-skills with router) - Bi-directional skill format converter (import/export between OpenCode and any platform) - GitHub Actions template for automated skill updates Totals: 12 --target platforms, 18 --agent paths, 2915 tests passing Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 20:31:51 +03:00
yusyus	1d3d7389d7	fix: sanitize_url crashes on Python 3.14 strict urlparse (#284 ) Python 3.14's urlparse() raises ValueError on URLs with unencoded brackets that look like malformed IPv6 (e.g. http://[fdaa:x:x:x::x from docs.openclaw.ai llms-full.txt). sanitize_url() called urlparse() BEFORE encoding brackets, so it crashed before it could fix them. Fix: catch ValueError from urlparse, encode ALL brackets, then retry. This is safe because if urlparse rejected the brackets, they are NOT valid IPv6 host literals and should be encoded anyway. Also fixed Discord e2e tests to skip gracefully on network issues. Fixes #284 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-21 00:30:48 +03:00
yusyus	2ef6e59d06	fix: stop blindly appending /index.html.md to non-.md URLs (#277 ) The previous fix (`a82cf69`) only addressed anchor fragment stripping but left the fundamental problem: _convert_to_md_urls() blindly appended /index.html.md to ALL non-.md URLs from llms.txt. This only works for Docusaurus sites — for sites like Discord docs it generates mass 404s. Changes: - _convert_to_md_urls() now strips anchors and deduplicates only, preserving original URLs as-is instead of appending /index.html.md - New _has_md_extension() helper uses urlparse().path.endswith(".md") instead of error-prone ".md" in url substring matching - Fixed ".md" in url checks at 4 locations (lines 465, 554, 716, 775) - Removed 24 lines of dead commented-out code - Added real-world e2e test against docs.discord.com (no mocks) - Updated unit tests for new behavior (32 tests) Fixes #277 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 23:44:35 +03:00
yusyus	f6131c6798	fix: unified scraper temp config uses unified format for doc_scraper (#317 ) The unified scraper's _scrape_documentation() was creating temp configs in flat/legacy format (no "sources" key), causing doc_scraper's ConfigValidator to reject them. Wrap the temp config in unified format with a "sources" array. Also remove dead code branches and fix a pre-existing test that didn't clear GITHUB_TOKEN from env. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 22:35:12 +03:00
yusyus	4f87de6b56	fix: improve MiniMax adaptor from PR #318 review (#319 ) * feat: add MiniMax AI as LLM platform adaptor Original implementation by octo-patch in PR #318. This commit includes comprehensive improvements and documentation. Code Improvements: - Fix API key validation to properly check JWT format (eyJ prefix) - Add specific exception handling for timeout and connection errors - Remove unused variable in upload method Dependencies: - Add MiniMax to [all-llms] extra group in pyproject.toml Tests: - Remove duplicate setUp method in integration test class - Add 4 new test methods: * test_package_excludes_backup_files * test_upload_success_mocked (with OpenAI mocking) * test_upload_network_error * test_upload_connection_error * test_validate_api_key_jwt_format - Update test_validate_api_key_valid to use JWT format keys - Fix test assertions for error message matching Documentation: - Create comprehensive MINIMAX_INTEGRATION.md guide (380+ lines) - Update MULTI_LLM_SUPPORT.md with MiniMax platform entry - Update 01-installation.md extras table - Update INTEGRATIONS.md AI platforms table - Update AGENTS.md adaptor import pattern example - Fix README.md platform count from 4 to 5 All tests pass (33 passed, 3 skipped) Lint checks pass Co-authored-by: octo-patch <octo-patch@users.noreply.github.com> * fix: improve MiniMax adaptor — typed exceptions, key validation, tests, docs - Remove invalid "minimax" self-reference from all-llms dependency group - Use typed OpenAI exceptions (APITimeoutError, APIConnectionError) instead of string-matching on generic Exception - Replace incorrect JWT assumption in validate_api_key with length check - Use DEFAULT_API_ENDPOINT constant instead of hardcoded URLs (3 sites) - Add Path() cast for output_path before .is_dir() call - Add sys.modules mock to test_enhance_missing_library - Add mocked test_enhance_success with backup/content verification - Update test assertions for new exception types and key validation - Add MiniMax to __init__.py docstrings (module, get_adaptor, list_platforms) - Add MiniMax sections to MULTI_LLM_SUPPORT.md (install, format, API key, workflow example, export-to-all) Follows up on PR #318 by @octo-patch (feat: add MiniMax AI as LLM platform adaptor). Co-Authored-By: Octopus <octo-patch@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: octo-patch <octo-patch@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-03-20 22:12:23 +03:00
yusyus	37a23e6c6d	fix: replace unicode arrows in CLI help text for Windows cp1252 compatibility Replace → (U+2192) with -> in argparse help strings. Windows cmd uses cp1252 encoding which cannot render unicode arrows, causing --help to crash with UnicodeEncodeError.	2026-03-19 00:10:50 +03:00
yusyus	2b725aa8f7	fix: update version strings and test expectations from 3.2.0 to 3.3.0 Fix CI failures: version hardcoded in _version.py fallbacks and test assertions (test_package_structure, test_cli_paths) still referenced 3.2.0 after the version bump.	2026-03-16 00:53:35 +03:00
yusyus	53b911b697	feat: add 10 new skill source types (17 total) with full pipeline integration Add Jupyter Notebook, Local HTML, OpenAPI/Swagger, AsciiDoc, PowerPoint, RSS/Atom, Man Pages, Confluence, Notion, and Slack/Discord Chat as new skill source types. Each type is fully integrated across: - Standalone CLI commands (skill-seekers <type>) - Auto-detection via 'skill-seekers create' (file extension + content sniffing) - Unified multi-source configs (scraped_data, dispatch, config validation) - Unified skill builder (generic merge + source-attributed synthesis) - MCP server (scrape_generic tool with per-type flag mapping) - pyproject.toml (entry points, optional deps, [all] group) Also fixes: EPUB unified pipeline gap, missing word/video config validators, OpenAPI yaml import guard, MCP flag mismatch for all 10 types, stale docstrings, and adds 77 integration tests + complex-merge workflow. 50 files changed, +20,201 lines	2026-03-15 15:30:15 +03:00
yusyus	7185531f94	fix: replace PaginatedList slicing with itertools.islice in _extract_issues PyGithub's PaginatedList slicing (issues[:max_issues]) may fail with 'list index out of range' on some PyGithub versions or when repos have no issues. Replace with itertools.islice() which works reliably with any iterable, including PaginatedList. Bug reported by @dream0438-cmd in PR #269. Closes #269	2026-03-15 02:44:06 +03:00
yusyus	2e30970dfb	feat: add EPUB input support (#310 ) Adds EPUB as a first-class input source for skill generation. - EpubToSkillConverter (epub_scraper.py, ~1200 lines) following PDF scraper pattern - Dublin Core metadata, spine items, code blocks, tables, images extraction - DRM detection (Adobe ADEPT, Apple FairPlay, Readium LCP) with fail-fast - EPUB 3 NCX TOC bug workaround (ignore_ncx=True) - ebooklib as optional dep: pip install skill-seekers[epub] - Wired into create command with .epub auto-detection - 104 tests, all passing Review fixes: removed 3 empty test stubs, fixed SVG double-counting in _extract_images(), added logger.debug to bare except pass. Based on PR #310 by @christianbaumann. Co-authored-by: Christian Baumann <mail@chriss-baumann.de>	2026-03-15 02:34:41 +03:00
yusyus	83b9a695ba	feat: add sync-config command to detect and update config start_urls (#306 ) ## Summary Add `skill-seekers sync-config` subcommand that crawls a docs site's navigation, diffs discovered URLs against a config's start_urls, and optionally writes the updated list back with --apply. - BFS link discovery with configurable depth (default 2), max-pages, rate-limit - Respects url_patterns.include/exclude from config - Supports optional nav_seed_urls config field - Handles both unified (sources array) and legacy flat config formats - MCP tool sync_config included - 57 tests (39 unit + 18 E2E with local HTTP server) - Fixed CI: renamed summary job to "Tests" to match branch protection rule Closes #306	2026-03-15 02:16:32 +03:00
yusyus	b25a6f7f53	fix: centralize bracket-encoding to prevent 'Invalid IPv6 URL' on all code paths (#284 ) The original fix (`741daf1`) only patched LlmsTxtParser._clean_url(), which covers URLs extracted directly from llms.txt content. But URLs discovered from .md files during BFS crawl (_extract_markdown_content) and from HTML pages (extract_content) bypass _clean_url() entirely. When those pages contain links with square brackets (e.g. /api/[v1]/users), httpx raises 'Invalid IPv6 URL' on fetch. Fix: add a shared sanitize_url() utility in cli/utils.py that percent-encodes [ and ] in path/query components, and apply it at every URL ingestion point: - _enqueue_url(): main chokepoint — all discovered URLs pass through - scrape_page(): safety net for start_urls that skip _enqueue_url - scrape_page_async(): same for async mode - dry-run sync/async paths: direct fetches that also bypass _enqueue_url LlmsTxtParser._clean_url() now delegates bracket-encoding to the shared sanitize_url() (DRY), keeping only its malformed-anchor stripping logic. Added 16 tests: sanitize_url unit tests, _clean_url bracket tests, _enqueue_url sanitization tests, and integration test verifying markdown content with bracket URLs is handled safely. Fixes #284	2026-03-14 23:53:47 +03:00
yusyus	f214976ccd	fix: apply review fixes from PR #309 and stabilize flaky benchmark test Follow-up to PR #309 (perf: optimize with caching, pre-compiled regex, O(1) lookups, and bisect line indexing). These fixes were committed to the PR branch but missed the squash merge. Review fixes (credit: PR #309 by copperlang2007): 1. Rename _pending_set -> _enqueued_urls to accurately reflect that the set tracks all ever-enqueued URLs, not just currently pending ones 2. Extract duplicated _build_line_index()/_offset_to_line() into shared build_line_index()/offset_to_line() in cli/utils.py (DRY) 3. Fix pre-existing bug: infer_categories() guard checked 'tutorial' but wrote to 'tutorials' key, risking silent overwrites 4. Remove unnecessary _store_results() closure in scrape_page() 5. Simplify parser pre-import in codebase_scraper.py Benchmark stabilization: - test_benchmark_metadata_overhead was flaky on CI (106.7% overhead observed, threshold 50%) because 5 iterations with mean averaging can't reliably measure microsecond-level differences - Fix: 20 iterations, warm-up run, median instead of mean, threshold raised to 200% (guards catastrophic regression, not noise) Ref: https://github.com/yusufkaraaslan/Skill_Seekers/pull/309	2026-03-14 23:39:23 +03:00
copperlang2007	89f5e6fe5f	perf: optimize with caching, pre-compiled regex, O(1) lookups, and bisect line indexing (#309 ) ## Summary Performance optimizations across core scraping and analysis modules: - doc_scraper.py: Pre-compiled regex at module level, O(1) URL dedup via _enqueued_urls set, cached URL patterns, _enqueue_url() helper (DRY), seen_links set for link extraction, pre-lowercased category keywords, async error logging (bug fix), summary I/O error handling - code_analyzer.py: O(log n) bisect-based line lookups replacing O(n) count("\n") across all 10 language analyzers; O(n) parent class map replacing O(n^2) AST walks for Python method detection - dependency_analyzer.py: Same bisect line-index optimization for all import extractors - codebase_scraper.py: Module-level import re, pre-imported parser classes outside loop - github_scraper.py: deque.popleft() for O(1) tree traversal, module-level import fnmatch - utils.py: Shared build_line_index() / offset_to_line() utilities (DRY) - test_adaptor_benchmarks.py: Stabilized flaky test_benchmark_metadata_overhead (median, warm-up, more iterations) Review fixes applied on top of original PR: 1. Renamed misleading _pending_set to _enqueued_urls 2. Extracted duplicated line-index code into shared cli/utils.py 3. Fixed pre-existing "tutorial" vs "tutorials" key mismatch bug in infer_categories() 4. Removed unnecessary _store_results() closure 5. Simplified parser pre-import pattern	2026-03-14 23:35:39 +03:00
yusyus	a535c7cf18	chore: bump version to 3.2.0 for release Update version across pyproject.toml, _version.py fallbacks, CHANGELOG.md, and README badges for v3.2.0 release. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 22:24:18 +03:00
yusyus	d19ad7d820	feat: video pipeline OCR quality fixes + two-pass AI enhancement - Skip OCR on WEBCAM/OTHER frames (eliminates ~64 junk results per video) - Add _clean_ocr_line() to strip line numbers, IDE decorations, collapse markers - Add _fix_intra_line_duplication() for multi-engine OCR overlap artifacts - Add _is_likely_code() filter to prevent UI junk in reference code fences - Add language detection to get_text_groups() via LanguageDetector - Apply OCR cleaning in _assemble_structured_text() pipeline - Add two-pass AI enhancement: Pass 1 cleans reference Code Timeline using transcript context, Pass 2 generates SKILL.md from cleaned refs - Update video-tutorial.yaml prompts for pre-cleaned references - Add 17 new tests (197 total video tests), 2540 tests passing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 21:48:21 +03:00
yusyus	4b19cf4836	style: ruff format 4 video pipeline files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 19:48:02 +03:00
yusyus	0396dcc5c0	style: fix 3 ruff lint errors in video pipeline - UP024: Replace `(OSError, IOError)` with `OSError` in video_setup.py - E402: Use existing `os` import instead of `import os as _os` in video_visual.py - SIM103: Inline condition in `_detect_gpu()` return statement Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 19:39:37 +03:00
yusyus	cc9cc32417	feat: add `skill-seekers video --setup` for GPU auto-detection and dependency installation Auto-detects NVIDIA (CUDA), AMD (ROCm), or CPU-only GPU and installs the correct PyTorch variant + easyocr + all visual extraction dependencies. Removes easyocr from video-full pip extras to avoid pulling ~2GB of wrong CUDA packages on non-NVIDIA systems. New files: - video_setup.py (835 lines): GPU detection, PyTorch install, ROCm config, venv checks, system dep validation, module selection, verification - test_video_setup.py (60 tests): Full coverage of detection, install, verify Updated docs: CHANGELOG, AGENTS.md, CLAUDE.md, README.md, CLI_REFERENCE, FAQ, TROUBLESHOOTING, installation guide, video dependency plan All 2523 tests passing (15 skipped). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 18:39:16 +03:00
yusyus	12bc29ab36	fix: resolve 15 bugs and gaps in video scraper pipeline - Fix extract_visual_data returning 2-tuple instead of 3 (ValueError crash) - Move pytesseract from core deps to [video-full] optional group - Add 30-min timeout + user feedback to video enhancement subprocess - Add scrape_video_impl to MCP server fallback import block - Detect auto-generated YouTube captions via is_generated property - Forward --vision-ocr and --video-playlist through create command - Fix filename collision for non-ASCII video titles (fallback to video_id) - Make _vision_used a proper dataclass field on FrameSubSection - Expose 6 visual params in MCP scrape_video tool - Add install instructions on missing video deps in unified scraper - Update MCP docstring tool counts (25→33, 7 categories) - Add video and word commands to main.py docstring - Document video-full exclusion from [all] deps in pyproject.toml - Update parser registry test count (22→23 for video parser) All 2437 tests passing, 0 failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 12:39:21 +03:00
yusyus	066e19674a	Merge branch 'development' into feature/video-scraper-pipeline Sync with latest development changes including ruff formatting, bug fixes, and pinecone adaptor additions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 11:38:45 +03:00
yusyus	68bdbe8307	style: ruff format remaining 14 files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 10:54:45 +03:00
yusyus	6c31990941	style: fix ruff lint and formatting errors - E741: rename ambiguous variable `l` → `line_text` in enhance_skill_local.py - ARG001: suppress unused `doc` param in word_scraper _build_section() - SIM108: use ternary for code_text assignment in word_scraper - F841: remove unused `metadata` variable in test_chunking_integration - F401: remove unused imports in test_pinecone_adaptor - ARG001: rename unused `docs` → `_docs` in test_pinecone_adaptor - Format 20 files to match ruff formatting rules Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 10:54:32 +03:00
yusyus	064405c052	fix: resolve 18 bugs and code quality issues across adaptors, CLI, and chunking pipeline Bug fixes: - Fix --var flag silently dropped in create routing (args.workflow_var → args.var) - Fix double _score_code_quality() call in word scraper - Add .docx file extension validation in WordToSkillConverter - Fix weaviate ImportError masked by generic Exception handler - Fix RAG chunking crash using non-existent converter.output_dir Chunking pipeline improvements: - Wire --chunk-overlap-tokens through entire package pipeline (package_skill → adaptor.package → format_skill_md → _maybe_chunk_content → RAGChunker) - Add auto-scaling overlap: max(50, chunk_tokens//10) when chunk size is non-default - Rename --no-preserve-code to --no-preserve-code-blocks (backward-compat alias kept) - Replace hardcoded 512/50 chunk defaults with DEFAULT_CHUNK_TOKENS/DEFAULT_CHUNK_OVERLAP_TOKENS constants across all 12 concrete adaptors, rag_chunker, base, and package_skill Code quality: - Extract shared _generate_openai_embeddings() and _generate_st_embeddings() to SkillAdaptor base class, removing ~150 lines of duplication from chroma/weaviate/pinecone - Add Pinecone adaptor with full upload support (pinecone_adaptor.py) Tests (14 new): - chunk_overlap_tokens parameter wiring, auto-scaling overlap, preserve_code_blocks flag - .docx/.doc/no-extension file validation, --var flag routing E2E - Embedding method inheritance verification, backward-compatible flag aliases Docs: - Update CHANGELOG, CLI_REFERENCE, API_REFERENCE, packaging guide (EN+ZH) - Update README test count badge (1880+ → 2283+) All 2283 tests passing, 8 skipped, 0 failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 21:57:59 +03:00
YusufKaraaslanSpyke	62071c4aa9	feat: add video tutorial scraping pipeline with per-panel OCR and AI enhancement Add complete video tutorial extraction system that converts YouTube videos and local video files into AI-consumable skills. The pipeline extracts transcripts, performs visual OCR on code editor panels independently, tracks code evolution across frames, and generates structured SKILL.md output. Key features: - Video metadata extraction (YouTube, local files, playlists) - Multi-source transcript extraction (YouTube API, yt-dlp, Whisper fallback) - Chapter-based and time-window segmentation - Visual extraction: keyframe detection, frame classification, panel detection - Per-panel sub-section OCR (each IDE panel OCR'd independently) - Parallel OCR with ThreadPoolExecutor for multi-panel frames - Narrow panel filtering (300px min width) to skip UI chrome - Text block tracking with spatial panel position matching - Code timeline with edit tracking across frames - Audio-visual alignment (code + narrator pairs) - Video-specific AI enhancement prompt for OCR denoising and code reconstruction - video-tutorial.yaml workflow with 4 stages (OCR cleanup, language detection, tutorial synthesis, skill polish) - CLI integration: skill-seekers video --url/--video-file/--playlist - MCP tool: scrape_video for automation - 161 tests passing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 23:10:19 +03:00
yusyus	3bad7cf365	fix: RAG chunking crash using non-existent converter.output_dir DocToSkillConverter has self.skill_dir (string), not self.output_dir. The --chunk-for-rag flag on scrape command crashed with AttributeError. Changed to Path(converter.skill_dir). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-27 22:26:21 +03:00
yusyus	4c8e16c8b1	fix(#300 ): centralize selector fallback, fix dry-run link discovery, and smart --config routing - Add FALLBACK_MAIN_SELECTORS constant and _find_main_content() helper to eliminate 3 duplicated fallback loops in doc_scraper.py - Move link extraction before early return in extract_content() so links are always discovered from the full page, not just main content - Fix single-threaded dry-run to extract links from soup (full page) instead of main element only — fixes reactflow.dev finding only 1 page - Add link extraction to async dry-run path (was completely missing) - Remove main_content from get_configuration() defaults so fallback logic kicks in instead of a broad CSS comma selector matching body - Smart create --config routing: peek at JSON to determine unified (sources array → unified_scraper) vs simple (base_url → doc_scraper) - Update docs/user-guide/02-scraping.md and docs/reference/CONFIG_FORMAT.md to use unified config format (legacy format rejected since v2.11.0) - Fix test_auto_fetch_enabled and test_mcp_validate_legacy_config Closes #300 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 22:25:59 +03:00
yusyus	b6d4dd8423	fix: remove arbitrary limits, fix hardcoded languages, and fix summarizer bugs Stage 1 quality improvements from the Arbitrary Limits & Dead Code audit: Reference file truncation removed: - codebase_scraper.py: remove code[:500] truncation at 5 locations — reference files now contain complete code blocks for copy-paste usability - unified_skill_builder.py: remove issues[:20], releases[:10], body[:500], and code_snippet[:300] caps in reference files — full content preserved Enhancement summarizer rewrite: - enhance_skill_local.py: replace arbitrary [:5] code block cap with character-budget approach using target_ratio * content_chars - Fix intro boundary bug: track code block state so intro never ends inside a code block, which was desynchronizing the parser - Remove dead _target_lines variable (assigned but never used) - Heading chunks now also respect the character budget Hardcoded language fixes: - unified_skill_builder.py: test examples use ex["language"] instead of always "python" for syntax highlighting - how_to_guide_builder.py: add language field to HowToGuide dataclass, set from workflow at creation, used in AI enhancement prompt Test fixes: - test_enhance_skill_local.py: rename test to test_code_blocks_not_arbitrarily_capped, fix assertion to count actual blocks (```count // 2), use target_ratio=0.9 Documentation: - Add Stage 1 plan, implementation summary, review, and corrected docs - Update CHANGELOG.md with all changes Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-26 00:30:40 +03:00
yusyus	b81d55fda0	feat(B2): add Microsoft Word (.docx) support Implements ROADMAP task B2 — full .docx scraping support via mammoth + python-docx, producing SKILL.md + references/ output identical to other source types. New files: - src/skill_seekers/cli/word_scraper.py — WordToSkillConverter class + main() entry point (~600 lines); mammoth → BeautifulSoup pipeline; handles headings, code detection (incl. monospace <p><br> blocks), tables, images, metadata extraction - src/skill_seekers/cli/arguments/word.py — add_word_arguments() + WORD_ARGUMENTS dict - src/skill_seekers/cli/parsers/word_parser.py — WordParser for unified CLI parser registry - tests/test_word_scraper.py — comprehensive test suite (~300 lines) Modified files: - src/skill_seekers/cli/main.py — registered "word" command module - src/skill_seekers/cli/source_detector.py — .docx auto-detection + _detect_word() classmethod - src/skill_seekers/cli/create_command.py — _route_word() + --help-word - src/skill_seekers/cli/arguments/create.py — WORD_ARGUMENTS + routing - src/skill_seekers/cli/arguments/__init__.py — export word args - src/skill_seekers/cli/parsers/__init__.py — register WordParser - src/skill_seekers/cli/unified_scraper.py — _scrape_word() integration - src/skill_seekers/cli/pdf_scraper.py — fix: real enhancement instead of stub; remove [:3] reference file limit; capture run_workflows return - src/skill_seekers/cli/github_scraper.py — fix: remove arbitrary open_issues[:20] / closed_issues[:10] reference file limits - pyproject.toml — skill-seekers-word entry point + docx optional dep - tests/test_cli_parsers.py — update parser count 21→22 Bug fixes applied during real-world testing: - Code detection: detect monospace <p><br> blocks as code (mammoth renders Courier paragraphs this way, not as <pre>/<code>) - Language detector: fix wrong method name detect_from_text → detect_from_code - Description inference: pass None from main() so extract_docx() can infer description from Word document subject/title metadata - Bullet-point guard: exclude prose starting with •/-/* from code scoring - Enhancement: implement real API/LOCAL enhancement (was stub) - pip install message: add quotes around skill-seekers[docx] Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-25 21:47:30 +03:00
yusyus	e42aade992	style: auto-format 6 files with ruff format (CI formatting check) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-24 22:28:11 +03:00
yusyus	91d6340c3c	chore: bump version to 3.1.3 Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-24 22:24:03 +03:00
yusyus	73adda0b17	docs: update all chunk flag names to match renamed CLI flags Replace all occurrences of old ambiguous flag names with the new explicit ones: --chunk-size (tokens) → --chunk-tokens --chunk-overlap → --chunk-overlap-tokens --chunk → --chunk-for-rag --streaming-chunk-size → --streaming-chunk-chars --streaming-overlap → --streaming-overlap-chars --chunk-size (pages) → --pdf-pages-per-chunk Updated: CLI_REFERENCE (EN+ZH), user-guide (EN+ZH), integrations (Haystack, Chroma, Weaviate, FAISS, Qdrant), features/PDF_CHUNKING, examples/haystack-pipeline, strategy docs, archive docs, and CHANGELOG. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-24 22:15:14 +03:00
yusyus	7a2ffb286c	refactor: rename all chunk flags to include explicit units Replace ambiguous --chunk-size / --chunk-overlap names that meant different things in different contexts (tokens vs characters) with fully explicit names: - --chunk-size (RAG tokens) → --chunk-tokens - --chunk-overlap (RAG tokens) → --chunk-overlap-tokens - --chunk (enable RAG chunking) → --chunk-for-rag - --streaming-chunk-size (chars) → --streaming-chunk-chars - --streaming-overlap (chars) → --streaming-overlap-chars - --chunk-size (PDF pages) → --pdf-pages-per-chunk (poc file) Also aligns stream_parser.py help with streaming_ingest.py standalone parser. All 2167 tests pass. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-24 22:07:56 +03:00
yusyus	b636a0a292	fix: resolve issue #299 and Phase 1 cleanup - Fix #299: rename --chunk-size/--chunk-overlap to --streaming-chunk-size/ --streaming-overlap in arguments/package.py to avoid collision with the RAG --chunk-size flag from arguments/common.py - Phase 1a: make package_skill.py import args via add_package_arguments() instead of a 105-line inline duplicate argparse block; fixes the root cause of _reconstruct_argv() passing unrecognised flag names - Phase 1b: centralise setup_logging() into utils.py and remove 4 duplicate module-level logging.basicConfig() calls from doc_scraper.py, github_scraper.py, codebase_scraper.py, and unified_scraper.py - Fix test_package_structure.py / test_cli_paths.py version strings (3.1.1 → 3.1.2) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-24 21:22:05 +03:00
yusyus	1229ff2baf	style: auto-format enhance_skill_local.py and test with ruff Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-24 07:05:50 +03:00
yusyus	5ae57d192a	fix: update Gemini model to 2.5-flash and add API auto-detection in enhance Fix 1 — gemini.py: replace deprecated gemini-2.0-flash-exp (404 errors) with gemini-2.5-flash (stable, GA, Google's recommended replacement). Closes #290. Fix 2 — enhance dispatcher: implement the documented auto-detection that was missing from the code. skill-seekers enhance now correctly routes: - ANTHROPIC_API_KEY set → Claude API mode (enhance_skill.py) - GOOGLE_API_KEY set → Gemini API mode - OPENAI_API_KEY set → OpenAI API mode - No API keys → LOCAL mode (Claude Code Max, free) Use --mode LOCAL to force local mode even when an API key is present. 9 new tests cover _detect_api_target() priority logic and main() routing (API delegation, --mode LOCAL override, no-key fallback). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-24 06:52:55 +03:00
YusufKaraaslanSpyke	3adc5a8c1d	fix: unify scraper argument interface and fix create command forwarding All scrapers (scrape, github, analyze, pdf) now share a common argument contract via add_all_standard_arguments() in arguments/common.py. Universal flags (--dry-run, --verbose, --quiet, --name, --description, workflow args) work consistently across all source types. Previously, `create <url> --dry-run`, `create owner/repo --dry-run`, and `create ./path --dry-run` would crash because sub-scrapers didn't accept those flags. Also fixes main.py _handle_analyze_command() not forwarding --dry-run, --preset, --quiet, --name, --description to codebase_scraper. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-23 20:56:13 +03:00
Claude	40cec4dffd	hotfix: v3.1.1 — fix create command max_pages AttributeError Merge fix from development (#293, #294) and bump version to 3.1.1. Fixes crash when max_pages argument was not provided in web source routing. https://claude.ai/code/session_01HS5q7ghjfEUravNPZRCGux	2026-02-23 06:37:39 +00:00
YusufKaraaslanSpyke	2e273b214f	fix: use getattr for max_pages in create command web routing (#293 ) The create command crashed with 'Namespace' object has no attribute 'max_pages' because it accessed args.max_pages directly instead of using getattr() like all other source-specific attributes in the same method. Closes #293 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-23 08:58:06 +03:00
yusyus	ef14fd4b5d	style: auto-format 12 files with ruff format (CI formatting check) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-02-22 22:32:31 +03:00

1 2 3 4 5 ...

277 Commits