New BrowserRenderer class uses Playwright to render JavaScript-heavy
documentation sites (React, Vue SPAs) that return empty HTML shells
with requests.get(). Activated via --browser flag on web scraping.
- browser_renderer.py: Playwright wrapper with lazy browser launch,
auto-install Chromium on first use, context manager support
- doc_scraper.py: browser_mode config, _render_with_browser() helper,
integrated into scrape_page() and scrape_page_async()
- SPA detection warnings now suggest --browser flag
- Optional dep: pip install "skill-seekers[browser]"
- 14 real e2e tests (actual Chromium, no mocks)
- UML updated: Scrapers class diagram (BrowserRenderer + dependency),
Parsers (DoctorParser), Utilities (Doctor), Components, and new
Browser Rendering sequence diagram (#20)
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
8 diagnostic checks: Python version (3.10+), package install, git,
14 core deps, 10 optional deps, API keys, MCP server, output dir.
Each check reports pass/warn/fail with --verbose for extra detail.
Exit code 0 if no critical failures, 1 otherwise.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New bundled workflow `prompt-injection-check` scans scraped content for
prompt injection patterns (role assumption, instruction overrides,
delimiter injection, hidden instructions, encoded payloads) using AI.
Flags suspicious content without removing it — preserves documentation
accuracy while warning about adversarial content. Added as first stage
in both `default` and `security-focus` workflows so it runs automatically
with --enhance-level >= 1.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3 sequence diagrams (create command dispatch, GitHub+C3.x pipeline with
all 5 stages, MCP dual-path invocation), 2 activity diagrams (source
detection in correct code order, enhancement level flag mapping), and
1 component diagram with corrected runtime dependency arrows.
All diagrams cross-referenced against source code for accuracy.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two call sites (_run_c3_analysis in unified_scraper.py and _analyze_c3x in
unified_codebase_analyzer.py) still passed the old enhance_with_ai and ai_mode
kwargs which were replaced by enhance_level. This caused a TypeError when
running C3.x codebase analysis.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PyGithub's get_languages() returns raw API JSON which in some environments
includes non-integer metadata keys (e.g., "url"), causing a TypeError in
sum(). Now filters to integer values only before calculating percentages.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The scraper previously reported len(visited_urls) as "Scraped N pages"
even when save_page() silently skipped pages with empty content (<50
chars). For JavaScript SPA sites this meant "Scraped 190 pages" followed
by "No scraped data found!" with no explanation.
Changes:
- Added pages_saved/pages_skipped counters to DocToSkillConverter
- save_page() now increments pages_skipped on skip, pages_saved on save
- New _log_scrape_completion() reports "(N saved, M skipped)" breakdown
- SPA detection warns when all/most pages have empty content
- build_skill() error now explains empty content cause when pages skipped
- Updated both sync and async scrape completion paths
- 14 new tests across 4 test classes (counting, messages, SPA, build)
Fixes#320
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Update version across pyproject.toml, _version.py fallbacks,
CHANGELOG.md, and README badges for v3.2.0 release.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Skip OCR on WEBCAM/OTHER frames (eliminates ~64 junk results per video)
- Add _clean_ocr_line() to strip line numbers, IDE decorations, collapse markers
- Add _fix_intra_line_duplication() for multi-engine OCR overlap artifacts
- Add _is_likely_code() filter to prevent UI junk in reference code fences
- Add language detection to get_text_groups() via LanguageDetector
- Apply OCR cleaning in _assemble_structured_text() pipeline
- Add two-pass AI enhancement: Pass 1 cleans reference Code Timeline
using transcript context, Pass 2 generates SKILL.md from cleaned refs
- Update video-tutorial.yaml prompts for pre-cleaned references
- Add 17 new tests (197 total video tests), 2540 tests passing
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add missing video pipeline feature (the main 15K+ line addition),
15 video bug fixes, and restructure [Unreleased] section with
proper hierarchy: Video Pipeline Core → Video --setup → Word support.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Auto-detects NVIDIA (CUDA), AMD (ROCm), or CPU-only GPU and installs the
correct PyTorch variant + easyocr + all visual extraction dependencies.
Removes easyocr from video-full pip extras to avoid pulling ~2GB of wrong
CUDA packages on non-NVIDIA systems.
New files:
- video_setup.py (835 lines): GPU detection, PyTorch install, ROCm config,
venv checks, system dep validation, module selection, verification
- test_video_setup.py (60 tests): Full coverage of detection, install, verify
Updated docs: CHANGELOG, AGENTS.md, CLAUDE.md, README.md, CLI_REFERENCE,
FAQ, TROUBLESHOOTING, installation guide, video dependency plan
All 2523 tests passing (15 skipped).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
DocToSkillConverter has self.skill_dir (string), not self.output_dir.
The --chunk-for-rag flag on scrape command crashed with AttributeError.
Changed to Path(converter.skill_dir).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add FALLBACK_MAIN_SELECTORS constant and _find_main_content() helper to
eliminate 3 duplicated fallback loops in doc_scraper.py
- Move link extraction before early return in extract_content() so links
are always discovered from the full page, not just main content
- Fix single-threaded dry-run to extract links from soup (full page)
instead of main element only — fixes reactflow.dev finding only 1 page
- Add link extraction to async dry-run path (was completely missing)
- Remove main_content from get_configuration() defaults so fallback logic
kicks in instead of a broad CSS comma selector matching body
- Smart create --config routing: peek at JSON to determine unified
(sources array → unified_scraper) vs simple (base_url → doc_scraper)
- Update docs/user-guide/02-scraping.md and docs/reference/CONFIG_FORMAT.md
to use unified config format (legacy format rejected since v2.11.0)
- Fix test_auto_fetch_enabled and test_mcp_validate_legacy_config
Closes#300
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Stage 1 quality improvements from the Arbitrary Limits & Dead Code audit:
Reference file truncation removed:
- codebase_scraper.py: remove code[:500] truncation at 5 locations — reference
files now contain complete code blocks for copy-paste usability
- unified_skill_builder.py: remove issues[:20], releases[:10], body[:500],
and code_snippet[:300] caps in reference files — full content preserved
Enhancement summarizer rewrite:
- enhance_skill_local.py: replace arbitrary [:5] code block cap with
character-budget approach using target_ratio * content_chars
- Fix intro boundary bug: track code block state so intro never ends
inside a code block, which was desynchronizing the parser
- Remove dead _target_lines variable (assigned but never used)
- Heading chunks now also respect the character budget
Hardcoded language fixes:
- unified_skill_builder.py: test examples use ex["language"] instead of
always "python" for syntax highlighting
- how_to_guide_builder.py: add language field to HowToGuide dataclass,
set from workflow at creation, used in AI enhancement prompt
Test fixes:
- test_enhance_skill_local.py: rename test to test_code_blocks_not_arbitrarily_capped,
fix assertion to count actual blocks (```count // 2), use target_ratio=0.9
Documentation:
- Add Stage 1 plan, implementation summary, review, and corrected docs
- Update CHANGELOG.md with all changes
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All scrapers (scrape, github, analyze, pdf) now share a common argument
contract via add_all_standard_arguments() in arguments/common.py.
Universal flags (--dry-run, --verbose, --quiet, --name, --description,
workflow args) work consistently across all source types.
Previously, `create <url> --dry-run`, `create owner/repo --dry-run`,
and `create ./path --dry-run` would crash because sub-scrapers didn't
accept those flags. Also fixes main.py _handle_analyze_command() not
forwarding --dry-run, --preset, --quiet, --name, --description to
codebase_scraper.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- pyproject.toml: version 3.0.0 → 3.1.0
- src/skill_seekers/_version.py: update hardcoded fallback to 3.1.0
- CHANGELOG.md: comprehensive [3.1.0] release notes covering all
features and fixes since v3.0.0 (unified create command, workflow
presets, RST parser, smart enhance dispatcher, CLI flag parity,
60 new workflow YAMLs, test suite improvements)
- Deprecation messages: update "removed in v3.0.0" → "v4.0.0" across
analyze_presets.py, codebase_scraper.py, mcp/server.py
- tests/test_cli_paths.py: update version assertion to 3.1.0
- tests/test_package_structure.py: update __version__ assertions to 3.1.0
- tests/test_preset_system.py: update deprecation message version to v4.0.0
All 2267 tests passing.
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add YAML-based enhancement workflow presets shipped inside the package
(default, minimal, security-focus, architecture-comprehensive, api-documentation)
- Add `skill-seekers workflows` subcommand: list, show, copy, add, remove, validate
- copy/add/remove all accept multiple names/files in one invocation with partial-failure behaviour
- `add --name` override restricted to single-file operations
- Add 5 MCP tools: list_workflows, get_workflow, create_workflow, update_workflow, delete_workflow
- Fix: create command _add_common_args() now correctly forwards each --enhance-workflow
as a separate flag instead of passing the whole list as a single argument
- Update README: reposition as "data layer for AI systems" with AI Skills front and centre
- Update CHANGELOG, QUICK_REFERENCE, CLAUDE.md with workflow preset details
- 1,880+ tests passing
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Major feature release with enhanced code analysis and documentation.
Features:
- C3.9: Project documentation extraction
- Granular AI enhancement control (--enhance-level 0-3)
- C# language support for test extraction
- 6-12x faster parallel LOCAL mode AI enhancement
- Auto-enhancement and LOCAL mode fallbacks
- GLM-4.7 and custom Claude-compatible API support
Bug Fixes:
- Fixed C# test extraction language errors
- Fixed config type field mismatch
- Fixed LocalSkillEnhancer import issues
- Fixed critical linter errors
Contributors:
- @xuintl - Chinese README improvements
- @Zhichang Yu - GLM-4.7 support and PDF fixes
- @YusufKaraaslanSpyke - Core features and maintenance
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Remove SPYKE-related client documentation files
- Fix critical ruff linter errors:
- Remove unused 'os' import in test_analyze_e2e.py
- Remove unused 'setups' variable in test_test_example_extractor.py
- Prefix unused output_dir parameter in codebase_scraper.py
- Fix import sorting in test_integration.py
- Update CHANGELOG.md with comprehensive PR #272 feature documentation
These changes were part of PR #272 cleanup but didn't make it into the squash merge.
- Update Chinese README (README.zh-CN.md) with new preset flags
- Update docs/features/*.md (PATTERN_DETECTION, HOW_TO_GUIDES, BOOTSTRAP_SKILL_TECHNICAL)
- Update scripts/bootstrap_skill.sh to use 'skill-seekers analyze'
- Update scripts/skill_header.md command examples
- Update tests/test_bootstrap_skill.py assertions
- Fix CHANGELOG.md historical entry with correct command name
All references to 'skill-seekers-codebase' updated to 'skill-seekers analyze'
except where needed for backward compatibility (pyproject.toml, E2E tests).
Related to Phase 1 implementation from previous commits.
Merging with admin override due to known issues:
✅ **What Works**:
- GLM-4.7 Claude-compatible API support (correctly implemented)
- PDF scraper improvements (content truncation fixed, page traceability added)
- Documentation updates comprehensive
⚠️ **Known Issues (will be fixed in next commit)**:
1. Import bugs in 3 files causing UnboundLocalError (30 tests failing)
2. PDF scraper test expectations need updating for new behavior (5 tests failing)
3. test_godot_config failure (pre-existing, not caused by this PR - 1 test failing)
**Action Plan**:
Fixes for issues #1 and #2 are ready and will be committed immediately after merge.
Issue #3 requires separate investigation as it's a pre-existing problem.
Total: 36 failing tests, 35 will be fixed in next commit.
This patch release fixes the broken Chinese language selector link
on PyPI by using absolute GitHub URLs instead of relative paths.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This patch release focuses on internationalization and making Skill Seekers
accessible to the Chinese developer community.
Key updates:
- Complete Chinese (简体中文) README translation
- PyPI metadata updated with i18n support
- Natural Language classifiers added
- Community engagement issue created
See CHANGELOG.md for complete release notes.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>