yusyus
|
5d8c7e39f6
|
Add unified multi-source scraping feature (Phases 7-11)
Completes the unified scraping system implementation:
**Phase 7: Unified Skill Builder**
- cli/unified_skill_builder.py: Generates final skill structure
- Inline conflict warnings (⚠️) in API reference
- Side-by-side docs vs code comparison
- Severity-based conflict grouping
- Separate conflicts.md report
**Phase 8: MCP Integration**
- skill_seeker_mcp/server.py: Auto-detects unified vs legacy configs
- Routes to unified_scraper.py or doc_scraper.py automatically
- Supports merge_mode parameter override
- Maintains full backward compatibility
**Phase 9: Example Unified Configs**
- configs/react_unified.json: React docs + GitHub
- configs/django_unified.json: Django docs + GitHub
- configs/fastapi_unified.json: FastAPI docs + GitHub
- configs/fastapi_unified_test.json: Test config with limited pages
**Phase 10: Comprehensive Tests**
- cli/test_unified_simple.py: Integration tests (all passing)
- Tests unified config validation
- Tests backward compatibility
- Tests mixed source types
- Tests error handling
**Phase 11: Documentation**
- docs/UNIFIED_SCRAPING.md: Complete guide (1000+ lines)
- Examples, best practices, troubleshooting
- Architecture diagrams and data flow
- Command reference
**Additional:**
- demo_conflicts.py: Interactive conflict detection demo
- TEST_RESULTS.md: Complete test results and findings
- cli/unified_scraper.py: Fixed doc_scraper integration (subprocess)
**Features:**
✅ Multi-source scraping (docs + GitHub + PDF)
✅ Conflict detection (4 types, 3 severity levels)
✅ Rule-based merging (fast, deterministic)
✅ Claude-enhanced merging (AI-powered)
✅ Transparent conflict reporting
✅ MCP auto-detection
✅ Backward compatibility
**Test Results:**
- 6/6 integration tests passed
- 4 unified configs validated
- 3 legacy configs backward compatible
- 5 conflicts detected in test data
- All documentation complete
🤖 Generated with Claude Code
|
2025-10-26 16:33:41 +03:00 |
|
yusyus
|
f03f4cf569
|
feat: Phase 6 - Unified scraper orchestrator
Created main orchestrator that coordinates entire workflow:
Architecture:
- UnifiedScraper class orchestrates all phases
- Routes to appropriate scraper based on source type
- Supports any combination of sources
4-Phase Workflow:
1. Scrape all sources (docs, GitHub, PDF)
2. Detect conflicts (if multiple API sources)
3. Merge intelligently (rule-based or Claude-enhanced)
4. Build unified skill (placeholder for Phase 7)
Features:
✅ Validates unified config on startup
✅ Backward compatible with legacy configs
✅ Source-specific routing (documentation/github/pdf)
✅ Automatic conflict detection when needed
✅ Merge mode selection (rule-based/claude-enhanced)
✅ Creates organized output structure
✅ Comprehensive logging for each phase
✅ Error handling and graceful failures
CLI Usage:
- python3 cli/unified_scraper.py --config configs/godot_unified.json
- python3 cli/unified_scraper.py -c configs/react_unified.json -m claude-enhanced
Output Structure:
- output/{name}/ - Final skill directory
- output/{name}_unified_data/ - Intermediate data files
* documentation_data.json
* github_data.json
* conflicts.json
* merged_data.json
Next: Phase 7 - Skill builder to generate final SKILL.md
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
|
2025-10-26 15:32:23 +03:00 |
|