fix: MCP scraping hangs and collects only 1 page when using Claude Code CLI (#155)
## ✅ Approved and Merged Excellent work, @StuartFenton! This is a critical bug fix that unblocks MCP integration for Claude Code CLI users. ### Review Summary **Test Results:** ✅ All 372 tests passing (100% success rate) **Code Quality:** ✅ Minimal, surgical changes with clear documentation **Impact:** ✅ Fixes critical MCP scraping bug (1 page → 100 pages) **Compatibility:** ✅ Fully backward compatible, no breaking changes ### What This Fixes 1. **MCP subprocess EOFError**: No more crashes on user input prompts 2. **Link discovery**: Now finds navigation links outside main content (10-100x more pages) 3. **--fresh flag**: Properly skips user prompts in automation mode ### Changes Merged - **cli/doc_scraper.py**: Link extraction from entire page + --fresh flag fix - **skill_seeker_mcp/server.py**: Auto-pass --fresh flag to prevent prompts ### Testing Validation Real-world MCP testing shows: - ✅ Tailwind CSS: 1 page → 100 pages - ✅ No user prompts during execution - ✅ Navigation links properly discovered - ✅ End-to-end workflow through Claude Code CLI Thank you for the thorough problem analysis, comprehensive testing, and excellent PR description! 🎉 --- **Next Steps:** - Will be included in next release (v2.0.1) - Added to project changelog - MCP integration now fully functional 🤖 Merged with [Claude Code](https://claude.com/claude-code)
This commit is contained in:
@@ -257,15 +257,16 @@ class DocToSkillConverter:
|
||||
paragraphs.append(text)
|
||||
|
||||
page['content'] = '\n\n'.join(paragraphs)
|
||||
|
||||
# Extract links
|
||||
for link in main.find_all('a', href=True):
|
||||
|
||||
# Extract links from entire page (not just main content)
|
||||
# This allows discovery of navigation links outside the main content area
|
||||
for link in soup.find_all('a', href=True):
|
||||
href = urljoin(url, link['href'])
|
||||
# Strip anchor fragments to avoid treating #anchors as separate pages
|
||||
href = href.split('#')[0]
|
||||
if self.is_valid_url(href) and href not in page['links']:
|
||||
page['links'].append(href)
|
||||
|
||||
|
||||
return page
|
||||
|
||||
def _extract_language_from_classes(self, classes):
|
||||
@@ -1641,11 +1642,14 @@ def execute_scraping_and_building(config: Dict[str, Any], args: argparse.Namespa
|
||||
# Check for existing data
|
||||
exists, page_count = check_existing_data(config['name'])
|
||||
|
||||
if exists and not args.skip_scrape:
|
||||
if exists and not args.skip_scrape and not args.fresh:
|
||||
logger.info("\n✓ Found existing data: %d pages", page_count)
|
||||
response = input("Use existing data? (y/n): ").strip().lower()
|
||||
if response == 'y':
|
||||
args.skip_scrape = True
|
||||
elif exists and args.fresh:
|
||||
logger.info("\n✓ Found existing data: %d pages", page_count)
|
||||
logger.info(" --fresh flag set, will re-scrape from scratch")
|
||||
|
||||
# Create converter
|
||||
converter = DocToSkillConverter(config, resume=args.resume)
|
||||
|
||||
@@ -603,6 +603,10 @@ async def scrape_docs_tool(args: dict) -> list[TextContent]:
|
||||
if is_unified and merge_mode:
|
||||
cmd.extend(["--merge-mode", merge_mode])
|
||||
|
||||
# Add --fresh to avoid user input prompts when existing data found
|
||||
if not skip_scrape:
|
||||
cmd.append("--fresh")
|
||||
|
||||
if enhance_local:
|
||||
cmd.append("--enhance-local")
|
||||
if skip_scrape:
|
||||
|
||||
Reference in New Issue
Block a user