fix: MCP scraping hangs and collects only 1 page when using Claude Code CLI (#155)

##  Approved and Merged

Excellent work, @StuartFenton! This is a critical bug fix that unblocks MCP integration for Claude Code CLI users.

### Review Summary

**Test Results:**  All 372 tests passing (100% success rate)
**Code Quality:**  Minimal, surgical changes with clear documentation
**Impact:**  Fixes critical MCP scraping bug (1 page → 100 pages)
**Compatibility:**  Fully backward compatible, no breaking changes

### What This Fixes

1. **MCP subprocess EOFError**: No more crashes on user input prompts
2. **Link discovery**: Now finds navigation links outside main content (10-100x more pages)
3. **--fresh flag**: Properly skips user prompts in automation mode

### Changes Merged

- **cli/doc_scraper.py**: Link extraction from entire page + --fresh flag fix
- **skill_seeker_mcp/server.py**: Auto-pass --fresh flag to prevent prompts

### Testing Validation

Real-world MCP testing shows:
-  Tailwind CSS: 1 page → 100 pages
-  No user prompts during execution
-  Navigation links properly discovered
-  End-to-end workflow through Claude Code CLI

Thank you for the thorough problem analysis, comprehensive testing, and excellent PR description! 🎉

---

**Next Steps:**
- Will be included in next release (v2.0.1)
- Added to project changelog
- MCP integration now fully functional

🤖 Merged with [Claude Code](https://claude.com/claude-code)
This commit is contained in:
StuartFenton
2025-11-06 20:23:45 +00:00
committed by GitHub
parent 13b19c2b06
commit 55bc8518f0
2 changed files with 13 additions and 5 deletions

View File

@@ -257,15 +257,16 @@ class DocToSkillConverter:
paragraphs.append(text)
page['content'] = '\n\n'.join(paragraphs)
# Extract links
for link in main.find_all('a', href=True):
# Extract links from entire page (not just main content)
# This allows discovery of navigation links outside the main content area
for link in soup.find_all('a', href=True):
href = urljoin(url, link['href'])
# Strip anchor fragments to avoid treating #anchors as separate pages
href = href.split('#')[0]
if self.is_valid_url(href) and href not in page['links']:
page['links'].append(href)
return page
def _extract_language_from_classes(self, classes):
@@ -1641,11 +1642,14 @@ def execute_scraping_and_building(config: Dict[str, Any], args: argparse.Namespa
# Check for existing data
exists, page_count = check_existing_data(config['name'])
if exists and not args.skip_scrape:
if exists and not args.skip_scrape and not args.fresh:
logger.info("\n✓ Found existing data: %d pages", page_count)
response = input("Use existing data? (y/n): ").strip().lower()
if response == 'y':
args.skip_scrape = True
elif exists and args.fresh:
logger.info("\n✓ Found existing data: %d pages", page_count)
logger.info(" --fresh flag set, will re-scrape from scratch")
# Create converter
converter = DocToSkillConverter(config, resume=args.resume)

View File

@@ -603,6 +603,10 @@ async def scrape_docs_tool(args: dict) -> list[TextContent]:
if is_unified and merge_mode:
cmd.extend(["--merge-mode", merge_mode])
# Add --fresh to avoid user input prompts when existing data found
if not skip_scrape:
cmd.append("--fresh")
if enhance_local:
cmd.append("--enhance-local")
if skip_scrape: