perf: optimize with caching, pre-compiled regex, O(1) lookups, and bisect line indexing (#309)

## Summary

Performance optimizations across core scraping and analysis modules:

- **doc_scraper.py**: Pre-compiled regex at module level, O(1) URL dedup via _enqueued_urls set, cached URL patterns, _enqueue_url() helper (DRY), seen_links set for link extraction, pre-lowercased category keywords, async error logging (bug fix), summary I/O error handling
- **code_analyzer.py**: O(log n) bisect-based line lookups replacing O(n) count("\n") across all 10 language analyzers; O(n) parent class map replacing O(n^2) AST walks for Python method detection
- **dependency_analyzer.py**: Same bisect line-index optimization for all import extractors
- **codebase_scraper.py**: Module-level import re, pre-imported parser classes outside loop
- **github_scraper.py**: deque.popleft() for O(1) tree traversal, module-level import fnmatch
- **utils.py**: Shared build_line_index() / offset_to_line() utilities (DRY)
- **test_adaptor_benchmarks.py**: Stabilized flaky test_benchmark_metadata_overhead (median, warm-up, more iterations)

Review fixes applied on top of original PR:
1. Renamed misleading _pending_set to _enqueued_urls
2. Extracted duplicated line-index code into shared cli/utils.py
3. Fixed pre-existing "tutorial" vs "tutorials" key mismatch bug in infer_categories()
4. Removed unnecessary _store_results() closure
5. Simplified parser pre-import pattern
This commit is contained in:
copperlang2007
2026-03-14 13:35:39 -07:00
committed by GitHub
parent 0ca271cdcb
commit 89f5e6fe5f
5 changed files with 191 additions and 140 deletions

View File

@@ -28,6 +28,7 @@ import argparse
import json
import logging
import os
import re
import sys
from pathlib import Path
from typing import Any
@@ -380,8 +381,6 @@ def extract_markdown_structure(content: str) -> dict[str, Any]:
Returns:
Dictionary with extracted structure
"""
import re
structure = {
"title": None,
"headers": [],
@@ -526,8 +525,6 @@ def extract_rst_structure(content: str) -> dict[str, Any]:
logger.warning(f"Enhanced RST parser failed: {e}, using basic parser")
# Legacy basic extraction (fallback)
import re
structure = {
"title": None,
"headers": [],
@@ -679,6 +676,17 @@ def process_markdown_docs(
processed_docs = []
categories = {}
# Pre-import parsers once outside the loop
_rst_parser_cls = None
_md_parser_cls = None
try:
from skill_seekers.cli.parsers.extractors import RstParser, MarkdownParser
_rst_parser_cls = RstParser
_md_parser_cls = MarkdownParser
except ImportError:
logger.debug("Unified parsers not available, using legacy parsers")
for md_path in md_files:
try:
content = md_path.read_text(encoding="utf-8", errors="ignore")
@@ -701,7 +709,10 @@ def process_markdown_docs(
parsed_doc = None
try:
from skill_seekers.cli.parsers.extractors import RstParser, MarkdownParser
RstParser = _rst_parser_cls
MarkdownParser = _md_parser_cls
if RstParser is None or MarkdownParser is None:
raise ImportError("Parsers not available")
# Use appropriate unified parser based on file extension
if md_path.suffix.lower() in RST_EXTENSIONS:
@@ -957,8 +968,6 @@ Return JSON with format:
# Parse response and merge enhancements
try:
import re
json_match = re.search(r"\{.*\}", response.content[0].text, re.DOTALL)
if json_match:
enhancements = json.loads(json_match.group())
@@ -1022,8 +1031,6 @@ Output JSON only:
os.unlink(prompt_file)
if result.returncode == 0 and result.stdout:
import re
json_match = re.search(r"\{.*\}", result.stdout, re.DOTALL)
if json_match:
enhancements = json.loads(json_match.group())