perf: optimize with caching, pre-compiled regex, O(1) lookups, and bisect line indexing (#309)
## Summary
Performance optimizations across core scraping and analysis modules:
- **doc_scraper.py**: Pre-compiled regex at module level, O(1) URL dedup via _enqueued_urls set, cached URL patterns, _enqueue_url() helper (DRY), seen_links set for link extraction, pre-lowercased category keywords, async error logging (bug fix), summary I/O error handling
- **code_analyzer.py**: O(log n) bisect-based line lookups replacing O(n) count("\n") across all 10 language analyzers; O(n) parent class map replacing O(n^2) AST walks for Python method detection
- **dependency_analyzer.py**: Same bisect line-index optimization for all import extractors
- **codebase_scraper.py**: Module-level import re, pre-imported parser classes outside loop
- **github_scraper.py**: deque.popleft() for O(1) tree traversal, module-level import fnmatch
- **utils.py**: Shared build_line_index() / offset_to_line() utilities (DRY)
- **test_adaptor_benchmarks.py**: Stabilized flaky test_benchmark_metadata_overhead (median, warm-up, more iterations)
Review fixes applied on top of original PR:
1. Renamed misleading _pending_set to _enqueued_urls
2. Extracted duplicated line-index code into shared cli/utils.py
3. Fixed pre-existing "tutorial" vs "tutorials" key mismatch bug in infer_categories()
4. Removed unnecessary _store_results() closure
5. Simplified parser pre-import pattern
This commit is contained in:
@@ -28,6 +28,7 @@ import argparse
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
import re
|
||||
import sys
|
||||
from pathlib import Path
|
||||
from typing import Any
|
||||
@@ -380,8 +381,6 @@ def extract_markdown_structure(content: str) -> dict[str, Any]:
|
||||
Returns:
|
||||
Dictionary with extracted structure
|
||||
"""
|
||||
import re
|
||||
|
||||
structure = {
|
||||
"title": None,
|
||||
"headers": [],
|
||||
@@ -526,8 +525,6 @@ def extract_rst_structure(content: str) -> dict[str, Any]:
|
||||
logger.warning(f"Enhanced RST parser failed: {e}, using basic parser")
|
||||
|
||||
# Legacy basic extraction (fallback)
|
||||
import re
|
||||
|
||||
structure = {
|
||||
"title": None,
|
||||
"headers": [],
|
||||
@@ -679,6 +676,17 @@ def process_markdown_docs(
|
||||
processed_docs = []
|
||||
categories = {}
|
||||
|
||||
# Pre-import parsers once outside the loop
|
||||
_rst_parser_cls = None
|
||||
_md_parser_cls = None
|
||||
try:
|
||||
from skill_seekers.cli.parsers.extractors import RstParser, MarkdownParser
|
||||
|
||||
_rst_parser_cls = RstParser
|
||||
_md_parser_cls = MarkdownParser
|
||||
except ImportError:
|
||||
logger.debug("Unified parsers not available, using legacy parsers")
|
||||
|
||||
for md_path in md_files:
|
||||
try:
|
||||
content = md_path.read_text(encoding="utf-8", errors="ignore")
|
||||
@@ -701,7 +709,10 @@ def process_markdown_docs(
|
||||
parsed_doc = None
|
||||
|
||||
try:
|
||||
from skill_seekers.cli.parsers.extractors import RstParser, MarkdownParser
|
||||
RstParser = _rst_parser_cls
|
||||
MarkdownParser = _md_parser_cls
|
||||
if RstParser is None or MarkdownParser is None:
|
||||
raise ImportError("Parsers not available")
|
||||
|
||||
# Use appropriate unified parser based on file extension
|
||||
if md_path.suffix.lower() in RST_EXTENSIONS:
|
||||
@@ -957,8 +968,6 @@ Return JSON with format:
|
||||
|
||||
# Parse response and merge enhancements
|
||||
try:
|
||||
import re
|
||||
|
||||
json_match = re.search(r"\{.*\}", response.content[0].text, re.DOTALL)
|
||||
if json_match:
|
||||
enhancements = json.loads(json_match.group())
|
||||
@@ -1022,8 +1031,6 @@ Output JSON only:
|
||||
os.unlink(prompt_file)
|
||||
|
||||
if result.returncode == 0 and result.stdout:
|
||||
import re
|
||||
|
||||
json_match = re.search(r"\{.*\}", result.stdout, re.DOTALL)
|
||||
if json_match:
|
||||
enhancements = json.loads(json_match.group())
|
||||
|
||||
Reference in New Issue
Block a user