perf: optimize with caching, pre-compiled regex, O(1) lookups, and bisect line indexing (#309)

## Summary

Performance optimizations across core scraping and analysis modules:

- **doc_scraper.py**: Pre-compiled regex at module level, O(1) URL dedup via _enqueued_urls set, cached URL patterns, _enqueue_url() helper (DRY), seen_links set for link extraction, pre-lowercased category keywords, async error logging (bug fix), summary I/O error handling
- **code_analyzer.py**: O(log n) bisect-based line lookups replacing O(n) count("\n") across all 10 language analyzers; O(n) parent class map replacing O(n^2) AST walks for Python method detection
- **dependency_analyzer.py**: Same bisect line-index optimization for all import extractors
- **codebase_scraper.py**: Module-level import re, pre-imported parser classes outside loop
- **github_scraper.py**: deque.popleft() for O(1) tree traversal, module-level import fnmatch
- **utils.py**: Shared build_line_index() / offset_to_line() utilities (DRY)
- **test_adaptor_benchmarks.py**: Stabilized flaky test_benchmark_metadata_overhead (median, warm-up, more iterations)

Review fixes applied on top of original PR:
1. Renamed misleading _pending_set to _enqueued_urls
2. Extracted duplicated line-index code into shared cli/utils.py
3. Fixed pre-existing "tutorial" vs "tutorials" key mismatch bug in infer_categories()
4. Removed unnecessary _store_results() closure
5. Simplified parser pre-import pattern
This commit is contained in:
copperlang2007
2026-03-14 13:35:39 -07:00
committed by GitHub
parent 0ca271cdcb
commit 89f5e6fe5f
5 changed files with 191 additions and 140 deletions

View File

@@ -15,6 +15,7 @@ Usage:
"""
import argparse
import fnmatch
import json
import logging
import os
@@ -664,11 +665,13 @@ class GitHubScraper:
def _extract_file_tree_github(self):
"""Extract file tree from GitHub API (rate-limited)."""
try:
contents = self.repo.get_contents("")
from collections import deque
contents = deque(self.repo.get_contents(""))
file_tree = []
while contents:
file_content = contents.pop(0)
file_content = contents.popleft()
file_info = {
"path": file_content.path,
@@ -741,11 +744,10 @@ class GitHubScraper:
continue
# Check if file matches patterns (if specified)
if self.file_patterns:
import fnmatch
if not any(fnmatch.fnmatch(file_path, pattern) for pattern in self.file_patterns):
continue
if self.file_patterns and not any(
fnmatch.fnmatch(file_path, pattern) for pattern in self.file_patterns
):
continue
# Analyze this file
try: