perf: optimize with caching, pre-compiled regex, O(1) lookups, and bisect line indexing (#309)
## Summary
Performance optimizations across core scraping and analysis modules:
- **doc_scraper.py**: Pre-compiled regex at module level, O(1) URL dedup via _enqueued_urls set, cached URL patterns, _enqueue_url() helper (DRY), seen_links set for link extraction, pre-lowercased category keywords, async error logging (bug fix), summary I/O error handling
- **code_analyzer.py**: O(log n) bisect-based line lookups replacing O(n) count("\n") across all 10 language analyzers; O(n) parent class map replacing O(n^2) AST walks for Python method detection
- **dependency_analyzer.py**: Same bisect line-index optimization for all import extractors
- **codebase_scraper.py**: Module-level import re, pre-imported parser classes outside loop
- **github_scraper.py**: deque.popleft() for O(1) tree traversal, module-level import fnmatch
- **utils.py**: Shared build_line_index() / offset_to_line() utilities (DRY)
- **test_adaptor_benchmarks.py**: Stabilized flaky test_benchmark_metadata_overhead (median, warm-up, more iterations)
Review fixes applied on top of original PR:
1. Renamed misleading _pending_set to _enqueued_urls
2. Extracted duplicated line-index code into shared cli/utils.py
3. Fixed pre-existing "tutorial" vs "tutorials" key mismatch bug in infer_categories()
4. Removed unnecessary _store_results() closure
5. Simplified parser pre-import pattern
This commit is contained in:
@@ -15,6 +15,7 @@ Usage:
|
||||
"""
|
||||
|
||||
import argparse
|
||||
import fnmatch
|
||||
import json
|
||||
import logging
|
||||
import os
|
||||
@@ -664,11 +665,13 @@ class GitHubScraper:
|
||||
def _extract_file_tree_github(self):
|
||||
"""Extract file tree from GitHub API (rate-limited)."""
|
||||
try:
|
||||
contents = self.repo.get_contents("")
|
||||
from collections import deque
|
||||
|
||||
contents = deque(self.repo.get_contents(""))
|
||||
file_tree = []
|
||||
|
||||
while contents:
|
||||
file_content = contents.pop(0)
|
||||
file_content = contents.popleft()
|
||||
|
||||
file_info = {
|
||||
"path": file_content.path,
|
||||
@@ -741,11 +744,10 @@ class GitHubScraper:
|
||||
continue
|
||||
|
||||
# Check if file matches patterns (if specified)
|
||||
if self.file_patterns:
|
||||
import fnmatch
|
||||
|
||||
if not any(fnmatch.fnmatch(file_path, pattern) for pattern in self.file_patterns):
|
||||
continue
|
||||
if self.file_patterns and not any(
|
||||
fnmatch.fnmatch(file_path, pattern) for pattern in self.file_patterns
|
||||
):
|
||||
continue
|
||||
|
||||
# Analyze this file
|
||||
try:
|
||||
|
||||
Reference in New Issue
Block a user