perf: optimize with caching, pre-compiled regex, O(1) lookups, and bisect line indexing (#309)

## Summary Performance optimizations across core scraping and analysis modules: - **doc_scraper.py**: Pre-compiled regex at module level, O(1) URL dedup via _enqueued_urls set, cached URL patterns, _enqueue_url() helper (DRY), seen_links set for link extraction, pre-lowercased category keywords, async error logging (bug fix), summary I/O error handling - **code_analyzer.py**: O(log n) bisect-based line lookups replacing O(n) count("\n") across all 10 language analyzers; O(n) parent class map replacing O(n^2) AST walks for Python method detection - **dependency_analyzer.py**: Same bisect line-index optimization for all import extractors - **codebase_scraper.py**: Module-level import re, pre-imported parser classes outside loop - **github_scraper.py**: deque.popleft() for O(1) tree traversal, module-level import fnmatch - **utils.py**: Shared build_line_index() / offset_to_line() utilities (DRY) - **test_adaptor_benchmarks.py**: Stabilized flaky test_benchmark_metadata_overhead (median, warm-up, more iterations) Review fixes applied on top of original PR: 1. Renamed misleading _pending_set to _enqueued_urls 2. Extracted duplicated line-index code into shared cli/utils.py 3. Fixed pre-existing "tutorial" vs "tutorials" key mismatch bug in infer_categories() 4. Removed unnecessary _store_results() closure 5. Simplified parser pre-import pattern
2026-03-14 13:35:39 -07:00
parent 0ca271cdcb
commit 89f5e6fe5f
5 changed files with 191 additions and 140 deletions
--- a/src/skill_seekers/cli/github_scraper.py
+++ b/src/skill_seekers/cli/github_scraper.py
@@ -15,6 +15,7 @@ Usage:
 """

 import argparse
+import fnmatch
 import json
 import logging
 import os
@@ -664,11 +665,13 @@ class GitHubScraper:
    def _extract_file_tree_github(self):
        """Extract file tree from GitHub API (rate-limited)."""
        try:
-            contents = self.repo.get_contents("")
+            from collections import deque
+
+            contents = deque(self.repo.get_contents(""))
            file_tree = []

            while contents:
-                file_content = contents.pop(0)
+                file_content = contents.popleft()

                file_info = {
                    "path": file_content.path,
@@ -741,11 +744,10 @@ class GitHubScraper:
                continue

            # Check if file matches patterns (if specified)
-            if self.file_patterns:
-                import fnmatch
-
-                if not any(fnmatch.fnmatch(file_path, pattern) for pattern in self.file_patterns):
-                    continue
+            if self.file_patterns and not any(
+                fnmatch.fnmatch(file_path, pattern) for pattern in self.file_patterns
+            ):
+                continue

            # Analyze this file
            try: