fix: apply review fixes from PR #309 and stabilize flaky benchmark test
Follow-up to PR #309 (perf: optimize with caching, pre-compiled regex, O(1) lookups, and bisect line indexing). These fixes were committed to the PR branch but missed the squash merge. Review fixes (credit: PR #309 by copperlang2007): 1. Rename _pending_set -> _enqueued_urls to accurately reflect that the set tracks all ever-enqueued URLs, not just currently pending ones 2. Extract duplicated _build_line_index()/_offset_to_line() into shared build_line_index()/offset_to_line() in cli/utils.py (DRY) 3. Fix pre-existing bug: infer_categories() guard checked 'tutorial' but wrote to 'tutorials' key, risking silent overwrites 4. Remove unnecessary _store_results() closure in scrape_page() 5. Simplify parser pre-import in codebase_scraper.py Benchmark stabilization: - test_benchmark_metadata_overhead was flaky on CI (106.7% overhead observed, threshold 50%) because 5 iterations with mean averaging can't reliably measure microsecond-level differences - Fix: 20 iterations, warm-up run, median instead of mean, threshold raised to 200% (guards catastrophic regression, not noise) Ref: https://github.com/yusufkaraaslan/Skill_Seekers/pull/309
This commit is contained in:
@@ -3,6 +3,7 @@
|
||||
Utility functions for Skill Seeker CLI tools
|
||||
"""
|
||||
|
||||
import bisect
|
||||
import logging
|
||||
import os
|
||||
import platform
|
||||
@@ -450,3 +451,36 @@ async def retry_with_backoff_async(
|
||||
if last_exception is not None:
|
||||
raise last_exception
|
||||
raise RuntimeError(f"{operation_name} failed with no exception captured")
|
||||
|
||||
|
||||
# ---------------------------------------------------------------------------
|
||||
# Line-index utilities for O(log n) offset-to-line-number lookups
|
||||
# ---------------------------------------------------------------------------
|
||||
|
||||
|
||||
def build_line_index(content: str) -> list[int]:
|
||||
"""Build a sorted list of newline byte-offsets for O(log n) line lookups.
|
||||
|
||||
Args:
|
||||
content: Source text whose newline positions to index.
|
||||
|
||||
Returns:
|
||||
Sorted list of character offsets where '\\n' occurs.
|
||||
"""
|
||||
return [i for i, ch in enumerate(content) if ch == "\n"]
|
||||
|
||||
|
||||
def offset_to_line(newline_offsets: list[int], offset: int) -> int:
|
||||
"""Convert a character offset to a 1-based line number.
|
||||
|
||||
Uses ``bisect`` for O(log n) lookup against an index built by
|
||||
:func:`build_line_index`.
|
||||
|
||||
Args:
|
||||
newline_offsets: Sorted newline positions from :func:`build_line_index`.
|
||||
offset: Character offset into the source text.
|
||||
|
||||
Returns:
|
||||
1-based line number corresponding to *offset*.
|
||||
"""
|
||||
return bisect.bisect_left(newline_offsets, offset) + 1
|
||||
|
||||
Reference in New Issue
Block a user