feat: Add GLM-4.7 support and fix PDF scraper issues (#266)

Merging with admin override due to known issues:

 **What Works**:
- GLM-4.7 Claude-compatible API support (correctly implemented)
- PDF scraper improvements (content truncation fixed, page traceability added)  
- Documentation updates comprehensive

⚠️ **Known Issues (will be fixed in next commit)**:
1. Import bugs in 3 files causing UnboundLocalError (30 tests failing)
2. PDF scraper test expectations need updating for new behavior (5 tests failing)
3. test_godot_config failure (pre-existing, not caused by this PR - 1 test failing)

**Action Plan**:
Fixes for issues #1 and #2 are ready and will be committed immediately after merge.
Issue #3 requires separate investigation as it's a pre-existing problem.

Total: 36 failing tests, 35 will be fixed in next commit.
This commit is contained in:
Zhichang Yu
2026-01-28 02:10:40 +08:00
committed by GitHub
parent ffa745fbc7
commit 9435d2911d
12 changed files with 233 additions and 34 deletions

View File

@@ -789,7 +789,12 @@ class PDFExtractor:
text = self.extract_text_with_ocr(page) if self.use_ocr else page.get_text("text")
# Extract markdown (better structure preservation)
markdown = page.get_text("markdown")
# Use "text" format with layout info for PyMuDF 1.24+
try:
markdown = page.get_text("markdown")
except (AssertionError, ValueError):
# Fallback to text format for older/newer PyMuDF versions
markdown = page.get_text("text", flags=fitz.TEXT_PRESERVE_WHITESPACE | fitz.TEXT_PRESERVE_LIGATURES | fitz.TEXT_PRESERVE_SPANS)
# Extract tables (Priority 2)
tables = self.extract_tables_from_page(page)