fix: resolve 18 bugs and code quality issues across adaptors, CLI, and chunking pipeline

Bug fixes: - Fix --var flag silently dropped in create routing (args.workflow_var → args.var) - Fix double _score_code_quality() call in word scraper - Add .docx file extension validation in WordToSkillConverter - Fix weaviate ImportError masked by generic Exception handler - Fix RAG chunking crash using non-existent converter.output_dir Chunking pipeline improvements: - Wire --chunk-overlap-tokens through entire package pipeline (package_skill → adaptor.package → format_skill_md → _maybe_chunk_content → RAGChunker) - Add auto-scaling overlap: max(50, chunk_tokens//10) when chunk size is non-default - Rename --no-preserve-code to --no-preserve-code-blocks (backward-compat alias kept) - Replace hardcoded 512/50 chunk defaults with DEFAULT_CHUNK_TOKENS/DEFAULT_CHUNK_OVERLAP_TOKENS constants across all 12 concrete adaptors, rag_chunker, base, and package_skill Code quality: - Extract shared _generate_openai_embeddings() and _generate_st_embeddings() to SkillAdaptor base class, removing ~150 lines of duplication from chroma/weaviate/pinecone - Add Pinecone adaptor with full upload support (pinecone_adaptor.py) Tests (14 new): - chunk_overlap_tokens parameter wiring, auto-scaling overlap, preserve_code_blocks flag - .docx/.doc/no-extension file validation, --var flag routing E2E - Embedding method inheritance verification, backward-compatible flag aliases Docs: - Update CHANGELOG, CLI_REFERENCE, API_REFERENCE, packaging guide (EN+ZH) - Update README test count badge (1880+ → 2283+) All 2283 tests passing, 8 skipped, 0 failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 21:57:59 +03:00
parent 3bad7cf365
commit 064405c052
41 changed files with 1864 additions and 237 deletions
--- a/src/skill_seekers/cli/rag_chunker.py
+++ b/src/skill_seekers/cli/rag_chunker.py
@@ -14,6 +14,8 @@ Usage:
    chunks = chunker.chunk_skill(Path("output/react"))
 """

+from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
+
 import re
 from pathlib import Path
 import json
@@ -35,8 +37,8 @@ class RAGChunker:

    def __init__(
        self,
-        chunk_size: int = 512,
-        chunk_overlap: int = 50,
+        chunk_size: int = DEFAULT_CHUNK_TOKENS,
+        chunk_overlap: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
        preserve_code_blocks: bool = True,
        preserve_paragraphs: bool = True,
        min_chunk_size: int = 100,
@@ -383,9 +385,9 @@ def main():
    )
    parser.add_argument("skill_dir", type=Path, help="Path to skill directory")
    parser.add_argument("--output", "-o", type=Path, help="Output JSON file")
-    parser.add_argument("--chunk-tokens", type=int, default=512, help="Target chunk size in tokens")
+    parser.add_argument("--chunk-tokens", type=int, default=DEFAULT_CHUNK_TOKENS, help="Target chunk size in tokens")
    parser.add_argument(
-        "--chunk-overlap-tokens", type=int, default=50, help="Overlap size in tokens"
+        "--chunk-overlap-tokens", type=int, default=DEFAULT_CHUNK_OVERLAP_TOKENS, help="Overlap size in tokens"
    )
    parser.add_argument("--no-code-blocks", action="store_true", help="Don't preserve code blocks")
    parser.add_argument("--no-paragraphs", action="store_true", help="Don't preserve paragraphs")