fix: resolve 18 bugs and code quality issues across adaptors, CLI, and chunking pipeline

Bug fixes: - Fix --var flag silently dropped in create routing (args.workflow_var → args.var) - Fix double _score_code_quality() call in word scraper - Add .docx file extension validation in WordToSkillConverter - Fix weaviate ImportError masked by generic Exception handler - Fix RAG chunking crash using non-existent converter.output_dir Chunking pipeline improvements: - Wire --chunk-overlap-tokens through entire package pipeline (package_skill → adaptor.package → format_skill_md → _maybe_chunk_content → RAGChunker) - Add auto-scaling overlap: max(50, chunk_tokens//10) when chunk size is non-default - Rename --no-preserve-code to --no-preserve-code-blocks (backward-compat alias kept) - Replace hardcoded 512/50 chunk defaults with DEFAULT_CHUNK_TOKENS/DEFAULT_CHUNK_OVERLAP_TOKENS constants across all 12 concrete adaptors, rag_chunker, base, and package_skill Code quality: - Extract shared _generate_openai_embeddings() and _generate_st_embeddings() to SkillAdaptor base class, removing ~150 lines of duplication from chroma/weaviate/pinecone - Add Pinecone adaptor with full upload support (pinecone_adaptor.py) Tests (14 new): - chunk_overlap_tokens parameter wiring, auto-scaling overlap, preserve_code_blocks flag - .docx/.doc/no-extension file validation, --var flag routing E2E - Embedding method inheritance verification, backward-compatible flag aliases Docs: - Update CHANGELOG, CLI_REFERENCE, API_REFERENCE, packaging guide (EN+ZH) - Update README test count badge (1880+ → 2283+) All 2283 tests passing, 8 skipped, 0 failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 21:57:59 +03:00
parent 3bad7cf365
commit 064405c052
41 changed files with 1864 additions and 237 deletions
--- a/src/skill_seekers/cli/arguments/package.py
+++ b/src/skill_seekers/cli/arguments/package.py
@@ -8,6 +8,8 @@ import and use these definitions.
 import argparse
 from typing import Any

+from .common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
+
 PACKAGE_ARGUMENTS: dict[str, dict[str, Any]] = {
    # Positional argument
    "skill_directory": {
@@ -49,6 +51,7 @@ PACKAGE_ARGUMENTS: dict[str, dict[str, Any]] = {
                "chroma",
                "faiss",
                "qdrant",
+                "pinecone",
            ],
            "default": "claude",
            "help": "Target LLM platform (default: claude)",
@@ -109,13 +112,22 @@ PACKAGE_ARGUMENTS: dict[str, dict[str, Any]] = {
        "flags": ("--chunk-tokens",),
        "kwargs": {
            "type": int,
-            "default": 512,
-            "help": "Maximum tokens per chunk (default: 512)",
+            "default": DEFAULT_CHUNK_TOKENS,
+            "help": f"Maximum tokens per chunk (default: {DEFAULT_CHUNK_TOKENS})",
            "metavar": "N",
        },
    },
-    "no_preserve_code": {
-        "flags": ("--no-preserve-code",),
+    "chunk_overlap_tokens": {
+        "flags": ("--chunk-overlap-tokens",),
+        "kwargs": {
+            "type": int,
+            "default": DEFAULT_CHUNK_OVERLAP_TOKENS,
+            "help": f"Overlap between chunks in tokens (default: {DEFAULT_CHUNK_OVERLAP_TOKENS})",
+            "metavar": "N",
+        },
+    },
+    "no_preserve_code_blocks": {
+        "flags": ("--no-preserve-code-blocks",),
        "kwargs": {
            "action": "store_true",
            "help": "Allow code block splitting (default: code blocks preserved)",
@@ -130,3 +142,11 @@ def add_package_arguments(parser: argparse.ArgumentParser) -> None:
        flags = arg_def["flags"]
        kwargs = arg_def["kwargs"]
        parser.add_argument(*flags, **kwargs)
+
+    # Deprecated alias for backward compatibility (removed in v4.0.0)
+    parser.add_argument(
+        "--no-preserve-code",
+        dest="no_preserve_code_blocks",
+        action="store_true",
+        help=argparse.SUPPRESS,
+    )