fix: resolve 18 bugs and code quality issues across adaptors, CLI, and chunking pipeline

Bug fixes: - Fix --var flag silently dropped in create routing (args.workflow_var → args.var) - Fix double _score_code_quality() call in word scraper - Add .docx file extension validation in WordToSkillConverter - Fix weaviate ImportError masked by generic Exception handler - Fix RAG chunking crash using non-existent converter.output_dir Chunking pipeline improvements: - Wire --chunk-overlap-tokens through entire package pipeline (package_skill → adaptor.package → format_skill_md → _maybe_chunk_content → RAGChunker) - Add auto-scaling overlap: max(50, chunk_tokens//10) when chunk size is non-default - Rename --no-preserve-code to --no-preserve-code-blocks (backward-compat alias kept) - Replace hardcoded 512/50 chunk defaults with DEFAULT_CHUNK_TOKENS/DEFAULT_CHUNK_OVERLAP_TOKENS constants across all 12 concrete adaptors, rag_chunker, base, and package_skill Code quality: - Extract shared _generate_openai_embeddings() and _generate_st_embeddings() to SkillAdaptor base class, removing ~150 lines of duplication from chroma/weaviate/pinecone - Add Pinecone adaptor with full upload support (pinecone_adaptor.py) Tests (14 new): - chunk_overlap_tokens parameter wiring, auto-scaling overlap, preserve_code_blocks flag - .docx/.doc/no-extension file validation, --var flag routing E2E - Embedding method inheritance verification, backward-compatible flag aliases Docs: - Update CHANGELOG, CLI_REFERENCE, API_REFERENCE, packaging guide (EN+ZH) - Update README test count badge (1880+ → 2283+) All 2283 tests passing, 8 skipped, 0 failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
2026-02-28 21:57:59 +03:00
parent 3bad7cf365
commit 064405c052
41 changed files with 1864 additions and 237 deletions
--- a/docs/reference/API_REFERENCE.md
+++ b/docs/reference/API_REFERENCE.md
@@ -309,6 +309,15 @@ package_path = adaptor.package(
 )
 ```

+#### Shared Embedding Methods
+
+The base `SkillAdaptor` class provides two shared embedding methods inherited by all vector database adaptors (chroma, weaviate, pinecone):
+
+- `_generate_openai_embeddings(texts, model)` -- Generate embeddings via the OpenAI API.
+- `_generate_st_embeddings(texts, model)` -- Generate embeddings using a local sentence-transformers model.
+
+These methods are available on any adaptor instance returned by `get_adaptor()` for vector database targets, so you do not need to implement embedding logic per-adaptor.
+
 ---

 ### 6. Skill Upload API
--- a/docs/reference/CLI_REFERENCE.md
+++ b/docs/reference/CLI_REFERENCE.md
@@ -620,7 +620,8 @@ skill-seekers package SKILL_DIRECTORY [options]
 | | `--batch-size` | 100 | Chunks per batch |
 | | `--chunk-for-rag` | | Enable RAG chunking |
 | | `--chunk-tokens` | 512 | Max tokens per chunk |
-| | `--no-preserve-code` | | Allow code block splitting |
+| | `--chunk-overlap-tokens` | 50 | Overlap between chunks (tokens) |
+| | `--no-preserve-code-blocks` | | Allow code block splitting |

 **Supported Platforms:**

--- a/docs/user-guide/04-packaging.md
+++ b/docs/user-guide/04-packaging.md
@@ -194,7 +194,9 @@ skill-seekers package output/my-skill/ \
 | `--chunk-for-rag` | auto | Enable chunking |
 | `--chunk-tokens` | 512 | Tokens per chunk |
 | `--chunk-overlap-tokens` | 50 | Overlap between chunks (tokens) |
-| `--no-preserve-code` | - | Allow splitting code blocks |
+| `--no-preserve-code-blocks` | - | Allow splitting code blocks |
+
+> **Auto-scaling overlap:** When `--chunk-tokens` is set to a non-default value but `--chunk-overlap-tokens` is left at default (50), the overlap automatically scales to `max(50, chunk_tokens / 10)` for better context preservation with larger chunks.

 ---

--- a/docs/zh-CN/reference/CLI_REFERENCE.md
+++ b/docs/zh-CN/reference/CLI_REFERENCE.md
@@ -598,7 +598,8 @@ skill-seekers package SKILL_DIRECTORY [options]
 | | `--batch-size` | 100 | Chunks per batch |
 | | `--chunk-for-rag` | | Enable RAG chunking |
 | | `--chunk-tokens` | 512 | Max tokens per chunk |
-| | `--no-preserve-code` | | Allow code block splitting |
+| | `--chunk-overlap-tokens` | 50 | Overlap between chunks (tokens) |
+| | `--no-preserve-code-blocks` | | Allow code block splitting |

 **Supported Platforms:**

--- a/docs/zh-CN/user-guide/04-packaging.md
+++ b/docs/zh-CN/user-guide/04-packaging.md
@@ -194,7 +194,9 @@ skill-seekers package output/my-skill/ \
 | `--chunk-for-rag` | auto | Enable chunking |
 | `--chunk-tokens` | 512 | Tokens per chunk |
 | `--chunk-overlap-tokens` | 50 | Overlap between chunks (tokens) |
-| `--no-preserve-code` | - | Allow splitting code blocks |
+| `--no-preserve-code-blocks` | - | Allow splitting code blocks |
+
+> **自动缩放重叠:** 当 `--chunk-tokens` 设置为非默认值但 `--chunk-overlap-tokens` 保持默认值 (50) 时，重叠会自动缩放为 `max(50, chunk_tokens / 10)`，以在较大的分块中实现更好的上下文保留。

 ---