fix: resolve 18 bugs and code quality issues across adaptors, CLI, and chunking pipeline
Bug fixes: - Fix --var flag silently dropped in create routing (args.workflow_var → args.var) - Fix double _score_code_quality() call in word scraper - Add .docx file extension validation in WordToSkillConverter - Fix weaviate ImportError masked by generic Exception handler - Fix RAG chunking crash using non-existent converter.output_dir Chunking pipeline improvements: - Wire --chunk-overlap-tokens through entire package pipeline (package_skill → adaptor.package → format_skill_md → _maybe_chunk_content → RAGChunker) - Add auto-scaling overlap: max(50, chunk_tokens//10) when chunk size is non-default - Rename --no-preserve-code to --no-preserve-code-blocks (backward-compat alias kept) - Replace hardcoded 512/50 chunk defaults with DEFAULT_CHUNK_TOKENS/DEFAULT_CHUNK_OVERLAP_TOKENS constants across all 12 concrete adaptors, rag_chunker, base, and package_skill Code quality: - Extract shared _generate_openai_embeddings() and _generate_st_embeddings() to SkillAdaptor base class, removing ~150 lines of duplication from chroma/weaviate/pinecone - Add Pinecone adaptor with full upload support (pinecone_adaptor.py) Tests (14 new): - chunk_overlap_tokens parameter wiring, auto-scaling overlap, preserve_code_blocks flag - .docx/.doc/no-extension file validation, --var flag routing E2E - Embedding method inheritance verification, backward-compatible flag aliases Docs: - Update CHANGELOG, CLI_REFERENCE, API_REFERENCE, packaging guide (EN+ZH) - Update README test count badge (1880+ → 2283+) All 2283 tests passing, 8 skipped, 0 failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
10
CHANGELOG.md
10
CHANGELOG.md
@@ -22,6 +22,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|||||||
- **`docx` optional dependency group** — `pip install skill-seekers[docx]` (mammoth + python-docx)
|
- **`docx` optional dependency group** — `pip install skill-seekers[docx]` (mammoth + python-docx)
|
||||||
|
|
||||||
### Fixed
|
### Fixed
|
||||||
|
- **`--var` flag silently dropped in `create` routing** — `main.py` checked `args.workflow_var` but argparse stores the flag as `args.var`. Workflow variable overrides via `--var KEY=VALUE` were silently ignored. Fixed to read `args.var`.
|
||||||
|
- **Double `_score_code_quality()` call in word scraper** — `word_scraper.py` called `_score_code_quality(raw_text)` twice for every code-like paragraph (once to check threshold, once to assign). Consolidated to a single call.
|
||||||
|
- **`.docx` file extension validation** — `WordToSkillConverter` now validates the file has a `.docx` extension before attempting to parse. Non-`.docx` files (`.doc`, `.txt`, no extension) raise `ValueError` with a clear message instead of cryptic parse errors.
|
||||||
|
- **`--no-preserve-code` renamed to `--no-preserve-code-blocks`** — Flag name now matches the parameter it controls (`preserve_code_blocks`). Backward-compatible alias `--no-preserve-code` kept (hidden, removed in v4.0.0).
|
||||||
|
- **`--chunk-overlap-tokens` missing from `package` command** — Flag was defined in `create` and `scrape` but not `package`. Added to `PACKAGE_ARGUMENTS` and wired through `package_skill()` → `adaptor.package()` → `format_skill_md()` → `_maybe_chunk_content()` → `RAGChunker`.
|
||||||
|
- **Chunk overlap auto-scaling** — When `--chunk-tokens` is non-default but `--chunk-overlap-tokens` is default, overlap now auto-scales to `max(50, chunk_tokens // 10)` for better context preservation with large chunks.
|
||||||
|
- **Weaviate `ImportError` masked by generic handler** — `upload()` caught `Exception` before `ImportError`, so missing `sentence-transformers` produced a generic "Upload failed" message instead of the specific install instruction. Added `except ImportError` before `except Exception`.
|
||||||
|
- **Hardcoded chunk defaults in 12 adaptors** — All concrete adaptors (claude, gemini, openai, markdown, langchain, llama_index, haystack, chroma, faiss, qdrant, weaviate, pinecone) used hardcoded `512`/`50` for chunk token/overlap defaults. Replaced with `DEFAULT_CHUNK_TOKENS` and `DEFAULT_CHUNK_OVERLAP_TOKENS` constants from `arguments/common.py`.
|
||||||
- **RAG chunking crash (`AttributeError: output_dir`)** — `execute_scraping_and_building()` used `converter.output_dir` which doesn't exist on `DocToSkillConverter`. Changed to `Path(converter.skill_dir)`. Affected `--chunk-for-rag` flag on `scrape` command.
|
- **RAG chunking crash (`AttributeError: output_dir`)** — `execute_scraping_and_building()` used `converter.output_dir` which doesn't exist on `DocToSkillConverter`. Changed to `Path(converter.skill_dir)`. Affected `--chunk-for-rag` flag on `scrape` command.
|
||||||
- **Issue #301: `setup.sh` fails on macOS with mismatched Python/pip** — `pip3` can point to a different Python than `python3` (e.g. pip3 → 3.9, python3 → 3.14), causing "no matching distribution" errors. Changed `setup.sh` to use `python3 -m pip` instead of bare `pip3` to guarantee the correct interpreter.
|
- **Issue #301: `setup.sh` fails on macOS with mismatched Python/pip** — `pip3` can point to a different Python than `python3` (e.g. pip3 → 3.9, python3 → 3.14), causing "no matching distribution" errors. Changed `setup.sh` to use `python3 -m pip` instead of bare `pip3` to guarantee the correct interpreter.
|
||||||
- **Issue #300: Selector fallback & dry-run link discovery** — `create https://reactflow.dev/` now finds 20+ pages (was 1). Root causes:
|
- **Issue #300: Selector fallback & dry-run link discovery** — `create https://reactflow.dev/` now finds 20+ pages (was 1). Root causes:
|
||||||
@@ -45,6 +53,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
|
|||||||
- **Language detector method** — Fixed `detect_from_text` → `detect_from_code` in word scraper
|
- **Language detector method** — Fixed `detect_from_text` → `detect_from_code` in word scraper
|
||||||
|
|
||||||
### Changed
|
### Changed
|
||||||
|
- **Shared embedding methods consolidated to base class** — `_generate_openai_embeddings()` and `_generate_st_embeddings()` moved from chroma/weaviate/pinecone adaptors into `SkillAdaptor` base class. All 3 adaptors now inherit these methods, eliminating ~150 lines of duplicated code.
|
||||||
|
- **Chunk constants centralized** — Added `DEFAULT_CHUNK_TOKENS = 512` and `DEFAULT_CHUNK_OVERLAP_TOKENS = 50` in `arguments/common.py`. Used across `rag_chunker.py`, `base.py`, `package_skill.py`, `create_command.py`, and all 12 concrete adaptors. No more magic numbers for chunk defaults.
|
||||||
- **Enhancement summarizer architecture** — Character-budget approach respects `target_ratio` for both code blocks and heading chunks, replacing hard limits with proportional allocation
|
- **Enhancement summarizer architecture** — Character-budget approach respects `target_ratio` for both code blocks and heading chunks, replacing hard limits with proportional allocation
|
||||||
|
|
||||||
## [3.1.3] - 2026-02-24
|
## [3.1.3] - 2026-02-24
|
||||||
|
|||||||
@@ -10,7 +10,7 @@ English | [简体中文](https://github.com/yusufkaraaslan/Skill_Seekers/blob/ma
|
|||||||
[](https://opensource.org/licenses/MIT)
|
[](https://opensource.org/licenses/MIT)
|
||||||
[](https://www.python.org/downloads/)
|
[](https://www.python.org/downloads/)
|
||||||
[](https://modelcontextprotocol.io)
|
[](https://modelcontextprotocol.io)
|
||||||
[](tests/)
|
[](tests/)
|
||||||
[](https://github.com/users/yusufkaraaslan/projects/2)
|
[](https://github.com/users/yusufkaraaslan/projects/2)
|
||||||
[](https://pypi.org/project/skill-seekers/)
|
[](https://pypi.org/project/skill-seekers/)
|
||||||
[](https://pypi.org/project/skill-seekers/)
|
[](https://pypi.org/project/skill-seekers/)
|
||||||
|
|||||||
@@ -309,6 +309,15 @@ package_path = adaptor.package(
|
|||||||
)
|
)
|
||||||
```
|
```
|
||||||
|
|
||||||
|
#### Shared Embedding Methods
|
||||||
|
|
||||||
|
The base `SkillAdaptor` class provides two shared embedding methods inherited by all vector database adaptors (chroma, weaviate, pinecone):
|
||||||
|
|
||||||
|
- `_generate_openai_embeddings(texts, model)` -- Generate embeddings via the OpenAI API.
|
||||||
|
- `_generate_st_embeddings(texts, model)` -- Generate embeddings using a local sentence-transformers model.
|
||||||
|
|
||||||
|
These methods are available on any adaptor instance returned by `get_adaptor()` for vector database targets, so you do not need to implement embedding logic per-adaptor.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
### 6. Skill Upload API
|
### 6. Skill Upload API
|
||||||
|
|||||||
@@ -620,7 +620,8 @@ skill-seekers package SKILL_DIRECTORY [options]
|
|||||||
| | `--batch-size` | 100 | Chunks per batch |
|
| | `--batch-size` | 100 | Chunks per batch |
|
||||||
| | `--chunk-for-rag` | | Enable RAG chunking |
|
| | `--chunk-for-rag` | | Enable RAG chunking |
|
||||||
| | `--chunk-tokens` | 512 | Max tokens per chunk |
|
| | `--chunk-tokens` | 512 | Max tokens per chunk |
|
||||||
| | `--no-preserve-code` | | Allow code block splitting |
|
| | `--chunk-overlap-tokens` | 50 | Overlap between chunks (tokens) |
|
||||||
|
| | `--no-preserve-code-blocks` | | Allow code block splitting |
|
||||||
|
|
||||||
**Supported Platforms:**
|
**Supported Platforms:**
|
||||||
|
|
||||||
|
|||||||
@@ -194,7 +194,9 @@ skill-seekers package output/my-skill/ \
|
|||||||
| `--chunk-for-rag` | auto | Enable chunking |
|
| `--chunk-for-rag` | auto | Enable chunking |
|
||||||
| `--chunk-tokens` | 512 | Tokens per chunk |
|
| `--chunk-tokens` | 512 | Tokens per chunk |
|
||||||
| `--chunk-overlap-tokens` | 50 | Overlap between chunks (tokens) |
|
| `--chunk-overlap-tokens` | 50 | Overlap between chunks (tokens) |
|
||||||
| `--no-preserve-code` | - | Allow splitting code blocks |
|
| `--no-preserve-code-blocks` | - | Allow splitting code blocks |
|
||||||
|
|
||||||
|
> **Auto-scaling overlap:** When `--chunk-tokens` is set to a non-default value but `--chunk-overlap-tokens` is left at default (50), the overlap automatically scales to `max(50, chunk_tokens / 10)` for better context preservation with larger chunks.
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|||||||
@@ -598,7 +598,8 @@ skill-seekers package SKILL_DIRECTORY [options]
|
|||||||
| | `--batch-size` | 100 | Chunks per batch |
|
| | `--batch-size` | 100 | Chunks per batch |
|
||||||
| | `--chunk-for-rag` | | Enable RAG chunking |
|
| | `--chunk-for-rag` | | Enable RAG chunking |
|
||||||
| | `--chunk-tokens` | 512 | Max tokens per chunk |
|
| | `--chunk-tokens` | 512 | Max tokens per chunk |
|
||||||
| | `--no-preserve-code` | | Allow code block splitting |
|
| | `--chunk-overlap-tokens` | 50 | Overlap between chunks (tokens) |
|
||||||
|
| | `--no-preserve-code-blocks` | | Allow code block splitting |
|
||||||
|
|
||||||
**Supported Platforms:**
|
**Supported Platforms:**
|
||||||
|
|
||||||
|
|||||||
@@ -194,7 +194,9 @@ skill-seekers package output/my-skill/ \
|
|||||||
| `--chunk-for-rag` | auto | Enable chunking |
|
| `--chunk-for-rag` | auto | Enable chunking |
|
||||||
| `--chunk-tokens` | 512 | Tokens per chunk |
|
| `--chunk-tokens` | 512 | Tokens per chunk |
|
||||||
| `--chunk-overlap-tokens` | 50 | Overlap between chunks (tokens) |
|
| `--chunk-overlap-tokens` | 50 | Overlap between chunks (tokens) |
|
||||||
| `--no-preserve-code` | - | Allow splitting code blocks |
|
| `--no-preserve-code-blocks` | - | Allow splitting code blocks |
|
||||||
|
|
||||||
|
> **自动缩放重叠:** 当 `--chunk-tokens` 设置为非默认值但 `--chunk-overlap-tokens` 保持默认值 (50) 时,重叠会自动缩放为 `max(50, chunk_tokens / 10)`,以在较大的分块中实现更好的上下文保留。
|
||||||
|
|
||||||
---
|
---
|
||||||
|
|
||||||
|
|||||||
@@ -128,10 +128,15 @@ sentence-transformers = [
|
|||||||
"sentence-transformers>=2.2.0",
|
"sentence-transformers>=2.2.0",
|
||||||
]
|
]
|
||||||
|
|
||||||
|
pinecone = [
|
||||||
|
"pinecone>=5.0.0",
|
||||||
|
]
|
||||||
|
|
||||||
rag-upload = [
|
rag-upload = [
|
||||||
"chromadb>=0.4.0",
|
"chromadb>=0.4.0",
|
||||||
"weaviate-client>=3.25.0",
|
"weaviate-client>=3.25.0",
|
||||||
"sentence-transformers>=2.2.0",
|
"sentence-transformers>=2.2.0",
|
||||||
|
"pinecone>=5.0.0",
|
||||||
]
|
]
|
||||||
|
|
||||||
# All cloud storage providers combined
|
# All cloud storage providers combined
|
||||||
@@ -167,6 +172,7 @@ all = [
|
|||||||
"azure-storage-blob>=12.19.0",
|
"azure-storage-blob>=12.19.0",
|
||||||
"chromadb>=0.4.0",
|
"chromadb>=0.4.0",
|
||||||
"weaviate-client>=3.25.0",
|
"weaviate-client>=3.25.0",
|
||||||
|
"pinecone>=5.0.0",
|
||||||
"fastapi>=0.109.0",
|
"fastapi>=0.109.0",
|
||||||
"sentence-transformers>=2.3.0",
|
"sentence-transformers>=2.3.0",
|
||||||
"numpy>=1.24.0",
|
"numpy>=1.24.0",
|
||||||
|
|||||||
@@ -64,6 +64,11 @@ try:
|
|||||||
except ImportError:
|
except ImportError:
|
||||||
HaystackAdaptor = None
|
HaystackAdaptor = None
|
||||||
|
|
||||||
|
try:
|
||||||
|
from .pinecone_adaptor import PineconeAdaptor
|
||||||
|
except ImportError:
|
||||||
|
PineconeAdaptor = None
|
||||||
|
|
||||||
|
|
||||||
# Registry of available adaptors
|
# Registry of available adaptors
|
||||||
ADAPTORS: dict[str, type[SkillAdaptor]] = {}
|
ADAPTORS: dict[str, type[SkillAdaptor]] = {}
|
||||||
@@ -91,6 +96,8 @@ if QdrantAdaptor:
|
|||||||
ADAPTORS["qdrant"] = QdrantAdaptor
|
ADAPTORS["qdrant"] = QdrantAdaptor
|
||||||
if HaystackAdaptor:
|
if HaystackAdaptor:
|
||||||
ADAPTORS["haystack"] = HaystackAdaptor
|
ADAPTORS["haystack"] = HaystackAdaptor
|
||||||
|
if PineconeAdaptor:
|
||||||
|
ADAPTORS["pinecone"] = PineconeAdaptor
|
||||||
|
|
||||||
|
|
||||||
def get_adaptor(platform: str, config: dict = None) -> SkillAdaptor:
|
def get_adaptor(platform: str, config: dict = None) -> SkillAdaptor:
|
||||||
|
|||||||
@@ -11,6 +11,8 @@ from dataclasses import dataclass, field
|
|||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
from typing import Any
|
from typing import Any
|
||||||
|
|
||||||
|
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
|
||||||
|
|
||||||
|
|
||||||
@dataclass
|
@dataclass
|
||||||
class SkillMetadata:
|
class SkillMetadata:
|
||||||
@@ -19,6 +21,7 @@ class SkillMetadata:
|
|||||||
name: str
|
name: str
|
||||||
description: str
|
description: str
|
||||||
version: str = "1.0.0"
|
version: str = "1.0.0"
|
||||||
|
doc_version: str = "" # Documentation version (e.g., "16.2") for RAG metadata filtering
|
||||||
author: str | None = None
|
author: str | None = None
|
||||||
tags: list[str] = field(default_factory=list)
|
tags: list[str] = field(default_factory=list)
|
||||||
|
|
||||||
@@ -73,8 +76,9 @@ class SkillAdaptor(ABC):
|
|||||||
skill_dir: Path,
|
skill_dir: Path,
|
||||||
output_path: Path,
|
output_path: Path,
|
||||||
enable_chunking: bool = False,
|
enable_chunking: bool = False,
|
||||||
chunk_max_tokens: int = 512,
|
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
|
||||||
preserve_code_blocks: bool = True,
|
preserve_code_blocks: bool = True,
|
||||||
|
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
|
||||||
) -> Path:
|
) -> Path:
|
||||||
"""
|
"""
|
||||||
Package skill for platform (ZIP, tar.gz, etc.).
|
Package skill for platform (ZIP, tar.gz, etc.).
|
||||||
@@ -228,6 +232,47 @@ class SkillAdaptor(ABC):
|
|||||||
|
|
||||||
return skill_md_path.read_text(encoding="utf-8")
|
return skill_md_path.read_text(encoding="utf-8")
|
||||||
|
|
||||||
|
def _read_frontmatter(self, skill_dir: Path) -> dict[str, str]:
|
||||||
|
"""Read YAML frontmatter from SKILL.md.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
skill_dir: Path to skill directory
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Dict of key-value pairs from the frontmatter block.
|
||||||
|
"""
|
||||||
|
content = self._read_skill_md(skill_dir)
|
||||||
|
if content.startswith("---"):
|
||||||
|
parts = content.split("---", 2)
|
||||||
|
if len(parts) >= 3:
|
||||||
|
frontmatter: dict[str, str] = {}
|
||||||
|
for line in parts[1].strip().splitlines():
|
||||||
|
if ":" in line:
|
||||||
|
key, _, value = line.partition(":")
|
||||||
|
frontmatter[key.strip()] = value.strip()
|
||||||
|
return frontmatter
|
||||||
|
return {}
|
||||||
|
|
||||||
|
def _build_skill_metadata(self, skill_dir: Path) -> SkillMetadata:
|
||||||
|
"""Build SkillMetadata from SKILL.md frontmatter.
|
||||||
|
|
||||||
|
Reads name, description, version, and doc_version from frontmatter
|
||||||
|
instead of using hardcoded defaults.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
skill_dir: Path to skill directory
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
SkillMetadata populated from frontmatter values.
|
||||||
|
"""
|
||||||
|
fm = self._read_frontmatter(skill_dir)
|
||||||
|
return SkillMetadata(
|
||||||
|
name=skill_dir.name,
|
||||||
|
description=fm.get("description", f"Documentation for {skill_dir.name}"),
|
||||||
|
version=fm.get("version", "1.0.0"),
|
||||||
|
doc_version=fm.get("doc_version", ""),
|
||||||
|
)
|
||||||
|
|
||||||
def _iterate_references(self, skill_dir: Path):
|
def _iterate_references(self, skill_dir: Path):
|
||||||
"""
|
"""
|
||||||
Iterate over all reference files in skill directory.
|
Iterate over all reference files in skill directory.
|
||||||
@@ -266,6 +311,7 @@ class SkillAdaptor(ABC):
|
|||||||
base_meta = {
|
base_meta = {
|
||||||
"source": metadata.name,
|
"source": metadata.name,
|
||||||
"version": metadata.version,
|
"version": metadata.version,
|
||||||
|
"doc_version": metadata.doc_version,
|
||||||
"description": metadata.description,
|
"description": metadata.description,
|
||||||
}
|
}
|
||||||
if metadata.author:
|
if metadata.author:
|
||||||
@@ -280,9 +326,10 @@ class SkillAdaptor(ABC):
|
|||||||
content: str,
|
content: str,
|
||||||
metadata: dict,
|
metadata: dict,
|
||||||
enable_chunking: bool = False,
|
enable_chunking: bool = False,
|
||||||
chunk_max_tokens: int = 512,
|
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
|
||||||
preserve_code_blocks: bool = True,
|
preserve_code_blocks: bool = True,
|
||||||
source_file: str = None,
|
source_file: str = None,
|
||||||
|
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
|
||||||
) -> list[tuple[str, dict]]:
|
) -> list[tuple[str, dict]]:
|
||||||
"""
|
"""
|
||||||
Optionally chunk content for RAG platforms.
|
Optionally chunk content for RAG platforms.
|
||||||
@@ -321,9 +368,15 @@ class SkillAdaptor(ABC):
|
|||||||
return [(content, metadata)]
|
return [(content, metadata)]
|
||||||
|
|
||||||
# RAGChunker uses TOKENS (it converts to chars internally)
|
# RAGChunker uses TOKENS (it converts to chars internally)
|
||||||
|
# If overlap is at the default value but chunk size was customized,
|
||||||
|
# scale overlap proportionally (10% of chunk size, min DEFAULT_CHUNK_OVERLAP_TOKENS)
|
||||||
|
effective_overlap = chunk_overlap_tokens
|
||||||
|
if chunk_overlap_tokens == DEFAULT_CHUNK_OVERLAP_TOKENS and chunk_max_tokens != DEFAULT_CHUNK_TOKENS:
|
||||||
|
effective_overlap = max(DEFAULT_CHUNK_OVERLAP_TOKENS, chunk_max_tokens // 10)
|
||||||
|
|
||||||
chunker = RAGChunker(
|
chunker = RAGChunker(
|
||||||
chunk_size=chunk_max_tokens,
|
chunk_size=chunk_max_tokens,
|
||||||
chunk_overlap=max(50, chunk_max_tokens // 10), # 10% overlap
|
chunk_overlap=effective_overlap,
|
||||||
preserve_code_blocks=preserve_code_blocks,
|
preserve_code_blocks=preserve_code_blocks,
|
||||||
preserve_paragraphs=True,
|
preserve_paragraphs=True,
|
||||||
min_chunk_size=100, # 100 tokens minimum
|
min_chunk_size=100, # 100 tokens minimum
|
||||||
@@ -433,6 +486,69 @@ class SkillAdaptor(ABC):
|
|||||||
# Plain hex digest
|
# Plain hex digest
|
||||||
return hash_hex
|
return hash_hex
|
||||||
|
|
||||||
|
def _generate_openai_embeddings(
|
||||||
|
self, documents: list[str], api_key: str | None = None
|
||||||
|
) -> list[list[float]]:
|
||||||
|
"""Generate embeddings using OpenAI text-embedding-3-small.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
documents: List of document texts
|
||||||
|
api_key: OpenAI API key (or uses OPENAI_API_KEY env var)
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of embedding vectors
|
||||||
|
"""
|
||||||
|
import os
|
||||||
|
|
||||||
|
try:
|
||||||
|
from openai import OpenAI
|
||||||
|
except ImportError:
|
||||||
|
raise ImportError("openai not installed. Run: pip install openai") from None
|
||||||
|
|
||||||
|
api_key = api_key or os.getenv("OPENAI_API_KEY")
|
||||||
|
if not api_key:
|
||||||
|
raise ValueError("OPENAI_API_KEY not set. Set via env var or --openai-api-key")
|
||||||
|
|
||||||
|
client = OpenAI(api_key=api_key)
|
||||||
|
embeddings: list[list[float]] = []
|
||||||
|
batch_size = 100
|
||||||
|
|
||||||
|
print(f" Generating OpenAI embeddings for {len(documents)} documents...")
|
||||||
|
|
||||||
|
for i in range(0, len(documents), batch_size):
|
||||||
|
batch = documents[i : i + batch_size]
|
||||||
|
try:
|
||||||
|
response = client.embeddings.create(
|
||||||
|
input=batch, model="text-embedding-3-small"
|
||||||
|
)
|
||||||
|
embeddings.extend([item.embedding for item in response.data])
|
||||||
|
print(f" ✓ Embedded {min(i + batch_size, len(documents))}/{len(documents)}")
|
||||||
|
except Exception as e:
|
||||||
|
raise Exception(f"OpenAI embedding generation failed: {e}") from e
|
||||||
|
|
||||||
|
return embeddings
|
||||||
|
|
||||||
|
def _generate_st_embeddings(self, documents: list[str]) -> list[list[float]]:
|
||||||
|
"""Generate embeddings using sentence-transformers (all-MiniLM-L6-v2).
|
||||||
|
|
||||||
|
Args:
|
||||||
|
documents: List of document texts
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
List of embedding vectors
|
||||||
|
"""
|
||||||
|
try:
|
||||||
|
from sentence_transformers import SentenceTransformer
|
||||||
|
except ImportError:
|
||||||
|
raise ImportError(
|
||||||
|
"sentence-transformers not installed. Run: pip install sentence-transformers"
|
||||||
|
) from None
|
||||||
|
|
||||||
|
print(f" Generating sentence-transformer embeddings for {len(documents)} documents...")
|
||||||
|
model = SentenceTransformer("all-MiniLM-L6-v2")
|
||||||
|
embeddings = model.encode(documents, show_progress_bar=True)
|
||||||
|
return [emb.tolist() for emb in embeddings]
|
||||||
|
|
||||||
def _generate_toc(self, skill_dir: Path) -> str:
|
def _generate_toc(self, skill_dir: Path) -> str:
|
||||||
"""
|
"""
|
||||||
Helper to generate table of contents from references.
|
Helper to generate table of contents from references.
|
||||||
|
|||||||
@@ -11,6 +11,7 @@ from pathlib import Path
|
|||||||
from typing import Any
|
from typing import Any
|
||||||
|
|
||||||
from .base import SkillAdaptor, SkillMetadata
|
from .base import SkillAdaptor, SkillMetadata
|
||||||
|
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
|
||||||
|
|
||||||
|
|
||||||
class ChromaAdaptor(SkillAdaptor):
|
class ChromaAdaptor(SkillAdaptor):
|
||||||
@@ -79,6 +80,7 @@ class ChromaAdaptor(SkillAdaptor):
|
|||||||
"file": "SKILL.md",
|
"file": "SKILL.md",
|
||||||
"type": "documentation",
|
"type": "documentation",
|
||||||
"version": metadata.version,
|
"version": metadata.version,
|
||||||
|
"doc_version": metadata.doc_version,
|
||||||
}
|
}
|
||||||
|
|
||||||
# Chunk if enabled
|
# Chunk if enabled
|
||||||
@@ -86,9 +88,10 @@ class ChromaAdaptor(SkillAdaptor):
|
|||||||
content,
|
content,
|
||||||
doc_metadata,
|
doc_metadata,
|
||||||
enable_chunking=enable_chunking,
|
enable_chunking=enable_chunking,
|
||||||
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
|
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
|
||||||
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
||||||
source_file="SKILL.md",
|
source_file="SKILL.md",
|
||||||
|
chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS),
|
||||||
)
|
)
|
||||||
|
|
||||||
# Add all chunks to parallel arrays
|
# Add all chunks to parallel arrays
|
||||||
@@ -109,6 +112,7 @@ class ChromaAdaptor(SkillAdaptor):
|
|||||||
"file": ref_file.name,
|
"file": ref_file.name,
|
||||||
"type": "reference",
|
"type": "reference",
|
||||||
"version": metadata.version,
|
"version": metadata.version,
|
||||||
|
"doc_version": metadata.doc_version,
|
||||||
}
|
}
|
||||||
|
|
||||||
# Chunk if enabled
|
# Chunk if enabled
|
||||||
@@ -116,9 +120,10 @@ class ChromaAdaptor(SkillAdaptor):
|
|||||||
ref_content,
|
ref_content,
|
||||||
doc_metadata,
|
doc_metadata,
|
||||||
enable_chunking=enable_chunking,
|
enable_chunking=enable_chunking,
|
||||||
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
|
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
|
||||||
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
||||||
source_file=ref_file.name,
|
source_file=ref_file.name,
|
||||||
|
chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS),
|
||||||
)
|
)
|
||||||
|
|
||||||
# Add all chunks to parallel arrays
|
# Add all chunks to parallel arrays
|
||||||
@@ -144,8 +149,9 @@ class ChromaAdaptor(SkillAdaptor):
|
|||||||
skill_dir: Path,
|
skill_dir: Path,
|
||||||
output_path: Path,
|
output_path: Path,
|
||||||
enable_chunking: bool = False,
|
enable_chunking: bool = False,
|
||||||
chunk_max_tokens: int = 512,
|
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
|
||||||
preserve_code_blocks: bool = True,
|
preserve_code_blocks: bool = True,
|
||||||
|
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
|
||||||
) -> Path:
|
) -> Path:
|
||||||
"""
|
"""
|
||||||
Package skill into JSON file for Chroma.
|
Package skill into JSON file for Chroma.
|
||||||
@@ -166,12 +172,8 @@ class ChromaAdaptor(SkillAdaptor):
|
|||||||
output_path = self._format_output_path(skill_dir, Path(output_path), "-chroma.json")
|
output_path = self._format_output_path(skill_dir, Path(output_path), "-chroma.json")
|
||||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
# Read metadata
|
# Read metadata from SKILL.md frontmatter
|
||||||
metadata = SkillMetadata(
|
metadata = self._build_skill_metadata(skill_dir)
|
||||||
name=skill_dir.name,
|
|
||||||
description=f"Chroma collection data for {skill_dir.name}",
|
|
||||||
version="1.0.0",
|
|
||||||
)
|
|
||||||
|
|
||||||
# Generate Chroma data
|
# Generate Chroma data
|
||||||
chroma_json = self.format_skill_md(
|
chroma_json = self.format_skill_md(
|
||||||
@@ -180,6 +182,7 @@ class ChromaAdaptor(SkillAdaptor):
|
|||||||
enable_chunking=enable_chunking,
|
enable_chunking=enable_chunking,
|
||||||
chunk_max_tokens=chunk_max_tokens,
|
chunk_max_tokens=chunk_max_tokens,
|
||||||
preserve_code_blocks=preserve_code_blocks,
|
preserve_code_blocks=preserve_code_blocks,
|
||||||
|
chunk_overlap_tokens=chunk_overlap_tokens,
|
||||||
)
|
)
|
||||||
|
|
||||||
# Write to file
|
# Write to file
|
||||||
@@ -206,7 +209,7 @@ class ChromaAdaptor(SkillAdaptor):
|
|||||||
|
|
||||||
return output_path
|
return output_path
|
||||||
|
|
||||||
def upload(self, package_path: Path, api_key: str = None, **kwargs) -> dict[str, Any]:
|
def upload(self, package_path: Path, api_key: str | None = None, **kwargs) -> dict[str, Any]:
|
||||||
"""
|
"""
|
||||||
Upload packaged skill to ChromaDB.
|
Upload packaged skill to ChromaDB.
|
||||||
|
|
||||||
@@ -250,9 +253,7 @@ class ChromaAdaptor(SkillAdaptor):
|
|||||||
print(f"🌐 Connecting to ChromaDB at: {chroma_url}")
|
print(f"🌐 Connecting to ChromaDB at: {chroma_url}")
|
||||||
# Parse URL
|
# Parse URL
|
||||||
if "://" in chroma_url:
|
if "://" in chroma_url:
|
||||||
parts = chroma_url.split("://")
|
_scheme, host_port = chroma_url.split("://", 1)
|
||||||
parts[0]
|
|
||||||
host_port = parts[1]
|
|
||||||
else:
|
else:
|
||||||
host_port = chroma_url
|
host_port = chroma_url
|
||||||
|
|
||||||
@@ -352,52 +353,6 @@ class ChromaAdaptor(SkillAdaptor):
|
|||||||
except Exception as e:
|
except Exception as e:
|
||||||
return {"success": False, "message": f"Upload failed: {e}"}
|
return {"success": False, "message": f"Upload failed: {e}"}
|
||||||
|
|
||||||
def _generate_openai_embeddings(
|
|
||||||
self, documents: list[str], api_key: str = None
|
|
||||||
) -> list[list[float]]:
|
|
||||||
"""
|
|
||||||
Generate embeddings using OpenAI API.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
documents: List of document texts
|
|
||||||
api_key: OpenAI API key (or uses OPENAI_API_KEY env var)
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
List of embedding vectors
|
|
||||||
"""
|
|
||||||
import os
|
|
||||||
|
|
||||||
try:
|
|
||||||
from openai import OpenAI
|
|
||||||
except ImportError:
|
|
||||||
raise ImportError("openai not installed. Run: pip install openai") from None
|
|
||||||
|
|
||||||
api_key = api_key or os.getenv("OPENAI_API_KEY")
|
|
||||||
if not api_key:
|
|
||||||
raise ValueError("OPENAI_API_KEY not set. Set via env var or --openai-api-key")
|
|
||||||
|
|
||||||
client = OpenAI(api_key=api_key)
|
|
||||||
|
|
||||||
# Batch process (OpenAI allows up to 2048 inputs)
|
|
||||||
embeddings = []
|
|
||||||
batch_size = 100
|
|
||||||
|
|
||||||
print(f" Generating embeddings for {len(documents)} documents...")
|
|
||||||
|
|
||||||
for i in range(0, len(documents), batch_size):
|
|
||||||
batch = documents[i : i + batch_size]
|
|
||||||
try:
|
|
||||||
response = client.embeddings.create(
|
|
||||||
input=batch,
|
|
||||||
model="text-embedding-3-small", # Cheapest, fastest
|
|
||||||
)
|
|
||||||
embeddings.extend([item.embedding for item in response.data])
|
|
||||||
print(f" ✓ Processed {min(i + batch_size, len(documents))}/{len(documents)}")
|
|
||||||
except Exception as e:
|
|
||||||
raise Exception(f"OpenAI embedding generation failed: {e}") from e
|
|
||||||
|
|
||||||
return embeddings
|
|
||||||
|
|
||||||
def validate_api_key(self, _api_key: str) -> bool:
|
def validate_api_key(self, _api_key: str) -> bool:
|
||||||
"""
|
"""
|
||||||
Chroma format doesn't use API keys for packaging.
|
Chroma format doesn't use API keys for packaging.
|
||||||
|
|||||||
@@ -12,6 +12,7 @@ from pathlib import Path
|
|||||||
from typing import Any
|
from typing import Any
|
||||||
|
|
||||||
from .base import SkillAdaptor, SkillMetadata
|
from .base import SkillAdaptor, SkillMetadata
|
||||||
|
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
|
||||||
|
|
||||||
|
|
||||||
class ClaudeAdaptor(SkillAdaptor):
|
class ClaudeAdaptor(SkillAdaptor):
|
||||||
@@ -86,8 +87,9 @@ version: {metadata.version}
|
|||||||
skill_dir: Path,
|
skill_dir: Path,
|
||||||
output_path: Path,
|
output_path: Path,
|
||||||
enable_chunking: bool = False,
|
enable_chunking: bool = False,
|
||||||
chunk_max_tokens: int = 512,
|
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
|
||||||
preserve_code_blocks: bool = True,
|
preserve_code_blocks: bool = True,
|
||||||
|
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
|
||||||
) -> Path:
|
) -> Path:
|
||||||
"""
|
"""
|
||||||
Package skill into ZIP file for Claude.
|
Package skill into ZIP file for Claude.
|
||||||
|
|||||||
@@ -11,6 +11,7 @@ from pathlib import Path
|
|||||||
from typing import Any
|
from typing import Any
|
||||||
|
|
||||||
from .base import SkillAdaptor, SkillMetadata
|
from .base import SkillAdaptor, SkillMetadata
|
||||||
|
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
|
||||||
|
|
||||||
|
|
||||||
class FAISSHelpers(SkillAdaptor):
|
class FAISSHelpers(SkillAdaptor):
|
||||||
@@ -81,6 +82,7 @@ class FAISSHelpers(SkillAdaptor):
|
|||||||
"file": "SKILL.md",
|
"file": "SKILL.md",
|
||||||
"type": "documentation",
|
"type": "documentation",
|
||||||
"version": metadata.version,
|
"version": metadata.version,
|
||||||
|
"doc_version": metadata.doc_version,
|
||||||
}
|
}
|
||||||
|
|
||||||
# Chunk if enabled
|
# Chunk if enabled
|
||||||
@@ -88,9 +90,10 @@ class FAISSHelpers(SkillAdaptor):
|
|||||||
content,
|
content,
|
||||||
doc_metadata,
|
doc_metadata,
|
||||||
enable_chunking=enable_chunking,
|
enable_chunking=enable_chunking,
|
||||||
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
|
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
|
||||||
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
||||||
source_file="SKILL.md",
|
source_file="SKILL.md",
|
||||||
|
chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS),
|
||||||
)
|
)
|
||||||
|
|
||||||
# Add all chunks to parallel arrays
|
# Add all chunks to parallel arrays
|
||||||
@@ -110,6 +113,7 @@ class FAISSHelpers(SkillAdaptor):
|
|||||||
"file": ref_file.name,
|
"file": ref_file.name,
|
||||||
"type": "reference",
|
"type": "reference",
|
||||||
"version": metadata.version,
|
"version": metadata.version,
|
||||||
|
"doc_version": metadata.doc_version,
|
||||||
}
|
}
|
||||||
|
|
||||||
# Chunk if enabled
|
# Chunk if enabled
|
||||||
@@ -117,9 +121,10 @@ class FAISSHelpers(SkillAdaptor):
|
|||||||
ref_content,
|
ref_content,
|
||||||
doc_metadata,
|
doc_metadata,
|
||||||
enable_chunking=enable_chunking,
|
enable_chunking=enable_chunking,
|
||||||
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
|
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
|
||||||
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
||||||
source_file=ref_file.name,
|
source_file=ref_file.name,
|
||||||
|
chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS),
|
||||||
)
|
)
|
||||||
|
|
||||||
# Add all chunks to parallel arrays
|
# Add all chunks to parallel arrays
|
||||||
@@ -155,8 +160,9 @@ class FAISSHelpers(SkillAdaptor):
|
|||||||
skill_dir: Path,
|
skill_dir: Path,
|
||||||
output_path: Path,
|
output_path: Path,
|
||||||
enable_chunking: bool = False,
|
enable_chunking: bool = False,
|
||||||
chunk_max_tokens: int = 512,
|
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
|
||||||
preserve_code_blocks: bool = True,
|
preserve_code_blocks: bool = True,
|
||||||
|
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
|
||||||
) -> Path:
|
) -> Path:
|
||||||
"""
|
"""
|
||||||
Package skill into JSON file for FAISS.
|
Package skill into JSON file for FAISS.
|
||||||
@@ -176,12 +182,8 @@ class FAISSHelpers(SkillAdaptor):
|
|||||||
output_path = self._format_output_path(skill_dir, Path(output_path), "-faiss.json")
|
output_path = self._format_output_path(skill_dir, Path(output_path), "-faiss.json")
|
||||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
# Read metadata
|
# Read metadata from SKILL.md frontmatter
|
||||||
metadata = SkillMetadata(
|
metadata = self._build_skill_metadata(skill_dir)
|
||||||
name=skill_dir.name,
|
|
||||||
description=f"FAISS data for {skill_dir.name}",
|
|
||||||
version="1.0.0",
|
|
||||||
)
|
|
||||||
|
|
||||||
# Generate FAISS data
|
# Generate FAISS data
|
||||||
faiss_json = self.format_skill_md(
|
faiss_json = self.format_skill_md(
|
||||||
@@ -190,6 +192,7 @@ class FAISSHelpers(SkillAdaptor):
|
|||||||
enable_chunking=enable_chunking,
|
enable_chunking=enable_chunking,
|
||||||
chunk_max_tokens=chunk_max_tokens,
|
chunk_max_tokens=chunk_max_tokens,
|
||||||
preserve_code_blocks=preserve_code_blocks,
|
preserve_code_blocks=preserve_code_blocks,
|
||||||
|
chunk_overlap_tokens=chunk_overlap_tokens,
|
||||||
)
|
)
|
||||||
|
|
||||||
# Write to file
|
# Write to file
|
||||||
|
|||||||
@@ -13,6 +13,7 @@ from pathlib import Path
|
|||||||
from typing import Any
|
from typing import Any
|
||||||
|
|
||||||
from .base import SkillAdaptor, SkillMetadata
|
from .base import SkillAdaptor, SkillMetadata
|
||||||
|
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
|
||||||
|
|
||||||
|
|
||||||
class GeminiAdaptor(SkillAdaptor):
|
class GeminiAdaptor(SkillAdaptor):
|
||||||
@@ -91,8 +92,9 @@ See the references directory for complete documentation with examples and best p
|
|||||||
skill_dir: Path,
|
skill_dir: Path,
|
||||||
output_path: Path,
|
output_path: Path,
|
||||||
enable_chunking: bool = False,
|
enable_chunking: bool = False,
|
||||||
chunk_max_tokens: int = 512,
|
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
|
||||||
preserve_code_blocks: bool = True,
|
preserve_code_blocks: bool = True,
|
||||||
|
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
|
||||||
) -> Path:
|
) -> Path:
|
||||||
"""
|
"""
|
||||||
Package skill into tar.gz file for Gemini.
|
Package skill into tar.gz file for Gemini.
|
||||||
|
|||||||
@@ -11,6 +11,7 @@ from pathlib import Path
|
|||||||
from typing import Any
|
from typing import Any
|
||||||
|
|
||||||
from .base import SkillAdaptor, SkillMetadata
|
from .base import SkillAdaptor, SkillMetadata
|
||||||
|
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
|
||||||
|
|
||||||
|
|
||||||
class HaystackAdaptor(SkillAdaptor):
|
class HaystackAdaptor(SkillAdaptor):
|
||||||
@@ -62,6 +63,7 @@ class HaystackAdaptor(SkillAdaptor):
|
|||||||
"file": "SKILL.md",
|
"file": "SKILL.md",
|
||||||
"type": "documentation",
|
"type": "documentation",
|
||||||
"version": metadata.version,
|
"version": metadata.version,
|
||||||
|
"doc_version": metadata.doc_version,
|
||||||
}
|
}
|
||||||
|
|
||||||
# Chunk if enabled
|
# Chunk if enabled
|
||||||
@@ -69,9 +71,10 @@ class HaystackAdaptor(SkillAdaptor):
|
|||||||
content,
|
content,
|
||||||
doc_meta,
|
doc_meta,
|
||||||
enable_chunking=enable_chunking,
|
enable_chunking=enable_chunking,
|
||||||
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
|
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
|
||||||
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
||||||
source_file="SKILL.md",
|
source_file="SKILL.md",
|
||||||
|
chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS),
|
||||||
)
|
)
|
||||||
|
|
||||||
# Add all chunks as documents
|
# Add all chunks as documents
|
||||||
@@ -95,6 +98,7 @@ class HaystackAdaptor(SkillAdaptor):
|
|||||||
"file": ref_file.name,
|
"file": ref_file.name,
|
||||||
"type": "reference",
|
"type": "reference",
|
||||||
"version": metadata.version,
|
"version": metadata.version,
|
||||||
|
"doc_version": metadata.doc_version,
|
||||||
}
|
}
|
||||||
|
|
||||||
# Chunk if enabled
|
# Chunk if enabled
|
||||||
@@ -102,9 +106,10 @@ class HaystackAdaptor(SkillAdaptor):
|
|||||||
ref_content,
|
ref_content,
|
||||||
doc_meta,
|
doc_meta,
|
||||||
enable_chunking=enable_chunking,
|
enable_chunking=enable_chunking,
|
||||||
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
|
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
|
||||||
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
||||||
source_file=ref_file.name,
|
source_file=ref_file.name,
|
||||||
|
chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS),
|
||||||
)
|
)
|
||||||
|
|
||||||
# Add all chunks as documents
|
# Add all chunks as documents
|
||||||
@@ -124,8 +129,9 @@ class HaystackAdaptor(SkillAdaptor):
|
|||||||
skill_dir: Path,
|
skill_dir: Path,
|
||||||
output_path: Path,
|
output_path: Path,
|
||||||
enable_chunking: bool = False,
|
enable_chunking: bool = False,
|
||||||
chunk_max_tokens: int = 512,
|
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
|
||||||
preserve_code_blocks: bool = True,
|
preserve_code_blocks: bool = True,
|
||||||
|
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
|
||||||
) -> Path:
|
) -> Path:
|
||||||
"""
|
"""
|
||||||
Package skill into JSON file for Haystack.
|
Package skill into JSON file for Haystack.
|
||||||
@@ -147,11 +153,8 @@ class HaystackAdaptor(SkillAdaptor):
|
|||||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
# Read metadata
|
# Read metadata
|
||||||
metadata = SkillMetadata(
|
# Read metadata from SKILL.md frontmatter
|
||||||
name=skill_dir.name,
|
metadata = self._build_skill_metadata(skill_dir)
|
||||||
description=f"Haystack documents for {skill_dir.name}",
|
|
||||||
version="1.0.0",
|
|
||||||
)
|
|
||||||
|
|
||||||
# Generate Haystack documents
|
# Generate Haystack documents
|
||||||
documents_json = self.format_skill_md(
|
documents_json = self.format_skill_md(
|
||||||
@@ -160,6 +163,7 @@ class HaystackAdaptor(SkillAdaptor):
|
|||||||
enable_chunking=enable_chunking,
|
enable_chunking=enable_chunking,
|
||||||
chunk_max_tokens=chunk_max_tokens,
|
chunk_max_tokens=chunk_max_tokens,
|
||||||
preserve_code_blocks=preserve_code_blocks,
|
preserve_code_blocks=preserve_code_blocks,
|
||||||
|
chunk_overlap_tokens=chunk_overlap_tokens,
|
||||||
)
|
)
|
||||||
|
|
||||||
# Write to file
|
# Write to file
|
||||||
|
|||||||
@@ -11,6 +11,7 @@ from pathlib import Path
|
|||||||
from typing import Any
|
from typing import Any
|
||||||
|
|
||||||
from .base import SkillAdaptor, SkillMetadata
|
from .base import SkillAdaptor, SkillMetadata
|
||||||
|
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
|
||||||
|
|
||||||
|
|
||||||
class LangChainAdaptor(SkillAdaptor):
|
class LangChainAdaptor(SkillAdaptor):
|
||||||
@@ -62,6 +63,7 @@ class LangChainAdaptor(SkillAdaptor):
|
|||||||
"file": "SKILL.md",
|
"file": "SKILL.md",
|
||||||
"type": "documentation",
|
"type": "documentation",
|
||||||
"version": metadata.version,
|
"version": metadata.version,
|
||||||
|
"doc_version": metadata.doc_version,
|
||||||
}
|
}
|
||||||
|
|
||||||
# Chunk if enabled
|
# Chunk if enabled
|
||||||
@@ -69,9 +71,10 @@ class LangChainAdaptor(SkillAdaptor):
|
|||||||
content,
|
content,
|
||||||
doc_metadata,
|
doc_metadata,
|
||||||
enable_chunking=enable_chunking,
|
enable_chunking=enable_chunking,
|
||||||
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
|
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
|
||||||
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
||||||
source_file="SKILL.md",
|
source_file="SKILL.md",
|
||||||
|
chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS),
|
||||||
)
|
)
|
||||||
|
|
||||||
# Add all chunks to documents
|
# Add all chunks to documents
|
||||||
@@ -90,6 +93,7 @@ class LangChainAdaptor(SkillAdaptor):
|
|||||||
"file": ref_file.name,
|
"file": ref_file.name,
|
||||||
"type": "reference",
|
"type": "reference",
|
||||||
"version": metadata.version,
|
"version": metadata.version,
|
||||||
|
"doc_version": metadata.doc_version,
|
||||||
}
|
}
|
||||||
|
|
||||||
# Chunk if enabled
|
# Chunk if enabled
|
||||||
@@ -97,9 +101,10 @@ class LangChainAdaptor(SkillAdaptor):
|
|||||||
ref_content,
|
ref_content,
|
||||||
doc_metadata,
|
doc_metadata,
|
||||||
enable_chunking=enable_chunking,
|
enable_chunking=enable_chunking,
|
||||||
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
|
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
|
||||||
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
||||||
source_file=ref_file.name,
|
source_file=ref_file.name,
|
||||||
|
chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS),
|
||||||
)
|
)
|
||||||
|
|
||||||
# Add all chunks to documents
|
# Add all chunks to documents
|
||||||
@@ -114,8 +119,9 @@ class LangChainAdaptor(SkillAdaptor):
|
|||||||
skill_dir: Path,
|
skill_dir: Path,
|
||||||
output_path: Path,
|
output_path: Path,
|
||||||
enable_chunking: bool = False,
|
enable_chunking: bool = False,
|
||||||
chunk_max_tokens: int = 512,
|
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
|
||||||
preserve_code_blocks: bool = True,
|
preserve_code_blocks: bool = True,
|
||||||
|
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
|
||||||
) -> Path:
|
) -> Path:
|
||||||
"""
|
"""
|
||||||
Package skill into JSON file for LangChain.
|
Package skill into JSON file for LangChain.
|
||||||
@@ -139,12 +145,8 @@ class LangChainAdaptor(SkillAdaptor):
|
|||||||
output_path = self._format_output_path(skill_dir, Path(output_path), "-langchain.json")
|
output_path = self._format_output_path(skill_dir, Path(output_path), "-langchain.json")
|
||||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
# Read metadata
|
# Read metadata from SKILL.md frontmatter
|
||||||
metadata = SkillMetadata(
|
metadata = self._build_skill_metadata(skill_dir)
|
||||||
name=skill_dir.name,
|
|
||||||
description=f"LangChain documents for {skill_dir.name}",
|
|
||||||
version="1.0.0",
|
|
||||||
)
|
|
||||||
|
|
||||||
# Generate LangChain documents with chunking
|
# Generate LangChain documents with chunking
|
||||||
documents_json = self.format_skill_md(
|
documents_json = self.format_skill_md(
|
||||||
@@ -153,6 +155,7 @@ class LangChainAdaptor(SkillAdaptor):
|
|||||||
enable_chunking=enable_chunking,
|
enable_chunking=enable_chunking,
|
||||||
chunk_max_tokens=chunk_max_tokens,
|
chunk_max_tokens=chunk_max_tokens,
|
||||||
preserve_code_blocks=preserve_code_blocks,
|
preserve_code_blocks=preserve_code_blocks,
|
||||||
|
chunk_overlap_tokens=chunk_overlap_tokens,
|
||||||
)
|
)
|
||||||
|
|
||||||
# Write to file
|
# Write to file
|
||||||
|
|||||||
@@ -11,6 +11,7 @@ from pathlib import Path
|
|||||||
from typing import Any
|
from typing import Any
|
||||||
|
|
||||||
from .base import SkillAdaptor, SkillMetadata
|
from .base import SkillAdaptor, SkillMetadata
|
||||||
|
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
|
||||||
|
|
||||||
|
|
||||||
class LlamaIndexAdaptor(SkillAdaptor):
|
class LlamaIndexAdaptor(SkillAdaptor):
|
||||||
@@ -77,6 +78,7 @@ class LlamaIndexAdaptor(SkillAdaptor):
|
|||||||
"file": "SKILL.md",
|
"file": "SKILL.md",
|
||||||
"type": "documentation",
|
"type": "documentation",
|
||||||
"version": metadata.version,
|
"version": metadata.version,
|
||||||
|
"doc_version": metadata.doc_version,
|
||||||
}
|
}
|
||||||
|
|
||||||
# Chunk if enabled
|
# Chunk if enabled
|
||||||
@@ -84,9 +86,10 @@ class LlamaIndexAdaptor(SkillAdaptor):
|
|||||||
content,
|
content,
|
||||||
node_metadata,
|
node_metadata,
|
||||||
enable_chunking=enable_chunking,
|
enable_chunking=enable_chunking,
|
||||||
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
|
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
|
||||||
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
||||||
source_file="SKILL.md",
|
source_file="SKILL.md",
|
||||||
|
chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS),
|
||||||
)
|
)
|
||||||
|
|
||||||
# Add all chunks as nodes
|
# Add all chunks as nodes
|
||||||
@@ -112,6 +115,7 @@ class LlamaIndexAdaptor(SkillAdaptor):
|
|||||||
"file": ref_file.name,
|
"file": ref_file.name,
|
||||||
"type": "reference",
|
"type": "reference",
|
||||||
"version": metadata.version,
|
"version": metadata.version,
|
||||||
|
"doc_version": metadata.doc_version,
|
||||||
}
|
}
|
||||||
|
|
||||||
# Chunk if enabled
|
# Chunk if enabled
|
||||||
@@ -119,9 +123,10 @@ class LlamaIndexAdaptor(SkillAdaptor):
|
|||||||
ref_content,
|
ref_content,
|
||||||
node_metadata,
|
node_metadata,
|
||||||
enable_chunking=enable_chunking,
|
enable_chunking=enable_chunking,
|
||||||
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
|
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
|
||||||
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
||||||
source_file=ref_file.name,
|
source_file=ref_file.name,
|
||||||
|
chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS),
|
||||||
)
|
)
|
||||||
|
|
||||||
# Add all chunks as nodes
|
# Add all chunks as nodes
|
||||||
@@ -143,8 +148,9 @@ class LlamaIndexAdaptor(SkillAdaptor):
|
|||||||
skill_dir: Path,
|
skill_dir: Path,
|
||||||
output_path: Path,
|
output_path: Path,
|
||||||
enable_chunking: bool = False,
|
enable_chunking: bool = False,
|
||||||
chunk_max_tokens: int = 512,
|
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
|
||||||
preserve_code_blocks: bool = True,
|
preserve_code_blocks: bool = True,
|
||||||
|
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
|
||||||
) -> Path:
|
) -> Path:
|
||||||
"""
|
"""
|
||||||
Package skill into JSON file for LlamaIndex.
|
Package skill into JSON file for LlamaIndex.
|
||||||
@@ -166,11 +172,8 @@ class LlamaIndexAdaptor(SkillAdaptor):
|
|||||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
# Read metadata
|
# Read metadata
|
||||||
metadata = SkillMetadata(
|
# Read metadata from SKILL.md frontmatter
|
||||||
name=skill_dir.name,
|
metadata = self._build_skill_metadata(skill_dir)
|
||||||
description=f"LlamaIndex nodes for {skill_dir.name}",
|
|
||||||
version="1.0.0",
|
|
||||||
)
|
|
||||||
|
|
||||||
# Generate LlamaIndex nodes
|
# Generate LlamaIndex nodes
|
||||||
nodes_json = self.format_skill_md(
|
nodes_json = self.format_skill_md(
|
||||||
@@ -179,6 +182,7 @@ class LlamaIndexAdaptor(SkillAdaptor):
|
|||||||
enable_chunking=enable_chunking,
|
enable_chunking=enable_chunking,
|
||||||
chunk_max_tokens=chunk_max_tokens,
|
chunk_max_tokens=chunk_max_tokens,
|
||||||
preserve_code_blocks=preserve_code_blocks,
|
preserve_code_blocks=preserve_code_blocks,
|
||||||
|
chunk_overlap_tokens=chunk_overlap_tokens,
|
||||||
)
|
)
|
||||||
|
|
||||||
# Write to file
|
# Write to file
|
||||||
|
|||||||
@@ -11,6 +11,7 @@ from pathlib import Path
|
|||||||
from typing import Any
|
from typing import Any
|
||||||
|
|
||||||
from .base import SkillAdaptor, SkillMetadata
|
from .base import SkillAdaptor, SkillMetadata
|
||||||
|
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
|
||||||
|
|
||||||
|
|
||||||
class MarkdownAdaptor(SkillAdaptor):
|
class MarkdownAdaptor(SkillAdaptor):
|
||||||
@@ -86,8 +87,9 @@ Browse the reference files for detailed information on each topic. All files are
|
|||||||
skill_dir: Path,
|
skill_dir: Path,
|
||||||
output_path: Path,
|
output_path: Path,
|
||||||
enable_chunking: bool = False,
|
enable_chunking: bool = False,
|
||||||
chunk_max_tokens: int = 512,
|
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
|
||||||
preserve_code_blocks: bool = True,
|
preserve_code_blocks: bool = True,
|
||||||
|
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
|
||||||
) -> Path:
|
) -> Path:
|
||||||
"""
|
"""
|
||||||
Package skill into ZIP file with markdown documentation.
|
Package skill into ZIP file with markdown documentation.
|
||||||
|
|||||||
@@ -12,6 +12,7 @@ from pathlib import Path
|
|||||||
from typing import Any
|
from typing import Any
|
||||||
|
|
||||||
from .base import SkillAdaptor, SkillMetadata
|
from .base import SkillAdaptor, SkillMetadata
|
||||||
|
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
|
||||||
|
|
||||||
|
|
||||||
class OpenAIAdaptor(SkillAdaptor):
|
class OpenAIAdaptor(SkillAdaptor):
|
||||||
@@ -108,8 +109,9 @@ Always prioritize accuracy by consulting the attached documentation files before
|
|||||||
skill_dir: Path,
|
skill_dir: Path,
|
||||||
output_path: Path,
|
output_path: Path,
|
||||||
enable_chunking: bool = False,
|
enable_chunking: bool = False,
|
||||||
chunk_max_tokens: int = 512,
|
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
|
||||||
preserve_code_blocks: bool = True,
|
preserve_code_blocks: bool = True,
|
||||||
|
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
|
||||||
) -> Path:
|
) -> Path:
|
||||||
"""
|
"""
|
||||||
Package skill into ZIP file for OpenAI Assistants.
|
Package skill into ZIP file for OpenAI Assistants.
|
||||||
|
|||||||
400
src/skill_seekers/cli/adaptors/pinecone_adaptor.py
Normal file
400
src/skill_seekers/cli/adaptors/pinecone_adaptor.py
Normal file
@@ -0,0 +1,400 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Pinecone Adaptor
|
||||||
|
|
||||||
|
Implements Pinecone vector database format for RAG pipelines.
|
||||||
|
Converts Skill Seekers documentation into Pinecone-compatible format
|
||||||
|
with namespace support and batch upsert.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
from typing import Any
|
||||||
|
|
||||||
|
from .base import SkillAdaptor, SkillMetadata
|
||||||
|
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
|
||||||
|
|
||||||
|
# Pinecone metadata value limit: 40 KB per vector
|
||||||
|
PINECONE_METADATA_BYTES_LIMIT = 40_000
|
||||||
|
|
||||||
|
|
||||||
|
class PineconeAdaptor(SkillAdaptor):
|
||||||
|
"""
|
||||||
|
Pinecone vector database adaptor.
|
||||||
|
|
||||||
|
Handles:
|
||||||
|
- Pinecone-compatible vector format with metadata
|
||||||
|
- Namespace support for multi-tenant indexing
|
||||||
|
- Batch upsert (100 vectors per batch)
|
||||||
|
- OpenAI and sentence-transformers embedding generation
|
||||||
|
- Metadata truncation to stay within Pinecone's 40KB limit
|
||||||
|
"""
|
||||||
|
|
||||||
|
PLATFORM = "pinecone"
|
||||||
|
PLATFORM_NAME = "Pinecone (Vector Database)"
|
||||||
|
DEFAULT_API_ENDPOINT = None
|
||||||
|
|
||||||
|
def _generate_id(self, content: str, metadata: dict) -> str:
|
||||||
|
"""Generate deterministic ID from content and metadata."""
|
||||||
|
return self._generate_deterministic_id(content, metadata, format="hex")
|
||||||
|
|
||||||
|
def _truncate_text_for_metadata(self, text: str, max_bytes: int = PINECONE_METADATA_BYTES_LIMIT) -> str:
|
||||||
|
"""Truncate text to fit within Pinecone's metadata byte limit.
|
||||||
|
|
||||||
|
Pinecone limits metadata to 40KB per vector. This truncates
|
||||||
|
the text field (largest metadata value) to stay within limits,
|
||||||
|
leaving room for other metadata fields (~1KB overhead).
|
||||||
|
|
||||||
|
Args:
|
||||||
|
text: Text content to potentially truncate
|
||||||
|
max_bytes: Maximum bytes for the text field
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Truncated text that fits within the byte limit
|
||||||
|
"""
|
||||||
|
# Reserve ~2KB for other metadata fields
|
||||||
|
available = max_bytes - 2000
|
||||||
|
encoded = text.encode("utf-8")
|
||||||
|
if len(encoded) <= available:
|
||||||
|
return text
|
||||||
|
# Truncate at byte boundary, decode safely
|
||||||
|
truncated = encoded[:available].decode("utf-8", errors="ignore")
|
||||||
|
return truncated
|
||||||
|
|
||||||
|
def format_skill_md(
|
||||||
|
self, skill_dir: Path, metadata: SkillMetadata, enable_chunking: bool = False, **kwargs
|
||||||
|
) -> str:
|
||||||
|
"""
|
||||||
|
Format skill as JSON for Pinecone ingestion.
|
||||||
|
|
||||||
|
Creates a package with vectors ready for upsert:
|
||||||
|
{
|
||||||
|
"index_name": "...",
|
||||||
|
"namespace": "...",
|
||||||
|
"dimension": 1536,
|
||||||
|
"metric": "cosine",
|
||||||
|
"vectors": [
|
||||||
|
{
|
||||||
|
"id": "hex-id",
|
||||||
|
"metadata": {
|
||||||
|
"text": "content",
|
||||||
|
"source": "...",
|
||||||
|
"category": "...",
|
||||||
|
...
|
||||||
|
}
|
||||||
|
}
|
||||||
|
]
|
||||||
|
}
|
||||||
|
|
||||||
|
No ``values`` field — embeddings are added at upload time.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
skill_dir: Path to skill directory
|
||||||
|
metadata: Skill metadata
|
||||||
|
enable_chunking: Enable intelligent chunking for large documents
|
||||||
|
**kwargs: Additional chunking parameters
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
JSON string containing Pinecone-compatible data
|
||||||
|
"""
|
||||||
|
vectors: list[dict[str, Any]] = []
|
||||||
|
|
||||||
|
# Convert SKILL.md (main documentation)
|
||||||
|
skill_md_path = skill_dir / "SKILL.md"
|
||||||
|
if skill_md_path.exists():
|
||||||
|
content = self._read_existing_content(skill_dir)
|
||||||
|
if content.strip():
|
||||||
|
doc_metadata = {
|
||||||
|
"source": metadata.name,
|
||||||
|
"category": "overview",
|
||||||
|
"file": "SKILL.md",
|
||||||
|
"type": "documentation",
|
||||||
|
"version": metadata.version,
|
||||||
|
"doc_version": metadata.doc_version,
|
||||||
|
}
|
||||||
|
|
||||||
|
chunks = self._maybe_chunk_content(
|
||||||
|
content,
|
||||||
|
doc_metadata,
|
||||||
|
enable_chunking=enable_chunking,
|
||||||
|
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
|
||||||
|
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
||||||
|
source_file="SKILL.md",
|
||||||
|
chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS),
|
||||||
|
)
|
||||||
|
|
||||||
|
for chunk_text, chunk_meta in chunks:
|
||||||
|
vectors.append(
|
||||||
|
{
|
||||||
|
"id": self._generate_id(chunk_text, chunk_meta),
|
||||||
|
"metadata": {
|
||||||
|
**chunk_meta,
|
||||||
|
"text": self._truncate_text_for_metadata(chunk_text),
|
||||||
|
},
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
# Convert all reference files
|
||||||
|
for ref_file, ref_content in self._iterate_references(skill_dir):
|
||||||
|
if ref_content.strip():
|
||||||
|
category = ref_file.stem.replace("_", " ").lower()
|
||||||
|
|
||||||
|
doc_metadata = {
|
||||||
|
"source": metadata.name,
|
||||||
|
"category": category,
|
||||||
|
"file": ref_file.name,
|
||||||
|
"type": "reference",
|
||||||
|
"version": metadata.version,
|
||||||
|
"doc_version": metadata.doc_version,
|
||||||
|
}
|
||||||
|
|
||||||
|
chunks = self._maybe_chunk_content(
|
||||||
|
ref_content,
|
||||||
|
doc_metadata,
|
||||||
|
enable_chunking=enable_chunking,
|
||||||
|
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
|
||||||
|
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
||||||
|
source_file=ref_file.name,
|
||||||
|
chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS),
|
||||||
|
)
|
||||||
|
|
||||||
|
for chunk_text, chunk_meta in chunks:
|
||||||
|
vectors.append(
|
||||||
|
{
|
||||||
|
"id": self._generate_id(chunk_text, chunk_meta),
|
||||||
|
"metadata": {
|
||||||
|
**chunk_meta,
|
||||||
|
"text": self._truncate_text_for_metadata(chunk_text),
|
||||||
|
},
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
index_name = metadata.name.replace("_", "-").lower()
|
||||||
|
|
||||||
|
return json.dumps(
|
||||||
|
{
|
||||||
|
"index_name": index_name,
|
||||||
|
"namespace": index_name,
|
||||||
|
"dimension": 1536,
|
||||||
|
"metric": "cosine",
|
||||||
|
"vectors": vectors,
|
||||||
|
},
|
||||||
|
indent=2,
|
||||||
|
ensure_ascii=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
def package(
|
||||||
|
self,
|
||||||
|
skill_dir: Path,
|
||||||
|
output_path: Path,
|
||||||
|
enable_chunking: bool = False,
|
||||||
|
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
|
||||||
|
preserve_code_blocks: bool = True,
|
||||||
|
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
|
||||||
|
) -> Path:
|
||||||
|
"""
|
||||||
|
Package skill into JSON file for Pinecone.
|
||||||
|
|
||||||
|
Creates a JSON file containing vectors with metadata, ready for
|
||||||
|
embedding generation and upsert to a Pinecone index.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
skill_dir: Path to skill directory
|
||||||
|
output_path: Output path/filename for JSON file
|
||||||
|
enable_chunking: Enable intelligent chunking for large documents
|
||||||
|
chunk_max_tokens: Maximum tokens per chunk (default: 512)
|
||||||
|
preserve_code_blocks: Preserve code blocks during chunking
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
Path to created JSON file
|
||||||
|
"""
|
||||||
|
skill_dir = Path(skill_dir)
|
||||||
|
|
||||||
|
output_path = self._format_output_path(skill_dir, Path(output_path), "-pinecone.json")
|
||||||
|
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
# Read metadata from SKILL.md frontmatter
|
||||||
|
metadata = self._build_skill_metadata(skill_dir)
|
||||||
|
|
||||||
|
pinecone_json = self.format_skill_md(
|
||||||
|
skill_dir,
|
||||||
|
metadata,
|
||||||
|
enable_chunking=enable_chunking,
|
||||||
|
chunk_max_tokens=chunk_max_tokens,
|
||||||
|
preserve_code_blocks=preserve_code_blocks,
|
||||||
|
chunk_overlap_tokens=chunk_overlap_tokens,
|
||||||
|
)
|
||||||
|
|
||||||
|
output_path.write_text(pinecone_json, encoding="utf-8")
|
||||||
|
|
||||||
|
print(f"\n✅ Pinecone data packaged successfully!")
|
||||||
|
print(f"📦 Output: {output_path}")
|
||||||
|
|
||||||
|
data = json.loads(pinecone_json)
|
||||||
|
print(f"📊 Total vectors: {len(data['vectors'])}")
|
||||||
|
print(f"🗂️ Index name: {data['index_name']}")
|
||||||
|
print(f"📁 Namespace: {data['namespace']}")
|
||||||
|
print(f"📐 Default dimension: {data['dimension']} (auto-detected at upload time)")
|
||||||
|
|
||||||
|
# Show category breakdown
|
||||||
|
categories: dict[str, int] = {}
|
||||||
|
for vec in data["vectors"]:
|
||||||
|
cat = vec["metadata"].get("category", "unknown")
|
||||||
|
categories[cat] = categories.get(cat, 0) + 1
|
||||||
|
|
||||||
|
print("📁 Categories:")
|
||||||
|
for cat, count in sorted(categories.items()):
|
||||||
|
print(f" - {cat}: {count}")
|
||||||
|
|
||||||
|
return output_path
|
||||||
|
|
||||||
|
def upload(self, package_path: Path, api_key: str | None = None, **kwargs) -> dict[str, Any]:
|
||||||
|
"""
|
||||||
|
Upload packaged skill to Pinecone.
|
||||||
|
|
||||||
|
Args:
|
||||||
|
package_path: Path to packaged JSON
|
||||||
|
api_key: Pinecone API key (or uses PINECONE_API_KEY env var)
|
||||||
|
**kwargs:
|
||||||
|
index_name: Override index name from JSON
|
||||||
|
namespace: Override namespace from JSON
|
||||||
|
dimension: Embedding dimension (default: 1536)
|
||||||
|
metric: Distance metric (default: "cosine")
|
||||||
|
embedding_function: "openai" or "sentence-transformers"
|
||||||
|
cloud: Cloud provider (default: "aws")
|
||||||
|
region: Cloud region (default: "us-east-1")
|
||||||
|
|
||||||
|
Returns:
|
||||||
|
{"success": bool, "index": str, "namespace": str, "count": int}
|
||||||
|
"""
|
||||||
|
import os
|
||||||
|
|
||||||
|
try:
|
||||||
|
from pinecone import Pinecone, ServerlessSpec
|
||||||
|
except (ImportError, Exception):
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"message": "pinecone not installed. Run: pip install 'pinecone>=5.0.0'",
|
||||||
|
}
|
||||||
|
|
||||||
|
api_key = api_key or os.getenv("PINECONE_API_KEY")
|
||||||
|
if not api_key:
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"message": (
|
||||||
|
"PINECONE_API_KEY not set. "
|
||||||
|
"Set via env var or pass api_key parameter."
|
||||||
|
),
|
||||||
|
}
|
||||||
|
|
||||||
|
# Load package
|
||||||
|
with open(package_path) as f:
|
||||||
|
data = json.load(f)
|
||||||
|
|
||||||
|
index_name = kwargs.get("index_name", data.get("index_name", "skill-docs"))
|
||||||
|
namespace = kwargs.get("namespace", data.get("namespace", ""))
|
||||||
|
metric = kwargs.get("metric", data.get("metric", "cosine"))
|
||||||
|
cloud = kwargs.get("cloud", "aws")
|
||||||
|
region = kwargs.get("region", "us-east-1")
|
||||||
|
|
||||||
|
# Auto-detect dimension from embedding model
|
||||||
|
embedding_function = kwargs.get("embedding_function", "openai")
|
||||||
|
EMBEDDING_DIMENSIONS = {
|
||||||
|
"openai": 1536, # text-embedding-3-small
|
||||||
|
"sentence-transformers": 384, # all-MiniLM-L6-v2
|
||||||
|
}
|
||||||
|
# Priority: explicit kwarg > model-based auto-detect > JSON file > fallback
|
||||||
|
# Note: format_skill_md() hardcodes dimension=1536 in the JSON, so we must
|
||||||
|
# give EMBEDDING_DIMENSIONS priority over the file to handle sentence-transformers (384).
|
||||||
|
dimension = kwargs.get(
|
||||||
|
"dimension",
|
||||||
|
EMBEDDING_DIMENSIONS.get(embedding_function, data.get("dimension", 1536)),
|
||||||
|
)
|
||||||
|
|
||||||
|
try:
|
||||||
|
# Generate embeddings FIRST — before creating the index.
|
||||||
|
# This avoids leaving an empty Pinecone index behind when
|
||||||
|
# embedding generation fails (e.g. missing API key).
|
||||||
|
texts = [vec["metadata"]["text"] for vec in data["vectors"]]
|
||||||
|
|
||||||
|
if embedding_function == "openai":
|
||||||
|
embeddings = self._generate_openai_embeddings(texts)
|
||||||
|
elif embedding_function == "sentence-transformers":
|
||||||
|
embeddings = self._generate_st_embeddings(texts)
|
||||||
|
else:
|
||||||
|
return {
|
||||||
|
"success": False,
|
||||||
|
"message": f"Unknown embedding_function: {embedding_function}. Use 'openai' or 'sentence-transformers'.",
|
||||||
|
}
|
||||||
|
|
||||||
|
pc = Pinecone(api_key=api_key)
|
||||||
|
|
||||||
|
# Create index if it doesn't exist
|
||||||
|
existing_indexes = [idx.name for idx in pc.list_indexes()]
|
||||||
|
if index_name not in existing_indexes:
|
||||||
|
print(f"🔧 Creating Pinecone index: {index_name} (dimension={dimension}, metric={metric})")
|
||||||
|
pc.create_index(
|
||||||
|
name=index_name,
|
||||||
|
dimension=dimension,
|
||||||
|
metric=metric,
|
||||||
|
spec=ServerlessSpec(cloud=cloud, region=region),
|
||||||
|
)
|
||||||
|
print(f"✅ Index '{index_name}' created")
|
||||||
|
else:
|
||||||
|
print(f"ℹ️ Using existing index: {index_name}")
|
||||||
|
|
||||||
|
index = pc.Index(index_name)
|
||||||
|
|
||||||
|
# Batch upsert (100 per batch — Pinecone recommendation)
|
||||||
|
batch_size = 100
|
||||||
|
vectors_to_upsert = []
|
||||||
|
for i, vec in enumerate(data["vectors"]):
|
||||||
|
vectors_to_upsert.append(
|
||||||
|
{
|
||||||
|
"id": vec["id"],
|
||||||
|
"values": embeddings[i],
|
||||||
|
"metadata": vec["metadata"],
|
||||||
|
}
|
||||||
|
)
|
||||||
|
|
||||||
|
total = len(vectors_to_upsert)
|
||||||
|
print(f"🔄 Upserting {total} vectors to Pinecone...")
|
||||||
|
|
||||||
|
for i in range(0, total, batch_size):
|
||||||
|
batch = vectors_to_upsert[i : i + batch_size]
|
||||||
|
index.upsert(vectors=batch, namespace=namespace)
|
||||||
|
print(f" ✓ Upserted {min(i + batch_size, total)}/{total}")
|
||||||
|
|
||||||
|
print(f"✅ Uploaded {total} vectors to Pinecone index '{index_name}'")
|
||||||
|
|
||||||
|
return {
|
||||||
|
"success": True,
|
||||||
|
"message": f"Uploaded {total} vectors to Pinecone index '{index_name}' (namespace: '{namespace}')",
|
||||||
|
"url": None,
|
||||||
|
"index": index_name,
|
||||||
|
"namespace": namespace,
|
||||||
|
"count": total,
|
||||||
|
}
|
||||||
|
|
||||||
|
except Exception as e:
|
||||||
|
return {"success": False, "message": f"Pinecone upload failed: {e}"}
|
||||||
|
|
||||||
|
def validate_api_key(self, _api_key: str) -> bool:
|
||||||
|
"""Pinecone doesn't need API key for packaging."""
|
||||||
|
return False
|
||||||
|
|
||||||
|
def get_env_var_name(self) -> str:
|
||||||
|
"""Return the expected env var for Pinecone API key."""
|
||||||
|
return "PINECONE_API_KEY"
|
||||||
|
|
||||||
|
def supports_enhancement(self) -> bool:
|
||||||
|
"""Pinecone format doesn't support AI enhancement."""
|
||||||
|
return False
|
||||||
|
|
||||||
|
def enhance(self, _skill_dir: Path, _api_key: str) -> bool:
|
||||||
|
"""Pinecone format doesn't support enhancement."""
|
||||||
|
print("❌ Pinecone format does not support enhancement")
|
||||||
|
print(" Enhance before packaging:")
|
||||||
|
print(" skill-seekers enhance output/skill/ --mode LOCAL")
|
||||||
|
print(" skill-seekers package output/skill/ --target pinecone")
|
||||||
|
return False
|
||||||
@@ -11,6 +11,7 @@ from pathlib import Path
|
|||||||
from typing import Any
|
from typing import Any
|
||||||
|
|
||||||
from .base import SkillAdaptor, SkillMetadata
|
from .base import SkillAdaptor, SkillMetadata
|
||||||
|
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
|
||||||
|
|
||||||
|
|
||||||
class QdrantAdaptor(SkillAdaptor):
|
class QdrantAdaptor(SkillAdaptor):
|
||||||
@@ -76,6 +77,7 @@ class QdrantAdaptor(SkillAdaptor):
|
|||||||
"file": "SKILL.md",
|
"file": "SKILL.md",
|
||||||
"type": "documentation",
|
"type": "documentation",
|
||||||
"version": metadata.version,
|
"version": metadata.version,
|
||||||
|
"doc_version": metadata.doc_version,
|
||||||
}
|
}
|
||||||
|
|
||||||
# Chunk if enabled
|
# Chunk if enabled
|
||||||
@@ -83,9 +85,10 @@ class QdrantAdaptor(SkillAdaptor):
|
|||||||
content,
|
content,
|
||||||
payload_meta,
|
payload_meta,
|
||||||
enable_chunking=enable_chunking,
|
enable_chunking=enable_chunking,
|
||||||
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
|
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
|
||||||
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
||||||
source_file="SKILL.md",
|
source_file="SKILL.md",
|
||||||
|
chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS),
|
||||||
)
|
)
|
||||||
|
|
||||||
# Add all chunks as points
|
# Add all chunks as points
|
||||||
@@ -109,6 +112,7 @@ class QdrantAdaptor(SkillAdaptor):
|
|||||||
"file": chunk_meta.get("file", "SKILL.md"),
|
"file": chunk_meta.get("file", "SKILL.md"),
|
||||||
"type": chunk_meta.get("type", "documentation"),
|
"type": chunk_meta.get("type", "documentation"),
|
||||||
"version": chunk_meta.get("version", metadata.version),
|
"version": chunk_meta.get("version", metadata.version),
|
||||||
|
"doc_version": chunk_meta.get("doc_version", ""),
|
||||||
},
|
},
|
||||||
}
|
}
|
||||||
)
|
)
|
||||||
@@ -124,6 +128,7 @@ class QdrantAdaptor(SkillAdaptor):
|
|||||||
"file": ref_file.name,
|
"file": ref_file.name,
|
||||||
"type": "reference",
|
"type": "reference",
|
||||||
"version": metadata.version,
|
"version": metadata.version,
|
||||||
|
"doc_version": metadata.doc_version,
|
||||||
}
|
}
|
||||||
|
|
||||||
# Chunk if enabled
|
# Chunk if enabled
|
||||||
@@ -131,9 +136,10 @@ class QdrantAdaptor(SkillAdaptor):
|
|||||||
ref_content,
|
ref_content,
|
||||||
payload_meta,
|
payload_meta,
|
||||||
enable_chunking=enable_chunking,
|
enable_chunking=enable_chunking,
|
||||||
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
|
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
|
||||||
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
||||||
source_file=ref_file.name,
|
source_file=ref_file.name,
|
||||||
|
chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS),
|
||||||
)
|
)
|
||||||
|
|
||||||
# Add all chunks as points
|
# Add all chunks as points
|
||||||
@@ -157,6 +163,7 @@ class QdrantAdaptor(SkillAdaptor):
|
|||||||
"file": chunk_meta.get("file", ref_file.name),
|
"file": chunk_meta.get("file", ref_file.name),
|
||||||
"type": chunk_meta.get("type", "reference"),
|
"type": chunk_meta.get("type", "reference"),
|
||||||
"version": chunk_meta.get("version", metadata.version),
|
"version": chunk_meta.get("version", metadata.version),
|
||||||
|
"doc_version": chunk_meta.get("doc_version", ""),
|
||||||
},
|
},
|
||||||
}
|
}
|
||||||
)
|
)
|
||||||
@@ -189,8 +196,9 @@ class QdrantAdaptor(SkillAdaptor):
|
|||||||
skill_dir: Path,
|
skill_dir: Path,
|
||||||
output_path: Path,
|
output_path: Path,
|
||||||
enable_chunking: bool = False,
|
enable_chunking: bool = False,
|
||||||
chunk_max_tokens: int = 512,
|
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
|
||||||
preserve_code_blocks: bool = True,
|
preserve_code_blocks: bool = True,
|
||||||
|
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
|
||||||
) -> Path:
|
) -> Path:
|
||||||
"""
|
"""
|
||||||
Package skill into JSON file for Qdrant.
|
Package skill into JSON file for Qdrant.
|
||||||
@@ -211,11 +219,8 @@ class QdrantAdaptor(SkillAdaptor):
|
|||||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
# Read metadata
|
# Read metadata
|
||||||
metadata = SkillMetadata(
|
# Read metadata from SKILL.md frontmatter
|
||||||
name=skill_dir.name,
|
metadata = self._build_skill_metadata(skill_dir)
|
||||||
description=f"Qdrant data for {skill_dir.name}",
|
|
||||||
version="1.0.0",
|
|
||||||
)
|
|
||||||
|
|
||||||
# Generate Qdrant data
|
# Generate Qdrant data
|
||||||
qdrant_json = self.format_skill_md(
|
qdrant_json = self.format_skill_md(
|
||||||
@@ -224,6 +229,7 @@ class QdrantAdaptor(SkillAdaptor):
|
|||||||
enable_chunking=enable_chunking,
|
enable_chunking=enable_chunking,
|
||||||
chunk_max_tokens=chunk_max_tokens,
|
chunk_max_tokens=chunk_max_tokens,
|
||||||
preserve_code_blocks=preserve_code_blocks,
|
preserve_code_blocks=preserve_code_blocks,
|
||||||
|
chunk_overlap_tokens=chunk_overlap_tokens,
|
||||||
)
|
)
|
||||||
|
|
||||||
# Write to file
|
# Write to file
|
||||||
|
|||||||
@@ -11,6 +11,7 @@ from pathlib import Path
|
|||||||
from typing import Any
|
from typing import Any
|
||||||
|
|
||||||
from .base import SkillAdaptor, SkillMetadata
|
from .base import SkillAdaptor, SkillMetadata
|
||||||
|
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
|
||||||
|
|
||||||
|
|
||||||
class WeaviateAdaptor(SkillAdaptor):
|
class WeaviateAdaptor(SkillAdaptor):
|
||||||
@@ -96,7 +97,14 @@ class WeaviateAdaptor(SkillAdaptor):
|
|||||||
{
|
{
|
||||||
"name": "version",
|
"name": "version",
|
||||||
"dataType": ["text"],
|
"dataType": ["text"],
|
||||||
"description": "Documentation version",
|
"description": "Skill package version",
|
||||||
|
"indexFilterable": True,
|
||||||
|
"indexSearchable": False,
|
||||||
|
},
|
||||||
|
{
|
||||||
|
"name": "doc_version",
|
||||||
|
"dataType": ["text"],
|
||||||
|
"description": "Documentation version (e.g., 16.2)",
|
||||||
"indexFilterable": True,
|
"indexFilterable": True,
|
||||||
"indexSearchable": False,
|
"indexSearchable": False,
|
||||||
},
|
},
|
||||||
@@ -137,6 +145,7 @@ class WeaviateAdaptor(SkillAdaptor):
|
|||||||
"file": "SKILL.md",
|
"file": "SKILL.md",
|
||||||
"type": "documentation",
|
"type": "documentation",
|
||||||
"version": metadata.version,
|
"version": metadata.version,
|
||||||
|
"doc_version": metadata.doc_version,
|
||||||
}
|
}
|
||||||
|
|
||||||
# Chunk if enabled
|
# Chunk if enabled
|
||||||
@@ -144,9 +153,10 @@ class WeaviateAdaptor(SkillAdaptor):
|
|||||||
content,
|
content,
|
||||||
obj_metadata,
|
obj_metadata,
|
||||||
enable_chunking=enable_chunking,
|
enable_chunking=enable_chunking,
|
||||||
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
|
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
|
||||||
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
||||||
source_file="SKILL.md",
|
source_file="SKILL.md",
|
||||||
|
chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS),
|
||||||
)
|
)
|
||||||
|
|
||||||
# Add all chunks as objects
|
# Add all chunks as objects
|
||||||
@@ -161,6 +171,7 @@ class WeaviateAdaptor(SkillAdaptor):
|
|||||||
"file": chunk_meta.get("file", "SKILL.md"),
|
"file": chunk_meta.get("file", "SKILL.md"),
|
||||||
"type": chunk_meta.get("type", "documentation"),
|
"type": chunk_meta.get("type", "documentation"),
|
||||||
"version": chunk_meta.get("version", metadata.version),
|
"version": chunk_meta.get("version", metadata.version),
|
||||||
|
"doc_version": chunk_meta.get("doc_version", ""),
|
||||||
},
|
},
|
||||||
}
|
}
|
||||||
)
|
)
|
||||||
@@ -177,6 +188,7 @@ class WeaviateAdaptor(SkillAdaptor):
|
|||||||
"file": ref_file.name,
|
"file": ref_file.name,
|
||||||
"type": "reference",
|
"type": "reference",
|
||||||
"version": metadata.version,
|
"version": metadata.version,
|
||||||
|
"doc_version": metadata.doc_version,
|
||||||
}
|
}
|
||||||
|
|
||||||
# Chunk if enabled
|
# Chunk if enabled
|
||||||
@@ -184,9 +196,10 @@ class WeaviateAdaptor(SkillAdaptor):
|
|||||||
ref_content,
|
ref_content,
|
||||||
obj_metadata,
|
obj_metadata,
|
||||||
enable_chunking=enable_chunking,
|
enable_chunking=enable_chunking,
|
||||||
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
|
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
|
||||||
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
|
||||||
source_file=ref_file.name,
|
source_file=ref_file.name,
|
||||||
|
chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS),
|
||||||
)
|
)
|
||||||
|
|
||||||
# Add all chunks as objects
|
# Add all chunks as objects
|
||||||
@@ -201,6 +214,7 @@ class WeaviateAdaptor(SkillAdaptor):
|
|||||||
"file": chunk_meta.get("file", ref_file.name),
|
"file": chunk_meta.get("file", ref_file.name),
|
||||||
"type": chunk_meta.get("type", "reference"),
|
"type": chunk_meta.get("type", "reference"),
|
||||||
"version": chunk_meta.get("version", metadata.version),
|
"version": chunk_meta.get("version", metadata.version),
|
||||||
|
"doc_version": chunk_meta.get("doc_version", ""),
|
||||||
},
|
},
|
||||||
}
|
}
|
||||||
)
|
)
|
||||||
@@ -221,8 +235,9 @@ class WeaviateAdaptor(SkillAdaptor):
|
|||||||
skill_dir: Path,
|
skill_dir: Path,
|
||||||
output_path: Path,
|
output_path: Path,
|
||||||
enable_chunking: bool = False,
|
enable_chunking: bool = False,
|
||||||
chunk_max_tokens: int = 512,
|
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
|
||||||
preserve_code_blocks: bool = True,
|
preserve_code_blocks: bool = True,
|
||||||
|
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
|
||||||
) -> Path:
|
) -> Path:
|
||||||
"""
|
"""
|
||||||
Package skill into JSON file for Weaviate.
|
Package skill into JSON file for Weaviate.
|
||||||
@@ -245,12 +260,8 @@ class WeaviateAdaptor(SkillAdaptor):
|
|||||||
output_path = self._format_output_path(skill_dir, Path(output_path), "-weaviate.json")
|
output_path = self._format_output_path(skill_dir, Path(output_path), "-weaviate.json")
|
||||||
output_path.parent.mkdir(parents=True, exist_ok=True)
|
output_path.parent.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
# Read metadata
|
# Read metadata from SKILL.md frontmatter
|
||||||
metadata = SkillMetadata(
|
metadata = self._build_skill_metadata(skill_dir)
|
||||||
name=skill_dir.name,
|
|
||||||
description=f"Weaviate objects for {skill_dir.name}",
|
|
||||||
version="1.0.0",
|
|
||||||
)
|
|
||||||
|
|
||||||
# Generate Weaviate objects
|
# Generate Weaviate objects
|
||||||
weaviate_json = self.format_skill_md(
|
weaviate_json = self.format_skill_md(
|
||||||
@@ -259,6 +270,7 @@ class WeaviateAdaptor(SkillAdaptor):
|
|||||||
enable_chunking=enable_chunking,
|
enable_chunking=enable_chunking,
|
||||||
chunk_max_tokens=chunk_max_tokens,
|
chunk_max_tokens=chunk_max_tokens,
|
||||||
preserve_code_blocks=preserve_code_blocks,
|
preserve_code_blocks=preserve_code_blocks,
|
||||||
|
chunk_overlap_tokens=chunk_overlap_tokens,
|
||||||
)
|
)
|
||||||
|
|
||||||
# Write to file
|
# Write to file
|
||||||
@@ -288,7 +300,7 @@ class WeaviateAdaptor(SkillAdaptor):
|
|||||||
|
|
||||||
return output_path
|
return output_path
|
||||||
|
|
||||||
def upload(self, package_path: Path, api_key: str = None, **kwargs) -> dict[str, Any]:
|
def upload(self, package_path: Path, api_key: str | None = None, **kwargs) -> dict[str, Any]:
|
||||||
"""
|
"""
|
||||||
Upload packaged skill to Weaviate.
|
Upload packaged skill to Weaviate.
|
||||||
|
|
||||||
@@ -382,31 +394,20 @@ class WeaviateAdaptor(SkillAdaptor):
|
|||||||
print(f" ✓ Uploaded {i + 1}/{len(data['objects'])} objects")
|
print(f" ✓ Uploaded {i + 1}/{len(data['objects'])} objects")
|
||||||
|
|
||||||
elif embedding_function == "sentence-transformers":
|
elif embedding_function == "sentence-transformers":
|
||||||
# Use sentence-transformers
|
# Use sentence-transformers (via shared base method)
|
||||||
print("🔄 Generating sentence-transformer embeddings and uploading...")
|
contents = [obj["properties"]["content"] for obj in data["objects"]]
|
||||||
try:
|
embeddings = self._generate_st_embeddings(contents)
|
||||||
from sentence_transformers import SentenceTransformer
|
|
||||||
|
|
||||||
model = SentenceTransformer("all-MiniLM-L6-v2")
|
for i, obj in enumerate(data["objects"]):
|
||||||
contents = [obj["properties"]["content"] for obj in data["objects"]]
|
batch.add_data_object(
|
||||||
embeddings = model.encode(contents, show_progress_bar=True).tolist()
|
data_object=obj["properties"],
|
||||||
|
class_name=data["class_name"],
|
||||||
|
uuid=obj["id"],
|
||||||
|
vector=embeddings[i],
|
||||||
|
)
|
||||||
|
|
||||||
for i, obj in enumerate(data["objects"]):
|
if (i + 1) % 100 == 0:
|
||||||
batch.add_data_object(
|
print(f" ✓ Uploaded {i + 1}/{len(data['objects'])} objects")
|
||||||
data_object=obj["properties"],
|
|
||||||
class_name=data["class_name"],
|
|
||||||
uuid=obj["id"],
|
|
||||||
vector=embeddings[i],
|
|
||||||
)
|
|
||||||
|
|
||||||
if (i + 1) % 100 == 0:
|
|
||||||
print(f" ✓ Uploaded {i + 1}/{len(data['objects'])} objects")
|
|
||||||
|
|
||||||
except ImportError:
|
|
||||||
return {
|
|
||||||
"success": False,
|
|
||||||
"message": "sentence-transformers not installed. Run: pip install sentence-transformers",
|
|
||||||
}
|
|
||||||
|
|
||||||
else:
|
else:
|
||||||
# No embeddings - Weaviate will use its configured vectorizer
|
# No embeddings - Weaviate will use its configured vectorizer
|
||||||
@@ -427,61 +428,16 @@ class WeaviateAdaptor(SkillAdaptor):
|
|||||||
return {
|
return {
|
||||||
"success": True,
|
"success": True,
|
||||||
"message": f"Uploaded {count} objects to Weaviate class '{data['class_name']}'",
|
"message": f"Uploaded {count} objects to Weaviate class '{data['class_name']}'",
|
||||||
|
"url": None,
|
||||||
"class_name": data["class_name"],
|
"class_name": data["class_name"],
|
||||||
"count": count,
|
"count": count,
|
||||||
}
|
}
|
||||||
|
|
||||||
|
except ImportError as e:
|
||||||
|
return {"success": False, "message": str(e)}
|
||||||
except Exception as e:
|
except Exception as e:
|
||||||
return {"success": False, "message": f"Upload failed: {e}"}
|
return {"success": False, "message": f"Upload failed: {e}"}
|
||||||
|
|
||||||
def _generate_openai_embeddings(
|
|
||||||
self, documents: list[str], api_key: str = None
|
|
||||||
) -> list[list[float]]:
|
|
||||||
"""
|
|
||||||
Generate embeddings using OpenAI API.
|
|
||||||
|
|
||||||
Args:
|
|
||||||
documents: List of document texts
|
|
||||||
api_key: OpenAI API key (or uses OPENAI_API_KEY env var)
|
|
||||||
|
|
||||||
Returns:
|
|
||||||
List of embedding vectors
|
|
||||||
"""
|
|
||||||
import os
|
|
||||||
|
|
||||||
try:
|
|
||||||
from openai import OpenAI
|
|
||||||
except ImportError:
|
|
||||||
raise ImportError("openai not installed. Run: pip install openai") from None
|
|
||||||
|
|
||||||
api_key = api_key or os.getenv("OPENAI_API_KEY")
|
|
||||||
if not api_key:
|
|
||||||
raise ValueError("OPENAI_API_KEY not set. Set via env var or --openai-api-key")
|
|
||||||
|
|
||||||
client = OpenAI(api_key=api_key)
|
|
||||||
|
|
||||||
# Batch process (OpenAI allows up to 2048 inputs)
|
|
||||||
embeddings = []
|
|
||||||
batch_size = 100
|
|
||||||
|
|
||||||
print(f" Generating embeddings for {len(documents)} documents...")
|
|
||||||
|
|
||||||
for i in range(0, len(documents), batch_size):
|
|
||||||
batch = documents[i : i + batch_size]
|
|
||||||
try:
|
|
||||||
response = client.embeddings.create(
|
|
||||||
input=batch,
|
|
||||||
model="text-embedding-3-small", # Cheapest, fastest
|
|
||||||
)
|
|
||||||
embeddings.extend([item.embedding for item in response.data])
|
|
||||||
print(
|
|
||||||
f" ✓ Generated {min(i + batch_size, len(documents))}/{len(documents)} embeddings"
|
|
||||||
)
|
|
||||||
except Exception as e:
|
|
||||||
raise Exception(f"OpenAI embedding generation failed: {e}") from e
|
|
||||||
|
|
||||||
return embeddings
|
|
||||||
|
|
||||||
def validate_api_key(self, _api_key: str) -> bool:
|
def validate_api_key(self, _api_key: str) -> bool:
|
||||||
"""
|
"""
|
||||||
Weaviate format doesn't use API keys for packaging.
|
Weaviate format doesn't use API keys for packaging.
|
||||||
|
|||||||
@@ -15,6 +15,10 @@ Hierarchy:
|
|||||||
import argparse
|
import argparse
|
||||||
from typing import Any
|
from typing import Any
|
||||||
|
|
||||||
|
# Default chunking constants used by RAG and package arguments
|
||||||
|
DEFAULT_CHUNK_TOKENS = 512
|
||||||
|
DEFAULT_CHUNK_OVERLAP_TOKENS = 50
|
||||||
|
|
||||||
# Common argument definitions as data structure
|
# Common argument definitions as data structure
|
||||||
# These are arguments that appear in MULTIPLE commands
|
# These are arguments that appear in MULTIPLE commands
|
||||||
COMMON_ARGUMENTS: dict[str, dict[str, Any]] = {
|
COMMON_ARGUMENTS: dict[str, dict[str, Any]] = {
|
||||||
@@ -64,6 +68,15 @@ COMMON_ARGUMENTS: dict[str, dict[str, Any]] = {
|
|||||||
"metavar": "KEY",
|
"metavar": "KEY",
|
||||||
},
|
},
|
||||||
},
|
},
|
||||||
|
"doc_version": {
|
||||||
|
"flags": ("--doc-version",),
|
||||||
|
"kwargs": {
|
||||||
|
"type": str,
|
||||||
|
"default": "",
|
||||||
|
"help": "Documentation version tag for RAG metadata (e.g., '16.2')",
|
||||||
|
"metavar": "VERSION",
|
||||||
|
},
|
||||||
|
},
|
||||||
}
|
}
|
||||||
|
|
||||||
# Behavior arguments — runtime flags shared by every scraper
|
# Behavior arguments — runtime flags shared by every scraper
|
||||||
@@ -105,18 +118,18 @@ RAG_ARGUMENTS: dict[str, dict[str, Any]] = {
|
|||||||
"flags": ("--chunk-tokens",),
|
"flags": ("--chunk-tokens",),
|
||||||
"kwargs": {
|
"kwargs": {
|
||||||
"type": int,
|
"type": int,
|
||||||
"default": 512,
|
"default": DEFAULT_CHUNK_TOKENS,
|
||||||
"metavar": "TOKENS",
|
"metavar": "TOKENS",
|
||||||
"help": "Chunk size in tokens for RAG (default: 512)",
|
"help": f"Chunk size in tokens for RAG (default: {DEFAULT_CHUNK_TOKENS})",
|
||||||
},
|
},
|
||||||
},
|
},
|
||||||
"chunk_overlap_tokens": {
|
"chunk_overlap_tokens": {
|
||||||
"flags": ("--chunk-overlap-tokens",),
|
"flags": ("--chunk-overlap-tokens",),
|
||||||
"kwargs": {
|
"kwargs": {
|
||||||
"type": int,
|
"type": int,
|
||||||
"default": 50,
|
"default": DEFAULT_CHUNK_OVERLAP_TOKENS,
|
||||||
"metavar": "TOKENS",
|
"metavar": "TOKENS",
|
||||||
"help": "Overlap between chunks in tokens (default: 50)",
|
"help": f"Overlap between chunks in tokens (default: {DEFAULT_CHUNK_OVERLAP_TOKENS})",
|
||||||
},
|
},
|
||||||
},
|
},
|
||||||
}
|
}
|
||||||
|
|||||||
@@ -153,6 +153,15 @@ UNIVERSAL_ARGUMENTS: dict[str, dict[str, Any]] = {
|
|||||||
"metavar": "PATH",
|
"metavar": "PATH",
|
||||||
},
|
},
|
||||||
},
|
},
|
||||||
|
"doc_version": {
|
||||||
|
"flags": ("--doc-version",),
|
||||||
|
"kwargs": {
|
||||||
|
"type": str,
|
||||||
|
"default": "",
|
||||||
|
"help": "Documentation version tag for RAG metadata (e.g., '16.2')",
|
||||||
|
"metavar": "VERSION",
|
||||||
|
},
|
||||||
|
},
|
||||||
}
|
}
|
||||||
|
|
||||||
# Merge RAG arguments from common.py into universal arguments
|
# Merge RAG arguments from common.py into universal arguments
|
||||||
@@ -569,3 +578,11 @@ def add_create_arguments(parser: argparse.ArgumentParser, mode: str = "default")
|
|||||||
if mode in ["advanced", "all"]:
|
if mode in ["advanced", "all"]:
|
||||||
for arg_name, arg_def in ADVANCED_ARGUMENTS.items():
|
for arg_name, arg_def in ADVANCED_ARGUMENTS.items():
|
||||||
parser.add_argument(*arg_def["flags"], **arg_def["kwargs"])
|
parser.add_argument(*arg_def["flags"], **arg_def["kwargs"])
|
||||||
|
|
||||||
|
# Deprecated alias for backward compatibility (removed in v4.0.0)
|
||||||
|
parser.add_argument(
|
||||||
|
"--no-preserve-code",
|
||||||
|
dest="no_preserve_code_blocks",
|
||||||
|
action="store_true",
|
||||||
|
help=argparse.SUPPRESS,
|
||||||
|
)
|
||||||
|
|||||||
@@ -8,6 +8,8 @@ import and use these definitions.
|
|||||||
import argparse
|
import argparse
|
||||||
from typing import Any
|
from typing import Any
|
||||||
|
|
||||||
|
from .common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
|
||||||
|
|
||||||
PACKAGE_ARGUMENTS: dict[str, dict[str, Any]] = {
|
PACKAGE_ARGUMENTS: dict[str, dict[str, Any]] = {
|
||||||
# Positional argument
|
# Positional argument
|
||||||
"skill_directory": {
|
"skill_directory": {
|
||||||
@@ -49,6 +51,7 @@ PACKAGE_ARGUMENTS: dict[str, dict[str, Any]] = {
|
|||||||
"chroma",
|
"chroma",
|
||||||
"faiss",
|
"faiss",
|
||||||
"qdrant",
|
"qdrant",
|
||||||
|
"pinecone",
|
||||||
],
|
],
|
||||||
"default": "claude",
|
"default": "claude",
|
||||||
"help": "Target LLM platform (default: claude)",
|
"help": "Target LLM platform (default: claude)",
|
||||||
@@ -109,13 +112,22 @@ PACKAGE_ARGUMENTS: dict[str, dict[str, Any]] = {
|
|||||||
"flags": ("--chunk-tokens",),
|
"flags": ("--chunk-tokens",),
|
||||||
"kwargs": {
|
"kwargs": {
|
||||||
"type": int,
|
"type": int,
|
||||||
"default": 512,
|
"default": DEFAULT_CHUNK_TOKENS,
|
||||||
"help": "Maximum tokens per chunk (default: 512)",
|
"help": f"Maximum tokens per chunk (default: {DEFAULT_CHUNK_TOKENS})",
|
||||||
"metavar": "N",
|
"metavar": "N",
|
||||||
},
|
},
|
||||||
},
|
},
|
||||||
"no_preserve_code": {
|
"chunk_overlap_tokens": {
|
||||||
"flags": ("--no-preserve-code",),
|
"flags": ("--chunk-overlap-tokens",),
|
||||||
|
"kwargs": {
|
||||||
|
"type": int,
|
||||||
|
"default": DEFAULT_CHUNK_OVERLAP_TOKENS,
|
||||||
|
"help": f"Overlap between chunks in tokens (default: {DEFAULT_CHUNK_OVERLAP_TOKENS})",
|
||||||
|
"metavar": "N",
|
||||||
|
},
|
||||||
|
},
|
||||||
|
"no_preserve_code_blocks": {
|
||||||
|
"flags": ("--no-preserve-code-blocks",),
|
||||||
"kwargs": {
|
"kwargs": {
|
||||||
"action": "store_true",
|
"action": "store_true",
|
||||||
"help": "Allow code block splitting (default: code blocks preserved)",
|
"help": "Allow code block splitting (default: code blocks preserved)",
|
||||||
@@ -130,3 +142,11 @@ def add_package_arguments(parser: argparse.ArgumentParser) -> None:
|
|||||||
flags = arg_def["flags"]
|
flags = arg_def["flags"]
|
||||||
kwargs = arg_def["kwargs"]
|
kwargs = arg_def["kwargs"]
|
||||||
parser.add_argument(*flags, **kwargs)
|
parser.add_argument(*flags, **kwargs)
|
||||||
|
|
||||||
|
# Deprecated alias for backward compatibility (removed in v4.0.0)
|
||||||
|
parser.add_argument(
|
||||||
|
"--no-preserve-code",
|
||||||
|
dest="no_preserve_code_blocks",
|
||||||
|
action="store_true",
|
||||||
|
help=argparse.SUPPRESS,
|
||||||
|
)
|
||||||
|
|||||||
@@ -172,6 +172,14 @@ def add_scrape_arguments(parser: argparse.ArgumentParser) -> None:
|
|||||||
kwargs = arg_def["kwargs"]
|
kwargs = arg_def["kwargs"]
|
||||||
parser.add_argument(*flags, **kwargs)
|
parser.add_argument(*flags, **kwargs)
|
||||||
|
|
||||||
|
# Deprecated alias for backward compatibility (removed in v4.0.0)
|
||||||
|
parser.add_argument(
|
||||||
|
"--no-preserve-code",
|
||||||
|
dest="no_preserve_code_blocks",
|
||||||
|
action="store_true",
|
||||||
|
help=argparse.SUPPRESS,
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
def get_scrape_argument_names() -> set:
|
def get_scrape_argument_names() -> set:
|
||||||
"""Get the set of scrape argument destination names.
|
"""Get the set of scrape argument destination names.
|
||||||
|
|||||||
@@ -1057,6 +1057,7 @@ def analyze_codebase(
|
|||||||
enhance_level: int = 0,
|
enhance_level: int = 0,
|
||||||
skill_name: str | None = None,
|
skill_name: str | None = None,
|
||||||
skill_description: str | None = None,
|
skill_description: str | None = None,
|
||||||
|
doc_version: str = "",
|
||||||
) -> dict[str, Any]:
|
) -> dict[str, Any]:
|
||||||
"""
|
"""
|
||||||
Analyze local codebase and extract code knowledge.
|
Analyze local codebase and extract code knowledge.
|
||||||
@@ -1603,6 +1604,7 @@ def analyze_codebase(
|
|||||||
docs_data=docs_data,
|
docs_data=docs_data,
|
||||||
skill_name=skill_name,
|
skill_name=skill_name,
|
||||||
skill_description=skill_description,
|
skill_description=skill_description,
|
||||||
|
doc_version=doc_version,
|
||||||
)
|
)
|
||||||
|
|
||||||
return results
|
return results
|
||||||
@@ -1622,6 +1624,7 @@ def _generate_skill_md(
|
|||||||
docs_data: dict[str, Any] | None = None,
|
docs_data: dict[str, Any] | None = None,
|
||||||
skill_name: str | None = None,
|
skill_name: str | None = None,
|
||||||
skill_description: str | None = None,
|
skill_description: str | None = None,
|
||||||
|
doc_version: str = "",
|
||||||
):
|
):
|
||||||
"""
|
"""
|
||||||
Generate rich SKILL.md from codebase analysis results.
|
Generate rich SKILL.md from codebase analysis results.
|
||||||
@@ -1657,6 +1660,7 @@ def _generate_skill_md(
|
|||||||
skill_content = f"""---
|
skill_content = f"""---
|
||||||
name: {skill_name}
|
name: {skill_name}
|
||||||
description: {description}
|
description: {description}
|
||||||
|
doc_version: {doc_version}
|
||||||
---
|
---
|
||||||
|
|
||||||
# {repo_name} Codebase
|
# {repo_name} Codebase
|
||||||
@@ -2197,13 +2201,11 @@ def _generate_references(output_dir: Path):
|
|||||||
|
|
||||||
if source_dir.exists() and source_dir.is_dir():
|
if source_dir.exists() and source_dir.is_dir():
|
||||||
# Copy directory to references/ (not symlink, for portability)
|
# Copy directory to references/ (not symlink, for portability)
|
||||||
if target_dir.exists():
|
|
||||||
import shutil
|
|
||||||
|
|
||||||
shutil.rmtree(target_dir)
|
|
||||||
|
|
||||||
import shutil
|
import shutil
|
||||||
|
|
||||||
|
if target_dir.exists():
|
||||||
|
shutil.rmtree(target_dir)
|
||||||
|
|
||||||
shutil.copytree(source_dir, target_dir)
|
shutil.copytree(source_dir, target_dir)
|
||||||
logger.debug(f"Copied {source} → references/{target}")
|
logger.debug(f"Copied {source} → references/{target}")
|
||||||
|
|
||||||
@@ -2451,6 +2453,7 @@ Examples:
|
|||||||
enhance_level=args.enhance_level, # AI enhancement level (0-3)
|
enhance_level=args.enhance_level, # AI enhancement level (0-3)
|
||||||
skill_name=getattr(args, "name", None),
|
skill_name=getattr(args, "name", None),
|
||||||
skill_description=getattr(args, "description", None),
|
skill_description=getattr(args, "description", None),
|
||||||
|
doc_version=getattr(args, "doc_version", ""),
|
||||||
)
|
)
|
||||||
|
|
||||||
# ============================================================
|
# ============================================================
|
||||||
|
|||||||
@@ -13,6 +13,7 @@ from skill_seekers.cli.arguments.create import (
|
|||||||
get_compatible_arguments,
|
get_compatible_arguments,
|
||||||
get_universal_argument_names,
|
get_universal_argument_names,
|
||||||
)
|
)
|
||||||
|
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
|
||||||
|
|
||||||
logger = logging.getLogger(__name__)
|
logger = logging.getLogger(__name__)
|
||||||
|
|
||||||
@@ -106,8 +107,8 @@ class CreateCommand:
|
|||||||
# Check against common defaults
|
# Check against common defaults
|
||||||
defaults = {
|
defaults = {
|
||||||
"max_issues": 100,
|
"max_issues": 100,
|
||||||
"chunk_tokens": 512,
|
"chunk_tokens": DEFAULT_CHUNK_TOKENS,
|
||||||
"chunk_overlap_tokens": 50,
|
"chunk_overlap_tokens": DEFAULT_CHUNK_OVERLAP_TOKENS,
|
||||||
"output": None,
|
"output": None,
|
||||||
}
|
}
|
||||||
|
|
||||||
@@ -160,11 +161,11 @@ class CreateCommand:
|
|||||||
# RAG arguments (web scraper only)
|
# RAG arguments (web scraper only)
|
||||||
if getattr(self.args, "chunk_for_rag", False):
|
if getattr(self.args, "chunk_for_rag", False):
|
||||||
argv.append("--chunk-for-rag")
|
argv.append("--chunk-for-rag")
|
||||||
if getattr(self.args, "chunk_tokens", None) and self.args.chunk_tokens != 512:
|
if getattr(self.args, "chunk_tokens", None) and self.args.chunk_tokens != DEFAULT_CHUNK_TOKENS:
|
||||||
argv.extend(["--chunk-tokens", str(self.args.chunk_tokens)])
|
argv.extend(["--chunk-tokens", str(self.args.chunk_tokens)])
|
||||||
if (
|
if (
|
||||||
getattr(self.args, "chunk_overlap_tokens", None)
|
getattr(self.args, "chunk_overlap_tokens", None)
|
||||||
and self.args.chunk_overlap_tokens != 50
|
and self.args.chunk_overlap_tokens != DEFAULT_CHUNK_OVERLAP_TOKENS
|
||||||
):
|
):
|
||||||
argv.extend(["--chunk-overlap-tokens", str(self.args.chunk_overlap_tokens)])
|
argv.extend(["--chunk-overlap-tokens", str(self.args.chunk_overlap_tokens)])
|
||||||
|
|
||||||
@@ -428,6 +429,10 @@ class CreateCommand:
|
|||||||
if self.args.quiet:
|
if self.args.quiet:
|
||||||
argv.append("--quiet")
|
argv.append("--quiet")
|
||||||
|
|
||||||
|
# Documentation version metadata
|
||||||
|
if getattr(self.args, "doc_version", ""):
|
||||||
|
argv.extend(["--doc-version", self.args.doc_version])
|
||||||
|
|
||||||
# Enhancement Workflow arguments
|
# Enhancement Workflow arguments
|
||||||
if getattr(self.args, "enhance_workflow", None):
|
if getattr(self.args, "enhance_workflow", None):
|
||||||
for wf in self.args.enhance_workflow:
|
for wf in self.args.enhance_workflow:
|
||||||
|
|||||||
@@ -1565,9 +1565,11 @@ class DocToSkillConverter:
|
|||||||
if len(example_codes) >= 10:
|
if len(example_codes) >= 10:
|
||||||
break
|
break
|
||||||
|
|
||||||
|
doc_version = self.config.get("doc_version", "")
|
||||||
content = f"""---
|
content = f"""---
|
||||||
name: {self.name}
|
name: {self.name}
|
||||||
description: {description}
|
description: {description}
|
||||||
|
doc_version: {doc_version}
|
||||||
---
|
---
|
||||||
|
|
||||||
# {self.name.title()} Skill
|
# {self.name.title()} Skill
|
||||||
@@ -2103,6 +2105,11 @@ def get_configuration(args: argparse.Namespace) -> dict[str, Any]:
|
|||||||
"max_pages": DEFAULT_MAX_PAGES,
|
"max_pages": DEFAULT_MAX_PAGES,
|
||||||
}
|
}
|
||||||
|
|
||||||
|
# Apply CLI override for doc_version (works for all config modes)
|
||||||
|
cli_doc_version = getattr(args, "doc_version", "")
|
||||||
|
if cli_doc_version:
|
||||||
|
config["doc_version"] = cli_doc_version
|
||||||
|
|
||||||
# Apply CLI overrides for rate limiting
|
# Apply CLI overrides for rate limiting
|
||||||
if args.no_rate_limit:
|
if args.no_rate_limit:
|
||||||
config["rate_limit"] = 0
|
config["rate_limit"] = 0
|
||||||
|
|||||||
@@ -968,10 +968,13 @@ class GitHubToSkillConverter:
|
|||||||
# Truncate description to 1024 chars if needed
|
# Truncate description to 1024 chars if needed
|
||||||
desc = self.description[:1024] if len(self.description) > 1024 else self.description
|
desc = self.description[:1024] if len(self.description) > 1024 else self.description
|
||||||
|
|
||||||
|
doc_version = self.config.get("doc_version", "")
|
||||||
|
|
||||||
# Build skill content
|
# Build skill content
|
||||||
skill_content = f"""---
|
skill_content = f"""---
|
||||||
name: {skill_name}
|
name: {skill_name}
|
||||||
description: {desc}
|
description: {desc}
|
||||||
|
doc_version: {doc_version}
|
||||||
---
|
---
|
||||||
|
|
||||||
# {repo_info.get("name", self.name)}
|
# {repo_info.get("name", self.name)}
|
||||||
@@ -1003,10 +1006,11 @@ Use this skill when you need to:
|
|||||||
|
|
||||||
# Repository info
|
# Repository info
|
||||||
skill_content += "### Repository Info\n"
|
skill_content += "### Repository Info\n"
|
||||||
skill_content += f"- **Homepage:** {repo_info.get('homepage', 'N/A')}\n"
|
skill_content += f"- **Homepage:** {repo_info.get('homepage') or 'N/A'}\n"
|
||||||
skill_content += f"- **Topics:** {', '.join(repo_info.get('topics', []))}\n"
|
skill_content += f"- **Topics:** {', '.join(repo_info.get('topics', []))}\n"
|
||||||
skill_content += f"- **Open Issues:** {repo_info.get('open_issues', 0)}\n"
|
skill_content += f"- **Open Issues:** {repo_info.get('open_issues', 0)}\n"
|
||||||
skill_content += f"- **Last Updated:** {repo_info.get('updated_at', 'N/A')[:10]}\n\n"
|
updated_at = repo_info.get('updated_at') or 'N/A'
|
||||||
|
skill_content += f"- **Last Updated:** {updated_at[:10]}\n\n"
|
||||||
|
|
||||||
# Languages
|
# Languages
|
||||||
skill_content += "### Languages\n"
|
skill_content += "### Languages\n"
|
||||||
@@ -1101,8 +1105,10 @@ Use this skill when you need to:
|
|||||||
|
|
||||||
lines = []
|
lines = []
|
||||||
for release in releases[:3]:
|
for release in releases[:3]:
|
||||||
|
published_at = release.get('published_at') or 'N/A'
|
||||||
|
release_name = release.get('name') or release['tag_name']
|
||||||
lines.append(
|
lines.append(
|
||||||
f"- **{release['tag_name']}** ({release['published_at'][:10]}): {release['name']}"
|
f"- **{release['tag_name']}** ({published_at[:10]}): {release_name}"
|
||||||
)
|
)
|
||||||
|
|
||||||
return "\n".join(lines)
|
return "\n".join(lines)
|
||||||
@@ -1298,15 +1304,17 @@ Use this skill when you need to:
|
|||||||
content += f"## Open Issues ({len(open_issues)})\n\n"
|
content += f"## Open Issues ({len(open_issues)})\n\n"
|
||||||
for issue in open_issues:
|
for issue in open_issues:
|
||||||
labels = ", ".join(issue["labels"]) if issue["labels"] else "No labels"
|
labels = ", ".join(issue["labels"]) if issue["labels"] else "No labels"
|
||||||
|
created_at = issue.get('created_at') or 'N/A'
|
||||||
content += f"### #{issue['number']}: {issue['title']}\n"
|
content += f"### #{issue['number']}: {issue['title']}\n"
|
||||||
content += f"**Labels:** {labels} | **Created:** {issue['created_at'][:10]}\n"
|
content += f"**Labels:** {labels} | **Created:** {created_at[:10]}\n"
|
||||||
content += f"[View on GitHub]({issue['url']})\n\n"
|
content += f"[View on GitHub]({issue['url']})\n\n"
|
||||||
|
|
||||||
content += f"\n## Recently Closed Issues ({len(closed_issues)})\n\n"
|
content += f"\n## Recently Closed Issues ({len(closed_issues)})\n\n"
|
||||||
for issue in closed_issues:
|
for issue in closed_issues:
|
||||||
labels = ", ".join(issue["labels"]) if issue["labels"] else "No labels"
|
labels = ", ".join(issue["labels"]) if issue["labels"] else "No labels"
|
||||||
|
closed_at = issue.get('closed_at') or 'N/A'
|
||||||
content += f"### #{issue['number']}: {issue['title']}\n"
|
content += f"### #{issue['number']}: {issue['title']}\n"
|
||||||
content += f"**Labels:** {labels} | **Closed:** {issue['closed_at'][:10]}\n"
|
content += f"**Labels:** {labels} | **Closed:** {closed_at[:10]}\n"
|
||||||
content += f"[View on GitHub]({issue['url']})\n\n"
|
content += f"[View on GitHub]({issue['url']})\n\n"
|
||||||
|
|
||||||
issues_path = f"{self.skill_dir}/references/issues.md"
|
issues_path = f"{self.skill_dir}/references/issues.md"
|
||||||
@@ -1323,11 +1331,14 @@ Use this skill when you need to:
|
|||||||
)
|
)
|
||||||
|
|
||||||
for release in releases:
|
for release in releases:
|
||||||
content += f"## {release['tag_name']}: {release['name']}\n"
|
published_at = release.get('published_at') or 'N/A'
|
||||||
content += f"**Published:** {release['published_at'][:10]}\n"
|
release_name = release.get('name') or release['tag_name']
|
||||||
|
release_body = release.get('body') or ''
|
||||||
|
content += f"## {release['tag_name']}: {release_name}\n"
|
||||||
|
content += f"**Published:** {published_at[:10]}\n"
|
||||||
if release["prerelease"]:
|
if release["prerelease"]:
|
||||||
content += "**Pre-release**\n"
|
content += "**Pre-release**\n"
|
||||||
content += f"\n{release['body']}\n\n"
|
content += f"\n{release_body}\n\n"
|
||||||
content += f"[View on GitHub]({release['url']})\n\n---\n\n"
|
content += f"[View on GitHub]({release['url']})\n\n---\n\n"
|
||||||
|
|
||||||
releases_path = f"{self.skill_dir}/references/releases.md"
|
releases_path = f"{self.skill_dir}/references/releases.md"
|
||||||
|
|||||||
@@ -325,8 +325,8 @@ def _handle_analyze_command(args: argparse.Namespace) -> int:
|
|||||||
if getattr(args, "enhance_stage", None):
|
if getattr(args, "enhance_stage", None):
|
||||||
for stage in args.enhance_stage:
|
for stage in args.enhance_stage:
|
||||||
sys.argv.extend(["--enhance-stage", stage])
|
sys.argv.extend(["--enhance-stage", stage])
|
||||||
if getattr(args, "workflow_var", None):
|
if getattr(args, "var", None):
|
||||||
for var in args.workflow_var:
|
for var in args.var:
|
||||||
sys.argv.extend(["--var", var])
|
sys.argv.extend(["--var", var])
|
||||||
if getattr(args, "workflow_dry_run", False):
|
if getattr(args, "workflow_dry_run", False):
|
||||||
sys.argv.append("--workflow-dry-run")
|
sys.argv.append("--workflow-dry-run")
|
||||||
|
|||||||
@@ -14,6 +14,8 @@ import os
|
|||||||
import sys
|
import sys
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
|
|
||||||
|
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
|
||||||
|
|
||||||
# Import utilities
|
# Import utilities
|
||||||
try:
|
try:
|
||||||
from quality_checker import SkillQualityChecker, print_report
|
from quality_checker import SkillQualityChecker, print_report
|
||||||
@@ -45,8 +47,9 @@ def package_skill(
|
|||||||
chunk_overlap=200,
|
chunk_overlap=200,
|
||||||
batch_size=100,
|
batch_size=100,
|
||||||
enable_chunking=False,
|
enable_chunking=False,
|
||||||
chunk_max_tokens=512,
|
chunk_max_tokens=DEFAULT_CHUNK_TOKENS,
|
||||||
preserve_code_blocks=True,
|
preserve_code_blocks=True,
|
||||||
|
chunk_overlap_tokens=DEFAULT_CHUNK_OVERLAP_TOKENS,
|
||||||
):
|
):
|
||||||
"""
|
"""
|
||||||
Package a skill directory into platform-specific format
|
Package a skill directory into platform-specific format
|
||||||
@@ -121,6 +124,7 @@ def package_skill(
|
|||||||
"chroma",
|
"chroma",
|
||||||
"faiss",
|
"faiss",
|
||||||
"qdrant",
|
"qdrant",
|
||||||
|
"pinecone",
|
||||||
]
|
]
|
||||||
|
|
||||||
if target in RAG_PLATFORMS and not enable_chunking:
|
if target in RAG_PLATFORMS and not enable_chunking:
|
||||||
@@ -156,6 +160,7 @@ def package_skill(
|
|||||||
enable_chunking=enable_chunking,
|
enable_chunking=enable_chunking,
|
||||||
chunk_max_tokens=chunk_max_tokens,
|
chunk_max_tokens=chunk_max_tokens,
|
||||||
preserve_code_blocks=preserve_code_blocks,
|
preserve_code_blocks=preserve_code_blocks,
|
||||||
|
chunk_overlap_tokens=chunk_overlap_tokens,
|
||||||
)
|
)
|
||||||
else:
|
else:
|
||||||
package_path = adaptor.package(
|
package_path = adaptor.package(
|
||||||
@@ -164,6 +169,7 @@ def package_skill(
|
|||||||
enable_chunking=enable_chunking,
|
enable_chunking=enable_chunking,
|
||||||
chunk_max_tokens=chunk_max_tokens,
|
chunk_max_tokens=chunk_max_tokens,
|
||||||
preserve_code_blocks=preserve_code_blocks,
|
preserve_code_blocks=preserve_code_blocks,
|
||||||
|
chunk_overlap_tokens=chunk_overlap_tokens,
|
||||||
)
|
)
|
||||||
|
|
||||||
print(f" Output: {package_path}")
|
print(f" Output: {package_path}")
|
||||||
@@ -226,7 +232,8 @@ Examples:
|
|||||||
batch_size=args.batch_size,
|
batch_size=args.batch_size,
|
||||||
enable_chunking=args.chunk_for_rag,
|
enable_chunking=args.chunk_for_rag,
|
||||||
chunk_max_tokens=args.chunk_tokens,
|
chunk_max_tokens=args.chunk_tokens,
|
||||||
preserve_code_blocks=not args.no_preserve_code,
|
preserve_code_blocks=not args.no_preserve_code_blocks,
|
||||||
|
chunk_overlap_tokens=args.chunk_overlap_tokens,
|
||||||
)
|
)
|
||||||
|
|
||||||
if not success:
|
if not success:
|
||||||
|
|||||||
@@ -14,6 +14,8 @@ Usage:
|
|||||||
chunks = chunker.chunk_skill(Path("output/react"))
|
chunks = chunker.chunk_skill(Path("output/react"))
|
||||||
"""
|
"""
|
||||||
|
|
||||||
|
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
|
||||||
|
|
||||||
import re
|
import re
|
||||||
from pathlib import Path
|
from pathlib import Path
|
||||||
import json
|
import json
|
||||||
@@ -35,8 +37,8 @@ class RAGChunker:
|
|||||||
|
|
||||||
def __init__(
|
def __init__(
|
||||||
self,
|
self,
|
||||||
chunk_size: int = 512,
|
chunk_size: int = DEFAULT_CHUNK_TOKENS,
|
||||||
chunk_overlap: int = 50,
|
chunk_overlap: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
|
||||||
preserve_code_blocks: bool = True,
|
preserve_code_blocks: bool = True,
|
||||||
preserve_paragraphs: bool = True,
|
preserve_paragraphs: bool = True,
|
||||||
min_chunk_size: int = 100,
|
min_chunk_size: int = 100,
|
||||||
@@ -383,9 +385,9 @@ def main():
|
|||||||
)
|
)
|
||||||
parser.add_argument("skill_dir", type=Path, help="Path to skill directory")
|
parser.add_argument("skill_dir", type=Path, help="Path to skill directory")
|
||||||
parser.add_argument("--output", "-o", type=Path, help="Output JSON file")
|
parser.add_argument("--output", "-o", type=Path, help="Output JSON file")
|
||||||
parser.add_argument("--chunk-tokens", type=int, default=512, help="Target chunk size in tokens")
|
parser.add_argument("--chunk-tokens", type=int, default=DEFAULT_CHUNK_TOKENS, help="Target chunk size in tokens")
|
||||||
parser.add_argument(
|
parser.add_argument(
|
||||||
"--chunk-overlap-tokens", type=int, default=50, help="Overlap size in tokens"
|
"--chunk-overlap-tokens", type=int, default=DEFAULT_CHUNK_OVERLAP_TOKENS, help="Overlap size in tokens"
|
||||||
)
|
)
|
||||||
parser.add_argument("--no-code-blocks", action="store_true", help="Don't preserve code blocks")
|
parser.add_argument("--no-code-blocks", action="store_true", help="Don't preserve code blocks")
|
||||||
parser.add_argument("--no-paragraphs", action="store_true", help="Don't preserve paragraphs")
|
parser.add_argument("--no-paragraphs", action="store_true", help="Don't preserve paragraphs")
|
||||||
|
|||||||
@@ -109,6 +109,11 @@ class WordToSkillConverter:
|
|||||||
if not os.path.exists(self.docx_path):
|
if not os.path.exists(self.docx_path):
|
||||||
raise FileNotFoundError(f"Word document not found: {self.docx_path}")
|
raise FileNotFoundError(f"Word document not found: {self.docx_path}")
|
||||||
|
|
||||||
|
if not self.docx_path.lower().endswith(".docx"):
|
||||||
|
raise ValueError(
|
||||||
|
f"Not a Word document (expected .docx): {self.docx_path}"
|
||||||
|
)
|
||||||
|
|
||||||
# --- Extract metadata via python-docx ---
|
# --- Extract metadata via python-docx ---
|
||||||
doc = python_docx.Document(self.docx_path)
|
doc = python_docx.Document(self.docx_path)
|
||||||
core_props = doc.core_properties
|
core_props = doc.core_properties
|
||||||
@@ -825,8 +830,8 @@ def _build_section(
|
|||||||
raw_text = elem.get_text(separator="\n").strip()
|
raw_text = elem.get_text(separator="\n").strip()
|
||||||
# Exclude bullet-point / prose lists (•, *, -)
|
# Exclude bullet-point / prose lists (•, *, -)
|
||||||
if raw_text and not re.search(r"^[•\-\*]\s", raw_text, re.MULTILINE):
|
if raw_text and not re.search(r"^[•\-\*]\s", raw_text, re.MULTILINE):
|
||||||
if _score_code_quality(raw_text) >= 5.5:
|
quality_score = _score_code_quality(raw_text)
|
||||||
quality_score = _score_code_quality(raw_text)
|
if quality_score >= 5.5:
|
||||||
code_samples.append(
|
code_samples.append(
|
||||||
{"code": raw_text, "language": "", "quality_score": quality_score}
|
{"code": raw_text, "language": "", "quality_score": quality_score}
|
||||||
)
|
)
|
||||||
|
|||||||
@@ -359,5 +359,102 @@ class TestChunkingCLIIntegration:
|
|||||||
)
|
)
|
||||||
|
|
||||||
|
|
||||||
|
def test_chunk_overlap_tokens_parameter(self, tmp_path):
|
||||||
|
"""Test --chunk-overlap-tokens controls RAGChunker overlap."""
|
||||||
|
from skill_seekers.cli.package_skill import package_skill
|
||||||
|
|
||||||
|
skill_dir = create_test_skill(tmp_path, large_doc=True)
|
||||||
|
|
||||||
|
# Package with default overlap (50)
|
||||||
|
success, package_path = package_skill(
|
||||||
|
skill_dir=skill_dir,
|
||||||
|
open_folder_after=False,
|
||||||
|
skip_quality_check=True,
|
||||||
|
target="langchain",
|
||||||
|
enable_chunking=True,
|
||||||
|
chunk_max_tokens=256,
|
||||||
|
chunk_overlap_tokens=50,
|
||||||
|
)
|
||||||
|
|
||||||
|
assert success
|
||||||
|
assert package_path.exists()
|
||||||
|
|
||||||
|
with open(package_path) as f:
|
||||||
|
data_default = json.load(f)
|
||||||
|
|
||||||
|
# Package with large overlap (128)
|
||||||
|
success2, package_path2 = package_skill(
|
||||||
|
skill_dir=skill_dir,
|
||||||
|
open_folder_after=False,
|
||||||
|
skip_quality_check=True,
|
||||||
|
target="langchain",
|
||||||
|
enable_chunking=True,
|
||||||
|
chunk_max_tokens=256,
|
||||||
|
chunk_overlap_tokens=128,
|
||||||
|
)
|
||||||
|
|
||||||
|
assert success2
|
||||||
|
assert package_path2.exists()
|
||||||
|
|
||||||
|
with open(package_path2) as f:
|
||||||
|
data_large_overlap = json.load(f)
|
||||||
|
|
||||||
|
# Large overlap should produce more chunks (more overlap = more chunks)
|
||||||
|
assert len(data_large_overlap) >= len(data_default), (
|
||||||
|
f"Large overlap ({len(data_large_overlap)}) should produce >= chunks than default ({len(data_default)})"
|
||||||
|
)
|
||||||
|
|
||||||
|
def test_chunk_overlap_scales_with_chunk_size(self, tmp_path):
|
||||||
|
"""Test that overlap auto-scales when chunk_tokens is non-default but overlap is default."""
|
||||||
|
from skill_seekers.cli.adaptors.base import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
|
||||||
|
|
||||||
|
adaptor = get_adaptor("langchain")
|
||||||
|
|
||||||
|
skill_dir = create_test_skill(tmp_path, large_doc=True)
|
||||||
|
metadata = adaptor._build_skill_metadata(skill_dir)
|
||||||
|
content = (skill_dir / "SKILL.md").read_text()
|
||||||
|
|
||||||
|
# With default chunk size (512) and default overlap (50), overlap should be 50
|
||||||
|
chunks_default = adaptor._maybe_chunk_content(
|
||||||
|
content, {"source": "test"},
|
||||||
|
enable_chunking=True,
|
||||||
|
chunk_max_tokens=DEFAULT_CHUNK_TOKENS,
|
||||||
|
chunk_overlap_tokens=DEFAULT_CHUNK_OVERLAP_TOKENS,
|
||||||
|
)
|
||||||
|
|
||||||
|
# With large chunk size (1024) and default overlap (50),
|
||||||
|
# overlap should auto-scale to max(50, 1024//10) = 102
|
||||||
|
chunks_large = adaptor._maybe_chunk_content(
|
||||||
|
content, {"source": "test"},
|
||||||
|
enable_chunking=True,
|
||||||
|
chunk_max_tokens=1024,
|
||||||
|
chunk_overlap_tokens=DEFAULT_CHUNK_OVERLAP_TOKENS,
|
||||||
|
)
|
||||||
|
|
||||||
|
# Both should produce valid chunks
|
||||||
|
assert len(chunks_default) > 1
|
||||||
|
assert len(chunks_large) >= 1
|
||||||
|
|
||||||
|
def test_preserve_code_blocks_flag(self, tmp_path):
|
||||||
|
"""Test --no-preserve-code-blocks parameter is accepted."""
|
||||||
|
from skill_seekers.cli.package_skill import package_skill
|
||||||
|
|
||||||
|
skill_dir = create_test_skill(tmp_path, large_doc=True)
|
||||||
|
|
||||||
|
# Package with code block preservation disabled
|
||||||
|
success, package_path = package_skill(
|
||||||
|
skill_dir=skill_dir,
|
||||||
|
open_folder_after=False,
|
||||||
|
skip_quality_check=True,
|
||||||
|
target="langchain",
|
||||||
|
enable_chunking=True,
|
||||||
|
chunk_max_tokens=256,
|
||||||
|
preserve_code_blocks=False,
|
||||||
|
)
|
||||||
|
|
||||||
|
assert success
|
||||||
|
assert package_path.exists()
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
pytest.main([__file__, "-v"])
|
pytest.main([__file__, "-v"])
|
||||||
|
|||||||
@@ -294,5 +294,81 @@ class TestE2EWorkflow:
|
|||||||
assert "unrecognized arguments" not in result.stderr.lower()
|
assert "unrecognized arguments" not in result.stderr.lower()
|
||||||
|
|
||||||
|
|
||||||
|
class TestVarFlagRouting:
|
||||||
|
"""Test that --var flag is correctly routed through create command."""
|
||||||
|
|
||||||
|
def test_var_flag_accepted_by_create(self):
|
||||||
|
"""Test that --var flag is accepted (not 'unrecognized') by create command."""
|
||||||
|
result = subprocess.run(
|
||||||
|
["skill-seekers", "create", "--help"],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
)
|
||||||
|
assert "--var" in result.stdout, "create --help should show --var flag"
|
||||||
|
|
||||||
|
def test_var_flag_accepted_by_analyze(self):
|
||||||
|
"""Test that --var flag is accepted by analyze command."""
|
||||||
|
result = subprocess.run(
|
||||||
|
["skill-seekers", "analyze", "--help"],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
)
|
||||||
|
assert "--var" in result.stdout, "analyze --help should show --var flag"
|
||||||
|
|
||||||
|
@pytest.mark.slow
|
||||||
|
def test_var_flag_not_rejected_in_create_local(self, tmp_path):
|
||||||
|
"""Test --var KEY=VALUE doesn't cause 'unrecognized arguments' in create."""
|
||||||
|
test_dir = tmp_path / "test_code"
|
||||||
|
test_dir.mkdir()
|
||||||
|
(test_dir / "test.py").write_text("def hello(): pass")
|
||||||
|
|
||||||
|
result = subprocess.run(
|
||||||
|
[
|
||||||
|
"skill-seekers", "create", str(test_dir),
|
||||||
|
"--var", "foo=bar",
|
||||||
|
"--dry-run",
|
||||||
|
],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
timeout=15,
|
||||||
|
)
|
||||||
|
assert "unrecognized arguments" not in result.stderr.lower(), (
|
||||||
|
f"--var should be accepted, got stderr: {result.stderr}"
|
||||||
|
)
|
||||||
|
|
||||||
|
|
||||||
|
class TestBackwardCompatibleFlags:
|
||||||
|
"""Test that deprecated flag aliases still work."""
|
||||||
|
|
||||||
|
def test_no_preserve_code_alias_accepted_by_package(self):
|
||||||
|
"""Test --no-preserve-code (old name) is still accepted by package command."""
|
||||||
|
result = subprocess.run(
|
||||||
|
["skill-seekers", "package", "--help"],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
)
|
||||||
|
# The old flag should not appear in --help (it's suppressed)
|
||||||
|
# but should not cause an error if used
|
||||||
|
assert result.returncode == 0
|
||||||
|
|
||||||
|
def test_no_preserve_code_alias_accepted_by_scrape(self):
|
||||||
|
"""Test --no-preserve-code (old name) is still accepted by scrape command."""
|
||||||
|
result = subprocess.run(
|
||||||
|
["skill-seekers", "scrape", "--help"],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
)
|
||||||
|
assert result.returncode == 0
|
||||||
|
|
||||||
|
def test_no_preserve_code_alias_accepted_by_create(self):
|
||||||
|
"""Test --no-preserve-code (old name) is still accepted by create command."""
|
||||||
|
result = subprocess.run(
|
||||||
|
["skill-seekers", "create", "--help-all"],
|
||||||
|
capture_output=True,
|
||||||
|
text=True,
|
||||||
|
)
|
||||||
|
assert result.returncode == 0
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
if __name__ == "__main__":
|
||||||
pytest.main([__file__, "-v", "-s"])
|
pytest.main([__file__, "-v", "-s"])
|
||||||
|
|||||||
@@ -25,8 +25,8 @@ class TestUniversalArguments:
|
|||||||
"""Test universal argument definitions."""
|
"""Test universal argument definitions."""
|
||||||
|
|
||||||
def test_universal_count(self):
|
def test_universal_count(self):
|
||||||
"""Should have exactly 18 universal arguments (after Phase 2 workflow integration + local_repo_path)."""
|
"""Should have exactly 19 universal arguments (after Phase 2 workflow integration + local_repo_path + doc_version)."""
|
||||||
assert len(UNIVERSAL_ARGUMENTS) == 18
|
assert len(UNIVERSAL_ARGUMENTS) == 19
|
||||||
|
|
||||||
def test_universal_argument_names(self):
|
def test_universal_argument_names(self):
|
||||||
"""Universal arguments should have expected names."""
|
"""Universal arguments should have expected names."""
|
||||||
@@ -50,6 +50,7 @@ class TestUniversalArguments:
|
|||||||
"var",
|
"var",
|
||||||
"workflow_dry_run",
|
"workflow_dry_run",
|
||||||
"local_repo_path", # GitHub local clone path for unlimited C3.x analysis
|
"local_repo_path", # GitHub local clone path for unlimited C3.x analysis
|
||||||
|
"doc_version", # Documentation version tag for RAG metadata
|
||||||
}
|
}
|
||||||
assert set(UNIVERSAL_ARGUMENTS.keys()) == expected_names
|
assert set(UNIVERSAL_ARGUMENTS.keys()) == expected_names
|
||||||
|
|
||||||
@@ -130,7 +131,7 @@ class TestArgumentHelpers:
|
|||||||
"""Should return set of universal argument names."""
|
"""Should return set of universal argument names."""
|
||||||
names = get_universal_argument_names()
|
names = get_universal_argument_names()
|
||||||
assert isinstance(names, set)
|
assert isinstance(names, set)
|
||||||
assert len(names) == 18 # Phase 2: added 4 workflow arguments + local_repo_path
|
assert len(names) == 19 # Phase 2: added 4 workflow arguments + local_repo_path + doc_version
|
||||||
assert "name" in names
|
assert "name" in names
|
||||||
assert "enhance_level" in names # Phase 1: consolidated flag
|
assert "enhance_level" in names # Phase 1: consolidated flag
|
||||||
assert "enhance_workflow" in names # Phase 2: workflow support
|
assert "enhance_workflow" in names # Phase 2: workflow support
|
||||||
|
|||||||
752
tests/test_pinecone_adaptor.py
Normal file
752
tests/test_pinecone_adaptor.py
Normal file
@@ -0,0 +1,752 @@
|
|||||||
|
#!/usr/bin/env python3
|
||||||
|
"""
|
||||||
|
Tests for Pinecone adaptor and doc_version metadata flow.
|
||||||
|
"""
|
||||||
|
|
||||||
|
import json
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import pytest
|
||||||
|
|
||||||
|
from skill_seekers.cli.adaptors.base import SkillAdaptor, SkillMetadata
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Fixtures
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def sample_skill_dir(tmp_path):
|
||||||
|
"""Create a minimal skill directory with SKILL.md and references."""
|
||||||
|
skill_dir = tmp_path / "test-skill"
|
||||||
|
skill_dir.mkdir()
|
||||||
|
|
||||||
|
skill_md = """---
|
||||||
|
name: test-skill
|
||||||
|
description: A test skill for pinecone
|
||||||
|
doc_version: 16.2
|
||||||
|
---
|
||||||
|
|
||||||
|
# Test Skill
|
||||||
|
|
||||||
|
This is a test skill for Pinecone adaptor testing.
|
||||||
|
|
||||||
|
## Quick Start
|
||||||
|
|
||||||
|
Get started quickly.
|
||||||
|
"""
|
||||||
|
(skill_dir / "SKILL.md").write_text(skill_md)
|
||||||
|
|
||||||
|
refs_dir = skill_dir / "references"
|
||||||
|
refs_dir.mkdir()
|
||||||
|
(refs_dir / "api_reference.md").write_text(
|
||||||
|
"# API Reference\n\nSome API docs.\n"
|
||||||
|
)
|
||||||
|
(refs_dir / "getting_started.md").write_text(
|
||||||
|
"# Getting Started\n\nSome getting started docs.\n"
|
||||||
|
)
|
||||||
|
|
||||||
|
return skill_dir
|
||||||
|
|
||||||
|
|
||||||
|
@pytest.fixture
|
||||||
|
def sample_skill_dir_no_doc_version(tmp_path):
|
||||||
|
"""Create a skill directory without doc_version in frontmatter."""
|
||||||
|
skill_dir = tmp_path / "no-version-skill"
|
||||||
|
skill_dir.mkdir()
|
||||||
|
|
||||||
|
skill_md = """---
|
||||||
|
name: no-version-skill
|
||||||
|
description: A test skill without doc_version
|
||||||
|
---
|
||||||
|
|
||||||
|
# No Version Skill
|
||||||
|
|
||||||
|
Content here.
|
||||||
|
"""
|
||||||
|
(skill_dir / "SKILL.md").write_text(skill_md)
|
||||||
|
|
||||||
|
refs_dir = skill_dir / "references"
|
||||||
|
refs_dir.mkdir()
|
||||||
|
(refs_dir / "api.md").write_text("# API\n\nAPI docs.\n")
|
||||||
|
|
||||||
|
return skill_dir
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Pinecone Adaptor Tests
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestPineconeAdaptor:
|
||||||
|
"""Test Pinecone adaptor functionality."""
|
||||||
|
|
||||||
|
def test_import(self):
|
||||||
|
"""PineconeAdaptor can be imported."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
assert PineconeAdaptor is not None
|
||||||
|
|
||||||
|
def test_platform_constants(self):
|
||||||
|
"""Platform constants are set correctly."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
assert adaptor.PLATFORM == "pinecone"
|
||||||
|
assert adaptor.PLATFORM_NAME == "Pinecone (Vector Database)"
|
||||||
|
assert adaptor.DEFAULT_API_ENDPOINT is None
|
||||||
|
|
||||||
|
def test_registered_in_factory(self):
|
||||||
|
"""PineconeAdaptor is registered in the adaptor factory."""
|
||||||
|
from skill_seekers.cli.adaptors import ADAPTORS
|
||||||
|
|
||||||
|
assert "pinecone" in ADAPTORS
|
||||||
|
|
||||||
|
def test_get_adaptor(self):
|
||||||
|
"""get_adaptor('pinecone') returns PineconeAdaptor instance."""
|
||||||
|
from skill_seekers.cli.adaptors import get_adaptor
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = get_adaptor("pinecone")
|
||||||
|
assert isinstance(adaptor, PineconeAdaptor)
|
||||||
|
|
||||||
|
def test_format_skill_md_structure(self, sample_skill_dir):
|
||||||
|
"""format_skill_md returns valid JSON with expected structure."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
metadata = SkillMetadata(
|
||||||
|
name="test-skill",
|
||||||
|
description="Test skill",
|
||||||
|
version="1.0.0",
|
||||||
|
doc_version="16.2",
|
||||||
|
)
|
||||||
|
result = adaptor.format_skill_md(sample_skill_dir, metadata)
|
||||||
|
data = json.loads(result)
|
||||||
|
|
||||||
|
assert "index_name" in data
|
||||||
|
assert "namespace" in data
|
||||||
|
assert "dimension" in data
|
||||||
|
assert "metric" in data
|
||||||
|
assert "vectors" in data
|
||||||
|
assert data["dimension"] == 1536
|
||||||
|
assert data["metric"] == "cosine"
|
||||||
|
|
||||||
|
def test_format_skill_md_vectors_have_metadata(self, sample_skill_dir):
|
||||||
|
"""Each vector has id and metadata fields."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
metadata = SkillMetadata(
|
||||||
|
name="test-skill",
|
||||||
|
description="Test",
|
||||||
|
doc_version="16.2",
|
||||||
|
)
|
||||||
|
result = adaptor.format_skill_md(sample_skill_dir, metadata)
|
||||||
|
data = json.loads(result)
|
||||||
|
|
||||||
|
assert len(data["vectors"]) > 0
|
||||||
|
for vec in data["vectors"]:
|
||||||
|
assert "id" in vec
|
||||||
|
assert "metadata" in vec
|
||||||
|
assert "text" in vec["metadata"]
|
||||||
|
assert "source" in vec["metadata"]
|
||||||
|
assert "category" in vec["metadata"]
|
||||||
|
assert "file" in vec["metadata"]
|
||||||
|
assert "type" in vec["metadata"]
|
||||||
|
assert "version" in vec["metadata"]
|
||||||
|
assert "doc_version" in vec["metadata"]
|
||||||
|
|
||||||
|
def test_format_skill_md_doc_version_propagates(self, sample_skill_dir):
|
||||||
|
"""doc_version flows into every vector's metadata."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
metadata = SkillMetadata(
|
||||||
|
name="test-skill",
|
||||||
|
description="Test",
|
||||||
|
doc_version="16.2",
|
||||||
|
)
|
||||||
|
result = adaptor.format_skill_md(sample_skill_dir, metadata)
|
||||||
|
data = json.loads(result)
|
||||||
|
|
||||||
|
for vec in data["vectors"]:
|
||||||
|
assert vec["metadata"]["doc_version"] == "16.2"
|
||||||
|
|
||||||
|
def test_format_skill_md_empty_doc_version(self, sample_skill_dir):
|
||||||
|
"""Empty doc_version is preserved as empty string."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
metadata = SkillMetadata(name="test-skill", description="Test", doc_version="")
|
||||||
|
result = adaptor.format_skill_md(sample_skill_dir, metadata)
|
||||||
|
data = json.loads(result)
|
||||||
|
|
||||||
|
for vec in data["vectors"]:
|
||||||
|
assert vec["metadata"]["doc_version"] == ""
|
||||||
|
|
||||||
|
def test_format_skill_md_has_overview_and_references(self, sample_skill_dir):
|
||||||
|
"""Output includes overview (SKILL.md) and reference documents."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
metadata = SkillMetadata(name="test-skill", description="Test")
|
||||||
|
result = adaptor.format_skill_md(sample_skill_dir, metadata)
|
||||||
|
data = json.loads(result)
|
||||||
|
|
||||||
|
categories = {vec["metadata"]["category"] for vec in data["vectors"]}
|
||||||
|
types = {vec["metadata"]["type"] for vec in data["vectors"]}
|
||||||
|
assert "overview" in categories
|
||||||
|
assert "documentation" in types
|
||||||
|
assert "reference" in types
|
||||||
|
|
||||||
|
def test_package_creates_file(self, sample_skill_dir, tmp_path):
|
||||||
|
"""package() creates a JSON file at expected path."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
output_path = adaptor.package(sample_skill_dir, tmp_path)
|
||||||
|
|
||||||
|
assert output_path.exists()
|
||||||
|
assert output_path.name.endswith("-pinecone.json")
|
||||||
|
|
||||||
|
data = json.loads(output_path.read_text())
|
||||||
|
assert "vectors" in data
|
||||||
|
assert len(data["vectors"]) > 0
|
||||||
|
|
||||||
|
def test_package_reads_frontmatter_metadata(self, sample_skill_dir, tmp_path):
|
||||||
|
"""package() reads doc_version from SKILL.md frontmatter."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
output_path = adaptor.package(sample_skill_dir, tmp_path)
|
||||||
|
|
||||||
|
data = json.loads(output_path.read_text())
|
||||||
|
for vec in data["vectors"]:
|
||||||
|
assert vec["metadata"]["doc_version"] == "16.2"
|
||||||
|
|
||||||
|
def test_package_with_chunking(self, sample_skill_dir, tmp_path):
|
||||||
|
"""package() with chunking enabled produces valid output."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
output_path = adaptor.package(
|
||||||
|
sample_skill_dir, tmp_path, enable_chunking=True, chunk_max_tokens=64
|
||||||
|
)
|
||||||
|
|
||||||
|
data = json.loads(output_path.read_text())
|
||||||
|
assert "vectors" in data
|
||||||
|
assert len(data["vectors"]) > 0
|
||||||
|
|
||||||
|
def test_index_name_derived_from_skill_name(self, sample_skill_dir, tmp_path):
|
||||||
|
"""index_name and namespace are derived from skill directory name."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
output_path = adaptor.package(sample_skill_dir, tmp_path)
|
||||||
|
|
||||||
|
data = json.loads(output_path.read_text())
|
||||||
|
assert data["index_name"] == "test-skill"
|
||||||
|
assert data["namespace"] == "test-skill"
|
||||||
|
|
||||||
|
def test_no_values_field_in_vectors(self, sample_skill_dir, tmp_path):
|
||||||
|
"""Vectors have no 'values' field — embeddings are added at upload time."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
output_path = adaptor.package(sample_skill_dir, tmp_path)
|
||||||
|
|
||||||
|
data = json.loads(output_path.read_text())
|
||||||
|
for vec in data["vectors"]:
|
||||||
|
assert "values" not in vec
|
||||||
|
|
||||||
|
def test_text_truncation(self):
|
||||||
|
"""_truncate_text_for_metadata respects byte limit."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
# Short text should not be truncated
|
||||||
|
assert adaptor._truncate_text_for_metadata("hello") == "hello"
|
||||||
|
|
||||||
|
# Very long text should be truncated
|
||||||
|
long_text = "x" * 50000
|
||||||
|
truncated = adaptor._truncate_text_for_metadata(long_text)
|
||||||
|
assert len(truncated.encode("utf-8")) <= 40000
|
||||||
|
|
||||||
|
def test_validate_api_key_returns_false(self):
|
||||||
|
"""validate_api_key returns False (no key needed for packaging)."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
assert adaptor.validate_api_key("some-key") is False
|
||||||
|
|
||||||
|
def test_get_env_var_name(self):
|
||||||
|
"""get_env_var_name returns PINECONE_API_KEY."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
assert adaptor.get_env_var_name() == "PINECONE_API_KEY"
|
||||||
|
|
||||||
|
def test_supports_enhancement_false(self):
|
||||||
|
"""Pinecone doesn't support enhancement."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
assert adaptor.supports_enhancement() is False
|
||||||
|
|
||||||
|
def test_upload_without_pinecone_installed(self, tmp_path):
|
||||||
|
"""upload() returns helpful error when pinecone not installed."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
# Create a dummy package file
|
||||||
|
pkg = tmp_path / "test-pinecone.json"
|
||||||
|
pkg.write_text(json.dumps({"vectors": [], "index_name": "test", "namespace": "test"}))
|
||||||
|
|
||||||
|
# This will either work (if pinecone is installed) or return error
|
||||||
|
result = adaptor.upload(pkg)
|
||||||
|
# Without API key, should fail
|
||||||
|
assert result["success"] is False
|
||||||
|
|
||||||
|
def _make_mock_pinecone(self, monkeypatch):
|
||||||
|
"""Helper: stub the pinecone module so upload() can run without a real server."""
|
||||||
|
import sys
|
||||||
|
import types
|
||||||
|
from unittest.mock import MagicMock
|
||||||
|
|
||||||
|
mock_module = types.ModuleType("pinecone")
|
||||||
|
mock_index = MagicMock()
|
||||||
|
mock_pc = MagicMock()
|
||||||
|
mock_pc.list_indexes.return_value = [] # no existing indexes
|
||||||
|
mock_pc.Index.return_value = mock_index
|
||||||
|
mock_module.Pinecone = MagicMock(return_value=mock_pc)
|
||||||
|
mock_module.ServerlessSpec = MagicMock()
|
||||||
|
monkeypatch.setitem(sys.modules, "pinecone", mock_module)
|
||||||
|
return mock_pc, mock_index
|
||||||
|
|
||||||
|
def _make_package(self, tmp_path, vectors=None):
|
||||||
|
"""Helper: create a minimal Pinecone package JSON."""
|
||||||
|
if vectors is None:
|
||||||
|
vectors = [{"id": "a", "metadata": {"text": "hello world"}}]
|
||||||
|
pkg = tmp_path / "test-pinecone.json"
|
||||||
|
pkg.write_text(json.dumps({
|
||||||
|
"vectors": vectors,
|
||||||
|
"index_name": "test",
|
||||||
|
"namespace": "test",
|
||||||
|
"metric": "cosine",
|
||||||
|
"dimension": 1536,
|
||||||
|
}))
|
||||||
|
return pkg
|
||||||
|
|
||||||
|
def test_upload_success_has_url_key(self, tmp_path, monkeypatch):
|
||||||
|
"""upload() success return dict includes 'url' key (prevents KeyError in package_skill.py)."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
mock_pc, _mock_index = self._make_mock_pinecone(monkeypatch)
|
||||||
|
monkeypatch.setattr(
|
||||||
|
adaptor, "_generate_openai_embeddings",
|
||||||
|
lambda docs: [[0.0] * 1536] * len(docs),
|
||||||
|
)
|
||||||
|
pkg = self._make_package(tmp_path)
|
||||||
|
|
||||||
|
result = adaptor.upload(pkg, api_key="fake-key")
|
||||||
|
assert result["success"] is True
|
||||||
|
assert "url" in result # key must exist to avoid KeyError in package_skill.py
|
||||||
|
# Value should be None for Pinecone (no web URL)
|
||||||
|
assert result["url"] is None
|
||||||
|
|
||||||
|
def test_embedding_dimension_autodetect_st(self, tmp_path, monkeypatch):
|
||||||
|
"""sentence-transformers upload creates index with dimension=384."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
mock_pc, _mock_index = self._make_mock_pinecone(monkeypatch)
|
||||||
|
monkeypatch.setattr(
|
||||||
|
adaptor, "_generate_st_embeddings",
|
||||||
|
lambda docs: [[0.0] * 384] * len(docs),
|
||||||
|
)
|
||||||
|
pkg = self._make_package(tmp_path)
|
||||||
|
|
||||||
|
result = adaptor.upload(
|
||||||
|
pkg, api_key="fake-key", embedding_function="sentence-transformers",
|
||||||
|
)
|
||||||
|
assert result["success"] is True
|
||||||
|
# Verify create_index was called with dimension=384
|
||||||
|
mock_pc.create_index.assert_called_once()
|
||||||
|
call_kwargs = mock_pc.create_index.call_args
|
||||||
|
assert call_kwargs.kwargs["dimension"] == 384
|
||||||
|
|
||||||
|
def test_embedding_dimension_autodetect_openai(self, tmp_path, monkeypatch):
|
||||||
|
"""openai upload creates index with dimension=1536."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
mock_pc, _mock_index = self._make_mock_pinecone(monkeypatch)
|
||||||
|
monkeypatch.setattr(
|
||||||
|
adaptor, "_generate_openai_embeddings",
|
||||||
|
lambda docs: [[0.0] * 1536] * len(docs),
|
||||||
|
)
|
||||||
|
pkg = self._make_package(tmp_path)
|
||||||
|
|
||||||
|
result = adaptor.upload(
|
||||||
|
pkg, api_key="fake-key", embedding_function="openai",
|
||||||
|
)
|
||||||
|
assert result["success"] is True
|
||||||
|
mock_pc.create_index.assert_called_once()
|
||||||
|
call_kwargs = mock_pc.create_index.call_args
|
||||||
|
assert call_kwargs.kwargs["dimension"] == 1536
|
||||||
|
|
||||||
|
def test_embedding_before_index_creation(self, tmp_path, monkeypatch):
|
||||||
|
"""If embedding generation fails, index is never created (no side-effects)."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
mock_pc, _mock_index = self._make_mock_pinecone(monkeypatch)
|
||||||
|
|
||||||
|
def fail_embeddings(docs):
|
||||||
|
raise RuntimeError("OPENAI_API_KEY not set")
|
||||||
|
|
||||||
|
monkeypatch.setattr(adaptor, "_generate_openai_embeddings", fail_embeddings)
|
||||||
|
pkg = self._make_package(tmp_path)
|
||||||
|
|
||||||
|
result = adaptor.upload(pkg, api_key="fake-key")
|
||||||
|
assert result["success"] is False
|
||||||
|
# Index must NOT have been created since embedding failed first
|
||||||
|
mock_pc.create_index.assert_not_called()
|
||||||
|
|
||||||
|
def test_embedding_dimension_explicit_override(self, tmp_path, monkeypatch):
|
||||||
|
"""Explicit dimension kwarg overrides both auto-detect and JSON file value."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
mock_pc, _mock_index = self._make_mock_pinecone(monkeypatch)
|
||||||
|
monkeypatch.setattr(
|
||||||
|
adaptor, "_generate_openai_embeddings",
|
||||||
|
lambda docs: [[0.0] * 768] * len(docs),
|
||||||
|
)
|
||||||
|
pkg = self._make_package(tmp_path)
|
||||||
|
|
||||||
|
result = adaptor.upload(
|
||||||
|
pkg, api_key="fake-key", embedding_function="openai", dimension=768,
|
||||||
|
)
|
||||||
|
assert result["success"] is True
|
||||||
|
mock_pc.create_index.assert_called_once()
|
||||||
|
call_kwargs = mock_pc.create_index.call_args
|
||||||
|
assert call_kwargs.kwargs["dimension"] == 768
|
||||||
|
|
||||||
|
def test_deterministic_ids(self, sample_skill_dir):
|
||||||
|
"""IDs are deterministic — same input produces same ID."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
metadata = SkillMetadata(name="test-skill", description="Test")
|
||||||
|
|
||||||
|
result1 = adaptor.format_skill_md(sample_skill_dir, metadata)
|
||||||
|
result2 = adaptor.format_skill_md(sample_skill_dir, metadata)
|
||||||
|
|
||||||
|
data1 = json.loads(result1)
|
||||||
|
data2 = json.loads(result2)
|
||||||
|
|
||||||
|
ids1 = [v["id"] for v in data1["vectors"]]
|
||||||
|
ids2 = [v["id"] for v in data2["vectors"]]
|
||||||
|
assert ids1 == ids2
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# doc_version Metadata Tests (cross-adaptor)
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestDocVersionMetadata:
|
||||||
|
"""Test doc_version flows through all RAG adaptors."""
|
||||||
|
|
||||||
|
def test_skill_metadata_has_doc_version(self):
|
||||||
|
"""SkillMetadata dataclass has doc_version field."""
|
||||||
|
meta = SkillMetadata(name="test", description="test", doc_version="3.2")
|
||||||
|
assert meta.doc_version == "3.2"
|
||||||
|
|
||||||
|
def test_skill_metadata_doc_version_default_empty(self):
|
||||||
|
"""doc_version defaults to empty string."""
|
||||||
|
meta = SkillMetadata(name="test", description="test")
|
||||||
|
assert meta.doc_version == ""
|
||||||
|
|
||||||
|
def test_read_frontmatter(self, sample_skill_dir):
|
||||||
|
"""_read_frontmatter reads doc_version from SKILL.md."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
fm = adaptor._read_frontmatter(sample_skill_dir)
|
||||||
|
assert fm["doc_version"] == "16.2"
|
||||||
|
assert fm["name"] == "test-skill"
|
||||||
|
|
||||||
|
def test_read_frontmatter_missing(self, sample_skill_dir_no_doc_version):
|
||||||
|
"""_read_frontmatter returns empty string when doc_version is absent."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
fm = adaptor._read_frontmatter(sample_skill_dir_no_doc_version)
|
||||||
|
assert fm.get("doc_version") is None # key not present
|
||||||
|
|
||||||
|
def test_build_skill_metadata_reads_doc_version(self, sample_skill_dir):
|
||||||
|
"""_build_skill_metadata populates doc_version from frontmatter."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
meta = adaptor._build_skill_metadata(sample_skill_dir)
|
||||||
|
assert meta.doc_version == "16.2"
|
||||||
|
assert meta.name == "test-skill"
|
||||||
|
|
||||||
|
def test_build_skill_metadata_no_doc_version(self, sample_skill_dir_no_doc_version):
|
||||||
|
"""_build_skill_metadata defaults to empty string when frontmatter has no doc_version."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
meta = adaptor._build_skill_metadata(sample_skill_dir_no_doc_version)
|
||||||
|
assert meta.doc_version == ""
|
||||||
|
|
||||||
|
def test_build_metadata_dict_includes_doc_version(self):
|
||||||
|
"""_build_metadata_dict includes doc_version in output."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
meta = SkillMetadata(name="test", description="desc", doc_version="3.0")
|
||||||
|
result = adaptor._build_metadata_dict(meta)
|
||||||
|
assert "doc_version" in result
|
||||||
|
assert result["doc_version"] == "3.0"
|
||||||
|
|
||||||
|
def test_build_metadata_dict_empty_doc_version(self):
|
||||||
|
"""_build_metadata_dict preserves empty doc_version."""
|
||||||
|
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
|
||||||
|
|
||||||
|
adaptor = PineconeAdaptor()
|
||||||
|
meta = SkillMetadata(name="test", description="desc")
|
||||||
|
result = adaptor._build_metadata_dict(meta)
|
||||||
|
assert "doc_version" in result
|
||||||
|
assert result["doc_version"] == ""
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"platform",
|
||||||
|
["chroma", "faiss", "langchain", "llama-index", "haystack", "pinecone"],
|
||||||
|
)
|
||||||
|
def test_doc_version_in_package_output(self, platform, sample_skill_dir, tmp_path):
|
||||||
|
"""doc_version appears in package output for all RAG adaptors."""
|
||||||
|
from skill_seekers.cli.adaptors import get_adaptor
|
||||||
|
|
||||||
|
adaptor = get_adaptor(platform)
|
||||||
|
output_path = adaptor.package(sample_skill_dir, tmp_path)
|
||||||
|
|
||||||
|
data = json.loads(output_path.read_text())
|
||||||
|
|
||||||
|
# Each adaptor has a different structure — extract metadata dicts
|
||||||
|
meta_list = _extract_metadata_from_package(platform, data)
|
||||||
|
assert len(meta_list) > 0, f"No metadata found in {platform} output"
|
||||||
|
|
||||||
|
for meta in meta_list:
|
||||||
|
assert "doc_version" in meta, f"doc_version missing in {platform} metadata: {meta}"
|
||||||
|
assert meta["doc_version"] == "16.2", (
|
||||||
|
f"doc_version mismatch in {platform}: expected '16.2', got '{meta['doc_version']}'"
|
||||||
|
)
|
||||||
|
|
||||||
|
@pytest.mark.parametrize(
|
||||||
|
"platform",
|
||||||
|
["chroma", "faiss", "langchain", "llama-index", "haystack", "pinecone"],
|
||||||
|
)
|
||||||
|
def test_empty_doc_version_in_package_output(
|
||||||
|
self, platform, sample_skill_dir_no_doc_version, tmp_path
|
||||||
|
):
|
||||||
|
"""Empty doc_version is preserved (not omitted) in all adaptors."""
|
||||||
|
from skill_seekers.cli.adaptors import get_adaptor
|
||||||
|
|
||||||
|
adaptor = get_adaptor(platform)
|
||||||
|
output_path = adaptor.package(sample_skill_dir_no_doc_version, tmp_path)
|
||||||
|
|
||||||
|
data = json.loads(output_path.read_text())
|
||||||
|
meta_list = _extract_metadata_from_package(platform, data)
|
||||||
|
assert len(meta_list) > 0
|
||||||
|
|
||||||
|
for meta in meta_list:
|
||||||
|
assert "doc_version" in meta
|
||||||
|
|
||||||
|
|
||||||
|
# Qdrant and Weaviate may not be installed — test separately if available
|
||||||
|
class TestDocVersionQdrant:
|
||||||
|
"""Test doc_version in Qdrant adaptor (may require qdrant client)."""
|
||||||
|
|
||||||
|
def test_qdrant_doc_version(self, sample_skill_dir, tmp_path):
|
||||||
|
from skill_seekers.cli.adaptors import ADAPTORS
|
||||||
|
|
||||||
|
if "qdrant" not in ADAPTORS:
|
||||||
|
pytest.skip("Qdrant adaptor not available")
|
||||||
|
from skill_seekers.cli.adaptors import get_adaptor
|
||||||
|
|
||||||
|
adaptor = get_adaptor("qdrant")
|
||||||
|
output_path = adaptor.package(sample_skill_dir, tmp_path)
|
||||||
|
data = json.loads(output_path.read_text())
|
||||||
|
|
||||||
|
for point in data["points"]:
|
||||||
|
assert "doc_version" in point["payload"]
|
||||||
|
assert point["payload"]["doc_version"] == "16.2"
|
||||||
|
|
||||||
|
|
||||||
|
class TestWeaviateUploadReturnKeys:
|
||||||
|
"""Test Weaviate upload() return dict has required keys."""
|
||||||
|
|
||||||
|
def test_weaviate_upload_success_has_url_key(self, sample_skill_dir, tmp_path, monkeypatch):
|
||||||
|
"""Weaviate upload() success return includes 'url' key (prevents KeyError in package_skill.py)."""
|
||||||
|
import sys
|
||||||
|
import types
|
||||||
|
from unittest.mock import MagicMock
|
||||||
|
|
||||||
|
from skill_seekers.cli.adaptors import ADAPTORS
|
||||||
|
|
||||||
|
if "weaviate" not in ADAPTORS:
|
||||||
|
pytest.skip("Weaviate adaptor not available")
|
||||||
|
|
||||||
|
from skill_seekers.cli.adaptors.weaviate import WeaviateAdaptor
|
||||||
|
|
||||||
|
adaptor = WeaviateAdaptor()
|
||||||
|
|
||||||
|
# Stub the weaviate module
|
||||||
|
mock_module = types.ModuleType("weaviate")
|
||||||
|
mock_client = MagicMock()
|
||||||
|
mock_client.is_ready.return_value = True
|
||||||
|
mock_module.Client = MagicMock(return_value=mock_client)
|
||||||
|
mock_module.AuthApiKey = MagicMock()
|
||||||
|
monkeypatch.setitem(sys.modules, "weaviate", mock_module)
|
||||||
|
|
||||||
|
# Create a minimal weaviate package
|
||||||
|
output_path = adaptor.package(sample_skill_dir, tmp_path)
|
||||||
|
result = adaptor.upload(output_path)
|
||||||
|
|
||||||
|
assert result["success"] is True
|
||||||
|
assert "url" in result
|
||||||
|
assert result["url"] is None
|
||||||
|
|
||||||
|
|
||||||
|
class TestDocVersionWeaviate:
|
||||||
|
"""Test doc_version in Weaviate adaptor (may require weaviate client)."""
|
||||||
|
|
||||||
|
def test_weaviate_doc_version(self, sample_skill_dir, tmp_path):
|
||||||
|
from skill_seekers.cli.adaptors import ADAPTORS
|
||||||
|
|
||||||
|
if "weaviate" not in ADAPTORS:
|
||||||
|
pytest.skip("Weaviate adaptor not available")
|
||||||
|
from skill_seekers.cli.adaptors import get_adaptor
|
||||||
|
|
||||||
|
adaptor = get_adaptor("weaviate")
|
||||||
|
output_path = adaptor.package(sample_skill_dir, tmp_path)
|
||||||
|
data = json.loads(output_path.read_text())
|
||||||
|
|
||||||
|
for obj in data["objects"]:
|
||||||
|
assert "doc_version" in obj["properties"]
|
||||||
|
assert obj["properties"]["doc_version"] == "16.2"
|
||||||
|
|
||||||
|
def test_weaviate_schema_includes_doc_version(self, sample_skill_dir, tmp_path):
|
||||||
|
from skill_seekers.cli.adaptors import ADAPTORS
|
||||||
|
|
||||||
|
if "weaviate" not in ADAPTORS:
|
||||||
|
pytest.skip("Weaviate adaptor not available")
|
||||||
|
from skill_seekers.cli.adaptors import get_adaptor
|
||||||
|
|
||||||
|
adaptor = get_adaptor("weaviate")
|
||||||
|
output_path = adaptor.package(sample_skill_dir, tmp_path)
|
||||||
|
data = json.loads(output_path.read_text())
|
||||||
|
|
||||||
|
property_names = [p["name"] for p in data["schema"]["properties"]]
|
||||||
|
assert "doc_version" in property_names
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# CLI Flag Tests
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestDocVersionCLIFlag:
|
||||||
|
"""Test --doc-version CLI flag is accepted."""
|
||||||
|
|
||||||
|
def test_common_arguments_has_doc_version(self):
|
||||||
|
"""COMMON_ARGUMENTS includes doc_version."""
|
||||||
|
from skill_seekers.cli.arguments.common import COMMON_ARGUMENTS
|
||||||
|
|
||||||
|
assert "doc_version" in COMMON_ARGUMENTS
|
||||||
|
|
||||||
|
def test_create_arguments_has_doc_version(self):
|
||||||
|
"""UNIVERSAL_ARGUMENTS includes doc_version."""
|
||||||
|
from skill_seekers.cli.arguments.create import UNIVERSAL_ARGUMENTS
|
||||||
|
|
||||||
|
assert "doc_version" in UNIVERSAL_ARGUMENTS
|
||||||
|
|
||||||
|
def test_doc_version_flag_parsed(self):
|
||||||
|
"""--doc-version is parsed correctly by argparse."""
|
||||||
|
import argparse
|
||||||
|
from skill_seekers.cli.arguments.common import add_common_arguments
|
||||||
|
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
add_common_arguments(parser)
|
||||||
|
args = parser.parse_args(["--doc-version", "16.2"])
|
||||||
|
assert args.doc_version == "16.2"
|
||||||
|
|
||||||
|
def test_doc_version_default_empty(self):
|
||||||
|
"""--doc-version defaults to empty string."""
|
||||||
|
import argparse
|
||||||
|
from skill_seekers.cli.arguments.common import add_common_arguments
|
||||||
|
|
||||||
|
parser = argparse.ArgumentParser()
|
||||||
|
add_common_arguments(parser)
|
||||||
|
args = parser.parse_args([])
|
||||||
|
assert args.doc_version == ""
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Package choices test
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
class TestPineconeInPackageChoices:
|
||||||
|
"""Test pinecone is in package CLI choices."""
|
||||||
|
|
||||||
|
def test_pinecone_in_package_arguments(self):
|
||||||
|
"""pinecone is listed in package --target choices."""
|
||||||
|
from skill_seekers.cli.arguments.package import PACKAGE_ARGUMENTS
|
||||||
|
|
||||||
|
choices = PACKAGE_ARGUMENTS["target"]["kwargs"]["choices"]
|
||||||
|
assert "pinecone" in choices
|
||||||
|
|
||||||
|
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
# Helpers
|
||||||
|
# ---------------------------------------------------------------------------
|
||||||
|
|
||||||
|
|
||||||
|
def _extract_metadata_from_package(platform: str, data: dict) -> list[dict]:
|
||||||
|
"""Extract metadata dicts from adaptor-specific package format."""
|
||||||
|
meta_list = []
|
||||||
|
|
||||||
|
if platform == "pinecone":
|
||||||
|
for vec in data.get("vectors", []):
|
||||||
|
meta_list.append(vec.get("metadata", {}))
|
||||||
|
elif platform == "chroma":
|
||||||
|
for meta in data.get("metadatas", []):
|
||||||
|
meta_list.append(meta)
|
||||||
|
elif platform == "faiss":
|
||||||
|
for meta in data.get("metadatas", []):
|
||||||
|
meta_list.append(meta)
|
||||||
|
elif platform == "langchain":
|
||||||
|
for doc in data if isinstance(data, list) else []:
|
||||||
|
meta_list.append(doc.get("metadata", {}))
|
||||||
|
elif platform == "llama-index":
|
||||||
|
for node in data if isinstance(data, list) else []:
|
||||||
|
meta_list.append(node.get("metadata", {}))
|
||||||
|
elif platform == "haystack":
|
||||||
|
for doc in data if isinstance(data, list) else []:
|
||||||
|
meta_list.append(doc.get("meta", {}))
|
||||||
|
elif platform == "qdrant":
|
||||||
|
for point in data.get("points", []):
|
||||||
|
meta_list.append(point.get("payload", {}))
|
||||||
|
elif platform == "weaviate":
|
||||||
|
for obj in data.get("objects", []):
|
||||||
|
meta_list.append(obj.get("properties", {}))
|
||||||
|
|
||||||
|
return meta_list
|
||||||
@@ -151,6 +151,36 @@ class TestWeaviateUploadBasics:
|
|||||||
assert hasattr(adaptor, "_generate_openai_embeddings")
|
assert hasattr(adaptor, "_generate_openai_embeddings")
|
||||||
|
|
||||||
|
|
||||||
|
class TestEmbeddingMethodInheritance:
|
||||||
|
"""Test that shared embedding methods are properly inherited from base."""
|
||||||
|
|
||||||
|
def test_chroma_inherits_openai_embeddings(self):
|
||||||
|
"""Test chroma adaptor gets _generate_openai_embeddings from base."""
|
||||||
|
adaptor = get_adaptor("chroma")
|
||||||
|
assert hasattr(adaptor, "_generate_openai_embeddings")
|
||||||
|
# Verify it's the base class method, not a local override
|
||||||
|
from skill_seekers.cli.adaptors.base import SkillAdaptor
|
||||||
|
assert adaptor._generate_openai_embeddings.__func__ is SkillAdaptor._generate_openai_embeddings
|
||||||
|
|
||||||
|
def test_weaviate_inherits_both_embedding_methods(self):
|
||||||
|
"""Test weaviate adaptor gets both embedding methods from base."""
|
||||||
|
adaptor = get_adaptor("weaviate")
|
||||||
|
assert hasattr(adaptor, "_generate_openai_embeddings")
|
||||||
|
assert hasattr(adaptor, "_generate_st_embeddings")
|
||||||
|
from skill_seekers.cli.adaptors.base import SkillAdaptor
|
||||||
|
assert adaptor._generate_openai_embeddings.__func__ is SkillAdaptor._generate_openai_embeddings
|
||||||
|
assert adaptor._generate_st_embeddings.__func__ is SkillAdaptor._generate_st_embeddings
|
||||||
|
|
||||||
|
def test_pinecone_inherits_both_embedding_methods(self):
|
||||||
|
"""Test pinecone adaptor gets both embedding methods from base."""
|
||||||
|
adaptor = get_adaptor("pinecone")
|
||||||
|
assert hasattr(adaptor, "_generate_openai_embeddings")
|
||||||
|
assert hasattr(adaptor, "_generate_st_embeddings")
|
||||||
|
from skill_seekers.cli.adaptors.base import SkillAdaptor
|
||||||
|
assert adaptor._generate_openai_embeddings.__func__ is SkillAdaptor._generate_openai_embeddings
|
||||||
|
assert adaptor._generate_st_embeddings.__func__ is SkillAdaptor._generate_st_embeddings
|
||||||
|
|
||||||
|
|
||||||
class TestPackageStructure:
|
class TestPackageStructure:
|
||||||
"""Test that packages are correctly structured for upload."""
|
"""Test that packages are correctly structured for upload."""
|
||||||
|
|
||||||
|
|||||||
@@ -16,6 +16,7 @@ Tests cover:
|
|||||||
"""
|
"""
|
||||||
|
|
||||||
import json
|
import json
|
||||||
|
import os
|
||||||
import shutil
|
import shutil
|
||||||
import tempfile
|
import tempfile
|
||||||
import unittest
|
import unittest
|
||||||
@@ -456,6 +457,37 @@ class TestWordErrorHandling(unittest.TestCase):
|
|||||||
with self.assertRaises((KeyError, TypeError)):
|
with self.assertRaises((KeyError, TypeError)):
|
||||||
self.WordToSkillConverter({"docx_path": "test.docx"})
|
self.WordToSkillConverter({"docx_path": "test.docx"})
|
||||||
|
|
||||||
|
def test_non_docx_file_raises_value_error(self):
|
||||||
|
"""extract_docx raises ValueError for non-.docx files."""
|
||||||
|
# Create a real file with wrong extension
|
||||||
|
txt_path = os.path.join(self.temp_dir, "test.txt")
|
||||||
|
with open(txt_path, "w") as f:
|
||||||
|
f.write("not a docx")
|
||||||
|
config = {"name": "test", "docx_path": txt_path}
|
||||||
|
converter = self.WordToSkillConverter(config)
|
||||||
|
with self.assertRaises(ValueError):
|
||||||
|
converter.extract_docx()
|
||||||
|
|
||||||
|
def test_doc_file_raises_value_error(self):
|
||||||
|
"""extract_docx raises ValueError for .doc (old Word format)."""
|
||||||
|
doc_path = os.path.join(self.temp_dir, "test.doc")
|
||||||
|
with open(doc_path, "w") as f:
|
||||||
|
f.write("not a docx")
|
||||||
|
config = {"name": "test", "docx_path": doc_path}
|
||||||
|
converter = self.WordToSkillConverter(config)
|
||||||
|
with self.assertRaises(ValueError):
|
||||||
|
converter.extract_docx()
|
||||||
|
|
||||||
|
def test_no_extension_file_raises_value_error(self):
|
||||||
|
"""extract_docx raises ValueError for file with no extension."""
|
||||||
|
no_ext_path = os.path.join(self.temp_dir, "document")
|
||||||
|
with open(no_ext_path, "w") as f:
|
||||||
|
f.write("not a docx")
|
||||||
|
config = {"name": "test", "docx_path": no_ext_path}
|
||||||
|
converter = self.WordToSkillConverter(config)
|
||||||
|
with self.assertRaises(ValueError):
|
||||||
|
converter.extract_docx()
|
||||||
|
|
||||||
|
|
||||||
class TestWordJSONWorkflow(unittest.TestCase):
|
class TestWordJSONWorkflow(unittest.TestCase):
|
||||||
"""Test building skills from extracted JSON."""
|
"""Test building skills from extracted JSON."""
|
||||||
|
|||||||
56
uv.lock
generated
56
uv.lock
generated
@@ -3621,11 +3621,11 @@ wheels = [
|
|||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "packaging"
|
name = "packaging"
|
||||||
version = "25.0"
|
version = "24.2"
|
||||||
source = { registry = "https://pypi.org/simple" }
|
source = { registry = "https://pypi.org/simple" }
|
||||||
sdist = { url = "https://files.pythonhosted.org/packages/a1/d4/1fc4078c65507b51b96ca8f8c3ba19e6a61c8253c72794544580a7b6c24d/packaging-25.0.tar.gz", hash = "sha256:d443872c98d677bf60f6a1f2f8c1cb748e8fe762d2bf9d3148b5599295b0fc4f", size = 165727, upload-time = "2025-04-19T11:48:59.673Z" }
|
sdist = { url = "https://files.pythonhosted.org/packages/d0/63/68dbb6eb2de9cb10ee4c9c14a0148804425e13c4fb20d61cce69f53106da/packaging-24.2.tar.gz", hash = "sha256:c228a6dc5e932d346bc5739379109d49e8853dd8223571c7c5b55260edc0b97f", size = 163950, upload-time = "2024-11-08T09:47:47.202Z" }
|
||||||
wheels = [
|
wheels = [
|
||||||
{ url = "https://files.pythonhosted.org/packages/20/12/38679034af332785aac8774540895e234f4d07f7545804097de4b666afd8/packaging-25.0-py3-none-any.whl", hash = "sha256:29572ef2b1f17581046b3a2227d5c611fb25ec70ca1ba8554b24b0e69331a484", size = 66469, upload-time = "2025-04-19T11:48:57.875Z" },
|
{ url = "https://files.pythonhosted.org/packages/88/ef/eb23f262cca3c0c4eb7ab1933c3b1f03d021f2c48f54763065b6f0e321be/packaging-24.2-py3-none-any.whl", hash = "sha256:09abb1bccd265c01f4a3aa3f7a7db064b36514d2cba19a2f694fe6150451a759", size = 65451, upload-time = "2024-11-08T09:47:44.722Z" },
|
||||||
]
|
]
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
@@ -3797,6 +3797,46 @@ wheels = [
|
|||||||
{ url = "https://files.pythonhosted.org/packages/2d/71/64e9b1c7f04ae0027f788a248e6297d7fcc29571371fe7d45495a78172c0/pillow-12.1.0-pp311-pypy311_pp73-win_amd64.whl", hash = "sha256:75af0b4c229ac519b155028fa1be632d812a519abba9b46b20e50c6caa184f19", size = 7029809, upload-time = "2026-01-02T09:13:26.541Z" },
|
{ url = "https://files.pythonhosted.org/packages/2d/71/64e9b1c7f04ae0027f788a248e6297d7fcc29571371fe7d45495a78172c0/pillow-12.1.0-pp311-pypy311_pp73-win_amd64.whl", hash = "sha256:75af0b4c229ac519b155028fa1be632d812a519abba9b46b20e50c6caa184f19", size = 7029809, upload-time = "2026-01-02T09:13:26.541Z" },
|
||||||
]
|
]
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "pinecone"
|
||||||
|
version = "8.1.0"
|
||||||
|
source = { registry = "https://pypi.org/simple" }
|
||||||
|
dependencies = [
|
||||||
|
{ name = "certifi" },
|
||||||
|
{ name = "orjson" },
|
||||||
|
{ name = "pinecone-plugin-assistant" },
|
||||||
|
{ name = "pinecone-plugin-interface" },
|
||||||
|
{ name = "python-dateutil" },
|
||||||
|
{ name = "typing-extensions" },
|
||||||
|
{ name = "urllib3" },
|
||||||
|
]
|
||||||
|
sdist = { url = "https://files.pythonhosted.org/packages/e2/e4/8303133de5b3850c85d56caf9cc23cc38c74942bb8a940890b225245d7df/pinecone-8.1.0.tar.gz", hash = "sha256:48a00843fb232ccfd57eba618f0c0294e918b030e1bc7e853fb88d04f80ba569", size = 1041965, upload-time = "2026-02-19T20:08:32.999Z" }
|
||||||
|
wheels = [
|
||||||
|
{ url = "https://files.pythonhosted.org/packages/4e/f7/beee7033ef92e5964e570fc29a048627e298745916e65c66105378405d06/pinecone-8.1.0-py3-none-any.whl", hash = "sha256:b0ba9c55c9a072fbe4fc7381bc3e5eb1b14550a8007233a3368ada74b1747534", size = 742745, upload-time = "2026-02-19T20:08:31.319Z" },
|
||||||
|
]
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "pinecone-plugin-assistant"
|
||||||
|
version = "3.0.2"
|
||||||
|
source = { registry = "https://pypi.org/simple" }
|
||||||
|
dependencies = [
|
||||||
|
{ name = "packaging" },
|
||||||
|
{ name = "requests" },
|
||||||
|
]
|
||||||
|
sdist = { url = "https://files.pythonhosted.org/packages/c4/16/dcaff42ddfeab75dccd17685a0db46489717c3d23753dc14c55770e12aa8/pinecone_plugin_assistant-3.0.2.tar.gz", hash = "sha256:04163af282ad7895b581ab89f850ed139e4ddcea72010cadfa4c573759d5c896", size = 152066, upload-time = "2026-02-01T09:08:48.04Z" }
|
||||||
|
wheels = [
|
||||||
|
{ url = "https://files.pythonhosted.org/packages/4a/dd/8bc4f3baf6c03acfb0b300f5aba53d19cc3a319281da518182bf22671b92/pinecone_plugin_assistant-3.0.2-py3-none-any.whl", hash = "sha256:de21ff696219fcad6c7ec86a3d1f70875024314537758ab345b6230462342903", size = 280863, upload-time = "2026-02-01T09:08:49.384Z" },
|
||||||
|
]
|
||||||
|
|
||||||
|
[[package]]
|
||||||
|
name = "pinecone-plugin-interface"
|
||||||
|
version = "0.0.7"
|
||||||
|
source = { registry = "https://pypi.org/simple" }
|
||||||
|
sdist = { url = "https://files.pythonhosted.org/packages/f4/fb/e8a4063264953ead9e2b24d9b390152c60f042c951c47f4592e9996e57ff/pinecone_plugin_interface-0.0.7.tar.gz", hash = "sha256:b8e6675e41847333aa13923cc44daa3f85676d7157324682dc1640588a982846", size = 3370, upload-time = "2024-06-05T01:57:52.093Z" }
|
||||||
|
wheels = [
|
||||||
|
{ url = "https://files.pythonhosted.org/packages/3b/1d/a21fdfcd6d022cb64cef5c2a29ee6691c6c103c4566b41646b080b7536a5/pinecone_plugin_interface-0.0.7-py3-none-any.whl", hash = "sha256:875857ad9c9fc8bbc074dbe780d187a2afd21f5bfe0f3b08601924a61ef1bba8", size = 6249, upload-time = "2024-06-05T01:57:50.583Z" },
|
||||||
|
]
|
||||||
|
|
||||||
[[package]]
|
[[package]]
|
||||||
name = "platformdirs"
|
name = "platformdirs"
|
||||||
version = "4.9.2"
|
version = "4.9.2"
|
||||||
@@ -5405,6 +5445,7 @@ all = [
|
|||||||
{ name = "numpy", version = "2.2.6", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.11'" },
|
{ name = "numpy", version = "2.2.6", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.11'" },
|
||||||
{ name = "numpy", version = "2.4.2", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.11'" },
|
{ name = "numpy", version = "2.4.2", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.11'" },
|
||||||
{ name = "openai" },
|
{ name = "openai" },
|
||||||
|
{ name = "pinecone" },
|
||||||
{ name = "python-docx" },
|
{ name = "python-docx" },
|
||||||
{ name = "sentence-transformers" },
|
{ name = "sentence-transformers" },
|
||||||
{ name = "sse-starlette" },
|
{ name = "sse-starlette" },
|
||||||
@@ -5457,8 +5498,12 @@ mcp = [
|
|||||||
openai = [
|
openai = [
|
||||||
{ name = "openai" },
|
{ name = "openai" },
|
||||||
]
|
]
|
||||||
|
pinecone = [
|
||||||
|
{ name = "pinecone" },
|
||||||
|
]
|
||||||
rag-upload = [
|
rag-upload = [
|
||||||
{ name = "chromadb" },
|
{ name = "chromadb" },
|
||||||
|
{ name = "pinecone" },
|
||||||
{ name = "sentence-transformers" },
|
{ name = "sentence-transformers" },
|
||||||
{ name = "weaviate-client" },
|
{ name = "weaviate-client" },
|
||||||
]
|
]
|
||||||
@@ -5533,6 +5578,9 @@ requires-dist = [
|
|||||||
{ name = "openai", marker = "extra == 'openai'", specifier = ">=1.0.0" },
|
{ name = "openai", marker = "extra == 'openai'", specifier = ">=1.0.0" },
|
||||||
{ name = "pathspec", specifier = ">=0.12.1" },
|
{ name = "pathspec", specifier = ">=0.12.1" },
|
||||||
{ name = "pillow", specifier = ">=11.0.0" },
|
{ name = "pillow", specifier = ">=11.0.0" },
|
||||||
|
{ name = "pinecone", marker = "extra == 'all'", specifier = ">=5.0.0" },
|
||||||
|
{ name = "pinecone", marker = "extra == 'pinecone'", specifier = ">=5.0.0" },
|
||||||
|
{ name = "pinecone", marker = "extra == 'rag-upload'", specifier = ">=5.0.0" },
|
||||||
{ name = "pydantic", specifier = ">=2.12.3" },
|
{ name = "pydantic", specifier = ">=2.12.3" },
|
||||||
{ name = "pydantic-settings", specifier = ">=2.11.0" },
|
{ name = "pydantic-settings", specifier = ">=2.11.0" },
|
||||||
{ name = "pygithub", specifier = ">=2.5.0" },
|
{ name = "pygithub", specifier = ">=2.5.0" },
|
||||||
@@ -5563,7 +5611,7 @@ requires-dist = [
|
|||||||
{ name = "weaviate-client", marker = "extra == 'rag-upload'", specifier = ">=3.25.0" },
|
{ name = "weaviate-client", marker = "extra == 'rag-upload'", specifier = ">=3.25.0" },
|
||||||
{ name = "weaviate-client", marker = "extra == 'weaviate'", specifier = ">=3.25.0" },
|
{ name = "weaviate-client", marker = "extra == 'weaviate'", specifier = ">=3.25.0" },
|
||||||
]
|
]
|
||||||
provides-extras = ["mcp", "gemini", "openai", "all-llms", "s3", "gcs", "azure", "docx", "chroma", "weaviate", "sentence-transformers", "rag-upload", "all-cloud", "embedding", "all"]
|
provides-extras = ["mcp", "gemini", "openai", "all-llms", "s3", "gcs", "azure", "docx", "chroma", "weaviate", "sentence-transformers", "pinecone", "rag-upload", "all-cloud", "embedding", "all"]
|
||||||
|
|
||||||
[package.metadata.requires-dev]
|
[package.metadata.requires-dev]
|
||||||
dev = [
|
dev = [
|
||||||
|
|||||||
Reference in New Issue
Block a user