Merge branch 'development' into feature/video-scraper-pipeline

Sync with latest development changes including ruff formatting,
bug fixes, and pinecone adaptor additions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This commit is contained in:
yusyus
2026-03-01 11:38:45 +03:00
43 changed files with 1988 additions and 261 deletions

View File

@@ -22,6 +22,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- **`docx` optional dependency group** — `pip install skill-seekers[docx]` (mammoth + python-docx)
### Fixed
- **`--var` flag silently dropped in `create` routing** — `main.py` checked `args.workflow_var` but argparse stores the flag as `args.var`. Workflow variable overrides via `--var KEY=VALUE` were silently ignored. Fixed to read `args.var`.
- **Double `_score_code_quality()` call in word scraper** — `word_scraper.py` called `_score_code_quality(raw_text)` twice for every code-like paragraph (once to check threshold, once to assign). Consolidated to a single call.
- **`.docx` file extension validation** — `WordToSkillConverter` now validates the file has a `.docx` extension before attempting to parse. Non-`.docx` files (`.doc`, `.txt`, no extension) raise `ValueError` with a clear message instead of cryptic parse errors.
- **`--no-preserve-code` renamed to `--no-preserve-code-blocks`** — Flag name now matches the parameter it controls (`preserve_code_blocks`). Backward-compatible alias `--no-preserve-code` kept (hidden, removed in v4.0.0).
- **`--chunk-overlap-tokens` missing from `package` command** — Flag was defined in `create` and `scrape` but not `package`. Added to `PACKAGE_ARGUMENTS` and wired through `package_skill()``adaptor.package()``format_skill_md()``_maybe_chunk_content()``RAGChunker`.
- **Chunk overlap auto-scaling** — When `--chunk-tokens` is non-default but `--chunk-overlap-tokens` is default, overlap now auto-scales to `max(50, chunk_tokens // 10)` for better context preservation with large chunks.
- **Weaviate `ImportError` masked by generic handler** — `upload()` caught `Exception` before `ImportError`, so missing `sentence-transformers` produced a generic "Upload failed" message instead of the specific install instruction. Added `except ImportError` before `except Exception`.
- **Hardcoded chunk defaults in 12 adaptors** — All concrete adaptors (claude, gemini, openai, markdown, langchain, llama_index, haystack, chroma, faiss, qdrant, weaviate, pinecone) used hardcoded `512`/`50` for chunk token/overlap defaults. Replaced with `DEFAULT_CHUNK_TOKENS` and `DEFAULT_CHUNK_OVERLAP_TOKENS` constants from `arguments/common.py`.
- **RAG chunking crash (`AttributeError: output_dir`)** — `execute_scraping_and_building()` used `converter.output_dir` which doesn't exist on `DocToSkillConverter`. Changed to `Path(converter.skill_dir)`. Affected `--chunk-for-rag` flag on `scrape` command.
- **Issue #301: `setup.sh` fails on macOS with mismatched Python/pip** — `pip3` can point to a different Python than `python3` (e.g. pip3 → 3.9, python3 → 3.14), causing "no matching distribution" errors. Changed `setup.sh` to use `python3 -m pip` instead of bare `pip3` to guarantee the correct interpreter.
- **Issue #300: Selector fallback & dry-run link discovery** — `create https://reactflow.dev/` now finds 20+ pages (was 1). Root causes:
@@ -45,6 +53,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
- **Language detector method** — Fixed `detect_from_text``detect_from_code` in word scraper
### Changed
- **Shared embedding methods consolidated to base class** — `_generate_openai_embeddings()` and `_generate_st_embeddings()` moved from chroma/weaviate/pinecone adaptors into `SkillAdaptor` base class. All 3 adaptors now inherit these methods, eliminating ~150 lines of duplicated code.
- **Chunk constants centralized** — Added `DEFAULT_CHUNK_TOKENS = 512` and `DEFAULT_CHUNK_OVERLAP_TOKENS = 50` in `arguments/common.py`. Used across `rag_chunker.py`, `base.py`, `package_skill.py`, `create_command.py`, and all 12 concrete adaptors. No more magic numbers for chunk defaults.
- **Enhancement summarizer architecture** — Character-budget approach respects `target_ratio` for both code blocks and heading chunks, replacing hard limits with proportional allocation
## [3.1.3] - 2026-02-24

View File

@@ -10,7 +10,7 @@ English | [简体中文](https://github.com/yusufkaraaslan/Skill_Seekers/blob/ma
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![MCP Integration](https://img.shields.io/badge/MCP-Integrated-blue.svg)](https://modelcontextprotocol.io)
[![Tested](https://img.shields.io/badge/Tests-1880%2B%20Passing-brightgreen.svg)](tests/)
[![Tested](https://img.shields.io/badge/Tests-2283%2B%20Passing-brightgreen.svg)](tests/)
[![Project Board](https://img.shields.io/badge/Project-Board-purple.svg)](https://github.com/users/yusufkaraaslan/projects/2)
[![PyPI version](https://badge.fury.io/py/skill-seekers.svg)](https://pypi.org/project/skill-seekers/)
[![PyPI - Downloads](https://img.shields.io/pypi/dm/skill-seekers.svg)](https://pypi.org/project/skill-seekers/)

View File

@@ -309,6 +309,15 @@ package_path = adaptor.package(
)
```
#### Shared Embedding Methods
The base `SkillAdaptor` class provides two shared embedding methods inherited by all vector database adaptors (chroma, weaviate, pinecone):
- `_generate_openai_embeddings(texts, model)` -- Generate embeddings via the OpenAI API.
- `_generate_st_embeddings(texts, model)` -- Generate embeddings using a local sentence-transformers model.
These methods are available on any adaptor instance returned by `get_adaptor()` for vector database targets, so you do not need to implement embedding logic per-adaptor.
---
### 6. Skill Upload API

View File

@@ -620,7 +620,8 @@ skill-seekers package SKILL_DIRECTORY [options]
| | `--batch-size` | 100 | Chunks per batch |
| | `--chunk-for-rag` | | Enable RAG chunking |
| | `--chunk-tokens` | 512 | Max tokens per chunk |
| | `--no-preserve-code` | | Allow code block splitting |
| | `--chunk-overlap-tokens` | 50 | Overlap between chunks (tokens) |
| | `--no-preserve-code-blocks` | | Allow code block splitting |
**Supported Platforms:**

View File

@@ -194,7 +194,9 @@ skill-seekers package output/my-skill/ \
| `--chunk-for-rag` | auto | Enable chunking |
| `--chunk-tokens` | 512 | Tokens per chunk |
| `--chunk-overlap-tokens` | 50 | Overlap between chunks (tokens) |
| `--no-preserve-code` | - | Allow splitting code blocks |
| `--no-preserve-code-blocks` | - | Allow splitting code blocks |
> **Auto-scaling overlap:** When `--chunk-tokens` is set to a non-default value but `--chunk-overlap-tokens` is left at default (50), the overlap automatically scales to `max(50, chunk_tokens / 10)` for better context preservation with larger chunks.
---

View File

@@ -598,7 +598,8 @@ skill-seekers package SKILL_DIRECTORY [options]
| | `--batch-size` | 100 | Chunks per batch |
| | `--chunk-for-rag` | | Enable RAG chunking |
| | `--chunk-tokens` | 512 | Max tokens per chunk |
| | `--no-preserve-code` | | Allow code block splitting |
| | `--chunk-overlap-tokens` | 50 | Overlap between chunks (tokens) |
| | `--no-preserve-code-blocks` | | Allow code block splitting |
**Supported Platforms:**

View File

@@ -194,7 +194,9 @@ skill-seekers package output/my-skill/ \
| `--chunk-for-rag` | auto | Enable chunking |
| `--chunk-tokens` | 512 | Tokens per chunk |
| `--chunk-overlap-tokens` | 50 | Overlap between chunks (tokens) |
| `--no-preserve-code` | - | Allow splitting code blocks |
| `--no-preserve-code-blocks` | - | Allow splitting code blocks |
> **自动缩放重叠:** 当 `--chunk-tokens` 设置为非默认值但 `--chunk-overlap-tokens` 保持默认值 (50) 时,重叠会自动缩放为 `max(50, chunk_tokens / 10)`,以在较大的分块中实现更好的上下文保留。
---

View File

@@ -144,10 +144,15 @@ sentence-transformers = [
"sentence-transformers>=2.2.0",
]
pinecone = [
"pinecone>=5.0.0",
]
rag-upload = [
"chromadb>=0.4.0",
"weaviate-client>=3.25.0",
"sentence-transformers>=2.2.0",
"pinecone>=5.0.0",
]
# All cloud storage providers combined
@@ -185,6 +190,7 @@ all = [
"azure-storage-blob>=12.19.0",
"chromadb>=0.4.0",
"weaviate-client>=3.25.0",
"pinecone>=5.0.0",
"fastapi>=0.109.0",
"sentence-transformers>=2.3.0",
"numpy>=1.24.0",

View File

@@ -64,6 +64,11 @@ try:
except ImportError:
HaystackAdaptor = None
try:
from .pinecone_adaptor import PineconeAdaptor
except ImportError:
PineconeAdaptor = None
# Registry of available adaptors
ADAPTORS: dict[str, type[SkillAdaptor]] = {}
@@ -91,6 +96,8 @@ if QdrantAdaptor:
ADAPTORS["qdrant"] = QdrantAdaptor
if HaystackAdaptor:
ADAPTORS["haystack"] = HaystackAdaptor
if PineconeAdaptor:
ADAPTORS["pinecone"] = PineconeAdaptor
def get_adaptor(platform: str, config: dict = None) -> SkillAdaptor:

View File

@@ -11,6 +11,8 @@ from dataclasses import dataclass, field
from pathlib import Path
from typing import Any
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
@dataclass
class SkillMetadata:
@@ -19,6 +21,7 @@ class SkillMetadata:
name: str
description: str
version: str = "1.0.0"
doc_version: str = "" # Documentation version (e.g., "16.2") for RAG metadata filtering
author: str | None = None
tags: list[str] = field(default_factory=list)
@@ -73,8 +76,9 @@ class SkillAdaptor(ABC):
skill_dir: Path,
output_path: Path,
enable_chunking: bool = False,
chunk_max_tokens: int = 512,
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
preserve_code_blocks: bool = True,
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
) -> Path:
"""
Package skill for platform (ZIP, tar.gz, etc.).
@@ -228,6 +232,47 @@ class SkillAdaptor(ABC):
return skill_md_path.read_text(encoding="utf-8")
def _read_frontmatter(self, skill_dir: Path) -> dict[str, str]:
"""Read YAML frontmatter from SKILL.md.
Args:
skill_dir: Path to skill directory
Returns:
Dict of key-value pairs from the frontmatter block.
"""
content = self._read_skill_md(skill_dir)
if content.startswith("---"):
parts = content.split("---", 2)
if len(parts) >= 3:
frontmatter: dict[str, str] = {}
for line in parts[1].strip().splitlines():
if ":" in line:
key, _, value = line.partition(":")
frontmatter[key.strip()] = value.strip()
return frontmatter
return {}
def _build_skill_metadata(self, skill_dir: Path) -> SkillMetadata:
"""Build SkillMetadata from SKILL.md frontmatter.
Reads name, description, version, and doc_version from frontmatter
instead of using hardcoded defaults.
Args:
skill_dir: Path to skill directory
Returns:
SkillMetadata populated from frontmatter values.
"""
fm = self._read_frontmatter(skill_dir)
return SkillMetadata(
name=skill_dir.name,
description=fm.get("description", f"Documentation for {skill_dir.name}"),
version=fm.get("version", "1.0.0"),
doc_version=fm.get("doc_version", ""),
)
def _iterate_references(self, skill_dir: Path):
"""
Iterate over all reference files in skill directory.
@@ -266,6 +311,7 @@ class SkillAdaptor(ABC):
base_meta = {
"source": metadata.name,
"version": metadata.version,
"doc_version": metadata.doc_version,
"description": metadata.description,
}
if metadata.author:
@@ -280,9 +326,10 @@ class SkillAdaptor(ABC):
content: str,
metadata: dict,
enable_chunking: bool = False,
chunk_max_tokens: int = 512,
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
preserve_code_blocks: bool = True,
source_file: str = None,
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
) -> list[tuple[str, dict]]:
"""
Optionally chunk content for RAG platforms.
@@ -321,9 +368,18 @@ class SkillAdaptor(ABC):
return [(content, metadata)]
# RAGChunker uses TOKENS (it converts to chars internally)
# If overlap is at the default value but chunk size was customized,
# scale overlap proportionally (10% of chunk size, min DEFAULT_CHUNK_OVERLAP_TOKENS)
effective_overlap = chunk_overlap_tokens
if (
chunk_overlap_tokens == DEFAULT_CHUNK_OVERLAP_TOKENS
and chunk_max_tokens != DEFAULT_CHUNK_TOKENS
):
effective_overlap = max(DEFAULT_CHUNK_OVERLAP_TOKENS, chunk_max_tokens // 10)
chunker = RAGChunker(
chunk_size=chunk_max_tokens,
chunk_overlap=max(50, chunk_max_tokens // 10), # 10% overlap
chunk_overlap=effective_overlap,
preserve_code_blocks=preserve_code_blocks,
preserve_paragraphs=True,
min_chunk_size=100, # 100 tokens minimum
@@ -433,6 +489,67 @@ class SkillAdaptor(ABC):
# Plain hex digest
return hash_hex
def _generate_openai_embeddings(
self, documents: list[str], api_key: str | None = None
) -> list[list[float]]:
"""Generate embeddings using OpenAI text-embedding-3-small.
Args:
documents: List of document texts
api_key: OpenAI API key (or uses OPENAI_API_KEY env var)
Returns:
List of embedding vectors
"""
import os
try:
from openai import OpenAI
except ImportError:
raise ImportError("openai not installed. Run: pip install openai") from None
api_key = api_key or os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not set. Set via env var or --openai-api-key")
client = OpenAI(api_key=api_key)
embeddings: list[list[float]] = []
batch_size = 100
print(f" Generating OpenAI embeddings for {len(documents)} documents...")
for i in range(0, len(documents), batch_size):
batch = documents[i : i + batch_size]
try:
response = client.embeddings.create(input=batch, model="text-embedding-3-small")
embeddings.extend([item.embedding for item in response.data])
print(f" ✓ Embedded {min(i + batch_size, len(documents))}/{len(documents)}")
except Exception as e:
raise Exception(f"OpenAI embedding generation failed: {e}") from e
return embeddings
def _generate_st_embeddings(self, documents: list[str]) -> list[list[float]]:
"""Generate embeddings using sentence-transformers (all-MiniLM-L6-v2).
Args:
documents: List of document texts
Returns:
List of embedding vectors
"""
try:
from sentence_transformers import SentenceTransformer
except ImportError:
raise ImportError(
"sentence-transformers not installed. Run: pip install sentence-transformers"
) from None
print(f" Generating sentence-transformer embeddings for {len(documents)} documents...")
model = SentenceTransformer("all-MiniLM-L6-v2")
embeddings = model.encode(documents, show_progress_bar=True)
return [emb.tolist() for emb in embeddings]
def _generate_toc(self, skill_dir: Path) -> str:
"""
Helper to generate table of contents from references.

View File

@@ -11,6 +11,7 @@ from pathlib import Path
from typing import Any
from .base import SkillAdaptor, SkillMetadata
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
class ChromaAdaptor(SkillAdaptor):
@@ -79,6 +80,7 @@ class ChromaAdaptor(SkillAdaptor):
"file": "SKILL.md",
"type": "documentation",
"version": metadata.version,
"doc_version": metadata.doc_version,
}
# Chunk if enabled
@@ -86,9 +88,12 @@ class ChromaAdaptor(SkillAdaptor):
content,
doc_metadata,
enable_chunking=enable_chunking,
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
source_file="SKILL.md",
chunk_overlap_tokens=kwargs.get(
"chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS
),
)
# Add all chunks to parallel arrays
@@ -109,6 +114,7 @@ class ChromaAdaptor(SkillAdaptor):
"file": ref_file.name,
"type": "reference",
"version": metadata.version,
"doc_version": metadata.doc_version,
}
# Chunk if enabled
@@ -116,9 +122,12 @@ class ChromaAdaptor(SkillAdaptor):
ref_content,
doc_metadata,
enable_chunking=enable_chunking,
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
source_file=ref_file.name,
chunk_overlap_tokens=kwargs.get(
"chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS
),
)
# Add all chunks to parallel arrays
@@ -144,8 +153,9 @@ class ChromaAdaptor(SkillAdaptor):
skill_dir: Path,
output_path: Path,
enable_chunking: bool = False,
chunk_max_tokens: int = 512,
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
preserve_code_blocks: bool = True,
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
) -> Path:
"""
Package skill into JSON file for Chroma.
@@ -166,12 +176,8 @@ class ChromaAdaptor(SkillAdaptor):
output_path = self._format_output_path(skill_dir, Path(output_path), "-chroma.json")
output_path.parent.mkdir(parents=True, exist_ok=True)
# Read metadata
metadata = SkillMetadata(
name=skill_dir.name,
description=f"Chroma collection data for {skill_dir.name}",
version="1.0.0",
)
# Read metadata from SKILL.md frontmatter
metadata = self._build_skill_metadata(skill_dir)
# Generate Chroma data
chroma_json = self.format_skill_md(
@@ -180,6 +186,7 @@ class ChromaAdaptor(SkillAdaptor):
enable_chunking=enable_chunking,
chunk_max_tokens=chunk_max_tokens,
preserve_code_blocks=preserve_code_blocks,
chunk_overlap_tokens=chunk_overlap_tokens,
)
# Write to file
@@ -206,7 +213,7 @@ class ChromaAdaptor(SkillAdaptor):
return output_path
def upload(self, package_path: Path, api_key: str = None, **kwargs) -> dict[str, Any]:
def upload(self, package_path: Path, api_key: str | None = None, **kwargs) -> dict[str, Any]:
"""
Upload packaged skill to ChromaDB.
@@ -250,9 +257,7 @@ class ChromaAdaptor(SkillAdaptor):
print(f"🌐 Connecting to ChromaDB at: {chroma_url}")
# Parse URL
if "://" in chroma_url:
parts = chroma_url.split("://")
parts[0]
host_port = parts[1]
_scheme, host_port = chroma_url.split("://", 1)
else:
host_port = chroma_url
@@ -352,52 +357,6 @@ class ChromaAdaptor(SkillAdaptor):
except Exception as e:
return {"success": False, "message": f"Upload failed: {e}"}
def _generate_openai_embeddings(
self, documents: list[str], api_key: str = None
) -> list[list[float]]:
"""
Generate embeddings using OpenAI API.
Args:
documents: List of document texts
api_key: OpenAI API key (or uses OPENAI_API_KEY env var)
Returns:
List of embedding vectors
"""
import os
try:
from openai import OpenAI
except ImportError:
raise ImportError("openai not installed. Run: pip install openai") from None
api_key = api_key or os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not set. Set via env var or --openai-api-key")
client = OpenAI(api_key=api_key)
# Batch process (OpenAI allows up to 2048 inputs)
embeddings = []
batch_size = 100
print(f" Generating embeddings for {len(documents)} documents...")
for i in range(0, len(documents), batch_size):
batch = documents[i : i + batch_size]
try:
response = client.embeddings.create(
input=batch,
model="text-embedding-3-small", # Cheapest, fastest
)
embeddings.extend([item.embedding for item in response.data])
print(f" ✓ Processed {min(i + batch_size, len(documents))}/{len(documents)}")
except Exception as e:
raise Exception(f"OpenAI embedding generation failed: {e}") from e
return embeddings
def validate_api_key(self, _api_key: str) -> bool:
"""
Chroma format doesn't use API keys for packaging.

View File

@@ -12,6 +12,7 @@ from pathlib import Path
from typing import Any
from .base import SkillAdaptor, SkillMetadata
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
class ClaudeAdaptor(SkillAdaptor):
@@ -86,8 +87,9 @@ version: {metadata.version}
skill_dir: Path,
output_path: Path,
enable_chunking: bool = False,
chunk_max_tokens: int = 512,
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
preserve_code_blocks: bool = True,
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
) -> Path:
"""
Package skill into ZIP file for Claude.

View File

@@ -11,6 +11,7 @@ from pathlib import Path
from typing import Any
from .base import SkillAdaptor, SkillMetadata
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
class FAISSHelpers(SkillAdaptor):
@@ -81,6 +82,7 @@ class FAISSHelpers(SkillAdaptor):
"file": "SKILL.md",
"type": "documentation",
"version": metadata.version,
"doc_version": metadata.doc_version,
}
# Chunk if enabled
@@ -88,9 +90,12 @@ class FAISSHelpers(SkillAdaptor):
content,
doc_metadata,
enable_chunking=enable_chunking,
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
source_file="SKILL.md",
chunk_overlap_tokens=kwargs.get(
"chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS
),
)
# Add all chunks to parallel arrays
@@ -110,6 +115,7 @@ class FAISSHelpers(SkillAdaptor):
"file": ref_file.name,
"type": "reference",
"version": metadata.version,
"doc_version": metadata.doc_version,
}
# Chunk if enabled
@@ -117,9 +123,12 @@ class FAISSHelpers(SkillAdaptor):
ref_content,
doc_metadata,
enable_chunking=enable_chunking,
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
source_file=ref_file.name,
chunk_overlap_tokens=kwargs.get(
"chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS
),
)
# Add all chunks to parallel arrays
@@ -155,8 +164,9 @@ class FAISSHelpers(SkillAdaptor):
skill_dir: Path,
output_path: Path,
enable_chunking: bool = False,
chunk_max_tokens: int = 512,
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
preserve_code_blocks: bool = True,
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
) -> Path:
"""
Package skill into JSON file for FAISS.
@@ -176,12 +186,8 @@ class FAISSHelpers(SkillAdaptor):
output_path = self._format_output_path(skill_dir, Path(output_path), "-faiss.json")
output_path.parent.mkdir(parents=True, exist_ok=True)
# Read metadata
metadata = SkillMetadata(
name=skill_dir.name,
description=f"FAISS data for {skill_dir.name}",
version="1.0.0",
)
# Read metadata from SKILL.md frontmatter
metadata = self._build_skill_metadata(skill_dir)
# Generate FAISS data
faiss_json = self.format_skill_md(
@@ -190,6 +196,7 @@ class FAISSHelpers(SkillAdaptor):
enable_chunking=enable_chunking,
chunk_max_tokens=chunk_max_tokens,
preserve_code_blocks=preserve_code_blocks,
chunk_overlap_tokens=chunk_overlap_tokens,
)
# Write to file

View File

@@ -13,6 +13,7 @@ from pathlib import Path
from typing import Any
from .base import SkillAdaptor, SkillMetadata
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
class GeminiAdaptor(SkillAdaptor):
@@ -91,8 +92,9 @@ See the references directory for complete documentation with examples and best p
skill_dir: Path,
output_path: Path,
enable_chunking: bool = False,
chunk_max_tokens: int = 512,
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
preserve_code_blocks: bool = True,
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
) -> Path:
"""
Package skill into tar.gz file for Gemini.

View File

@@ -11,6 +11,7 @@ from pathlib import Path
from typing import Any
from .base import SkillAdaptor, SkillMetadata
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
class HaystackAdaptor(SkillAdaptor):
@@ -62,6 +63,7 @@ class HaystackAdaptor(SkillAdaptor):
"file": "SKILL.md",
"type": "documentation",
"version": metadata.version,
"doc_version": metadata.doc_version,
}
# Chunk if enabled
@@ -69,9 +71,12 @@ class HaystackAdaptor(SkillAdaptor):
content,
doc_meta,
enable_chunking=enable_chunking,
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
source_file="SKILL.md",
chunk_overlap_tokens=kwargs.get(
"chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS
),
)
# Add all chunks as documents
@@ -95,6 +100,7 @@ class HaystackAdaptor(SkillAdaptor):
"file": ref_file.name,
"type": "reference",
"version": metadata.version,
"doc_version": metadata.doc_version,
}
# Chunk if enabled
@@ -102,9 +108,12 @@ class HaystackAdaptor(SkillAdaptor):
ref_content,
doc_meta,
enable_chunking=enable_chunking,
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
source_file=ref_file.name,
chunk_overlap_tokens=kwargs.get(
"chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS
),
)
# Add all chunks as documents
@@ -124,8 +133,9 @@ class HaystackAdaptor(SkillAdaptor):
skill_dir: Path,
output_path: Path,
enable_chunking: bool = False,
chunk_max_tokens: int = 512,
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
preserve_code_blocks: bool = True,
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
) -> Path:
"""
Package skill into JSON file for Haystack.
@@ -147,11 +157,8 @@ class HaystackAdaptor(SkillAdaptor):
output_path.parent.mkdir(parents=True, exist_ok=True)
# Read metadata
metadata = SkillMetadata(
name=skill_dir.name,
description=f"Haystack documents for {skill_dir.name}",
version="1.0.0",
)
# Read metadata from SKILL.md frontmatter
metadata = self._build_skill_metadata(skill_dir)
# Generate Haystack documents
documents_json = self.format_skill_md(
@@ -160,6 +167,7 @@ class HaystackAdaptor(SkillAdaptor):
enable_chunking=enable_chunking,
chunk_max_tokens=chunk_max_tokens,
preserve_code_blocks=preserve_code_blocks,
chunk_overlap_tokens=chunk_overlap_tokens,
)
# Write to file

View File

@@ -11,6 +11,7 @@ from pathlib import Path
from typing import Any
from .base import SkillAdaptor, SkillMetadata
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
class LangChainAdaptor(SkillAdaptor):
@@ -62,6 +63,7 @@ class LangChainAdaptor(SkillAdaptor):
"file": "SKILL.md",
"type": "documentation",
"version": metadata.version,
"doc_version": metadata.doc_version,
}
# Chunk if enabled
@@ -69,9 +71,12 @@ class LangChainAdaptor(SkillAdaptor):
content,
doc_metadata,
enable_chunking=enable_chunking,
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
source_file="SKILL.md",
chunk_overlap_tokens=kwargs.get(
"chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS
),
)
# Add all chunks to documents
@@ -90,6 +95,7 @@ class LangChainAdaptor(SkillAdaptor):
"file": ref_file.name,
"type": "reference",
"version": metadata.version,
"doc_version": metadata.doc_version,
}
# Chunk if enabled
@@ -97,9 +103,12 @@ class LangChainAdaptor(SkillAdaptor):
ref_content,
doc_metadata,
enable_chunking=enable_chunking,
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
source_file=ref_file.name,
chunk_overlap_tokens=kwargs.get(
"chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS
),
)
# Add all chunks to documents
@@ -114,8 +123,9 @@ class LangChainAdaptor(SkillAdaptor):
skill_dir: Path,
output_path: Path,
enable_chunking: bool = False,
chunk_max_tokens: int = 512,
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
preserve_code_blocks: bool = True,
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
) -> Path:
"""
Package skill into JSON file for LangChain.
@@ -139,12 +149,8 @@ class LangChainAdaptor(SkillAdaptor):
output_path = self._format_output_path(skill_dir, Path(output_path), "-langchain.json")
output_path.parent.mkdir(parents=True, exist_ok=True)
# Read metadata
metadata = SkillMetadata(
name=skill_dir.name,
description=f"LangChain documents for {skill_dir.name}",
version="1.0.0",
)
# Read metadata from SKILL.md frontmatter
metadata = self._build_skill_metadata(skill_dir)
# Generate LangChain documents with chunking
documents_json = self.format_skill_md(
@@ -153,6 +159,7 @@ class LangChainAdaptor(SkillAdaptor):
enable_chunking=enable_chunking,
chunk_max_tokens=chunk_max_tokens,
preserve_code_blocks=preserve_code_blocks,
chunk_overlap_tokens=chunk_overlap_tokens,
)
# Write to file

View File

@@ -11,6 +11,7 @@ from pathlib import Path
from typing import Any
from .base import SkillAdaptor, SkillMetadata
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
class LlamaIndexAdaptor(SkillAdaptor):
@@ -77,6 +78,7 @@ class LlamaIndexAdaptor(SkillAdaptor):
"file": "SKILL.md",
"type": "documentation",
"version": metadata.version,
"doc_version": metadata.doc_version,
}
# Chunk if enabled
@@ -84,9 +86,12 @@ class LlamaIndexAdaptor(SkillAdaptor):
content,
node_metadata,
enable_chunking=enable_chunking,
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
source_file="SKILL.md",
chunk_overlap_tokens=kwargs.get(
"chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS
),
)
# Add all chunks as nodes
@@ -112,6 +117,7 @@ class LlamaIndexAdaptor(SkillAdaptor):
"file": ref_file.name,
"type": "reference",
"version": metadata.version,
"doc_version": metadata.doc_version,
}
# Chunk if enabled
@@ -119,9 +125,12 @@ class LlamaIndexAdaptor(SkillAdaptor):
ref_content,
node_metadata,
enable_chunking=enable_chunking,
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
source_file=ref_file.name,
chunk_overlap_tokens=kwargs.get(
"chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS
),
)
# Add all chunks as nodes
@@ -143,8 +152,9 @@ class LlamaIndexAdaptor(SkillAdaptor):
skill_dir: Path,
output_path: Path,
enable_chunking: bool = False,
chunk_max_tokens: int = 512,
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
preserve_code_blocks: bool = True,
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
) -> Path:
"""
Package skill into JSON file for LlamaIndex.
@@ -166,11 +176,8 @@ class LlamaIndexAdaptor(SkillAdaptor):
output_path.parent.mkdir(parents=True, exist_ok=True)
# Read metadata
metadata = SkillMetadata(
name=skill_dir.name,
description=f"LlamaIndex nodes for {skill_dir.name}",
version="1.0.0",
)
# Read metadata from SKILL.md frontmatter
metadata = self._build_skill_metadata(skill_dir)
# Generate LlamaIndex nodes
nodes_json = self.format_skill_md(
@@ -179,6 +186,7 @@ class LlamaIndexAdaptor(SkillAdaptor):
enable_chunking=enable_chunking,
chunk_max_tokens=chunk_max_tokens,
preserve_code_blocks=preserve_code_blocks,
chunk_overlap_tokens=chunk_overlap_tokens,
)
# Write to file

View File

@@ -11,6 +11,7 @@ from pathlib import Path
from typing import Any
from .base import SkillAdaptor, SkillMetadata
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
class MarkdownAdaptor(SkillAdaptor):
@@ -86,8 +87,9 @@ Browse the reference files for detailed information on each topic. All files are
skill_dir: Path,
output_path: Path,
enable_chunking: bool = False,
chunk_max_tokens: int = 512,
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
preserve_code_blocks: bool = True,
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
) -> Path:
"""
Package skill into ZIP file with markdown documentation.

View File

@@ -12,6 +12,7 @@ from pathlib import Path
from typing import Any
from .base import SkillAdaptor, SkillMetadata
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
class OpenAIAdaptor(SkillAdaptor):
@@ -108,8 +109,9 @@ Always prioritize accuracy by consulting the attached documentation files before
skill_dir: Path,
output_path: Path,
enable_chunking: bool = False,
chunk_max_tokens: int = 512,
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
preserve_code_blocks: bool = True,
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
) -> Path:
"""
Package skill into ZIP file for OpenAI Assistants.

View File

@@ -0,0 +1,405 @@
#!/usr/bin/env python3
"""
Pinecone Adaptor
Implements Pinecone vector database format for RAG pipelines.
Converts Skill Seekers documentation into Pinecone-compatible format
with namespace support and batch upsert.
"""
import json
from pathlib import Path
from typing import Any
from .base import SkillAdaptor, SkillMetadata
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
# Pinecone metadata value limit: 40 KB per vector
PINECONE_METADATA_BYTES_LIMIT = 40_000
class PineconeAdaptor(SkillAdaptor):
"""
Pinecone vector database adaptor.
Handles:
- Pinecone-compatible vector format with metadata
- Namespace support for multi-tenant indexing
- Batch upsert (100 vectors per batch)
- OpenAI and sentence-transformers embedding generation
- Metadata truncation to stay within Pinecone's 40KB limit
"""
PLATFORM = "pinecone"
PLATFORM_NAME = "Pinecone (Vector Database)"
DEFAULT_API_ENDPOINT = None
def _generate_id(self, content: str, metadata: dict) -> str:
"""Generate deterministic ID from content and metadata."""
return self._generate_deterministic_id(content, metadata, format="hex")
def _truncate_text_for_metadata(
self, text: str, max_bytes: int = PINECONE_METADATA_BYTES_LIMIT
) -> str:
"""Truncate text to fit within Pinecone's metadata byte limit.
Pinecone limits metadata to 40KB per vector. This truncates
the text field (largest metadata value) to stay within limits,
leaving room for other metadata fields (~1KB overhead).
Args:
text: Text content to potentially truncate
max_bytes: Maximum bytes for the text field
Returns:
Truncated text that fits within the byte limit
"""
# Reserve ~2KB for other metadata fields
available = max_bytes - 2000
encoded = text.encode("utf-8")
if len(encoded) <= available:
return text
# Truncate at byte boundary, decode safely
truncated = encoded[:available].decode("utf-8", errors="ignore")
return truncated
def format_skill_md(
self, skill_dir: Path, metadata: SkillMetadata, enable_chunking: bool = False, **kwargs
) -> str:
"""
Format skill as JSON for Pinecone ingestion.
Creates a package with vectors ready for upsert:
{
"index_name": "...",
"namespace": "...",
"dimension": 1536,
"metric": "cosine",
"vectors": [
{
"id": "hex-id",
"metadata": {
"text": "content",
"source": "...",
"category": "...",
...
}
}
]
}
No ``values`` field — embeddings are added at upload time.
Args:
skill_dir: Path to skill directory
metadata: Skill metadata
enable_chunking: Enable intelligent chunking for large documents
**kwargs: Additional chunking parameters
Returns:
JSON string containing Pinecone-compatible data
"""
vectors: list[dict[str, Any]] = []
# Convert SKILL.md (main documentation)
skill_md_path = skill_dir / "SKILL.md"
if skill_md_path.exists():
content = self._read_existing_content(skill_dir)
if content.strip():
doc_metadata = {
"source": metadata.name,
"category": "overview",
"file": "SKILL.md",
"type": "documentation",
"version": metadata.version,
"doc_version": metadata.doc_version,
}
chunks = self._maybe_chunk_content(
content,
doc_metadata,
enable_chunking=enable_chunking,
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
source_file="SKILL.md",
chunk_overlap_tokens=kwargs.get(
"chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS
),
)
for chunk_text, chunk_meta in chunks:
vectors.append(
{
"id": self._generate_id(chunk_text, chunk_meta),
"metadata": {
**chunk_meta,
"text": self._truncate_text_for_metadata(chunk_text),
},
}
)
# Convert all reference files
for ref_file, ref_content in self._iterate_references(skill_dir):
if ref_content.strip():
category = ref_file.stem.replace("_", " ").lower()
doc_metadata = {
"source": metadata.name,
"category": category,
"file": ref_file.name,
"type": "reference",
"version": metadata.version,
"doc_version": metadata.doc_version,
}
chunks = self._maybe_chunk_content(
ref_content,
doc_metadata,
enable_chunking=enable_chunking,
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
source_file=ref_file.name,
chunk_overlap_tokens=kwargs.get(
"chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS
),
)
for chunk_text, chunk_meta in chunks:
vectors.append(
{
"id": self._generate_id(chunk_text, chunk_meta),
"metadata": {
**chunk_meta,
"text": self._truncate_text_for_metadata(chunk_text),
},
}
)
index_name = metadata.name.replace("_", "-").lower()
return json.dumps(
{
"index_name": index_name,
"namespace": index_name,
"dimension": 1536,
"metric": "cosine",
"vectors": vectors,
},
indent=2,
ensure_ascii=False,
)
def package(
self,
skill_dir: Path,
output_path: Path,
enable_chunking: bool = False,
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
preserve_code_blocks: bool = True,
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
) -> Path:
"""
Package skill into JSON file for Pinecone.
Creates a JSON file containing vectors with metadata, ready for
embedding generation and upsert to a Pinecone index.
Args:
skill_dir: Path to skill directory
output_path: Output path/filename for JSON file
enable_chunking: Enable intelligent chunking for large documents
chunk_max_tokens: Maximum tokens per chunk (default: 512)
preserve_code_blocks: Preserve code blocks during chunking
Returns:
Path to created JSON file
"""
skill_dir = Path(skill_dir)
output_path = self._format_output_path(skill_dir, Path(output_path), "-pinecone.json")
output_path.parent.mkdir(parents=True, exist_ok=True)
# Read metadata from SKILL.md frontmatter
metadata = self._build_skill_metadata(skill_dir)
pinecone_json = self.format_skill_md(
skill_dir,
metadata,
enable_chunking=enable_chunking,
chunk_max_tokens=chunk_max_tokens,
preserve_code_blocks=preserve_code_blocks,
chunk_overlap_tokens=chunk_overlap_tokens,
)
output_path.write_text(pinecone_json, encoding="utf-8")
print(f"\n✅ Pinecone data packaged successfully!")
print(f"📦 Output: {output_path}")
data = json.loads(pinecone_json)
print(f"📊 Total vectors: {len(data['vectors'])}")
print(f"🗂️ Index name: {data['index_name']}")
print(f"📁 Namespace: {data['namespace']}")
print(f"📐 Default dimension: {data['dimension']} (auto-detected at upload time)")
# Show category breakdown
categories: dict[str, int] = {}
for vec in data["vectors"]:
cat = vec["metadata"].get("category", "unknown")
categories[cat] = categories.get(cat, 0) + 1
print("📁 Categories:")
for cat, count in sorted(categories.items()):
print(f" - {cat}: {count}")
return output_path
def upload(self, package_path: Path, api_key: str | None = None, **kwargs) -> dict[str, Any]:
"""
Upload packaged skill to Pinecone.
Args:
package_path: Path to packaged JSON
api_key: Pinecone API key (or uses PINECONE_API_KEY env var)
**kwargs:
index_name: Override index name from JSON
namespace: Override namespace from JSON
dimension: Embedding dimension (default: 1536)
metric: Distance metric (default: "cosine")
embedding_function: "openai" or "sentence-transformers"
cloud: Cloud provider (default: "aws")
region: Cloud region (default: "us-east-1")
Returns:
{"success": bool, "index": str, "namespace": str, "count": int}
"""
import os
try:
from pinecone import Pinecone, ServerlessSpec
except (ImportError, Exception):
return {
"success": False,
"message": "pinecone not installed. Run: pip install 'pinecone>=5.0.0'",
}
api_key = api_key or os.getenv("PINECONE_API_KEY")
if not api_key:
return {
"success": False,
"message": ("PINECONE_API_KEY not set. Set via env var or pass api_key parameter."),
}
# Load package
with open(package_path) as f:
data = json.load(f)
index_name = kwargs.get("index_name", data.get("index_name", "skill-docs"))
namespace = kwargs.get("namespace", data.get("namespace", ""))
metric = kwargs.get("metric", data.get("metric", "cosine"))
cloud = kwargs.get("cloud", "aws")
region = kwargs.get("region", "us-east-1")
# Auto-detect dimension from embedding model
embedding_function = kwargs.get("embedding_function", "openai")
EMBEDDING_DIMENSIONS = {
"openai": 1536, # text-embedding-3-small
"sentence-transformers": 384, # all-MiniLM-L6-v2
}
# Priority: explicit kwarg > model-based auto-detect > JSON file > fallback
# Note: format_skill_md() hardcodes dimension=1536 in the JSON, so we must
# give EMBEDDING_DIMENSIONS priority over the file to handle sentence-transformers (384).
dimension = kwargs.get(
"dimension",
EMBEDDING_DIMENSIONS.get(embedding_function, data.get("dimension", 1536)),
)
try:
# Generate embeddings FIRST — before creating the index.
# This avoids leaving an empty Pinecone index behind when
# embedding generation fails (e.g. missing API key).
texts = [vec["metadata"]["text"] for vec in data["vectors"]]
if embedding_function == "openai":
embeddings = self._generate_openai_embeddings(texts)
elif embedding_function == "sentence-transformers":
embeddings = self._generate_st_embeddings(texts)
else:
return {
"success": False,
"message": f"Unknown embedding_function: {embedding_function}. Use 'openai' or 'sentence-transformers'.",
}
pc = Pinecone(api_key=api_key)
# Create index if it doesn't exist
existing_indexes = [idx.name for idx in pc.list_indexes()]
if index_name not in existing_indexes:
print(
f"🔧 Creating Pinecone index: {index_name} (dimension={dimension}, metric={metric})"
)
pc.create_index(
name=index_name,
dimension=dimension,
metric=metric,
spec=ServerlessSpec(cloud=cloud, region=region),
)
print(f"✅ Index '{index_name}' created")
else:
print(f" Using existing index: {index_name}")
index = pc.Index(index_name)
# Batch upsert (100 per batch — Pinecone recommendation)
batch_size = 100
vectors_to_upsert = []
for i, vec in enumerate(data["vectors"]):
vectors_to_upsert.append(
{
"id": vec["id"],
"values": embeddings[i],
"metadata": vec["metadata"],
}
)
total = len(vectors_to_upsert)
print(f"🔄 Upserting {total} vectors to Pinecone...")
for i in range(0, total, batch_size):
batch = vectors_to_upsert[i : i + batch_size]
index.upsert(vectors=batch, namespace=namespace)
print(f" ✓ Upserted {min(i + batch_size, total)}/{total}")
print(f"✅ Uploaded {total} vectors to Pinecone index '{index_name}'")
return {
"success": True,
"message": f"Uploaded {total} vectors to Pinecone index '{index_name}' (namespace: '{namespace}')",
"url": None,
"index": index_name,
"namespace": namespace,
"count": total,
}
except Exception as e:
return {"success": False, "message": f"Pinecone upload failed: {e}"}
def validate_api_key(self, _api_key: str) -> bool:
"""Pinecone doesn't need API key for packaging."""
return False
def get_env_var_name(self) -> str:
"""Return the expected env var for Pinecone API key."""
return "PINECONE_API_KEY"
def supports_enhancement(self) -> bool:
"""Pinecone format doesn't support AI enhancement."""
return False
def enhance(self, _skill_dir: Path, _api_key: str) -> bool:
"""Pinecone format doesn't support enhancement."""
print("❌ Pinecone format does not support enhancement")
print(" Enhance before packaging:")
print(" skill-seekers enhance output/skill/ --mode LOCAL")
print(" skill-seekers package output/skill/ --target pinecone")
return False

View File

@@ -11,6 +11,7 @@ from pathlib import Path
from typing import Any
from .base import SkillAdaptor, SkillMetadata
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
class QdrantAdaptor(SkillAdaptor):
@@ -76,6 +77,7 @@ class QdrantAdaptor(SkillAdaptor):
"file": "SKILL.md",
"type": "documentation",
"version": metadata.version,
"doc_version": metadata.doc_version,
}
# Chunk if enabled
@@ -83,9 +85,12 @@ class QdrantAdaptor(SkillAdaptor):
content,
payload_meta,
enable_chunking=enable_chunking,
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
source_file="SKILL.md",
chunk_overlap_tokens=kwargs.get(
"chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS
),
)
# Add all chunks as points
@@ -109,6 +114,7 @@ class QdrantAdaptor(SkillAdaptor):
"file": chunk_meta.get("file", "SKILL.md"),
"type": chunk_meta.get("type", "documentation"),
"version": chunk_meta.get("version", metadata.version),
"doc_version": chunk_meta.get("doc_version", ""),
},
}
)
@@ -124,6 +130,7 @@ class QdrantAdaptor(SkillAdaptor):
"file": ref_file.name,
"type": "reference",
"version": metadata.version,
"doc_version": metadata.doc_version,
}
# Chunk if enabled
@@ -131,9 +138,12 @@ class QdrantAdaptor(SkillAdaptor):
ref_content,
payload_meta,
enable_chunking=enable_chunking,
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
source_file=ref_file.name,
chunk_overlap_tokens=kwargs.get(
"chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS
),
)
# Add all chunks as points
@@ -157,6 +167,7 @@ class QdrantAdaptor(SkillAdaptor):
"file": chunk_meta.get("file", ref_file.name),
"type": chunk_meta.get("type", "reference"),
"version": chunk_meta.get("version", metadata.version),
"doc_version": chunk_meta.get("doc_version", ""),
},
}
)
@@ -189,8 +200,9 @@ class QdrantAdaptor(SkillAdaptor):
skill_dir: Path,
output_path: Path,
enable_chunking: bool = False,
chunk_max_tokens: int = 512,
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
preserve_code_blocks: bool = True,
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
) -> Path:
"""
Package skill into JSON file for Qdrant.
@@ -211,11 +223,8 @@ class QdrantAdaptor(SkillAdaptor):
output_path.parent.mkdir(parents=True, exist_ok=True)
# Read metadata
metadata = SkillMetadata(
name=skill_dir.name,
description=f"Qdrant data for {skill_dir.name}",
version="1.0.0",
)
# Read metadata from SKILL.md frontmatter
metadata = self._build_skill_metadata(skill_dir)
# Generate Qdrant data
qdrant_json = self.format_skill_md(
@@ -224,6 +233,7 @@ class QdrantAdaptor(SkillAdaptor):
enable_chunking=enable_chunking,
chunk_max_tokens=chunk_max_tokens,
preserve_code_blocks=preserve_code_blocks,
chunk_overlap_tokens=chunk_overlap_tokens,
)
# Write to file

View File

@@ -11,6 +11,7 @@ from pathlib import Path
from typing import Any
from .base import SkillAdaptor, SkillMetadata
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
class WeaviateAdaptor(SkillAdaptor):
@@ -96,7 +97,14 @@ class WeaviateAdaptor(SkillAdaptor):
{
"name": "version",
"dataType": ["text"],
"description": "Documentation version",
"description": "Skill package version",
"indexFilterable": True,
"indexSearchable": False,
},
{
"name": "doc_version",
"dataType": ["text"],
"description": "Documentation version (e.g., 16.2)",
"indexFilterable": True,
"indexSearchable": False,
},
@@ -137,6 +145,7 @@ class WeaviateAdaptor(SkillAdaptor):
"file": "SKILL.md",
"type": "documentation",
"version": metadata.version,
"doc_version": metadata.doc_version,
}
# Chunk if enabled
@@ -144,9 +153,12 @@ class WeaviateAdaptor(SkillAdaptor):
content,
obj_metadata,
enable_chunking=enable_chunking,
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
source_file="SKILL.md",
chunk_overlap_tokens=kwargs.get(
"chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS
),
)
# Add all chunks as objects
@@ -161,6 +173,7 @@ class WeaviateAdaptor(SkillAdaptor):
"file": chunk_meta.get("file", "SKILL.md"),
"type": chunk_meta.get("type", "documentation"),
"version": chunk_meta.get("version", metadata.version),
"doc_version": chunk_meta.get("doc_version", ""),
},
}
)
@@ -177,6 +190,7 @@ class WeaviateAdaptor(SkillAdaptor):
"file": ref_file.name,
"type": "reference",
"version": metadata.version,
"doc_version": metadata.doc_version,
}
# Chunk if enabled
@@ -184,9 +198,12 @@ class WeaviateAdaptor(SkillAdaptor):
ref_content,
obj_metadata,
enable_chunking=enable_chunking,
chunk_max_tokens=kwargs.get("chunk_max_tokens", 512),
chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS),
preserve_code_blocks=kwargs.get("preserve_code_blocks", True),
source_file=ref_file.name,
chunk_overlap_tokens=kwargs.get(
"chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS
),
)
# Add all chunks as objects
@@ -201,6 +218,7 @@ class WeaviateAdaptor(SkillAdaptor):
"file": chunk_meta.get("file", ref_file.name),
"type": chunk_meta.get("type", "reference"),
"version": chunk_meta.get("version", metadata.version),
"doc_version": chunk_meta.get("doc_version", ""),
},
}
)
@@ -221,8 +239,9 @@ class WeaviateAdaptor(SkillAdaptor):
skill_dir: Path,
output_path: Path,
enable_chunking: bool = False,
chunk_max_tokens: int = 512,
chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS,
preserve_code_blocks: bool = True,
chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
) -> Path:
"""
Package skill into JSON file for Weaviate.
@@ -245,12 +264,8 @@ class WeaviateAdaptor(SkillAdaptor):
output_path = self._format_output_path(skill_dir, Path(output_path), "-weaviate.json")
output_path.parent.mkdir(parents=True, exist_ok=True)
# Read metadata
metadata = SkillMetadata(
name=skill_dir.name,
description=f"Weaviate objects for {skill_dir.name}",
version="1.0.0",
)
# Read metadata from SKILL.md frontmatter
metadata = self._build_skill_metadata(skill_dir)
# Generate Weaviate objects
weaviate_json = self.format_skill_md(
@@ -259,6 +274,7 @@ class WeaviateAdaptor(SkillAdaptor):
enable_chunking=enable_chunking,
chunk_max_tokens=chunk_max_tokens,
preserve_code_blocks=preserve_code_blocks,
chunk_overlap_tokens=chunk_overlap_tokens,
)
# Write to file
@@ -288,7 +304,7 @@ class WeaviateAdaptor(SkillAdaptor):
return output_path
def upload(self, package_path: Path, api_key: str = None, **kwargs) -> dict[str, Any]:
def upload(self, package_path: Path, api_key: str | None = None, **kwargs) -> dict[str, Any]:
"""
Upload packaged skill to Weaviate.
@@ -382,31 +398,20 @@ class WeaviateAdaptor(SkillAdaptor):
print(f" ✓ Uploaded {i + 1}/{len(data['objects'])} objects")
elif embedding_function == "sentence-transformers":
# Use sentence-transformers
print("🔄 Generating sentence-transformer embeddings and uploading...")
try:
from sentence_transformers import SentenceTransformer
# Use sentence-transformers (via shared base method)
contents = [obj["properties"]["content"] for obj in data["objects"]]
embeddings = self._generate_st_embeddings(contents)
model = SentenceTransformer("all-MiniLM-L6-v2")
contents = [obj["properties"]["content"] for obj in data["objects"]]
embeddings = model.encode(contents, show_progress_bar=True).tolist()
for i, obj in enumerate(data["objects"]):
batch.add_data_object(
data_object=obj["properties"],
class_name=data["class_name"],
uuid=obj["id"],
vector=embeddings[i],
)
for i, obj in enumerate(data["objects"]):
batch.add_data_object(
data_object=obj["properties"],
class_name=data["class_name"],
uuid=obj["id"],
vector=embeddings[i],
)
if (i + 1) % 100 == 0:
print(f" ✓ Uploaded {i + 1}/{len(data['objects'])} objects")
except ImportError:
return {
"success": False,
"message": "sentence-transformers not installed. Run: pip install sentence-transformers",
}
if (i + 1) % 100 == 0:
print(f" ✓ Uploaded {i + 1}/{len(data['objects'])} objects")
else:
# No embeddings - Weaviate will use its configured vectorizer
@@ -427,61 +432,16 @@ class WeaviateAdaptor(SkillAdaptor):
return {
"success": True,
"message": f"Uploaded {count} objects to Weaviate class '{data['class_name']}'",
"url": None,
"class_name": data["class_name"],
"count": count,
}
except ImportError as e:
return {"success": False, "message": str(e)}
except Exception as e:
return {"success": False, "message": f"Upload failed: {e}"}
def _generate_openai_embeddings(
self, documents: list[str], api_key: str = None
) -> list[list[float]]:
"""
Generate embeddings using OpenAI API.
Args:
documents: List of document texts
api_key: OpenAI API key (or uses OPENAI_API_KEY env var)
Returns:
List of embedding vectors
"""
import os
try:
from openai import OpenAI
except ImportError:
raise ImportError("openai not installed. Run: pip install openai") from None
api_key = api_key or os.getenv("OPENAI_API_KEY")
if not api_key:
raise ValueError("OPENAI_API_KEY not set. Set via env var or --openai-api-key")
client = OpenAI(api_key=api_key)
# Batch process (OpenAI allows up to 2048 inputs)
embeddings = []
batch_size = 100
print(f" Generating embeddings for {len(documents)} documents...")
for i in range(0, len(documents), batch_size):
batch = documents[i : i + batch_size]
try:
response = client.embeddings.create(
input=batch,
model="text-embedding-3-small", # Cheapest, fastest
)
embeddings.extend([item.embedding for item in response.data])
print(
f" ✓ Generated {min(i + batch_size, len(documents))}/{len(documents)} embeddings"
)
except Exception as e:
raise Exception(f"OpenAI embedding generation failed: {e}") from e
return embeddings
def validate_api_key(self, _api_key: str) -> bool:
"""
Weaviate format doesn't use API keys for packaging.

View File

@@ -15,6 +15,10 @@ Hierarchy:
import argparse
from typing import Any
# Default chunking constants used by RAG and package arguments
DEFAULT_CHUNK_TOKENS = 512
DEFAULT_CHUNK_OVERLAP_TOKENS = 50
# Common argument definitions as data structure
# These are arguments that appear in MULTIPLE commands
COMMON_ARGUMENTS: dict[str, dict[str, Any]] = {
@@ -64,6 +68,15 @@ COMMON_ARGUMENTS: dict[str, dict[str, Any]] = {
"metavar": "KEY",
},
},
"doc_version": {
"flags": ("--doc-version",),
"kwargs": {
"type": str,
"default": "",
"help": "Documentation version tag for RAG metadata (e.g., '16.2')",
"metavar": "VERSION",
},
},
}
# Behavior arguments — runtime flags shared by every scraper
@@ -105,18 +118,18 @@ RAG_ARGUMENTS: dict[str, dict[str, Any]] = {
"flags": ("--chunk-tokens",),
"kwargs": {
"type": int,
"default": 512,
"default": DEFAULT_CHUNK_TOKENS,
"metavar": "TOKENS",
"help": "Chunk size in tokens for RAG (default: 512)",
"help": f"Chunk size in tokens for RAG (default: {DEFAULT_CHUNK_TOKENS})",
},
},
"chunk_overlap_tokens": {
"flags": ("--chunk-overlap-tokens",),
"kwargs": {
"type": int,
"default": 50,
"default": DEFAULT_CHUNK_OVERLAP_TOKENS,
"metavar": "TOKENS",
"help": "Overlap between chunks in tokens (default: 50)",
"help": f"Overlap between chunks in tokens (default: {DEFAULT_CHUNK_OVERLAP_TOKENS})",
},
},
}

View File

@@ -153,6 +153,15 @@ UNIVERSAL_ARGUMENTS: dict[str, dict[str, Any]] = {
"metavar": "PATH",
},
},
"doc_version": {
"flags": ("--doc-version",),
"kwargs": {
"type": str,
"default": "",
"help": "Documentation version tag for RAG metadata (e.g., '16.2')",
"metavar": "VERSION",
},
},
}
# Merge RAG arguments from common.py into universal arguments
@@ -655,3 +664,11 @@ def add_create_arguments(parser: argparse.ArgumentParser, mode: str = "default")
if mode in ["advanced", "all"]:
for arg_name, arg_def in ADVANCED_ARGUMENTS.items():
parser.add_argument(*arg_def["flags"], **arg_def["kwargs"])
# Deprecated alias for backward compatibility (removed in v4.0.0)
parser.add_argument(
"--no-preserve-code",
dest="no_preserve_code_blocks",
action="store_true",
help=argparse.SUPPRESS,
)

View File

@@ -8,6 +8,8 @@ import and use these definitions.
import argparse
from typing import Any
from .common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
PACKAGE_ARGUMENTS: dict[str, dict[str, Any]] = {
# Positional argument
"skill_directory": {
@@ -49,6 +51,7 @@ PACKAGE_ARGUMENTS: dict[str, dict[str, Any]] = {
"chroma",
"faiss",
"qdrant",
"pinecone",
],
"default": "claude",
"help": "Target LLM platform (default: claude)",
@@ -109,13 +112,22 @@ PACKAGE_ARGUMENTS: dict[str, dict[str, Any]] = {
"flags": ("--chunk-tokens",),
"kwargs": {
"type": int,
"default": 512,
"help": "Maximum tokens per chunk (default: 512)",
"default": DEFAULT_CHUNK_TOKENS,
"help": f"Maximum tokens per chunk (default: {DEFAULT_CHUNK_TOKENS})",
"metavar": "N",
},
},
"no_preserve_code": {
"flags": ("--no-preserve-code",),
"chunk_overlap_tokens": {
"flags": ("--chunk-overlap-tokens",),
"kwargs": {
"type": int,
"default": DEFAULT_CHUNK_OVERLAP_TOKENS,
"help": f"Overlap between chunks in tokens (default: {DEFAULT_CHUNK_OVERLAP_TOKENS})",
"metavar": "N",
},
},
"no_preserve_code_blocks": {
"flags": ("--no-preserve-code-blocks",),
"kwargs": {
"action": "store_true",
"help": "Allow code block splitting (default: code blocks preserved)",
@@ -130,3 +142,11 @@ def add_package_arguments(parser: argparse.ArgumentParser) -> None:
flags = arg_def["flags"]
kwargs = arg_def["kwargs"]
parser.add_argument(*flags, **kwargs)
# Deprecated alias for backward compatibility (removed in v4.0.0)
parser.add_argument(
"--no-preserve-code",
dest="no_preserve_code_blocks",
action="store_true",
help=argparse.SUPPRESS,
)

View File

@@ -172,6 +172,14 @@ def add_scrape_arguments(parser: argparse.ArgumentParser) -> None:
kwargs = arg_def["kwargs"]
parser.add_argument(*flags, **kwargs)
# Deprecated alias for backward compatibility (removed in v4.0.0)
parser.add_argument(
"--no-preserve-code",
dest="no_preserve_code_blocks",
action="store_true",
help=argparse.SUPPRESS,
)
def get_scrape_argument_names() -> set:
"""Get the set of scrape argument destination names.

View File

@@ -1057,6 +1057,7 @@ def analyze_codebase(
enhance_level: int = 0,
skill_name: str | None = None,
skill_description: str | None = None,
doc_version: str = "",
) -> dict[str, Any]:
"""
Analyze local codebase and extract code knowledge.
@@ -1603,6 +1604,7 @@ def analyze_codebase(
docs_data=docs_data,
skill_name=skill_name,
skill_description=skill_description,
doc_version=doc_version,
)
return results
@@ -1622,6 +1624,7 @@ def _generate_skill_md(
docs_data: dict[str, Any] | None = None,
skill_name: str | None = None,
skill_description: str | None = None,
doc_version: str = "",
):
"""
Generate rich SKILL.md from codebase analysis results.
@@ -1657,6 +1660,7 @@ def _generate_skill_md(
skill_content = f"""---
name: {skill_name}
description: {description}
doc_version: {doc_version}
---
# {repo_name} Codebase
@@ -2197,13 +2201,11 @@ def _generate_references(output_dir: Path):
if source_dir.exists() and source_dir.is_dir():
# Copy directory to references/ (not symlink, for portability)
if target_dir.exists():
import shutil
shutil.rmtree(target_dir)
import shutil
if target_dir.exists():
shutil.rmtree(target_dir)
shutil.copytree(source_dir, target_dir)
logger.debug(f"Copied {source} → references/{target}")
@@ -2451,6 +2453,7 @@ Examples:
enhance_level=args.enhance_level, # AI enhancement level (0-3)
skill_name=getattr(args, "name", None),
skill_description=getattr(args, "description", None),
doc_version=getattr(args, "doc_version", ""),
)
# ============================================================

View File

@@ -13,6 +13,7 @@ from skill_seekers.cli.arguments.create import (
get_compatible_arguments,
get_universal_argument_names,
)
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
logger = logging.getLogger(__name__)
@@ -106,8 +107,8 @@ class CreateCommand:
# Check against common defaults
defaults = {
"max_issues": 100,
"chunk_tokens": 512,
"chunk_overlap_tokens": 50,
"chunk_tokens": DEFAULT_CHUNK_TOKENS,
"chunk_overlap_tokens": DEFAULT_CHUNK_OVERLAP_TOKENS,
"output": None,
}
@@ -162,11 +163,14 @@ class CreateCommand:
# RAG arguments (web scraper only)
if getattr(self.args, "chunk_for_rag", False):
argv.append("--chunk-for-rag")
if getattr(self.args, "chunk_tokens", None) and self.args.chunk_tokens != 512:
if (
getattr(self.args, "chunk_tokens", None)
and self.args.chunk_tokens != DEFAULT_CHUNK_TOKENS
):
argv.extend(["--chunk-tokens", str(self.args.chunk_tokens)])
if (
getattr(self.args, "chunk_overlap_tokens", None)
and self.args.chunk_overlap_tokens != 50
and self.args.chunk_overlap_tokens != DEFAULT_CHUNK_OVERLAP_TOKENS
):
argv.extend(["--chunk-overlap-tokens", str(self.args.chunk_overlap_tokens)])
@@ -479,6 +483,10 @@ class CreateCommand:
if self.args.quiet:
argv.append("--quiet")
# Documentation version metadata
if getattr(self.args, "doc_version", ""):
argv.extend(["--doc-version", self.args.doc_version])
# Enhancement Workflow arguments
if getattr(self.args, "enhance_workflow", None):
for wf in self.args.enhance_workflow:

View File

@@ -1565,9 +1565,11 @@ class DocToSkillConverter:
if len(example_codes) >= 10:
break
doc_version = self.config.get("doc_version", "")
content = f"""---
name: {self.name}
description: {description}
doc_version: {doc_version}
---
# {self.name.title()} Skill
@@ -2103,6 +2105,11 @@ def get_configuration(args: argparse.Namespace) -> dict[str, Any]:
"max_pages": DEFAULT_MAX_PAGES,
}
# Apply CLI override for doc_version (works for all config modes)
cli_doc_version = getattr(args, "doc_version", "")
if cli_doc_version:
config["doc_version"] = cli_doc_version
# Apply CLI overrides for rate limiting
if args.no_rate_limit:
config["rate_limit"] = 0

View File

@@ -367,7 +367,7 @@ class LocalSkillEnhancer:
if line.startswith("#"):
# Found heading - keep it and next 3 lines
chunk = lines[i : min(i + 4, len(lines))]
chunk_chars = sum(len(l) for l in chunk)
chunk_chars = sum(len(line_text) for line_text in chunk)
if current_chars + chunk_chars > max_chars:
break
result.extend(chunk)

View File

@@ -968,10 +968,13 @@ class GitHubToSkillConverter:
# Truncate description to 1024 chars if needed
desc = self.description[:1024] if len(self.description) > 1024 else self.description
doc_version = self.config.get("doc_version", "")
# Build skill content
skill_content = f"""---
name: {skill_name}
description: {desc}
doc_version: {doc_version}
---
# {repo_info.get("name", self.name)}
@@ -1003,10 +1006,11 @@ Use this skill when you need to:
# Repository info
skill_content += "### Repository Info\n"
skill_content += f"- **Homepage:** {repo_info.get('homepage', 'N/A')}\n"
skill_content += f"- **Homepage:** {repo_info.get('homepage') or 'N/A'}\n"
skill_content += f"- **Topics:** {', '.join(repo_info.get('topics', []))}\n"
skill_content += f"- **Open Issues:** {repo_info.get('open_issues', 0)}\n"
skill_content += f"- **Last Updated:** {repo_info.get('updated_at', 'N/A')[:10]}\n\n"
updated_at = repo_info.get("updated_at") or "N/A"
skill_content += f"- **Last Updated:** {updated_at[:10]}\n\n"
# Languages
skill_content += "### Languages\n"
@@ -1101,9 +1105,9 @@ Use this skill when you need to:
lines = []
for release in releases[:3]:
lines.append(
f"- **{release['tag_name']}** ({release['published_at'][:10]}): {release['name']}"
)
published_at = release.get("published_at") or "N/A"
release_name = release.get("name") or release["tag_name"]
lines.append(f"- **{release['tag_name']}** ({published_at[:10]}): {release_name}")
return "\n".join(lines)
@@ -1298,15 +1302,17 @@ Use this skill when you need to:
content += f"## Open Issues ({len(open_issues)})\n\n"
for issue in open_issues:
labels = ", ".join(issue["labels"]) if issue["labels"] else "No labels"
created_at = issue.get("created_at") or "N/A"
content += f"### #{issue['number']}: {issue['title']}\n"
content += f"**Labels:** {labels} | **Created:** {issue['created_at'][:10]}\n"
content += f"**Labels:** {labels} | **Created:** {created_at[:10]}\n"
content += f"[View on GitHub]({issue['url']})\n\n"
content += f"\n## Recently Closed Issues ({len(closed_issues)})\n\n"
for issue in closed_issues:
labels = ", ".join(issue["labels"]) if issue["labels"] else "No labels"
closed_at = issue.get("closed_at") or "N/A"
content += f"### #{issue['number']}: {issue['title']}\n"
content += f"**Labels:** {labels} | **Closed:** {issue['closed_at'][:10]}\n"
content += f"**Labels:** {labels} | **Closed:** {closed_at[:10]}\n"
content += f"[View on GitHub]({issue['url']})\n\n"
issues_path = f"{self.skill_dir}/references/issues.md"
@@ -1323,11 +1329,14 @@ Use this skill when you need to:
)
for release in releases:
content += f"## {release['tag_name']}: {release['name']}\n"
content += f"**Published:** {release['published_at'][:10]}\n"
published_at = release.get("published_at") or "N/A"
release_name = release.get("name") or release["tag_name"]
release_body = release.get("body") or ""
content += f"## {release['tag_name']}: {release_name}\n"
content += f"**Published:** {published_at[:10]}\n"
if release["prerelease"]:
content += "**Pre-release**\n"
content += f"\n{release['body']}\n\n"
content += f"\n{release_body}\n\n"
content += f"[View on GitHub]({release['url']})\n\n---\n\n"
releases_path = f"{self.skill_dir}/references/releases.md"

View File

@@ -325,8 +325,8 @@ def _handle_analyze_command(args: argparse.Namespace) -> int:
if getattr(args, "enhance_stage", None):
for stage in args.enhance_stage:
sys.argv.extend(["--enhance-stage", stage])
if getattr(args, "workflow_var", None):
for var in args.workflow_var:
if getattr(args, "var", None):
for var in args.var:
sys.argv.extend(["--var", var])
if getattr(args, "workflow_dry_run", False):
sys.argv.append("--workflow-dry-run")

View File

@@ -14,6 +14,8 @@ import os
import sys
from pathlib import Path
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
# Import utilities
try:
from quality_checker import SkillQualityChecker, print_report
@@ -45,8 +47,9 @@ def package_skill(
chunk_overlap=200,
batch_size=100,
enable_chunking=False,
chunk_max_tokens=512,
chunk_max_tokens=DEFAULT_CHUNK_TOKENS,
preserve_code_blocks=True,
chunk_overlap_tokens=DEFAULT_CHUNK_OVERLAP_TOKENS,
):
"""
Package a skill directory into platform-specific format
@@ -121,6 +124,7 @@ def package_skill(
"chroma",
"faiss",
"qdrant",
"pinecone",
]
if target in RAG_PLATFORMS and not enable_chunking:
@@ -156,6 +160,7 @@ def package_skill(
enable_chunking=enable_chunking,
chunk_max_tokens=chunk_max_tokens,
preserve_code_blocks=preserve_code_blocks,
chunk_overlap_tokens=chunk_overlap_tokens,
)
else:
package_path = adaptor.package(
@@ -164,6 +169,7 @@ def package_skill(
enable_chunking=enable_chunking,
chunk_max_tokens=chunk_max_tokens,
preserve_code_blocks=preserve_code_blocks,
chunk_overlap_tokens=chunk_overlap_tokens,
)
print(f" Output: {package_path}")
@@ -226,7 +232,8 @@ Examples:
batch_size=args.batch_size,
enable_chunking=args.chunk_for_rag,
chunk_max_tokens=args.chunk_tokens,
preserve_code_blocks=not args.no_preserve_code,
preserve_code_blocks=not args.no_preserve_code_blocks,
chunk_overlap_tokens=args.chunk_overlap_tokens,
)
if not success:

View File

@@ -14,6 +14,8 @@ Usage:
chunks = chunker.chunk_skill(Path("output/react"))
"""
from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS
import re
from pathlib import Path
import json
@@ -35,8 +37,8 @@ class RAGChunker:
def __init__(
self,
chunk_size: int = 512,
chunk_overlap: int = 50,
chunk_size: int = DEFAULT_CHUNK_TOKENS,
chunk_overlap: int = DEFAULT_CHUNK_OVERLAP_TOKENS,
preserve_code_blocks: bool = True,
preserve_paragraphs: bool = True,
min_chunk_size: int = 100,
@@ -383,9 +385,14 @@ def main():
)
parser.add_argument("skill_dir", type=Path, help="Path to skill directory")
parser.add_argument("--output", "-o", type=Path, help="Output JSON file")
parser.add_argument("--chunk-tokens", type=int, default=512, help="Target chunk size in tokens")
parser.add_argument(
"--chunk-overlap-tokens", type=int, default=50, help="Overlap size in tokens"
"--chunk-tokens", type=int, default=DEFAULT_CHUNK_TOKENS, help="Target chunk size in tokens"
)
parser.add_argument(
"--chunk-overlap-tokens",
type=int,
default=DEFAULT_CHUNK_OVERLAP_TOKENS,
help="Overlap size in tokens",
)
parser.add_argument("--no-code-blocks", action="store_true", help="Don't preserve code blocks")
parser.add_argument("--no-paragraphs", action="store_true", help="Don't preserve paragraphs")

View File

@@ -1296,7 +1296,9 @@ This skill combines knowledge from multiple sources:
f.write(f"- **File**: `{ex.get('file_path', 'N/A')}`\n")
if ex.get("code_snippet"):
lang = ex.get("language", "text")
f.write(f"\n```{lang}\n{ex['code_snippet']}\n```\n") # Full code, no truncation
f.write(
f"\n```{lang}\n{ex['code_snippet']}\n```\n"
) # Full code, no truncation
f.write("\n")
logger.info(f" ✓ Test examples: {total} total, {high_value} high-value")

View File

@@ -79,7 +79,9 @@ class WordToSkillConverter:
self.config = config
self.name = config["name"]
self.docx_path = config.get("docx_path", "")
self.description = config.get("description") or f"Use when referencing {self.name} documentation"
self.description = (
config.get("description") or f"Use when referencing {self.name} documentation"
)
# Paths
self.skill_dir = f"output/{self.name}"
@@ -109,6 +111,9 @@ class WordToSkillConverter:
if not os.path.exists(self.docx_path):
raise FileNotFoundError(f"Word document not found: {self.docx_path}")
if not self.docx_path.lower().endswith(".docx"):
raise ValueError(f"Not a Word document (expected .docx): {self.docx_path}")
# --- Extract metadata via python-docx ---
doc = python_docx.Document(self.docx_path)
core_props = doc.core_properties
@@ -728,12 +733,13 @@ class WordToSkillConverter:
# HTML-to-sections helper (module-level for clarity)
# ---------------------------------------------------------------------------
def _build_section(
section_number: int,
heading: str | None,
heading_level: str | None,
elements: list,
doc,
doc, # noqa: ARG001
) -> dict:
"""Build a section dict from a list of BeautifulSoup elements.
@@ -769,10 +775,7 @@ def _build_section(
# Code blocks
if tag == "pre" or (tag == "code" and elem.find_parent("pre") is None):
code_elem = elem.find("code") if tag == "pre" else elem
if code_elem:
code_text = code_elem.get_text()
else:
code_text = elem.get_text()
code_text = code_elem.get_text() if code_elem else elem.get_text()
code_text = code_text.strip()
if code_text:
@@ -825,8 +828,8 @@ def _build_section(
raw_text = elem.get_text(separator="\n").strip()
# Exclude bullet-point / prose lists (•, *, -)
if raw_text and not re.search(r"^[•\-\*]\s", raw_text, re.MULTILINE):
if _score_code_quality(raw_text) >= 5.5:
quality_score = _score_code_quality(raw_text)
quality_score = _score_code_quality(raw_text)
if quality_score >= 5.5:
code_samples.append(
{"code": raw_text, "language": "", "quality_score": quality_score}
)
@@ -956,7 +959,8 @@ def main():
name = Path(args.from_json).stem.replace("_extracted", "")
config = {
"name": getattr(args, "name", None) or name,
"description": getattr(args, "description", None) or f"Use when referencing {name} documentation",
"description": getattr(args, "description", None)
or f"Use when referencing {name} documentation",
}
try:
converter = WordToSkillConverter(config)
@@ -1044,6 +1048,7 @@ def main():
except Exception as e:
print(f"\n❌ Unexpected error during Word processing: {e}", file=sys.stderr)
import traceback
traceback.print_exc()
sys.exit(1)

View File

@@ -358,6 +358,107 @@ class TestChunkingCLIIntegration:
f"Small chunks ({len(data_small)}) should be more than large chunks ({len(data_large)})"
)
def test_chunk_overlap_tokens_parameter(self, tmp_path):
"""Test --chunk-overlap-tokens controls RAGChunker overlap."""
from skill_seekers.cli.package_skill import package_skill
skill_dir = create_test_skill(tmp_path, large_doc=True)
# Package with default overlap (50)
success, package_path = package_skill(
skill_dir=skill_dir,
open_folder_after=False,
skip_quality_check=True,
target="langchain",
enable_chunking=True,
chunk_max_tokens=256,
chunk_overlap_tokens=50,
)
assert success
assert package_path.exists()
with open(package_path) as f:
data_default = json.load(f)
# Package with large overlap (128)
success2, package_path2 = package_skill(
skill_dir=skill_dir,
open_folder_after=False,
skip_quality_check=True,
target="langchain",
enable_chunking=True,
chunk_max_tokens=256,
chunk_overlap_tokens=128,
)
assert success2
assert package_path2.exists()
with open(package_path2) as f:
data_large_overlap = json.load(f)
# Large overlap should produce more chunks (more overlap = more chunks)
assert len(data_large_overlap) >= len(data_default), (
f"Large overlap ({len(data_large_overlap)}) should produce >= chunks than default ({len(data_default)})"
)
def test_chunk_overlap_scales_with_chunk_size(self, tmp_path):
"""Test that overlap auto-scales when chunk_tokens is non-default but overlap is default."""
from skill_seekers.cli.adaptors.base import (
DEFAULT_CHUNK_TOKENS,
DEFAULT_CHUNK_OVERLAP_TOKENS,
)
adaptor = get_adaptor("langchain")
skill_dir = create_test_skill(tmp_path, large_doc=True)
adaptor._build_skill_metadata(skill_dir)
content = (skill_dir / "SKILL.md").read_text()
# With default chunk size (512) and default overlap (50), overlap should be 50
chunks_default = adaptor._maybe_chunk_content(
content,
{"source": "test"},
enable_chunking=True,
chunk_max_tokens=DEFAULT_CHUNK_TOKENS,
chunk_overlap_tokens=DEFAULT_CHUNK_OVERLAP_TOKENS,
)
# With large chunk size (1024) and default overlap (50),
# overlap should auto-scale to max(50, 1024//10) = 102
chunks_large = adaptor._maybe_chunk_content(
content,
{"source": "test"},
enable_chunking=True,
chunk_max_tokens=1024,
chunk_overlap_tokens=DEFAULT_CHUNK_OVERLAP_TOKENS,
)
# Both should produce valid chunks
assert len(chunks_default) > 1
assert len(chunks_large) >= 1
def test_preserve_code_blocks_flag(self, tmp_path):
"""Test --no-preserve-code-blocks parameter is accepted."""
from skill_seekers.cli.package_skill import package_skill
skill_dir = create_test_skill(tmp_path, large_doc=True)
# Package with code block preservation disabled
success, package_path = package_skill(
skill_dir=skill_dir,
open_folder_after=False,
skip_quality_check=True,
target="langchain",
enable_chunking=True,
chunk_max_tokens=256,
preserve_code_blocks=False,
)
assert success
assert package_path.exists()
if __name__ == "__main__":
pytest.main([__file__, "-v"])

View File

@@ -294,5 +294,84 @@ class TestE2EWorkflow:
assert "unrecognized arguments" not in result.stderr.lower()
class TestVarFlagRouting:
"""Test that --var flag is correctly routed through create command."""
def test_var_flag_accepted_by_create(self):
"""Test that --var flag is accepted (not 'unrecognized') by create command."""
result = subprocess.run(
["skill-seekers", "create", "--help"],
capture_output=True,
text=True,
)
assert "--var" in result.stdout, "create --help should show --var flag"
def test_var_flag_accepted_by_analyze(self):
"""Test that --var flag is accepted by analyze command."""
result = subprocess.run(
["skill-seekers", "analyze", "--help"],
capture_output=True,
text=True,
)
assert "--var" in result.stdout, "analyze --help should show --var flag"
@pytest.mark.slow
def test_var_flag_not_rejected_in_create_local(self, tmp_path):
"""Test --var KEY=VALUE doesn't cause 'unrecognized arguments' in create."""
test_dir = tmp_path / "test_code"
test_dir.mkdir()
(test_dir / "test.py").write_text("def hello(): pass")
result = subprocess.run(
[
"skill-seekers",
"create",
str(test_dir),
"--var",
"foo=bar",
"--dry-run",
],
capture_output=True,
text=True,
timeout=15,
)
assert "unrecognized arguments" not in result.stderr.lower(), (
f"--var should be accepted, got stderr: {result.stderr}"
)
class TestBackwardCompatibleFlags:
"""Test that deprecated flag aliases still work."""
def test_no_preserve_code_alias_accepted_by_package(self):
"""Test --no-preserve-code (old name) is still accepted by package command."""
result = subprocess.run(
["skill-seekers", "package", "--help"],
capture_output=True,
text=True,
)
# The old flag should not appear in --help (it's suppressed)
# but should not cause an error if used
assert result.returncode == 0
def test_no_preserve_code_alias_accepted_by_scrape(self):
"""Test --no-preserve-code (old name) is still accepted by scrape command."""
result = subprocess.run(
["skill-seekers", "scrape", "--help"],
capture_output=True,
text=True,
)
assert result.returncode == 0
def test_no_preserve_code_alias_accepted_by_create(self):
"""Test --no-preserve-code (old name) is still accepted by create command."""
result = subprocess.run(
["skill-seekers", "create", "--help-all"],
capture_output=True,
text=True,
)
assert result.returncode == 0
if __name__ == "__main__":
pytest.main([__file__, "-v", "-s"])

View File

@@ -25,8 +25,8 @@ class TestUniversalArguments:
"""Test universal argument definitions."""
def test_universal_count(self):
"""Should have exactly 18 universal arguments (after Phase 2 workflow integration + local_repo_path)."""
assert len(UNIVERSAL_ARGUMENTS) == 18
"""Should have exactly 19 universal arguments (after Phase 2 workflow integration + local_repo_path + doc_version)."""
assert len(UNIVERSAL_ARGUMENTS) == 19
def test_universal_argument_names(self):
"""Universal arguments should have expected names."""
@@ -50,6 +50,7 @@ class TestUniversalArguments:
"var",
"workflow_dry_run",
"local_repo_path", # GitHub local clone path for unlimited C3.x analysis
"doc_version", # Documentation version tag for RAG metadata
}
assert set(UNIVERSAL_ARGUMENTS.keys()) == expected_names
@@ -130,7 +131,9 @@ class TestArgumentHelpers:
"""Should return set of universal argument names."""
names = get_universal_argument_names()
assert isinstance(names, set)
assert len(names) == 18 # Phase 2: added 4 workflow arguments + local_repo_path
assert (
len(names) == 19
) # Phase 2: added 4 workflow arguments + local_repo_path + doc_version
assert "name" in names
assert "enhance_level" in names # Phase 1: consolidated flag
assert "enhance_workflow" in names # Phase 2: workflow support

View File

@@ -0,0 +1,764 @@
#!/usr/bin/env python3
"""
Tests for Pinecone adaptor and doc_version metadata flow.
"""
import json
import pytest
from skill_seekers.cli.adaptors.base import SkillMetadata
# ---------------------------------------------------------------------------
# Fixtures
# ---------------------------------------------------------------------------
@pytest.fixture
def sample_skill_dir(tmp_path):
"""Create a minimal skill directory with SKILL.md and references."""
skill_dir = tmp_path / "test-skill"
skill_dir.mkdir()
skill_md = """---
name: test-skill
description: A test skill for pinecone
doc_version: 16.2
---
# Test Skill
This is a test skill for Pinecone adaptor testing.
## Quick Start
Get started quickly.
"""
(skill_dir / "SKILL.md").write_text(skill_md)
refs_dir = skill_dir / "references"
refs_dir.mkdir()
(refs_dir / "api_reference.md").write_text("# API Reference\n\nSome API docs.\n")
(refs_dir / "getting_started.md").write_text(
"# Getting Started\n\nSome getting started docs.\n"
)
return skill_dir
@pytest.fixture
def sample_skill_dir_no_doc_version(tmp_path):
"""Create a skill directory without doc_version in frontmatter."""
skill_dir = tmp_path / "no-version-skill"
skill_dir.mkdir()
skill_md = """---
name: no-version-skill
description: A test skill without doc_version
---
# No Version Skill
Content here.
"""
(skill_dir / "SKILL.md").write_text(skill_md)
refs_dir = skill_dir / "references"
refs_dir.mkdir()
(refs_dir / "api.md").write_text("# API\n\nAPI docs.\n")
return skill_dir
# ---------------------------------------------------------------------------
# Pinecone Adaptor Tests
# ---------------------------------------------------------------------------
class TestPineconeAdaptor:
"""Test Pinecone adaptor functionality."""
def test_import(self):
"""PineconeAdaptor can be imported."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
assert PineconeAdaptor is not None
def test_platform_constants(self):
"""Platform constants are set correctly."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
assert adaptor.PLATFORM == "pinecone"
assert adaptor.PLATFORM_NAME == "Pinecone (Vector Database)"
assert adaptor.DEFAULT_API_ENDPOINT is None
def test_registered_in_factory(self):
"""PineconeAdaptor is registered in the adaptor factory."""
from skill_seekers.cli.adaptors import ADAPTORS
assert "pinecone" in ADAPTORS
def test_get_adaptor(self):
"""get_adaptor('pinecone') returns PineconeAdaptor instance."""
from skill_seekers.cli.adaptors import get_adaptor
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = get_adaptor("pinecone")
assert isinstance(adaptor, PineconeAdaptor)
def test_format_skill_md_structure(self, sample_skill_dir):
"""format_skill_md returns valid JSON with expected structure."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
metadata = SkillMetadata(
name="test-skill",
description="Test skill",
version="1.0.0",
doc_version="16.2",
)
result = adaptor.format_skill_md(sample_skill_dir, metadata)
data = json.loads(result)
assert "index_name" in data
assert "namespace" in data
assert "dimension" in data
assert "metric" in data
assert "vectors" in data
assert data["dimension"] == 1536
assert data["metric"] == "cosine"
def test_format_skill_md_vectors_have_metadata(self, sample_skill_dir):
"""Each vector has id and metadata fields."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
metadata = SkillMetadata(
name="test-skill",
description="Test",
doc_version="16.2",
)
result = adaptor.format_skill_md(sample_skill_dir, metadata)
data = json.loads(result)
assert len(data["vectors"]) > 0
for vec in data["vectors"]:
assert "id" in vec
assert "metadata" in vec
assert "text" in vec["metadata"]
assert "source" in vec["metadata"]
assert "category" in vec["metadata"]
assert "file" in vec["metadata"]
assert "type" in vec["metadata"]
assert "version" in vec["metadata"]
assert "doc_version" in vec["metadata"]
def test_format_skill_md_doc_version_propagates(self, sample_skill_dir):
"""doc_version flows into every vector's metadata."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
metadata = SkillMetadata(
name="test-skill",
description="Test",
doc_version="16.2",
)
result = adaptor.format_skill_md(sample_skill_dir, metadata)
data = json.loads(result)
for vec in data["vectors"]:
assert vec["metadata"]["doc_version"] == "16.2"
def test_format_skill_md_empty_doc_version(self, sample_skill_dir):
"""Empty doc_version is preserved as empty string."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
metadata = SkillMetadata(name="test-skill", description="Test", doc_version="")
result = adaptor.format_skill_md(sample_skill_dir, metadata)
data = json.loads(result)
for vec in data["vectors"]:
assert vec["metadata"]["doc_version"] == ""
def test_format_skill_md_has_overview_and_references(self, sample_skill_dir):
"""Output includes overview (SKILL.md) and reference documents."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
metadata = SkillMetadata(name="test-skill", description="Test")
result = adaptor.format_skill_md(sample_skill_dir, metadata)
data = json.loads(result)
categories = {vec["metadata"]["category"] for vec in data["vectors"]}
types = {vec["metadata"]["type"] for vec in data["vectors"]}
assert "overview" in categories
assert "documentation" in types
assert "reference" in types
def test_package_creates_file(self, sample_skill_dir, tmp_path):
"""package() creates a JSON file at expected path."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
output_path = adaptor.package(sample_skill_dir, tmp_path)
assert output_path.exists()
assert output_path.name.endswith("-pinecone.json")
data = json.loads(output_path.read_text())
assert "vectors" in data
assert len(data["vectors"]) > 0
def test_package_reads_frontmatter_metadata(self, sample_skill_dir, tmp_path):
"""package() reads doc_version from SKILL.md frontmatter."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
output_path = adaptor.package(sample_skill_dir, tmp_path)
data = json.loads(output_path.read_text())
for vec in data["vectors"]:
assert vec["metadata"]["doc_version"] == "16.2"
def test_package_with_chunking(self, sample_skill_dir, tmp_path):
"""package() with chunking enabled produces valid output."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
output_path = adaptor.package(
sample_skill_dir, tmp_path, enable_chunking=True, chunk_max_tokens=64
)
data = json.loads(output_path.read_text())
assert "vectors" in data
assert len(data["vectors"]) > 0
def test_index_name_derived_from_skill_name(self, sample_skill_dir, tmp_path):
"""index_name and namespace are derived from skill directory name."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
output_path = adaptor.package(sample_skill_dir, tmp_path)
data = json.loads(output_path.read_text())
assert data["index_name"] == "test-skill"
assert data["namespace"] == "test-skill"
def test_no_values_field_in_vectors(self, sample_skill_dir, tmp_path):
"""Vectors have no 'values' field — embeddings are added at upload time."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
output_path = adaptor.package(sample_skill_dir, tmp_path)
data = json.loads(output_path.read_text())
for vec in data["vectors"]:
assert "values" not in vec
def test_text_truncation(self):
"""_truncate_text_for_metadata respects byte limit."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
# Short text should not be truncated
assert adaptor._truncate_text_for_metadata("hello") == "hello"
# Very long text should be truncated
long_text = "x" * 50000
truncated = adaptor._truncate_text_for_metadata(long_text)
assert len(truncated.encode("utf-8")) <= 40000
def test_validate_api_key_returns_false(self):
"""validate_api_key returns False (no key needed for packaging)."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
assert adaptor.validate_api_key("some-key") is False
def test_get_env_var_name(self):
"""get_env_var_name returns PINECONE_API_KEY."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
assert adaptor.get_env_var_name() == "PINECONE_API_KEY"
def test_supports_enhancement_false(self):
"""Pinecone doesn't support enhancement."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
assert adaptor.supports_enhancement() is False
def test_upload_without_pinecone_installed(self, tmp_path):
"""upload() returns helpful error when pinecone not installed."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
# Create a dummy package file
pkg = tmp_path / "test-pinecone.json"
pkg.write_text(json.dumps({"vectors": [], "index_name": "test", "namespace": "test"}))
# This will either work (if pinecone is installed) or return error
result = adaptor.upload(pkg)
# Without API key, should fail
assert result["success"] is False
def _make_mock_pinecone(self, monkeypatch):
"""Helper: stub the pinecone module so upload() can run without a real server."""
import sys
import types
from unittest.mock import MagicMock
mock_module = types.ModuleType("pinecone")
mock_index = MagicMock()
mock_pc = MagicMock()
mock_pc.list_indexes.return_value = [] # no existing indexes
mock_pc.Index.return_value = mock_index
mock_module.Pinecone = MagicMock(return_value=mock_pc)
mock_module.ServerlessSpec = MagicMock()
monkeypatch.setitem(sys.modules, "pinecone", mock_module)
return mock_pc, mock_index
def _make_package(self, tmp_path, vectors=None):
"""Helper: create a minimal Pinecone package JSON."""
if vectors is None:
vectors = [{"id": "a", "metadata": {"text": "hello world"}}]
pkg = tmp_path / "test-pinecone.json"
pkg.write_text(
json.dumps(
{
"vectors": vectors,
"index_name": "test",
"namespace": "test",
"metric": "cosine",
"dimension": 1536,
}
)
)
return pkg
def test_upload_success_has_url_key(self, tmp_path, monkeypatch):
"""upload() success return dict includes 'url' key (prevents KeyError in package_skill.py)."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
mock_pc, _mock_index = self._make_mock_pinecone(monkeypatch)
monkeypatch.setattr(
adaptor,
"_generate_openai_embeddings",
lambda docs: [[0.0] * 1536] * len(docs),
)
pkg = self._make_package(tmp_path)
result = adaptor.upload(pkg, api_key="fake-key")
assert result["success"] is True
assert "url" in result # key must exist to avoid KeyError in package_skill.py
# Value should be None for Pinecone (no web URL)
assert result["url"] is None
def test_embedding_dimension_autodetect_st(self, tmp_path, monkeypatch):
"""sentence-transformers upload creates index with dimension=384."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
mock_pc, _mock_index = self._make_mock_pinecone(monkeypatch)
monkeypatch.setattr(
adaptor,
"_generate_st_embeddings",
lambda docs: [[0.0] * 384] * len(docs),
)
pkg = self._make_package(tmp_path)
result = adaptor.upload(
pkg,
api_key="fake-key",
embedding_function="sentence-transformers",
)
assert result["success"] is True
# Verify create_index was called with dimension=384
mock_pc.create_index.assert_called_once()
call_kwargs = mock_pc.create_index.call_args
assert call_kwargs.kwargs["dimension"] == 384
def test_embedding_dimension_autodetect_openai(self, tmp_path, monkeypatch):
"""openai upload creates index with dimension=1536."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
mock_pc, _mock_index = self._make_mock_pinecone(monkeypatch)
monkeypatch.setattr(
adaptor,
"_generate_openai_embeddings",
lambda docs: [[0.0] * 1536] * len(docs),
)
pkg = self._make_package(tmp_path)
result = adaptor.upload(
pkg,
api_key="fake-key",
embedding_function="openai",
)
assert result["success"] is True
mock_pc.create_index.assert_called_once()
call_kwargs = mock_pc.create_index.call_args
assert call_kwargs.kwargs["dimension"] == 1536
def test_embedding_before_index_creation(self, tmp_path, monkeypatch):
"""If embedding generation fails, index is never created (no side-effects)."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
mock_pc, _mock_index = self._make_mock_pinecone(monkeypatch)
def fail_embeddings(_docs):
raise RuntimeError("OPENAI_API_KEY not set")
monkeypatch.setattr(adaptor, "_generate_openai_embeddings", fail_embeddings)
pkg = self._make_package(tmp_path)
result = adaptor.upload(pkg, api_key="fake-key")
assert result["success"] is False
# Index must NOT have been created since embedding failed first
mock_pc.create_index.assert_not_called()
def test_embedding_dimension_explicit_override(self, tmp_path, monkeypatch):
"""Explicit dimension kwarg overrides both auto-detect and JSON file value."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
mock_pc, _mock_index = self._make_mock_pinecone(monkeypatch)
monkeypatch.setattr(
adaptor,
"_generate_openai_embeddings",
lambda docs: [[0.0] * 768] * len(docs),
)
pkg = self._make_package(tmp_path)
result = adaptor.upload(
pkg,
api_key="fake-key",
embedding_function="openai",
dimension=768,
)
assert result["success"] is True
mock_pc.create_index.assert_called_once()
call_kwargs = mock_pc.create_index.call_args
assert call_kwargs.kwargs["dimension"] == 768
def test_deterministic_ids(self, sample_skill_dir):
"""IDs are deterministic — same input produces same ID."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
metadata = SkillMetadata(name="test-skill", description="Test")
result1 = adaptor.format_skill_md(sample_skill_dir, metadata)
result2 = adaptor.format_skill_md(sample_skill_dir, metadata)
data1 = json.loads(result1)
data2 = json.loads(result2)
ids1 = [v["id"] for v in data1["vectors"]]
ids2 = [v["id"] for v in data2["vectors"]]
assert ids1 == ids2
# ---------------------------------------------------------------------------
# doc_version Metadata Tests (cross-adaptor)
# ---------------------------------------------------------------------------
class TestDocVersionMetadata:
"""Test doc_version flows through all RAG adaptors."""
def test_skill_metadata_has_doc_version(self):
"""SkillMetadata dataclass has doc_version field."""
meta = SkillMetadata(name="test", description="test", doc_version="3.2")
assert meta.doc_version == "3.2"
def test_skill_metadata_doc_version_default_empty(self):
"""doc_version defaults to empty string."""
meta = SkillMetadata(name="test", description="test")
assert meta.doc_version == ""
def test_read_frontmatter(self, sample_skill_dir):
"""_read_frontmatter reads doc_version from SKILL.md."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
fm = adaptor._read_frontmatter(sample_skill_dir)
assert fm["doc_version"] == "16.2"
assert fm["name"] == "test-skill"
def test_read_frontmatter_missing(self, sample_skill_dir_no_doc_version):
"""_read_frontmatter returns empty string when doc_version is absent."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
fm = adaptor._read_frontmatter(sample_skill_dir_no_doc_version)
assert fm.get("doc_version") is None # key not present
def test_build_skill_metadata_reads_doc_version(self, sample_skill_dir):
"""_build_skill_metadata populates doc_version from frontmatter."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
meta = adaptor._build_skill_metadata(sample_skill_dir)
assert meta.doc_version == "16.2"
assert meta.name == "test-skill"
def test_build_skill_metadata_no_doc_version(self, sample_skill_dir_no_doc_version):
"""_build_skill_metadata defaults to empty string when frontmatter has no doc_version."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
meta = adaptor._build_skill_metadata(sample_skill_dir_no_doc_version)
assert meta.doc_version == ""
def test_build_metadata_dict_includes_doc_version(self):
"""_build_metadata_dict includes doc_version in output."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
meta = SkillMetadata(name="test", description="desc", doc_version="3.0")
result = adaptor._build_metadata_dict(meta)
assert "doc_version" in result
assert result["doc_version"] == "3.0"
def test_build_metadata_dict_empty_doc_version(self):
"""_build_metadata_dict preserves empty doc_version."""
from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor
adaptor = PineconeAdaptor()
meta = SkillMetadata(name="test", description="desc")
result = adaptor._build_metadata_dict(meta)
assert "doc_version" in result
assert result["doc_version"] == ""
@pytest.mark.parametrize(
"platform",
["chroma", "faiss", "langchain", "llama-index", "haystack", "pinecone"],
)
def test_doc_version_in_package_output(self, platform, sample_skill_dir, tmp_path):
"""doc_version appears in package output for all RAG adaptors."""
from skill_seekers.cli.adaptors import get_adaptor
adaptor = get_adaptor(platform)
output_path = adaptor.package(sample_skill_dir, tmp_path)
data = json.loads(output_path.read_text())
# Each adaptor has a different structure — extract metadata dicts
meta_list = _extract_metadata_from_package(platform, data)
assert len(meta_list) > 0, f"No metadata found in {platform} output"
for meta in meta_list:
assert "doc_version" in meta, f"doc_version missing in {platform} metadata: {meta}"
assert meta["doc_version"] == "16.2", (
f"doc_version mismatch in {platform}: expected '16.2', got '{meta['doc_version']}'"
)
@pytest.mark.parametrize(
"platform",
["chroma", "faiss", "langchain", "llama-index", "haystack", "pinecone"],
)
def test_empty_doc_version_in_package_output(
self, platform, sample_skill_dir_no_doc_version, tmp_path
):
"""Empty doc_version is preserved (not omitted) in all adaptors."""
from skill_seekers.cli.adaptors import get_adaptor
adaptor = get_adaptor(platform)
output_path = adaptor.package(sample_skill_dir_no_doc_version, tmp_path)
data = json.loads(output_path.read_text())
meta_list = _extract_metadata_from_package(platform, data)
assert len(meta_list) > 0
for meta in meta_list:
assert "doc_version" in meta
# Qdrant and Weaviate may not be installed — test separately if available
class TestDocVersionQdrant:
"""Test doc_version in Qdrant adaptor (may require qdrant client)."""
def test_qdrant_doc_version(self, sample_skill_dir, tmp_path):
from skill_seekers.cli.adaptors import ADAPTORS
if "qdrant" not in ADAPTORS:
pytest.skip("Qdrant adaptor not available")
from skill_seekers.cli.adaptors import get_adaptor
adaptor = get_adaptor("qdrant")
output_path = adaptor.package(sample_skill_dir, tmp_path)
data = json.loads(output_path.read_text())
for point in data["points"]:
assert "doc_version" in point["payload"]
assert point["payload"]["doc_version"] == "16.2"
class TestWeaviateUploadReturnKeys:
"""Test Weaviate upload() return dict has required keys."""
def test_weaviate_upload_success_has_url_key(self, sample_skill_dir, tmp_path, monkeypatch):
"""Weaviate upload() success return includes 'url' key (prevents KeyError in package_skill.py)."""
import sys
import types
from unittest.mock import MagicMock
from skill_seekers.cli.adaptors import ADAPTORS
if "weaviate" not in ADAPTORS:
pytest.skip("Weaviate adaptor not available")
from skill_seekers.cli.adaptors.weaviate import WeaviateAdaptor
adaptor = WeaviateAdaptor()
# Stub the weaviate module
mock_module = types.ModuleType("weaviate")
mock_client = MagicMock()
mock_client.is_ready.return_value = True
mock_module.Client = MagicMock(return_value=mock_client)
mock_module.AuthApiKey = MagicMock()
monkeypatch.setitem(sys.modules, "weaviate", mock_module)
# Create a minimal weaviate package
output_path = adaptor.package(sample_skill_dir, tmp_path)
result = adaptor.upload(output_path)
assert result["success"] is True
assert "url" in result
assert result["url"] is None
class TestDocVersionWeaviate:
"""Test doc_version in Weaviate adaptor (may require weaviate client)."""
def test_weaviate_doc_version(self, sample_skill_dir, tmp_path):
from skill_seekers.cli.adaptors import ADAPTORS
if "weaviate" not in ADAPTORS:
pytest.skip("Weaviate adaptor not available")
from skill_seekers.cli.adaptors import get_adaptor
adaptor = get_adaptor("weaviate")
output_path = adaptor.package(sample_skill_dir, tmp_path)
data = json.loads(output_path.read_text())
for obj in data["objects"]:
assert "doc_version" in obj["properties"]
assert obj["properties"]["doc_version"] == "16.2"
def test_weaviate_schema_includes_doc_version(self, sample_skill_dir, tmp_path):
from skill_seekers.cli.adaptors import ADAPTORS
if "weaviate" not in ADAPTORS:
pytest.skip("Weaviate adaptor not available")
from skill_seekers.cli.adaptors import get_adaptor
adaptor = get_adaptor("weaviate")
output_path = adaptor.package(sample_skill_dir, tmp_path)
data = json.loads(output_path.read_text())
property_names = [p["name"] for p in data["schema"]["properties"]]
assert "doc_version" in property_names
# ---------------------------------------------------------------------------
# CLI Flag Tests
# ---------------------------------------------------------------------------
class TestDocVersionCLIFlag:
"""Test --doc-version CLI flag is accepted."""
def test_common_arguments_has_doc_version(self):
"""COMMON_ARGUMENTS includes doc_version."""
from skill_seekers.cli.arguments.common import COMMON_ARGUMENTS
assert "doc_version" in COMMON_ARGUMENTS
def test_create_arguments_has_doc_version(self):
"""UNIVERSAL_ARGUMENTS includes doc_version."""
from skill_seekers.cli.arguments.create import UNIVERSAL_ARGUMENTS
assert "doc_version" in UNIVERSAL_ARGUMENTS
def test_doc_version_flag_parsed(self):
"""--doc-version is parsed correctly by argparse."""
import argparse
from skill_seekers.cli.arguments.common import add_common_arguments
parser = argparse.ArgumentParser()
add_common_arguments(parser)
args = parser.parse_args(["--doc-version", "16.2"])
assert args.doc_version == "16.2"
def test_doc_version_default_empty(self):
"""--doc-version defaults to empty string."""
import argparse
from skill_seekers.cli.arguments.common import add_common_arguments
parser = argparse.ArgumentParser()
add_common_arguments(parser)
args = parser.parse_args([])
assert args.doc_version == ""
# ---------------------------------------------------------------------------
# Package choices test
# ---------------------------------------------------------------------------
class TestPineconeInPackageChoices:
"""Test pinecone is in package CLI choices."""
def test_pinecone_in_package_arguments(self):
"""pinecone is listed in package --target choices."""
from skill_seekers.cli.arguments.package import PACKAGE_ARGUMENTS
choices = PACKAGE_ARGUMENTS["target"]["kwargs"]["choices"]
assert "pinecone" in choices
# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------
def _extract_metadata_from_package(platform: str, data: dict) -> list[dict]:
"""Extract metadata dicts from adaptor-specific package format."""
meta_list = []
if platform == "pinecone":
for vec in data.get("vectors", []):
meta_list.append(vec.get("metadata", {}))
elif platform == "chroma":
for meta in data.get("metadatas", []):
meta_list.append(meta)
elif platform == "faiss":
for meta in data.get("metadatas", []):
meta_list.append(meta)
elif platform == "langchain":
for doc in data if isinstance(data, list) else []:
meta_list.append(doc.get("metadata", {}))
elif platform == "llama-index":
for node in data if isinstance(data, list) else []:
meta_list.append(node.get("metadata", {}))
elif platform == "haystack":
for doc in data if isinstance(data, list) else []:
meta_list.append(doc.get("meta", {}))
elif platform == "qdrant":
for point in data.get("points", []):
meta_list.append(point.get("payload", {}))
elif platform == "weaviate":
for obj in data.get("objects", []):
meta_list.append(obj.get("properties", {}))
return meta_list

View File

@@ -151,6 +151,45 @@ class TestWeaviateUploadBasics:
assert hasattr(adaptor, "_generate_openai_embeddings")
class TestEmbeddingMethodInheritance:
"""Test that shared embedding methods are properly inherited from base."""
def test_chroma_inherits_openai_embeddings(self):
"""Test chroma adaptor gets _generate_openai_embeddings from base."""
adaptor = get_adaptor("chroma")
assert hasattr(adaptor, "_generate_openai_embeddings")
# Verify it's the base class method, not a local override
from skill_seekers.cli.adaptors.base import SkillAdaptor
assert (
adaptor._generate_openai_embeddings.__func__ is SkillAdaptor._generate_openai_embeddings
)
def test_weaviate_inherits_both_embedding_methods(self):
"""Test weaviate adaptor gets both embedding methods from base."""
adaptor = get_adaptor("weaviate")
assert hasattr(adaptor, "_generate_openai_embeddings")
assert hasattr(adaptor, "_generate_st_embeddings")
from skill_seekers.cli.adaptors.base import SkillAdaptor
assert (
adaptor._generate_openai_embeddings.__func__ is SkillAdaptor._generate_openai_embeddings
)
assert adaptor._generate_st_embeddings.__func__ is SkillAdaptor._generate_st_embeddings
def test_pinecone_inherits_both_embedding_methods(self):
"""Test pinecone adaptor gets both embedding methods from base."""
adaptor = get_adaptor("pinecone")
assert hasattr(adaptor, "_generate_openai_embeddings")
assert hasattr(adaptor, "_generate_st_embeddings")
from skill_seekers.cli.adaptors.base import SkillAdaptor
assert (
adaptor._generate_openai_embeddings.__func__ is SkillAdaptor._generate_openai_embeddings
)
assert adaptor._generate_st_embeddings.__func__ is SkillAdaptor._generate_st_embeddings
class TestPackageStructure:
"""Test that packages are correctly structured for upload."""

View File

@@ -16,6 +16,7 @@ Tests cover:
"""
import json
import os
import shutil
import tempfile
import unittest
@@ -30,8 +31,9 @@ except ImportError:
WORD_AVAILABLE = False
def _make_sample_extracted_data(num_sections=2, include_code=False, include_tables=False,
include_images=False):
def _make_sample_extracted_data(
num_sections=2, include_code=False, include_tables=False, include_images=False
):
"""Helper to build a minimal extracted_data dict for testing."""
mock_image_bytes = (
b"\x89PNG\r\n\x1a\n\x00\x00\x00\rIHDR\x00\x00\x00\x01\x00\x00\x00\x01"
@@ -53,23 +55,29 @@ def _make_sample_extracted_data(num_sections=2, include_code=False, include_tabl
}
if include_code:
section["code_samples"] = [
{"code": f"def hello_{i}():\n return 'world'", "language": "python",
"quality_score": 7.5}
{
"code": f"def hello_{i}():\n return 'world'",
"language": "python",
"quality_score": 7.5,
}
]
if include_tables:
section["tables"] = [
{"headers": ["Col A", "Col B"], "rows": [["val1", "val2"], ["val3", "val4"]]}
]
if include_images:
section["images"] = [
{"index": 0, "data": mock_image_bytes, "width": 100, "height": 80}
]
section["images"] = [{"index": 0, "data": mock_image_bytes, "width": 100, "height": 80}]
pages.append(section)
return {
"source_file": "test.docx",
"metadata": {"title": "Test Doc", "author": "Test Author", "created": "", "modified": "",
"subject": ""},
"metadata": {
"title": "Test Doc",
"author": "Test Author",
"created": "",
"modified": "",
"subject": "",
},
"total_sections": num_sections,
"total_code_blocks": num_sections if include_code else 0,
"total_images": num_sections if include_images else 0,
@@ -85,6 +93,7 @@ class TestWordToSkillConverterInit(unittest.TestCase):
if not WORD_AVAILABLE:
self.skipTest("mammoth and python-docx not installed")
from skill_seekers.cli.word_scraper import WordToSkillConverter
self.WordToSkillConverter = WordToSkillConverter
self.temp_dir = tempfile.mkdtemp()
@@ -130,6 +139,7 @@ class TestWordToSkillConverterInit(unittest.TestCase):
def test_name_auto_detected_from_filename(self):
"""Test name can be extracted from filename via infer_description_from_word."""
from skill_seekers.cli.word_scraper import infer_description_from_word
desc = infer_description_from_word({}, name="my_doc")
self.assertIn("my_doc", desc)
@@ -141,6 +151,7 @@ class TestWordCategorization(unittest.TestCase):
if not WORD_AVAILABLE:
self.skipTest("mammoth and python-docx not installed")
from skill_seekers.cli.word_scraper import WordToSkillConverter
self.WordToSkillConverter = WordToSkillConverter
self.temp_dir = tempfile.mkdtemp()
@@ -174,10 +185,22 @@ class TestWordCategorization(unittest.TestCase):
converter.docx_path = ""
converter.extracted_data = {
"pages": [
{"section_number": 1, "heading": "API Reference", "text": "api reference docs",
"code_samples": [], "tables": [], "images": []},
{"section_number": 2, "heading": "Getting Started", "text": "getting started guide",
"code_samples": [], "tables": [], "images": []},
{
"section_number": 1,
"heading": "API Reference",
"text": "api reference docs",
"code_samples": [],
"tables": [],
"images": [],
},
{
"section_number": 2,
"heading": "Getting Started",
"text": "getting started guide",
"code_samples": [],
"tables": [],
"images": [],
},
]
}
@@ -204,6 +227,7 @@ class TestWordSkillBuilding(unittest.TestCase):
if not WORD_AVAILABLE:
self.skipTest("mammoth and python-docx not installed")
from skill_seekers.cli.word_scraper import WordToSkillConverter
self.WordToSkillConverter = WordToSkillConverter
self.temp_dir = tempfile.mkdtemp()
@@ -296,6 +320,7 @@ class TestWordCodeBlocks(unittest.TestCase):
if not WORD_AVAILABLE:
self.skipTest("mammoth and python-docx not installed")
from skill_seekers.cli.word_scraper import WordToSkillConverter
self.WordToSkillConverter = WordToSkillConverter
self.temp_dir = tempfile.mkdtemp()
@@ -350,6 +375,7 @@ class TestWordTables(unittest.TestCase):
if not WORD_AVAILABLE:
self.skipTest("mammoth and python-docx not installed")
from skill_seekers.cli.word_scraper import WordToSkillConverter
self.WordToSkillConverter = WordToSkillConverter
self.temp_dir = tempfile.mkdtemp()
@@ -392,6 +418,7 @@ class TestWordImages(unittest.TestCase):
if not WORD_AVAILABLE:
self.skipTest("mammoth and python-docx not installed")
from skill_seekers.cli.word_scraper import WordToSkillConverter
self.WordToSkillConverter = WordToSkillConverter
self.temp_dir = tempfile.mkdtemp()
@@ -433,6 +460,7 @@ class TestWordErrorHandling(unittest.TestCase):
if not WORD_AVAILABLE:
self.skipTest("mammoth and python-docx not installed")
from skill_seekers.cli.word_scraper import WordToSkillConverter
self.WordToSkillConverter = WordToSkillConverter
self.temp_dir = tempfile.mkdtemp()
@@ -456,6 +484,37 @@ class TestWordErrorHandling(unittest.TestCase):
with self.assertRaises((KeyError, TypeError)):
self.WordToSkillConverter({"docx_path": "test.docx"})
def test_non_docx_file_raises_value_error(self):
"""extract_docx raises ValueError for non-.docx files."""
# Create a real file with wrong extension
txt_path = os.path.join(self.temp_dir, "test.txt")
with open(txt_path, "w") as f:
f.write("not a docx")
config = {"name": "test", "docx_path": txt_path}
converter = self.WordToSkillConverter(config)
with self.assertRaises(ValueError):
converter.extract_docx()
def test_doc_file_raises_value_error(self):
"""extract_docx raises ValueError for .doc (old Word format)."""
doc_path = os.path.join(self.temp_dir, "test.doc")
with open(doc_path, "w") as f:
f.write("not a docx")
config = {"name": "test", "docx_path": doc_path}
converter = self.WordToSkillConverter(config)
with self.assertRaises(ValueError):
converter.extract_docx()
def test_no_extension_file_raises_value_error(self):
"""extract_docx raises ValueError for file with no extension."""
no_ext_path = os.path.join(self.temp_dir, "document")
with open(no_ext_path, "w") as f:
f.write("not a docx")
config = {"name": "test", "docx_path": no_ext_path}
converter = self.WordToSkillConverter(config)
with self.assertRaises(ValueError):
converter.extract_docx()
class TestWordJSONWorkflow(unittest.TestCase):
"""Test building skills from extracted JSON."""
@@ -464,6 +523,7 @@ class TestWordJSONWorkflow(unittest.TestCase):
if not WORD_AVAILABLE:
self.skipTest("mammoth and python-docx not installed")
from skill_seekers.cli.word_scraper import WordToSkillConverter
self.WordToSkillConverter = WordToSkillConverter
self.temp_dir = tempfile.mkdtemp()

56
uv.lock generated
View File

@@ -3852,11 +3852,11 @@ wheels = [
[[package]]
name = "packaging"
version = "25.0"
version = "24.2"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/a1/d4/1fc4078c65507b51b96ca8f8c3ba19e6a61c8253c72794544580a7b6c24d/packaging-25.0.tar.gz", hash = "sha256:d443872c98d677bf60f6a1f2f8c1cb748e8fe762d2bf9d3148b5599295b0fc4f", size = 165727, upload-time = "2025-04-19T11:48:59.673Z" }
sdist = { url = "https://files.pythonhosted.org/packages/d0/63/68dbb6eb2de9cb10ee4c9c14a0148804425e13c4fb20d61cce69f53106da/packaging-24.2.tar.gz", hash = "sha256:c228a6dc5e932d346bc5739379109d49e8853dd8223571c7c5b55260edc0b97f", size = 163950, upload-time = "2024-11-08T09:47:47.202Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/20/12/38679034af332785aac8774540895e234f4d07f7545804097de4b666afd8/packaging-25.0-py3-none-any.whl", hash = "sha256:29572ef2b1f17581046b3a2227d5c611fb25ec70ca1ba8554b24b0e69331a484", size = 66469, upload-time = "2025-04-19T11:48:57.875Z" },
{ url = "https://files.pythonhosted.org/packages/88/ef/eb23f262cca3c0c4eb7ab1933c3b1f03d021f2c48f54763065b6f0e321be/packaging-24.2-py3-none-any.whl", hash = "sha256:09abb1bccd265c01f4a3aa3f7a7db064b36514d2cba19a2f694fe6150451a759", size = 65451, upload-time = "2024-11-08T09:47:44.722Z" },
]
[[package]]
@@ -4028,6 +4028,46 @@ wheels = [
{ url = "https://files.pythonhosted.org/packages/2d/71/64e9b1c7f04ae0027f788a248e6297d7fcc29571371fe7d45495a78172c0/pillow-12.1.0-pp311-pypy311_pp73-win_amd64.whl", hash = "sha256:75af0b4c229ac519b155028fa1be632d812a519abba9b46b20e50c6caa184f19", size = 7029809, upload-time = "2026-01-02T09:13:26.541Z" },
]
[[package]]
name = "pinecone"
version = "8.1.0"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "certifi" },
{ name = "orjson" },
{ name = "pinecone-plugin-assistant" },
{ name = "pinecone-plugin-interface" },
{ name = "python-dateutil" },
{ name = "typing-extensions" },
{ name = "urllib3" },
]
sdist = { url = "https://files.pythonhosted.org/packages/e2/e4/8303133de5b3850c85d56caf9cc23cc38c74942bb8a940890b225245d7df/pinecone-8.1.0.tar.gz", hash = "sha256:48a00843fb232ccfd57eba618f0c0294e918b030e1bc7e853fb88d04f80ba569", size = 1041965, upload-time = "2026-02-19T20:08:32.999Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/4e/f7/beee7033ef92e5964e570fc29a048627e298745916e65c66105378405d06/pinecone-8.1.0-py3-none-any.whl", hash = "sha256:b0ba9c55c9a072fbe4fc7381bc3e5eb1b14550a8007233a3368ada74b1747534", size = 742745, upload-time = "2026-02-19T20:08:31.319Z" },
]
[[package]]
name = "pinecone-plugin-assistant"
version = "3.0.2"
source = { registry = "https://pypi.org/simple" }
dependencies = [
{ name = "packaging" },
{ name = "requests" },
]
sdist = { url = "https://files.pythonhosted.org/packages/c4/16/dcaff42ddfeab75dccd17685a0db46489717c3d23753dc14c55770e12aa8/pinecone_plugin_assistant-3.0.2.tar.gz", hash = "sha256:04163af282ad7895b581ab89f850ed139e4ddcea72010cadfa4c573759d5c896", size = 152066, upload-time = "2026-02-01T09:08:48.04Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/4a/dd/8bc4f3baf6c03acfb0b300f5aba53d19cc3a319281da518182bf22671b92/pinecone_plugin_assistant-3.0.2-py3-none-any.whl", hash = "sha256:de21ff696219fcad6c7ec86a3d1f70875024314537758ab345b6230462342903", size = 280863, upload-time = "2026-02-01T09:08:49.384Z" },
]
[[package]]
name = "pinecone-plugin-interface"
version = "0.0.7"
source = { registry = "https://pypi.org/simple" }
sdist = { url = "https://files.pythonhosted.org/packages/f4/fb/e8a4063264953ead9e2b24d9b390152c60f042c951c47f4592e9996e57ff/pinecone_plugin_interface-0.0.7.tar.gz", hash = "sha256:b8e6675e41847333aa13923cc44daa3f85676d7157324682dc1640588a982846", size = 3370, upload-time = "2024-06-05T01:57:52.093Z" }
wheels = [
{ url = "https://files.pythonhosted.org/packages/3b/1d/a21fdfcd6d022cb64cef5c2a29ee6691c6c103c4566b41646b080b7536a5/pinecone_plugin_interface-0.0.7-py3-none-any.whl", hash = "sha256:875857ad9c9fc8bbc074dbe780d187a2afd21f5bfe0f3b08601924a61ef1bba8", size = 6249, upload-time = "2024-06-05T01:57:50.583Z" },
]
[[package]]
name = "platformdirs"
version = "4.9.2"
@@ -5966,6 +6006,7 @@ all = [
{ name = "numpy", version = "2.2.6", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.11'" },
{ name = "numpy", version = "2.4.2", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.11'" },
{ name = "openai" },
{ name = "pinecone" },
{ name = "python-docx" },
{ name = "sentence-transformers" },
{ name = "sse-starlette" },
@@ -6020,8 +6061,12 @@ mcp = [
openai = [
{ name = "openai" },
]
pinecone = [
{ name = "pinecone" },
]
rag-upload = [
{ name = "chromadb" },
{ name = "pinecone" },
{ name = "sentence-transformers" },
{ name = "weaviate-client" },
]
@@ -6111,6 +6156,9 @@ requires-dist = [
{ name = "opencv-python-headless", marker = "extra == 'video-full'", specifier = ">=4.9.0" },
{ name = "pathspec", specifier = ">=0.12.1" },
{ name = "pillow", specifier = ">=11.0.0" },
{ name = "pinecone", marker = "extra == 'all'", specifier = ">=5.0.0" },
{ name = "pinecone", marker = "extra == 'pinecone'", specifier = ">=5.0.0" },
{ name = "pinecone", marker = "extra == 'rag-upload'", specifier = ">=5.0.0" },
{ name = "pydantic", specifier = ">=2.12.3" },
{ name = "pydantic-settings", specifier = ">=2.11.0" },
{ name = "pygithub", specifier = ">=2.5.0" },
@@ -6148,7 +6196,7 @@ requires-dist = [
{ name = "yt-dlp", marker = "extra == 'video'", specifier = ">=2024.12.0" },
{ name = "yt-dlp", marker = "extra == 'video-full'", specifier = ">=2024.12.0" },
]
provides-extras = ["mcp", "gemini", "openai", "all-llms", "s3", "gcs", "azure", "docx", "video", "video-full", "chroma", "weaviate", "sentence-transformers", "rag-upload", "all-cloud", "embedding", "all"]
provides-extras = ["mcp", "gemini", "openai", "all-llms", "s3", "gcs", "azure", "docx", "video", "video-full", "chroma", "weaviate", "sentence-transformers", "pinecone", "rag-upload", "all-cloud", "embedding", "all"]
[package.metadata.requires-dev]
dev = [