diff --git a/CHANGELOG.md b/CHANGELOG.md index b2e7351..0494e32 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -22,6 +22,14 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - **`docx` optional dependency group** — `pip install skill-seekers[docx]` (mammoth + python-docx) ### Fixed +- **`--var` flag silently dropped in `create` routing** — `main.py` checked `args.workflow_var` but argparse stores the flag as `args.var`. Workflow variable overrides via `--var KEY=VALUE` were silently ignored. Fixed to read `args.var`. +- **Double `_score_code_quality()` call in word scraper** — `word_scraper.py` called `_score_code_quality(raw_text)` twice for every code-like paragraph (once to check threshold, once to assign). Consolidated to a single call. +- **`.docx` file extension validation** — `WordToSkillConverter` now validates the file has a `.docx` extension before attempting to parse. Non-`.docx` files (`.doc`, `.txt`, no extension) raise `ValueError` with a clear message instead of cryptic parse errors. +- **`--no-preserve-code` renamed to `--no-preserve-code-blocks`** — Flag name now matches the parameter it controls (`preserve_code_blocks`). Backward-compatible alias `--no-preserve-code` kept (hidden, removed in v4.0.0). +- **`--chunk-overlap-tokens` missing from `package` command** — Flag was defined in `create` and `scrape` but not `package`. Added to `PACKAGE_ARGUMENTS` and wired through `package_skill()` → `adaptor.package()` → `format_skill_md()` → `_maybe_chunk_content()` → `RAGChunker`. +- **Chunk overlap auto-scaling** — When `--chunk-tokens` is non-default but `--chunk-overlap-tokens` is default, overlap now auto-scales to `max(50, chunk_tokens // 10)` for better context preservation with large chunks. +- **Weaviate `ImportError` masked by generic handler** — `upload()` caught `Exception` before `ImportError`, so missing `sentence-transformers` produced a generic "Upload failed" message instead of the specific install instruction. Added `except ImportError` before `except Exception`. +- **Hardcoded chunk defaults in 12 adaptors** — All concrete adaptors (claude, gemini, openai, markdown, langchain, llama_index, haystack, chroma, faiss, qdrant, weaviate, pinecone) used hardcoded `512`/`50` for chunk token/overlap defaults. Replaced with `DEFAULT_CHUNK_TOKENS` and `DEFAULT_CHUNK_OVERLAP_TOKENS` constants from `arguments/common.py`. - **RAG chunking crash (`AttributeError: output_dir`)** — `execute_scraping_and_building()` used `converter.output_dir` which doesn't exist on `DocToSkillConverter`. Changed to `Path(converter.skill_dir)`. Affected `--chunk-for-rag` flag on `scrape` command. - **Issue #301: `setup.sh` fails on macOS with mismatched Python/pip** — `pip3` can point to a different Python than `python3` (e.g. pip3 → 3.9, python3 → 3.14), causing "no matching distribution" errors. Changed `setup.sh` to use `python3 -m pip` instead of bare `pip3` to guarantee the correct interpreter. - **Issue #300: Selector fallback & dry-run link discovery** — `create https://reactflow.dev/` now finds 20+ pages (was 1). Root causes: @@ -45,6 +53,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - **Language detector method** — Fixed `detect_from_text` → `detect_from_code` in word scraper ### Changed +- **Shared embedding methods consolidated to base class** — `_generate_openai_embeddings()` and `_generate_st_embeddings()` moved from chroma/weaviate/pinecone adaptors into `SkillAdaptor` base class. All 3 adaptors now inherit these methods, eliminating ~150 lines of duplicated code. +- **Chunk constants centralized** — Added `DEFAULT_CHUNK_TOKENS = 512` and `DEFAULT_CHUNK_OVERLAP_TOKENS = 50` in `arguments/common.py`. Used across `rag_chunker.py`, `base.py`, `package_skill.py`, `create_command.py`, and all 12 concrete adaptors. No more magic numbers for chunk defaults. - **Enhancement summarizer architecture** — Character-budget approach respects `target_ratio` for both code blocks and heading chunks, replacing hard limits with proportional allocation ## [3.1.3] - 2026-02-24 diff --git a/README.md b/README.md index 6755136..aea3a61 100644 --- a/README.md +++ b/README.md @@ -10,7 +10,7 @@ English | [简体中文](https://github.com/yusufkaraaslan/Skill_Seekers/blob/ma [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Python 3.10+](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/) [![MCP Integration](https://img.shields.io/badge/MCP-Integrated-blue.svg)](https://modelcontextprotocol.io) -[![Tested](https://img.shields.io/badge/Tests-1880%2B%20Passing-brightgreen.svg)](tests/) +[![Tested](https://img.shields.io/badge/Tests-2283%2B%20Passing-brightgreen.svg)](tests/) [![Project Board](https://img.shields.io/badge/Project-Board-purple.svg)](https://github.com/users/yusufkaraaslan/projects/2) [![PyPI version](https://badge.fury.io/py/skill-seekers.svg)](https://pypi.org/project/skill-seekers/) [![PyPI - Downloads](https://img.shields.io/pypi/dm/skill-seekers.svg)](https://pypi.org/project/skill-seekers/) diff --git a/docs/reference/API_REFERENCE.md b/docs/reference/API_REFERENCE.md index 3be8718..4ba6d39 100644 --- a/docs/reference/API_REFERENCE.md +++ b/docs/reference/API_REFERENCE.md @@ -309,6 +309,15 @@ package_path = adaptor.package( ) ``` +#### Shared Embedding Methods + +The base `SkillAdaptor` class provides two shared embedding methods inherited by all vector database adaptors (chroma, weaviate, pinecone): + +- `_generate_openai_embeddings(texts, model)` -- Generate embeddings via the OpenAI API. +- `_generate_st_embeddings(texts, model)` -- Generate embeddings using a local sentence-transformers model. + +These methods are available on any adaptor instance returned by `get_adaptor()` for vector database targets, so you do not need to implement embedding logic per-adaptor. + --- ### 6. Skill Upload API diff --git a/docs/reference/CLI_REFERENCE.md b/docs/reference/CLI_REFERENCE.md index f5be01f..07de4c6 100644 --- a/docs/reference/CLI_REFERENCE.md +++ b/docs/reference/CLI_REFERENCE.md @@ -620,7 +620,8 @@ skill-seekers package SKILL_DIRECTORY [options] | | `--batch-size` | 100 | Chunks per batch | | | `--chunk-for-rag` | | Enable RAG chunking | | | `--chunk-tokens` | 512 | Max tokens per chunk | -| | `--no-preserve-code` | | Allow code block splitting | +| | `--chunk-overlap-tokens` | 50 | Overlap between chunks (tokens) | +| | `--no-preserve-code-blocks` | | Allow code block splitting | **Supported Platforms:** diff --git a/docs/user-guide/04-packaging.md b/docs/user-guide/04-packaging.md index cced71a..0f58bc7 100644 --- a/docs/user-guide/04-packaging.md +++ b/docs/user-guide/04-packaging.md @@ -194,7 +194,9 @@ skill-seekers package output/my-skill/ \ | `--chunk-for-rag` | auto | Enable chunking | | `--chunk-tokens` | 512 | Tokens per chunk | | `--chunk-overlap-tokens` | 50 | Overlap between chunks (tokens) | -| `--no-preserve-code` | - | Allow splitting code blocks | +| `--no-preserve-code-blocks` | - | Allow splitting code blocks | + +> **Auto-scaling overlap:** When `--chunk-tokens` is set to a non-default value but `--chunk-overlap-tokens` is left at default (50), the overlap automatically scales to `max(50, chunk_tokens / 10)` for better context preservation with larger chunks. --- diff --git a/docs/zh-CN/reference/CLI_REFERENCE.md b/docs/zh-CN/reference/CLI_REFERENCE.md index 88ffbc0..269dc51 100644 --- a/docs/zh-CN/reference/CLI_REFERENCE.md +++ b/docs/zh-CN/reference/CLI_REFERENCE.md @@ -598,7 +598,8 @@ skill-seekers package SKILL_DIRECTORY [options] | | `--batch-size` | 100 | Chunks per batch | | | `--chunk-for-rag` | | Enable RAG chunking | | | `--chunk-tokens` | 512 | Max tokens per chunk | -| | `--no-preserve-code` | | Allow code block splitting | +| | `--chunk-overlap-tokens` | 50 | Overlap between chunks (tokens) | +| | `--no-preserve-code-blocks` | | Allow code block splitting | **Supported Platforms:** diff --git a/docs/zh-CN/user-guide/04-packaging.md b/docs/zh-CN/user-guide/04-packaging.md index cced71a..f343f94 100644 --- a/docs/zh-CN/user-guide/04-packaging.md +++ b/docs/zh-CN/user-guide/04-packaging.md @@ -194,7 +194,9 @@ skill-seekers package output/my-skill/ \ | `--chunk-for-rag` | auto | Enable chunking | | `--chunk-tokens` | 512 | Tokens per chunk | | `--chunk-overlap-tokens` | 50 | Overlap between chunks (tokens) | -| `--no-preserve-code` | - | Allow splitting code blocks | +| `--no-preserve-code-blocks` | - | Allow splitting code blocks | + +> **自动缩放重叠:** 当 `--chunk-tokens` 设置为非默认值但 `--chunk-overlap-tokens` 保持默认值 (50) 时,重叠会自动缩放为 `max(50, chunk_tokens / 10)`,以在较大的分块中实现更好的上下文保留。 --- diff --git a/pyproject.toml b/pyproject.toml index 0c2a3ab..b07acf0 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -128,10 +128,15 @@ sentence-transformers = [ "sentence-transformers>=2.2.0", ] +pinecone = [ + "pinecone>=5.0.0", +] + rag-upload = [ "chromadb>=0.4.0", "weaviate-client>=3.25.0", "sentence-transformers>=2.2.0", + "pinecone>=5.0.0", ] # All cloud storage providers combined @@ -167,6 +172,7 @@ all = [ "azure-storage-blob>=12.19.0", "chromadb>=0.4.0", "weaviate-client>=3.25.0", + "pinecone>=5.0.0", "fastapi>=0.109.0", "sentence-transformers>=2.3.0", "numpy>=1.24.0", diff --git a/src/skill_seekers/cli/adaptors/__init__.py b/src/skill_seekers/cli/adaptors/__init__.py index a012843..6240082 100644 --- a/src/skill_seekers/cli/adaptors/__init__.py +++ b/src/skill_seekers/cli/adaptors/__init__.py @@ -64,6 +64,11 @@ try: except ImportError: HaystackAdaptor = None +try: + from .pinecone_adaptor import PineconeAdaptor +except ImportError: + PineconeAdaptor = None + # Registry of available adaptors ADAPTORS: dict[str, type[SkillAdaptor]] = {} @@ -91,6 +96,8 @@ if QdrantAdaptor: ADAPTORS["qdrant"] = QdrantAdaptor if HaystackAdaptor: ADAPTORS["haystack"] = HaystackAdaptor +if PineconeAdaptor: + ADAPTORS["pinecone"] = PineconeAdaptor def get_adaptor(platform: str, config: dict = None) -> SkillAdaptor: diff --git a/src/skill_seekers/cli/adaptors/base.py b/src/skill_seekers/cli/adaptors/base.py index ca02c30..55c1fcc 100644 --- a/src/skill_seekers/cli/adaptors/base.py +++ b/src/skill_seekers/cli/adaptors/base.py @@ -11,6 +11,8 @@ from dataclasses import dataclass, field from pathlib import Path from typing import Any +from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS + @dataclass class SkillMetadata: @@ -19,6 +21,7 @@ class SkillMetadata: name: str description: str version: str = "1.0.0" + doc_version: str = "" # Documentation version (e.g., "16.2") for RAG metadata filtering author: str | None = None tags: list[str] = field(default_factory=list) @@ -73,8 +76,9 @@ class SkillAdaptor(ABC): skill_dir: Path, output_path: Path, enable_chunking: bool = False, - chunk_max_tokens: int = 512, + chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS, preserve_code_blocks: bool = True, + chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS, ) -> Path: """ Package skill for platform (ZIP, tar.gz, etc.). @@ -228,6 +232,47 @@ class SkillAdaptor(ABC): return skill_md_path.read_text(encoding="utf-8") + def _read_frontmatter(self, skill_dir: Path) -> dict[str, str]: + """Read YAML frontmatter from SKILL.md. + + Args: + skill_dir: Path to skill directory + + Returns: + Dict of key-value pairs from the frontmatter block. + """ + content = self._read_skill_md(skill_dir) + if content.startswith("---"): + parts = content.split("---", 2) + if len(parts) >= 3: + frontmatter: dict[str, str] = {} + for line in parts[1].strip().splitlines(): + if ":" in line: + key, _, value = line.partition(":") + frontmatter[key.strip()] = value.strip() + return frontmatter + return {} + + def _build_skill_metadata(self, skill_dir: Path) -> SkillMetadata: + """Build SkillMetadata from SKILL.md frontmatter. + + Reads name, description, version, and doc_version from frontmatter + instead of using hardcoded defaults. + + Args: + skill_dir: Path to skill directory + + Returns: + SkillMetadata populated from frontmatter values. + """ + fm = self._read_frontmatter(skill_dir) + return SkillMetadata( + name=skill_dir.name, + description=fm.get("description", f"Documentation for {skill_dir.name}"), + version=fm.get("version", "1.0.0"), + doc_version=fm.get("doc_version", ""), + ) + def _iterate_references(self, skill_dir: Path): """ Iterate over all reference files in skill directory. @@ -266,6 +311,7 @@ class SkillAdaptor(ABC): base_meta = { "source": metadata.name, "version": metadata.version, + "doc_version": metadata.doc_version, "description": metadata.description, } if metadata.author: @@ -280,9 +326,10 @@ class SkillAdaptor(ABC): content: str, metadata: dict, enable_chunking: bool = False, - chunk_max_tokens: int = 512, + chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS, preserve_code_blocks: bool = True, source_file: str = None, + chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS, ) -> list[tuple[str, dict]]: """ Optionally chunk content for RAG platforms. @@ -321,9 +368,15 @@ class SkillAdaptor(ABC): return [(content, metadata)] # RAGChunker uses TOKENS (it converts to chars internally) + # If overlap is at the default value but chunk size was customized, + # scale overlap proportionally (10% of chunk size, min DEFAULT_CHUNK_OVERLAP_TOKENS) + effective_overlap = chunk_overlap_tokens + if chunk_overlap_tokens == DEFAULT_CHUNK_OVERLAP_TOKENS and chunk_max_tokens != DEFAULT_CHUNK_TOKENS: + effective_overlap = max(DEFAULT_CHUNK_OVERLAP_TOKENS, chunk_max_tokens // 10) + chunker = RAGChunker( chunk_size=chunk_max_tokens, - chunk_overlap=max(50, chunk_max_tokens // 10), # 10% overlap + chunk_overlap=effective_overlap, preserve_code_blocks=preserve_code_blocks, preserve_paragraphs=True, min_chunk_size=100, # 100 tokens minimum @@ -433,6 +486,69 @@ class SkillAdaptor(ABC): # Plain hex digest return hash_hex + def _generate_openai_embeddings( + self, documents: list[str], api_key: str | None = None + ) -> list[list[float]]: + """Generate embeddings using OpenAI text-embedding-3-small. + + Args: + documents: List of document texts + api_key: OpenAI API key (or uses OPENAI_API_KEY env var) + + Returns: + List of embedding vectors + """ + import os + + try: + from openai import OpenAI + except ImportError: + raise ImportError("openai not installed. Run: pip install openai") from None + + api_key = api_key or os.getenv("OPENAI_API_KEY") + if not api_key: + raise ValueError("OPENAI_API_KEY not set. Set via env var or --openai-api-key") + + client = OpenAI(api_key=api_key) + embeddings: list[list[float]] = [] + batch_size = 100 + + print(f" Generating OpenAI embeddings for {len(documents)} documents...") + + for i in range(0, len(documents), batch_size): + batch = documents[i : i + batch_size] + try: + response = client.embeddings.create( + input=batch, model="text-embedding-3-small" + ) + embeddings.extend([item.embedding for item in response.data]) + print(f" ✓ Embedded {min(i + batch_size, len(documents))}/{len(documents)}") + except Exception as e: + raise Exception(f"OpenAI embedding generation failed: {e}") from e + + return embeddings + + def _generate_st_embeddings(self, documents: list[str]) -> list[list[float]]: + """Generate embeddings using sentence-transformers (all-MiniLM-L6-v2). + + Args: + documents: List of document texts + + Returns: + List of embedding vectors + """ + try: + from sentence_transformers import SentenceTransformer + except ImportError: + raise ImportError( + "sentence-transformers not installed. Run: pip install sentence-transformers" + ) from None + + print(f" Generating sentence-transformer embeddings for {len(documents)} documents...") + model = SentenceTransformer("all-MiniLM-L6-v2") + embeddings = model.encode(documents, show_progress_bar=True) + return [emb.tolist() for emb in embeddings] + def _generate_toc(self, skill_dir: Path) -> str: """ Helper to generate table of contents from references. diff --git a/src/skill_seekers/cli/adaptors/chroma.py b/src/skill_seekers/cli/adaptors/chroma.py index c6e0a6d..be37728 100644 --- a/src/skill_seekers/cli/adaptors/chroma.py +++ b/src/skill_seekers/cli/adaptors/chroma.py @@ -11,6 +11,7 @@ from pathlib import Path from typing import Any from .base import SkillAdaptor, SkillMetadata +from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS class ChromaAdaptor(SkillAdaptor): @@ -79,6 +80,7 @@ class ChromaAdaptor(SkillAdaptor): "file": "SKILL.md", "type": "documentation", "version": metadata.version, + "doc_version": metadata.doc_version, } # Chunk if enabled @@ -86,9 +88,10 @@ class ChromaAdaptor(SkillAdaptor): content, doc_metadata, enable_chunking=enable_chunking, - chunk_max_tokens=kwargs.get("chunk_max_tokens", 512), + chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS), preserve_code_blocks=kwargs.get("preserve_code_blocks", True), source_file="SKILL.md", + chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS), ) # Add all chunks to parallel arrays @@ -109,6 +112,7 @@ class ChromaAdaptor(SkillAdaptor): "file": ref_file.name, "type": "reference", "version": metadata.version, + "doc_version": metadata.doc_version, } # Chunk if enabled @@ -116,9 +120,10 @@ class ChromaAdaptor(SkillAdaptor): ref_content, doc_metadata, enable_chunking=enable_chunking, - chunk_max_tokens=kwargs.get("chunk_max_tokens", 512), + chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS), preserve_code_blocks=kwargs.get("preserve_code_blocks", True), source_file=ref_file.name, + chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS), ) # Add all chunks to parallel arrays @@ -144,8 +149,9 @@ class ChromaAdaptor(SkillAdaptor): skill_dir: Path, output_path: Path, enable_chunking: bool = False, - chunk_max_tokens: int = 512, + chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS, preserve_code_blocks: bool = True, + chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS, ) -> Path: """ Package skill into JSON file for Chroma. @@ -166,12 +172,8 @@ class ChromaAdaptor(SkillAdaptor): output_path = self._format_output_path(skill_dir, Path(output_path), "-chroma.json") output_path.parent.mkdir(parents=True, exist_ok=True) - # Read metadata - metadata = SkillMetadata( - name=skill_dir.name, - description=f"Chroma collection data for {skill_dir.name}", - version="1.0.0", - ) + # Read metadata from SKILL.md frontmatter + metadata = self._build_skill_metadata(skill_dir) # Generate Chroma data chroma_json = self.format_skill_md( @@ -180,6 +182,7 @@ class ChromaAdaptor(SkillAdaptor): enable_chunking=enable_chunking, chunk_max_tokens=chunk_max_tokens, preserve_code_blocks=preserve_code_blocks, + chunk_overlap_tokens=chunk_overlap_tokens, ) # Write to file @@ -206,7 +209,7 @@ class ChromaAdaptor(SkillAdaptor): return output_path - def upload(self, package_path: Path, api_key: str = None, **kwargs) -> dict[str, Any]: + def upload(self, package_path: Path, api_key: str | None = None, **kwargs) -> dict[str, Any]: """ Upload packaged skill to ChromaDB. @@ -250,9 +253,7 @@ class ChromaAdaptor(SkillAdaptor): print(f"🌐 Connecting to ChromaDB at: {chroma_url}") # Parse URL if "://" in chroma_url: - parts = chroma_url.split("://") - parts[0] - host_port = parts[1] + _scheme, host_port = chroma_url.split("://", 1) else: host_port = chroma_url @@ -352,52 +353,6 @@ class ChromaAdaptor(SkillAdaptor): except Exception as e: return {"success": False, "message": f"Upload failed: {e}"} - def _generate_openai_embeddings( - self, documents: list[str], api_key: str = None - ) -> list[list[float]]: - """ - Generate embeddings using OpenAI API. - - Args: - documents: List of document texts - api_key: OpenAI API key (or uses OPENAI_API_KEY env var) - - Returns: - List of embedding vectors - """ - import os - - try: - from openai import OpenAI - except ImportError: - raise ImportError("openai not installed. Run: pip install openai") from None - - api_key = api_key or os.getenv("OPENAI_API_KEY") - if not api_key: - raise ValueError("OPENAI_API_KEY not set. Set via env var or --openai-api-key") - - client = OpenAI(api_key=api_key) - - # Batch process (OpenAI allows up to 2048 inputs) - embeddings = [] - batch_size = 100 - - print(f" Generating embeddings for {len(documents)} documents...") - - for i in range(0, len(documents), batch_size): - batch = documents[i : i + batch_size] - try: - response = client.embeddings.create( - input=batch, - model="text-embedding-3-small", # Cheapest, fastest - ) - embeddings.extend([item.embedding for item in response.data]) - print(f" ✓ Processed {min(i + batch_size, len(documents))}/{len(documents)}") - except Exception as e: - raise Exception(f"OpenAI embedding generation failed: {e}") from e - - return embeddings - def validate_api_key(self, _api_key: str) -> bool: """ Chroma format doesn't use API keys for packaging. diff --git a/src/skill_seekers/cli/adaptors/claude.py b/src/skill_seekers/cli/adaptors/claude.py index 503ca1d..b8f97c3 100644 --- a/src/skill_seekers/cli/adaptors/claude.py +++ b/src/skill_seekers/cli/adaptors/claude.py @@ -12,6 +12,7 @@ from pathlib import Path from typing import Any from .base import SkillAdaptor, SkillMetadata +from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS class ClaudeAdaptor(SkillAdaptor): @@ -86,8 +87,9 @@ version: {metadata.version} skill_dir: Path, output_path: Path, enable_chunking: bool = False, - chunk_max_tokens: int = 512, + chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS, preserve_code_blocks: bool = True, + chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS, ) -> Path: """ Package skill into ZIP file for Claude. diff --git a/src/skill_seekers/cli/adaptors/faiss_helpers.py b/src/skill_seekers/cli/adaptors/faiss_helpers.py index 62c8539..df79fd6 100644 --- a/src/skill_seekers/cli/adaptors/faiss_helpers.py +++ b/src/skill_seekers/cli/adaptors/faiss_helpers.py @@ -11,6 +11,7 @@ from pathlib import Path from typing import Any from .base import SkillAdaptor, SkillMetadata +from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS class FAISSHelpers(SkillAdaptor): @@ -81,6 +82,7 @@ class FAISSHelpers(SkillAdaptor): "file": "SKILL.md", "type": "documentation", "version": metadata.version, + "doc_version": metadata.doc_version, } # Chunk if enabled @@ -88,9 +90,10 @@ class FAISSHelpers(SkillAdaptor): content, doc_metadata, enable_chunking=enable_chunking, - chunk_max_tokens=kwargs.get("chunk_max_tokens", 512), + chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS), preserve_code_blocks=kwargs.get("preserve_code_blocks", True), source_file="SKILL.md", + chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS), ) # Add all chunks to parallel arrays @@ -110,6 +113,7 @@ class FAISSHelpers(SkillAdaptor): "file": ref_file.name, "type": "reference", "version": metadata.version, + "doc_version": metadata.doc_version, } # Chunk if enabled @@ -117,9 +121,10 @@ class FAISSHelpers(SkillAdaptor): ref_content, doc_metadata, enable_chunking=enable_chunking, - chunk_max_tokens=kwargs.get("chunk_max_tokens", 512), + chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS), preserve_code_blocks=kwargs.get("preserve_code_blocks", True), source_file=ref_file.name, + chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS), ) # Add all chunks to parallel arrays @@ -155,8 +160,9 @@ class FAISSHelpers(SkillAdaptor): skill_dir: Path, output_path: Path, enable_chunking: bool = False, - chunk_max_tokens: int = 512, + chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS, preserve_code_blocks: bool = True, + chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS, ) -> Path: """ Package skill into JSON file for FAISS. @@ -176,12 +182,8 @@ class FAISSHelpers(SkillAdaptor): output_path = self._format_output_path(skill_dir, Path(output_path), "-faiss.json") output_path.parent.mkdir(parents=True, exist_ok=True) - # Read metadata - metadata = SkillMetadata( - name=skill_dir.name, - description=f"FAISS data for {skill_dir.name}", - version="1.0.0", - ) + # Read metadata from SKILL.md frontmatter + metadata = self._build_skill_metadata(skill_dir) # Generate FAISS data faiss_json = self.format_skill_md( @@ -190,6 +192,7 @@ class FAISSHelpers(SkillAdaptor): enable_chunking=enable_chunking, chunk_max_tokens=chunk_max_tokens, preserve_code_blocks=preserve_code_blocks, + chunk_overlap_tokens=chunk_overlap_tokens, ) # Write to file diff --git a/src/skill_seekers/cli/adaptors/gemini.py b/src/skill_seekers/cli/adaptors/gemini.py index 3e58f1b..b21a865 100644 --- a/src/skill_seekers/cli/adaptors/gemini.py +++ b/src/skill_seekers/cli/adaptors/gemini.py @@ -13,6 +13,7 @@ from pathlib import Path from typing import Any from .base import SkillAdaptor, SkillMetadata +from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS class GeminiAdaptor(SkillAdaptor): @@ -91,8 +92,9 @@ See the references directory for complete documentation with examples and best p skill_dir: Path, output_path: Path, enable_chunking: bool = False, - chunk_max_tokens: int = 512, + chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS, preserve_code_blocks: bool = True, + chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS, ) -> Path: """ Package skill into tar.gz file for Gemini. diff --git a/src/skill_seekers/cli/adaptors/haystack.py b/src/skill_seekers/cli/adaptors/haystack.py index 7876ccc..a16b71a 100644 --- a/src/skill_seekers/cli/adaptors/haystack.py +++ b/src/skill_seekers/cli/adaptors/haystack.py @@ -11,6 +11,7 @@ from pathlib import Path from typing import Any from .base import SkillAdaptor, SkillMetadata +from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS class HaystackAdaptor(SkillAdaptor): @@ -62,6 +63,7 @@ class HaystackAdaptor(SkillAdaptor): "file": "SKILL.md", "type": "documentation", "version": metadata.version, + "doc_version": metadata.doc_version, } # Chunk if enabled @@ -69,9 +71,10 @@ class HaystackAdaptor(SkillAdaptor): content, doc_meta, enable_chunking=enable_chunking, - chunk_max_tokens=kwargs.get("chunk_max_tokens", 512), + chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS), preserve_code_blocks=kwargs.get("preserve_code_blocks", True), source_file="SKILL.md", + chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS), ) # Add all chunks as documents @@ -95,6 +98,7 @@ class HaystackAdaptor(SkillAdaptor): "file": ref_file.name, "type": "reference", "version": metadata.version, + "doc_version": metadata.doc_version, } # Chunk if enabled @@ -102,9 +106,10 @@ class HaystackAdaptor(SkillAdaptor): ref_content, doc_meta, enable_chunking=enable_chunking, - chunk_max_tokens=kwargs.get("chunk_max_tokens", 512), + chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS), preserve_code_blocks=kwargs.get("preserve_code_blocks", True), source_file=ref_file.name, + chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS), ) # Add all chunks as documents @@ -124,8 +129,9 @@ class HaystackAdaptor(SkillAdaptor): skill_dir: Path, output_path: Path, enable_chunking: bool = False, - chunk_max_tokens: int = 512, + chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS, preserve_code_blocks: bool = True, + chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS, ) -> Path: """ Package skill into JSON file for Haystack. @@ -147,11 +153,8 @@ class HaystackAdaptor(SkillAdaptor): output_path.parent.mkdir(parents=True, exist_ok=True) # Read metadata - metadata = SkillMetadata( - name=skill_dir.name, - description=f"Haystack documents for {skill_dir.name}", - version="1.0.0", - ) + # Read metadata from SKILL.md frontmatter + metadata = self._build_skill_metadata(skill_dir) # Generate Haystack documents documents_json = self.format_skill_md( @@ -160,6 +163,7 @@ class HaystackAdaptor(SkillAdaptor): enable_chunking=enable_chunking, chunk_max_tokens=chunk_max_tokens, preserve_code_blocks=preserve_code_blocks, + chunk_overlap_tokens=chunk_overlap_tokens, ) # Write to file diff --git a/src/skill_seekers/cli/adaptors/langchain.py b/src/skill_seekers/cli/adaptors/langchain.py index d937290..d6be4ab 100644 --- a/src/skill_seekers/cli/adaptors/langchain.py +++ b/src/skill_seekers/cli/adaptors/langchain.py @@ -11,6 +11,7 @@ from pathlib import Path from typing import Any from .base import SkillAdaptor, SkillMetadata +from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS class LangChainAdaptor(SkillAdaptor): @@ -62,6 +63,7 @@ class LangChainAdaptor(SkillAdaptor): "file": "SKILL.md", "type": "documentation", "version": metadata.version, + "doc_version": metadata.doc_version, } # Chunk if enabled @@ -69,9 +71,10 @@ class LangChainAdaptor(SkillAdaptor): content, doc_metadata, enable_chunking=enable_chunking, - chunk_max_tokens=kwargs.get("chunk_max_tokens", 512), + chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS), preserve_code_blocks=kwargs.get("preserve_code_blocks", True), source_file="SKILL.md", + chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS), ) # Add all chunks to documents @@ -90,6 +93,7 @@ class LangChainAdaptor(SkillAdaptor): "file": ref_file.name, "type": "reference", "version": metadata.version, + "doc_version": metadata.doc_version, } # Chunk if enabled @@ -97,9 +101,10 @@ class LangChainAdaptor(SkillAdaptor): ref_content, doc_metadata, enable_chunking=enable_chunking, - chunk_max_tokens=kwargs.get("chunk_max_tokens", 512), + chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS), preserve_code_blocks=kwargs.get("preserve_code_blocks", True), source_file=ref_file.name, + chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS), ) # Add all chunks to documents @@ -114,8 +119,9 @@ class LangChainAdaptor(SkillAdaptor): skill_dir: Path, output_path: Path, enable_chunking: bool = False, - chunk_max_tokens: int = 512, + chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS, preserve_code_blocks: bool = True, + chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS, ) -> Path: """ Package skill into JSON file for LangChain. @@ -139,12 +145,8 @@ class LangChainAdaptor(SkillAdaptor): output_path = self._format_output_path(skill_dir, Path(output_path), "-langchain.json") output_path.parent.mkdir(parents=True, exist_ok=True) - # Read metadata - metadata = SkillMetadata( - name=skill_dir.name, - description=f"LangChain documents for {skill_dir.name}", - version="1.0.0", - ) + # Read metadata from SKILL.md frontmatter + metadata = self._build_skill_metadata(skill_dir) # Generate LangChain documents with chunking documents_json = self.format_skill_md( @@ -153,6 +155,7 @@ class LangChainAdaptor(SkillAdaptor): enable_chunking=enable_chunking, chunk_max_tokens=chunk_max_tokens, preserve_code_blocks=preserve_code_blocks, + chunk_overlap_tokens=chunk_overlap_tokens, ) # Write to file diff --git a/src/skill_seekers/cli/adaptors/llama_index.py b/src/skill_seekers/cli/adaptors/llama_index.py index 7ea6ed9..3ac80d1 100644 --- a/src/skill_seekers/cli/adaptors/llama_index.py +++ b/src/skill_seekers/cli/adaptors/llama_index.py @@ -11,6 +11,7 @@ from pathlib import Path from typing import Any from .base import SkillAdaptor, SkillMetadata +from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS class LlamaIndexAdaptor(SkillAdaptor): @@ -77,6 +78,7 @@ class LlamaIndexAdaptor(SkillAdaptor): "file": "SKILL.md", "type": "documentation", "version": metadata.version, + "doc_version": metadata.doc_version, } # Chunk if enabled @@ -84,9 +86,10 @@ class LlamaIndexAdaptor(SkillAdaptor): content, node_metadata, enable_chunking=enable_chunking, - chunk_max_tokens=kwargs.get("chunk_max_tokens", 512), + chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS), preserve_code_blocks=kwargs.get("preserve_code_blocks", True), source_file="SKILL.md", + chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS), ) # Add all chunks as nodes @@ -112,6 +115,7 @@ class LlamaIndexAdaptor(SkillAdaptor): "file": ref_file.name, "type": "reference", "version": metadata.version, + "doc_version": metadata.doc_version, } # Chunk if enabled @@ -119,9 +123,10 @@ class LlamaIndexAdaptor(SkillAdaptor): ref_content, node_metadata, enable_chunking=enable_chunking, - chunk_max_tokens=kwargs.get("chunk_max_tokens", 512), + chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS), preserve_code_blocks=kwargs.get("preserve_code_blocks", True), source_file=ref_file.name, + chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS), ) # Add all chunks as nodes @@ -143,8 +148,9 @@ class LlamaIndexAdaptor(SkillAdaptor): skill_dir: Path, output_path: Path, enable_chunking: bool = False, - chunk_max_tokens: int = 512, + chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS, preserve_code_blocks: bool = True, + chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS, ) -> Path: """ Package skill into JSON file for LlamaIndex. @@ -166,11 +172,8 @@ class LlamaIndexAdaptor(SkillAdaptor): output_path.parent.mkdir(parents=True, exist_ok=True) # Read metadata - metadata = SkillMetadata( - name=skill_dir.name, - description=f"LlamaIndex nodes for {skill_dir.name}", - version="1.0.0", - ) + # Read metadata from SKILL.md frontmatter + metadata = self._build_skill_metadata(skill_dir) # Generate LlamaIndex nodes nodes_json = self.format_skill_md( @@ -179,6 +182,7 @@ class LlamaIndexAdaptor(SkillAdaptor): enable_chunking=enable_chunking, chunk_max_tokens=chunk_max_tokens, preserve_code_blocks=preserve_code_blocks, + chunk_overlap_tokens=chunk_overlap_tokens, ) # Write to file diff --git a/src/skill_seekers/cli/adaptors/markdown.py b/src/skill_seekers/cli/adaptors/markdown.py index 5d60033..f280571 100644 --- a/src/skill_seekers/cli/adaptors/markdown.py +++ b/src/skill_seekers/cli/adaptors/markdown.py @@ -11,6 +11,7 @@ from pathlib import Path from typing import Any from .base import SkillAdaptor, SkillMetadata +from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS class MarkdownAdaptor(SkillAdaptor): @@ -86,8 +87,9 @@ Browse the reference files for detailed information on each topic. All files are skill_dir: Path, output_path: Path, enable_chunking: bool = False, - chunk_max_tokens: int = 512, + chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS, preserve_code_blocks: bool = True, + chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS, ) -> Path: """ Package skill into ZIP file with markdown documentation. diff --git a/src/skill_seekers/cli/adaptors/openai.py b/src/skill_seekers/cli/adaptors/openai.py index e6437af..511ab02 100644 --- a/src/skill_seekers/cli/adaptors/openai.py +++ b/src/skill_seekers/cli/adaptors/openai.py @@ -12,6 +12,7 @@ from pathlib import Path from typing import Any from .base import SkillAdaptor, SkillMetadata +from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS class OpenAIAdaptor(SkillAdaptor): @@ -108,8 +109,9 @@ Always prioritize accuracy by consulting the attached documentation files before skill_dir: Path, output_path: Path, enable_chunking: bool = False, - chunk_max_tokens: int = 512, + chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS, preserve_code_blocks: bool = True, + chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS, ) -> Path: """ Package skill into ZIP file for OpenAI Assistants. diff --git a/src/skill_seekers/cli/adaptors/pinecone_adaptor.py b/src/skill_seekers/cli/adaptors/pinecone_adaptor.py new file mode 100644 index 0000000..6978779 --- /dev/null +++ b/src/skill_seekers/cli/adaptors/pinecone_adaptor.py @@ -0,0 +1,400 @@ +#!/usr/bin/env python3 +""" +Pinecone Adaptor + +Implements Pinecone vector database format for RAG pipelines. +Converts Skill Seekers documentation into Pinecone-compatible format +with namespace support and batch upsert. +""" + +import json +from pathlib import Path +from typing import Any + +from .base import SkillAdaptor, SkillMetadata +from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS + +# Pinecone metadata value limit: 40 KB per vector +PINECONE_METADATA_BYTES_LIMIT = 40_000 + + +class PineconeAdaptor(SkillAdaptor): + """ + Pinecone vector database adaptor. + + Handles: + - Pinecone-compatible vector format with metadata + - Namespace support for multi-tenant indexing + - Batch upsert (100 vectors per batch) + - OpenAI and sentence-transformers embedding generation + - Metadata truncation to stay within Pinecone's 40KB limit + """ + + PLATFORM = "pinecone" + PLATFORM_NAME = "Pinecone (Vector Database)" + DEFAULT_API_ENDPOINT = None + + def _generate_id(self, content: str, metadata: dict) -> str: + """Generate deterministic ID from content and metadata.""" + return self._generate_deterministic_id(content, metadata, format="hex") + + def _truncate_text_for_metadata(self, text: str, max_bytes: int = PINECONE_METADATA_BYTES_LIMIT) -> str: + """Truncate text to fit within Pinecone's metadata byte limit. + + Pinecone limits metadata to 40KB per vector. This truncates + the text field (largest metadata value) to stay within limits, + leaving room for other metadata fields (~1KB overhead). + + Args: + text: Text content to potentially truncate + max_bytes: Maximum bytes for the text field + + Returns: + Truncated text that fits within the byte limit + """ + # Reserve ~2KB for other metadata fields + available = max_bytes - 2000 + encoded = text.encode("utf-8") + if len(encoded) <= available: + return text + # Truncate at byte boundary, decode safely + truncated = encoded[:available].decode("utf-8", errors="ignore") + return truncated + + def format_skill_md( + self, skill_dir: Path, metadata: SkillMetadata, enable_chunking: bool = False, **kwargs + ) -> str: + """ + Format skill as JSON for Pinecone ingestion. + + Creates a package with vectors ready for upsert: + { + "index_name": "...", + "namespace": "...", + "dimension": 1536, + "metric": "cosine", + "vectors": [ + { + "id": "hex-id", + "metadata": { + "text": "content", + "source": "...", + "category": "...", + ... + } + } + ] + } + + No ``values`` field — embeddings are added at upload time. + + Args: + skill_dir: Path to skill directory + metadata: Skill metadata + enable_chunking: Enable intelligent chunking for large documents + **kwargs: Additional chunking parameters + + Returns: + JSON string containing Pinecone-compatible data + """ + vectors: list[dict[str, Any]] = [] + + # Convert SKILL.md (main documentation) + skill_md_path = skill_dir / "SKILL.md" + if skill_md_path.exists(): + content = self._read_existing_content(skill_dir) + if content.strip(): + doc_metadata = { + "source": metadata.name, + "category": "overview", + "file": "SKILL.md", + "type": "documentation", + "version": metadata.version, + "doc_version": metadata.doc_version, + } + + chunks = self._maybe_chunk_content( + content, + doc_metadata, + enable_chunking=enable_chunking, + chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS), + preserve_code_blocks=kwargs.get("preserve_code_blocks", True), + source_file="SKILL.md", + chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS), + ) + + for chunk_text, chunk_meta in chunks: + vectors.append( + { + "id": self._generate_id(chunk_text, chunk_meta), + "metadata": { + **chunk_meta, + "text": self._truncate_text_for_metadata(chunk_text), + }, + } + ) + + # Convert all reference files + for ref_file, ref_content in self._iterate_references(skill_dir): + if ref_content.strip(): + category = ref_file.stem.replace("_", " ").lower() + + doc_metadata = { + "source": metadata.name, + "category": category, + "file": ref_file.name, + "type": "reference", + "version": metadata.version, + "doc_version": metadata.doc_version, + } + + chunks = self._maybe_chunk_content( + ref_content, + doc_metadata, + enable_chunking=enable_chunking, + chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS), + preserve_code_blocks=kwargs.get("preserve_code_blocks", True), + source_file=ref_file.name, + chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS), + ) + + for chunk_text, chunk_meta in chunks: + vectors.append( + { + "id": self._generate_id(chunk_text, chunk_meta), + "metadata": { + **chunk_meta, + "text": self._truncate_text_for_metadata(chunk_text), + }, + } + ) + + index_name = metadata.name.replace("_", "-").lower() + + return json.dumps( + { + "index_name": index_name, + "namespace": index_name, + "dimension": 1536, + "metric": "cosine", + "vectors": vectors, + }, + indent=2, + ensure_ascii=False, + ) + + def package( + self, + skill_dir: Path, + output_path: Path, + enable_chunking: bool = False, + chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS, + preserve_code_blocks: bool = True, + chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS, + ) -> Path: + """ + Package skill into JSON file for Pinecone. + + Creates a JSON file containing vectors with metadata, ready for + embedding generation and upsert to a Pinecone index. + + Args: + skill_dir: Path to skill directory + output_path: Output path/filename for JSON file + enable_chunking: Enable intelligent chunking for large documents + chunk_max_tokens: Maximum tokens per chunk (default: 512) + preserve_code_blocks: Preserve code blocks during chunking + + Returns: + Path to created JSON file + """ + skill_dir = Path(skill_dir) + + output_path = self._format_output_path(skill_dir, Path(output_path), "-pinecone.json") + output_path.parent.mkdir(parents=True, exist_ok=True) + + # Read metadata from SKILL.md frontmatter + metadata = self._build_skill_metadata(skill_dir) + + pinecone_json = self.format_skill_md( + skill_dir, + metadata, + enable_chunking=enable_chunking, + chunk_max_tokens=chunk_max_tokens, + preserve_code_blocks=preserve_code_blocks, + chunk_overlap_tokens=chunk_overlap_tokens, + ) + + output_path.write_text(pinecone_json, encoding="utf-8") + + print(f"\n✅ Pinecone data packaged successfully!") + print(f"📦 Output: {output_path}") + + data = json.loads(pinecone_json) + print(f"📊 Total vectors: {len(data['vectors'])}") + print(f"🗂️ Index name: {data['index_name']}") + print(f"📁 Namespace: {data['namespace']}") + print(f"📐 Default dimension: {data['dimension']} (auto-detected at upload time)") + + # Show category breakdown + categories: dict[str, int] = {} + for vec in data["vectors"]: + cat = vec["metadata"].get("category", "unknown") + categories[cat] = categories.get(cat, 0) + 1 + + print("📁 Categories:") + for cat, count in sorted(categories.items()): + print(f" - {cat}: {count}") + + return output_path + + def upload(self, package_path: Path, api_key: str | None = None, **kwargs) -> dict[str, Any]: + """ + Upload packaged skill to Pinecone. + + Args: + package_path: Path to packaged JSON + api_key: Pinecone API key (or uses PINECONE_API_KEY env var) + **kwargs: + index_name: Override index name from JSON + namespace: Override namespace from JSON + dimension: Embedding dimension (default: 1536) + metric: Distance metric (default: "cosine") + embedding_function: "openai" or "sentence-transformers" + cloud: Cloud provider (default: "aws") + region: Cloud region (default: "us-east-1") + + Returns: + {"success": bool, "index": str, "namespace": str, "count": int} + """ + import os + + try: + from pinecone import Pinecone, ServerlessSpec + except (ImportError, Exception): + return { + "success": False, + "message": "pinecone not installed. Run: pip install 'pinecone>=5.0.0'", + } + + api_key = api_key or os.getenv("PINECONE_API_KEY") + if not api_key: + return { + "success": False, + "message": ( + "PINECONE_API_KEY not set. " + "Set via env var or pass api_key parameter." + ), + } + + # Load package + with open(package_path) as f: + data = json.load(f) + + index_name = kwargs.get("index_name", data.get("index_name", "skill-docs")) + namespace = kwargs.get("namespace", data.get("namespace", "")) + metric = kwargs.get("metric", data.get("metric", "cosine")) + cloud = kwargs.get("cloud", "aws") + region = kwargs.get("region", "us-east-1") + + # Auto-detect dimension from embedding model + embedding_function = kwargs.get("embedding_function", "openai") + EMBEDDING_DIMENSIONS = { + "openai": 1536, # text-embedding-3-small + "sentence-transformers": 384, # all-MiniLM-L6-v2 + } + # Priority: explicit kwarg > model-based auto-detect > JSON file > fallback + # Note: format_skill_md() hardcodes dimension=1536 in the JSON, so we must + # give EMBEDDING_DIMENSIONS priority over the file to handle sentence-transformers (384). + dimension = kwargs.get( + "dimension", + EMBEDDING_DIMENSIONS.get(embedding_function, data.get("dimension", 1536)), + ) + + try: + # Generate embeddings FIRST — before creating the index. + # This avoids leaving an empty Pinecone index behind when + # embedding generation fails (e.g. missing API key). + texts = [vec["metadata"]["text"] for vec in data["vectors"]] + + if embedding_function == "openai": + embeddings = self._generate_openai_embeddings(texts) + elif embedding_function == "sentence-transformers": + embeddings = self._generate_st_embeddings(texts) + else: + return { + "success": False, + "message": f"Unknown embedding_function: {embedding_function}. Use 'openai' or 'sentence-transformers'.", + } + + pc = Pinecone(api_key=api_key) + + # Create index if it doesn't exist + existing_indexes = [idx.name for idx in pc.list_indexes()] + if index_name not in existing_indexes: + print(f"🔧 Creating Pinecone index: {index_name} (dimension={dimension}, metric={metric})") + pc.create_index( + name=index_name, + dimension=dimension, + metric=metric, + spec=ServerlessSpec(cloud=cloud, region=region), + ) + print(f"✅ Index '{index_name}' created") + else: + print(f"ℹ️ Using existing index: {index_name}") + + index = pc.Index(index_name) + + # Batch upsert (100 per batch — Pinecone recommendation) + batch_size = 100 + vectors_to_upsert = [] + for i, vec in enumerate(data["vectors"]): + vectors_to_upsert.append( + { + "id": vec["id"], + "values": embeddings[i], + "metadata": vec["metadata"], + } + ) + + total = len(vectors_to_upsert) + print(f"🔄 Upserting {total} vectors to Pinecone...") + + for i in range(0, total, batch_size): + batch = vectors_to_upsert[i : i + batch_size] + index.upsert(vectors=batch, namespace=namespace) + print(f" ✓ Upserted {min(i + batch_size, total)}/{total}") + + print(f"✅ Uploaded {total} vectors to Pinecone index '{index_name}'") + + return { + "success": True, + "message": f"Uploaded {total} vectors to Pinecone index '{index_name}' (namespace: '{namespace}')", + "url": None, + "index": index_name, + "namespace": namespace, + "count": total, + } + + except Exception as e: + return {"success": False, "message": f"Pinecone upload failed: {e}"} + + def validate_api_key(self, _api_key: str) -> bool: + """Pinecone doesn't need API key for packaging.""" + return False + + def get_env_var_name(self) -> str: + """Return the expected env var for Pinecone API key.""" + return "PINECONE_API_KEY" + + def supports_enhancement(self) -> bool: + """Pinecone format doesn't support AI enhancement.""" + return False + + def enhance(self, _skill_dir: Path, _api_key: str) -> bool: + """Pinecone format doesn't support enhancement.""" + print("❌ Pinecone format does not support enhancement") + print(" Enhance before packaging:") + print(" skill-seekers enhance output/skill/ --mode LOCAL") + print(" skill-seekers package output/skill/ --target pinecone") + return False diff --git a/src/skill_seekers/cli/adaptors/qdrant.py b/src/skill_seekers/cli/adaptors/qdrant.py index d201510..b9f6a2a 100644 --- a/src/skill_seekers/cli/adaptors/qdrant.py +++ b/src/skill_seekers/cli/adaptors/qdrant.py @@ -11,6 +11,7 @@ from pathlib import Path from typing import Any from .base import SkillAdaptor, SkillMetadata +from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS class QdrantAdaptor(SkillAdaptor): @@ -76,6 +77,7 @@ class QdrantAdaptor(SkillAdaptor): "file": "SKILL.md", "type": "documentation", "version": metadata.version, + "doc_version": metadata.doc_version, } # Chunk if enabled @@ -83,9 +85,10 @@ class QdrantAdaptor(SkillAdaptor): content, payload_meta, enable_chunking=enable_chunking, - chunk_max_tokens=kwargs.get("chunk_max_tokens", 512), + chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS), preserve_code_blocks=kwargs.get("preserve_code_blocks", True), source_file="SKILL.md", + chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS), ) # Add all chunks as points @@ -109,6 +112,7 @@ class QdrantAdaptor(SkillAdaptor): "file": chunk_meta.get("file", "SKILL.md"), "type": chunk_meta.get("type", "documentation"), "version": chunk_meta.get("version", metadata.version), + "doc_version": chunk_meta.get("doc_version", ""), }, } ) @@ -124,6 +128,7 @@ class QdrantAdaptor(SkillAdaptor): "file": ref_file.name, "type": "reference", "version": metadata.version, + "doc_version": metadata.doc_version, } # Chunk if enabled @@ -131,9 +136,10 @@ class QdrantAdaptor(SkillAdaptor): ref_content, payload_meta, enable_chunking=enable_chunking, - chunk_max_tokens=kwargs.get("chunk_max_tokens", 512), + chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS), preserve_code_blocks=kwargs.get("preserve_code_blocks", True), source_file=ref_file.name, + chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS), ) # Add all chunks as points @@ -157,6 +163,7 @@ class QdrantAdaptor(SkillAdaptor): "file": chunk_meta.get("file", ref_file.name), "type": chunk_meta.get("type", "reference"), "version": chunk_meta.get("version", metadata.version), + "doc_version": chunk_meta.get("doc_version", ""), }, } ) @@ -189,8 +196,9 @@ class QdrantAdaptor(SkillAdaptor): skill_dir: Path, output_path: Path, enable_chunking: bool = False, - chunk_max_tokens: int = 512, + chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS, preserve_code_blocks: bool = True, + chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS, ) -> Path: """ Package skill into JSON file for Qdrant. @@ -211,11 +219,8 @@ class QdrantAdaptor(SkillAdaptor): output_path.parent.mkdir(parents=True, exist_ok=True) # Read metadata - metadata = SkillMetadata( - name=skill_dir.name, - description=f"Qdrant data for {skill_dir.name}", - version="1.0.0", - ) + # Read metadata from SKILL.md frontmatter + metadata = self._build_skill_metadata(skill_dir) # Generate Qdrant data qdrant_json = self.format_skill_md( @@ -224,6 +229,7 @@ class QdrantAdaptor(SkillAdaptor): enable_chunking=enable_chunking, chunk_max_tokens=chunk_max_tokens, preserve_code_blocks=preserve_code_blocks, + chunk_overlap_tokens=chunk_overlap_tokens, ) # Write to file diff --git a/src/skill_seekers/cli/adaptors/weaviate.py b/src/skill_seekers/cli/adaptors/weaviate.py index c06081c..439c56b 100644 --- a/src/skill_seekers/cli/adaptors/weaviate.py +++ b/src/skill_seekers/cli/adaptors/weaviate.py @@ -11,6 +11,7 @@ from pathlib import Path from typing import Any from .base import SkillAdaptor, SkillMetadata +from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS class WeaviateAdaptor(SkillAdaptor): @@ -96,7 +97,14 @@ class WeaviateAdaptor(SkillAdaptor): { "name": "version", "dataType": ["text"], - "description": "Documentation version", + "description": "Skill package version", + "indexFilterable": True, + "indexSearchable": False, + }, + { + "name": "doc_version", + "dataType": ["text"], + "description": "Documentation version (e.g., 16.2)", "indexFilterable": True, "indexSearchable": False, }, @@ -137,6 +145,7 @@ class WeaviateAdaptor(SkillAdaptor): "file": "SKILL.md", "type": "documentation", "version": metadata.version, + "doc_version": metadata.doc_version, } # Chunk if enabled @@ -144,9 +153,10 @@ class WeaviateAdaptor(SkillAdaptor): content, obj_metadata, enable_chunking=enable_chunking, - chunk_max_tokens=kwargs.get("chunk_max_tokens", 512), + chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS), preserve_code_blocks=kwargs.get("preserve_code_blocks", True), source_file="SKILL.md", + chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS), ) # Add all chunks as objects @@ -161,6 +171,7 @@ class WeaviateAdaptor(SkillAdaptor): "file": chunk_meta.get("file", "SKILL.md"), "type": chunk_meta.get("type", "documentation"), "version": chunk_meta.get("version", metadata.version), + "doc_version": chunk_meta.get("doc_version", ""), }, } ) @@ -177,6 +188,7 @@ class WeaviateAdaptor(SkillAdaptor): "file": ref_file.name, "type": "reference", "version": metadata.version, + "doc_version": metadata.doc_version, } # Chunk if enabled @@ -184,9 +196,10 @@ class WeaviateAdaptor(SkillAdaptor): ref_content, obj_metadata, enable_chunking=enable_chunking, - chunk_max_tokens=kwargs.get("chunk_max_tokens", 512), + chunk_max_tokens=kwargs.get("chunk_max_tokens", DEFAULT_CHUNK_TOKENS), preserve_code_blocks=kwargs.get("preserve_code_blocks", True), source_file=ref_file.name, + chunk_overlap_tokens=kwargs.get("chunk_overlap_tokens", DEFAULT_CHUNK_OVERLAP_TOKENS), ) # Add all chunks as objects @@ -201,6 +214,7 @@ class WeaviateAdaptor(SkillAdaptor): "file": chunk_meta.get("file", ref_file.name), "type": chunk_meta.get("type", "reference"), "version": chunk_meta.get("version", metadata.version), + "doc_version": chunk_meta.get("doc_version", ""), }, } ) @@ -221,8 +235,9 @@ class WeaviateAdaptor(SkillAdaptor): skill_dir: Path, output_path: Path, enable_chunking: bool = False, - chunk_max_tokens: int = 512, + chunk_max_tokens: int = DEFAULT_CHUNK_TOKENS, preserve_code_blocks: bool = True, + chunk_overlap_tokens: int = DEFAULT_CHUNK_OVERLAP_TOKENS, ) -> Path: """ Package skill into JSON file for Weaviate. @@ -245,12 +260,8 @@ class WeaviateAdaptor(SkillAdaptor): output_path = self._format_output_path(skill_dir, Path(output_path), "-weaviate.json") output_path.parent.mkdir(parents=True, exist_ok=True) - # Read metadata - metadata = SkillMetadata( - name=skill_dir.name, - description=f"Weaviate objects for {skill_dir.name}", - version="1.0.0", - ) + # Read metadata from SKILL.md frontmatter + metadata = self._build_skill_metadata(skill_dir) # Generate Weaviate objects weaviate_json = self.format_skill_md( @@ -259,6 +270,7 @@ class WeaviateAdaptor(SkillAdaptor): enable_chunking=enable_chunking, chunk_max_tokens=chunk_max_tokens, preserve_code_blocks=preserve_code_blocks, + chunk_overlap_tokens=chunk_overlap_tokens, ) # Write to file @@ -288,7 +300,7 @@ class WeaviateAdaptor(SkillAdaptor): return output_path - def upload(self, package_path: Path, api_key: str = None, **kwargs) -> dict[str, Any]: + def upload(self, package_path: Path, api_key: str | None = None, **kwargs) -> dict[str, Any]: """ Upload packaged skill to Weaviate. @@ -382,31 +394,20 @@ class WeaviateAdaptor(SkillAdaptor): print(f" ✓ Uploaded {i + 1}/{len(data['objects'])} objects") elif embedding_function == "sentence-transformers": - # Use sentence-transformers - print("🔄 Generating sentence-transformer embeddings and uploading...") - try: - from sentence_transformers import SentenceTransformer + # Use sentence-transformers (via shared base method) + contents = [obj["properties"]["content"] for obj in data["objects"]] + embeddings = self._generate_st_embeddings(contents) - model = SentenceTransformer("all-MiniLM-L6-v2") - contents = [obj["properties"]["content"] for obj in data["objects"]] - embeddings = model.encode(contents, show_progress_bar=True).tolist() + for i, obj in enumerate(data["objects"]): + batch.add_data_object( + data_object=obj["properties"], + class_name=data["class_name"], + uuid=obj["id"], + vector=embeddings[i], + ) - for i, obj in enumerate(data["objects"]): - batch.add_data_object( - data_object=obj["properties"], - class_name=data["class_name"], - uuid=obj["id"], - vector=embeddings[i], - ) - - if (i + 1) % 100 == 0: - print(f" ✓ Uploaded {i + 1}/{len(data['objects'])} objects") - - except ImportError: - return { - "success": False, - "message": "sentence-transformers not installed. Run: pip install sentence-transformers", - } + if (i + 1) % 100 == 0: + print(f" ✓ Uploaded {i + 1}/{len(data['objects'])} objects") else: # No embeddings - Weaviate will use its configured vectorizer @@ -427,61 +428,16 @@ class WeaviateAdaptor(SkillAdaptor): return { "success": True, "message": f"Uploaded {count} objects to Weaviate class '{data['class_name']}'", + "url": None, "class_name": data["class_name"], "count": count, } + except ImportError as e: + return {"success": False, "message": str(e)} except Exception as e: return {"success": False, "message": f"Upload failed: {e}"} - def _generate_openai_embeddings( - self, documents: list[str], api_key: str = None - ) -> list[list[float]]: - """ - Generate embeddings using OpenAI API. - - Args: - documents: List of document texts - api_key: OpenAI API key (or uses OPENAI_API_KEY env var) - - Returns: - List of embedding vectors - """ - import os - - try: - from openai import OpenAI - except ImportError: - raise ImportError("openai not installed. Run: pip install openai") from None - - api_key = api_key or os.getenv("OPENAI_API_KEY") - if not api_key: - raise ValueError("OPENAI_API_KEY not set. Set via env var or --openai-api-key") - - client = OpenAI(api_key=api_key) - - # Batch process (OpenAI allows up to 2048 inputs) - embeddings = [] - batch_size = 100 - - print(f" Generating embeddings for {len(documents)} documents...") - - for i in range(0, len(documents), batch_size): - batch = documents[i : i + batch_size] - try: - response = client.embeddings.create( - input=batch, - model="text-embedding-3-small", # Cheapest, fastest - ) - embeddings.extend([item.embedding for item in response.data]) - print( - f" ✓ Generated {min(i + batch_size, len(documents))}/{len(documents)} embeddings" - ) - except Exception as e: - raise Exception(f"OpenAI embedding generation failed: {e}") from e - - return embeddings - def validate_api_key(self, _api_key: str) -> bool: """ Weaviate format doesn't use API keys for packaging. diff --git a/src/skill_seekers/cli/arguments/common.py b/src/skill_seekers/cli/arguments/common.py index 309f993..f1ed246 100644 --- a/src/skill_seekers/cli/arguments/common.py +++ b/src/skill_seekers/cli/arguments/common.py @@ -15,6 +15,10 @@ Hierarchy: import argparse from typing import Any +# Default chunking constants used by RAG and package arguments +DEFAULT_CHUNK_TOKENS = 512 +DEFAULT_CHUNK_OVERLAP_TOKENS = 50 + # Common argument definitions as data structure # These are arguments that appear in MULTIPLE commands COMMON_ARGUMENTS: dict[str, dict[str, Any]] = { @@ -64,6 +68,15 @@ COMMON_ARGUMENTS: dict[str, dict[str, Any]] = { "metavar": "KEY", }, }, + "doc_version": { + "flags": ("--doc-version",), + "kwargs": { + "type": str, + "default": "", + "help": "Documentation version tag for RAG metadata (e.g., '16.2')", + "metavar": "VERSION", + }, + }, } # Behavior arguments — runtime flags shared by every scraper @@ -105,18 +118,18 @@ RAG_ARGUMENTS: dict[str, dict[str, Any]] = { "flags": ("--chunk-tokens",), "kwargs": { "type": int, - "default": 512, + "default": DEFAULT_CHUNK_TOKENS, "metavar": "TOKENS", - "help": "Chunk size in tokens for RAG (default: 512)", + "help": f"Chunk size in tokens for RAG (default: {DEFAULT_CHUNK_TOKENS})", }, }, "chunk_overlap_tokens": { "flags": ("--chunk-overlap-tokens",), "kwargs": { "type": int, - "default": 50, + "default": DEFAULT_CHUNK_OVERLAP_TOKENS, "metavar": "TOKENS", - "help": "Overlap between chunks in tokens (default: 50)", + "help": f"Overlap between chunks in tokens (default: {DEFAULT_CHUNK_OVERLAP_TOKENS})", }, }, } diff --git a/src/skill_seekers/cli/arguments/create.py b/src/skill_seekers/cli/arguments/create.py index 03b30c7..bfc1116 100644 --- a/src/skill_seekers/cli/arguments/create.py +++ b/src/skill_seekers/cli/arguments/create.py @@ -153,6 +153,15 @@ UNIVERSAL_ARGUMENTS: dict[str, dict[str, Any]] = { "metavar": "PATH", }, }, + "doc_version": { + "flags": ("--doc-version",), + "kwargs": { + "type": str, + "default": "", + "help": "Documentation version tag for RAG metadata (e.g., '16.2')", + "metavar": "VERSION", + }, + }, } # Merge RAG arguments from common.py into universal arguments @@ -569,3 +578,11 @@ def add_create_arguments(parser: argparse.ArgumentParser, mode: str = "default") if mode in ["advanced", "all"]: for arg_name, arg_def in ADVANCED_ARGUMENTS.items(): parser.add_argument(*arg_def["flags"], **arg_def["kwargs"]) + + # Deprecated alias for backward compatibility (removed in v4.0.0) + parser.add_argument( + "--no-preserve-code", + dest="no_preserve_code_blocks", + action="store_true", + help=argparse.SUPPRESS, + ) diff --git a/src/skill_seekers/cli/arguments/package.py b/src/skill_seekers/cli/arguments/package.py index 6b1387e..ad91ae8 100644 --- a/src/skill_seekers/cli/arguments/package.py +++ b/src/skill_seekers/cli/arguments/package.py @@ -8,6 +8,8 @@ import and use these definitions. import argparse from typing import Any +from .common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS + PACKAGE_ARGUMENTS: dict[str, dict[str, Any]] = { # Positional argument "skill_directory": { @@ -49,6 +51,7 @@ PACKAGE_ARGUMENTS: dict[str, dict[str, Any]] = { "chroma", "faiss", "qdrant", + "pinecone", ], "default": "claude", "help": "Target LLM platform (default: claude)", @@ -109,13 +112,22 @@ PACKAGE_ARGUMENTS: dict[str, dict[str, Any]] = { "flags": ("--chunk-tokens",), "kwargs": { "type": int, - "default": 512, - "help": "Maximum tokens per chunk (default: 512)", + "default": DEFAULT_CHUNK_TOKENS, + "help": f"Maximum tokens per chunk (default: {DEFAULT_CHUNK_TOKENS})", "metavar": "N", }, }, - "no_preserve_code": { - "flags": ("--no-preserve-code",), + "chunk_overlap_tokens": { + "flags": ("--chunk-overlap-tokens",), + "kwargs": { + "type": int, + "default": DEFAULT_CHUNK_OVERLAP_TOKENS, + "help": f"Overlap between chunks in tokens (default: {DEFAULT_CHUNK_OVERLAP_TOKENS})", + "metavar": "N", + }, + }, + "no_preserve_code_blocks": { + "flags": ("--no-preserve-code-blocks",), "kwargs": { "action": "store_true", "help": "Allow code block splitting (default: code blocks preserved)", @@ -130,3 +142,11 @@ def add_package_arguments(parser: argparse.ArgumentParser) -> None: flags = arg_def["flags"] kwargs = arg_def["kwargs"] parser.add_argument(*flags, **kwargs) + + # Deprecated alias for backward compatibility (removed in v4.0.0) + parser.add_argument( + "--no-preserve-code", + dest="no_preserve_code_blocks", + action="store_true", + help=argparse.SUPPRESS, + ) diff --git a/src/skill_seekers/cli/arguments/scrape.py b/src/skill_seekers/cli/arguments/scrape.py index 63b5781..afdd34a 100644 --- a/src/skill_seekers/cli/arguments/scrape.py +++ b/src/skill_seekers/cli/arguments/scrape.py @@ -172,6 +172,14 @@ def add_scrape_arguments(parser: argparse.ArgumentParser) -> None: kwargs = arg_def["kwargs"] parser.add_argument(*flags, **kwargs) + # Deprecated alias for backward compatibility (removed in v4.0.0) + parser.add_argument( + "--no-preserve-code", + dest="no_preserve_code_blocks", + action="store_true", + help=argparse.SUPPRESS, + ) + def get_scrape_argument_names() -> set: """Get the set of scrape argument destination names. diff --git a/src/skill_seekers/cli/codebase_scraper.py b/src/skill_seekers/cli/codebase_scraper.py index 96ed0df..d9d73ea 100644 --- a/src/skill_seekers/cli/codebase_scraper.py +++ b/src/skill_seekers/cli/codebase_scraper.py @@ -1057,6 +1057,7 @@ def analyze_codebase( enhance_level: int = 0, skill_name: str | None = None, skill_description: str | None = None, + doc_version: str = "", ) -> dict[str, Any]: """ Analyze local codebase and extract code knowledge. @@ -1603,6 +1604,7 @@ def analyze_codebase( docs_data=docs_data, skill_name=skill_name, skill_description=skill_description, + doc_version=doc_version, ) return results @@ -1622,6 +1624,7 @@ def _generate_skill_md( docs_data: dict[str, Any] | None = None, skill_name: str | None = None, skill_description: str | None = None, + doc_version: str = "", ): """ Generate rich SKILL.md from codebase analysis results. @@ -1657,6 +1660,7 @@ def _generate_skill_md( skill_content = f"""--- name: {skill_name} description: {description} +doc_version: {doc_version} --- # {repo_name} Codebase @@ -2197,13 +2201,11 @@ def _generate_references(output_dir: Path): if source_dir.exists() and source_dir.is_dir(): # Copy directory to references/ (not symlink, for portability) - if target_dir.exists(): - import shutil - - shutil.rmtree(target_dir) - import shutil + if target_dir.exists(): + shutil.rmtree(target_dir) + shutil.copytree(source_dir, target_dir) logger.debug(f"Copied {source} → references/{target}") @@ -2451,6 +2453,7 @@ Examples: enhance_level=args.enhance_level, # AI enhancement level (0-3) skill_name=getattr(args, "name", None), skill_description=getattr(args, "description", None), + doc_version=getattr(args, "doc_version", ""), ) # ============================================================ diff --git a/src/skill_seekers/cli/create_command.py b/src/skill_seekers/cli/create_command.py index 92f6b1b..7a79202 100644 --- a/src/skill_seekers/cli/create_command.py +++ b/src/skill_seekers/cli/create_command.py @@ -13,6 +13,7 @@ from skill_seekers.cli.arguments.create import ( get_compatible_arguments, get_universal_argument_names, ) +from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS logger = logging.getLogger(__name__) @@ -106,8 +107,8 @@ class CreateCommand: # Check against common defaults defaults = { "max_issues": 100, - "chunk_tokens": 512, - "chunk_overlap_tokens": 50, + "chunk_tokens": DEFAULT_CHUNK_TOKENS, + "chunk_overlap_tokens": DEFAULT_CHUNK_OVERLAP_TOKENS, "output": None, } @@ -160,11 +161,11 @@ class CreateCommand: # RAG arguments (web scraper only) if getattr(self.args, "chunk_for_rag", False): argv.append("--chunk-for-rag") - if getattr(self.args, "chunk_tokens", None) and self.args.chunk_tokens != 512: + if getattr(self.args, "chunk_tokens", None) and self.args.chunk_tokens != DEFAULT_CHUNK_TOKENS: argv.extend(["--chunk-tokens", str(self.args.chunk_tokens)]) if ( getattr(self.args, "chunk_overlap_tokens", None) - and self.args.chunk_overlap_tokens != 50 + and self.args.chunk_overlap_tokens != DEFAULT_CHUNK_OVERLAP_TOKENS ): argv.extend(["--chunk-overlap-tokens", str(self.args.chunk_overlap_tokens)]) @@ -428,6 +429,10 @@ class CreateCommand: if self.args.quiet: argv.append("--quiet") + # Documentation version metadata + if getattr(self.args, "doc_version", ""): + argv.extend(["--doc-version", self.args.doc_version]) + # Enhancement Workflow arguments if getattr(self.args, "enhance_workflow", None): for wf in self.args.enhance_workflow: diff --git a/src/skill_seekers/cli/doc_scraper.py b/src/skill_seekers/cli/doc_scraper.py index 62cac55..957ca5b 100755 --- a/src/skill_seekers/cli/doc_scraper.py +++ b/src/skill_seekers/cli/doc_scraper.py @@ -1565,9 +1565,11 @@ class DocToSkillConverter: if len(example_codes) >= 10: break + doc_version = self.config.get("doc_version", "") content = f"""--- name: {self.name} description: {description} +doc_version: {doc_version} --- # {self.name.title()} Skill @@ -2103,6 +2105,11 @@ def get_configuration(args: argparse.Namespace) -> dict[str, Any]: "max_pages": DEFAULT_MAX_PAGES, } + # Apply CLI override for doc_version (works for all config modes) + cli_doc_version = getattr(args, "doc_version", "") + if cli_doc_version: + config["doc_version"] = cli_doc_version + # Apply CLI overrides for rate limiting if args.no_rate_limit: config["rate_limit"] = 0 diff --git a/src/skill_seekers/cli/github_scraper.py b/src/skill_seekers/cli/github_scraper.py index ebf10e2..7316988 100644 --- a/src/skill_seekers/cli/github_scraper.py +++ b/src/skill_seekers/cli/github_scraper.py @@ -968,10 +968,13 @@ class GitHubToSkillConverter: # Truncate description to 1024 chars if needed desc = self.description[:1024] if len(self.description) > 1024 else self.description + doc_version = self.config.get("doc_version", "") + # Build skill content skill_content = f"""--- name: {skill_name} description: {desc} +doc_version: {doc_version} --- # {repo_info.get("name", self.name)} @@ -1003,10 +1006,11 @@ Use this skill when you need to: # Repository info skill_content += "### Repository Info\n" - skill_content += f"- **Homepage:** {repo_info.get('homepage', 'N/A')}\n" + skill_content += f"- **Homepage:** {repo_info.get('homepage') or 'N/A'}\n" skill_content += f"- **Topics:** {', '.join(repo_info.get('topics', []))}\n" skill_content += f"- **Open Issues:** {repo_info.get('open_issues', 0)}\n" - skill_content += f"- **Last Updated:** {repo_info.get('updated_at', 'N/A')[:10]}\n\n" + updated_at = repo_info.get('updated_at') or 'N/A' + skill_content += f"- **Last Updated:** {updated_at[:10]}\n\n" # Languages skill_content += "### Languages\n" @@ -1101,8 +1105,10 @@ Use this skill when you need to: lines = [] for release in releases[:3]: + published_at = release.get('published_at') or 'N/A' + release_name = release.get('name') or release['tag_name'] lines.append( - f"- **{release['tag_name']}** ({release['published_at'][:10]}): {release['name']}" + f"- **{release['tag_name']}** ({published_at[:10]}): {release_name}" ) return "\n".join(lines) @@ -1298,15 +1304,17 @@ Use this skill when you need to: content += f"## Open Issues ({len(open_issues)})\n\n" for issue in open_issues: labels = ", ".join(issue["labels"]) if issue["labels"] else "No labels" + created_at = issue.get('created_at') or 'N/A' content += f"### #{issue['number']}: {issue['title']}\n" - content += f"**Labels:** {labels} | **Created:** {issue['created_at'][:10]}\n" + content += f"**Labels:** {labels} | **Created:** {created_at[:10]}\n" content += f"[View on GitHub]({issue['url']})\n\n" content += f"\n## Recently Closed Issues ({len(closed_issues)})\n\n" for issue in closed_issues: labels = ", ".join(issue["labels"]) if issue["labels"] else "No labels" + closed_at = issue.get('closed_at') or 'N/A' content += f"### #{issue['number']}: {issue['title']}\n" - content += f"**Labels:** {labels} | **Closed:** {issue['closed_at'][:10]}\n" + content += f"**Labels:** {labels} | **Closed:** {closed_at[:10]}\n" content += f"[View on GitHub]({issue['url']})\n\n" issues_path = f"{self.skill_dir}/references/issues.md" @@ -1323,11 +1331,14 @@ Use this skill when you need to: ) for release in releases: - content += f"## {release['tag_name']}: {release['name']}\n" - content += f"**Published:** {release['published_at'][:10]}\n" + published_at = release.get('published_at') or 'N/A' + release_name = release.get('name') or release['tag_name'] + release_body = release.get('body') or '' + content += f"## {release['tag_name']}: {release_name}\n" + content += f"**Published:** {published_at[:10]}\n" if release["prerelease"]: content += "**Pre-release**\n" - content += f"\n{release['body']}\n\n" + content += f"\n{release_body}\n\n" content += f"[View on GitHub]({release['url']})\n\n---\n\n" releases_path = f"{self.skill_dir}/references/releases.md" diff --git a/src/skill_seekers/cli/main.py b/src/skill_seekers/cli/main.py index fb0a478..ecf8648 100644 --- a/src/skill_seekers/cli/main.py +++ b/src/skill_seekers/cli/main.py @@ -325,8 +325,8 @@ def _handle_analyze_command(args: argparse.Namespace) -> int: if getattr(args, "enhance_stage", None): for stage in args.enhance_stage: sys.argv.extend(["--enhance-stage", stage]) - if getattr(args, "workflow_var", None): - for var in args.workflow_var: + if getattr(args, "var", None): + for var in args.var: sys.argv.extend(["--var", var]) if getattr(args, "workflow_dry_run", False): sys.argv.append("--workflow-dry-run") diff --git a/src/skill_seekers/cli/package_skill.py b/src/skill_seekers/cli/package_skill.py index c8ebe4a..ab7900f 100644 --- a/src/skill_seekers/cli/package_skill.py +++ b/src/skill_seekers/cli/package_skill.py @@ -14,6 +14,8 @@ import os import sys from pathlib import Path +from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS + # Import utilities try: from quality_checker import SkillQualityChecker, print_report @@ -45,8 +47,9 @@ def package_skill( chunk_overlap=200, batch_size=100, enable_chunking=False, - chunk_max_tokens=512, + chunk_max_tokens=DEFAULT_CHUNK_TOKENS, preserve_code_blocks=True, + chunk_overlap_tokens=DEFAULT_CHUNK_OVERLAP_TOKENS, ): """ Package a skill directory into platform-specific format @@ -121,6 +124,7 @@ def package_skill( "chroma", "faiss", "qdrant", + "pinecone", ] if target in RAG_PLATFORMS and not enable_chunking: @@ -156,6 +160,7 @@ def package_skill( enable_chunking=enable_chunking, chunk_max_tokens=chunk_max_tokens, preserve_code_blocks=preserve_code_blocks, + chunk_overlap_tokens=chunk_overlap_tokens, ) else: package_path = adaptor.package( @@ -164,6 +169,7 @@ def package_skill( enable_chunking=enable_chunking, chunk_max_tokens=chunk_max_tokens, preserve_code_blocks=preserve_code_blocks, + chunk_overlap_tokens=chunk_overlap_tokens, ) print(f" Output: {package_path}") @@ -226,7 +232,8 @@ Examples: batch_size=args.batch_size, enable_chunking=args.chunk_for_rag, chunk_max_tokens=args.chunk_tokens, - preserve_code_blocks=not args.no_preserve_code, + preserve_code_blocks=not args.no_preserve_code_blocks, + chunk_overlap_tokens=args.chunk_overlap_tokens, ) if not success: diff --git a/src/skill_seekers/cli/rag_chunker.py b/src/skill_seekers/cli/rag_chunker.py index 3854f7e..124456a 100644 --- a/src/skill_seekers/cli/rag_chunker.py +++ b/src/skill_seekers/cli/rag_chunker.py @@ -14,6 +14,8 @@ Usage: chunks = chunker.chunk_skill(Path("output/react")) """ +from skill_seekers.cli.arguments.common import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS + import re from pathlib import Path import json @@ -35,8 +37,8 @@ class RAGChunker: def __init__( self, - chunk_size: int = 512, - chunk_overlap: int = 50, + chunk_size: int = DEFAULT_CHUNK_TOKENS, + chunk_overlap: int = DEFAULT_CHUNK_OVERLAP_TOKENS, preserve_code_blocks: bool = True, preserve_paragraphs: bool = True, min_chunk_size: int = 100, @@ -383,9 +385,9 @@ def main(): ) parser.add_argument("skill_dir", type=Path, help="Path to skill directory") parser.add_argument("--output", "-o", type=Path, help="Output JSON file") - parser.add_argument("--chunk-tokens", type=int, default=512, help="Target chunk size in tokens") + parser.add_argument("--chunk-tokens", type=int, default=DEFAULT_CHUNK_TOKENS, help="Target chunk size in tokens") parser.add_argument( - "--chunk-overlap-tokens", type=int, default=50, help="Overlap size in tokens" + "--chunk-overlap-tokens", type=int, default=DEFAULT_CHUNK_OVERLAP_TOKENS, help="Overlap size in tokens" ) parser.add_argument("--no-code-blocks", action="store_true", help="Don't preserve code blocks") parser.add_argument("--no-paragraphs", action="store_true", help="Don't preserve paragraphs") diff --git a/src/skill_seekers/cli/word_scraper.py b/src/skill_seekers/cli/word_scraper.py index 76d068f..33d8666 100644 --- a/src/skill_seekers/cli/word_scraper.py +++ b/src/skill_seekers/cli/word_scraper.py @@ -109,6 +109,11 @@ class WordToSkillConverter: if not os.path.exists(self.docx_path): raise FileNotFoundError(f"Word document not found: {self.docx_path}") + if not self.docx_path.lower().endswith(".docx"): + raise ValueError( + f"Not a Word document (expected .docx): {self.docx_path}" + ) + # --- Extract metadata via python-docx --- doc = python_docx.Document(self.docx_path) core_props = doc.core_properties @@ -825,8 +830,8 @@ def _build_section( raw_text = elem.get_text(separator="\n").strip() # Exclude bullet-point / prose lists (•, *, -) if raw_text and not re.search(r"^[•\-\*]\s", raw_text, re.MULTILINE): - if _score_code_quality(raw_text) >= 5.5: - quality_score = _score_code_quality(raw_text) + quality_score = _score_code_quality(raw_text) + if quality_score >= 5.5: code_samples.append( {"code": raw_text, "language": "", "quality_score": quality_score} ) diff --git a/tests/test_chunking_integration.py b/tests/test_chunking_integration.py index 42a1c1b..e9068ba 100644 --- a/tests/test_chunking_integration.py +++ b/tests/test_chunking_integration.py @@ -359,5 +359,102 @@ class TestChunkingCLIIntegration: ) + def test_chunk_overlap_tokens_parameter(self, tmp_path): + """Test --chunk-overlap-tokens controls RAGChunker overlap.""" + from skill_seekers.cli.package_skill import package_skill + + skill_dir = create_test_skill(tmp_path, large_doc=True) + + # Package with default overlap (50) + success, package_path = package_skill( + skill_dir=skill_dir, + open_folder_after=False, + skip_quality_check=True, + target="langchain", + enable_chunking=True, + chunk_max_tokens=256, + chunk_overlap_tokens=50, + ) + + assert success + assert package_path.exists() + + with open(package_path) as f: + data_default = json.load(f) + + # Package with large overlap (128) + success2, package_path2 = package_skill( + skill_dir=skill_dir, + open_folder_after=False, + skip_quality_check=True, + target="langchain", + enable_chunking=True, + chunk_max_tokens=256, + chunk_overlap_tokens=128, + ) + + assert success2 + assert package_path2.exists() + + with open(package_path2) as f: + data_large_overlap = json.load(f) + + # Large overlap should produce more chunks (more overlap = more chunks) + assert len(data_large_overlap) >= len(data_default), ( + f"Large overlap ({len(data_large_overlap)}) should produce >= chunks than default ({len(data_default)})" + ) + + def test_chunk_overlap_scales_with_chunk_size(self, tmp_path): + """Test that overlap auto-scales when chunk_tokens is non-default but overlap is default.""" + from skill_seekers.cli.adaptors.base import DEFAULT_CHUNK_TOKENS, DEFAULT_CHUNK_OVERLAP_TOKENS + + adaptor = get_adaptor("langchain") + + skill_dir = create_test_skill(tmp_path, large_doc=True) + metadata = adaptor._build_skill_metadata(skill_dir) + content = (skill_dir / "SKILL.md").read_text() + + # With default chunk size (512) and default overlap (50), overlap should be 50 + chunks_default = adaptor._maybe_chunk_content( + content, {"source": "test"}, + enable_chunking=True, + chunk_max_tokens=DEFAULT_CHUNK_TOKENS, + chunk_overlap_tokens=DEFAULT_CHUNK_OVERLAP_TOKENS, + ) + + # With large chunk size (1024) and default overlap (50), + # overlap should auto-scale to max(50, 1024//10) = 102 + chunks_large = adaptor._maybe_chunk_content( + content, {"source": "test"}, + enable_chunking=True, + chunk_max_tokens=1024, + chunk_overlap_tokens=DEFAULT_CHUNK_OVERLAP_TOKENS, + ) + + # Both should produce valid chunks + assert len(chunks_default) > 1 + assert len(chunks_large) >= 1 + + def test_preserve_code_blocks_flag(self, tmp_path): + """Test --no-preserve-code-blocks parameter is accepted.""" + from skill_seekers.cli.package_skill import package_skill + + skill_dir = create_test_skill(tmp_path, large_doc=True) + + # Package with code block preservation disabled + success, package_path = package_skill( + skill_dir=skill_dir, + open_folder_after=False, + skip_quality_check=True, + target="langchain", + enable_chunking=True, + chunk_max_tokens=256, + preserve_code_blocks=False, + ) + + assert success + assert package_path.exists() + + if __name__ == "__main__": pytest.main([__file__, "-v"]) diff --git a/tests/test_cli_refactor_e2e.py b/tests/test_cli_refactor_e2e.py index ad2c790..600dbb5 100644 --- a/tests/test_cli_refactor_e2e.py +++ b/tests/test_cli_refactor_e2e.py @@ -294,5 +294,81 @@ class TestE2EWorkflow: assert "unrecognized arguments" not in result.stderr.lower() +class TestVarFlagRouting: + """Test that --var flag is correctly routed through create command.""" + + def test_var_flag_accepted_by_create(self): + """Test that --var flag is accepted (not 'unrecognized') by create command.""" + result = subprocess.run( + ["skill-seekers", "create", "--help"], + capture_output=True, + text=True, + ) + assert "--var" in result.stdout, "create --help should show --var flag" + + def test_var_flag_accepted_by_analyze(self): + """Test that --var flag is accepted by analyze command.""" + result = subprocess.run( + ["skill-seekers", "analyze", "--help"], + capture_output=True, + text=True, + ) + assert "--var" in result.stdout, "analyze --help should show --var flag" + + @pytest.mark.slow + def test_var_flag_not_rejected_in_create_local(self, tmp_path): + """Test --var KEY=VALUE doesn't cause 'unrecognized arguments' in create.""" + test_dir = tmp_path / "test_code" + test_dir.mkdir() + (test_dir / "test.py").write_text("def hello(): pass") + + result = subprocess.run( + [ + "skill-seekers", "create", str(test_dir), + "--var", "foo=bar", + "--dry-run", + ], + capture_output=True, + text=True, + timeout=15, + ) + assert "unrecognized arguments" not in result.stderr.lower(), ( + f"--var should be accepted, got stderr: {result.stderr}" + ) + + +class TestBackwardCompatibleFlags: + """Test that deprecated flag aliases still work.""" + + def test_no_preserve_code_alias_accepted_by_package(self): + """Test --no-preserve-code (old name) is still accepted by package command.""" + result = subprocess.run( + ["skill-seekers", "package", "--help"], + capture_output=True, + text=True, + ) + # The old flag should not appear in --help (it's suppressed) + # but should not cause an error if used + assert result.returncode == 0 + + def test_no_preserve_code_alias_accepted_by_scrape(self): + """Test --no-preserve-code (old name) is still accepted by scrape command.""" + result = subprocess.run( + ["skill-seekers", "scrape", "--help"], + capture_output=True, + text=True, + ) + assert result.returncode == 0 + + def test_no_preserve_code_alias_accepted_by_create(self): + """Test --no-preserve-code (old name) is still accepted by create command.""" + result = subprocess.run( + ["skill-seekers", "create", "--help-all"], + capture_output=True, + text=True, + ) + assert result.returncode == 0 + + if __name__ == "__main__": pytest.main([__file__, "-v", "-s"]) diff --git a/tests/test_create_arguments.py b/tests/test_create_arguments.py index b297721..249348b 100644 --- a/tests/test_create_arguments.py +++ b/tests/test_create_arguments.py @@ -25,8 +25,8 @@ class TestUniversalArguments: """Test universal argument definitions.""" def test_universal_count(self): - """Should have exactly 18 universal arguments (after Phase 2 workflow integration + local_repo_path).""" - assert len(UNIVERSAL_ARGUMENTS) == 18 + """Should have exactly 19 universal arguments (after Phase 2 workflow integration + local_repo_path + doc_version).""" + assert len(UNIVERSAL_ARGUMENTS) == 19 def test_universal_argument_names(self): """Universal arguments should have expected names.""" @@ -50,6 +50,7 @@ class TestUniversalArguments: "var", "workflow_dry_run", "local_repo_path", # GitHub local clone path for unlimited C3.x analysis + "doc_version", # Documentation version tag for RAG metadata } assert set(UNIVERSAL_ARGUMENTS.keys()) == expected_names @@ -130,7 +131,7 @@ class TestArgumentHelpers: """Should return set of universal argument names.""" names = get_universal_argument_names() assert isinstance(names, set) - assert len(names) == 18 # Phase 2: added 4 workflow arguments + local_repo_path + assert len(names) == 19 # Phase 2: added 4 workflow arguments + local_repo_path + doc_version assert "name" in names assert "enhance_level" in names # Phase 1: consolidated flag assert "enhance_workflow" in names # Phase 2: workflow support diff --git a/tests/test_pinecone_adaptor.py b/tests/test_pinecone_adaptor.py new file mode 100644 index 0000000..7a81400 --- /dev/null +++ b/tests/test_pinecone_adaptor.py @@ -0,0 +1,752 @@ +#!/usr/bin/env python3 +""" +Tests for Pinecone adaptor and doc_version metadata flow. +""" + +import json +from pathlib import Path + +import pytest + +from skill_seekers.cli.adaptors.base import SkillAdaptor, SkillMetadata + + +# --------------------------------------------------------------------------- +# Fixtures +# --------------------------------------------------------------------------- + + +@pytest.fixture +def sample_skill_dir(tmp_path): + """Create a minimal skill directory with SKILL.md and references.""" + skill_dir = tmp_path / "test-skill" + skill_dir.mkdir() + + skill_md = """--- +name: test-skill +description: A test skill for pinecone +doc_version: 16.2 +--- + +# Test Skill + +This is a test skill for Pinecone adaptor testing. + +## Quick Start + +Get started quickly. +""" + (skill_dir / "SKILL.md").write_text(skill_md) + + refs_dir = skill_dir / "references" + refs_dir.mkdir() + (refs_dir / "api_reference.md").write_text( + "# API Reference\n\nSome API docs.\n" + ) + (refs_dir / "getting_started.md").write_text( + "# Getting Started\n\nSome getting started docs.\n" + ) + + return skill_dir + + +@pytest.fixture +def sample_skill_dir_no_doc_version(tmp_path): + """Create a skill directory without doc_version in frontmatter.""" + skill_dir = tmp_path / "no-version-skill" + skill_dir.mkdir() + + skill_md = """--- +name: no-version-skill +description: A test skill without doc_version +--- + +# No Version Skill + +Content here. +""" + (skill_dir / "SKILL.md").write_text(skill_md) + + refs_dir = skill_dir / "references" + refs_dir.mkdir() + (refs_dir / "api.md").write_text("# API\n\nAPI docs.\n") + + return skill_dir + + +# --------------------------------------------------------------------------- +# Pinecone Adaptor Tests +# --------------------------------------------------------------------------- + + +class TestPineconeAdaptor: + """Test Pinecone adaptor functionality.""" + + def test_import(self): + """PineconeAdaptor can be imported.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + assert PineconeAdaptor is not None + + def test_platform_constants(self): + """Platform constants are set correctly.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + assert adaptor.PLATFORM == "pinecone" + assert adaptor.PLATFORM_NAME == "Pinecone (Vector Database)" + assert adaptor.DEFAULT_API_ENDPOINT is None + + def test_registered_in_factory(self): + """PineconeAdaptor is registered in the adaptor factory.""" + from skill_seekers.cli.adaptors import ADAPTORS + + assert "pinecone" in ADAPTORS + + def test_get_adaptor(self): + """get_adaptor('pinecone') returns PineconeAdaptor instance.""" + from skill_seekers.cli.adaptors import get_adaptor + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = get_adaptor("pinecone") + assert isinstance(adaptor, PineconeAdaptor) + + def test_format_skill_md_structure(self, sample_skill_dir): + """format_skill_md returns valid JSON with expected structure.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + metadata = SkillMetadata( + name="test-skill", + description="Test skill", + version="1.0.0", + doc_version="16.2", + ) + result = adaptor.format_skill_md(sample_skill_dir, metadata) + data = json.loads(result) + + assert "index_name" in data + assert "namespace" in data + assert "dimension" in data + assert "metric" in data + assert "vectors" in data + assert data["dimension"] == 1536 + assert data["metric"] == "cosine" + + def test_format_skill_md_vectors_have_metadata(self, sample_skill_dir): + """Each vector has id and metadata fields.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + metadata = SkillMetadata( + name="test-skill", + description="Test", + doc_version="16.2", + ) + result = adaptor.format_skill_md(sample_skill_dir, metadata) + data = json.loads(result) + + assert len(data["vectors"]) > 0 + for vec in data["vectors"]: + assert "id" in vec + assert "metadata" in vec + assert "text" in vec["metadata"] + assert "source" in vec["metadata"] + assert "category" in vec["metadata"] + assert "file" in vec["metadata"] + assert "type" in vec["metadata"] + assert "version" in vec["metadata"] + assert "doc_version" in vec["metadata"] + + def test_format_skill_md_doc_version_propagates(self, sample_skill_dir): + """doc_version flows into every vector's metadata.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + metadata = SkillMetadata( + name="test-skill", + description="Test", + doc_version="16.2", + ) + result = adaptor.format_skill_md(sample_skill_dir, metadata) + data = json.loads(result) + + for vec in data["vectors"]: + assert vec["metadata"]["doc_version"] == "16.2" + + def test_format_skill_md_empty_doc_version(self, sample_skill_dir): + """Empty doc_version is preserved as empty string.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + metadata = SkillMetadata(name="test-skill", description="Test", doc_version="") + result = adaptor.format_skill_md(sample_skill_dir, metadata) + data = json.loads(result) + + for vec in data["vectors"]: + assert vec["metadata"]["doc_version"] == "" + + def test_format_skill_md_has_overview_and_references(self, sample_skill_dir): + """Output includes overview (SKILL.md) and reference documents.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + metadata = SkillMetadata(name="test-skill", description="Test") + result = adaptor.format_skill_md(sample_skill_dir, metadata) + data = json.loads(result) + + categories = {vec["metadata"]["category"] for vec in data["vectors"]} + types = {vec["metadata"]["type"] for vec in data["vectors"]} + assert "overview" in categories + assert "documentation" in types + assert "reference" in types + + def test_package_creates_file(self, sample_skill_dir, tmp_path): + """package() creates a JSON file at expected path.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + output_path = adaptor.package(sample_skill_dir, tmp_path) + + assert output_path.exists() + assert output_path.name.endswith("-pinecone.json") + + data = json.loads(output_path.read_text()) + assert "vectors" in data + assert len(data["vectors"]) > 0 + + def test_package_reads_frontmatter_metadata(self, sample_skill_dir, tmp_path): + """package() reads doc_version from SKILL.md frontmatter.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + output_path = adaptor.package(sample_skill_dir, tmp_path) + + data = json.loads(output_path.read_text()) + for vec in data["vectors"]: + assert vec["metadata"]["doc_version"] == "16.2" + + def test_package_with_chunking(self, sample_skill_dir, tmp_path): + """package() with chunking enabled produces valid output.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + output_path = adaptor.package( + sample_skill_dir, tmp_path, enable_chunking=True, chunk_max_tokens=64 + ) + + data = json.loads(output_path.read_text()) + assert "vectors" in data + assert len(data["vectors"]) > 0 + + def test_index_name_derived_from_skill_name(self, sample_skill_dir, tmp_path): + """index_name and namespace are derived from skill directory name.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + output_path = adaptor.package(sample_skill_dir, tmp_path) + + data = json.loads(output_path.read_text()) + assert data["index_name"] == "test-skill" + assert data["namespace"] == "test-skill" + + def test_no_values_field_in_vectors(self, sample_skill_dir, tmp_path): + """Vectors have no 'values' field — embeddings are added at upload time.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + output_path = adaptor.package(sample_skill_dir, tmp_path) + + data = json.loads(output_path.read_text()) + for vec in data["vectors"]: + assert "values" not in vec + + def test_text_truncation(self): + """_truncate_text_for_metadata respects byte limit.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + # Short text should not be truncated + assert adaptor._truncate_text_for_metadata("hello") == "hello" + + # Very long text should be truncated + long_text = "x" * 50000 + truncated = adaptor._truncate_text_for_metadata(long_text) + assert len(truncated.encode("utf-8")) <= 40000 + + def test_validate_api_key_returns_false(self): + """validate_api_key returns False (no key needed for packaging).""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + assert adaptor.validate_api_key("some-key") is False + + def test_get_env_var_name(self): + """get_env_var_name returns PINECONE_API_KEY.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + assert adaptor.get_env_var_name() == "PINECONE_API_KEY" + + def test_supports_enhancement_false(self): + """Pinecone doesn't support enhancement.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + assert adaptor.supports_enhancement() is False + + def test_upload_without_pinecone_installed(self, tmp_path): + """upload() returns helpful error when pinecone not installed.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + # Create a dummy package file + pkg = tmp_path / "test-pinecone.json" + pkg.write_text(json.dumps({"vectors": [], "index_name": "test", "namespace": "test"})) + + # This will either work (if pinecone is installed) or return error + result = adaptor.upload(pkg) + # Without API key, should fail + assert result["success"] is False + + def _make_mock_pinecone(self, monkeypatch): + """Helper: stub the pinecone module so upload() can run without a real server.""" + import sys + import types + from unittest.mock import MagicMock + + mock_module = types.ModuleType("pinecone") + mock_index = MagicMock() + mock_pc = MagicMock() + mock_pc.list_indexes.return_value = [] # no existing indexes + mock_pc.Index.return_value = mock_index + mock_module.Pinecone = MagicMock(return_value=mock_pc) + mock_module.ServerlessSpec = MagicMock() + monkeypatch.setitem(sys.modules, "pinecone", mock_module) + return mock_pc, mock_index + + def _make_package(self, tmp_path, vectors=None): + """Helper: create a minimal Pinecone package JSON.""" + if vectors is None: + vectors = [{"id": "a", "metadata": {"text": "hello world"}}] + pkg = tmp_path / "test-pinecone.json" + pkg.write_text(json.dumps({ + "vectors": vectors, + "index_name": "test", + "namespace": "test", + "metric": "cosine", + "dimension": 1536, + })) + return pkg + + def test_upload_success_has_url_key(self, tmp_path, monkeypatch): + """upload() success return dict includes 'url' key (prevents KeyError in package_skill.py).""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + mock_pc, _mock_index = self._make_mock_pinecone(monkeypatch) + monkeypatch.setattr( + adaptor, "_generate_openai_embeddings", + lambda docs: [[0.0] * 1536] * len(docs), + ) + pkg = self._make_package(tmp_path) + + result = adaptor.upload(pkg, api_key="fake-key") + assert result["success"] is True + assert "url" in result # key must exist to avoid KeyError in package_skill.py + # Value should be None for Pinecone (no web URL) + assert result["url"] is None + + def test_embedding_dimension_autodetect_st(self, tmp_path, monkeypatch): + """sentence-transformers upload creates index with dimension=384.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + mock_pc, _mock_index = self._make_mock_pinecone(monkeypatch) + monkeypatch.setattr( + adaptor, "_generate_st_embeddings", + lambda docs: [[0.0] * 384] * len(docs), + ) + pkg = self._make_package(tmp_path) + + result = adaptor.upload( + pkg, api_key="fake-key", embedding_function="sentence-transformers", + ) + assert result["success"] is True + # Verify create_index was called with dimension=384 + mock_pc.create_index.assert_called_once() + call_kwargs = mock_pc.create_index.call_args + assert call_kwargs.kwargs["dimension"] == 384 + + def test_embedding_dimension_autodetect_openai(self, tmp_path, monkeypatch): + """openai upload creates index with dimension=1536.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + mock_pc, _mock_index = self._make_mock_pinecone(monkeypatch) + monkeypatch.setattr( + adaptor, "_generate_openai_embeddings", + lambda docs: [[0.0] * 1536] * len(docs), + ) + pkg = self._make_package(tmp_path) + + result = adaptor.upload( + pkg, api_key="fake-key", embedding_function="openai", + ) + assert result["success"] is True + mock_pc.create_index.assert_called_once() + call_kwargs = mock_pc.create_index.call_args + assert call_kwargs.kwargs["dimension"] == 1536 + + def test_embedding_before_index_creation(self, tmp_path, monkeypatch): + """If embedding generation fails, index is never created (no side-effects).""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + mock_pc, _mock_index = self._make_mock_pinecone(monkeypatch) + + def fail_embeddings(docs): + raise RuntimeError("OPENAI_API_KEY not set") + + monkeypatch.setattr(adaptor, "_generate_openai_embeddings", fail_embeddings) + pkg = self._make_package(tmp_path) + + result = adaptor.upload(pkg, api_key="fake-key") + assert result["success"] is False + # Index must NOT have been created since embedding failed first + mock_pc.create_index.assert_not_called() + + def test_embedding_dimension_explicit_override(self, tmp_path, monkeypatch): + """Explicit dimension kwarg overrides both auto-detect and JSON file value.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + mock_pc, _mock_index = self._make_mock_pinecone(monkeypatch) + monkeypatch.setattr( + adaptor, "_generate_openai_embeddings", + lambda docs: [[0.0] * 768] * len(docs), + ) + pkg = self._make_package(tmp_path) + + result = adaptor.upload( + pkg, api_key="fake-key", embedding_function="openai", dimension=768, + ) + assert result["success"] is True + mock_pc.create_index.assert_called_once() + call_kwargs = mock_pc.create_index.call_args + assert call_kwargs.kwargs["dimension"] == 768 + + def test_deterministic_ids(self, sample_skill_dir): + """IDs are deterministic — same input produces same ID.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + metadata = SkillMetadata(name="test-skill", description="Test") + + result1 = adaptor.format_skill_md(sample_skill_dir, metadata) + result2 = adaptor.format_skill_md(sample_skill_dir, metadata) + + data1 = json.loads(result1) + data2 = json.loads(result2) + + ids1 = [v["id"] for v in data1["vectors"]] + ids2 = [v["id"] for v in data2["vectors"]] + assert ids1 == ids2 + + +# --------------------------------------------------------------------------- +# doc_version Metadata Tests (cross-adaptor) +# --------------------------------------------------------------------------- + + +class TestDocVersionMetadata: + """Test doc_version flows through all RAG adaptors.""" + + def test_skill_metadata_has_doc_version(self): + """SkillMetadata dataclass has doc_version field.""" + meta = SkillMetadata(name="test", description="test", doc_version="3.2") + assert meta.doc_version == "3.2" + + def test_skill_metadata_doc_version_default_empty(self): + """doc_version defaults to empty string.""" + meta = SkillMetadata(name="test", description="test") + assert meta.doc_version == "" + + def test_read_frontmatter(self, sample_skill_dir): + """_read_frontmatter reads doc_version from SKILL.md.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + fm = adaptor._read_frontmatter(sample_skill_dir) + assert fm["doc_version"] == "16.2" + assert fm["name"] == "test-skill" + + def test_read_frontmatter_missing(self, sample_skill_dir_no_doc_version): + """_read_frontmatter returns empty string when doc_version is absent.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + fm = adaptor._read_frontmatter(sample_skill_dir_no_doc_version) + assert fm.get("doc_version") is None # key not present + + def test_build_skill_metadata_reads_doc_version(self, sample_skill_dir): + """_build_skill_metadata populates doc_version from frontmatter.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + meta = adaptor._build_skill_metadata(sample_skill_dir) + assert meta.doc_version == "16.2" + assert meta.name == "test-skill" + + def test_build_skill_metadata_no_doc_version(self, sample_skill_dir_no_doc_version): + """_build_skill_metadata defaults to empty string when frontmatter has no doc_version.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + meta = adaptor._build_skill_metadata(sample_skill_dir_no_doc_version) + assert meta.doc_version == "" + + def test_build_metadata_dict_includes_doc_version(self): + """_build_metadata_dict includes doc_version in output.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + meta = SkillMetadata(name="test", description="desc", doc_version="3.0") + result = adaptor._build_metadata_dict(meta) + assert "doc_version" in result + assert result["doc_version"] == "3.0" + + def test_build_metadata_dict_empty_doc_version(self): + """_build_metadata_dict preserves empty doc_version.""" + from skill_seekers.cli.adaptors.pinecone_adaptor import PineconeAdaptor + + adaptor = PineconeAdaptor() + meta = SkillMetadata(name="test", description="desc") + result = adaptor._build_metadata_dict(meta) + assert "doc_version" in result + assert result["doc_version"] == "" + + @pytest.mark.parametrize( + "platform", + ["chroma", "faiss", "langchain", "llama-index", "haystack", "pinecone"], + ) + def test_doc_version_in_package_output(self, platform, sample_skill_dir, tmp_path): + """doc_version appears in package output for all RAG adaptors.""" + from skill_seekers.cli.adaptors import get_adaptor + + adaptor = get_adaptor(platform) + output_path = adaptor.package(sample_skill_dir, tmp_path) + + data = json.loads(output_path.read_text()) + + # Each adaptor has a different structure — extract metadata dicts + meta_list = _extract_metadata_from_package(platform, data) + assert len(meta_list) > 0, f"No metadata found in {platform} output" + + for meta in meta_list: + assert "doc_version" in meta, f"doc_version missing in {platform} metadata: {meta}" + assert meta["doc_version"] == "16.2", ( + f"doc_version mismatch in {platform}: expected '16.2', got '{meta['doc_version']}'" + ) + + @pytest.mark.parametrize( + "platform", + ["chroma", "faiss", "langchain", "llama-index", "haystack", "pinecone"], + ) + def test_empty_doc_version_in_package_output( + self, platform, sample_skill_dir_no_doc_version, tmp_path + ): + """Empty doc_version is preserved (not omitted) in all adaptors.""" + from skill_seekers.cli.adaptors import get_adaptor + + adaptor = get_adaptor(platform) + output_path = adaptor.package(sample_skill_dir_no_doc_version, tmp_path) + + data = json.loads(output_path.read_text()) + meta_list = _extract_metadata_from_package(platform, data) + assert len(meta_list) > 0 + + for meta in meta_list: + assert "doc_version" in meta + + +# Qdrant and Weaviate may not be installed — test separately if available +class TestDocVersionQdrant: + """Test doc_version in Qdrant adaptor (may require qdrant client).""" + + def test_qdrant_doc_version(self, sample_skill_dir, tmp_path): + from skill_seekers.cli.adaptors import ADAPTORS + + if "qdrant" not in ADAPTORS: + pytest.skip("Qdrant adaptor not available") + from skill_seekers.cli.adaptors import get_adaptor + + adaptor = get_adaptor("qdrant") + output_path = adaptor.package(sample_skill_dir, tmp_path) + data = json.loads(output_path.read_text()) + + for point in data["points"]: + assert "doc_version" in point["payload"] + assert point["payload"]["doc_version"] == "16.2" + + +class TestWeaviateUploadReturnKeys: + """Test Weaviate upload() return dict has required keys.""" + + def test_weaviate_upload_success_has_url_key(self, sample_skill_dir, tmp_path, monkeypatch): + """Weaviate upload() success return includes 'url' key (prevents KeyError in package_skill.py).""" + import sys + import types + from unittest.mock import MagicMock + + from skill_seekers.cli.adaptors import ADAPTORS + + if "weaviate" not in ADAPTORS: + pytest.skip("Weaviate adaptor not available") + + from skill_seekers.cli.adaptors.weaviate import WeaviateAdaptor + + adaptor = WeaviateAdaptor() + + # Stub the weaviate module + mock_module = types.ModuleType("weaviate") + mock_client = MagicMock() + mock_client.is_ready.return_value = True + mock_module.Client = MagicMock(return_value=mock_client) + mock_module.AuthApiKey = MagicMock() + monkeypatch.setitem(sys.modules, "weaviate", mock_module) + + # Create a minimal weaviate package + output_path = adaptor.package(sample_skill_dir, tmp_path) + result = adaptor.upload(output_path) + + assert result["success"] is True + assert "url" in result + assert result["url"] is None + + +class TestDocVersionWeaviate: + """Test doc_version in Weaviate adaptor (may require weaviate client).""" + + def test_weaviate_doc_version(self, sample_skill_dir, tmp_path): + from skill_seekers.cli.adaptors import ADAPTORS + + if "weaviate" not in ADAPTORS: + pytest.skip("Weaviate adaptor not available") + from skill_seekers.cli.adaptors import get_adaptor + + adaptor = get_adaptor("weaviate") + output_path = adaptor.package(sample_skill_dir, tmp_path) + data = json.loads(output_path.read_text()) + + for obj in data["objects"]: + assert "doc_version" in obj["properties"] + assert obj["properties"]["doc_version"] == "16.2" + + def test_weaviate_schema_includes_doc_version(self, sample_skill_dir, tmp_path): + from skill_seekers.cli.adaptors import ADAPTORS + + if "weaviate" not in ADAPTORS: + pytest.skip("Weaviate adaptor not available") + from skill_seekers.cli.adaptors import get_adaptor + + adaptor = get_adaptor("weaviate") + output_path = adaptor.package(sample_skill_dir, tmp_path) + data = json.loads(output_path.read_text()) + + property_names = [p["name"] for p in data["schema"]["properties"]] + assert "doc_version" in property_names + + +# --------------------------------------------------------------------------- +# CLI Flag Tests +# --------------------------------------------------------------------------- + + +class TestDocVersionCLIFlag: + """Test --doc-version CLI flag is accepted.""" + + def test_common_arguments_has_doc_version(self): + """COMMON_ARGUMENTS includes doc_version.""" + from skill_seekers.cli.arguments.common import COMMON_ARGUMENTS + + assert "doc_version" in COMMON_ARGUMENTS + + def test_create_arguments_has_doc_version(self): + """UNIVERSAL_ARGUMENTS includes doc_version.""" + from skill_seekers.cli.arguments.create import UNIVERSAL_ARGUMENTS + + assert "doc_version" in UNIVERSAL_ARGUMENTS + + def test_doc_version_flag_parsed(self): + """--doc-version is parsed correctly by argparse.""" + import argparse + from skill_seekers.cli.arguments.common import add_common_arguments + + parser = argparse.ArgumentParser() + add_common_arguments(parser) + args = parser.parse_args(["--doc-version", "16.2"]) + assert args.doc_version == "16.2" + + def test_doc_version_default_empty(self): + """--doc-version defaults to empty string.""" + import argparse + from skill_seekers.cli.arguments.common import add_common_arguments + + parser = argparse.ArgumentParser() + add_common_arguments(parser) + args = parser.parse_args([]) + assert args.doc_version == "" + + +# --------------------------------------------------------------------------- +# Package choices test +# --------------------------------------------------------------------------- + + +class TestPineconeInPackageChoices: + """Test pinecone is in package CLI choices.""" + + def test_pinecone_in_package_arguments(self): + """pinecone is listed in package --target choices.""" + from skill_seekers.cli.arguments.package import PACKAGE_ARGUMENTS + + choices = PACKAGE_ARGUMENTS["target"]["kwargs"]["choices"] + assert "pinecone" in choices + + +# --------------------------------------------------------------------------- +# Helpers +# --------------------------------------------------------------------------- + + +def _extract_metadata_from_package(platform: str, data: dict) -> list[dict]: + """Extract metadata dicts from adaptor-specific package format.""" + meta_list = [] + + if platform == "pinecone": + for vec in data.get("vectors", []): + meta_list.append(vec.get("metadata", {})) + elif platform == "chroma": + for meta in data.get("metadatas", []): + meta_list.append(meta) + elif platform == "faiss": + for meta in data.get("metadatas", []): + meta_list.append(meta) + elif platform == "langchain": + for doc in data if isinstance(data, list) else []: + meta_list.append(doc.get("metadata", {})) + elif platform == "llama-index": + for node in data if isinstance(data, list) else []: + meta_list.append(node.get("metadata", {})) + elif platform == "haystack": + for doc in data if isinstance(data, list) else []: + meta_list.append(doc.get("meta", {})) + elif platform == "qdrant": + for point in data.get("points", []): + meta_list.append(point.get("payload", {})) + elif platform == "weaviate": + for obj in data.get("objects", []): + meta_list.append(obj.get("properties", {})) + + return meta_list diff --git a/tests/test_upload_integration.py b/tests/test_upload_integration.py index 5d69ee2..75aa019 100644 --- a/tests/test_upload_integration.py +++ b/tests/test_upload_integration.py @@ -151,6 +151,36 @@ class TestWeaviateUploadBasics: assert hasattr(adaptor, "_generate_openai_embeddings") +class TestEmbeddingMethodInheritance: + """Test that shared embedding methods are properly inherited from base.""" + + def test_chroma_inherits_openai_embeddings(self): + """Test chroma adaptor gets _generate_openai_embeddings from base.""" + adaptor = get_adaptor("chroma") + assert hasattr(adaptor, "_generate_openai_embeddings") + # Verify it's the base class method, not a local override + from skill_seekers.cli.adaptors.base import SkillAdaptor + assert adaptor._generate_openai_embeddings.__func__ is SkillAdaptor._generate_openai_embeddings + + def test_weaviate_inherits_both_embedding_methods(self): + """Test weaviate adaptor gets both embedding methods from base.""" + adaptor = get_adaptor("weaviate") + assert hasattr(adaptor, "_generate_openai_embeddings") + assert hasattr(adaptor, "_generate_st_embeddings") + from skill_seekers.cli.adaptors.base import SkillAdaptor + assert adaptor._generate_openai_embeddings.__func__ is SkillAdaptor._generate_openai_embeddings + assert adaptor._generate_st_embeddings.__func__ is SkillAdaptor._generate_st_embeddings + + def test_pinecone_inherits_both_embedding_methods(self): + """Test pinecone adaptor gets both embedding methods from base.""" + adaptor = get_adaptor("pinecone") + assert hasattr(adaptor, "_generate_openai_embeddings") + assert hasattr(adaptor, "_generate_st_embeddings") + from skill_seekers.cli.adaptors.base import SkillAdaptor + assert adaptor._generate_openai_embeddings.__func__ is SkillAdaptor._generate_openai_embeddings + assert adaptor._generate_st_embeddings.__func__ is SkillAdaptor._generate_st_embeddings + + class TestPackageStructure: """Test that packages are correctly structured for upload.""" diff --git a/tests/test_word_scraper.py b/tests/test_word_scraper.py index 72dc8c3..cfc14ef 100644 --- a/tests/test_word_scraper.py +++ b/tests/test_word_scraper.py @@ -16,6 +16,7 @@ Tests cover: """ import json +import os import shutil import tempfile import unittest @@ -456,6 +457,37 @@ class TestWordErrorHandling(unittest.TestCase): with self.assertRaises((KeyError, TypeError)): self.WordToSkillConverter({"docx_path": "test.docx"}) + def test_non_docx_file_raises_value_error(self): + """extract_docx raises ValueError for non-.docx files.""" + # Create a real file with wrong extension + txt_path = os.path.join(self.temp_dir, "test.txt") + with open(txt_path, "w") as f: + f.write("not a docx") + config = {"name": "test", "docx_path": txt_path} + converter = self.WordToSkillConverter(config) + with self.assertRaises(ValueError): + converter.extract_docx() + + def test_doc_file_raises_value_error(self): + """extract_docx raises ValueError for .doc (old Word format).""" + doc_path = os.path.join(self.temp_dir, "test.doc") + with open(doc_path, "w") as f: + f.write("not a docx") + config = {"name": "test", "docx_path": doc_path} + converter = self.WordToSkillConverter(config) + with self.assertRaises(ValueError): + converter.extract_docx() + + def test_no_extension_file_raises_value_error(self): + """extract_docx raises ValueError for file with no extension.""" + no_ext_path = os.path.join(self.temp_dir, "document") + with open(no_ext_path, "w") as f: + f.write("not a docx") + config = {"name": "test", "docx_path": no_ext_path} + converter = self.WordToSkillConverter(config) + with self.assertRaises(ValueError): + converter.extract_docx() + class TestWordJSONWorkflow(unittest.TestCase): """Test building skills from extracted JSON.""" diff --git a/uv.lock b/uv.lock index 6d7bf71..52a1aef 100644 --- a/uv.lock +++ b/uv.lock @@ -3621,11 +3621,11 @@ wheels = [ [[package]] name = "packaging" -version = "25.0" +version = "24.2" source = { registry = "https://pypi.org/simple" } -sdist = { url = "https://files.pythonhosted.org/packages/a1/d4/1fc4078c65507b51b96ca8f8c3ba19e6a61c8253c72794544580a7b6c24d/packaging-25.0.tar.gz", hash = "sha256:d443872c98d677bf60f6a1f2f8c1cb748e8fe762d2bf9d3148b5599295b0fc4f", size = 165727, upload-time = "2025-04-19T11:48:59.673Z" } +sdist = { url = "https://files.pythonhosted.org/packages/d0/63/68dbb6eb2de9cb10ee4c9c14a0148804425e13c4fb20d61cce69f53106da/packaging-24.2.tar.gz", hash = "sha256:c228a6dc5e932d346bc5739379109d49e8853dd8223571c7c5b55260edc0b97f", size = 163950, upload-time = "2024-11-08T09:47:47.202Z" } wheels = [ - { url = "https://files.pythonhosted.org/packages/20/12/38679034af332785aac8774540895e234f4d07f7545804097de4b666afd8/packaging-25.0-py3-none-any.whl", hash = "sha256:29572ef2b1f17581046b3a2227d5c611fb25ec70ca1ba8554b24b0e69331a484", size = 66469, upload-time = "2025-04-19T11:48:57.875Z" }, + { url = "https://files.pythonhosted.org/packages/88/ef/eb23f262cca3c0c4eb7ab1933c3b1f03d021f2c48f54763065b6f0e321be/packaging-24.2-py3-none-any.whl", hash = "sha256:09abb1bccd265c01f4a3aa3f7a7db064b36514d2cba19a2f694fe6150451a759", size = 65451, upload-time = "2024-11-08T09:47:44.722Z" }, ] [[package]] @@ -3797,6 +3797,46 @@ wheels = [ { url = "https://files.pythonhosted.org/packages/2d/71/64e9b1c7f04ae0027f788a248e6297d7fcc29571371fe7d45495a78172c0/pillow-12.1.0-pp311-pypy311_pp73-win_amd64.whl", hash = "sha256:75af0b4c229ac519b155028fa1be632d812a519abba9b46b20e50c6caa184f19", size = 7029809, upload-time = "2026-01-02T09:13:26.541Z" }, ] +[[package]] +name = "pinecone" +version = "8.1.0" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "certifi" }, + { name = "orjson" }, + { name = "pinecone-plugin-assistant" }, + { name = "pinecone-plugin-interface" }, + { name = "python-dateutil" }, + { name = "typing-extensions" }, + { name = "urllib3" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/e2/e4/8303133de5b3850c85d56caf9cc23cc38c74942bb8a940890b225245d7df/pinecone-8.1.0.tar.gz", hash = "sha256:48a00843fb232ccfd57eba618f0c0294e918b030e1bc7e853fb88d04f80ba569", size = 1041965, upload-time = "2026-02-19T20:08:32.999Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/4e/f7/beee7033ef92e5964e570fc29a048627e298745916e65c66105378405d06/pinecone-8.1.0-py3-none-any.whl", hash = "sha256:b0ba9c55c9a072fbe4fc7381bc3e5eb1b14550a8007233a3368ada74b1747534", size = 742745, upload-time = "2026-02-19T20:08:31.319Z" }, +] + +[[package]] +name = "pinecone-plugin-assistant" +version = "3.0.2" +source = { registry = "https://pypi.org/simple" } +dependencies = [ + { name = "packaging" }, + { name = "requests" }, +] +sdist = { url = "https://files.pythonhosted.org/packages/c4/16/dcaff42ddfeab75dccd17685a0db46489717c3d23753dc14c55770e12aa8/pinecone_plugin_assistant-3.0.2.tar.gz", hash = "sha256:04163af282ad7895b581ab89f850ed139e4ddcea72010cadfa4c573759d5c896", size = 152066, upload-time = "2026-02-01T09:08:48.04Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/4a/dd/8bc4f3baf6c03acfb0b300f5aba53d19cc3a319281da518182bf22671b92/pinecone_plugin_assistant-3.0.2-py3-none-any.whl", hash = "sha256:de21ff696219fcad6c7ec86a3d1f70875024314537758ab345b6230462342903", size = 280863, upload-time = "2026-02-01T09:08:49.384Z" }, +] + +[[package]] +name = "pinecone-plugin-interface" +version = "0.0.7" +source = { registry = "https://pypi.org/simple" } +sdist = { url = "https://files.pythonhosted.org/packages/f4/fb/e8a4063264953ead9e2b24d9b390152c60f042c951c47f4592e9996e57ff/pinecone_plugin_interface-0.0.7.tar.gz", hash = "sha256:b8e6675e41847333aa13923cc44daa3f85676d7157324682dc1640588a982846", size = 3370, upload-time = "2024-06-05T01:57:52.093Z" } +wheels = [ + { url = "https://files.pythonhosted.org/packages/3b/1d/a21fdfcd6d022cb64cef5c2a29ee6691c6c103c4566b41646b080b7536a5/pinecone_plugin_interface-0.0.7-py3-none-any.whl", hash = "sha256:875857ad9c9fc8bbc074dbe780d187a2afd21f5bfe0f3b08601924a61ef1bba8", size = 6249, upload-time = "2024-06-05T01:57:50.583Z" }, +] + [[package]] name = "platformdirs" version = "4.9.2" @@ -5405,6 +5445,7 @@ all = [ { name = "numpy", version = "2.2.6", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version < '3.11'" }, { name = "numpy", version = "2.4.2", source = { registry = "https://pypi.org/simple" }, marker = "python_full_version >= '3.11'" }, { name = "openai" }, + { name = "pinecone" }, { name = "python-docx" }, { name = "sentence-transformers" }, { name = "sse-starlette" }, @@ -5457,8 +5498,12 @@ mcp = [ openai = [ { name = "openai" }, ] +pinecone = [ + { name = "pinecone" }, +] rag-upload = [ { name = "chromadb" }, + { name = "pinecone" }, { name = "sentence-transformers" }, { name = "weaviate-client" }, ] @@ -5533,6 +5578,9 @@ requires-dist = [ { name = "openai", marker = "extra == 'openai'", specifier = ">=1.0.0" }, { name = "pathspec", specifier = ">=0.12.1" }, { name = "pillow", specifier = ">=11.0.0" }, + { name = "pinecone", marker = "extra == 'all'", specifier = ">=5.0.0" }, + { name = "pinecone", marker = "extra == 'pinecone'", specifier = ">=5.0.0" }, + { name = "pinecone", marker = "extra == 'rag-upload'", specifier = ">=5.0.0" }, { name = "pydantic", specifier = ">=2.12.3" }, { name = "pydantic-settings", specifier = ">=2.11.0" }, { name = "pygithub", specifier = ">=2.5.0" }, @@ -5563,7 +5611,7 @@ requires-dist = [ { name = "weaviate-client", marker = "extra == 'rag-upload'", specifier = ">=3.25.0" }, { name = "weaviate-client", marker = "extra == 'weaviate'", specifier = ">=3.25.0" }, ] -provides-extras = ["mcp", "gemini", "openai", "all-llms", "s3", "gcs", "azure", "docx", "chroma", "weaviate", "sentence-transformers", "rag-upload", "all-cloud", "embedding", "all"] +provides-extras = ["mcp", "gemini", "openai", "all-llms", "s3", "gcs", "azure", "docx", "chroma", "weaviate", "sentence-transformers", "pinecone", "rag-upload", "all-cloud", "embedding", "all"] [package.metadata.requires-dev] dev = [