skill-seekers-reference

firefrost-gaming/skill-seekers-reference

Author	SHA1	Message	Date
yusyus	68bdbe8307	style: ruff format remaining 14 files Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-01 10:54:45 +03:00
yusyus	064405c052	fix: resolve 18 bugs and code quality issues across adaptors, CLI, and chunking pipeline Bug fixes: - Fix --var flag silently dropped in create routing (args.workflow_var → args.var) - Fix double _score_code_quality() call in word scraper - Add .docx file extension validation in WordToSkillConverter - Fix weaviate ImportError masked by generic Exception handler - Fix RAG chunking crash using non-existent converter.output_dir Chunking pipeline improvements: - Wire --chunk-overlap-tokens through entire package pipeline (package_skill → adaptor.package → format_skill_md → _maybe_chunk_content → RAGChunker) - Add auto-scaling overlap: max(50, chunk_tokens//10) when chunk size is non-default - Rename --no-preserve-code to --no-preserve-code-blocks (backward-compat alias kept) - Replace hardcoded 512/50 chunk defaults with DEFAULT_CHUNK_TOKENS/DEFAULT_CHUNK_OVERLAP_TOKENS constants across all 12 concrete adaptors, rag_chunker, base, and package_skill Code quality: - Extract shared _generate_openai_embeddings() and _generate_st_embeddings() to SkillAdaptor base class, removing ~150 lines of duplication from chroma/weaviate/pinecone - Add Pinecone adaptor with full upload support (pinecone_adaptor.py) Tests (14 new): - chunk_overlap_tokens parameter wiring, auto-scaling overlap, preserve_code_blocks flag - .docx/.doc/no-extension file validation, --var flag routing E2E - Embedding method inheritance verification, backward-compatible flag aliases Docs: - Update CHANGELOG, CLI_REFERENCE, API_REFERENCE, packaging guide (EN+ZH) - Update README test count badge (1880+ → 2283+) All 2283 tests passing, 8 skipped, 0 failures. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-28 21:57:59 +03:00
yusyus	0265de5816	style: Format all Python files with ruff - Formatted 103 files to comply with ruff format requirements - No code logic changes, only formatting/whitespace - Fixes CI formatting check failures	2026-02-08 14:42:27 +03:00
yusyus	85dfae19f1	style: Fix remaining lint issues - down to 11 errors (98% reduction) Fixed all critical and high-priority ruff lint issues: Exception Chaining (B904): 39 → 0 ✅ - Auto-fixed 29 with Python script - Manually fixed 10 remaining cases - Added 'from err' or 'from None' to all raise statements in except blocks Unused Imports (F401): 5 → 0 ✅ - Removed unused chromadb.config.Settings import - Removed unused fastapi.responses.JSONResponse import - Added noqa comments for intentional availability-check imports Syntax Errors: Fixed - Fixed duplicate 'from None from None' in azure_storage.py - Fixed undefined 'e' in embedding_pipeline.py Results: - Before: 447 errors - Fixed: 436 errors (98% reduction!) - Remaining: 11 errors (all minor style improvements) Remaining non-critical issues: - 3 SIM105: Could use contextlib.suppress (style) - 3 SIM117: Multiple with statements (style) - 2 ARG001: Unused function arguments (acceptable) - 3 others: bare-except, collapsible-if, enumerate (minor) These 11 remaining are code quality suggestions, not bugs or issues. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-08 13:00:44 +03:00
yusyus	4f9a5a553b	feat: Phase 2 - Real upload capabilities for ChromaDB and Weaviate Implemented complete upload functionality for vector databases, replacing stub implementations with real upload capabilities including embedding generation, multiple connection modes, and comprehensive error handling. ## ChromaDB Upload (chroma.py) - ✅ Multiple connection modes (PersistentClient, HttpClient) - ✅ 3 embedding strategies (OpenAI, sentence-transformers, default) - ✅ Batch processing (100 docs per batch) - ✅ Progress tracking for large uploads - ✅ Collection management (create if not exists) ## Weaviate Upload (weaviate.py) - ✅ Local and cloud connections - ✅ Schema management (auto-create) - ✅ Batch upload with progress tracking - ✅ OpenAI embedding support ## Upload Command (upload_skill.py) - ✅ Added 8 new CLI arguments for vector DBs - ✅ Platform-specific kwargs handling - ✅ Enhanced output formatting (collection/class names) - ✅ Backward compatibility (LLM platforms unchanged) ## Dependencies (pyproject.toml) - ✅ Added 4 optional dependency groups: - chroma = ["chromadb>=0.4.0"] - weaviate = ["weaviate-client>=3.25.0"] - sentence-transformers = ["sentence-transformers>=2.2.0"] - rag-upload = [all vector DB deps] ## Testing (test_upload_integration.py) - ✅ 15 new tests across 4 test classes - ✅ Works without optional dependencies installed - ✅ Error handling tests (missing files, invalid JSON) - ✅ Fixed 2 existing tests (chroma/weaviate adaptors) - ✅ 37/37 tests passing ## User-Facing Examples Local ChromaDB: skill-seekers upload output/react-chroma.json --target chroma \ --persist-directory ./chroma_db Weaviate Cloud: skill-seekers upload output/react-weaviate.json --target weaviate \ --use-cloud --cluster-url https://xxx.weaviate.network With OpenAI embeddings: skill-seekers upload output/react-chroma.json --target chroma \ --embedding-function openai --openai-api-key $OPENAI_API_KEY ## Files Changed - src/skill_seekers/cli/adaptors/chroma.py (250 lines) - src/skill_seekers/cli/adaptors/weaviate.py (200 lines) - src/skill_seekers/cli/upload_skill.py (50 lines) - pyproject.toml (15 lines) - tests/test_upload_integration.py (NEW - 293 lines) - tests/test_adaptors/test_chroma_adaptor.py (1 line) - tests/test_adaptors/test_weaviate_adaptor.py (1 line) Total: 7 files, ~810 lines added/modified See PHASE2_COMPLETION_SUMMARY.md for detailed documentation. Time: ~7 hours (estimated 6-8h) Status: ✅ COMPLETE - Ready for Phase 3 Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-08 01:30:04 +03:00
yusyus	59e77f42b3	feat: Complete Phase 1b - Implement chunking in all 6 RAG adaptors - Updated chroma.py: Parallel arrays pattern with chunking support - Updated llama_index.py: Node format with chunking support - Updated haystack.py: Document format with chunking support - Updated faiss_helpers.py: Parallel arrays pattern with chunking support - Updated weaviate.py: Object/properties format with chunking support - Updated qdrant.py: Points/payload format with chunking support All adaptors now use base._maybe_chunk_content() for consistent chunking behavior: - Auto-chunks large documents (>512 tokens by default) - Preserves code blocks during chunking - Adds chunk metadata (chunk_index, total_chunks, is_chunked, chunk_id) - Configurable via enable_chunking, chunk_max_tokens, preserve_code_blocks Test results: 174/174 tests passing (6 skipped E2E tests) - All 10 chunking integration tests pass - All 66 RAG adaptor tests pass - All platform-specific tests pass Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-08 01:15:10 +03:00
yusyus	e9e3f5f4d7	feat: Complete Phase 1 - RAGChunker integration for all adaptors (v2.11.0) 🎯 MAJOR FEATURE: Intelligent chunking for RAG platforms Integrates RAGChunker into package command and all 7 RAG adaptors to fix token limit issues with large documents. Auto-enables chunking for RAG platforms (LangChain, LlamaIndex, Haystack, Weaviate, Chroma, FAISS, Qdrant). ## What's New ### CLI Enhancements - Add --chunk flag to enable intelligent chunking - Add --chunk-tokens <int> to control chunk size (default: 512 tokens) - Add --no-preserve-code to allow code block splitting - Auto-enable chunking for all RAG platforms ### Adaptor Updates - Add _maybe_chunk_content() helper to base adaptor - Update all 11 adaptors with chunking parameters: * 7 RAG adaptors: langchain, llama-index, haystack, weaviate, chroma, faiss, qdrant * 4 non-RAG adaptors: claude, gemini, openai, markdown (compatibility) - Fully implemented chunking for LangChain adaptor ### Bug Fixes - Fix RAGChunker boundary detection bug (documents starting with headers) - Documents now chunk correctly: 27-30 chunks instead of 1 ### Testing - Add 10 comprehensive chunking integration tests - All 184 tests passing (174 existing + 10 new) ## Impact ### Before - Large docs (>512 tokens) caused token limit errors - Documents with headers weren't chunked properly - Manual chunking required ### After - Auto-chunking for RAG platforms ✅ - Configurable chunk size ✅ - Code blocks preserved ✅ - 27x improvement in chunk granularity (56KB → 27 chunks of 2KB) ## Technical Details Chunking Algorithm: - Token estimation: ~4 chars/token - Default chunk size: 512 tokens (~2KB) - Overlap: 10% (50 tokens) - Preserves code blocks and paragraphs Example Output: ```bash skill-seekers package output/react/ --target chroma # ℹ️ Auto-enabling chunking for chroma platform # ✅ Package created with 27 chunks (was 1 document) ``` ## Files Changed (15) - package_skill.py - Add chunking CLI args - base.py - Add _maybe_chunk_content() helper - rag_chunker.py - Fix boundary detection bug - 7 RAG adaptors - Add chunking support - 4 non-RAG adaptors - Add parameter compatibility - test_chunking_integration.py - NEW: 10 tests ## Quality Metrics - Tests: 184 passed, 6 skipped - Quality: 9.5/10 → 9.7/10 (+2%) - Code: +350 lines, well-tested - Breaking: None ## Next Steps - Phase 1b: Complete format_skill_md() for remaining 6 RAG adaptors (optional) - Phase 2: Upload integration for ChromaDB + Weaviate - Phase 3: CLI refactoring (main.py 836 → 200 lines) - Phase 4: Formal preset system with deprecation warnings Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-08 00:59:22 +03:00
yusyus	d84e5878a1	refactor: Adopt helper methods across 7 RAG adaptors to eliminate duplication Refactored all RAG adaptors (LangChain, LlamaIndex, Haystack, Weaviate, Chroma, FAISS, Qdrant) to use existing helper methods from base.py, removing ~215 lines of duplicate code (26% reduction). Key improvements: - All adaptors now use _format_output_path() for consistent path handling - All adaptors now use _iterate_references() for reference file iteration - Added _generate_deterministic_id() helper with 3 formats (hex, uuid, uuid5) - 5 adaptors refactored to use unified ID generation - Removed 6 unused imports (hashlib, uuid) Benefits: - DRY principles enforced across all RAG adaptors - Single source of truth for common logic - Easier maintenance and testing - Consistent behavior across platforms All 159 adaptor tests passing. Zero regressions. Phase 1 of optional enhancements (Phases 2-5 pending). Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-07 22:31:10 +03:00
yusyus	baccbf9d81	feat(weaviate): Add Weaviate vector database adaptor (Task #10 ) Implements native Weaviate integration for RAG pipelines as part of Week 2 vector store integrations. ## Features - Auto-generated schema - Creates Weaviate class definition from metadata - Deterministic UUIDs - Stable IDs for consistent re-imports - Rich metadata - All properties indexed for filtering - Batch-ready format - Optimized for batch import - Example code - Complete usage examples in upload() ## Output Format JSON file containing: - `schema`: Weaviate class definition with properties - `objects`: Array of objects ready for batch import - `class_name`: Derived from skill name ## Properties - content (text, searchable) - source (filterable, searchable) - category (filterable, searchable) - file (filterable) - type (filterable) - version (filterable) ## CLI Integration ```bash skill-seekers package output/django --target weaviate # → output/django-weaviate.json ``` ## Files Added - src/skill_seekers/cli/adaptors/weaviate.py (428 lines) * Complete Weaviate adaptor implementation * Schema auto-generation * UUID generation from content hash * Example code for import/query ## Files Modified - src/skill_seekers/cli/adaptors/__init__.py * Import WeaviateAdaptor * Register "weaviate" in ADAPTORS - src/skill_seekers/cli/package_skill.py * Add "weaviate" to --target choices - src/skill_seekers/cli/main.py * Add "weaviate" to --target choices ## Testing Tested with ansible skill: - ✅ Schema generation works - ✅ Object format correct - ✅ UUID generation deterministic - ✅ Metadata preserved - ✅ CLI integration working Output: output/ansible-weaviate.json (10.7 KB, 1 object) ## Week 2 Progress - ✅ Task #10: Weaviate adaptor (Complete) - ⏳ Task #11: Chroma adaptor (Next) - ⏳ Task #12: FAISS helpers - ⏳ Task #13: Qdrant adaptor Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-05 23:38:12 +03:00

9 Commits