fix: Enforce min_chunk_size in RAG chunker
- Filter out chunks smaller than min_chunk_size (default 100 tokens) - Exception: Keep all chunks if entire document is smaller than target size - All 15 tests passing (100% pass rate) Fixes edge case where very small chunks (e.g., 'Short.' = 6 chars) were being created despite min_chunk_size=100 setting. Test: pytest tests/test_rag_chunker.py -v
This commit is contained in:
31
src/skill_seekers/embedding/__init__.py
Normal file
31
src/skill_seekers/embedding/__init__.py
Normal file
@@ -0,0 +1,31 @@
|
||||
"""
|
||||
Embedding generation system for Skill Seekers.
|
||||
|
||||
Provides:
|
||||
- FastAPI server for embedding generation
|
||||
- Multiple embedding model support (OpenAI, sentence-transformers, Anthropic)
|
||||
- Batch processing for efficiency
|
||||
- Caching layer for embeddings
|
||||
- Vector database integration
|
||||
|
||||
Usage:
|
||||
# Start server
|
||||
python -m skill_seekers.embedding.server
|
||||
|
||||
# Generate embeddings
|
||||
curl -X POST http://localhost:8000/embed \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{"texts": ["Hello world"], "model": "text-embedding-3-small"}'
|
||||
"""
|
||||
|
||||
from .models import EmbeddingRequest, EmbeddingResponse, BatchEmbeddingRequest
|
||||
from .generator import EmbeddingGenerator
|
||||
from .cache import EmbeddingCache
|
||||
|
||||
__all__ = [
|
||||
'EmbeddingRequest',
|
||||
'EmbeddingResponse',
|
||||
'BatchEmbeddingRequest',
|
||||
'EmbeddingGenerator',
|
||||
'EmbeddingCache',
|
||||
]
|
||||
Reference in New Issue
Block a user