fix: Enforce min_chunk_size in RAG chunker

- Filter out chunks smaller than min_chunk_size (default 100 tokens)
- Exception: Keep all chunks if entire document is smaller than target size
- All 15 tests passing (100% pass rate)

Fixes edge case where very small chunks (e.g., 'Short.' = 6 chars) were
being created despite min_chunk_size=100 setting.

Test: pytest tests/test_rag_chunker.py -v
This commit is contained in:
yusyus
2026-02-07 20:59:03 +03:00
parent 3a769a27cd
commit 8b3f31409e
65 changed files with 16133 additions and 7 deletions

41
.env.example Normal file
View File

@@ -0,0 +1,41 @@
# Skill Seekers Docker Environment Configuration
# Copy this file to .env and fill in your API keys
# Claude AI / Anthropic API
# Required for AI enhancement features
# Get your key from: https://console.anthropic.com/
ANTHROPIC_API_KEY=sk-ant-your-key-here
# Google Gemini API (Optional)
# Required for Gemini platform support
# Get your key from: https://makersuite.google.com/app/apikey
GOOGLE_API_KEY=
# OpenAI API (Optional)
# Required for OpenAI/ChatGPT platform support
# Get your key from: https://platform.openai.com/api-keys
OPENAI_API_KEY=
# GitHub Token (Optional, but recommended)
# Increases rate limits from 60/hour to 5000/hour
# Create token at: https://github.com/settings/tokens
# Required scopes: public_repo (for public repos)
GITHUB_TOKEN=
# MCP Server Configuration
MCP_TRANSPORT=http
MCP_PORT=8765
# Docker Resource Limits (Optional)
# Uncomment to set custom limits
# DOCKER_CPU_LIMIT=2.0
# DOCKER_MEMORY_LIMIT=4g
# Vector Database Ports (Optional - change if needed)
# WEAVIATE_PORT=8080
# QDRANT_PORT=6333
# CHROMA_PORT=8000
# Logging (Optional)
# SKILL_SEEKERS_LOG_LEVEL=INFO
# SKILL_SEEKERS_LOG_FILE=/data/logs/skill-seekers.log