fix: Enforce min_chunk_size in RAG chunker

- Filter out chunks smaller than min_chunk_size (default 100 tokens) - Exception: Keep all chunks if entire document is smaller than target size - All 15 tests passing (100% pass rate) Fixes edge case where very small chunks (e.g., 'Short.' = 6 chars) were being created despite min_chunk_size=100 setting. Test: pytest tests/test_rag_chunker.py -v
2026-02-07 20:59:03 +03:00
parent 3a769a27cd
commit 8b3f31409e
65 changed files with 16133 additions and 7 deletions
--- a/docs/PRODUCTION_DEPLOYMENT.md
+++ b/docs/PRODUCTION_DEPLOYMENT.md
@@ -0,0 +1,827 @@
+# Production Deployment Guide
+
+Complete guide for deploying Skill Seekers in production environments.
+
+## Table of Contents
+
+- [Prerequisites](#prerequisites)
+- [Installation](#installation)
+- [Configuration](#configuration)
+- [Deployment Options](#deployment-options)
+- [Monitoring & Observability](#monitoring--observability)
+- [Security](#security)
+- [Scaling](#scaling)
+- [Backup & Disaster Recovery](#backup--disaster-recovery)
+- [Troubleshooting](#troubleshooting)
+
+## Prerequisites
+
+### System Requirements
+
+**Minimum:**
+- CPU: 2 cores
+- RAM: 4 GB
+- Disk: 10 GB
+- Python: 3.10+
+
+**Recommended (for production):**
+- CPU: 4+ cores
+- RAM: 8+ GB
+- Disk: 50+ GB SSD
+- Python: 3.12+
+
+### Dependencies
+
+**Required:**
+```bash
+# System packages (Ubuntu/Debian)
+sudo apt update
+sudo apt install -y python3.12 python3.12-venv python3-pip \
+  git curl wget build-essential libssl-dev
+
+# System packages (RHEL/CentOS)
+sudo yum install -y python312 python312-devel git curl wget \
+  gcc gcc-c++ openssl-devel
+```
+
+**Optional (for specific features):**
+```bash
+# OCR support (PDF scraping)
+sudo apt install -y tesseract-ocr
+
+# Cloud storage
+# (Install provider-specific SDKs via pip)
+
+# Embedding generation
+# (GPU support requires CUDA)
+```
+
+## Installation
+
+### 1. Production Installation
+
+```bash
+# Create dedicated user
+sudo useradd -m -s /bin/bash skillseekers
+sudo su - skillseekers
+
+# Create virtual environment
+python3.12 -m venv /opt/skillseekers/venv
+source /opt/skillseekers/venv/bin/activate
+
+# Install package
+pip install --upgrade pip
+pip install skill-seekers[all]
+
+# Verify installation
+skill-seekers --version
+```
+
+### 2. Configuration Directory
+
+```bash
+# Create config directory
+mkdir -p ~/.config/skill-seekers/{configs,output,logs,cache}
+
+# Set permissions
+chmod 700 ~/.config/skill-seekers
+```
+
+### 3. Environment Variables
+
+Create `/opt/skillseekers/.env`:
+
+```bash
+# API Keys
+ANTHROPIC_API_KEY=sk-ant-...
+GOOGLE_API_KEY=AIza...
+OPENAI_API_KEY=sk-...
+VOYAGE_API_KEY=...
+
+# GitHub Tokens (use skill-seekers config --github for multiple)
+GITHUB_TOKEN=ghp_...
+
+# Cloud Storage (optional)
+AWS_ACCESS_KEY_ID=...
+AWS_SECRET_ACCESS_KEY=...
+GOOGLE_APPLICATION_CREDENTIALS=/path/to/gcs-key.json
+AZURE_STORAGE_CONNECTION_STRING=...
+
+# MCP Server
+MCP_TRANSPORT=http
+MCP_PORT=8765
+
+# Sync Monitoring (optional)
+SYNC_WEBHOOK_URL=https://...
+SLACK_WEBHOOK_URL=https://hooks.slack.com/...
+
+# Logging
+LOG_LEVEL=INFO
+LOG_FILE=/var/log/skillseekers/app.log
+```
+
+**Security Note:** Never commit `.env` files to version control!
+
+```bash
+# Secure the env file
+chmod 600 /opt/skillseekers/.env
+```
+
+## Configuration
+
+### 1. GitHub Configuration
+
+Use the interactive configuration wizard:
+
+```bash
+skill-seekers config --github
+```
+
+This will:
+- Add GitHub personal access tokens
+- Configure rate limit strategies
+- Test token validity
+- Support multiple profiles (work, personal, etc.)
+
+### 2. API Keys Configuration
+
+```bash
+skill-seekers config --api-keys
+```
+
+Configure:
+- Claude API (Anthropic)
+- Gemini API (Google)
+- OpenAI API
+- Voyage AI (embeddings)
+
+### 3. Connection Testing
+
+```bash
+skill-seekers config --test
+```
+
+Verifies:
+- ✅ GitHub token(s) validity and rate limits
+- ✅ Claude API connectivity
+- ✅ Gemini API connectivity
+- ✅ OpenAI API connectivity
+- ✅ Cloud storage access (if configured)
+
+## Deployment Options
+
+### Option 1: Systemd Service (Recommended)
+
+Create `/etc/systemd/system/skillseekers-mcp.service`:
+
+```ini
+[Unit]
+Description=Skill Seekers MCP Server
+After=network.target
+
+[Service]
+Type=simple
+User=skillseekers
+Group=skillseekers
+WorkingDirectory=/opt/skillseekers
+EnvironmentFile=/opt/skillseekers/.env
+ExecStart=/opt/skillseekers/venv/bin/python -m skill_seekers.mcp.server_fastmcp --transport http --port 8765
+Restart=always
+RestartSec=10
+StandardOutput=journal
+StandardError=journal
+SyslogIdentifier=skillseekers-mcp
+
+# Security
+NoNewPrivileges=true
+PrivateTmp=true
+ProtectSystem=strict
+ProtectHome=true
+ReadWritePaths=/opt/skillseekers /var/log/skillseekers
+
+[Install]
+WantedBy=multi-user.target
+```
+
+**Enable and start:**
+
+```bash
+sudo systemctl daemon-reload
+sudo systemctl enable skillseekers-mcp
+sudo systemctl start skillseekers-mcp
+sudo systemctl status skillseekers-mcp
+```
+
+### Option 2: Docker Deployment
+
+See [Docker Deployment Guide](./DOCKER_DEPLOYMENT.md) for detailed instructions.
+
+**Quick Start:**
+
+```bash
+# Build image
+docker build -t skillseekers:latest .
+
+# Run container
+docker run -d \
+  --name skillseekers-mcp \
+  -p 8765:8765 \
+  -e ANTHROPIC_API_KEY=$ANTHROPIC_API_KEY \
+  -e GITHUB_TOKEN=$GITHUB_TOKEN \
+  -v /opt/skillseekers/data:/app/data \
+  --restart unless-stopped \
+  skillseekers:latest
+```
+
+### Option 3: Kubernetes Deployment
+
+See [Kubernetes Deployment Guide](./KUBERNETES_DEPLOYMENT.md) for detailed instructions.
+
+**Quick Start:**
+
+```bash
+# Install with Helm
+helm install skillseekers ./helm/skillseekers \
+  --namespace skillseekers \
+  --create-namespace \
+  --set secrets.anthropicApiKey=$ANTHROPIC_API_KEY \
+  --set secrets.githubToken=$GITHUB_TOKEN
+```
+
+### Option 4: Docker Compose
+
+See [Docker Compose Guide](./DOCKER_COMPOSE.md) for multi-service deployment.
+
+```bash
+# Start all services
+docker-compose up -d
+
+# Check status
+docker-compose ps
+
+# View logs
+docker-compose logs -f
+```
+
+## Monitoring & Observability
+
+### 1. Health Checks
+
+**MCP Server Health:**
+
+```bash
+# HTTP transport
+curl http://localhost:8765/health
+
+# Expected response:
+{
+  "status": "healthy",
+  "version": "2.9.0",
+  "uptime": 3600,
+  "tools": 25
+}
+```
+
+### 2. Logging
+
+**Configure structured logging:**
+
+```python
+# config/logging.yaml
+version: 1
+formatters:
+  json:
+    format: '{"time":"%(asctime)s","level":"%(levelname)s","msg":"%(message)s"}'
+handlers:
+  file:
+    class: logging.handlers.RotatingFileHandler
+    filename: /var/log/skillseekers/app.log
+    maxBytes: 10485760  # 10MB
+    backupCount: 5
+    formatter: json
+loggers:
+  skill_seekers:
+    level: INFO
+    handlers: [file]
+```
+
+**Log aggregation options:**
+- **ELK Stack:** Elasticsearch + Logstash + Kibana
+- **Grafana Loki:** Lightweight log aggregation
+- **CloudWatch Logs:** For AWS deployments
+- **Stackdriver:** For GCP deployments
+
+### 3. Metrics
+
+**Prometheus metrics endpoint:**
+
+```bash
+# Add to MCP server
+from prometheus_client import start_http_server, Counter, Histogram
+
+# Metrics
+scraping_requests = Counter('scraping_requests_total', 'Total scraping requests')
+scraping_duration = Histogram('scraping_duration_seconds', 'Scraping duration')
+
+# Start metrics server
+start_http_server(9090)
+```
+
+**Key metrics to monitor:**
+- Request rate
+- Response time (p50, p95, p99)
+- Error rate
+- Memory usage
+- CPU usage
+- Disk I/O
+- GitHub API rate limit remaining
+- Claude API token usage
+
+### 4. Alerting
+
+**Example Prometheus alert rules:**
+
+```yaml
+groups:
+  - name: skillseekers
+    rules:
+      - alert: HighErrorRate
+        expr: rate(http_requests_total{status=~"5.."}[5m]) > 0.05
+        for: 5m
+        annotations:
+          summary: "High error rate detected"
+
+      - alert: HighMemoryUsage
+        expr: process_resident_memory_bytes > 2e9  # 2GB
+        for: 10m
+        annotations:
+          summary: "Memory usage above 2GB"
+
+      - alert: GitHubRateLimitLow
+        expr: github_rate_limit_remaining < 100
+        for: 1m
+        annotations:
+          summary: "GitHub rate limit low"
+```
+
+## Security
+
+### 1. API Key Management
+
+**Best Practices:**
+
+✅ **DO:**
+- Store keys in environment variables or secret managers
+- Use different keys for dev/staging/prod
+- Rotate keys regularly (quarterly minimum)
+- Use least-privilege IAM roles for cloud services
+- Monitor key usage for anomalies
+
+❌ **DON'T:**
+- Commit keys to version control
+- Share keys via email/Slack
+- Use production keys in development
+- Grant overly broad permissions
+
+**Recommended Secret Managers:**
+- **Kubernetes Secrets** (for K8s deployments)
+- **AWS Secrets Manager** (for AWS)
+- **Google Secret Manager** (for GCP)
+- **Azure Key Vault** (for Azure)
+- **HashiCorp Vault** (cloud-agnostic)
+
+### 2. Network Security
+
+**Firewall Rules:**
+
+```bash
+# Allow only necessary ports
+sudo ufw enable
+sudo ufw allow 22/tcp    # SSH
+sudo ufw allow 8765/tcp  # MCP server (if public)
+sudo ufw deny incoming
+sudo ufw allow outgoing
+```
+
+**Reverse Proxy (Nginx):**
+
+```nginx
+# /etc/nginx/sites-available/skillseekers
+server {
+    listen 80;
+    server_name api.skillseekers.example.com;
+
+    # Redirect to HTTPS
+    return 301 https://$server_name$request_uri;
+}
+
+server {
+    listen 443 ssl http2;
+    server_name api.skillseekers.example.com;
+
+    ssl_certificate /etc/letsencrypt/live/api.skillseekers.example.com/fullchain.pem;
+    ssl_certificate_key /etc/letsencrypt/live/api.skillseekers.example.com/privkey.pem;
+
+    # Security headers
+    add_header Strict-Transport-Security "max-age=31536000" always;
+    add_header X-Frame-Options "SAMEORIGIN" always;
+    add_header X-Content-Type-Options "nosniff" always;
+
+    # Rate limiting
+    limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
+    limit_req zone=api burst=20 nodelay;
+
+    location / {
+        proxy_pass http://localhost:8765;
+        proxy_set_header Host $host;
+        proxy_set_header X-Real-IP $remote_addr;
+        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
+        proxy_set_header X-Forwarded-Proto $scheme;
+
+        # Timeouts
+        proxy_connect_timeout 60s;
+        proxy_send_timeout 60s;
+        proxy_read_timeout 60s;
+    }
+}
+```
+
+### 3. TLS/SSL
+
+**Let's Encrypt (free certificates):**
+
+```bash
+# Install certbot
+sudo apt install certbot python3-certbot-nginx
+
+# Obtain certificate
+sudo certbot --nginx -d api.skillseekers.example.com
+
+# Auto-renewal (cron)
+0 12 * * * /usr/bin/certbot renew --quiet
+```
+
+### 4. Authentication & Authorization
+
+**API Key Authentication (optional):**
+
+```python
+# Add to MCP server
+from fastapi import Security, HTTPException
+from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
+
+security = HTTPBearer()
+
+async def verify_token(credentials: HTTPAuthorizationCredentials = Security(security)):
+    token = credentials.credentials
+    if token != os.getenv("API_SECRET_KEY"):
+        raise HTTPException(status_code=401, detail="Invalid token")
+    return token
+```
+
+## Scaling
+
+### 1. Vertical Scaling
+
+**Increase resources:**
+
+```yaml
+# Kubernetes resource limits
+resources:
+  requests:
+    cpu: "2"
+    memory: "4Gi"
+  limits:
+    cpu: "4"
+    memory: "8Gi"
+```
+
+### 2. Horizontal Scaling
+
+**Deploy multiple instances:**
+
+```bash
+# Kubernetes HPA (Horizontal Pod Autoscaler)
+kubectl autoscale deployment skillseekers-mcp \
+  --cpu-percent=70 \
+  --min=2 \
+  --max=10
+```
+
+**Load Balancing:**
+
+```nginx
+# Nginx load balancer
+upstream skillseekers {
+    least_conn;
+    server 10.0.0.1:8765;
+    server 10.0.0.2:8765;
+    server 10.0.0.3:8765;
+}
+
+server {
+    listen 80;
+    location / {
+        proxy_pass http://skillseekers;
+    }
+}
+```
+
+### 3. Database/Storage Scaling
+
+**Distributed caching:**
+
+```python
+# Redis for distributed cache
+import redis
+
+cache = redis.Redis(host='redis.example.com', port=6379, db=0)
+```
+
+**Object storage:**
+- Use S3/GCS/Azure Blob for skill packages
+- Enable CDN for static assets
+- Use read replicas for databases
+
+### 4. Rate Limit Management
+
+**Multiple GitHub tokens:**
+
+```bash
+# Configure multiple profiles
+skill-seekers config --github
+
+# Automatic token rotation on rate limit
+# (handled by rate_limit_handler.py)
+```
+
+## Backup & Disaster Recovery
+
+### 1. Data Backup
+
+**What to backup:**
+- Configuration files (`~/.config/skill-seekers/`)
+- Generated skills (`output/`)
+- Database/cache (if applicable)
+- Logs (for forensics)
+
+**Backup script:**
+
+```bash
+#!/bin/bash
+# /opt/skillseekers/scripts/backup.sh
+
+BACKUP_DIR="/backups/skillseekers"
+TIMESTAMP=$(date +%Y%m%d_%H%M%S)
+
+# Create backup
+tar -czf "$BACKUP_DIR/backup_$TIMESTAMP.tar.gz" \
+  ~/.config/skill-seekers \
+  /opt/skillseekers/output \
+  /opt/skillseekers/.env
+
+# Retain last 30 days
+find "$BACKUP_DIR" -name "backup_*.tar.gz" -mtime +30 -delete
+
+# Upload to S3 (optional)
+aws s3 cp "$BACKUP_DIR/backup_$TIMESTAMP.tar.gz" \
+  s3://backups/skillseekers/
+```
+
+**Schedule backups:**
+
+```bash
+# Crontab
+0 2 * * * /opt/skillseekers/scripts/backup.sh
+```
+
+### 2. Disaster Recovery Plan
+
+**Recovery steps:**
+
+1. **Provision new infrastructure**
+   ```bash
+   # Deploy from backup
+   terraform apply
+   ```
+
+2. **Restore configuration**
+   ```bash
+   tar -xzf backup_20250207.tar.gz -C /
+   ```
+
+3. **Verify services**
+   ```bash
+   skill-seekers config --test
+   systemctl status skillseekers-mcp
+   ```
+
+4. **Test functionality**
+   ```bash
+   skill-seekers scrape --config configs/test.json --max-pages 10
+   ```
+
+**RTO/RPO targets:**
+- **RTO (Recovery Time Objective):** < 2 hours
+- **RPO (Recovery Point Objective):** < 24 hours
+
+## Troubleshooting
+
+### Common Issues
+
+#### 1. High Memory Usage
+
+**Symptoms:**
+- OOM kills
+- Slow performance
+- Swapping
+
+**Solutions:**
+
+```bash
+# Check memory usage
+ps aux --sort=-%mem | head -10
+
+# Reduce batch size
+skill-seekers scrape --config config.json --batch-size 10
+
+# Enable memory limits
+docker run --memory=4g skillseekers:latest
+```
+
+#### 2. GitHub Rate Limits
+
+**Symptoms:**
+- `403 Forbidden` errors
+- "API rate limit exceeded" messages
+
+**Solutions:**
+
+```bash
+# Check rate limit
+curl -H "Authorization: token $GITHUB_TOKEN" \
+  https://api.github.com/rate_limit
+
+# Add more tokens
+skill-seekers config --github
+
+# Use rate limit strategy
+# (automatic with multi-token config)
+```
+
+#### 3. Slow Scraping
+
+**Symptoms:**
+- Long scraping times
+- Timeouts
+
+**Solutions:**
+
+```bash
+# Enable async scraping (2-3x faster)
+skill-seekers scrape --config config.json --async
+
+# Increase concurrency
+# (adjust in config: "concurrency": 10)
+
+# Use caching
+skill-seekers scrape --config config.json --use-cache
+```
+
+#### 4. API Errors
+
+**Symptoms:**
+- `401 Unauthorized`
+- `429 Too Many Requests`
+
+**Solutions:**
+
+```bash
+# Verify API keys
+skill-seekers config --test
+
+# Check API key validity
+# Claude API: https://console.anthropic.com/
+# OpenAI: https://platform.openai.com/api-keys
+# Google: https://console.cloud.google.com/apis/credentials
+
+# Rotate keys if compromised
+```
+
+#### 5. Service Won't Start
+
+**Symptoms:**
+- systemd service fails
+- Container exits immediately
+
+**Solutions:**
+
+```bash
+# Check logs
+journalctl -u skillseekers-mcp -n 100
+
+# Or for Docker
+docker logs skillseekers-mcp
+
+# Common causes:
+# - Missing environment variables
+# - Port already in use
+# - Permission issues
+
+# Verify config
+skill-seekers config --show
+```
+
+### Debug Mode
+
+Enable detailed logging:
+
+```bash
+# Set debug level
+export LOG_LEVEL=DEBUG
+
+# Run with verbose output
+skill-seekers scrape --config config.json --verbose
+```
+
+### Getting Help
+
+**Community Support:**
+- GitHub Issues: https://github.com/yusufkaraaslan/Skill_Seekers/issues
+- Documentation: https://skillseekersweb.com/
+
+**Log Collection:**
+
+```bash
+# Collect diagnostic info
+tar -czf skillseekers-debug.tar.gz \
+  /var/log/skillseekers/ \
+  ~/.config/skill-seekers/configs/ \
+  /opt/skillseekers/.env
+```
+
+## Performance Tuning
+
+### 1. Scraping Performance
+
+**Optimization techniques:**
+
+```python
+# Enable async scraping
+"async_scraping": true,
+"concurrency": 20,  # Adjust based on resources
+
+# Optimize selectors
+"selectors": {
+    "main_content": "article",  # More specific = faster
+    "code_blocks": "pre code"
+}
+
+# Enable caching
+"use_cache": true,
+"cache_ttl": 86400  # 24 hours
+```
+
+### 2. Embedding Performance
+
+**GPU acceleration (if available):**
+
+```python
+# Use GPU for sentence-transformers
+pip install sentence-transformers[gpu]
+
+# Configure
+export CUDA_VISIBLE_DEVICES=0
+```
+
+**Batch processing:**
+
+```python
+# Generate embeddings in batches
+generator.generate_batch(texts, batch_size=32)
+```
+
+### 3. Storage Performance
+
+**Use SSD for:**
+- SQLite databases
+- Cache directories
+- Log files
+
+**Use object storage for:**
+- Skill packages
+- Backup archives
+- Large datasets
+
+## Next Steps
+
+1. **Review** deployment option that fits your infrastructure
+2. **Configure** monitoring and alerting
+3. **Set up** backups and disaster recovery
+4. **Test** failover procedures
+5. **Document** your specific deployment
+6. **Train** your team on operations
+
+---
+
+**Need help?** See [TROUBLESHOOTING.md](./TROUBLESHOOTING.md) or open an issue on GitHub.