Files
skill-seekers-reference/docs/integrations/CHROMA.md
yusyus 73adda0b17 docs: update all chunk flag names to match renamed CLI flags
Replace all occurrences of old ambiguous flag names with the new explicit ones:
  --chunk-size (tokens)  → --chunk-tokens
  --chunk-overlap        → --chunk-overlap-tokens
  --chunk                → --chunk-for-rag
  --streaming-chunk-size → --streaming-chunk-chars
  --streaming-overlap    → --streaming-overlap-chars
  --chunk-size (pages)   → --pdf-pages-per-chunk

Updated: CLI_REFERENCE (EN+ZH), user-guide (EN+ZH), integrations (Haystack,
Chroma, Weaviate, FAISS, Qdrant), features/PDF_CHUNKING, examples/haystack-pipeline,
strategy docs, archive docs, and CHANGELOG.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
2026-02-24 22:15:14 +03:00

1005 lines
25 KiB
Markdown

# Chroma Integration with Skill Seekers
**Status:** ✅ Production Ready
**Difficulty:** Beginner
**Last Updated:** February 7, 2026
---
## ❌ The Problem
Building RAG applications with Chroma involves several challenges:
1. **Embedding Model Setup** - Need to choose and configure embedding models (local vs API) manually
2. **Collection Management** - Creating and managing collections with metadata requires boilerplate code
3. **Local-First Complexity** - Setting up persistent storage and dealing with file paths
**Example Pain Point:**
```python
# Manual embedding + collection setup for each framework
import chromadb
from chromadb.utils import embedding_functions
# Choose embedding function
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="sk-...",
model_name="text-embedding-ada-002"
)
# Create client + collection
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.create_collection(
name="react_docs",
embedding_function=openai_ef,
metadata={"description": "React documentation"}
)
# Manually parse and add documents...
```
---
## ✅ The Solution
Skill Seekers automates Chroma integration with structured, production-ready data:
**Benefits:**
- ✅ Auto-formatted documents with embeddings included
- ✅ Consistent collection structure across all frameworks
- ✅ Works with local models (Sentence Transformers) or API embeddings (OpenAI, Cohere)
- ✅ Persistent storage with automatic path management
- ✅ Metadata-rich for precise filtering
**Result:** 5-minute setup, production-ready local vector search with zero external dependencies.
---
## ⚡ Quick Start (5 Minutes)
### Prerequisites
```bash
# Install Chroma
pip install chromadb>=0.4.22
# For local embeddings (optional, free)
pip install sentence-transformers
# For OpenAI embeddings (optional)
pip install openai
# Or with Skill Seekers
pip install skill-seekers[all-llms]
```
**What you need:**
- Python 3.10+
- No external services required (fully local!)
- Optional: OpenAI API key for better embeddings
### Generate Chroma-Ready Documents
```bash
# Step 1: Scrape documentation
skill-seekers scrape --config configs/react.json
# Step 2: Package for Chroma (creates LangChain format)
skill-seekers package output/react --target langchain
# Output: output/react-langchain.json (Chroma-compatible)
```
### Upload to Chroma (Local)
```python
import chromadb
import json
# Create persistent client (data saved to disk)
client = chromadb.PersistentClient(path="./chroma_db")
# Create collection with local embeddings (free!)
collection = client.get_or_create_collection(
name="react_docs",
metadata={"description": "React documentation from Skill Seekers"}
)
# Load documents
with open("output/react-langchain.json") as f:
documents = json.load(f)
# Add to collection (Chroma generates embeddings automatically)
collection.add(
documents=[doc["page_content"] for doc in documents],
metadatas=[doc["metadata"] for doc in documents],
ids=[f"doc_{i}" for i in range(len(documents))]
)
print(f"✅ Added {len(documents)} documents to Chroma")
print(f"Total in collection: {collection.count()}")
```
### Query with Filters
```python
# Semantic search with metadata filter
results = collection.query(
query_texts=["How do I use React hooks?"],
n_results=3,
where={"category": "hooks"} # Filter by category
)
for i, (doc, metadata) in enumerate(zip(results["documents"][0], results["metadatas"][0])):
print(f"\n{i+1}. Category: {metadata['category']}")
print(f" Source: {metadata['source']}")
print(f" Content: {doc[:200]}...")
```
**That's it!** Chroma is now running locally with your documentation.
---
## 📖 Detailed Setup Guide
### Step 1: Choose Storage Mode
**Option A: Persistent (Recommended for Production)**
```python
import chromadb
# Data persists to disk
client = chromadb.PersistentClient(
path="./chroma_db" # Specify database directory
)
# Database files saved to ./chroma_db/
# Survives script restarts
```
**Option B: In-Memory (Fast, for Development)**
```python
# Data lost when script ends
client = chromadb.Client()
# Fast, but temporary
# Perfect for experimentation
```
**Option C: HTTP Client (Remote Chroma Server)**
```bash
# Start Chroma server
chroma run --path ./chroma_db --port 8000
```
```python
# Connect to remote server
client = chromadb.HttpClient(host="localhost", port=8000)
# Great for microservices architecture
```
**Option D: Docker (Production)**
```bash
# docker-compose.yml
version: '3'
services:
chroma:
image: ghcr.io/chroma-core/chroma:latest
volumes:
- ./chroma-data:/chroma/chroma
ports:
- "8000:8000"
environment:
- ANONYMIZED_TELEMETRY=False
# Start Chroma
docker-compose up -d
```
### Step 2: Generate Skill Seekers Documents
**Option A: Documentation Website**
```bash
skill-seekers scrape --config configs/django.json
skill-seekers package output/django --target langchain
```
**Option B: GitHub Repository**
```bash
skill-seekers github --repo django/django --name django
skill-seekers package output/django --target langchain
```
**Option C: Local Codebase**
```bash
skill-seekers analyze --directory /path/to/repo
skill-seekers package output/codebase --target langchain
```
**Option D: RAG-Optimized Chunking**
```bash
skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-tokens 512
skill-seekers package output/fastapi --target langchain
```
### Step 3: Choose Embedding Function
**Option A: Default (Sentence Transformers - Free)**
```python
# Chroma uses all-MiniLM-L6-v2 by default
collection = client.get_or_create_collection(name="docs")
# Automatically downloads model on first use (~90MB)
# Dimensions: 384
# Speed: ~500 docs/sec on CPU
# Quality: Good for most use cases
```
**Option B: OpenAI (Best Quality)**
```python
from chromadb.utils import embedding_functions
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="sk-...",
model_name="text-embedding-ada-002"
)
collection = client.get_or_create_collection(
name="docs",
embedding_function=openai_ef
)
# Cost: ~$0.0001 per 1K tokens
# Dimensions: 1536
# Quality: Excellent
```
**Option C: Local Sentence Transformers (Customizable)**
```python
from chromadb.utils import embedding_functions
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-mpnet-base-v2" # Better quality than default
)
collection = client.get_or_create_collection(
name="docs",
embedding_function=sentence_transformer_ef
)
# Free, local, customizable
# Dimensions: 768 (all-mpnet-base-v2)
# Quality: Better than default
```
**Option D: Cohere**
```python
cohere_ef = embedding_functions.CohereEmbeddingFunction(
api_key="your-cohere-key",
model_name="embed-english-v3.0"
)
collection = client.get_or_create_collection(
name="docs",
embedding_function=cohere_ef
)
```
### Step 4: Add Documents with Metadata
```python
import json
# Load Skill Seekers documents
with open("output/django-langchain.json") as f:
documents = json.load(f)
# Prepare for Chroma
docs_content = []
docs_metadata = []
docs_ids = []
for i, doc in enumerate(documents):
docs_content.append(doc["page_content"])
docs_metadata.append(doc["metadata"])
docs_ids.append(f"doc_{i}")
# Add to collection (batch operation)
collection.add(
documents=docs_content,
metadatas=docs_metadata,
ids=docs_ids
)
print(f"✅ Added {len(documents)} documents")
print(f"Collection size: {collection.count()}")
```
### Step 5: Query with Advanced Filters
```python
# Simple query
results = collection.query(
query_texts=["How do I create models?"],
n_results=5
)
# With metadata filter
results = collection.query(
query_texts=["Django authentication"],
n_results=3,
where={"category": "authentication"}
)
# Multiple filters (AND logic)
results = collection.query(
query_texts=["user registration"],
n_results=3,
where={
"$and": [
{"category": "authentication"},
{"type": "tutorial"}
]
}
)
# Filter with OR
results = collection.query(
query_texts=["components"],
n_results=5,
where={
"$or": [
{"category": "components"},
{"category": "hooks"}
]
}
)
# Filter with IN
results = collection.query(
query_texts=["data handling"],
n_results=5,
where={"category": {"$in": ["models", "views", "serializers"]}}
)
# Extract results
for doc, metadata, distance in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
):
print(f"Distance: {distance:.3f}")
print(f"Category: {metadata['category']}")
print(f"Content: {doc[:200]}...")
print()
```
---
## 🚀 Advanced Usage
### 1. Multiple Collections for Different Frameworks
```python
# Create separate collections
frameworks = ["react", "vue", "angular", "svelte"]
for framework in frameworks:
collection = client.get_or_create_collection(
name=f"{framework}_docs",
metadata={
"framework": framework,
"version": "latest",
"last_updated": "2026-02-07"
}
)
# Load framework-specific documents
with open(f"output/{framework}-langchain.json") as f:
docs = json.load(f)
collection.add(
documents=[d["page_content"] for d in docs],
metadatas=[d["metadata"] for d in docs],
ids=[f"doc_{i}" for i in range(len(docs))]
)
# Query specific framework
react_collection = client.get_collection(name="react_docs")
results = react_collection.query(
query_texts=["useState hook"],
n_results=3
)
```
### 2. Update Documents Efficiently
```python
# Update existing document (same ID)
collection.update(
ids=["doc_42"],
documents=["Updated content for React hooks..."],
metadatas=[{"category": "hooks", "updated": "2026-02-07"}]
)
# Upsert (update or insert)
collection.upsert(
ids=["doc_42"],
documents=["New or updated content..."],
metadatas=[{"category": "hooks"}]
)
# Delete specific documents
collection.delete(ids=["doc_42", "doc_99"])
# Delete by filter
collection.delete(where={"category": "deprecated"})
```
### 3. Pre-Compute Embeddings for Faster Ingestion
```python
from chromadb.utils import embedding_functions
import openai
# Generate embeddings separately
openai_client = openai.OpenAI()
embeddings = []
for doc in documents:
response = openai_client.embeddings.create(
model="text-embedding-ada-002",
input=doc["page_content"]
)
embeddings.append(response.data[0].embedding)
# Add with pre-computed embeddings (faster)
collection.add(
documents=[d["page_content"] for d in documents],
embeddings=embeddings, # Skip embedding generation
metadatas=[d["metadata"] for d in documents],
ids=[f"doc_{i}" for i in range(len(documents))]
)
```
### 4. Hybrid Search (Vector + Keyword)
```python
# Get all documents matching keyword filter
results = collection.query(
query_texts=["state management"],
n_results=100, # Get many candidates
where_document={"$contains": "useState"} # Keyword filter
)
# Chroma re-ranks by semantic similarity
# Results contain "useState" AND are semantically similar to "state management"
```
### 5. Collection Management
```python
# List all collections
collections = client.list_collections()
for collection in collections:
print(f"{collection.name}: {collection.count()} documents")
print(f" Metadata: {collection.metadata}")
# Get collection info
collection = client.get_collection(name="react_docs")
print(f"Count: {collection.count()}")
print(f"Metadata: {collection.metadata}")
# Delete collection
client.delete_collection(name="old_docs")
# Rename collection (create new, copy data, delete old)
old = client.get_collection(name="react_docs")
new = client.create_collection(name="react_docs_v2")
# Copy all documents
old_data = old.get()
new.add(
ids=old_data["ids"],
documents=old_data["documents"],
metadatas=old_data["metadatas"],
embeddings=old_data["embeddings"]
)
client.delete_collection(name="react_docs")
```
---
## 📋 Best Practices
### 1. Use Persistent Storage for Production
```python
# ✅ Good: Data persists
client = chromadb.PersistentClient(path="./chroma_db")
# ❌ Bad: Data lost on restart
client = chromadb.Client()
# Store DB in appropriate location
import os
db_path = os.path.expanduser("~/.local/share/my_app/chroma_db")
client = chromadb.PersistentClient(path=db_path)
```
### 2. Batch Operations for Large Datasets
```python
# ✅ Good: Batch add (fast)
batch_size = 1000
for i in range(0, len(documents), batch_size):
batch = documents[i:i + batch_size]
collection.add(
documents=[d["page_content"] for d in batch],
metadatas=[d["metadata"] for d in batch],
ids=[f"doc_{i+j}" for j in range(len(batch))]
)
print(f"Added {i + len(batch)}/{len(documents)}...")
# ❌ Bad: One at a time (slow)
for i, doc in enumerate(documents):
collection.add(
documents=[doc["page_content"]],
metadatas=[doc["metadata"]],
ids=[f"doc_{i}"]
)
```
### 3. Choose Embedding Model Wisely
```python
# For speed (local development):
# - Default Chroma (all-MiniLM-L6-v2): 384 dims, fast
collection = client.get_or_create_collection(name="docs")
# For quality (production):
# - OpenAI ada-002: 1536 dims, best quality
openai_ef = embedding_functions.OpenAIEmbeddingFunction(...)
collection = client.get_or_create_collection(name="docs", embedding_function=openai_ef)
# For balance (offline production):
# - all-mpnet-base-v2: 768 dims, good quality, free
mpnet_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-mpnet-base-v2"
)
collection = client.get_or_create_collection(name="docs", embedding_function=mpnet_ef)
```
### 4. Use Metadata Filters to Reduce Search Space
```python
# ✅ Good: Filter then search (fast)
results = collection.query(
query_texts=["authentication"],
n_results=3,
where={"category": "auth"} # Only search auth docs
)
# ❌ Slow: Search everything, filter later
results = collection.query(
query_texts=["authentication"],
n_results=100
)
filtered = [r for r in results if r["metadata"]["category"] == "auth"]
```
### 5. Handle Updates with Upsert
```python
# ✅ Good: Upsert (idempotent)
collection.upsert(
ids=["doc_42"],
documents=["Updated content..."],
metadatas=[{"updated": "2026-02-07"}]
)
# ❌ Bad: Delete then add (race conditions)
try:
collection.delete(ids=["doc_42"])
except:
pass
collection.add(ids=["doc_42"], ...)
```
---
## 🔥 Real-World Example: Local RAG Chatbot
```python
import chromadb
import json
from openai import OpenAI
class LocalRAGChatbot:
def __init__(self, db_path: str = "./chroma_db"):
"""Initialize chatbot with local Chroma database."""
self.client = chromadb.PersistentClient(path=db_path)
self.openai = OpenAI() # For chat completion only
self.collection = None
def ingest_framework(self, framework: str, docs_path: str):
"""Ingest documentation for a framework."""
# Create or get collection
self.collection = self.client.get_or_create_collection(
name=f"{framework}_docs",
metadata={"framework": framework}
)
# Load documents
with open(docs_path) as f:
documents = json.load(f)
# Batch add (Chroma generates embeddings locally)
batch_size = 1000
for i in range(0, len(documents), batch_size):
batch = documents[i:i + batch_size]
self.collection.add(
documents=[d["page_content"] for d in batch],
metadatas=[d["metadata"] for d in batch],
ids=[f"doc_{i+j}" for j in range(len(batch))]
)
if (i + batch_size) < len(documents):
print(f"Ingested {i + batch_size}/{len(documents)}...")
print(f"✅ Ingested {len(documents)} documents for {framework}")
print(f"Collection size: {self.collection.count()}")
def chat(self, question: str, category: str = None):
"""Answer question using RAG."""
if not self.collection:
raise ValueError("No framework ingested. Call ingest_framework() first.")
# Retrieve relevant documents
where_filter = {"category": category} if category else None
results = self.collection.query(
query_texts=[question],
n_results=5,
where=where_filter
)
# Build context from results
context_parts = []
for doc, metadata in zip(results["documents"][0], results["metadatas"][0]):
context_parts.append(f"[{metadata['category']}] {doc}")
context = "\n\n".join(context_parts)
# Generate answer using GPT-4
completion = self.openai.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": "You are a helpful assistant. Answer based on the provided documentation context."
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"
}
]
)
return {
"answer": completion.choices[0].message.content,
"sources": [
{
"category": m["category"],
"source": m["source"],
"file": m["file"]
}
for m in results["metadatas"][0]
],
"context_used": len(context)
}
def list_frameworks(self):
"""List all ingested frameworks."""
collections = self.client.list_collections()
return [
{
"name": c.name,
"count": c.count(),
"metadata": c.metadata
}
for c in collections
]
# Usage
chatbot = LocalRAGChatbot(db_path="./my_docs_db")
# Ingest multiple frameworks
chatbot.ingest_framework("react", "output/react-langchain.json")
chatbot.ingest_framework("django", "output/django-langchain.json")
# Interactive chat
frameworks = chatbot.list_frameworks()
print(f"Available frameworks: {[f['name'] for f in frameworks]}")
# Select framework
chatbot.collection = chatbot.client.get_collection("react_docs")
# Ask questions
questions = [
"How do I use useState?",
"What is useEffect for?",
"How do I handle form input?"
]
for question in questions:
print(f"\nQ: {question}")
result = chatbot.chat(question, category="hooks")
print(f"A: {result['answer']}")
print(f"Sources: {[s['file'] for s in result['sources'][:2]]}")
print(f"Context size: {result['context_used']} chars")
```
**Output:**
```
✅ Ingested 1247 documents for react
Collection size: 1247
✅ Ingested 892 documents for django
Collection size: 892
Available frameworks: ['react_docs', 'django_docs']
Q: How do I use useState?
A: useState is a React Hook that lets you add state to functional components.
Call it at the top level: const [count, setCount] = useState(0)
Sources: ['hooks/useState.md', 'hooks/overview.md']
Context size: 2340 chars
Q: What is useEffect for?
A: useEffect performs side effects in functional components, like fetching data,
subscriptions, or DOM manipulation. It runs after render.
Sources: ['hooks/useEffect.md', 'hooks/rules.md']
Context size: 2156 chars
```
---
## 🐛 Troubleshooting
### Issue: Model Download Stuck
**Problem:** "Downloading model..." hangs indefinitely
**Solutions:**
1. **Check internet connection:**
```bash
curl -I https://huggingface.co
```
2. **Manually download model:**
```python
from sentence_transformers import SentenceTransformer
# Force download
model = SentenceTransformer('all-MiniLM-L6-v2')
print("Model downloaded!")
```
3. **Use pre-downloaded model:**
```python
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="/path/to/local/model"
)
```
### Issue: Dimension Mismatch
**Problem:** "Dimensionality mismatch: expected 384, got 1536"
**Solution:** Collections remember their embedding function
```python
# Delete and recreate with correct embedding function
client.delete_collection(name="docs")
openai_ef = embedding_functions.OpenAIEmbeddingFunction(...)
collection = client.create_collection(
name="docs",
embedding_function=openai_ef # 1536 dims
)
```
### Issue: Slow Queries
**Problem:** Queries take >1 second on 10K documents
**Solutions:**
1. **Use smaller n_results:**
```python
# ✅ Fast: Get only what you need
results = collection.query(query_texts=["..."], n_results=5)
# ❌ Slow: Large result sets
results = collection.query(query_texts=["..."], n_results=100)
```
2. **Filter with metadata:**
```python
# ✅ Fast: Reduce search space
results = collection.query(
query_texts=["..."],
n_results=5,
where={"category": "specific"} # Only search subset
)
```
3. **Use HttpClient for parallelism:**
```bash
# Start Chroma server
chroma run --path ./chroma_db
```
```python
# Connect multiple clients
client = chromadb.HttpClient(host="localhost", port=8000)
```
### Issue: Database Locked
**Problem:** "Database is locked" error
**Solutions:**
1. **Check for other processes:**
```bash
lsof ./chroma_db/chroma.sqlite3
# Kill any hung processes
```
2. **Use HttpClient instead:**
```bash
chroma run --path ./chroma_db --port 8000
```
```python
client = chromadb.HttpClient(host="localhost", port=8000)
```
3. **Enable WAL mode (Write-Ahead Logging):**
```python
import sqlite3
conn = sqlite3.connect("./chroma_db/chroma.sqlite3")
conn.execute("PRAGMA journal_mode=WAL")
conn.close()
```
### Issue: Collection Not Found
**Problem:** "Collection 'docs' does not exist"
**Solutions:**
1. **List existing collections:**
```python
collections = client.list_collections()
print([c.name for c in collections])
```
2. **Use get_or_create:**
```python
# ✅ Safe: Creates if missing
collection = client.get_or_create_collection(name="docs")
# ❌ Fails if missing
collection = client.get_collection(name="docs")
```
### Issue: Out of Memory
**Problem:** Python crashes when adding large dataset
**Solutions:**
1. **Batch with smaller size:**
```python
batch_size = 500 # Reduce from 1000
for i in range(0, len(documents), batch_size):
batch = documents[i:i + batch_size]
collection.add(...)
```
2. **Use HttpClient + server:**
```bash
# Server handles memory better
chroma run --path ./chroma_db
```
3. **Pre-compute embeddings externally:**
```python
# Generate embeddings in separate script
# Then add with embeddings parameter
collection.add(
documents=[...],
embeddings=precomputed_embeddings,
...
)
```
---
## 📊 Before vs. After
| Aspect | Without Skill Seekers | With Skill Seekers |
|--------|----------------------|-------------------|
| **Data Preparation** | Custom scraping + parsing logic | One command: `skill-seekers scrape` |
| **Embedding Setup** | Manual model selection and config | Auto-configured with sensible defaults |
| **Metadata** | Manual extraction from docs | Auto-extracted (category, source, file, type) |
| **Storage** | Complex path management | Simple: `PersistentClient(path="...")` |
| **Local-First** | Requires external services | Fully local with Sentence Transformers |
| **Setup Time** | 2-4 hours | 5 minutes |
| **Code Required** | 300+ lines scraping logic | 20 lines upload script |
| **External Deps** | OpenAI API required | Optional (works offline!) |
---
## 🎯 Next Steps
### Enhance Your Chroma Integration
1. **Try Different Embedding Models:**
```python
# Better quality (still local)
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-mpnet-base-v2"
)
```
2. **Implement Semantic Chunking:**
```bash
skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-tokens 512
```
3. **Set Up Multi-Collection Search:**
```python
# Search across multiple frameworks
for name in ["react_docs", "vue_docs", "angular_docs"]:
collection = client.get_collection(name)
results = collection.query(...)
```
4. **Deploy with Docker:**
```bash
docker run -p 8000:8000 -v ./chroma-data:/chroma/chroma ghcr.io/chroma-core/chroma:latest
```
### Related Guides
- **[LangChain Integration](LANGCHAIN.md)** - Use Chroma as vector store in LangChain
- **[LlamaIndex Integration](LLAMA_INDEX.md)** - Use Chroma with LlamaIndex
- **[RAG Pipelines Guide](RAG_PIPELINES.md)** - Build complete RAG systems
- **[INTEGRATIONS.md](INTEGRATIONS.md)** - See all integration options
### Resources
- **Chroma Docs:** https://docs.trychroma.com/
- **Python Client:** https://docs.trychroma.com/reference/py-client
- **Support:** https://github.com/yusufkaraaslan/Skill_Seekers/discussions
---
**Questions?** Open an issue: https://github.com/yusufkaraaslan/Skill_Seekers/issues
**Website:** https://skillseekersweb.com/
**Last Updated:** February 7, 2026