Replace all occurrences of old ambiguous flag names with the new explicit ones: --chunk-size (tokens) → --chunk-tokens --chunk-overlap → --chunk-overlap-tokens --chunk → --chunk-for-rag --streaming-chunk-size → --streaming-chunk-chars --streaming-overlap → --streaming-overlap-chars --chunk-size (pages) → --pdf-pages-per-chunk Updated: CLI_REFERENCE (EN+ZH), user-guide (EN+ZH), integrations (Haystack, Chroma, Weaviate, FAISS, Qdrant), features/PDF_CHUNKING, examples/haystack-pipeline, strategy docs, archive docs, and CHANGELOG. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
25 KiB
Chroma Integration with Skill Seekers
Status: ✅ Production Ready Difficulty: Beginner Last Updated: February 7, 2026
❌ The Problem
Building RAG applications with Chroma involves several challenges:
- Embedding Model Setup - Need to choose and configure embedding models (local vs API) manually
- Collection Management - Creating and managing collections with metadata requires boilerplate code
- Local-First Complexity - Setting up persistent storage and dealing with file paths
Example Pain Point:
# Manual embedding + collection setup for each framework
import chromadb
from chromadb.utils import embedding_functions
# Choose embedding function
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="sk-...",
model_name="text-embedding-ada-002"
)
# Create client + collection
client = chromadb.PersistentClient(path="./chroma_db")
collection = client.create_collection(
name="react_docs",
embedding_function=openai_ef,
metadata={"description": "React documentation"}
)
# Manually parse and add documents...
✅ The Solution
Skill Seekers automates Chroma integration with structured, production-ready data:
Benefits:
- ✅ Auto-formatted documents with embeddings included
- ✅ Consistent collection structure across all frameworks
- ✅ Works with local models (Sentence Transformers) or API embeddings (OpenAI, Cohere)
- ✅ Persistent storage with automatic path management
- ✅ Metadata-rich for precise filtering
Result: 5-minute setup, production-ready local vector search with zero external dependencies.
⚡ Quick Start (5 Minutes)
Prerequisites
# Install Chroma
pip install chromadb>=0.4.22
# For local embeddings (optional, free)
pip install sentence-transformers
# For OpenAI embeddings (optional)
pip install openai
# Or with Skill Seekers
pip install skill-seekers[all-llms]
What you need:
- Python 3.10+
- No external services required (fully local!)
- Optional: OpenAI API key for better embeddings
Generate Chroma-Ready Documents
# Step 1: Scrape documentation
skill-seekers scrape --config configs/react.json
# Step 2: Package for Chroma (creates LangChain format)
skill-seekers package output/react --target langchain
# Output: output/react-langchain.json (Chroma-compatible)
Upload to Chroma (Local)
import chromadb
import json
# Create persistent client (data saved to disk)
client = chromadb.PersistentClient(path="./chroma_db")
# Create collection with local embeddings (free!)
collection = client.get_or_create_collection(
name="react_docs",
metadata={"description": "React documentation from Skill Seekers"}
)
# Load documents
with open("output/react-langchain.json") as f:
documents = json.load(f)
# Add to collection (Chroma generates embeddings automatically)
collection.add(
documents=[doc["page_content"] for doc in documents],
metadatas=[doc["metadata"] for doc in documents],
ids=[f"doc_{i}" for i in range(len(documents))]
)
print(f"✅ Added {len(documents)} documents to Chroma")
print(f"Total in collection: {collection.count()}")
Query with Filters
# Semantic search with metadata filter
results = collection.query(
query_texts=["How do I use React hooks?"],
n_results=3,
where={"category": "hooks"} # Filter by category
)
for i, (doc, metadata) in enumerate(zip(results["documents"][0], results["metadatas"][0])):
print(f"\n{i+1}. Category: {metadata['category']}")
print(f" Source: {metadata['source']}")
print(f" Content: {doc[:200]}...")
That's it! Chroma is now running locally with your documentation.
📖 Detailed Setup Guide
Step 1: Choose Storage Mode
Option A: Persistent (Recommended for Production)
import chromadb
# Data persists to disk
client = chromadb.PersistentClient(
path="./chroma_db" # Specify database directory
)
# Database files saved to ./chroma_db/
# Survives script restarts
Option B: In-Memory (Fast, for Development)
# Data lost when script ends
client = chromadb.Client()
# Fast, but temporary
# Perfect for experimentation
Option C: HTTP Client (Remote Chroma Server)
# Start Chroma server
chroma run --path ./chroma_db --port 8000
# Connect to remote server
client = chromadb.HttpClient(host="localhost", port=8000)
# Great for microservices architecture
Option D: Docker (Production)
# docker-compose.yml
version: '3'
services:
chroma:
image: ghcr.io/chroma-core/chroma:latest
volumes:
- ./chroma-data:/chroma/chroma
ports:
- "8000:8000"
environment:
- ANONYMIZED_TELEMETRY=False
# Start Chroma
docker-compose up -d
Step 2: Generate Skill Seekers Documents
Option A: Documentation Website
skill-seekers scrape --config configs/django.json
skill-seekers package output/django --target langchain
Option B: GitHub Repository
skill-seekers github --repo django/django --name django
skill-seekers package output/django --target langchain
Option C: Local Codebase
skill-seekers analyze --directory /path/to/repo
skill-seekers package output/codebase --target langchain
Option D: RAG-Optimized Chunking
skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-tokens 512
skill-seekers package output/fastapi --target langchain
Step 3: Choose Embedding Function
Option A: Default (Sentence Transformers - Free)
# Chroma uses all-MiniLM-L6-v2 by default
collection = client.get_or_create_collection(name="docs")
# Automatically downloads model on first use (~90MB)
# Dimensions: 384
# Speed: ~500 docs/sec on CPU
# Quality: Good for most use cases
Option B: OpenAI (Best Quality)
from chromadb.utils import embedding_functions
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="sk-...",
model_name="text-embedding-ada-002"
)
collection = client.get_or_create_collection(
name="docs",
embedding_function=openai_ef
)
# Cost: ~$0.0001 per 1K tokens
# Dimensions: 1536
# Quality: Excellent
Option C: Local Sentence Transformers (Customizable)
from chromadb.utils import embedding_functions
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-mpnet-base-v2" # Better quality than default
)
collection = client.get_or_create_collection(
name="docs",
embedding_function=sentence_transformer_ef
)
# Free, local, customizable
# Dimensions: 768 (all-mpnet-base-v2)
# Quality: Better than default
Option D: Cohere
cohere_ef = embedding_functions.CohereEmbeddingFunction(
api_key="your-cohere-key",
model_name="embed-english-v3.0"
)
collection = client.get_or_create_collection(
name="docs",
embedding_function=cohere_ef
)
Step 4: Add Documents with Metadata
import json
# Load Skill Seekers documents
with open("output/django-langchain.json") as f:
documents = json.load(f)
# Prepare for Chroma
docs_content = []
docs_metadata = []
docs_ids = []
for i, doc in enumerate(documents):
docs_content.append(doc["page_content"])
docs_metadata.append(doc["metadata"])
docs_ids.append(f"doc_{i}")
# Add to collection (batch operation)
collection.add(
documents=docs_content,
metadatas=docs_metadata,
ids=docs_ids
)
print(f"✅ Added {len(documents)} documents")
print(f"Collection size: {collection.count()}")
Step 5: Query with Advanced Filters
# Simple query
results = collection.query(
query_texts=["How do I create models?"],
n_results=5
)
# With metadata filter
results = collection.query(
query_texts=["Django authentication"],
n_results=3,
where={"category": "authentication"}
)
# Multiple filters (AND logic)
results = collection.query(
query_texts=["user registration"],
n_results=3,
where={
"$and": [
{"category": "authentication"},
{"type": "tutorial"}
]
}
)
# Filter with OR
results = collection.query(
query_texts=["components"],
n_results=5,
where={
"$or": [
{"category": "components"},
{"category": "hooks"}
]
}
)
# Filter with IN
results = collection.query(
query_texts=["data handling"],
n_results=5,
where={"category": {"$in": ["models", "views", "serializers"]}}
)
# Extract results
for doc, metadata, distance in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
):
print(f"Distance: {distance:.3f}")
print(f"Category: {metadata['category']}")
print(f"Content: {doc[:200]}...")
print()
🚀 Advanced Usage
1. Multiple Collections for Different Frameworks
# Create separate collections
frameworks = ["react", "vue", "angular", "svelte"]
for framework in frameworks:
collection = client.get_or_create_collection(
name=f"{framework}_docs",
metadata={
"framework": framework,
"version": "latest",
"last_updated": "2026-02-07"
}
)
# Load framework-specific documents
with open(f"output/{framework}-langchain.json") as f:
docs = json.load(f)
collection.add(
documents=[d["page_content"] for d in docs],
metadatas=[d["metadata"] for d in docs],
ids=[f"doc_{i}" for i in range(len(docs))]
)
# Query specific framework
react_collection = client.get_collection(name="react_docs")
results = react_collection.query(
query_texts=["useState hook"],
n_results=3
)
2. Update Documents Efficiently
# Update existing document (same ID)
collection.update(
ids=["doc_42"],
documents=["Updated content for React hooks..."],
metadatas=[{"category": "hooks", "updated": "2026-02-07"}]
)
# Upsert (update or insert)
collection.upsert(
ids=["doc_42"],
documents=["New or updated content..."],
metadatas=[{"category": "hooks"}]
)
# Delete specific documents
collection.delete(ids=["doc_42", "doc_99"])
# Delete by filter
collection.delete(where={"category": "deprecated"})
3. Pre-Compute Embeddings for Faster Ingestion
from chromadb.utils import embedding_functions
import openai
# Generate embeddings separately
openai_client = openai.OpenAI()
embeddings = []
for doc in documents:
response = openai_client.embeddings.create(
model="text-embedding-ada-002",
input=doc["page_content"]
)
embeddings.append(response.data[0].embedding)
# Add with pre-computed embeddings (faster)
collection.add(
documents=[d["page_content"] for d in documents],
embeddings=embeddings, # Skip embedding generation
metadatas=[d["metadata"] for d in documents],
ids=[f"doc_{i}" for i in range(len(documents))]
)
4. Hybrid Search (Vector + Keyword)
# Get all documents matching keyword filter
results = collection.query(
query_texts=["state management"],
n_results=100, # Get many candidates
where_document={"$contains": "useState"} # Keyword filter
)
# Chroma re-ranks by semantic similarity
# Results contain "useState" AND are semantically similar to "state management"
5. Collection Management
# List all collections
collections = client.list_collections()
for collection in collections:
print(f"{collection.name}: {collection.count()} documents")
print(f" Metadata: {collection.metadata}")
# Get collection info
collection = client.get_collection(name="react_docs")
print(f"Count: {collection.count()}")
print(f"Metadata: {collection.metadata}")
# Delete collection
client.delete_collection(name="old_docs")
# Rename collection (create new, copy data, delete old)
old = client.get_collection(name="react_docs")
new = client.create_collection(name="react_docs_v2")
# Copy all documents
old_data = old.get()
new.add(
ids=old_data["ids"],
documents=old_data["documents"],
metadatas=old_data["metadatas"],
embeddings=old_data["embeddings"]
)
client.delete_collection(name="react_docs")
📋 Best Practices
1. Use Persistent Storage for Production
# ✅ Good: Data persists
client = chromadb.PersistentClient(path="./chroma_db")
# ❌ Bad: Data lost on restart
client = chromadb.Client()
# Store DB in appropriate location
import os
db_path = os.path.expanduser("~/.local/share/my_app/chroma_db")
client = chromadb.PersistentClient(path=db_path)
2. Batch Operations for Large Datasets
# ✅ Good: Batch add (fast)
batch_size = 1000
for i in range(0, len(documents), batch_size):
batch = documents[i:i + batch_size]
collection.add(
documents=[d["page_content"] for d in batch],
metadatas=[d["metadata"] for d in batch],
ids=[f"doc_{i+j}" for j in range(len(batch))]
)
print(f"Added {i + len(batch)}/{len(documents)}...")
# ❌ Bad: One at a time (slow)
for i, doc in enumerate(documents):
collection.add(
documents=[doc["page_content"]],
metadatas=[doc["metadata"]],
ids=[f"doc_{i}"]
)
3. Choose Embedding Model Wisely
# For speed (local development):
# - Default Chroma (all-MiniLM-L6-v2): 384 dims, fast
collection = client.get_or_create_collection(name="docs")
# For quality (production):
# - OpenAI ada-002: 1536 dims, best quality
openai_ef = embedding_functions.OpenAIEmbeddingFunction(...)
collection = client.get_or_create_collection(name="docs", embedding_function=openai_ef)
# For balance (offline production):
# - all-mpnet-base-v2: 768 dims, good quality, free
mpnet_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-mpnet-base-v2"
)
collection = client.get_or_create_collection(name="docs", embedding_function=mpnet_ef)
4. Use Metadata Filters to Reduce Search Space
# ✅ Good: Filter then search (fast)
results = collection.query(
query_texts=["authentication"],
n_results=3,
where={"category": "auth"} # Only search auth docs
)
# ❌ Slow: Search everything, filter later
results = collection.query(
query_texts=["authentication"],
n_results=100
)
filtered = [r for r in results if r["metadata"]["category"] == "auth"]
5. Handle Updates with Upsert
# ✅ Good: Upsert (idempotent)
collection.upsert(
ids=["doc_42"],
documents=["Updated content..."],
metadatas=[{"updated": "2026-02-07"}]
)
# ❌ Bad: Delete then add (race conditions)
try:
collection.delete(ids=["doc_42"])
except:
pass
collection.add(ids=["doc_42"], ...)
🔥 Real-World Example: Local RAG Chatbot
import chromadb
import json
from openai import OpenAI
class LocalRAGChatbot:
def __init__(self, db_path: str = "./chroma_db"):
"""Initialize chatbot with local Chroma database."""
self.client = chromadb.PersistentClient(path=db_path)
self.openai = OpenAI() # For chat completion only
self.collection = None
def ingest_framework(self, framework: str, docs_path: str):
"""Ingest documentation for a framework."""
# Create or get collection
self.collection = self.client.get_or_create_collection(
name=f"{framework}_docs",
metadata={"framework": framework}
)
# Load documents
with open(docs_path) as f:
documents = json.load(f)
# Batch add (Chroma generates embeddings locally)
batch_size = 1000
for i in range(0, len(documents), batch_size):
batch = documents[i:i + batch_size]
self.collection.add(
documents=[d["page_content"] for d in batch],
metadatas=[d["metadata"] for d in batch],
ids=[f"doc_{i+j}" for j in range(len(batch))]
)
if (i + batch_size) < len(documents):
print(f"Ingested {i + batch_size}/{len(documents)}...")
print(f"✅ Ingested {len(documents)} documents for {framework}")
print(f"Collection size: {self.collection.count()}")
def chat(self, question: str, category: str = None):
"""Answer question using RAG."""
if not self.collection:
raise ValueError("No framework ingested. Call ingest_framework() first.")
# Retrieve relevant documents
where_filter = {"category": category} if category else None
results = self.collection.query(
query_texts=[question],
n_results=5,
where=where_filter
)
# Build context from results
context_parts = []
for doc, metadata in zip(results["documents"][0], results["metadatas"][0]):
context_parts.append(f"[{metadata['category']}] {doc}")
context = "\n\n".join(context_parts)
# Generate answer using GPT-4
completion = self.openai.chat.completions.create(
model="gpt-4",
messages=[
{
"role": "system",
"content": "You are a helpful assistant. Answer based on the provided documentation context."
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"
}
]
)
return {
"answer": completion.choices[0].message.content,
"sources": [
{
"category": m["category"],
"source": m["source"],
"file": m["file"]
}
for m in results["metadatas"][0]
],
"context_used": len(context)
}
def list_frameworks(self):
"""List all ingested frameworks."""
collections = self.client.list_collections()
return [
{
"name": c.name,
"count": c.count(),
"metadata": c.metadata
}
for c in collections
]
# Usage
chatbot = LocalRAGChatbot(db_path="./my_docs_db")
# Ingest multiple frameworks
chatbot.ingest_framework("react", "output/react-langchain.json")
chatbot.ingest_framework("django", "output/django-langchain.json")
# Interactive chat
frameworks = chatbot.list_frameworks()
print(f"Available frameworks: {[f['name'] for f in frameworks]}")
# Select framework
chatbot.collection = chatbot.client.get_collection("react_docs")
# Ask questions
questions = [
"How do I use useState?",
"What is useEffect for?",
"How do I handle form input?"
]
for question in questions:
print(f"\nQ: {question}")
result = chatbot.chat(question, category="hooks")
print(f"A: {result['answer']}")
print(f"Sources: {[s['file'] for s in result['sources'][:2]]}")
print(f"Context size: {result['context_used']} chars")
Output:
✅ Ingested 1247 documents for react
Collection size: 1247
✅ Ingested 892 documents for django
Collection size: 892
Available frameworks: ['react_docs', 'django_docs']
Q: How do I use useState?
A: useState is a React Hook that lets you add state to functional components.
Call it at the top level: const [count, setCount] = useState(0)
Sources: ['hooks/useState.md', 'hooks/overview.md']
Context size: 2340 chars
Q: What is useEffect for?
A: useEffect performs side effects in functional components, like fetching data,
subscriptions, or DOM manipulation. It runs after render.
Sources: ['hooks/useEffect.md', 'hooks/rules.md']
Context size: 2156 chars
🐛 Troubleshooting
Issue: Model Download Stuck
Problem: "Downloading model..." hangs indefinitely
Solutions:
- Check internet connection:
curl -I https://huggingface.co
- Manually download model:
from sentence_transformers import SentenceTransformer
# Force download
model = SentenceTransformer('all-MiniLM-L6-v2')
print("Model downloaded!")
- Use pre-downloaded model:
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="/path/to/local/model"
)
Issue: Dimension Mismatch
Problem: "Dimensionality mismatch: expected 384, got 1536"
Solution: Collections remember their embedding function
# Delete and recreate with correct embedding function
client.delete_collection(name="docs")
openai_ef = embedding_functions.OpenAIEmbeddingFunction(...)
collection = client.create_collection(
name="docs",
embedding_function=openai_ef # 1536 dims
)
Issue: Slow Queries
Problem: Queries take >1 second on 10K documents
Solutions:
- Use smaller n_results:
# ✅ Fast: Get only what you need
results = collection.query(query_texts=["..."], n_results=5)
# ❌ Slow: Large result sets
results = collection.query(query_texts=["..."], n_results=100)
- Filter with metadata:
# ✅ Fast: Reduce search space
results = collection.query(
query_texts=["..."],
n_results=5,
where={"category": "specific"} # Only search subset
)
- Use HttpClient for parallelism:
# Start Chroma server
chroma run --path ./chroma_db
# Connect multiple clients
client = chromadb.HttpClient(host="localhost", port=8000)
Issue: Database Locked
Problem: "Database is locked" error
Solutions:
- Check for other processes:
lsof ./chroma_db/chroma.sqlite3
# Kill any hung processes
- Use HttpClient instead:
chroma run --path ./chroma_db --port 8000
client = chromadb.HttpClient(host="localhost", port=8000)
- Enable WAL mode (Write-Ahead Logging):
import sqlite3
conn = sqlite3.connect("./chroma_db/chroma.sqlite3")
conn.execute("PRAGMA journal_mode=WAL")
conn.close()
Issue: Collection Not Found
Problem: "Collection 'docs' does not exist"
Solutions:
- List existing collections:
collections = client.list_collections()
print([c.name for c in collections])
- Use get_or_create:
# ✅ Safe: Creates if missing
collection = client.get_or_create_collection(name="docs")
# ❌ Fails if missing
collection = client.get_collection(name="docs")
Issue: Out of Memory
Problem: Python crashes when adding large dataset
Solutions:
- Batch with smaller size:
batch_size = 500 # Reduce from 1000
for i in range(0, len(documents), batch_size):
batch = documents[i:i + batch_size]
collection.add(...)
- Use HttpClient + server:
# Server handles memory better
chroma run --path ./chroma_db
- Pre-compute embeddings externally:
# Generate embeddings in separate script
# Then add with embeddings parameter
collection.add(
documents=[...],
embeddings=precomputed_embeddings,
...
)
📊 Before vs. After
| Aspect | Without Skill Seekers | With Skill Seekers |
|---|---|---|
| Data Preparation | Custom scraping + parsing logic | One command: skill-seekers scrape |
| Embedding Setup | Manual model selection and config | Auto-configured with sensible defaults |
| Metadata | Manual extraction from docs | Auto-extracted (category, source, file, type) |
| Storage | Complex path management | Simple: PersistentClient(path="...") |
| Local-First | Requires external services | Fully local with Sentence Transformers |
| Setup Time | 2-4 hours | 5 minutes |
| Code Required | 300+ lines scraping logic | 20 lines upload script |
| External Deps | OpenAI API required | Optional (works offline!) |
🎯 Next Steps
Enhance Your Chroma Integration
-
Try Different Embedding Models:
# Better quality (still local) ef = embedding_functions.SentenceTransformerEmbeddingFunction( model_name="all-mpnet-base-v2" ) -
Implement Semantic Chunking:
skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-tokens 512 -
Set Up Multi-Collection Search:
# Search across multiple frameworks for name in ["react_docs", "vue_docs", "angular_docs"]: collection = client.get_collection(name) results = collection.query(...) -
Deploy with Docker:
docker run -p 8000:8000 -v ./chroma-data:/chroma/chroma ghcr.io/chroma-core/chroma:latest
Related Guides
- LangChain Integration - Use Chroma as vector store in LangChain
- LlamaIndex Integration - Use Chroma with LlamaIndex
- RAG Pipelines Guide - Build complete RAG systems
- INTEGRATIONS.md - See all integration options
Resources
- Chroma Docs: https://docs.trychroma.com/
- Python Client: https://docs.trychroma.com/reference/py-client
- Support: https://github.com/yusufkaraaslan/Skill_Seekers/discussions
Questions? Open an issue: https://github.com/yusufkaraaslan/Skill_Seekers/issues Website: https://skillseekersweb.com/ Last Updated: February 7, 2026