Replace all occurrences of old ambiguous flag names with the new explicit ones: --chunk-size (tokens) → --chunk-tokens --chunk-overlap → --chunk-overlap-tokens --chunk → --chunk-for-rag --streaming-chunk-size → --streaming-chunk-chars --streaming-overlap → --streaming-overlap-chars --chunk-size (pages) → --pdf-pages-per-chunk Updated: CLI_REFERENCE (EN+ZH), user-guide (EN+ZH), integrations (Haystack, Chroma, Weaviate, FAISS, Qdrant), features/PDF_CHUNKING, examples/haystack-pipeline, strategy docs, archive docs, and CHANGELOG. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
1005 lines
25 KiB
Markdown
1005 lines
25 KiB
Markdown
# Chroma Integration with Skill Seekers
|
|
|
|
**Status:** ✅ Production Ready
|
|
**Difficulty:** Beginner
|
|
**Last Updated:** February 7, 2026
|
|
|
|
---
|
|
|
|
## ❌ The Problem
|
|
|
|
Building RAG applications with Chroma involves several challenges:
|
|
|
|
1. **Embedding Model Setup** - Need to choose and configure embedding models (local vs API) manually
|
|
2. **Collection Management** - Creating and managing collections with metadata requires boilerplate code
|
|
3. **Local-First Complexity** - Setting up persistent storage and dealing with file paths
|
|
|
|
**Example Pain Point:**
|
|
|
|
```python
|
|
# Manual embedding + collection setup for each framework
|
|
import chromadb
|
|
from chromadb.utils import embedding_functions
|
|
|
|
# Choose embedding function
|
|
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
|
|
api_key="sk-...",
|
|
model_name="text-embedding-ada-002"
|
|
)
|
|
|
|
# Create client + collection
|
|
client = chromadb.PersistentClient(path="./chroma_db")
|
|
collection = client.create_collection(
|
|
name="react_docs",
|
|
embedding_function=openai_ef,
|
|
metadata={"description": "React documentation"}
|
|
)
|
|
|
|
# Manually parse and add documents...
|
|
```
|
|
|
|
---
|
|
|
|
## ✅ The Solution
|
|
|
|
Skill Seekers automates Chroma integration with structured, production-ready data:
|
|
|
|
**Benefits:**
|
|
- ✅ Auto-formatted documents with embeddings included
|
|
- ✅ Consistent collection structure across all frameworks
|
|
- ✅ Works with local models (Sentence Transformers) or API embeddings (OpenAI, Cohere)
|
|
- ✅ Persistent storage with automatic path management
|
|
- ✅ Metadata-rich for precise filtering
|
|
|
|
**Result:** 5-minute setup, production-ready local vector search with zero external dependencies.
|
|
|
|
---
|
|
|
|
## ⚡ Quick Start (5 Minutes)
|
|
|
|
### Prerequisites
|
|
|
|
```bash
|
|
# Install Chroma
|
|
pip install chromadb>=0.4.22
|
|
|
|
# For local embeddings (optional, free)
|
|
pip install sentence-transformers
|
|
|
|
# For OpenAI embeddings (optional)
|
|
pip install openai
|
|
|
|
# Or with Skill Seekers
|
|
pip install skill-seekers[all-llms]
|
|
```
|
|
|
|
**What you need:**
|
|
- Python 3.10+
|
|
- No external services required (fully local!)
|
|
- Optional: OpenAI API key for better embeddings
|
|
|
|
### Generate Chroma-Ready Documents
|
|
|
|
```bash
|
|
# Step 1: Scrape documentation
|
|
skill-seekers scrape --config configs/react.json
|
|
|
|
# Step 2: Package for Chroma (creates LangChain format)
|
|
skill-seekers package output/react --target langchain
|
|
|
|
# Output: output/react-langchain.json (Chroma-compatible)
|
|
```
|
|
|
|
### Upload to Chroma (Local)
|
|
|
|
```python
|
|
import chromadb
|
|
import json
|
|
|
|
# Create persistent client (data saved to disk)
|
|
client = chromadb.PersistentClient(path="./chroma_db")
|
|
|
|
# Create collection with local embeddings (free!)
|
|
collection = client.get_or_create_collection(
|
|
name="react_docs",
|
|
metadata={"description": "React documentation from Skill Seekers"}
|
|
)
|
|
|
|
# Load documents
|
|
with open("output/react-langchain.json") as f:
|
|
documents = json.load(f)
|
|
|
|
# Add to collection (Chroma generates embeddings automatically)
|
|
collection.add(
|
|
documents=[doc["page_content"] for doc in documents],
|
|
metadatas=[doc["metadata"] for doc in documents],
|
|
ids=[f"doc_{i}" for i in range(len(documents))]
|
|
)
|
|
|
|
print(f"✅ Added {len(documents)} documents to Chroma")
|
|
print(f"Total in collection: {collection.count()}")
|
|
```
|
|
|
|
### Query with Filters
|
|
|
|
```python
|
|
# Semantic search with metadata filter
|
|
results = collection.query(
|
|
query_texts=["How do I use React hooks?"],
|
|
n_results=3,
|
|
where={"category": "hooks"} # Filter by category
|
|
)
|
|
|
|
for i, (doc, metadata) in enumerate(zip(results["documents"][0], results["metadatas"][0])):
|
|
print(f"\n{i+1}. Category: {metadata['category']}")
|
|
print(f" Source: {metadata['source']}")
|
|
print(f" Content: {doc[:200]}...")
|
|
```
|
|
|
|
**That's it!** Chroma is now running locally with your documentation.
|
|
|
|
---
|
|
|
|
## 📖 Detailed Setup Guide
|
|
|
|
### Step 1: Choose Storage Mode
|
|
|
|
**Option A: Persistent (Recommended for Production)**
|
|
|
|
```python
|
|
import chromadb
|
|
|
|
# Data persists to disk
|
|
client = chromadb.PersistentClient(
|
|
path="./chroma_db" # Specify database directory
|
|
)
|
|
|
|
# Database files saved to ./chroma_db/
|
|
# Survives script restarts
|
|
```
|
|
|
|
**Option B: In-Memory (Fast, for Development)**
|
|
|
|
```python
|
|
# Data lost when script ends
|
|
client = chromadb.Client()
|
|
|
|
# Fast, but temporary
|
|
# Perfect for experimentation
|
|
```
|
|
|
|
**Option C: HTTP Client (Remote Chroma Server)**
|
|
|
|
```bash
|
|
# Start Chroma server
|
|
chroma run --path ./chroma_db --port 8000
|
|
```
|
|
|
|
```python
|
|
# Connect to remote server
|
|
client = chromadb.HttpClient(host="localhost", port=8000)
|
|
|
|
# Great for microservices architecture
|
|
```
|
|
|
|
**Option D: Docker (Production)**
|
|
|
|
```bash
|
|
# docker-compose.yml
|
|
version: '3'
|
|
services:
|
|
chroma:
|
|
image: ghcr.io/chroma-core/chroma:latest
|
|
volumes:
|
|
- ./chroma-data:/chroma/chroma
|
|
ports:
|
|
- "8000:8000"
|
|
environment:
|
|
- ANONYMIZED_TELEMETRY=False
|
|
|
|
# Start Chroma
|
|
docker-compose up -d
|
|
```
|
|
|
|
### Step 2: Generate Skill Seekers Documents
|
|
|
|
**Option A: Documentation Website**
|
|
```bash
|
|
skill-seekers scrape --config configs/django.json
|
|
skill-seekers package output/django --target langchain
|
|
```
|
|
|
|
**Option B: GitHub Repository**
|
|
```bash
|
|
skill-seekers github --repo django/django --name django
|
|
skill-seekers package output/django --target langchain
|
|
```
|
|
|
|
**Option C: Local Codebase**
|
|
```bash
|
|
skill-seekers analyze --directory /path/to/repo
|
|
skill-seekers package output/codebase --target langchain
|
|
```
|
|
|
|
**Option D: RAG-Optimized Chunking**
|
|
```bash
|
|
skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-tokens 512
|
|
skill-seekers package output/fastapi --target langchain
|
|
```
|
|
|
|
### Step 3: Choose Embedding Function
|
|
|
|
**Option A: Default (Sentence Transformers - Free)**
|
|
|
|
```python
|
|
# Chroma uses all-MiniLM-L6-v2 by default
|
|
collection = client.get_or_create_collection(name="docs")
|
|
|
|
# Automatically downloads model on first use (~90MB)
|
|
# Dimensions: 384
|
|
# Speed: ~500 docs/sec on CPU
|
|
# Quality: Good for most use cases
|
|
```
|
|
|
|
**Option B: OpenAI (Best Quality)**
|
|
|
|
```python
|
|
from chromadb.utils import embedding_functions
|
|
|
|
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
|
|
api_key="sk-...",
|
|
model_name="text-embedding-ada-002"
|
|
)
|
|
|
|
collection = client.get_or_create_collection(
|
|
name="docs",
|
|
embedding_function=openai_ef
|
|
)
|
|
|
|
# Cost: ~$0.0001 per 1K tokens
|
|
# Dimensions: 1536
|
|
# Quality: Excellent
|
|
```
|
|
|
|
**Option C: Local Sentence Transformers (Customizable)**
|
|
|
|
```python
|
|
from chromadb.utils import embedding_functions
|
|
|
|
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
|
|
model_name="all-mpnet-base-v2" # Better quality than default
|
|
)
|
|
|
|
collection = client.get_or_create_collection(
|
|
name="docs",
|
|
embedding_function=sentence_transformer_ef
|
|
)
|
|
|
|
# Free, local, customizable
|
|
# Dimensions: 768 (all-mpnet-base-v2)
|
|
# Quality: Better than default
|
|
```
|
|
|
|
**Option D: Cohere**
|
|
|
|
```python
|
|
cohere_ef = embedding_functions.CohereEmbeddingFunction(
|
|
api_key="your-cohere-key",
|
|
model_name="embed-english-v3.0"
|
|
)
|
|
|
|
collection = client.get_or_create_collection(
|
|
name="docs",
|
|
embedding_function=cohere_ef
|
|
)
|
|
```
|
|
|
|
### Step 4: Add Documents with Metadata
|
|
|
|
```python
|
|
import json
|
|
|
|
# Load Skill Seekers documents
|
|
with open("output/django-langchain.json") as f:
|
|
documents = json.load(f)
|
|
|
|
# Prepare for Chroma
|
|
docs_content = []
|
|
docs_metadata = []
|
|
docs_ids = []
|
|
|
|
for i, doc in enumerate(documents):
|
|
docs_content.append(doc["page_content"])
|
|
docs_metadata.append(doc["metadata"])
|
|
docs_ids.append(f"doc_{i}")
|
|
|
|
# Add to collection (batch operation)
|
|
collection.add(
|
|
documents=docs_content,
|
|
metadatas=docs_metadata,
|
|
ids=docs_ids
|
|
)
|
|
|
|
print(f"✅ Added {len(documents)} documents")
|
|
print(f"Collection size: {collection.count()}")
|
|
```
|
|
|
|
### Step 5: Query with Advanced Filters
|
|
|
|
```python
|
|
# Simple query
|
|
results = collection.query(
|
|
query_texts=["How do I create models?"],
|
|
n_results=5
|
|
)
|
|
|
|
# With metadata filter
|
|
results = collection.query(
|
|
query_texts=["Django authentication"],
|
|
n_results=3,
|
|
where={"category": "authentication"}
|
|
)
|
|
|
|
# Multiple filters (AND logic)
|
|
results = collection.query(
|
|
query_texts=["user registration"],
|
|
n_results=3,
|
|
where={
|
|
"$and": [
|
|
{"category": "authentication"},
|
|
{"type": "tutorial"}
|
|
]
|
|
}
|
|
)
|
|
|
|
# Filter with OR
|
|
results = collection.query(
|
|
query_texts=["components"],
|
|
n_results=5,
|
|
where={
|
|
"$or": [
|
|
{"category": "components"},
|
|
{"category": "hooks"}
|
|
]
|
|
}
|
|
)
|
|
|
|
# Filter with IN
|
|
results = collection.query(
|
|
query_texts=["data handling"],
|
|
n_results=5,
|
|
where={"category": {"$in": ["models", "views", "serializers"]}}
|
|
)
|
|
|
|
# Extract results
|
|
for doc, metadata, distance in zip(
|
|
results["documents"][0],
|
|
results["metadatas"][0],
|
|
results["distances"][0]
|
|
):
|
|
print(f"Distance: {distance:.3f}")
|
|
print(f"Category: {metadata['category']}")
|
|
print(f"Content: {doc[:200]}...")
|
|
print()
|
|
```
|
|
|
|
---
|
|
|
|
## 🚀 Advanced Usage
|
|
|
|
### 1. Multiple Collections for Different Frameworks
|
|
|
|
```python
|
|
# Create separate collections
|
|
frameworks = ["react", "vue", "angular", "svelte"]
|
|
|
|
for framework in frameworks:
|
|
collection = client.get_or_create_collection(
|
|
name=f"{framework}_docs",
|
|
metadata={
|
|
"framework": framework,
|
|
"version": "latest",
|
|
"last_updated": "2026-02-07"
|
|
}
|
|
)
|
|
|
|
# Load framework-specific documents
|
|
with open(f"output/{framework}-langchain.json") as f:
|
|
docs = json.load(f)
|
|
|
|
collection.add(
|
|
documents=[d["page_content"] for d in docs],
|
|
metadatas=[d["metadata"] for d in docs],
|
|
ids=[f"doc_{i}" for i in range(len(docs))]
|
|
)
|
|
|
|
# Query specific framework
|
|
react_collection = client.get_collection(name="react_docs")
|
|
results = react_collection.query(
|
|
query_texts=["useState hook"],
|
|
n_results=3
|
|
)
|
|
```
|
|
|
|
### 2. Update Documents Efficiently
|
|
|
|
```python
|
|
# Update existing document (same ID)
|
|
collection.update(
|
|
ids=["doc_42"],
|
|
documents=["Updated content for React hooks..."],
|
|
metadatas=[{"category": "hooks", "updated": "2026-02-07"}]
|
|
)
|
|
|
|
# Upsert (update or insert)
|
|
collection.upsert(
|
|
ids=["doc_42"],
|
|
documents=["New or updated content..."],
|
|
metadatas=[{"category": "hooks"}]
|
|
)
|
|
|
|
# Delete specific documents
|
|
collection.delete(ids=["doc_42", "doc_99"])
|
|
|
|
# Delete by filter
|
|
collection.delete(where={"category": "deprecated"})
|
|
```
|
|
|
|
### 3. Pre-Compute Embeddings for Faster Ingestion
|
|
|
|
```python
|
|
from chromadb.utils import embedding_functions
|
|
import openai
|
|
|
|
# Generate embeddings separately
|
|
openai_client = openai.OpenAI()
|
|
embeddings = []
|
|
|
|
for doc in documents:
|
|
response = openai_client.embeddings.create(
|
|
model="text-embedding-ada-002",
|
|
input=doc["page_content"]
|
|
)
|
|
embeddings.append(response.data[0].embedding)
|
|
|
|
# Add with pre-computed embeddings (faster)
|
|
collection.add(
|
|
documents=[d["page_content"] for d in documents],
|
|
embeddings=embeddings, # Skip embedding generation
|
|
metadatas=[d["metadata"] for d in documents],
|
|
ids=[f"doc_{i}" for i in range(len(documents))]
|
|
)
|
|
```
|
|
|
|
### 4. Hybrid Search (Vector + Keyword)
|
|
|
|
```python
|
|
# Get all documents matching keyword filter
|
|
results = collection.query(
|
|
query_texts=["state management"],
|
|
n_results=100, # Get many candidates
|
|
where_document={"$contains": "useState"} # Keyword filter
|
|
)
|
|
|
|
# Chroma re-ranks by semantic similarity
|
|
# Results contain "useState" AND are semantically similar to "state management"
|
|
```
|
|
|
|
### 5. Collection Management
|
|
|
|
```python
|
|
# List all collections
|
|
collections = client.list_collections()
|
|
for collection in collections:
|
|
print(f"{collection.name}: {collection.count()} documents")
|
|
print(f" Metadata: {collection.metadata}")
|
|
|
|
# Get collection info
|
|
collection = client.get_collection(name="react_docs")
|
|
print(f"Count: {collection.count()}")
|
|
print(f"Metadata: {collection.metadata}")
|
|
|
|
# Delete collection
|
|
client.delete_collection(name="old_docs")
|
|
|
|
# Rename collection (create new, copy data, delete old)
|
|
old = client.get_collection(name="react_docs")
|
|
new = client.create_collection(name="react_docs_v2")
|
|
|
|
# Copy all documents
|
|
old_data = old.get()
|
|
new.add(
|
|
ids=old_data["ids"],
|
|
documents=old_data["documents"],
|
|
metadatas=old_data["metadatas"],
|
|
embeddings=old_data["embeddings"]
|
|
)
|
|
|
|
client.delete_collection(name="react_docs")
|
|
```
|
|
|
|
---
|
|
|
|
## 📋 Best Practices
|
|
|
|
### 1. Use Persistent Storage for Production
|
|
|
|
```python
|
|
# ✅ Good: Data persists
|
|
client = chromadb.PersistentClient(path="./chroma_db")
|
|
|
|
# ❌ Bad: Data lost on restart
|
|
client = chromadb.Client()
|
|
|
|
# Store DB in appropriate location
|
|
import os
|
|
db_path = os.path.expanduser("~/.local/share/my_app/chroma_db")
|
|
client = chromadb.PersistentClient(path=db_path)
|
|
```
|
|
|
|
### 2. Batch Operations for Large Datasets
|
|
|
|
```python
|
|
# ✅ Good: Batch add (fast)
|
|
batch_size = 1000
|
|
for i in range(0, len(documents), batch_size):
|
|
batch = documents[i:i + batch_size]
|
|
collection.add(
|
|
documents=[d["page_content"] for d in batch],
|
|
metadatas=[d["metadata"] for d in batch],
|
|
ids=[f"doc_{i+j}" for j in range(len(batch))]
|
|
)
|
|
print(f"Added {i + len(batch)}/{len(documents)}...")
|
|
|
|
# ❌ Bad: One at a time (slow)
|
|
for i, doc in enumerate(documents):
|
|
collection.add(
|
|
documents=[doc["page_content"]],
|
|
metadatas=[doc["metadata"]],
|
|
ids=[f"doc_{i}"]
|
|
)
|
|
```
|
|
|
|
### 3. Choose Embedding Model Wisely
|
|
|
|
```python
|
|
# For speed (local development):
|
|
# - Default Chroma (all-MiniLM-L6-v2): 384 dims, fast
|
|
collection = client.get_or_create_collection(name="docs")
|
|
|
|
# For quality (production):
|
|
# - OpenAI ada-002: 1536 dims, best quality
|
|
openai_ef = embedding_functions.OpenAIEmbeddingFunction(...)
|
|
collection = client.get_or_create_collection(name="docs", embedding_function=openai_ef)
|
|
|
|
# For balance (offline production):
|
|
# - all-mpnet-base-v2: 768 dims, good quality, free
|
|
mpnet_ef = embedding_functions.SentenceTransformerEmbeddingFunction(
|
|
model_name="all-mpnet-base-v2"
|
|
)
|
|
collection = client.get_or_create_collection(name="docs", embedding_function=mpnet_ef)
|
|
```
|
|
|
|
### 4. Use Metadata Filters to Reduce Search Space
|
|
|
|
```python
|
|
# ✅ Good: Filter then search (fast)
|
|
results = collection.query(
|
|
query_texts=["authentication"],
|
|
n_results=3,
|
|
where={"category": "auth"} # Only search auth docs
|
|
)
|
|
|
|
# ❌ Slow: Search everything, filter later
|
|
results = collection.query(
|
|
query_texts=["authentication"],
|
|
n_results=100
|
|
)
|
|
filtered = [r for r in results if r["metadata"]["category"] == "auth"]
|
|
```
|
|
|
|
### 5. Handle Updates with Upsert
|
|
|
|
```python
|
|
# ✅ Good: Upsert (idempotent)
|
|
collection.upsert(
|
|
ids=["doc_42"],
|
|
documents=["Updated content..."],
|
|
metadatas=[{"updated": "2026-02-07"}]
|
|
)
|
|
|
|
# ❌ Bad: Delete then add (race conditions)
|
|
try:
|
|
collection.delete(ids=["doc_42"])
|
|
except:
|
|
pass
|
|
collection.add(ids=["doc_42"], ...)
|
|
```
|
|
|
|
---
|
|
|
|
## 🔥 Real-World Example: Local RAG Chatbot
|
|
|
|
```python
|
|
import chromadb
|
|
import json
|
|
from openai import OpenAI
|
|
|
|
class LocalRAGChatbot:
|
|
def __init__(self, db_path: str = "./chroma_db"):
|
|
"""Initialize chatbot with local Chroma database."""
|
|
self.client = chromadb.PersistentClient(path=db_path)
|
|
self.openai = OpenAI() # For chat completion only
|
|
self.collection = None
|
|
|
|
def ingest_framework(self, framework: str, docs_path: str):
|
|
"""Ingest documentation for a framework."""
|
|
# Create or get collection
|
|
self.collection = self.client.get_or_create_collection(
|
|
name=f"{framework}_docs",
|
|
metadata={"framework": framework}
|
|
)
|
|
|
|
# Load documents
|
|
with open(docs_path) as f:
|
|
documents = json.load(f)
|
|
|
|
# Batch add (Chroma generates embeddings locally)
|
|
batch_size = 1000
|
|
for i in range(0, len(documents), batch_size):
|
|
batch = documents[i:i + batch_size]
|
|
|
|
self.collection.add(
|
|
documents=[d["page_content"] for d in batch],
|
|
metadatas=[d["metadata"] for d in batch],
|
|
ids=[f"doc_{i+j}" for j in range(len(batch))]
|
|
)
|
|
|
|
if (i + batch_size) < len(documents):
|
|
print(f"Ingested {i + batch_size}/{len(documents)}...")
|
|
|
|
print(f"✅ Ingested {len(documents)} documents for {framework}")
|
|
print(f"Collection size: {self.collection.count()}")
|
|
|
|
def chat(self, question: str, category: str = None):
|
|
"""Answer question using RAG."""
|
|
if not self.collection:
|
|
raise ValueError("No framework ingested. Call ingest_framework() first.")
|
|
|
|
# Retrieve relevant documents
|
|
where_filter = {"category": category} if category else None
|
|
|
|
results = self.collection.query(
|
|
query_texts=[question],
|
|
n_results=5,
|
|
where=where_filter
|
|
)
|
|
|
|
# Build context from results
|
|
context_parts = []
|
|
for doc, metadata in zip(results["documents"][0], results["metadatas"][0]):
|
|
context_parts.append(f"[{metadata['category']}] {doc}")
|
|
|
|
context = "\n\n".join(context_parts)
|
|
|
|
# Generate answer using GPT-4
|
|
completion = self.openai.chat.completions.create(
|
|
model="gpt-4",
|
|
messages=[
|
|
{
|
|
"role": "system",
|
|
"content": "You are a helpful assistant. Answer based on the provided documentation context."
|
|
},
|
|
{
|
|
"role": "user",
|
|
"content": f"Context:\n{context}\n\nQuestion: {question}"
|
|
}
|
|
]
|
|
)
|
|
|
|
return {
|
|
"answer": completion.choices[0].message.content,
|
|
"sources": [
|
|
{
|
|
"category": m["category"],
|
|
"source": m["source"],
|
|
"file": m["file"]
|
|
}
|
|
for m in results["metadatas"][0]
|
|
],
|
|
"context_used": len(context)
|
|
}
|
|
|
|
def list_frameworks(self):
|
|
"""List all ingested frameworks."""
|
|
collections = self.client.list_collections()
|
|
return [
|
|
{
|
|
"name": c.name,
|
|
"count": c.count(),
|
|
"metadata": c.metadata
|
|
}
|
|
for c in collections
|
|
]
|
|
|
|
# Usage
|
|
chatbot = LocalRAGChatbot(db_path="./my_docs_db")
|
|
|
|
# Ingest multiple frameworks
|
|
chatbot.ingest_framework("react", "output/react-langchain.json")
|
|
chatbot.ingest_framework("django", "output/django-langchain.json")
|
|
|
|
# Interactive chat
|
|
frameworks = chatbot.list_frameworks()
|
|
print(f"Available frameworks: {[f['name'] for f in frameworks]}")
|
|
|
|
# Select framework
|
|
chatbot.collection = chatbot.client.get_collection("react_docs")
|
|
|
|
# Ask questions
|
|
questions = [
|
|
"How do I use useState?",
|
|
"What is useEffect for?",
|
|
"How do I handle form input?"
|
|
]
|
|
|
|
for question in questions:
|
|
print(f"\nQ: {question}")
|
|
result = chatbot.chat(question, category="hooks")
|
|
print(f"A: {result['answer']}")
|
|
print(f"Sources: {[s['file'] for s in result['sources'][:2]]}")
|
|
print(f"Context size: {result['context_used']} chars")
|
|
```
|
|
|
|
**Output:**
|
|
```
|
|
✅ Ingested 1247 documents for react
|
|
Collection size: 1247
|
|
✅ Ingested 892 documents for django
|
|
Collection size: 892
|
|
|
|
Available frameworks: ['react_docs', 'django_docs']
|
|
|
|
Q: How do I use useState?
|
|
A: useState is a React Hook that lets you add state to functional components.
|
|
Call it at the top level: const [count, setCount] = useState(0)
|
|
Sources: ['hooks/useState.md', 'hooks/overview.md']
|
|
Context size: 2340 chars
|
|
|
|
Q: What is useEffect for?
|
|
A: useEffect performs side effects in functional components, like fetching data,
|
|
subscriptions, or DOM manipulation. It runs after render.
|
|
Sources: ['hooks/useEffect.md', 'hooks/rules.md']
|
|
Context size: 2156 chars
|
|
```
|
|
|
|
---
|
|
|
|
## 🐛 Troubleshooting
|
|
|
|
### Issue: Model Download Stuck
|
|
|
|
**Problem:** "Downloading model..." hangs indefinitely
|
|
|
|
**Solutions:**
|
|
|
|
1. **Check internet connection:**
|
|
```bash
|
|
curl -I https://huggingface.co
|
|
```
|
|
|
|
2. **Manually download model:**
|
|
```python
|
|
from sentence_transformers import SentenceTransformer
|
|
|
|
# Force download
|
|
model = SentenceTransformer('all-MiniLM-L6-v2')
|
|
print("Model downloaded!")
|
|
```
|
|
|
|
3. **Use pre-downloaded model:**
|
|
```python
|
|
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
|
|
model_name="/path/to/local/model"
|
|
)
|
|
```
|
|
|
|
### Issue: Dimension Mismatch
|
|
|
|
**Problem:** "Dimensionality mismatch: expected 384, got 1536"
|
|
|
|
**Solution:** Collections remember their embedding function
|
|
```python
|
|
# Delete and recreate with correct embedding function
|
|
client.delete_collection(name="docs")
|
|
|
|
openai_ef = embedding_functions.OpenAIEmbeddingFunction(...)
|
|
collection = client.create_collection(
|
|
name="docs",
|
|
embedding_function=openai_ef # 1536 dims
|
|
)
|
|
```
|
|
|
|
### Issue: Slow Queries
|
|
|
|
**Problem:** Queries take >1 second on 10K documents
|
|
|
|
**Solutions:**
|
|
|
|
1. **Use smaller n_results:**
|
|
```python
|
|
# ✅ Fast: Get only what you need
|
|
results = collection.query(query_texts=["..."], n_results=5)
|
|
|
|
# ❌ Slow: Large result sets
|
|
results = collection.query(query_texts=["..."], n_results=100)
|
|
```
|
|
|
|
2. **Filter with metadata:**
|
|
```python
|
|
# ✅ Fast: Reduce search space
|
|
results = collection.query(
|
|
query_texts=["..."],
|
|
n_results=5,
|
|
where={"category": "specific"} # Only search subset
|
|
)
|
|
```
|
|
|
|
3. **Use HttpClient for parallelism:**
|
|
```bash
|
|
# Start Chroma server
|
|
chroma run --path ./chroma_db
|
|
```
|
|
|
|
```python
|
|
# Connect multiple clients
|
|
client = chromadb.HttpClient(host="localhost", port=8000)
|
|
```
|
|
|
|
### Issue: Database Locked
|
|
|
|
**Problem:** "Database is locked" error
|
|
|
|
**Solutions:**
|
|
|
|
1. **Check for other processes:**
|
|
```bash
|
|
lsof ./chroma_db/chroma.sqlite3
|
|
# Kill any hung processes
|
|
```
|
|
|
|
2. **Use HttpClient instead:**
|
|
```bash
|
|
chroma run --path ./chroma_db --port 8000
|
|
```
|
|
|
|
```python
|
|
client = chromadb.HttpClient(host="localhost", port=8000)
|
|
```
|
|
|
|
3. **Enable WAL mode (Write-Ahead Logging):**
|
|
```python
|
|
import sqlite3
|
|
conn = sqlite3.connect("./chroma_db/chroma.sqlite3")
|
|
conn.execute("PRAGMA journal_mode=WAL")
|
|
conn.close()
|
|
```
|
|
|
|
### Issue: Collection Not Found
|
|
|
|
**Problem:** "Collection 'docs' does not exist"
|
|
|
|
**Solutions:**
|
|
|
|
1. **List existing collections:**
|
|
```python
|
|
collections = client.list_collections()
|
|
print([c.name for c in collections])
|
|
```
|
|
|
|
2. **Use get_or_create:**
|
|
```python
|
|
# ✅ Safe: Creates if missing
|
|
collection = client.get_or_create_collection(name="docs")
|
|
|
|
# ❌ Fails if missing
|
|
collection = client.get_collection(name="docs")
|
|
```
|
|
|
|
### Issue: Out of Memory
|
|
|
|
**Problem:** Python crashes when adding large dataset
|
|
|
|
**Solutions:**
|
|
|
|
1. **Batch with smaller size:**
|
|
```python
|
|
batch_size = 500 # Reduce from 1000
|
|
for i in range(0, len(documents), batch_size):
|
|
batch = documents[i:i + batch_size]
|
|
collection.add(...)
|
|
```
|
|
|
|
2. **Use HttpClient + server:**
|
|
```bash
|
|
# Server handles memory better
|
|
chroma run --path ./chroma_db
|
|
```
|
|
|
|
3. **Pre-compute embeddings externally:**
|
|
```python
|
|
# Generate embeddings in separate script
|
|
# Then add with embeddings parameter
|
|
collection.add(
|
|
documents=[...],
|
|
embeddings=precomputed_embeddings,
|
|
...
|
|
)
|
|
```
|
|
|
|
---
|
|
|
|
## 📊 Before vs. After
|
|
|
|
| Aspect | Without Skill Seekers | With Skill Seekers |
|
|
|--------|----------------------|-------------------|
|
|
| **Data Preparation** | Custom scraping + parsing logic | One command: `skill-seekers scrape` |
|
|
| **Embedding Setup** | Manual model selection and config | Auto-configured with sensible defaults |
|
|
| **Metadata** | Manual extraction from docs | Auto-extracted (category, source, file, type) |
|
|
| **Storage** | Complex path management | Simple: `PersistentClient(path="...")` |
|
|
| **Local-First** | Requires external services | Fully local with Sentence Transformers |
|
|
| **Setup Time** | 2-4 hours | 5 minutes |
|
|
| **Code Required** | 300+ lines scraping logic | 20 lines upload script |
|
|
| **External Deps** | OpenAI API required | Optional (works offline!) |
|
|
|
|
---
|
|
|
|
## 🎯 Next Steps
|
|
|
|
### Enhance Your Chroma Integration
|
|
|
|
1. **Try Different Embedding Models:**
|
|
```python
|
|
# Better quality (still local)
|
|
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
|
|
model_name="all-mpnet-base-v2"
|
|
)
|
|
```
|
|
|
|
2. **Implement Semantic Chunking:**
|
|
```bash
|
|
skill-seekers scrape --config configs/fastapi.json --chunk-for-rag --chunk-tokens 512
|
|
```
|
|
|
|
3. **Set Up Multi-Collection Search:**
|
|
```python
|
|
# Search across multiple frameworks
|
|
for name in ["react_docs", "vue_docs", "angular_docs"]:
|
|
collection = client.get_collection(name)
|
|
results = collection.query(...)
|
|
```
|
|
|
|
4. **Deploy with Docker:**
|
|
```bash
|
|
docker run -p 8000:8000 -v ./chroma-data:/chroma/chroma ghcr.io/chroma-core/chroma:latest
|
|
```
|
|
|
|
### Related Guides
|
|
|
|
- **[LangChain Integration](LANGCHAIN.md)** - Use Chroma as vector store in LangChain
|
|
- **[LlamaIndex Integration](LLAMA_INDEX.md)** - Use Chroma with LlamaIndex
|
|
- **[RAG Pipelines Guide](RAG_PIPELINES.md)** - Build complete RAG systems
|
|
- **[INTEGRATIONS.md](INTEGRATIONS.md)** - See all integration options
|
|
|
|
### Resources
|
|
|
|
- **Chroma Docs:** https://docs.trychroma.com/
|
|
- **Python Client:** https://docs.trychroma.com/reference/py-client
|
|
- **Support:** https://github.com/yusufkaraaslan/Skill_Seekers/discussions
|
|
|
|
---
|
|
|
|
**Questions?** Open an issue: https://github.com/yusufkaraaslan/Skill_Seekers/issues
|
|
**Website:** https://skillseekersweb.com/
|
|
**Last Updated:** February 7, 2026
|