Created complete working examples for all 4 vector databases with RAG adaptors: Weaviate Example: - Comprehensive README with hybrid search guide - 3 Python scripts (generate, upload, query) - Sample outputs and query results - Covers hybrid search, filtering, schema design Chroma Example: - Simple, local-first approach - In-memory and persistent storage options - Semantic search and metadata filtering - Comparison with Weaviate FAISS Example: - Facebook AI Similarity Search integration - OpenAI embeddings generation - Index building and persistence - Performance-focused for scale Qdrant Example: - Advanced filtering capabilities - Production-ready features - Complex query patterns - Rust-based performance Each example includes: - Detailed README with setup and troubleshooting - requirements.txt with dependencies - 3 working Python scripts - Sample outputs directory Total files: 20 (4 examples × 5 files each) Documentation: 4 comprehensive READMEs (~800 lines total) Phase 2 of optional enhancements complete. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
395 lines
9.2 KiB
Markdown
395 lines
9.2 KiB
Markdown
# ChromaDB Vector Database Example
|
|
|
|
This example demonstrates how to use Skill Seekers with ChromaDB, the AI-native open-source embedding database. Chroma is designed to be simple, fast, and easy to use locally.
|
|
|
|
## What You'll Learn
|
|
|
|
- How to generate skills in ChromaDB format
|
|
- How to create local Chroma collections
|
|
- How to perform semantic searches
|
|
- How to filter by metadata categories
|
|
|
|
## Why ChromaDB?
|
|
|
|
- **No Server Required**: Works entirely in-process (perfect for development)
|
|
- **Simple API**: Clean Python interface, no complex setup
|
|
- **Fast**: Built for speed with smart indexing
|
|
- **Open Source**: MIT licensed, community-driven
|
|
|
|
## Prerequisites
|
|
|
|
### Python Dependencies
|
|
|
|
```bash
|
|
pip install -r requirements.txt
|
|
```
|
|
|
|
That's it! No Docker, no server setup. Chroma runs entirely in your Python process.
|
|
|
|
## Step-by-Step Guide
|
|
|
|
### Step 1: Generate Skill from Documentation
|
|
|
|
First, we'll scrape Vue documentation and package it for ChromaDB:
|
|
|
|
```bash
|
|
python 1_generate_skill.py
|
|
```
|
|
|
|
This script will:
|
|
1. Scrape Vue docs (limited to 20 pages for demo)
|
|
2. Package the skill in ChromaDB format (JSON with documents + metadata + IDs)
|
|
3. Save to `output/vue-chroma.json`
|
|
|
|
**Expected Output:**
|
|
```
|
|
✅ ChromaDB data packaged successfully!
|
|
📦 Output: output/vue-chroma.json
|
|
📊 Total documents: 21
|
|
📂 Categories: overview (1), guides (8), api (12)
|
|
```
|
|
|
|
**What's in the JSON?**
|
|
```json
|
|
{
|
|
"documents": [
|
|
"Vue is a progressive JavaScript framework...",
|
|
"Components are the building blocks..."
|
|
],
|
|
"metadatas": [
|
|
{
|
|
"source": "vue",
|
|
"category": "overview",
|
|
"file": "SKILL.md",
|
|
"type": "documentation",
|
|
"version": "1.0.0"
|
|
}
|
|
],
|
|
"ids": [
|
|
"a1b2c3d4e5f6...",
|
|
"b2c3d4e5f6g7..."
|
|
],
|
|
"collection_name": "vue"
|
|
}
|
|
```
|
|
|
|
### Step 2: Create Collection and Upload
|
|
|
|
Now we'll create a ChromaDB collection and load all documents:
|
|
|
|
```bash
|
|
python 2_upload_to_chroma.py
|
|
```
|
|
|
|
This script will:
|
|
1. Create an in-memory Chroma client (or persistent with `--persist`)
|
|
2. Create a collection with the skill name
|
|
3. Add all documents with metadata and IDs
|
|
4. Verify the upload was successful
|
|
|
|
**Expected Output:**
|
|
```
|
|
📊 Creating ChromaDB client...
|
|
✅ Client created (in-memory)
|
|
|
|
📦 Creating collection: vue
|
|
✅ Collection created!
|
|
|
|
📤 Adding 21 documents to collection...
|
|
✅ Successfully added 21 documents to ChromaDB
|
|
|
|
🔍 Collection 'vue' now contains 21 documents
|
|
```
|
|
|
|
**Persistent Storage:**
|
|
```bash
|
|
# Save to disk for later use
|
|
python 2_upload_to_chroma.py --persist ./chroma_db
|
|
```
|
|
|
|
### Step 3: Query and Search
|
|
|
|
Now search your knowledge base!
|
|
|
|
```bash
|
|
python 3_query_example.py
|
|
```
|
|
|
|
**With persistent storage:**
|
|
```bash
|
|
python 3_query_example.py --persist ./chroma_db
|
|
```
|
|
|
|
This script demonstrates:
|
|
1. **Semantic Search**: Natural language queries
|
|
2. **Metadata Filtering**: Filter by category
|
|
3. **Top-K Results**: Get most relevant documents
|
|
4. **Distance Scoring**: See how relevant each result is
|
|
|
|
**Example Queries:**
|
|
|
|
**Query 1: Semantic Search**
|
|
```
|
|
Query: "How do I create a Vue component?"
|
|
Top 3 results:
|
|
|
|
1. [Distance: 0.234] guides/components.md
|
|
Components are reusable Vue instances with a name. You can use them as custom
|
|
elements inside a root Vue instance...
|
|
|
|
2. [Distance: 0.298] api/component_api.md
|
|
The component API reference describes all available options for defining
|
|
components using the Options API...
|
|
|
|
3. [Distance: 0.312] guides/single_file_components.md
|
|
Single-File Components (SFCs) allow you to define templates, logic, and
|
|
styling in a single .vue file...
|
|
```
|
|
|
|
**Query 2: Filtered Search**
|
|
```
|
|
Query: "reactivity"
|
|
Filter: category = "api"
|
|
|
|
Results:
|
|
1. ref() - Create reactive references
|
|
2. reactive() - Create reactive proxies
|
|
3. computed() - Create computed properties
|
|
```
|
|
|
|
## Understanding ChromaDB Features
|
|
|
|
### Semantic Search
|
|
|
|
Chroma automatically:
|
|
- Generates embeddings for your documents (using default model)
|
|
- Indexes them for fast similarity search
|
|
- Finds semantically similar content
|
|
|
|
**Distance Scores:**
|
|
- Lower = more similar
|
|
- `0.0` = identical
|
|
- `< 0.5` = very relevant
|
|
- `0.5-1.0` = somewhat relevant
|
|
- `> 1.0` = less relevant
|
|
|
|
### Metadata Filtering
|
|
|
|
Filter results before semantic search:
|
|
```python
|
|
collection.query(
|
|
query_texts=["your query"],
|
|
n_results=5,
|
|
where={"category": "api"}
|
|
)
|
|
```
|
|
|
|
**Supported operators:**
|
|
- `$eq`: Equal to
|
|
- `$ne`: Not equal to
|
|
- `$gt`, `$gte`: Greater than (or equal)
|
|
- `$lt`, `$lte`: Less than (or equal)
|
|
- `$in`: In list
|
|
- `$nin`: Not in list
|
|
|
|
**Complex filters:**
|
|
```python
|
|
where={
|
|
"$and": [
|
|
{"category": {"$eq": "api"}},
|
|
{"type": {"$eq": "reference"}}
|
|
]
|
|
}
|
|
```
|
|
|
|
### Collection Management
|
|
|
|
```python
|
|
# List all collections
|
|
client.list_collections()
|
|
|
|
# Get collection
|
|
collection = client.get_collection("vue")
|
|
|
|
# Get count
|
|
collection.count()
|
|
|
|
# Delete collection
|
|
client.delete_collection("vue")
|
|
```
|
|
|
|
## Customization
|
|
|
|
### Use Your Own Embeddings
|
|
|
|
Chroma supports custom embedding functions:
|
|
|
|
```python
|
|
from chromadb.utils import embedding_functions
|
|
|
|
# OpenAI embeddings
|
|
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
|
|
api_key="your-key",
|
|
model_name="text-embedding-ada-002"
|
|
)
|
|
|
|
collection = client.create_collection(
|
|
name="your_skill",
|
|
embedding_function=openai_ef
|
|
)
|
|
```
|
|
|
|
**Supported embedding functions:**
|
|
- **OpenAI**: `text-embedding-ada-002` (best quality)
|
|
- **Cohere**: `embed-english-v2.0`
|
|
- **HuggingFace**: Various models (local, no API key)
|
|
- **Sentence Transformers**: Local models
|
|
|
|
### Generate Different Skills
|
|
|
|
```bash
|
|
# Change the config in 1_generate_skill.py
|
|
"--config", "configs/django.json", # Your framework
|
|
|
|
# Or use CLI directly
|
|
skill-seekers scrape --config configs/flask.json
|
|
skill-seekers package output/flask --target chroma
|
|
```
|
|
|
|
### Adjust Query Parameters
|
|
|
|
In `3_query_example.py`:
|
|
|
|
```python
|
|
# Get more results
|
|
n_results=10 # Default is 5
|
|
|
|
# Include more metadata
|
|
include=["documents", "metadatas", "distances"]
|
|
|
|
# Different distance metrics
|
|
# (configure when creating collection)
|
|
metadata={"hnsw:space": "cosine"} # or "l2", "ip"
|
|
```
|
|
|
|
## Performance Tips
|
|
|
|
1. **Batch Operations**: Add documents in batches for better performance
|
|
```python
|
|
collection.add(
|
|
documents=batch_docs,
|
|
metadatas=batch_metadata,
|
|
ids=batch_ids
|
|
)
|
|
```
|
|
|
|
2. **Persistent Storage**: Use `--persist` for production
|
|
```bash
|
|
python 2_upload_to_chroma.py --persist ./prod_db
|
|
```
|
|
|
|
3. **Custom Embeddings**: Use OpenAI for best quality (costs $)
|
|
4. **Index Tuning**: Adjust HNSW parameters for speed vs accuracy
|
|
|
|
## Troubleshooting
|
|
|
|
### Import Error
|
|
```
|
|
ModuleNotFoundError: No module named 'chromadb'
|
|
```
|
|
|
|
**Solution:**
|
|
```bash
|
|
pip install chromadb
|
|
```
|
|
|
|
### Collection Already Exists
|
|
```
|
|
Error: Collection 'vue' already exists
|
|
```
|
|
|
|
**Solution:**
|
|
```python
|
|
# Delete existing collection
|
|
client.delete_collection("vue")
|
|
|
|
# Or use --reset flag
|
|
python 2_upload_to_chroma.py --reset
|
|
```
|
|
|
|
### Empty Results
|
|
```
|
|
Query returned empty results
|
|
```
|
|
|
|
**Possible causes:**
|
|
1. Collection empty: Check `collection.count()`
|
|
2. Query too specific: Try broader queries
|
|
3. Wrong collection name: Verify collection exists
|
|
|
|
**Debug:**
|
|
```python
|
|
# Check collection contents
|
|
collection.get() # Get all documents
|
|
|
|
# Check embedding function
|
|
collection._embedding_function # Should not be None
|
|
```
|
|
|
|
### Performance Issues
|
|
```
|
|
Query is slow
|
|
```
|
|
|
|
**Solutions:**
|
|
1. Use persistent storage (faster than in-memory for large datasets)
|
|
2. Reduce `n_results` (fewer results = faster)
|
|
3. Add metadata filters to narrow search space
|
|
4. Consider using OpenAI embeddings (better quality = faster convergence)
|
|
|
|
## Next Steps
|
|
|
|
1. **Try other skills**: Package your favorite documentation
|
|
2. **Build a chatbot**: Integrate with LangChain or LlamaIndex
|
|
3. **Production deployment**: Use persistent storage + API wrapper
|
|
4. **Custom embeddings**: Experiment with different models
|
|
|
|
## Resources
|
|
|
|
- **ChromaDB Docs**: https://docs.trychroma.com/
|
|
- **GitHub**: https://github.com/chroma-core/chroma
|
|
- **Discord**: https://discord.gg/MMeYNTmh3x
|
|
- **Skill Seekers**: https://github.com/yourusername/skill-seekers
|
|
|
|
## File Structure
|
|
|
|
```
|
|
chroma-example/
|
|
├── README.md # This file
|
|
├── requirements.txt # Python dependencies
|
|
├── 1_generate_skill.py # Generate ChromaDB-format skill
|
|
├── 2_upload_to_chroma.py # Create collection and upload
|
|
├── 3_query_example.py # Query demonstrations
|
|
└── sample_output/ # Example outputs
|
|
├── vue-chroma.json # Generated skill (21 docs)
|
|
└── query_results.txt # Sample query results
|
|
```
|
|
|
|
## Comparison: Chroma vs Weaviate
|
|
|
|
| Feature | ChromaDB | Weaviate |
|
|
|---------|----------|----------|
|
|
| **Setup** | ✅ No server needed | ⚠️ Docker/Cloud required |
|
|
| **API** | ✅ Very simple | ⚠️ More complex |
|
|
| **Performance** | ✅ Fast for < 1M docs | ✅ Scales to billions |
|
|
| **Hybrid Search** | ❌ Semantic only | ✅ Keyword + semantic |
|
|
| **Production** | ✅ Good for small-medium | ✅ Built for scale |
|
|
|
|
**Use Chroma for:** Development, prototypes, small-medium datasets (< 1M docs)
|
|
**Use Weaviate for:** Production, large datasets (> 1M docs), hybrid search
|
|
|
|
---
|
|
|
|
**Last Updated:** February 2026
|
|
**Tested With:** ChromaDB v0.4.22, Python 3.10+, skill-seekers v2.10.0
|