Files
skill-seekers-reference/examples/chroma-example/README.md
yusyus 53d37e61dd docs: Add 4 comprehensive vector database examples (Weaviate, Chroma, FAISS, Qdrant)
Created complete working examples for all 4 vector databases with RAG adaptors:

Weaviate Example:
- Comprehensive README with hybrid search guide
- 3 Python scripts (generate, upload, query)
- Sample outputs and query results
- Covers hybrid search, filtering, schema design

Chroma Example:
- Simple, local-first approach
- In-memory and persistent storage options
- Semantic search and metadata filtering
- Comparison with Weaviate

FAISS Example:
- Facebook AI Similarity Search integration
- OpenAI embeddings generation
- Index building and persistence
- Performance-focused for scale

Qdrant Example:
- Advanced filtering capabilities
- Production-ready features
- Complex query patterns
- Rust-based performance

Each example includes:
- Detailed README with setup and troubleshooting
- requirements.txt with dependencies
- 3 working Python scripts
- Sample outputs directory

Total files: 20 (4 examples × 5 files each)
Documentation: 4 comprehensive READMEs (~800 lines total)

Phase 2 of optional enhancements complete.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-07 22:38:15 +03:00

395 lines
9.2 KiB
Markdown

# ChromaDB Vector Database Example
This example demonstrates how to use Skill Seekers with ChromaDB, the AI-native open-source embedding database. Chroma is designed to be simple, fast, and easy to use locally.
## What You'll Learn
- How to generate skills in ChromaDB format
- How to create local Chroma collections
- How to perform semantic searches
- How to filter by metadata categories
## Why ChromaDB?
- **No Server Required**: Works entirely in-process (perfect for development)
- **Simple API**: Clean Python interface, no complex setup
- **Fast**: Built for speed with smart indexing
- **Open Source**: MIT licensed, community-driven
## Prerequisites
### Python Dependencies
```bash
pip install -r requirements.txt
```
That's it! No Docker, no server setup. Chroma runs entirely in your Python process.
## Step-by-Step Guide
### Step 1: Generate Skill from Documentation
First, we'll scrape Vue documentation and package it for ChromaDB:
```bash
python 1_generate_skill.py
```
This script will:
1. Scrape Vue docs (limited to 20 pages for demo)
2. Package the skill in ChromaDB format (JSON with documents + metadata + IDs)
3. Save to `output/vue-chroma.json`
**Expected Output:**
```
✅ ChromaDB data packaged successfully!
📦 Output: output/vue-chroma.json
📊 Total documents: 21
📂 Categories: overview (1), guides (8), api (12)
```
**What's in the JSON?**
```json
{
"documents": [
"Vue is a progressive JavaScript framework...",
"Components are the building blocks..."
],
"metadatas": [
{
"source": "vue",
"category": "overview",
"file": "SKILL.md",
"type": "documentation",
"version": "1.0.0"
}
],
"ids": [
"a1b2c3d4e5f6...",
"b2c3d4e5f6g7..."
],
"collection_name": "vue"
}
```
### Step 2: Create Collection and Upload
Now we'll create a ChromaDB collection and load all documents:
```bash
python 2_upload_to_chroma.py
```
This script will:
1. Create an in-memory Chroma client (or persistent with `--persist`)
2. Create a collection with the skill name
3. Add all documents with metadata and IDs
4. Verify the upload was successful
**Expected Output:**
```
📊 Creating ChromaDB client...
✅ Client created (in-memory)
📦 Creating collection: vue
✅ Collection created!
📤 Adding 21 documents to collection...
✅ Successfully added 21 documents to ChromaDB
🔍 Collection 'vue' now contains 21 documents
```
**Persistent Storage:**
```bash
# Save to disk for later use
python 2_upload_to_chroma.py --persist ./chroma_db
```
### Step 3: Query and Search
Now search your knowledge base!
```bash
python 3_query_example.py
```
**With persistent storage:**
```bash
python 3_query_example.py --persist ./chroma_db
```
This script demonstrates:
1. **Semantic Search**: Natural language queries
2. **Metadata Filtering**: Filter by category
3. **Top-K Results**: Get most relevant documents
4. **Distance Scoring**: See how relevant each result is
**Example Queries:**
**Query 1: Semantic Search**
```
Query: "How do I create a Vue component?"
Top 3 results:
1. [Distance: 0.234] guides/components.md
Components are reusable Vue instances with a name. You can use them as custom
elements inside a root Vue instance...
2. [Distance: 0.298] api/component_api.md
The component API reference describes all available options for defining
components using the Options API...
3. [Distance: 0.312] guides/single_file_components.md
Single-File Components (SFCs) allow you to define templates, logic, and
styling in a single .vue file...
```
**Query 2: Filtered Search**
```
Query: "reactivity"
Filter: category = "api"
Results:
1. ref() - Create reactive references
2. reactive() - Create reactive proxies
3. computed() - Create computed properties
```
## Understanding ChromaDB Features
### Semantic Search
Chroma automatically:
- Generates embeddings for your documents (using default model)
- Indexes them for fast similarity search
- Finds semantically similar content
**Distance Scores:**
- Lower = more similar
- `0.0` = identical
- `< 0.5` = very relevant
- `0.5-1.0` = somewhat relevant
- `> 1.0` = less relevant
### Metadata Filtering
Filter results before semantic search:
```python
collection.query(
query_texts=["your query"],
n_results=5,
where={"category": "api"}
)
```
**Supported operators:**
- `$eq`: Equal to
- `$ne`: Not equal to
- `$gt`, `$gte`: Greater than (or equal)
- `$lt`, `$lte`: Less than (or equal)
- `$in`: In list
- `$nin`: Not in list
**Complex filters:**
```python
where={
"$and": [
{"category": {"$eq": "api"}},
{"type": {"$eq": "reference"}}
]
}
```
### Collection Management
```python
# List all collections
client.list_collections()
# Get collection
collection = client.get_collection("vue")
# Get count
collection.count()
# Delete collection
client.delete_collection("vue")
```
## Customization
### Use Your Own Embeddings
Chroma supports custom embedding functions:
```python
from chromadb.utils import embedding_functions
# OpenAI embeddings
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
api_key="your-key",
model_name="text-embedding-ada-002"
)
collection = client.create_collection(
name="your_skill",
embedding_function=openai_ef
)
```
**Supported embedding functions:**
- **OpenAI**: `text-embedding-ada-002` (best quality)
- **Cohere**: `embed-english-v2.0`
- **HuggingFace**: Various models (local, no API key)
- **Sentence Transformers**: Local models
### Generate Different Skills
```bash
# Change the config in 1_generate_skill.py
"--config", "configs/django.json", # Your framework
# Or use CLI directly
skill-seekers scrape --config configs/flask.json
skill-seekers package output/flask --target chroma
```
### Adjust Query Parameters
In `3_query_example.py`:
```python
# Get more results
n_results=10 # Default is 5
# Include more metadata
include=["documents", "metadatas", "distances"]
# Different distance metrics
# (configure when creating collection)
metadata={"hnsw:space": "cosine"} # or "l2", "ip"
```
## Performance Tips
1. **Batch Operations**: Add documents in batches for better performance
```python
collection.add(
documents=batch_docs,
metadatas=batch_metadata,
ids=batch_ids
)
```
2. **Persistent Storage**: Use `--persist` for production
```bash
python 2_upload_to_chroma.py --persist ./prod_db
```
3. **Custom Embeddings**: Use OpenAI for best quality (costs $)
4. **Index Tuning**: Adjust HNSW parameters for speed vs accuracy
## Troubleshooting
### Import Error
```
ModuleNotFoundError: No module named 'chromadb'
```
**Solution:**
```bash
pip install chromadb
```
### Collection Already Exists
```
Error: Collection 'vue' already exists
```
**Solution:**
```python
# Delete existing collection
client.delete_collection("vue")
# Or use --reset flag
python 2_upload_to_chroma.py --reset
```
### Empty Results
```
Query returned empty results
```
**Possible causes:**
1. Collection empty: Check `collection.count()`
2. Query too specific: Try broader queries
3. Wrong collection name: Verify collection exists
**Debug:**
```python
# Check collection contents
collection.get() # Get all documents
# Check embedding function
collection._embedding_function # Should not be None
```
### Performance Issues
```
Query is slow
```
**Solutions:**
1. Use persistent storage (faster than in-memory for large datasets)
2. Reduce `n_results` (fewer results = faster)
3. Add metadata filters to narrow search space
4. Consider using OpenAI embeddings (better quality = faster convergence)
## Next Steps
1. **Try other skills**: Package your favorite documentation
2. **Build a chatbot**: Integrate with LangChain or LlamaIndex
3. **Production deployment**: Use persistent storage + API wrapper
4. **Custom embeddings**: Experiment with different models
## Resources
- **ChromaDB Docs**: https://docs.trychroma.com/
- **GitHub**: https://github.com/chroma-core/chroma
- **Discord**: https://discord.gg/MMeYNTmh3x
- **Skill Seekers**: https://github.com/yourusername/skill-seekers
## File Structure
```
chroma-example/
├── README.md # This file
├── requirements.txt # Python dependencies
├── 1_generate_skill.py # Generate ChromaDB-format skill
├── 2_upload_to_chroma.py # Create collection and upload
├── 3_query_example.py # Query demonstrations
└── sample_output/ # Example outputs
├── vue-chroma.json # Generated skill (21 docs)
└── query_results.txt # Sample query results
```
## Comparison: Chroma vs Weaviate
| Feature | ChromaDB | Weaviate |
|---------|----------|----------|
| **Setup** | ✅ No server needed | ⚠️ Docker/Cloud required |
| **API** | ✅ Very simple | ⚠️ More complex |
| **Performance** | ✅ Fast for < 1M docs | ✅ Scales to billions |
| **Hybrid Search** | ❌ Semantic only | ✅ Keyword + semantic |
| **Production** | ✅ Good for small-medium | ✅ Built for scale |
**Use Chroma for:** Development, prototypes, small-medium datasets (< 1M docs)
**Use Weaviate for:** Production, large datasets (> 1M docs), hybrid search
---
**Last Updated:** February 2026
**Tested With:** ChromaDB v0.4.22, Python 3.10+, skill-seekers v2.10.0