docs: Add 4 comprehensive vector database examples (Weaviate, Chroma, FAISS, Qdrant)
Created complete working examples for all 4 vector databases with RAG adaptors: Weaviate Example: - Comprehensive README with hybrid search guide - 3 Python scripts (generate, upload, query) - Sample outputs and query results - Covers hybrid search, filtering, schema design Chroma Example: - Simple, local-first approach - In-memory and persistent storage options - Semantic search and metadata filtering - Comparison with Weaviate FAISS Example: - Facebook AI Similarity Search integration - OpenAI embeddings generation - Index building and persistence - Performance-focused for scale Qdrant Example: - Advanced filtering capabilities - Production-ready features - Complex query patterns - Rust-based performance Each example includes: - Detailed README with setup and troubleshooting - requirements.txt with dependencies - 3 working Python scripts - Sample outputs directory Total files: 20 (4 examples × 5 files each) Documentation: 4 comprehensive READMEs (~800 lines total) Phase 2 of optional enhancements complete. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
394
examples/chroma-example/README.md
Normal file
394
examples/chroma-example/README.md
Normal file
@@ -0,0 +1,394 @@
|
||||
# ChromaDB Vector Database Example
|
||||
|
||||
This example demonstrates how to use Skill Seekers with ChromaDB, the AI-native open-source embedding database. Chroma is designed to be simple, fast, and easy to use locally.
|
||||
|
||||
## What You'll Learn
|
||||
|
||||
- How to generate skills in ChromaDB format
|
||||
- How to create local Chroma collections
|
||||
- How to perform semantic searches
|
||||
- How to filter by metadata categories
|
||||
|
||||
## Why ChromaDB?
|
||||
|
||||
- **No Server Required**: Works entirely in-process (perfect for development)
|
||||
- **Simple API**: Clean Python interface, no complex setup
|
||||
- **Fast**: Built for speed with smart indexing
|
||||
- **Open Source**: MIT licensed, community-driven
|
||||
|
||||
## Prerequisites
|
||||
|
||||
### Python Dependencies
|
||||
|
||||
```bash
|
||||
pip install -r requirements.txt
|
||||
```
|
||||
|
||||
That's it! No Docker, no server setup. Chroma runs entirely in your Python process.
|
||||
|
||||
## Step-by-Step Guide
|
||||
|
||||
### Step 1: Generate Skill from Documentation
|
||||
|
||||
First, we'll scrape Vue documentation and package it for ChromaDB:
|
||||
|
||||
```bash
|
||||
python 1_generate_skill.py
|
||||
```
|
||||
|
||||
This script will:
|
||||
1. Scrape Vue docs (limited to 20 pages for demo)
|
||||
2. Package the skill in ChromaDB format (JSON with documents + metadata + IDs)
|
||||
3. Save to `output/vue-chroma.json`
|
||||
|
||||
**Expected Output:**
|
||||
```
|
||||
✅ ChromaDB data packaged successfully!
|
||||
📦 Output: output/vue-chroma.json
|
||||
📊 Total documents: 21
|
||||
📂 Categories: overview (1), guides (8), api (12)
|
||||
```
|
||||
|
||||
**What's in the JSON?**
|
||||
```json
|
||||
{
|
||||
"documents": [
|
||||
"Vue is a progressive JavaScript framework...",
|
||||
"Components are the building blocks..."
|
||||
],
|
||||
"metadatas": [
|
||||
{
|
||||
"source": "vue",
|
||||
"category": "overview",
|
||||
"file": "SKILL.md",
|
||||
"type": "documentation",
|
||||
"version": "1.0.0"
|
||||
}
|
||||
],
|
||||
"ids": [
|
||||
"a1b2c3d4e5f6...",
|
||||
"b2c3d4e5f6g7..."
|
||||
],
|
||||
"collection_name": "vue"
|
||||
}
|
||||
```
|
||||
|
||||
### Step 2: Create Collection and Upload
|
||||
|
||||
Now we'll create a ChromaDB collection and load all documents:
|
||||
|
||||
```bash
|
||||
python 2_upload_to_chroma.py
|
||||
```
|
||||
|
||||
This script will:
|
||||
1. Create an in-memory Chroma client (or persistent with `--persist`)
|
||||
2. Create a collection with the skill name
|
||||
3. Add all documents with metadata and IDs
|
||||
4. Verify the upload was successful
|
||||
|
||||
**Expected Output:**
|
||||
```
|
||||
📊 Creating ChromaDB client...
|
||||
✅ Client created (in-memory)
|
||||
|
||||
📦 Creating collection: vue
|
||||
✅ Collection created!
|
||||
|
||||
📤 Adding 21 documents to collection...
|
||||
✅ Successfully added 21 documents to ChromaDB
|
||||
|
||||
🔍 Collection 'vue' now contains 21 documents
|
||||
```
|
||||
|
||||
**Persistent Storage:**
|
||||
```bash
|
||||
# Save to disk for later use
|
||||
python 2_upload_to_chroma.py --persist ./chroma_db
|
||||
```
|
||||
|
||||
### Step 3: Query and Search
|
||||
|
||||
Now search your knowledge base!
|
||||
|
||||
```bash
|
||||
python 3_query_example.py
|
||||
```
|
||||
|
||||
**With persistent storage:**
|
||||
```bash
|
||||
python 3_query_example.py --persist ./chroma_db
|
||||
```
|
||||
|
||||
This script demonstrates:
|
||||
1. **Semantic Search**: Natural language queries
|
||||
2. **Metadata Filtering**: Filter by category
|
||||
3. **Top-K Results**: Get most relevant documents
|
||||
4. **Distance Scoring**: See how relevant each result is
|
||||
|
||||
**Example Queries:**
|
||||
|
||||
**Query 1: Semantic Search**
|
||||
```
|
||||
Query: "How do I create a Vue component?"
|
||||
Top 3 results:
|
||||
|
||||
1. [Distance: 0.234] guides/components.md
|
||||
Components are reusable Vue instances with a name. You can use them as custom
|
||||
elements inside a root Vue instance...
|
||||
|
||||
2. [Distance: 0.298] api/component_api.md
|
||||
The component API reference describes all available options for defining
|
||||
components using the Options API...
|
||||
|
||||
3. [Distance: 0.312] guides/single_file_components.md
|
||||
Single-File Components (SFCs) allow you to define templates, logic, and
|
||||
styling in a single .vue file...
|
||||
```
|
||||
|
||||
**Query 2: Filtered Search**
|
||||
```
|
||||
Query: "reactivity"
|
||||
Filter: category = "api"
|
||||
|
||||
Results:
|
||||
1. ref() - Create reactive references
|
||||
2. reactive() - Create reactive proxies
|
||||
3. computed() - Create computed properties
|
||||
```
|
||||
|
||||
## Understanding ChromaDB Features
|
||||
|
||||
### Semantic Search
|
||||
|
||||
Chroma automatically:
|
||||
- Generates embeddings for your documents (using default model)
|
||||
- Indexes them for fast similarity search
|
||||
- Finds semantically similar content
|
||||
|
||||
**Distance Scores:**
|
||||
- Lower = more similar
|
||||
- `0.0` = identical
|
||||
- `< 0.5` = very relevant
|
||||
- `0.5-1.0` = somewhat relevant
|
||||
- `> 1.0` = less relevant
|
||||
|
||||
### Metadata Filtering
|
||||
|
||||
Filter results before semantic search:
|
||||
```python
|
||||
collection.query(
|
||||
query_texts=["your query"],
|
||||
n_results=5,
|
||||
where={"category": "api"}
|
||||
)
|
||||
```
|
||||
|
||||
**Supported operators:**
|
||||
- `$eq`: Equal to
|
||||
- `$ne`: Not equal to
|
||||
- `$gt`, `$gte`: Greater than (or equal)
|
||||
- `$lt`, `$lte`: Less than (or equal)
|
||||
- `$in`: In list
|
||||
- `$nin`: Not in list
|
||||
|
||||
**Complex filters:**
|
||||
```python
|
||||
where={
|
||||
"$and": [
|
||||
{"category": {"$eq": "api"}},
|
||||
{"type": {"$eq": "reference"}}
|
||||
]
|
||||
}
|
||||
```
|
||||
|
||||
### Collection Management
|
||||
|
||||
```python
|
||||
# List all collections
|
||||
client.list_collections()
|
||||
|
||||
# Get collection
|
||||
collection = client.get_collection("vue")
|
||||
|
||||
# Get count
|
||||
collection.count()
|
||||
|
||||
# Delete collection
|
||||
client.delete_collection("vue")
|
||||
```
|
||||
|
||||
## Customization
|
||||
|
||||
### Use Your Own Embeddings
|
||||
|
||||
Chroma supports custom embedding functions:
|
||||
|
||||
```python
|
||||
from chromadb.utils import embedding_functions
|
||||
|
||||
# OpenAI embeddings
|
||||
openai_ef = embedding_functions.OpenAIEmbeddingFunction(
|
||||
api_key="your-key",
|
||||
model_name="text-embedding-ada-002"
|
||||
)
|
||||
|
||||
collection = client.create_collection(
|
||||
name="your_skill",
|
||||
embedding_function=openai_ef
|
||||
)
|
||||
```
|
||||
|
||||
**Supported embedding functions:**
|
||||
- **OpenAI**: `text-embedding-ada-002` (best quality)
|
||||
- **Cohere**: `embed-english-v2.0`
|
||||
- **HuggingFace**: Various models (local, no API key)
|
||||
- **Sentence Transformers**: Local models
|
||||
|
||||
### Generate Different Skills
|
||||
|
||||
```bash
|
||||
# Change the config in 1_generate_skill.py
|
||||
"--config", "configs/django.json", # Your framework
|
||||
|
||||
# Or use CLI directly
|
||||
skill-seekers scrape --config configs/flask.json
|
||||
skill-seekers package output/flask --target chroma
|
||||
```
|
||||
|
||||
### Adjust Query Parameters
|
||||
|
||||
In `3_query_example.py`:
|
||||
|
||||
```python
|
||||
# Get more results
|
||||
n_results=10 # Default is 5
|
||||
|
||||
# Include more metadata
|
||||
include=["documents", "metadatas", "distances"]
|
||||
|
||||
# Different distance metrics
|
||||
# (configure when creating collection)
|
||||
metadata={"hnsw:space": "cosine"} # or "l2", "ip"
|
||||
```
|
||||
|
||||
## Performance Tips
|
||||
|
||||
1. **Batch Operations**: Add documents in batches for better performance
|
||||
```python
|
||||
collection.add(
|
||||
documents=batch_docs,
|
||||
metadatas=batch_metadata,
|
||||
ids=batch_ids
|
||||
)
|
||||
```
|
||||
|
||||
2. **Persistent Storage**: Use `--persist` for production
|
||||
```bash
|
||||
python 2_upload_to_chroma.py --persist ./prod_db
|
||||
```
|
||||
|
||||
3. **Custom Embeddings**: Use OpenAI for best quality (costs $)
|
||||
4. **Index Tuning**: Adjust HNSW parameters for speed vs accuracy
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
### Import Error
|
||||
```
|
||||
ModuleNotFoundError: No module named 'chromadb'
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
pip install chromadb
|
||||
```
|
||||
|
||||
### Collection Already Exists
|
||||
```
|
||||
Error: Collection 'vue' already exists
|
||||
```
|
||||
|
||||
**Solution:**
|
||||
```python
|
||||
# Delete existing collection
|
||||
client.delete_collection("vue")
|
||||
|
||||
# Or use --reset flag
|
||||
python 2_upload_to_chroma.py --reset
|
||||
```
|
||||
|
||||
### Empty Results
|
||||
```
|
||||
Query returned empty results
|
||||
```
|
||||
|
||||
**Possible causes:**
|
||||
1. Collection empty: Check `collection.count()`
|
||||
2. Query too specific: Try broader queries
|
||||
3. Wrong collection name: Verify collection exists
|
||||
|
||||
**Debug:**
|
||||
```python
|
||||
# Check collection contents
|
||||
collection.get() # Get all documents
|
||||
|
||||
# Check embedding function
|
||||
collection._embedding_function # Should not be None
|
||||
```
|
||||
|
||||
### Performance Issues
|
||||
```
|
||||
Query is slow
|
||||
```
|
||||
|
||||
**Solutions:**
|
||||
1. Use persistent storage (faster than in-memory for large datasets)
|
||||
2. Reduce `n_results` (fewer results = faster)
|
||||
3. Add metadata filters to narrow search space
|
||||
4. Consider using OpenAI embeddings (better quality = faster convergence)
|
||||
|
||||
## Next Steps
|
||||
|
||||
1. **Try other skills**: Package your favorite documentation
|
||||
2. **Build a chatbot**: Integrate with LangChain or LlamaIndex
|
||||
3. **Production deployment**: Use persistent storage + API wrapper
|
||||
4. **Custom embeddings**: Experiment with different models
|
||||
|
||||
## Resources
|
||||
|
||||
- **ChromaDB Docs**: https://docs.trychroma.com/
|
||||
- **GitHub**: https://github.com/chroma-core/chroma
|
||||
- **Discord**: https://discord.gg/MMeYNTmh3x
|
||||
- **Skill Seekers**: https://github.com/yourusername/skill-seekers
|
||||
|
||||
## File Structure
|
||||
|
||||
```
|
||||
chroma-example/
|
||||
├── README.md # This file
|
||||
├── requirements.txt # Python dependencies
|
||||
├── 1_generate_skill.py # Generate ChromaDB-format skill
|
||||
├── 2_upload_to_chroma.py # Create collection and upload
|
||||
├── 3_query_example.py # Query demonstrations
|
||||
└── sample_output/ # Example outputs
|
||||
├── vue-chroma.json # Generated skill (21 docs)
|
||||
└── query_results.txt # Sample query results
|
||||
```
|
||||
|
||||
## Comparison: Chroma vs Weaviate
|
||||
|
||||
| Feature | ChromaDB | Weaviate |
|
||||
|---------|----------|----------|
|
||||
| **Setup** | ✅ No server needed | ⚠️ Docker/Cloud required |
|
||||
| **API** | ✅ Very simple | ⚠️ More complex |
|
||||
| **Performance** | ✅ Fast for < 1M docs | ✅ Scales to billions |
|
||||
| **Hybrid Search** | ❌ Semantic only | ✅ Keyword + semantic |
|
||||
| **Production** | ✅ Good for small-medium | ✅ Built for scale |
|
||||
|
||||
**Use Chroma for:** Development, prototypes, small-medium datasets (< 1M docs)
|
||||
**Use Weaviate for:** Production, large datasets (> 1M docs), hybrid search
|
||||
|
||||
---
|
||||
|
||||
**Last Updated:** February 2026
|
||||
**Tested With:** ChromaDB v0.4.22, Python 3.10+, skill-seekers v2.10.0
|
||||
Reference in New Issue
Block a user