skill-seekers-reference/examples/pinecone-upsert/README.md

# Pinecone Upsert Example

Complete example showing how to upsert Skill Seekers documents to Pinecone and perform semantic search.

## What This Example Does

1. **Creates** a Pinecone serverless index
2. **Loads** Skill Seekers-generated documents (LangChain format)
3. **Generates** embeddings with OpenAI
4. **Upserts** documents to Pinecone with metadata
5. **Demonstrates** semantic search capabilities
6. **Provides** interactive search mode

## Prerequisites

```bash
# Install dependencies
pip install pinecone-client openai

# Set API keys
export PINECONE_API_KEY=your-pinecone-api-key
export OPENAI_API_KEY=sk-...
```

## Generate Documents

First, generate LangChain-format documents using Skill Seekers:

```bash
# Option 1: Use preset config (e.g., Django)
skill-seekers scrape --config configs/django.json
skill-seekers package output/django --target langchain

# Option 2: From GitHub repo
skill-seekers github --repo django/django --name django
skill-seekers package output/django --target langchain

# Output: output/django-langchain.json
```

## Run the Example

```bash
cd examples/pinecone-upsert

# Run the quickstart script
python quickstart.py
```

## What You'll See

1. **Index creation** (if it doesn't exist)
2. **Documents loaded** with category breakdown
3. **Batch upsert** with progress tracking
4. **Example queries** demonstrating semantic search
5. **Interactive search mode** for your own queries

## Example Output

```
============================================================
PINECONE UPSERT QUICKSTART
============================================================

Step 1: Creating Pinecone index...
✅ Index created: skill-seekers-demo

Step 2: Loading documents...
✅ Loaded 180 documents
   Categories: {'api': 38, 'guides': 45, 'models': 42, 'overview': 1, ...}

Step 3: Upserting to Pinecone...
Upserting 180 documents...
Batch size: 100
  Upserted 100/180 documents...
  Upserted 180/180 documents...
✅ Upserted all documents to Pinecone
   Total vectors in index: 180

Step 4: Running example queries...
============================================================

QUERY: How do I create a Django model?
------------------------------------------------------------
  Score: 0.892
  Category: models
  Text: Django models are Python classes that define the structure of your database tables...

  Score: 0.854
  Category: api
  Text: To create a model, inherit from django.db.models.Model and define fields...

============================================================
INTERACTIVE SEMANTIC SEARCH
============================================================
Search the documentation (type 'quit' to exit)

Query: What are Django views?
```

## Features Demonstrated

- **Serverless Index** - Auto-scaling Pinecone infrastructure
- **Batch Upsertion** - Efficient bulk loading (100 docs/batch)
- **Metadata Filtering** - Category-based search filters
- **Semantic Search** - Vector similarity matching
- **Interactive Mode** - Real-time query interface

## Files in This Example

- `quickstart.py` - Complete working example
- `README.md` - This file
- `requirements.txt` - Python dependencies

## Cost Estimate

For 1000 documents:
- **Embeddings:** ~$0.01 (OpenAI ada-002)
- **Storage:** ~$0.03/month (Pinecone serverless)
- **Queries:** ~$0.025 per 100k queries

**Total first month:** ~$0.04 + query costs

## Customization Options

### Change Index Name

```python
INDEX_NAME = "my-custom-index"  # Line 215
```

### Adjust Batch Size

```python
batch_upsert(index, openai_client, documents, batch_size=50)  # Line 239
```

### Filter by Category

```python
matches = semantic_search(
    index=index,
    openai_client=openai_client,
    query="your query",
    category="models"  # Only search in "models" category
)
```

### Use Different Embedding Model

```python
# In create_embeddings() function
response = openai_client.embeddings.create(
    model="text-embedding-3-small",  # Cheaper, smaller dimension
    input=texts
)

# Update index dimension to 1536 (for text-embedding-3-small)
create_index(pc, INDEX_NAME, dimension=1536)
```

## Troubleshooting

**"Index already exists"**
- Normal message if you've run the script before
- The script will reuse the existing index

**"PINECONE_API_KEY not set"**
- Get API key from: https://app.pinecone.io/
- Set environment variable: `export PINECONE_API_KEY=your-key`

**"OPENAI_API_KEY not set"**
- Get API key from: https://platform.openai.com/api-keys
- Set environment variable: `export OPENAI_API_KEY=sk-...`

**"Documents not found"**
- Make sure you've generated documents first (see "Generate Documents" above)
- Check the `DOCS_PATH` in `quickstart.py` matches your output location

**"Rate limit exceeded"**
- OpenAI or Pinecone rate limit hit
- Reduce batch_size: `batch_size=50` or `batch_size=25`
- Add delays between batches

## Advanced Usage

### Load Existing Index

```python
from pinecone import Pinecone

pc = Pinecone(api_key="your-api-key")
index = pc.Index("skill-seekers-demo")

# Query immediately (no need to re-upsert)
results = index.query(
    vector=query_embedding,
    top_k=5,
    include_metadata=True
)
```

### Update Existing Documents

```python
# Upsert with same ID to update
index.upsert(vectors=[{
    "id": "doc_123",
    "values": new_embedding,
    "metadata": updated_metadata
}])
```

### Delete Documents

```python
# Delete by ID
index.delete(ids=["doc_123", "doc_456"])

# Delete by metadata filter
index.delete(filter={"category": {"$eq": "deprecated"}})

# Delete all (namespace)
index.delete(delete_all=True)
```

### Use Namespaces

```python
# Upsert to namespace
index.upsert(vectors=vectors, namespace="production")

# Query specific namespace
results = index.query(
    vector=query_embedding,
    namespace="production",
    top_k=5
)
```

## Related Examples

- [LangChain RAG Pipeline](../langchain-rag-pipeline/)
- [LlamaIndex Query Engine](../llama-index-query-engine/)

---

**Need help?** [GitHub Discussions](https://github.com/yusufkaraaslan/Skill_Seekers/discussions)