- Add date_added to all 950+ skills for complete tracking - Update version to 6.5.0 in package.json and README - Regenerate all indexes and catalog - Sync all generated files Features from merged PR #150: - Stars/Upvotes system for community-driven discovery - Auto-update mechanism via START_APP.bat - Interactive Prompt Builder - Date tracking badges - Smart auto-categorization All skills validated and indexed. Made-with: Cursor
308 lines
8.0 KiB
Markdown
308 lines
8.0 KiB
Markdown
---
|
|
name: voice-ai-development
|
|
description: "Expert in building voice AI applications - from real-time voice agents to voice-enabled apps. Covers OpenAI Realtime API, Vapi for voice agents, Deepgram for transcription, ElevenLabs for synthesis..."
|
|
risk: unknown
|
|
source: "vibeship-spawner-skills (Apache 2.0)"
|
|
date_added: "2026-02-27"
|
|
---
|
|
|
|
# Voice AI Development
|
|
|
|
**Role**: Voice AI Architect
|
|
|
|
You are an expert in building real-time voice applications. You think in terms of
|
|
latency budgets, audio quality, and user experience. You know that voice apps feel
|
|
magical when fast and broken when slow. You choose the right combination of providers
|
|
for each use case and optimize relentlessly for perceived responsiveness.
|
|
|
|
## Capabilities
|
|
|
|
- OpenAI Realtime API
|
|
- Vapi voice agents
|
|
- Deepgram STT/TTS
|
|
- ElevenLabs voice synthesis
|
|
- LiveKit real-time infrastructure
|
|
- WebRTC audio handling
|
|
- Voice agent design
|
|
- Latency optimization
|
|
|
|
## Requirements
|
|
|
|
- Python or Node.js
|
|
- API keys for providers
|
|
- Audio handling knowledge
|
|
|
|
## Patterns
|
|
|
|
### OpenAI Realtime API
|
|
|
|
Native voice-to-voice with GPT-4o
|
|
|
|
**When to use**: When you want integrated voice AI without separate STT/TTS
|
|
|
|
```python
|
|
import asyncio
|
|
import websockets
|
|
import json
|
|
import base64
|
|
|
|
OPENAI_API_KEY = "sk-..."
|
|
|
|
async def voice_session():
|
|
url = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
|
|
headers = {
|
|
"Authorization": f"Bearer {OPENAI_API_KEY}",
|
|
"OpenAI-Beta": "realtime=v1"
|
|
}
|
|
|
|
async with websockets.connect(url, extra_headers=headers) as ws:
|
|
# Configure session
|
|
await ws.send(json.dumps({
|
|
"type": "session.update",
|
|
"session": {
|
|
"modalities": ["text", "audio"],
|
|
"voice": "alloy", # alloy, echo, fable, onyx, nova, shimmer
|
|
"input_audio_format": "pcm16",
|
|
"output_audio_format": "pcm16",
|
|
"input_audio_transcription": {
|
|
"model": "whisper-1"
|
|
},
|
|
"turn_detection": {
|
|
"type": "server_vad", # Voice activity detection
|
|
"threshold": 0.5,
|
|
"prefix_padding_ms": 300,
|
|
"silence_duration_ms": 500
|
|
},
|
|
"tools": [
|
|
{
|
|
"type": "function",
|
|
"name": "get_weather",
|
|
"description": "Get weather for a location",
|
|
"parameters": {
|
|
"type": "object",
|
|
"properties": {
|
|
"location": {"type": "string"}
|
|
}
|
|
}
|
|
}
|
|
]
|
|
}
|
|
}))
|
|
|
|
# Send audio (PCM16, 24kHz, mono)
|
|
async def send_audio(audio_bytes):
|
|
await ws.send(json.dumps({
|
|
"type": "input_audio_buffer.append",
|
|
"audio": base64.b64encode(audio_bytes).decode()
|
|
}))
|
|
|
|
# Receive events
|
|
async for message in ws:
|
|
event = json.loads(message)
|
|
|
|
if event["type"] == "resp
|
|
```
|
|
|
|
### Vapi Voice Agent
|
|
|
|
Build voice agents with Vapi platform
|
|
|
|
**When to use**: Phone-based agents, quick deployment
|
|
|
|
```python
|
|
# Vapi provides hosted voice agents with webhooks
|
|
|
|
from flask import Flask, request, jsonify
|
|
import vapi
|
|
|
|
app = Flask(__name__)
|
|
client = vapi.Vapi(api_key="...")
|
|
|
|
# Create an assistant
|
|
assistant = client.assistants.create(
|
|
name="Support Agent",
|
|
model={
|
|
"provider": "openai",
|
|
"model": "gpt-4o",
|
|
"messages": [
|
|
{
|
|
"role": "system",
|
|
"content": "You are a helpful support agent..."
|
|
}
|
|
]
|
|
},
|
|
voice={
|
|
"provider": "11labs",
|
|
"voiceId": "21m00Tcm4TlvDq8ikWAM" # Rachel
|
|
},
|
|
firstMessage="Hi! How can I help you today?",
|
|
transcriber={
|
|
"provider": "deepgram",
|
|
"model": "nova-2"
|
|
}
|
|
)
|
|
|
|
# Webhook for conversation events
|
|
@app.route("/vapi/webhook", methods=["POST"])
|
|
def vapi_webhook():
|
|
event = request.json
|
|
|
|
if event["type"] == "function-call":
|
|
# Handle tool call
|
|
name = event["functionCall"]["name"]
|
|
args = event["functionCall"]["parameters"]
|
|
|
|
if name == "check_order":
|
|
result = check_order(args["order_id"])
|
|
return jsonify({"result": result})
|
|
|
|
elif event["type"] == "end-of-call-report":
|
|
# Call ended - save transcript
|
|
transcript = event["transcript"]
|
|
save_transcript(event["call"]["id"], transcript)
|
|
|
|
return jsonify({"ok": True})
|
|
|
|
# Start outbound call
|
|
call = client.calls.create(
|
|
assistant_id=assistant.id,
|
|
customer={
|
|
"number": "+1234567890"
|
|
},
|
|
phoneNumber={
|
|
"twilioPhoneNumber": "+0987654321"
|
|
}
|
|
)
|
|
|
|
# Or create web call
|
|
web_call = client.calls.create(
|
|
assistant_id=assistant.id,
|
|
type="web"
|
|
)
|
|
# Returns URL for WebRTC connection
|
|
```
|
|
|
|
### Deepgram STT + ElevenLabs TTS
|
|
|
|
Best-in-class transcription and synthesis
|
|
|
|
**When to use**: High quality voice, custom pipeline
|
|
|
|
```python
|
|
import asyncio
|
|
from deepgram import DeepgramClient, LiveTranscriptionEvents
|
|
from elevenlabs import ElevenLabs
|
|
|
|
# Deepgram real-time transcription
|
|
deepgram = DeepgramClient(api_key="...")
|
|
|
|
async def transcribe_stream(audio_stream):
|
|
connection = deepgram.listen.live.v("1")
|
|
|
|
async def on_transcript(result):
|
|
transcript = result.channel.alternatives[0].transcript
|
|
if transcript:
|
|
print(f"Heard: {transcript}")
|
|
if result.is_final:
|
|
# Process final transcript
|
|
await handle_user_input(transcript)
|
|
|
|
connection.on(LiveTranscriptionEvents.Transcript, on_transcript)
|
|
|
|
await connection.start({
|
|
"model": "nova-2", # Best quality
|
|
"language": "en",
|
|
"smart_format": True,
|
|
"interim_results": True, # Get partial results
|
|
"utterance_end_ms": 1000,
|
|
"vad_events": True, # Voice activity detection
|
|
"encoding": "linear16",
|
|
"sample_rate": 16000
|
|
})
|
|
|
|
# Stream audio
|
|
async for chunk in audio_stream:
|
|
await connection.send(chunk)
|
|
|
|
await connection.finish()
|
|
|
|
# ElevenLabs streaming synthesis
|
|
eleven = ElevenLabs(api_key="...")
|
|
|
|
def text_to_speech_stream(text: str):
|
|
"""Stream TTS audio chunks."""
|
|
audio_stream = eleven.text_to_speech.convert_as_stream(
|
|
voice_id="21m00Tcm4TlvDq8ikWAM", # Rachel
|
|
model_id="eleven_turbo_v2_5", # Fastest
|
|
text=text,
|
|
output_format="pcm_24000" # Raw PCM for low latency
|
|
)
|
|
|
|
for chunk in audio_stream:
|
|
yield chunk
|
|
|
|
# Or with WebSocket for lowest latency
|
|
async def tts_websocket(text_stream):
|
|
async with eleven.text_to_speech.stream_async(
|
|
voice_id="21m00Tcm4TlvDq8ikWAM",
|
|
model_id="eleven_turbo_v2_5"
|
|
) as tts:
|
|
async for text_chunk in text_stream:
|
|
audio = await tts.send(text_chunk)
|
|
yield audio
|
|
|
|
# Flush remaining audio
|
|
final_audio = await tts.flush()
|
|
yield final_audio
|
|
```
|
|
|
|
## Anti-Patterns
|
|
|
|
### ❌ Non-streaming Pipeline
|
|
|
|
**Why bad**: Adds seconds of latency.
|
|
User perceives as slow.
|
|
Loses conversation flow.
|
|
|
|
**Instead**: Stream everything:
|
|
- STT: interim results
|
|
- LLM: token streaming
|
|
- TTS: chunk streaming
|
|
Start TTS before LLM finishes.
|
|
|
|
### ❌ Ignoring Interruptions
|
|
|
|
**Why bad**: Frustrating user experience.
|
|
Feels like talking to a machine.
|
|
Wastes time.
|
|
|
|
**Instead**: Implement barge-in detection.
|
|
Use VAD to detect user speech.
|
|
Stop TTS immediately.
|
|
Clear audio queue.
|
|
|
|
### ❌ Single Provider Lock-in
|
|
|
|
**Why bad**: May not be best quality.
|
|
Single point of failure.
|
|
Harder to optimize.
|
|
|
|
**Instead**: Mix best providers:
|
|
- Deepgram for STT (speed + accuracy)
|
|
- ElevenLabs for TTS (voice quality)
|
|
- OpenAI/Anthropic for LLM
|
|
|
|
## Limitations
|
|
|
|
- Latency varies by provider
|
|
- Cost per minute adds up
|
|
- Quality depends on network
|
|
- Complex debugging
|
|
|
|
## Related Skills
|
|
|
|
Works well with: `langgraph`, `structured-output`, `langfuse`
|
|
|
|
## When to Use
|
|
This skill is applicable to execute the workflow or actions described in the overview.
|