firefrost-gaming/antigravity-skills-reference

Files

taksrules d972c4fa3a feat: add voice-ai-engine-development skill for building real-time conversational AI

2026-01-27 07:24:06 +02:00

11 KiB

Raw Blame History

Provider Comparison Guide

This guide compares different providers for transcription, LLM, and TTS services to help you choose the best option for your voice AI engine.

Transcription Providers

Deepgram

Strengths:

✅ Fastest transcription speed (< 300ms latency)
✅ Excellent streaming support
✅ High accuracy (95%+ on clear audio)
✅ Good pricing ($0.0043/minute)
✅ Nova-2 model optimized for real-time
✅ Excellent documentation

Weaknesses:

❌ Less accurate with heavy accents
❌ Smaller company (potential reliability concerns)

Best For:

Real-time voice conversations
Low-latency applications
English-language applications
Startups and small businesses

Configuration:

{
    "transcriberProvider": "deepgram",
    "deepgramApiKey": "your-api-key",
    "deepgramModel": "nova-2",
    "language": "en-US"
}

AssemblyAI

Strengths:

✅ Very high accuracy (96%+ on clear audio)
✅ Excellent with accents and dialects
✅ Good speaker diarization
✅ Competitive pricing ($0.00025/second)
✅ Strong customer support

Weaknesses:

❌ Slightly higher latency than Deepgram
❌ Streaming support is newer

Best For:

Applications requiring highest accuracy
Multi-speaker scenarios
Diverse user base with accents
Enterprise applications

Configuration:

{
    "transcriberProvider": "assemblyai",
    "assemblyaiApiKey": "your-api-key",
    "language": "en"
}

Azure Speech

Strengths:

✅ Enterprise-grade reliability
✅ Excellent multi-language support (100+ languages)
✅ Strong security and compliance
✅ Integration with Azure ecosystem
✅ Custom model training available

Weaknesses:

❌ Higher cost ($1/hour)
❌ More complex setup
❌ Slower than specialized providers

Best For:

Enterprise applications
Multi-language requirements
Azure-based infrastructure
Compliance-sensitive applications

Configuration:

{
    "transcriberProvider": "azure",
    "azureSpeechKey": "your-key",
    "azureSpeechRegion": "eastus",
    "language": "en-US"
}

Google Cloud Speech

Strengths:

✅ Excellent multi-language support (125+ languages)
✅ Good accuracy
✅ Integration with Google Cloud
✅ Automatic punctuation
✅ Speaker diarization

Weaknesses:

❌ Higher latency for streaming
❌ Complex pricing model
❌ Requires Google Cloud account

Best For:

Multi-language applications
Google Cloud infrastructure
Applications needing speaker diarization

Configuration:

{
    "transcriberProvider": "google",
    "googleCredentials": "path/to/credentials.json",
    "language": "en-US"
}

LLM Providers

OpenAI (GPT-4, GPT-3.5)

Strengths:

✅ Highest quality responses
✅ Excellent instruction following
✅ Fast streaming
✅ Large context window (128k for GPT-4)
✅ Best-in-class reasoning

Weaknesses:

❌ Higher cost ($0.01-0.03/1k tokens)
❌ Rate limits can be restrictive
❌ No free tier

Best For:

High-quality conversational AI
Complex reasoning tasks
Production applications
Enterprise use cases

Configuration:

{
    "llmProvider": "openai",
    "openaiApiKey": "your-api-key",
    "openaiModel": "gpt-4-turbo",
    "prompt": "You are a helpful AI assistant."
}

Pricing:

GPT-4 Turbo: $0.01/1k input tokens, $0.03/1k output tokens
GPT-3.5 Turbo: $0.0005/1k input tokens, $0.0015/1k output tokens

Google Gemini

Strengths:

✅ Excellent cost-effectiveness (free tier available)
✅ Multimodal capabilities
✅ Good streaming support
✅ Large context window (1M tokens for Pro)
✅ Fast response times

Weaknesses:

❌ Slightly lower quality than GPT-4
❌ Less predictable behavior
❌ Newer, less battle-tested

Best For:

Cost-sensitive applications
Multimodal applications
Startups and prototypes
High-volume applications

Configuration:

{
    "llmProvider": "gemini",
    "geminiApiKey": "your-api-key",
    "geminiModel": "gemini-pro",
    "prompt": "You are a helpful AI assistant."
}

Pricing:

Gemini Pro: Free up to 60 requests/minute
Gemini Pro (paid): $0.00025/1k input tokens, $0.0005/1k output tokens

Anthropic Claude

Strengths:

✅ Excellent safety and alignment
✅ Very long context window (200k tokens)
✅ High-quality responses
✅ Good at following complex instructions
✅ Strong reasoning capabilities

Weaknesses:

❌ Higher cost than Gemini
❌ Slower streaming than OpenAI
❌ More conservative responses

Best For:

Safety-critical applications
Long-context applications
Nuanced conversations
Enterprise applications

Configuration:

{
    "llmProvider": "claude",
    "claudeApiKey": "your-api-key",
    "claudeModel": "claude-3-opus",
    "prompt": "You are a helpful AI assistant."
}

Pricing:

Claude 3 Opus: $0.015/1k input tokens, $0.075/1k output tokens
Claude 3 Sonnet: $0.003/1k input tokens, $0.015/1k output tokens

TTS Providers

ElevenLabs

Strengths:

✅ Most natural-sounding voices
✅ Excellent emotional range
✅ Voice cloning capabilities
✅ Good streaming support
✅ Multiple languages

Weaknesses:

❌ Higher cost ($0.30/1k characters)
❌ Rate limits on lower tiers
❌ Occasional pronunciation errors

Best For:

Premium voice experiences
Customer-facing applications
Voice cloning needs
High-quality audio requirements

Configuration:

{
    "voiceProvider": "elevenlabs",
    "elevenlabsApiKey": "your-api-key",
    "elevenlabsVoiceId": "voice-id",
    "elevenlabsModel": "eleven_monolingual_v1"
}

Pricing:

Free: 10k characters/month
Starter: $5/month, 30k characters
Creator: $22/month, 100k characters

Azure TTS

Strengths:

✅ Enterprise-grade reliability
✅ Many languages (100+)
✅ Neural voices available
✅ SSML support for fine control
✅ Good pricing ($4/1M characters)

Weaknesses:

❌ Less natural than ElevenLabs
❌ More complex setup
❌ Requires Azure account

Best For:

Enterprise applications
Multi-language requirements
Azure-based infrastructure
Cost-sensitive high-volume applications

Configuration:

{
    "voiceProvider": "azure",
    "azureSpeechKey": "your-key",
    "azureSpeechRegion": "eastus",
    "azureVoiceName": "en-US-JennyNeural"
}

Pricing:

Neural voices: $16/1M characters
Standard voices: $4/1M characters

Google Cloud TTS

Strengths:

✅ Good quality neural voices
✅ Many languages (40+)
✅ WaveNet voices available
✅ Competitive pricing ($4/1M characters)
✅ SSML support

Weaknesses:

❌ Less natural than ElevenLabs
❌ Requires Google Cloud account
❌ Complex setup

Best For:

Multi-language applications
Google Cloud infrastructure
Cost-effective neural voices

Configuration:

{
    "voiceProvider": "google",
    "googleCredentials": "path/to/credentials.json",
    "googleVoiceName": "en-US-Neural2-F"
}

Pricing:

WaveNet voices: $16/1M characters
Neural2 voices: $16/1M characters
Standard voices: $4/1M characters

Amazon Polly

Strengths:

✅ AWS integration
✅ Good pricing ($4/1M characters)
✅ Neural voices available
✅ SSML support
✅ Reliable service

Weaknesses:

❌ Less natural than ElevenLabs
❌ Fewer voice options
❌ Requires AWS account

Best For:

AWS-based infrastructure
Cost-effective neural voices
Enterprise applications

Configuration:

{
    "voiceProvider": "polly",
    "awsAccessKey": "your-access-key",
    "awsSecretKey": "your-secret-key",
    "awsRegion": "us-east-1",
    "pollyVoiceId": "Joanna"
}

Pricing:

Neural voices: $16/1M characters
Standard voices: $4/1M characters

Play.ht

Strengths:

✅ Voice cloning capabilities
✅ Natural-sounding voices
✅ Good streaming support
✅ Easy to use API
✅ Multiple languages

Weaknesses:

❌ Higher cost than cloud providers
❌ Smaller company
❌ Less documentation

Best For:

Voice cloning applications
Premium voice experiences
Startups and small businesses

Configuration:

{
    "voiceProvider": "playht",
    "playhtApiKey": "your-api-key",
    "playhtUserId": "your-user-id",
    "playhtVoiceId": "voice-id"
}

Pricing:

Free: 2.5k characters
Creator: $31/month, 50k characters
Pro: $79/month, 150k characters

Recommended Combinations

Budget-Conscious Startup

{
    "transcriberProvider": "deepgram",  # Fast and affordable
    "llmProvider": "gemini",            # Free tier available
    "voiceProvider": "google"           # Cost-effective neural voices
}

Estimated cost: ~$0.01 per minute of conversation

Premium Experience

{
    "transcriberProvider": "assemblyai",  # Highest accuracy
    "llmProvider": "openai",              # Best quality responses
    "voiceProvider": "elevenlabs"         # Most natural voices
}

Estimated cost: ~$0.05 per minute of conversation

Enterprise Application

{
    "transcriberProvider": "azure",  # Enterprise reliability
    "llmProvider": "openai",         # Best quality
    "voiceProvider": "azure"         # Enterprise reliability
}

Estimated cost: ~$0.03 per minute of conversation

Multi-Language Application

{
    "transcriberProvider": "google",  # 125+ languages
    "llmProvider": "gemini",          # Good multi-language support
    "voiceProvider": "google"         # 40+ languages
}

Estimated cost: ~$0.02 per minute of conversation

Decision Matrix

Priority	Transcriber	LLM	TTS
Lowest Cost	Deepgram	Gemini	Google
Highest Quality	AssemblyAI	OpenAI	ElevenLabs
Fastest Speed	Deepgram	OpenAI	ElevenLabs
Enterprise	Azure	OpenAI	Azure
Multi-Language	Google	Gemini	Google
Voice Cloning	N/A	N/A	ElevenLabs/Play.ht

Testing Recommendations

Before committing to providers, test with your specific use case:

Create test conversations with representative audio
Measure latency end-to-end
Evaluate quality with real users
Calculate costs based on expected volume
Test edge cases (accents, background noise, interrupts)

Switching Providers

The multi-provider factory pattern makes switching easy:

# Just change the configuration
config = {
    "transcriberProvider": "deepgram",  # Change to "assemblyai"
    "llmProvider": "gemini",            # Change to "openai"
    "voiceProvider": "google"           # Change to "elevenlabs"
}

# No code changes needed!
factory = VoiceComponentFactory()
transcriber = factory.create_transcriber(config)
agent = factory.create_agent(config)
synthesizer = factory.create_synthesizer(config)

11 KiB Raw Blame History

Provider Comparison Guide

Transcription Providers

Deepgram

AssemblyAI

Azure Speech

Google Cloud Speech

LLM Providers

OpenAI (GPT-4, GPT-3.5)

Google Gemini

Anthropic Claude

TTS Providers

ElevenLabs

Azure TTS

Google Cloud TTS

Amazon Polly

Play.ht

Recommended Combinations

Budget-Conscious Startup

Premium Experience

Enterprise Application

Multi-Language Application

Decision Matrix

Testing Recommendations

Switching Providers

11 KiB

Raw Blame History