- Add audio-transcriber skill (v1.2.0): Transform audio to Markdown with Whisper - Add youtube-summarizer skill (v1.2.0): Generate summaries from YouTube videos - Update prompt-engineer skill: Enhanced with 11 optimization frameworks - Update skill-creator skill: Improved automation workflow All skills are zero-config, cross-platform (Claude Code, Copilot CLI, Codex) and follow Quality Bar V4 standards. Source: https://github.com/ericgandrade/cli-ai-skills
8.2 KiB
Transcription Tools Comparison
Comprehensive comparison of audio transcription engines supported by the audio-transcriber skill.
Overview
| Tool | Type | Speed | Quality | Cost | Privacy | Offline | Languages |
|---|---|---|---|---|---|---|---|
| Faster-Whisper | Open-source | ⚡⚡⚡⚡⚡ | ⭐⭐⭐⭐⭐ | Free | 100% | ✅ | 99 |
| Whisper | Open-source | ⚡⚡⚡ | ⭐⭐⭐⭐⭐ | Free | 100% | ✅ | 99 |
| Google Speech-to-Text | Commercial API | ⚡⚡⚡⚡ | ⭐⭐⭐⭐⭐ | $0.006/15s | Partial | ❌ | 125+ |
| Azure Speech | Commercial API | ⚡⚡⚡⚡ | ⭐⭐⭐⭐ | $1/hour | Partial | ❌ | 100+ |
| AssemblyAI | Commercial API | ⚡⚡⚡⚡ | ⭐⭐⭐⭐⭐ | $0.00025/s | Partial | ❌ | 99 |
Faster-Whisper (Recommended)
Pros
✅ 4-5x faster than original Whisper
✅ Same quality as original Whisper
✅ Lower memory usage (50-60% less RAM)
✅ Free and open-source
✅ 100% offline (privacy guaranteed)
✅ Easy installation (pip install faster-whisper)
✅ Drop-in replacement for Whisper
Cons
❌ Requires Python 3.8+
❌ Initial model download (~100MB-1.5GB)
❌ GPU optional but speeds up significantly
Installation
pip install faster-whisper
Usage Example
from faster_whisper import WhisperModel
# Load model (auto-downloads on first run)
model = WhisperModel("base", device="cpu", compute_type="int8")
# Transcribe
segments, info = model.transcribe("audio.mp3", language="pt")
# Print results
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
Model Sizes
| Model | Size | RAM | Speed (CPU) | Quality |
|---|---|---|---|---|
tiny |
39 MB | ~1 GB | Very fast (~10x realtime) | Basic |
base |
74 MB | ~1 GB | Fast (~7x realtime) | Good |
small |
244 MB | ~2 GB | Moderate (~4x realtime) | Very good |
medium |
769 MB | ~5 GB | Slow (~2x realtime) | Excellent |
large |
1550 MB | ~10 GB | Very slow (~1x realtime) | Best |
Recommendation: small or medium for production use.
Whisper (Original)
Pros
✅ Official OpenAI model
✅ Excellent quality
✅ Free and open-source
✅ 100% offline
✅ Well-documented
✅ Large community
Cons
❌ Slower than Faster-Whisper (4-5x)
❌ Higher memory usage
❌ Requires PyTorch (large dependency)
❌ GPU highly recommended for larger models
Installation
pip install openai-whisper
Usage Example
import whisper
# Load model
model = whisper.load_model("base")
# Transcribe
result = model.transcribe("audio.mp3", language="pt")
# Print results
print(result["text"])
When to Use Whisper vs. Faster-Whisper
Use Faster-Whisper if:
- Speed is important
- Limited RAM available
- Processing many files
Use Original Whisper if:
- Faster-Whisper installation issues
- Need exact OpenAI implementation
- Already have Whisper in project dependencies
Google Cloud Speech-to-Text
Pros
✅ Very accurate (industry-leading)
✅ Fast processing (cloud infrastructure)
✅ 125+ languages
✅ Word-level timestamps
✅ Punctuation & capitalization
✅ Speaker diarization (premium)
Cons
❌ Requires internet (cloud-only)
❌ Costs money (after free tier)
❌ Privacy concerns (audio uploaded to Google)
❌ Requires GCP account setup
❌ Complex authentication
Pricing
- Free tier: 60 minutes/month
- Standard: $0.006 per 15 seconds ($1.44/hour)
- Premium: $0.009 per 15 seconds (with diarization)
Installation
pip install google-cloud-speech
Setup
- Create GCP project
- Enable Speech-to-Text API
- Create service account & download JSON key
- Set environment variable:
export GOOGLE_APPLICATION_CREDENTIALS="path/to/key.json"
Usage Example
from google.cloud import speech
client = speech.SpeechClient()
with open("audio.wav", "rb") as audio_file:
content = audio_file.read()
audio = speech.RecognitionAudio(content=content)
config = speech.RecognitionConfig(
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
sample_rate_hertz=16000,
language_code="pt-BR",
)
response = client.recognize(config=config, audio=audio)
for result in response.results:
print(result.alternatives[0].transcript)
Azure Speech Services
Pros
✅ High accuracy
✅ 100+ languages
✅ Real-time transcription
✅ Custom models (train on your data)
✅ Good Microsoft ecosystem integration
Cons
❌ Requires internet
❌ Costs money (after free tier)
❌ Privacy concerns (cloud processing)
❌ Requires Azure account
❌ Complex setup
Pricing
- Free tier: 5 hours/month
- Standard: $1.00 per audio hour
Installation
pip install azure-cognitiveservices-speech
Setup
- Create Azure account
- Create Speech resource
- Get API key and region
- Set environment variables:
export AZURE_SPEECH_KEY="your-key" export AZURE_SPEECH_REGION="your-region"
Usage Example
import azure.cognitiveservices.speech as speechsdk
speech_config = speechsdk.SpeechConfig(
subscription=os.environ.get('AZURE_SPEECH_KEY'),
region=os.environ.get('AZURE_SPEECH_REGION')
)
audio_config = speechsdk.audio.AudioConfig(filename="audio.wav")
speech_recognizer = speechsdk.SpeechRecognizer(
speech_config=speech_config,
audio_config=audio_config
)
result = speech_recognizer.recognize_once()
print(result.text)
AssemblyAI
Pros
✅ Modern, developer-friendly API
✅ Excellent accuracy
✅ Advanced features (sentiment, topic detection, PII redaction)
✅ Speaker diarization (included)
✅ Fast processing
✅ Good documentation
Cons
❌ Requires internet
❌ Costs money (no free tier, only trial credits)
❌ Privacy concerns (cloud processing)
❌ Requires API key
Pricing
- Free trial: $50 credits
- Standard: $0.00025 per second (~$0.90/hour)
Installation
pip install assemblyai
Setup
- Sign up at assemblyai.com
- Get API key
- Set environment variable:
export ASSEMBLYAI_API_KEY="your-key"
Usage Example
import assemblyai as aai
aai.settings.api_key = os.environ["ASSEMBLYAI_API_KEY"]
transcriber = aai.Transcriber()
transcript = transcriber.transcribe("audio.mp3")
print(transcript.text)
# Speaker diarization
for utterance in transcript.utterances:
print(f"Speaker {utterance.speaker}: {utterance.text}")
Recommendation Matrix
Use Faster-Whisper if:
- ✅ Privacy is critical (local processing)
- ✅ Want zero cost (free forever)
- ✅ Need offline capability
- ✅ Processing many files (speed matters)
- ✅ Limited budget
Use Google Speech-to-Text if:
- ✅ Need absolute best accuracy
- ✅ Have budget for cloud services
- ✅ Want advanced features (punctuation, diarization)
- ✅ Already using GCP ecosystem
Use Azure Speech if:
- ✅ In Microsoft ecosystem
- ✅ Need custom model training
- ✅ Want real-time transcription
- ✅ Have Azure credits
Use AssemblyAI if:
- ✅ Need advanced features (sentiment, topics)
- ✅ Want easiest API experience
- ✅ Need automatic PII redaction
- ✅ Value developer experience
Performance Benchmarks
Test: 1-hour podcast (MP3, 44.1kHz, stereo)
| Tool | Processing Time | Accuracy | Cost |
|---|---|---|---|
| Faster-Whisper (small) | 8 min | 94% | $0 |
| Whisper (small) | 32 min | 94% | $0 |
| Google Speech | 2 min | 96% | $1.44 |
| Azure Speech | 3 min | 95% | $1.00 |
| AssemblyAI | 4 min | 96% | $0.90 |
Benchmarks run on MacBook Pro M1, 16GB RAM
Conclusion
For the audio-transcriber skill:
- Primary: Faster-Whisper (best balance of speed, quality, privacy, cost)
- Fallback: Whisper (if Faster-Whisper unavailable)
- Optional: Cloud APIs (user choice for premium features)
This ensures the skill works out-of-the-box for most users while allowing advanced users to integrate commercial services if needed.