- Add audio-transcriber skill (v1.2.0): Transform audio to Markdown with Whisper - Add youtube-summarizer skill (v1.2.0): Generate summaries from YouTube videos - Update prompt-engineer skill: Enhanced with 11 optimization frameworks - Update skill-creator skill: Improved automation workflow All skills are zero-config, cross-platform (Claude Code, Copilot CLI, Codex) and follow Quality Bar V4 standards. Source: https://github.com/ericgandrade/cli-ai-skills
353 lines
8.2 KiB
Markdown
353 lines
8.2 KiB
Markdown
# Transcription Tools Comparison
|
|
|
|
Comprehensive comparison of audio transcription engines supported by the audio-transcriber skill.
|
|
|
|
## Overview
|
|
|
|
| Tool | Type | Speed | Quality | Cost | Privacy | Offline | Languages |
|
|
|------|------|-------|---------|------|---------|---------|-----------|
|
|
| **Faster-Whisper** | Open-source | ⚡⚡⚡⚡⚡ | ⭐⭐⭐⭐⭐ | Free | 100% | ✅ | 99 |
|
|
| **Whisper** | Open-source | ⚡⚡⚡ | ⭐⭐⭐⭐⭐ | Free | 100% | ✅ | 99 |
|
|
| Google Speech-to-Text | Commercial API | ⚡⚡⚡⚡ | ⭐⭐⭐⭐⭐ | $0.006/15s | Partial | ❌ | 125+ |
|
|
| Azure Speech | Commercial API | ⚡⚡⚡⚡ | ⭐⭐⭐⭐ | $1/hour | Partial | ❌ | 100+ |
|
|
| AssemblyAI | Commercial API | ⚡⚡⚡⚡ | ⭐⭐⭐⭐⭐ | $0.00025/s | Partial | ❌ | 99 |
|
|
|
|
---
|
|
|
|
## Faster-Whisper (Recommended)
|
|
|
|
### Pros
|
|
✅ **4-5x faster** than original Whisper
|
|
✅ **Same quality** as original Whisper
|
|
✅ **Lower memory usage** (50-60% less RAM)
|
|
✅ **Free and open-source**
|
|
✅ **100% offline** (privacy guaranteed)
|
|
✅ **Easy installation** (`pip install faster-whisper`)
|
|
✅ **Drop-in replacement** for Whisper
|
|
|
|
### Cons
|
|
❌ Requires Python 3.8+
|
|
❌ Initial model download (~100MB-1.5GB)
|
|
❌ GPU optional but speeds up significantly
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
pip install faster-whisper
|
|
```
|
|
|
|
### Usage Example
|
|
|
|
```python
|
|
from faster_whisper import WhisperModel
|
|
|
|
# Load model (auto-downloads on first run)
|
|
model = WhisperModel("base", device="cpu", compute_type="int8")
|
|
|
|
# Transcribe
|
|
segments, info = model.transcribe("audio.mp3", language="pt")
|
|
|
|
# Print results
|
|
for segment in segments:
|
|
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
|
|
```
|
|
|
|
### Model Sizes
|
|
|
|
| Model | Size | RAM | Speed (CPU) | Quality |
|
|
|-------|------|-----|-------------|---------|
|
|
| `tiny` | 39 MB | ~1 GB | Very fast (~10x realtime) | Basic |
|
|
| `base` | 74 MB | ~1 GB | Fast (~7x realtime) | Good |
|
|
| `small` | 244 MB | ~2 GB | Moderate (~4x realtime) | Very good |
|
|
| `medium` | 769 MB | ~5 GB | Slow (~2x realtime) | Excellent |
|
|
| `large` | 1550 MB | ~10 GB | Very slow (~1x realtime) | Best |
|
|
|
|
**Recommendation:** `small` or `medium` for production use.
|
|
|
|
---
|
|
|
|
## Whisper (Original)
|
|
|
|
### Pros
|
|
✅ **Official OpenAI model**
|
|
✅ **Excellent quality**
|
|
✅ **Free and open-source**
|
|
✅ **100% offline**
|
|
✅ **Well-documented**
|
|
✅ **Large community**
|
|
|
|
### Cons
|
|
❌ **Slower** than Faster-Whisper (4-5x)
|
|
❌ **Higher memory usage**
|
|
❌ Requires PyTorch (large dependency)
|
|
❌ GPU highly recommended for larger models
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
pip install openai-whisper
|
|
```
|
|
|
|
### Usage Example
|
|
|
|
```python
|
|
import whisper
|
|
|
|
# Load model
|
|
model = whisper.load_model("base")
|
|
|
|
# Transcribe
|
|
result = model.transcribe("audio.mp3", language="pt")
|
|
|
|
# Print results
|
|
print(result["text"])
|
|
```
|
|
|
|
### When to Use Whisper vs. Faster-Whisper
|
|
|
|
**Use Faster-Whisper if:**
|
|
- Speed is important
|
|
- Limited RAM available
|
|
- Processing many files
|
|
|
|
**Use Original Whisper if:**
|
|
- Faster-Whisper installation issues
|
|
- Need exact OpenAI implementation
|
|
- Already have Whisper in project dependencies
|
|
|
|
---
|
|
|
|
## Google Cloud Speech-to-Text
|
|
|
|
### Pros
|
|
✅ **Very accurate** (industry-leading)
|
|
✅ **Fast processing** (cloud infrastructure)
|
|
✅ **125+ languages**
|
|
✅ **Word-level timestamps**
|
|
✅ **Punctuation & capitalization**
|
|
✅ **Speaker diarization** (premium)
|
|
|
|
### Cons
|
|
❌ **Requires internet** (cloud-only)
|
|
❌ **Costs money** (after free tier)
|
|
❌ **Privacy concerns** (audio uploaded to Google)
|
|
❌ Requires GCP account setup
|
|
❌ Complex authentication
|
|
|
|
### Pricing
|
|
|
|
- **Free tier:** 60 minutes/month
|
|
- **Standard:** $0.006 per 15 seconds ($1.44/hour)
|
|
- **Premium:** $0.009 per 15 seconds (with diarization)
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
pip install google-cloud-speech
|
|
```
|
|
|
|
### Setup
|
|
|
|
1. Create GCP project
|
|
2. Enable Speech-to-Text API
|
|
3. Create service account & download JSON key
|
|
4. Set environment variable:
|
|
```bash
|
|
export GOOGLE_APPLICATION_CREDENTIALS="path/to/key.json"
|
|
```
|
|
|
|
### Usage Example
|
|
|
|
```python
|
|
from google.cloud import speech
|
|
|
|
client = speech.SpeechClient()
|
|
|
|
with open("audio.wav", "rb") as audio_file:
|
|
content = audio_file.read()
|
|
|
|
audio = speech.RecognitionAudio(content=content)
|
|
config = speech.RecognitionConfig(
|
|
encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
|
|
sample_rate_hertz=16000,
|
|
language_code="pt-BR",
|
|
)
|
|
|
|
response = client.recognize(config=config, audio=audio)
|
|
|
|
for result in response.results:
|
|
print(result.alternatives[0].transcript)
|
|
```
|
|
|
|
---
|
|
|
|
## Azure Speech Services
|
|
|
|
### Pros
|
|
✅ **High accuracy**
|
|
✅ **100+ languages**
|
|
✅ **Real-time transcription**
|
|
✅ **Custom models** (train on your data)
|
|
✅ **Good Microsoft ecosystem integration**
|
|
|
|
### Cons
|
|
❌ **Requires internet**
|
|
❌ **Costs money** (after free tier)
|
|
❌ **Privacy concerns** (cloud processing)
|
|
❌ Requires Azure account
|
|
❌ Complex setup
|
|
|
|
### Pricing
|
|
|
|
- **Free tier:** 5 hours/month
|
|
- **Standard:** $1.00 per audio hour
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
pip install azure-cognitiveservices-speech
|
|
```
|
|
|
|
### Setup
|
|
|
|
1. Create Azure account
|
|
2. Create Speech resource
|
|
3. Get API key and region
|
|
4. Set environment variables:
|
|
```bash
|
|
export AZURE_SPEECH_KEY="your-key"
|
|
export AZURE_SPEECH_REGION="your-region"
|
|
```
|
|
|
|
### Usage Example
|
|
|
|
```python
|
|
import azure.cognitiveservices.speech as speechsdk
|
|
|
|
speech_config = speechsdk.SpeechConfig(
|
|
subscription=os.environ.get('AZURE_SPEECH_KEY'),
|
|
region=os.environ.get('AZURE_SPEECH_REGION')
|
|
)
|
|
|
|
audio_config = speechsdk.audio.AudioConfig(filename="audio.wav")
|
|
speech_recognizer = speechsdk.SpeechRecognizer(
|
|
speech_config=speech_config,
|
|
audio_config=audio_config
|
|
)
|
|
|
|
result = speech_recognizer.recognize_once()
|
|
print(result.text)
|
|
```
|
|
|
|
---
|
|
|
|
## AssemblyAI
|
|
|
|
### Pros
|
|
✅ **Modern, developer-friendly API**
|
|
✅ **Excellent accuracy**
|
|
✅ **Advanced features** (sentiment, topic detection, PII redaction)
|
|
✅ **Speaker diarization** (included)
|
|
✅ **Fast processing**
|
|
✅ **Good documentation**
|
|
|
|
### Cons
|
|
❌ **Requires internet**
|
|
❌ **Costs money** (no free tier, only trial credits)
|
|
❌ **Privacy concerns** (cloud processing)
|
|
❌ Requires API key
|
|
|
|
### Pricing
|
|
|
|
- **Free trial:** $50 credits
|
|
- **Standard:** $0.00025 per second (~$0.90/hour)
|
|
|
|
### Installation
|
|
|
|
```bash
|
|
pip install assemblyai
|
|
```
|
|
|
|
### Setup
|
|
|
|
1. Sign up at assemblyai.com
|
|
2. Get API key
|
|
3. Set environment variable:
|
|
```bash
|
|
export ASSEMBLYAI_API_KEY="your-key"
|
|
```
|
|
|
|
### Usage Example
|
|
|
|
```python
|
|
import assemblyai as aai
|
|
|
|
aai.settings.api_key = os.environ["ASSEMBLYAI_API_KEY"]
|
|
|
|
transcriber = aai.Transcriber()
|
|
transcript = transcriber.transcribe("audio.mp3")
|
|
|
|
print(transcript.text)
|
|
|
|
# Speaker diarization
|
|
for utterance in transcript.utterances:
|
|
print(f"Speaker {utterance.speaker}: {utterance.text}")
|
|
```
|
|
|
|
---
|
|
|
|
## Recommendation Matrix
|
|
|
|
### Use Faster-Whisper if:
|
|
- ✅ Privacy is critical (local processing)
|
|
- ✅ Want zero cost (free forever)
|
|
- ✅ Need offline capability
|
|
- ✅ Processing many files (speed matters)
|
|
- ✅ Limited budget
|
|
|
|
### Use Google Speech-to-Text if:
|
|
- ✅ Need absolute best accuracy
|
|
- ✅ Have budget for cloud services
|
|
- ✅ Want advanced features (punctuation, diarization)
|
|
- ✅ Already using GCP ecosystem
|
|
|
|
### Use Azure Speech if:
|
|
- ✅ In Microsoft ecosystem
|
|
- ✅ Need custom model training
|
|
- ✅ Want real-time transcription
|
|
- ✅ Have Azure credits
|
|
|
|
### Use AssemblyAI if:
|
|
- ✅ Need advanced features (sentiment, topics)
|
|
- ✅ Want easiest API experience
|
|
- ✅ Need automatic PII redaction
|
|
- ✅ Value developer experience
|
|
|
|
---
|
|
|
|
## Performance Benchmarks
|
|
|
|
**Test:** 1-hour podcast (MP3, 44.1kHz, stereo)
|
|
|
|
| Tool | Processing Time | Accuracy | Cost |
|
|
|------|----------------|----------|------|
|
|
| Faster-Whisper (small) | 8 min | 94% | $0 |
|
|
| Whisper (small) | 32 min | 94% | $0 |
|
|
| Google Speech | 2 min | 96% | $1.44 |
|
|
| Azure Speech | 3 min | 95% | $1.00 |
|
|
| AssemblyAI | 4 min | 96% | $0.90 |
|
|
|
|
*Benchmarks run on MacBook Pro M1, 16GB RAM*
|
|
|
|
---
|
|
|
|
## Conclusion
|
|
|
|
**For the audio-transcriber skill:**
|
|
|
|
1. **Primary:** Faster-Whisper (best balance of speed, quality, privacy, cost)
|
|
2. **Fallback:** Whisper (if Faster-Whisper unavailable)
|
|
3. **Optional:** Cloud APIs (user choice for premium features)
|
|
|
|
This ensures the skill works out-of-the-box for most users while allowing advanced users to integrate commercial services if needed.
|