Files
antigravity-skills-reference/skills/audio-transcriber/references/tools-comparison.md
Eric Andrade 801c8fa475 feat: add 4 universal skills from cli-ai-skills
- Add audio-transcriber skill (v1.2.0): Transform audio to Markdown with Whisper
- Add youtube-summarizer skill (v1.2.0): Generate summaries from YouTube videos
- Update prompt-engineer skill: Enhanced with 11 optimization frameworks
- Update skill-creator skill: Improved automation workflow

All skills are zero-config, cross-platform (Claude Code, Copilot CLI, Codex)
and follow Quality Bar V4 standards.

Source: https://github.com/ericgandrade/cli-ai-skills
2026-02-04 17:37:45 -03:00

8.2 KiB

Transcription Tools Comparison

Comprehensive comparison of audio transcription engines supported by the audio-transcriber skill.

Overview

Tool Type Speed Quality Cost Privacy Offline Languages
Faster-Whisper Open-source Free 100% 99
Whisper Open-source Free 100% 99
Google Speech-to-Text Commercial API $0.006/15s Partial 125+
Azure Speech Commercial API $1/hour Partial 100+
AssemblyAI Commercial API $0.00025/s Partial 99

Pros

4-5x faster than original Whisper
Same quality as original Whisper
Lower memory usage (50-60% less RAM)
Free and open-source
100% offline (privacy guaranteed)
Easy installation (pip install faster-whisper)
Drop-in replacement for Whisper

Cons

Requires Python 3.8+
Initial model download (~100MB-1.5GB)
GPU optional but speeds up significantly

Installation

pip install faster-whisper

Usage Example

from faster_whisper import WhisperModel

# Load model (auto-downloads on first run)
model = WhisperModel("base", device="cpu", compute_type="int8")

# Transcribe
segments, info = model.transcribe("audio.mp3", language="pt")

# Print results
for segment in segments:
    print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")

Model Sizes

Model Size RAM Speed (CPU) Quality
tiny 39 MB ~1 GB Very fast (~10x realtime) Basic
base 74 MB ~1 GB Fast (~7x realtime) Good
small 244 MB ~2 GB Moderate (~4x realtime) Very good
medium 769 MB ~5 GB Slow (~2x realtime) Excellent
large 1550 MB ~10 GB Very slow (~1x realtime) Best

Recommendation: small or medium for production use.


Whisper (Original)

Pros

Official OpenAI model
Excellent quality
Free and open-source
100% offline
Well-documented
Large community

Cons

Slower than Faster-Whisper (4-5x)
Higher memory usage
Requires PyTorch (large dependency)
GPU highly recommended for larger models

Installation

pip install openai-whisper

Usage Example

import whisper

# Load model
model = whisper.load_model("base")

# Transcribe
result = model.transcribe("audio.mp3", language="pt")

# Print results
print(result["text"])

When to Use Whisper vs. Faster-Whisper

Use Faster-Whisper if:

  • Speed is important
  • Limited RAM available
  • Processing many files

Use Original Whisper if:

  • Faster-Whisper installation issues
  • Need exact OpenAI implementation
  • Already have Whisper in project dependencies

Google Cloud Speech-to-Text

Pros

Very accurate (industry-leading)
Fast processing (cloud infrastructure)
125+ languages
Word-level timestamps
Punctuation & capitalization
Speaker diarization (premium)

Cons

Requires internet (cloud-only)
Costs money (after free tier)
Privacy concerns (audio uploaded to Google)
Requires GCP account setup
Complex authentication

Pricing

  • Free tier: 60 minutes/month
  • Standard: $0.006 per 15 seconds ($1.44/hour)
  • Premium: $0.009 per 15 seconds (with diarization)

Installation

pip install google-cloud-speech

Setup

  1. Create GCP project
  2. Enable Speech-to-Text API
  3. Create service account & download JSON key
  4. Set environment variable:
    export GOOGLE_APPLICATION_CREDENTIALS="path/to/key.json"
    

Usage Example

from google.cloud import speech

client = speech.SpeechClient()

with open("audio.wav", "rb") as audio_file:
    content = audio_file.read()

audio = speech.RecognitionAudio(content=content)
config = speech.RecognitionConfig(
    encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,
    sample_rate_hertz=16000,
    language_code="pt-BR",
)

response = client.recognize(config=config, audio=audio)

for result in response.results:
    print(result.alternatives[0].transcript)

Azure Speech Services

Pros

High accuracy
100+ languages
Real-time transcription
Custom models (train on your data)
Good Microsoft ecosystem integration

Cons

Requires internet
Costs money (after free tier)
Privacy concerns (cloud processing)
Requires Azure account
Complex setup

Pricing

  • Free tier: 5 hours/month
  • Standard: $1.00 per audio hour

Installation

pip install azure-cognitiveservices-speech

Setup

  1. Create Azure account
  2. Create Speech resource
  3. Get API key and region
  4. Set environment variables:
    export AZURE_SPEECH_KEY="your-key"
    export AZURE_SPEECH_REGION="your-region"
    

Usage Example

import azure.cognitiveservices.speech as speechsdk

speech_config = speechsdk.SpeechConfig(
    subscription=os.environ.get('AZURE_SPEECH_KEY'),
    region=os.environ.get('AZURE_SPEECH_REGION')
)

audio_config = speechsdk.audio.AudioConfig(filename="audio.wav")
speech_recognizer = speechsdk.SpeechRecognizer(
    speech_config=speech_config,
    audio_config=audio_config
)

result = speech_recognizer.recognize_once()
print(result.text)

AssemblyAI

Pros

Modern, developer-friendly API
Excellent accuracy
Advanced features (sentiment, topic detection, PII redaction)
Speaker diarization (included)
Fast processing
Good documentation

Cons

Requires internet
Costs money (no free tier, only trial credits)
Privacy concerns (cloud processing)
Requires API key

Pricing

  • Free trial: $50 credits
  • Standard: $0.00025 per second (~$0.90/hour)

Installation

pip install assemblyai

Setup

  1. Sign up at assemblyai.com
  2. Get API key
  3. Set environment variable:
    export ASSEMBLYAI_API_KEY="your-key"
    

Usage Example

import assemblyai as aai

aai.settings.api_key = os.environ["ASSEMBLYAI_API_KEY"]

transcriber = aai.Transcriber()
transcript = transcriber.transcribe("audio.mp3")

print(transcript.text)

# Speaker diarization
for utterance in transcript.utterances:
    print(f"Speaker {utterance.speaker}: {utterance.text}")

Recommendation Matrix

Use Faster-Whisper if:

  • Privacy is critical (local processing)
  • Want zero cost (free forever)
  • Need offline capability
  • Processing many files (speed matters)
  • Limited budget

Use Google Speech-to-Text if:

  • Need absolute best accuracy
  • Have budget for cloud services
  • Want advanced features (punctuation, diarization)
  • Already using GCP ecosystem

Use Azure Speech if:

  • In Microsoft ecosystem
  • Need custom model training
  • Want real-time transcription
  • Have Azure credits

Use AssemblyAI if:

  • Need advanced features (sentiment, topics)
  • Want easiest API experience
  • Need automatic PII redaction
  • Value developer experience

Performance Benchmarks

Test: 1-hour podcast (MP3, 44.1kHz, stereo)

Tool Processing Time Accuracy Cost
Faster-Whisper (small) 8 min 94% $0
Whisper (small) 32 min 94% $0
Google Speech 2 min 96% $1.44
Azure Speech 3 min 95% $1.00
AssemblyAI 4 min 96% $0.90

Benchmarks run on MacBook Pro M1, 16GB RAM


Conclusion

For the audio-transcriber skill:

  1. Primary: Faster-Whisper (best balance of speed, quality, privacy, cost)
  2. Fallback: Whisper (if Faster-Whisper unavailable)
  3. Optional: Cloud APIs (user choice for premium features)

This ensures the skill works out-of-the-box for most users while allowing advanced users to integrate commercial services if needed.