Files
antigravity-skills-reference/skills/hugging-face-model-trainer/references/hardware_guide.md
sickn33 bdcfbb9625 feat(hugging-face): Add official ecosystem skills
Import the official Hugging Face ecosystem skills and sync the\nexisting local coverage with upstream metadata and assets.\n\nRegenerate the canonical catalog, plugin mirrors, docs, and release\nnotes after the maintainer merge batch so main stays in sync.\n\nFixes #417
2026-03-29 18:31:46 +02:00

6.6 KiB
Raw Blame History

Hardware Selection Guide

Choosing the right hardware (flavor) is critical for cost-effective training.

Available Hardware

CPU

  • cpu-basic - Basic CPU, testing only
  • cpu-upgrade - Enhanced CPU

Use cases: Dataset validation, preprocessing, testing scripts Not recommended for training: Too slow for any meaningful training

GPU Options

Flavor GPU Memory Use Case Cost/hour
t4-small NVIDIA T4 16GB <1B models, demos ~$0.50-1
t4-medium NVIDIA T4 16GB 1-3B models, development ~$1-2
l4x1 NVIDIA L4 24GB 3-7B models, efficient training ~$2-3
l4x4 4x NVIDIA L4 96GB Multi-GPU training ~$8-12
a10g-small NVIDIA A10G 24GB 3-7B models, production ~$3-4
a10g-large NVIDIA A10G 24GB 7-13B models ~$4-6
a10g-largex2 2x NVIDIA A10G 48GB Multi-GPU, large models ~$8-12
a10g-largex4 4x NVIDIA A10G 96GB Multi-GPU, very large models ~$16-24
a100-large NVIDIA A100 40GB 13B+ models, fast training ~$8-12

TPU Options

Flavor Type Use Case
v5e-1x1 TPU v5e Small TPU workloads
v5e-2x2 4x TPU v5e Medium TPU workloads
v5e-2x4 8x TPU v5e Large TPU workloads

Note: TPUs require TPU-optimized code. Most TRL training uses GPUs.

Selection Guidelines

By Model Size

Tiny Models (<1B parameters)

  • Recommended: t4-small
  • Example: Qwen2.5-0.5B, TinyLlama
  • Batch size: 4-8
  • Training time: 1-2 hours for 1K examples

Small Models (1-3B parameters)

  • Recommended: t4-medium or a10g-small
  • Example: Qwen2.5-1.5B, Phi-2
  • Batch size: 2-4
  • Training time: 2-4 hours for 10K examples

Medium Models (3-7B parameters)

  • Recommended: a10g-small or a10g-large
  • Example: Qwen2.5-7B, Mistral-7B
  • Batch size: 1-2 (or LoRA with 4-8)
  • Training time: 4-8 hours for 10K examples

Large Models (7-13B parameters)

  • Recommended: a10g-large or a100-large
  • Example: Llama-3-8B, Mixtral-8x7B (with LoRA)
  • Batch size: 1 (full fine-tuning) or 2-4 (LoRA)
  • Training time: 6-12 hours for 10K examples
  • Note: Always use LoRA/PEFT

Very Large Models (13B+ parameters)

  • Recommended: a100-large with LoRA
  • Example: Llama-3-13B, Llama-3-70B (LoRA only)
  • Batch size: 1-2 with LoRA
  • Training time: 8-24 hours for 10K examples
  • Note: Full fine-tuning not feasible, use LoRA/PEFT

By Budget

Minimal Budget (<$5 total)

  • Use t4-small
  • Train on subset of data (100-500 examples)
  • Limit to 1-2 epochs
  • Use small model (<1B)

Small Budget ($5-20)

  • Use t4-medium or a10g-small
  • Train on 1K-5K examples
  • 2-3 epochs
  • Model up to 3B parameters

Medium Budget ($20-50)

  • Use a10g-small or a10g-large
  • Train on 5K-20K examples
  • 3-5 epochs
  • Model up to 7B parameters

Large Budget ($50-200)

  • Use a10g-large or a100-large
  • Full dataset training
  • Multiple epochs
  • Model up to 13B parameters with LoRA

By Training Type

Quick Demo/Experiment

  • t4-small
  • 50-100 examples
  • 5-10 steps
  • ~10-15 minutes

Development/Iteration

  • t4-medium or a10g-small
  • 1K examples
  • 1 epoch
  • ~30-60 minutes

Production Training

  • a10g-large or a100-large
  • Full dataset
  • 3-5 epochs
  • 4-12 hours

Research/Experimentation

  • a100-large
  • Multiple runs
  • Various hyperparameters
  • Budget for 20-50 hours

Memory Considerations

Estimating Memory Requirements

Full fine-tuning:

Memory (GB) ≈ (Model params in billions) × 20

LoRA fine-tuning:

Memory (GB) ≈ (Model params in billions) × 4

Examples:

  • Qwen2.5-0.5B full: ~10GB fits t4-small
  • Qwen2.5-1.5B full: ~30GB exceeds most GPUs
  • Qwen2.5-1.5B LoRA: ~6GB fits t4-small
  • Qwen2.5-7B full: ~140GB not feasible
  • Qwen2.5-7B LoRA: ~28GB fits a10g-large

Memory Optimization

If hitting memory limits:

  1. Use LoRA/PEFT

    peft_config=LoraConfig(r=16, lora_alpha=32)
    
  2. Reduce batch size

    per_device_train_batch_size=1
    
  3. Increase gradient accumulation

    gradient_accumulation_steps=8  # Effective batch size = 1×8
    
  4. Enable gradient checkpointing

    gradient_checkpointing=True
    
  5. Use mixed precision

    bf16=True  # or fp16=True
    
  6. Upgrade to larger GPU

    • t4 → a10g → a100

Cost Estimation

Formula

Total Cost = (Hours of training) × (Cost per hour)

Example Calculations

Quick demo:

  • Hardware: t4-small ($0.75/hour)
  • Time: 15 minutes (0.25 hours)
  • Cost: $0.19

Development training:

  • Hardware: a10g-small ($3.50/hour)
  • Time: 2 hours
  • Cost: $7.00

Production training:

  • Hardware: a10g-large ($5/hour)
  • Time: 6 hours
  • Cost: $30.00

Large model with LoRA:

  • Hardware: a100-large ($10/hour)
  • Time: 8 hours
  • Cost: $80.00

Cost Optimization Tips

  1. Start small: Test on t4-small with subset
  2. Use LoRA: 4-5x cheaper than full fine-tuning
  3. Optimize hyperparameters: Fewer epochs if possible
  4. Set appropriate timeout: Don't waste compute on stalled jobs
  5. Use checkpointing: Resume if job fails
  6. Monitor costs: Check running jobs regularly

Multi-GPU Training

TRL automatically handles multi-GPU training with Accelerate when using multi-GPU flavors.

Multi-GPU flavors:

  • l4x4 - 4x L4 GPUs
  • a10g-largex2 - 2x A10G GPUs
  • a10g-largex4 - 4x A10G GPUs

When to use:

  • Models >13B parameters
  • Need faster training (linear speedup)
  • Large datasets (>50K examples)

Example:

hf_jobs("uv", {
    "script": "train.py",
    "flavor": "a10g-largex2",  # 2 GPUs
    "timeout": "4h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}
})

No code changes needed—TRL/Accelerate handles distribution automatically.

Choosing Between Options

a10g vs a100

Choose a10g when:

  • Model <13B parameters
  • Budget conscious
  • Training time not critical

Choose a100 when:

  • Model 13B+ parameters
  • Need fastest training
  • Memory requirements high
  • Budget allows

Single vs Multi-GPU

Choose single GPU when:

  • Model <7B parameters
  • Budget constrained
  • Simpler debugging

Choose multi-GPU when:

  • Model >13B parameters
  • Need faster training
  • Large batch sizes required
  • Cost-effective for large jobs

Quick Reference

# Model size → Hardware selection
HARDWARE_MAP = {
    "<1B":     "t4-small",
    "1-3B":    "a10g-small",
    "3-7B":    "a10g-large",
    "7-13B":   "a10g-large (LoRA) or a100-large",
    ">13B":    "a100-large (LoRA required)"
}