Import the official Hugging Face ecosystem skills and sync the\nexisting local coverage with upstream metadata and assets.\n\nRegenerate the canonical catalog, plugin mirrors, docs, and release\nnotes after the maintainer merge batch so main stays in sync.\n\nFixes #417
6.6 KiB
Hardware Selection Guide
Choosing the right hardware (flavor) is critical for cost-effective training.
Available Hardware
CPU
cpu-basic- Basic CPU, testing onlycpu-upgrade- Enhanced CPU
Use cases: Dataset validation, preprocessing, testing scripts Not recommended for training: Too slow for any meaningful training
GPU Options
| Flavor | GPU | Memory | Use Case | Cost/hour |
|---|---|---|---|---|
t4-small |
NVIDIA T4 | 16GB | <1B models, demos | ~$0.50-1 |
t4-medium |
NVIDIA T4 | 16GB | 1-3B models, development | ~$1-2 |
l4x1 |
NVIDIA L4 | 24GB | 3-7B models, efficient training | ~$2-3 |
l4x4 |
4x NVIDIA L4 | 96GB | Multi-GPU training | ~$8-12 |
a10g-small |
NVIDIA A10G | 24GB | 3-7B models, production | ~$3-4 |
a10g-large |
NVIDIA A10G | 24GB | 7-13B models | ~$4-6 |
a10g-largex2 |
2x NVIDIA A10G | 48GB | Multi-GPU, large models | ~$8-12 |
a10g-largex4 |
4x NVIDIA A10G | 96GB | Multi-GPU, very large models | ~$16-24 |
a100-large |
NVIDIA A100 | 40GB | 13B+ models, fast training | ~$8-12 |
TPU Options
| Flavor | Type | Use Case |
|---|---|---|
v5e-1x1 |
TPU v5e | Small TPU workloads |
v5e-2x2 |
4x TPU v5e | Medium TPU workloads |
v5e-2x4 |
8x TPU v5e | Large TPU workloads |
Note: TPUs require TPU-optimized code. Most TRL training uses GPUs.
Selection Guidelines
By Model Size
Tiny Models (<1B parameters)
- Recommended:
t4-small - Example: Qwen2.5-0.5B, TinyLlama
- Batch size: 4-8
- Training time: 1-2 hours for 1K examples
Small Models (1-3B parameters)
- Recommended:
t4-mediumora10g-small - Example: Qwen2.5-1.5B, Phi-2
- Batch size: 2-4
- Training time: 2-4 hours for 10K examples
Medium Models (3-7B parameters)
- Recommended:
a10g-smallora10g-large - Example: Qwen2.5-7B, Mistral-7B
- Batch size: 1-2 (or LoRA with 4-8)
- Training time: 4-8 hours for 10K examples
Large Models (7-13B parameters)
- Recommended:
a10g-largeora100-large - Example: Llama-3-8B, Mixtral-8x7B (with LoRA)
- Batch size: 1 (full fine-tuning) or 2-4 (LoRA)
- Training time: 6-12 hours for 10K examples
- Note: Always use LoRA/PEFT
Very Large Models (13B+ parameters)
- Recommended:
a100-largewith LoRA - Example: Llama-3-13B, Llama-3-70B (LoRA only)
- Batch size: 1-2 with LoRA
- Training time: 8-24 hours for 10K examples
- Note: Full fine-tuning not feasible, use LoRA/PEFT
By Budget
Minimal Budget (<$5 total)
- Use
t4-small - Train on subset of data (100-500 examples)
- Limit to 1-2 epochs
- Use small model (<1B)
Small Budget ($5-20)
- Use
t4-mediumora10g-small - Train on 1K-5K examples
- 2-3 epochs
- Model up to 3B parameters
Medium Budget ($20-50)
- Use
a10g-smallora10g-large - Train on 5K-20K examples
- 3-5 epochs
- Model up to 7B parameters
Large Budget ($50-200)
- Use
a10g-largeora100-large - Full dataset training
- Multiple epochs
- Model up to 13B parameters with LoRA
By Training Type
Quick Demo/Experiment
t4-small- 50-100 examples
- 5-10 steps
- ~10-15 minutes
Development/Iteration
t4-mediumora10g-small- 1K examples
- 1 epoch
- ~30-60 minutes
Production Training
a10g-largeora100-large- Full dataset
- 3-5 epochs
- 4-12 hours
Research/Experimentation
a100-large- Multiple runs
- Various hyperparameters
- Budget for 20-50 hours
Memory Considerations
Estimating Memory Requirements
Full fine-tuning:
Memory (GB) ≈ (Model params in billions) × 20
LoRA fine-tuning:
Memory (GB) ≈ (Model params in billions) × 4
Examples:
- Qwen2.5-0.5B full: ~10GB ✅ fits t4-small
- Qwen2.5-1.5B full: ~30GB ❌ exceeds most GPUs
- Qwen2.5-1.5B LoRA: ~6GB ✅ fits t4-small
- Qwen2.5-7B full: ~140GB ❌ not feasible
- Qwen2.5-7B LoRA: ~28GB ✅ fits a10g-large
Memory Optimization
If hitting memory limits:
-
Use LoRA/PEFT
peft_config=LoraConfig(r=16, lora_alpha=32) -
Reduce batch size
per_device_train_batch_size=1 -
Increase gradient accumulation
gradient_accumulation_steps=8 # Effective batch size = 1×8 -
Enable gradient checkpointing
gradient_checkpointing=True -
Use mixed precision
bf16=True # or fp16=True -
Upgrade to larger GPU
- t4 → a10g → a100
Cost Estimation
Formula
Total Cost = (Hours of training) × (Cost per hour)
Example Calculations
Quick demo:
- Hardware: t4-small ($0.75/hour)
- Time: 15 minutes (0.25 hours)
- Cost: $0.19
Development training:
- Hardware: a10g-small ($3.50/hour)
- Time: 2 hours
- Cost: $7.00
Production training:
- Hardware: a10g-large ($5/hour)
- Time: 6 hours
- Cost: $30.00
Large model with LoRA:
- Hardware: a100-large ($10/hour)
- Time: 8 hours
- Cost: $80.00
Cost Optimization Tips
- Start small: Test on t4-small with subset
- Use LoRA: 4-5x cheaper than full fine-tuning
- Optimize hyperparameters: Fewer epochs if possible
- Set appropriate timeout: Don't waste compute on stalled jobs
- Use checkpointing: Resume if job fails
- Monitor costs: Check running jobs regularly
Multi-GPU Training
TRL automatically handles multi-GPU training with Accelerate when using multi-GPU flavors.
Multi-GPU flavors:
l4x4- 4x L4 GPUsa10g-largex2- 2x A10G GPUsa10g-largex4- 4x A10G GPUs
When to use:
- Models >13B parameters
- Need faster training (linear speedup)
- Large datasets (>50K examples)
Example:
hf_jobs("uv", {
"script": "train.py",
"flavor": "a10g-largex2", # 2 GPUs
"timeout": "4h",
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
})
No code changes needed—TRL/Accelerate handles distribution automatically.
Choosing Between Options
a10g vs a100
Choose a10g when:
- Model <13B parameters
- Budget conscious
- Training time not critical
Choose a100 when:
- Model 13B+ parameters
- Need fastest training
- Memory requirements high
- Budget allows
Single vs Multi-GPU
Choose single GPU when:
- Model <7B parameters
- Budget constrained
- Simpler debugging
Choose multi-GPU when:
- Model >13B parameters
- Need faster training
- Large batch sizes required
- Cost-effective for large jobs
Quick Reference
# Model size → Hardware selection
HARDWARE_MAP = {
"<1B": "t4-small",
"1-3B": "a10g-small",
"3-7B": "a10g-large",
"7-13B": "a10g-large (LoRA) or a100-large",
">13B": "a100-large (LoRA required)"
}