firefrost-gaming/antigravity-skills-reference

Files

sickn33 bdcfbb9625 feat(hugging-face): Add official ecosystem skills

Import the official Hugging Face ecosystem skills and sync the\nexisting local coverage with upstream metadata and assets.\n\nRegenerate the canonical catalog, plugin mirrors, docs, and release\nnotes after the maintainer merge batch so main stays in sync.\n\nFixes #417

2026-03-29 18:31:46 +02:00

6.6 KiB

Raw Blame History

Hardware Selection Guide

Choosing the right hardware (flavor) is critical for cost-effective training.

Available Hardware

CPU

cpu-basic - Basic CPU, testing only
cpu-upgrade - Enhanced CPU

Use cases: Dataset validation, preprocessing, testing scripts Not recommended for training: Too slow for any meaningful training

GPU Options

Flavor	GPU	Memory	Use Case	Cost/hour
`t4-small`	NVIDIA T4	16GB	<1B models, demos	~$0.50-1
`t4-medium`	NVIDIA T4	16GB	1-3B models, development	~$1-2
`l4x1`	NVIDIA L4	24GB	3-7B models, efficient training	~$2-3
`l4x4`	4x NVIDIA L4	96GB	Multi-GPU training	~$8-12
`a10g-small`	NVIDIA A10G	24GB	3-7B models, production	~$3-4
`a10g-large`	NVIDIA A10G	24GB	7-13B models	~$4-6
`a10g-largex2`	2x NVIDIA A10G	48GB	Multi-GPU, large models	~$8-12
`a10g-largex4`	4x NVIDIA A10G	96GB	Multi-GPU, very large models	~$16-24
`a100-large`	NVIDIA A100	40GB	13B+ models, fast training	~$8-12

TPU Options

Flavor	Type	Use Case
`v5e-1x1`	TPU v5e	Small TPU workloads
`v5e-2x2`	4x TPU v5e	Medium TPU workloads
`v5e-2x4`	8x TPU v5e	Large TPU workloads

Note: TPUs require TPU-optimized code. Most TRL training uses GPUs.

Selection Guidelines

By Model Size

Tiny Models (<1B parameters)

Recommended: t4-small
Example: Qwen2.5-0.5B, TinyLlama
Batch size: 4-8
Training time: 1-2 hours for 1K examples

Small Models (1-3B parameters)

Recommended: t4-medium or a10g-small
Example: Qwen2.5-1.5B, Phi-2
Batch size: 2-4
Training time: 2-4 hours for 10K examples

Medium Models (3-7B parameters)

Recommended: a10g-small or a10g-large
Example: Qwen2.5-7B, Mistral-7B
Batch size: 1-2 (or LoRA with 4-8)
Training time: 4-8 hours for 10K examples

Large Models (7-13B parameters)

Recommended: a10g-large or a100-large
Example: Llama-3-8B, Mixtral-8x7B (with LoRA)
Batch size: 1 (full fine-tuning) or 2-4 (LoRA)
Training time: 6-12 hours for 10K examples
Note: Always use LoRA/PEFT

Very Large Models (13B+ parameters)

Recommended: a100-large with LoRA
Example: Llama-3-13B, Llama-3-70B (LoRA only)
Batch size: 1-2 with LoRA
Training time: 8-24 hours for 10K examples
Note: Full fine-tuning not feasible, use LoRA/PEFT

By Budget

Minimal Budget (<$5 total)

Use t4-small
Train on subset of data (100-500 examples)
Limit to 1-2 epochs
Use small model (<1B)

Small Budget ($5-20)

Use t4-medium or a10g-small
Train on 1K-5K examples
2-3 epochs
Model up to 3B parameters

Medium Budget ($20-50)

Use a10g-small or a10g-large
Train on 5K-20K examples
3-5 epochs
Model up to 7B parameters

Large Budget ($50-200)

Use a10g-large or a100-large
Full dataset training
Multiple epochs
Model up to 13B parameters with LoRA

By Training Type

Quick Demo/Experiment

t4-small
50-100 examples
5-10 steps
~10-15 minutes

Development/Iteration

t4-medium or a10g-small
1K examples
1 epoch
~30-60 minutes

Production Training

a10g-large or a100-large
Full dataset
3-5 epochs
4-12 hours

Research/Experimentation

a100-large
Multiple runs
Various hyperparameters
Budget for 20-50 hours

Memory Considerations

Estimating Memory Requirements

Full fine-tuning:

Memory (GB) ≈ (Model params in billions) × 20

LoRA fine-tuning:

Memory (GB) ≈ (Model params in billions) × 4

Examples:

Qwen2.5-0.5B full: ~10GB ✅ fits t4-small
Qwen2.5-1.5B full: ~30GB ❌ exceeds most GPUs
Qwen2.5-1.5B LoRA: ~6GB ✅ fits t4-small
Qwen2.5-7B full: ~140GB ❌ not feasible
Qwen2.5-7B LoRA: ~28GB ✅ fits a10g-large

Memory Optimization

If hitting memory limits:

Use LoRA/PEFT

peft_config=LoraConfig(r=16, lora_alpha=32)

Reduce batch size
```
per_device_train_batch_size=1
```

Increase gradient accumulation

gradient_accumulation_steps=8  # Effective batch size = 1×8

Enable gradient checkpointing
```
gradient_checkpointing=True
```
Use mixed precision
```
bf16=True  # or fp16=True
```
Upgrade to larger GPU
- t4 → a10g → a100

Cost Estimation

Formula

Total Cost = (Hours of training) × (Cost per hour)

Example Calculations

Quick demo:

Hardware: t4-small ($0.75/hour)
Time: 15 minutes (0.25 hours)
Cost: $0.19

Development training:

Hardware: a10g-small ($3.50/hour)
Time: 2 hours
Cost: $7.00

Production training:

Hardware: a10g-large ($5/hour)
Time: 6 hours
Cost: $30.00

Large model with LoRA:

Hardware: a100-large ($10/hour)
Time: 8 hours
Cost: $80.00

Cost Optimization Tips

Start small: Test on t4-small with subset
Use LoRA: 4-5x cheaper than full fine-tuning
Optimize hyperparameters: Fewer epochs if possible
Set appropriate timeout: Don't waste compute on stalled jobs
Use checkpointing: Resume if job fails
Monitor costs: Check running jobs regularly

Multi-GPU Training

TRL automatically handles multi-GPU training with Accelerate when using multi-GPU flavors.

Multi-GPU flavors:

l4x4 - 4x L4 GPUs
a10g-largex2 - 2x A10G GPUs
a10g-largex4 - 4x A10G GPUs

When to use:

Models >13B parameters
Need faster training (linear speedup)
Large datasets (>50K examples)

Example:

hf_jobs("uv", {
    "script": "train.py",
    "flavor": "a10g-largex2",  # 2 GPUs
    "timeout": "4h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}
})

No code changes needed—TRL/Accelerate handles distribution automatically.

Choosing Between Options

a10g vs a100

Choose a10g when:

Model <13B parameters
Budget conscious
Training time not critical

Choose a100 when:

Model 13B+ parameters
Need fastest training
Memory requirements high
Budget allows

Single vs Multi-GPU

Choose single GPU when:

Model <7B parameters
Budget constrained
Simpler debugging

Choose multi-GPU when:

Model >13B parameters
Need faster training
Large batch sizes required
Cost-effective for large jobs

Quick Reference

# Model size → Hardware selection
HARDWARE_MAP = {
    "<1B":     "t4-small",
    "1-3B":    "a10g-small",
    "3-7B":    "a10g-large",
    "7-13B":   "a10g-large (LoRA) or a100-large",
    ">13B":    "a100-large (LoRA required)"
}

6.6 KiB Raw Blame History Unescape Escape