Import the official Hugging Face ecosystem skills and sync the\nexisting local coverage with upstream metadata and assets.\n\nRegenerate the canonical catalog, plugin mirrors, docs, and release\nnotes after the maintainer merge batch so main stays in sync.\n\nFixes #417
4.9 KiB
TRL Training Methods Overview
TRL (Transformer Reinforcement Learning) provides multiple training methods for fine-tuning and aligning language models. This reference provides a brief overview of each method.
Supervised Fine-Tuning (SFT)
What it is: Standard instruction tuning with supervised learning on demonstration data.
When to use:
- Initial fine-tuning of base models on task-specific data
- Teaching new capabilities or domains
- Most common starting point for fine-tuning
Dataset format: Conversational format with "messages" field, OR text field, OR prompt/completion pairs
Example:
from trl import SFTTrainer, SFTConfig
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
train_dataset=dataset,
args=SFTConfig(
output_dir="my-model",
push_to_hub=True,
hub_model_id="username/my-model",
eval_strategy="no", # Disable eval for simple example
# max_length=1024 is the default - only set if you need different length
)
)
trainer.train()
Note: For production training with evaluation monitoring, see scripts/train_sft_example.py
Documentation: hf_doc_fetch("https://huggingface.co/docs/trl/sft_trainer")
Direct Preference Optimization (DPO)
What it is: Alignment method that trains directly on preference pairs (chosen vs rejected responses) without requiring a reward model.
When to use:
- Aligning models to human preferences
- Improving response quality after SFT
- Have paired preference data (chosen/rejected responses)
Dataset format: Preference pairs with "chosen" and "rejected" fields
Example:
from trl import DPOTrainer, DPOConfig
trainer = DPOTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct", # Use instruct model
train_dataset=dataset,
args=DPOConfig(
output_dir="dpo-model",
beta=0.1, # KL penalty coefficient
eval_strategy="no", # Disable eval for simple example
# max_length=1024 is the default - only set if you need different length
)
)
trainer.train()
Note: For production training with evaluation monitoring, see scripts/train_dpo_example.py
Documentation: hf_doc_fetch("https://huggingface.co/docs/trl/dpo_trainer")
Group Relative Policy Optimization (GRPO)
What it is: Online RL method that optimizes relative to group performance, useful for tasks with verifiable rewards.
When to use:
- Tasks with automatic reward signals (code execution, math verification)
- Online learning scenarios
- When DPO offline data is insufficient
Dataset format: Prompt-only format (model generates responses, reward computed online)
Example:
# Use TRL maintained script
hf_jobs("uv", {
"script": "https://raw.githubusercontent.com/huggingface/trl/main/examples/scripts/grpo.py",
"script_args": [
"--model_name_or_path", "Qwen/Qwen2.5-0.5B-Instruct",
"--dataset_name", "trl-lib/math_shepherd",
"--output_dir", "grpo-model"
],
"flavor": "a10g-large",
"timeout": "4h",
"secrets": {"HF_TOKEN": "$HF_TOKEN"}
})
Documentation: hf_doc_fetch("https://huggingface.co/docs/trl/grpo_trainer")
Reward Modeling
What it is: Train a reward model to score responses, used as a component in RLHF pipelines.
When to use:
- Building RLHF pipeline
- Need automatic quality scoring
- Creating reward signals for PPO training
Dataset format: Preference pairs with "chosen" and "rejected" responses
Documentation: hf_doc_fetch("https://huggingface.co/docs/trl/reward_trainer")
Method Selection Guide
| Method | Complexity | Data Required | Use Case |
|---|---|---|---|
| SFT | Low | Demonstrations | Initial fine-tuning |
| DPO | Medium | Paired preferences | Post-SFT alignment |
| GRPO | Medium | Prompts + reward fn | Online RL with automatic rewards |
| Reward | Medium | Paired preferences | Building RLHF pipeline |
Recommended Pipeline
For most use cases:
- Start with SFT - Fine-tune base model on task data
- Follow with DPO - Align to preferences using paired data
- Optional: GGUF conversion - Deploy for local inference
For advanced RL scenarios:
- Start with SFT - Fine-tune base model
- Train reward model - On preference data
Dataset Format Reference
For complete dataset format specifications, use:
hf_doc_fetch("https://huggingface.co/docs/trl/dataset_formats")
Or validate your dataset:
uv run https://huggingface.co/datasets/mcp-tools/skills/raw/main/dataset_inspector.py \
--dataset your/dataset --split train
See Also
references/training_patterns.md- Common training patterns and examplesscripts/train_sft_example.py- Complete SFT templatescripts/train_dpo_example.py- Complete DPO template- Dataset Inspector - Dataset format validation tool