Import the official Hugging Face ecosystem skills and sync the\nexisting local coverage with upstream metadata and assets.\n\nRegenerate the canonical catalog, plugin mirrors, docs, and release\nnotes after the maintainer merge batch so main stays in sync.\n\nFixes #417
11 KiB
Reliability Principles for Training Jobs
These principles are derived from real production failures and successful fixes. Following them prevents common failure modes and ensures reliable job execution.
Principle 1: Always Verify Before Use
Rule: Never assume repos, datasets, or resources exist. Verify with tools first.
What It Prevents
- Non-existent datasets - Jobs fail immediately when dataset doesn't exist
- Typos in names - Simple mistakes like "argilla-dpo-mix-7k" vs "ultrafeedback_binarized"
- Incorrect paths - Old or moved repos, renamed files
- Missing dependencies - Undocumented requirements
How to Apply
Before submitting ANY job:
# Verify dataset exists
dataset_search({"query": "dataset-name", "author": "author-name", "limit": 5})
hub_repo_details(["author/dataset-name"], repo_type="dataset")
# Verify model exists
hub_repo_details(["org/model-name"], repo_type="model")
# Check script/file paths (for URL-based scripts)
# Verify before using: https://github.com/user/repo/blob/main/script.py
Examples that would have caught errors:
# ❌ WRONG: Assumed dataset exists
hf_jobs("uv", {
"script": """...""",
"env": {"DATASET": "trl-lib/argilla-dpo-mix-7k"} # Doesn't exist!
})
# ✅ CORRECT: Verify first
dataset_search({"query": "argilla dpo", "author": "trl-lib"})
# Would show: "trl-lib/ultrafeedback_binarized" is the correct name
hub_repo_details(["trl-lib/ultrafeedback_binarized"], repo_type="dataset")
# Confirms it exists before using
Implementation Checklist
- Check dataset exists before training
- Verify base model exists before fine-tuning
- Confirm adapter model exists before GGUF conversion
- Test script URLs are valid before submitting
- Validate file paths in repositories
- Check for recent updates/renames of resources
Time cost: 5-10 seconds
Time saved: Hours of failed job time + debugging
Principle 2: Prioritize Reliability Over Performance
Rule: Default to what is most likely to succeed, not what is theoretically fastest.
What It Prevents
- Hardware incompatibilities - Features that fail on certain GPUs
- Unstable optimizations - Speed-ups that cause crashes
- Complex configurations - More failure points
- Build system issues - Unreliable compilation methods
How to Apply
Choose reliability:
# ❌ RISKY: Aggressive optimization that may fail
SFTConfig(
torch_compile=True, # Can fail on T4, A10G GPUs
optim="adamw_bnb_8bit", # Requires specific setup
fp16=False, # May cause training instability
...
)
# ✅ SAFE: Proven defaults
SFTConfig(
# torch_compile=True, # Commented with note: "Enable on H100 for 20% speedup"
optim="adamw_torch", # Standard, always works
fp16=True, # Stable and fast
...
)
For build processes:
# ❌ UNRELIABLE: Uses make (platform-dependent)
subprocess.run(["make", "-C", "/tmp/llama.cpp", "llama-quantize"], check=True)
# ✅ RELIABLE: Uses CMake (consistent, documented)
subprocess.run([
"cmake", "-B", "/tmp/llama.cpp/build", "-S", "/tmp/llama.cpp",
"-DGGML_CUDA=OFF" # Disable CUDA for faster, more reliable build
], check=True)
subprocess.run([
"cmake", "--build", "/tmp/llama.cpp/build",
"--target", "llama-quantize", "-j", "4"
], check=True)
Real-World Example
The torch.compile failure:
- Added for "20% speedup" on H100
- Failed fatally on T4-medium with cryptic error
- Misdiagnosed as dataset issue (cost hours)
- Fix: Disable by default, add as optional comment
Result: Reliability > 20% performance gain
Implementation Checklist
- Use proven, standard configurations by default
- Comment out performance optimizations with hardware notes
- Use stable build systems (CMake > make)
- Test on target hardware before production
- Document known incompatibilities
- Provide "safe" and "fast" variants when needed
Performance loss: 10-20% in best case
Reliability gain: 95%+ success rate vs 60-70%
Principle 3: Create Atomic, Self-Contained Scripts
Rule: Scripts should work as complete, independent units. Don't remove parts to "simplify."
What It Prevents
- Missing dependencies - Removed "unnecessary" packages that are actually required
- Incomplete processes - Skipped steps that seem redundant
- Environment assumptions - Scripts that need pre-setup
- Partial failures - Some parts work, others fail silently
How to Apply
Complete dependency specifications:
# ❌ INCOMPLETE: "Simplified" by removing dependencies
# /// script
# dependencies = [
# "transformers",
# "peft",
# "torch",
# ]
# ///
# ✅ COMPLETE: All dependencies explicit
# /// script
# dependencies = [
# "transformers>=4.36.0",
# "peft>=0.7.0",
# "torch>=2.0.0",
# "accelerate>=0.24.0",
# "huggingface_hub>=0.20.0",
# "sentencepiece>=0.1.99", # Required for tokenizers
# "protobuf>=3.20.0", # Required for tokenizers
# "numpy",
# "gguf",
# ]
# ///
Complete build processes:
# ❌ INCOMPLETE: Assumes build tools exist
subprocess.run(["git", "clone", "https://github.com/ggerganov/llama.cpp.git", "/tmp/llama.cpp"])
subprocess.run(["make", "-C", "/tmp/llama.cpp", "llama-quantize"]) # FAILS: no gcc/make
# ✅ COMPLETE: Installs all requirements
subprocess.run(["apt-get", "update", "-qq"], check=True)
subprocess.run(["apt-get", "install", "-y", "-qq", "build-essential", "cmake"], check=True)
subprocess.run(["git", "clone", "https://github.com/ggerganov/llama.cpp.git", "/tmp/llama.cpp"])
# ... then build
Real-World Example
The sentencepiece failure:
- Original script had it: worked fine
- "Simplified" version removed it: "doesn't look necessary"
- GGUF conversion failed silently - tokenizer couldn't convert
- Hard to debug: no obvious error message
- Fix: Restore all original dependencies
Result: Don't remove dependencies without thorough testing
Implementation Checklist
- All dependencies in PEP 723 header with version pins
- All system packages installed by script
- No assumptions about pre-existing environment
- No "optional" steps that are actually required
- Test scripts in clean environment
- Document why each dependency is needed
Complexity: Slightly longer scripts
Reliability: Scripts "just work" every time
Principle 4: Provide Clear Error Context
Rule: When things fail, make it obvious what went wrong and how to fix it.
How to Apply
Wrap subprocess calls:
# ❌ UNCLEAR: Silent failure
subprocess.run([...], check=True, capture_output=True)
# ✅ CLEAR: Shows what failed
try:
result = subprocess.run(
[...],
check=True,
capture_output=True,
text=True
)
print(result.stdout)
if result.stderr:
print("Warnings:", result.stderr)
except subprocess.CalledProcessError as e:
print(f"❌ Command failed!")
print("STDOUT:", e.stdout)
print("STDERR:", e.stderr)
raise
Validate inputs:
# ❌ UNCLEAR: Fails later with cryptic error
model = load_model(MODEL_NAME)
# ✅ CLEAR: Fails fast with clear message
if not MODEL_NAME:
raise ValueError("MODEL_NAME environment variable not set!")
print(f"Loading model: {MODEL_NAME}")
try:
model = load_model(MODEL_NAME)
print(f"✅ Model loaded successfully")
except Exception as e:
print(f"❌ Failed to load model: {MODEL_NAME}")
print(f"Error: {e}")
print("Hint: Check that model exists on Hub")
raise
Implementation Checklist
- Wrap external calls with try/except
- Print stdout/stderr on failure
- Validate environment variables early
- Add progress indicators (✅, ❌, 🔄)
- Include hints for common failures
- Log configuration at start
Principle 5: Test the Happy Path on Known-Good Inputs
Rule: Before using new code in production, test with inputs you know work.
How to Apply
Known-good test inputs:
# For training
TEST_DATASET = "trl-lib/Capybara" # Small, well-formatted, widely used
TEST_MODEL = "Qwen/Qwen2.5-0.5B" # Small, fast, reliable
# For GGUF conversion
TEST_ADAPTER = "evalstate/qwen-capybara-medium" # Known working model
TEST_BASE = "Qwen/Qwen2.5-0.5B" # Compatible base
Testing workflow:
- Test with known-good inputs first
- If that works, try production inputs
- If production fails, you know it's the inputs (not code)
- Isolate the difference
Implementation Checklist
- Maintain list of known-good test models/datasets
- Test new scripts with test inputs first
- Document what makes inputs "good"
- Keep test jobs cheap (small models, short timeouts)
- Only move to production after test succeeds
Time cost: 5-10 minutes for test run
Debugging time saved: Hours
Summary: The Reliability Checklist
Before submitting ANY job:
Pre-Flight Checks
- Verified all repos/datasets exist (hub_repo_details)
- Tested with known-good inputs if new code
- Using proven hardware/configuration
- Included all dependencies in PEP 723 header
- Installed system requirements (build tools, etc.)
- Set appropriate timeout (not default 30m)
- Configured Hub push with HF_TOKEN
- Added clear error handling
Script Quality
- Self-contained (no external setup needed)
- Complete dependencies listed
- Build tools installed by script
- Progress indicators included
- Error messages are clear
- Configuration logged at start
Job Configuration
- Timeout > expected runtime + 30% buffer
- Hardware appropriate for model size
- Secrets include HF_TOKEN
- Environment variables set correctly
- Cost estimated and acceptable
Following these principles transforms job success rate from ~60-70% to ~95%+
When Principles Conflict
Sometimes reliability and performance conflict. Here's how to choose:
| Scenario | Choose | Rationale |
|---|---|---|
| Demo/test | Reliability | Fast failure is worse than slow success |
| Production (first run) | Reliability | Prove it works before optimizing |
| Production (proven) | Performance | Safe to optimize after validation |
| Time-critical | Reliability | Failures cause more delay than slow runs |
| Cost-critical | Balanced | Test with small model, then optimize |
General rule: Reliability first, optimize second.
Further Reading
troubleshooting.md- Common issues and fixestraining_patterns.md- Proven training configurationsgguf_conversion.md- Production GGUF workflow