Files
antigravity-skills-reference/skills/hugging-face-model-trainer/references/hub_saving.md
sickn33 bdcfbb9625 feat(hugging-face): Add official ecosystem skills
Import the official Hugging Face ecosystem skills and sync the\nexisting local coverage with upstream metadata and assets.\n\nRegenerate the canonical catalog, plugin mirrors, docs, and release\nnotes after the maintainer merge batch so main stays in sync.\n\nFixes #417
2026-03-29 18:31:46 +02:00

8.3 KiB

Saving Training Results to Hugging Face Hub

⚠️ CRITICAL: Training environments are ephemeral. ALL results are lost when a job completes unless pushed to the Hub.

Why Hub Push is Required

When running on Hugging Face Jobs:

  • Environment is temporary
  • All files deleted on job completion
  • No local disk persistence
  • Cannot access results after job ends

Without Hub push, training is completely wasted.

Required Configuration

1. Training Configuration

In your SFTConfig or trainer config:

SFTConfig(
    push_to_hub=True,                    # Enable Hub push
    hub_model_id="username/model-name",   # Target repository
)

2. Job Configuration

When submitting the job:

hf_jobs("uv", {
    "script": "train.py",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # Provide authentication
})

The $HF_TOKEN placeholder is automatically replaced with your Hugging Face token.

Complete Example

# train.py
# /// script
# dependencies = ["trl"]
# ///

from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

dataset = load_dataset("trl-lib/Capybara", split="train")

# Configure with Hub push
config = SFTConfig(
    output_dir="my-model",
    num_train_epochs=3,
    
    # ✅ CRITICAL: Hub push configuration
    push_to_hub=True,
    hub_model_id="myusername/my-trained-model",
    
    # Optional: Push strategy
    push_to_hub_model_id="myusername/my-trained-model",
    push_to_hub_organization=None,
    push_to_hub_token=None,  # Uses environment token
)

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-0.5B",
    train_dataset=dataset,
    args=config,
)

trainer.train()

# ✅ Push final model
trainer.push_to_hub()

print("✅ Model saved to: https://huggingface.co/myusername/my-trained-model")

Submit with authentication:

hf_jobs("uv", {
    "script": "train.py",
    "flavor": "a10g-large",
    "timeout": "2h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # ✅ Required!
})

What Gets Saved

When push_to_hub=True:

  1. Model weights - Final trained parameters
  2. Tokenizer - Associated tokenizer
  3. Configuration - Model config (config.json)
  4. Training arguments - Hyperparameters used
  5. Model card - Auto-generated documentation
  6. Checkpoints - If save_strategy="steps" enabled

Checkpoint Saving

Save intermediate checkpoints during training:

SFTConfig(
    output_dir="my-model",
    push_to_hub=True,
    hub_model_id="username/my-model",
    
    # Checkpoint configuration
    save_strategy="steps",
    save_steps=100,              # Save every 100 steps
    save_total_limit=3,          # Keep only last 3 checkpoints
)

Benefits:

  • Resume training if job fails
  • Compare checkpoint performance
  • Use intermediate models

Checkpoints are pushed to: username/my-model (same repo)

Authentication Methods

"secrets": {"HF_TOKEN": "$HF_TOKEN"}

Uses your logged-in Hugging Face token automatically.

Method 2: Explicit Token

"secrets": {"HF_TOKEN": "hf_abc123..."}

Provide token explicitly (not recommended for security).

Method 3: Environment Variable

"env": {"HF_TOKEN": "hf_abc123..."}

Pass as regular environment variable (less secure than secrets).

Always prefer Method 1 for security and convenience.

Verification Checklist

Before submitting any training job, verify:

  • push_to_hub=True in training config
  • hub_model_id is specified (format: username/model-name)
  • secrets={"HF_TOKEN": "$HF_TOKEN"} in job config
  • Repository name doesn't conflict with existing repos
  • You have write access to the target namespace

Repository Setup

Automatic Creation

If repository doesn't exist, it's created automatically when first pushing.

Manual Creation

Create repository before training:

from huggingface_hub import HfApi

api = HfApi()
api.create_repo(
    repo_id="username/model-name",
    repo_type="model",
    private=False,  # or True for private repo
)

Repository Naming

Valid names:

  • username/my-model
  • username/model-name
  • organization/model-name

Invalid names:

  • model-name (missing username)
  • username/model name (spaces not allowed)
  • username/MODEL (uppercase discouraged)

Troubleshooting

Error: 401 Unauthorized

Cause: HF_TOKEN not provided or invalid

Solutions:

  1. Verify secrets={"HF_TOKEN": "$HF_TOKEN"} in job config
  2. Check you're logged in: hf auth whoami
  3. Re-login: hf auth login

Error: 403 Forbidden

Cause: No write access to repository

Solutions:

  1. Check repository namespace matches your username
  2. Verify you're a member of organization (if using org namespace)
  3. Check repository isn't private (if accessing org repo)

Error: Repository not found

Cause: Repository doesn't exist and auto-creation failed

Solutions:

  1. Manually create repository first
  2. Check repository name format
  3. Verify namespace exists

Error: Push failed during training

Cause: Network issues or Hub unavailable

Solutions:

  1. Training continues but final push fails
  2. Checkpoints may be saved
  3. Re-run push manually after job completes

Issue: Model saved but not visible

Possible causes:

  1. Repository is private—check https://huggingface.co/username
  2. Wrong namespace—verify hub_model_id matches login
  3. Push still in progress—wait a few minutes

Manual Push After Training

If training completes but push fails, push manually:

from transformers import AutoModel, AutoTokenizer

# Load from local checkpoint
model = AutoModel.from_pretrained("./output_dir")
tokenizer = AutoTokenizer.from_pretrained("./output_dir")

# Push to Hub
model.push_to_hub("username/model-name", token="hf_abc123...")
tokenizer.push_to_hub("username/model-name", token="hf_abc123...")

Note: Only possible if job hasn't completed (files still exist).

Best Practices

  1. Always enable push_to_hub=True
  2. Use checkpoint saving for long training runs
  3. Verify Hub push in logs before job completes
  4. Set appropriate save_total_limit to avoid excessive checkpoints
  5. Use descriptive repo names (e.g., qwen-capybara-sft not model1)
  6. Add model card with training details
  7. Tag models with relevant tags (e.g., text-generation, fine-tuned)

Monitoring Push Progress

Check logs for push progress:

hf_jobs("logs", {"job_id": "your-job-id"})

Look for:

Pushing model to username/model-name...
Upload file pytorch_model.bin: 100%
✅ Model pushed successfully

Example: Full Production Setup

# production_train.py
# /// script
# dependencies = ["trl>=0.12.0", "peft>=0.7.0"]
# ///

from datasets import load_dataset
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
import os

# Verify token is available
assert "HF_TOKEN" in os.environ, "HF_TOKEN not found in environment!"

# Load dataset
dataset = load_dataset("trl-lib/Capybara", split="train")
print(f"✅ Dataset loaded: {len(dataset)} examples")

# Configure with comprehensive Hub settings
config = SFTConfig(
    output_dir="qwen-capybara-sft",
    
    # Hub configuration
    push_to_hub=True,
    hub_model_id="myusername/qwen-capybara-sft",
    hub_strategy="checkpoint",  # Push checkpoints
    
    # Checkpoint configuration
    save_strategy="steps",
    save_steps=100,
    save_total_limit=3,
    
    # Training settings
    num_train_epochs=3,
    per_device_train_batch_size=4,
    
    # Logging
    logging_steps=10,
    logging_first_step=True,
)

# Train with LoRA
trainer = SFTTrainer(
    model="Qwen/Qwen2.5-0.5B",
    train_dataset=dataset,
    args=config,
    peft_config=LoraConfig(r=16, lora_alpha=32),
)

print("🚀 Starting training...")
trainer.train()

print("💾 Pushing final model to Hub...")
trainer.push_to_hub()

print("✅ Training complete!")
print(f"Model available at: https://huggingface.co/myusername/qwen-capybara-sft")

Submit:

hf_jobs("uv", {
    "script": "production_train.py",
    "flavor": "a10g-large",
    "timeout": "6h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}
})

Key Takeaway

Without push_to_hub=True and secrets={"HF_TOKEN": "$HF_TOKEN"}, all training results are permanently lost.

Always verify both are configured before submitting any training job.