firefrost-gaming/antigravity-skills-reference

Files

sickn33 bdcfbb9625 feat(hugging-face): Add official ecosystem skills

Import the official Hugging Face ecosystem skills and sync the\nexisting local coverage with upstream metadata and assets.\n\nRegenerate the canonical catalog, plugin mirrors, docs, and release\nnotes after the maintainer merge batch so main stays in sync.\n\nFixes #417

2026-03-29 18:31:46 +02:00

8.3 KiB

Raw Blame History

Saving Training Results to Hugging Face Hub

⚠️ CRITICAL: Training environments are ephemeral. ALL results are lost when a job completes unless pushed to the Hub.

Why Hub Push is Required

When running on Hugging Face Jobs:

Environment is temporary
All files deleted on job completion
No local disk persistence
Cannot access results after job ends

Without Hub push, training is completely wasted.

Required Configuration

1. Training Configuration

In your SFTConfig or trainer config:

SFTConfig(
    push_to_hub=True,                    # Enable Hub push
    hub_model_id="username/model-name",   # Target repository
)

2. Job Configuration

When submitting the job:

hf_jobs("uv", {
    "script": "train.py",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # Provide authentication
})

The $HF_TOKEN placeholder is automatically replaced with your Hugging Face token.

Complete Example

# train.py
# /// script
# dependencies = ["trl"]
# ///

from trl import SFTTrainer, SFTConfig
from datasets import load_dataset

dataset = load_dataset("trl-lib/Capybara", split="train")

# Configure with Hub push
config = SFTConfig(
    output_dir="my-model",
    num_train_epochs=3,
    
    # ✅ CRITICAL: Hub push configuration
    push_to_hub=True,
    hub_model_id="myusername/my-trained-model",
    
    # Optional: Push strategy
    push_to_hub_model_id="myusername/my-trained-model",
    push_to_hub_organization=None,
    push_to_hub_token=None,  # Uses environment token
)

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-0.5B",
    train_dataset=dataset,
    args=config,
)

trainer.train()

# ✅ Push final model
trainer.push_to_hub()

print("✅ Model saved to: https://huggingface.co/myusername/my-trained-model")

Submit with authentication:

hf_jobs("uv", {
    "script": "train.py",
    "flavor": "a10g-large",
    "timeout": "2h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}  # ✅ Required!
})

What Gets Saved

When push_to_hub=True:

Model weights - Final trained parameters
Tokenizer - Associated tokenizer
Configuration - Model config (config.json)
Training arguments - Hyperparameters used
Model card - Auto-generated documentation
Checkpoints - If save_strategy="steps" enabled

Checkpoint Saving

Save intermediate checkpoints during training:

SFTConfig(
    output_dir="my-model",
    push_to_hub=True,
    hub_model_id="username/my-model",
    
    # Checkpoint configuration
    save_strategy="steps",
    save_steps=100,              # Save every 100 steps
    save_total_limit=3,          # Keep only last 3 checkpoints
)

Benefits:

Resume training if job fails
Compare checkpoint performance
Use intermediate models

Checkpoints are pushed to: username/my-model (same repo)

Authentication Methods

Method 1: Automatic Token (Recommended)

"secrets": {"HF_TOKEN": "$HF_TOKEN"}

Uses your logged-in Hugging Face token automatically.

Method 2: Explicit Token

"secrets": {"HF_TOKEN": "hf_abc123..."}

Provide token explicitly (not recommended for security).

Method 3: Environment Variable

"env": {"HF_TOKEN": "hf_abc123..."}

Pass as regular environment variable (less secure than secrets).

Always prefer Method 1 for security and convenience.

Verification Checklist

Before submitting any training job, verify:

push_to_hub=True in training config
hub_model_id is specified (format: username/model-name)
secrets={"HF_TOKEN": "$HF_TOKEN"} in job config
Repository name doesn't conflict with existing repos
You have write access to the target namespace

Repository Setup

Automatic Creation

If repository doesn't exist, it's created automatically when first pushing.

Manual Creation

Create repository before training:

from huggingface_hub import HfApi

api = HfApi()
api.create_repo(
    repo_id="username/model-name",
    repo_type="model",
    private=False,  # or True for private repo
)

Repository Naming

Valid names:

username/my-model
username/model-name
organization/model-name

Invalid names:

model-name (missing username)
username/model name (spaces not allowed)
username/MODEL (uppercase discouraged)

Troubleshooting

Error: 401 Unauthorized

Cause: HF_TOKEN not provided or invalid

Solutions:

Verify secrets={"HF_TOKEN": "$HF_TOKEN"} in job config
Check you're logged in: hf auth whoami
Re-login: hf auth login

Error: 403 Forbidden

Cause: No write access to repository

Solutions:

Check repository namespace matches your username
Verify you're a member of organization (if using org namespace)
Check repository isn't private (if accessing org repo)

Error: Repository not found

Cause: Repository doesn't exist and auto-creation failed

Solutions:

Manually create repository first
Check repository name format
Verify namespace exists

Error: Push failed during training

Cause: Network issues or Hub unavailable

Solutions:

Training continues but final push fails
Checkpoints may be saved
Re-run push manually after job completes

Issue: Model saved but not visible

Possible causes:

Repository is private—check https://huggingface.co/username
Wrong namespace—verify hub_model_id matches login
Push still in progress—wait a few minutes

Manual Push After Training

If training completes but push fails, push manually:

from transformers import AutoModel, AutoTokenizer

# Load from local checkpoint
model = AutoModel.from_pretrained("./output_dir")
tokenizer = AutoTokenizer.from_pretrained("./output_dir")

# Push to Hub
model.push_to_hub("username/model-name", token="hf_abc123...")
tokenizer.push_to_hub("username/model-name", token="hf_abc123...")

Note: Only possible if job hasn't completed (files still exist).

Best Practices

Always enable push_to_hub=True
Use checkpoint saving for long training runs
Verify Hub push in logs before job completes
Set appropriate save_total_limit to avoid excessive checkpoints
Use descriptive repo names (e.g., qwen-capybara-sft not model1)
Add model card with training details
Tag models with relevant tags (e.g., text-generation, fine-tuned)

Monitoring Push Progress

Check logs for push progress:

hf_jobs("logs", {"job_id": "your-job-id"})

Look for:

Pushing model to username/model-name...
Upload file pytorch_model.bin: 100%
✅ Model pushed successfully

Example: Full Production Setup

# production_train.py
# /// script
# dependencies = ["trl>=0.12.0", "peft>=0.7.0"]
# ///

from datasets import load_dataset
from peft import LoraConfig
from trl import SFTTrainer, SFTConfig
import os

# Verify token is available
assert "HF_TOKEN" in os.environ, "HF_TOKEN not found in environment!"

# Load dataset
dataset = load_dataset("trl-lib/Capybara", split="train")
print(f"✅ Dataset loaded: {len(dataset)} examples")

# Configure with comprehensive Hub settings
config = SFTConfig(
    output_dir="qwen-capybara-sft",
    
    # Hub configuration
    push_to_hub=True,
    hub_model_id="myusername/qwen-capybara-sft",
    hub_strategy="checkpoint",  # Push checkpoints
    
    # Checkpoint configuration
    save_strategy="steps",
    save_steps=100,
    save_total_limit=3,
    
    # Training settings
    num_train_epochs=3,
    per_device_train_batch_size=4,
    
    # Logging
    logging_steps=10,
    logging_first_step=True,
)

# Train with LoRA
trainer = SFTTrainer(
    model="Qwen/Qwen2.5-0.5B",
    train_dataset=dataset,
    args=config,
    peft_config=LoraConfig(r=16, lora_alpha=32),
)

print("🚀 Starting training...")
trainer.train()

print("💾 Pushing final model to Hub...")
trainer.push_to_hub()

print("✅ Training complete!")
print(f"Model available at: https://huggingface.co/myusername/qwen-capybara-sft")

Submit:

hf_jobs("uv", {
    "script": "production_train.py",
    "flavor": "a10g-large",
    "timeout": "6h",
    "secrets": {"HF_TOKEN": "$HF_TOKEN"}
})

Key Takeaway

Without push_to_hub=True and secrets={"HF_TOKEN": "$HF_TOKEN"}, all training results are permanently lost.

Always verify both are configured before submitting any training job.

8.3 KiB Raw Blame History

Saving Training Results to Hugging Face Hub

Why Hub Push is Required

Required Configuration

1. Training Configuration

2. Job Configuration

Complete Example

What Gets Saved

Checkpoint Saving

Authentication Methods

Method 1: Automatic Token (Recommended)

Method 2: Explicit Token

Method 3: Environment Variable

Verification Checklist

Repository Setup

Automatic Creation

Manual Creation

Repository Naming

Troubleshooting

Error: 401 Unauthorized

Error: 403 Forbidden

Error: Repository not found

Error: Push failed during training

Issue: Model saved but not visible

Manual Push After Training

Best Practices

Monitoring Push Progress

Example: Full Production Setup

Key Takeaway

8.3 KiB

Raw Blame History