antigravity-skills-reference/skills/transformers-js/references/MODEL_ARCHITECTURES.md

# Supported Model Architectures

This document lists the model architectures currently supported by Transformers.js.

## Natural Language Processing

### Text Models
- **ALBERT** - A Lite BERT for Self-supervised Learning
- **BERT** - Bidirectional Encoder Representations from Transformers
- **CamemBERT** - French language model based on RoBERTa
- **CodeGen** - Code generation models
- **CodeLlama** - Code-focused Llama models
- **Cohere** - Command-R models for RAG
- **DeBERTa** - Decoding-enhanced BERT with Disentangled Attention
- **DeBERTa-v2** - Improved version of DeBERTa
- **DistilBERT** - Distilled version of BERT (smaller, faster)
- **GPT-2** - Generative Pre-trained Transformer 2
- **GPT-Neo** - Open source GPT-3 alternative
- **GPT-NeoX** - Larger GPT-Neo models
- **LLaMA** - Large Language Model Meta AI
- **Mistral** - Mistral AI language models
- **MPNet** - Masked and Permuted Pre-training
- **MobileBERT** - Compressed BERT for mobile devices
- **RoBERTa** - Robustly Optimized BERT
- **T5** - Text-to-Text Transfer Transformer
- **XLM-RoBERTa** - Multilingual RoBERTa

### Sequence-to-Sequence
- **BART** - Denoising Sequence-to-Sequence Pre-training
- **Blenderbot** - Open-domain chatbot
- **BlenderbotSmall** - Smaller Blenderbot variant
- **M2M100** - Many-to-Many multilingual translation
- **MarianMT** - Neural machine translation
- **mBART** - Multilingual BART
- **NLLB** - No Language Left Behind (200 languages)
- **Pegasus** - Pre-training with extracted gap-sentences

## Computer Vision

### Image Classification
- **BEiT** - BERT Pre-Training of Image Transformers
- **ConvNeXT** - Modern ConvNet architecture
- **ConvNeXTV2** - Improved ConvNeXT
- **DeiT** - Data-efficient Image Transformers
- **DINOv2** - Self-supervised Vision Transformer
- **DINOv3** - Latest DINO iteration
- **EfficientNet** - Efficient convolutional networks
- **MobileNet** - Lightweight models for mobile
- **MobileViT** - Mobile Vision Transformer
- **ResNet** - Residual Networks
- **SegFormer** - Semantic segmentation transformer
- **Swin** - Shifted Window Transformer
- **ViT** - Vision Transformer

### Object Detection
- **DETR** - Detection Transformer
- **D-FINE** - Fine-grained Distribution Refinement for object detection
- **DINO** - DETR with Improved deNoising anchOr boxes
- **Grounding DINO** - Open-set object detection
- **YOLOS** - You Only Look at One Sequence

### Segmentation
- **CLIPSeg** - Image segmentation with text prompts
- **Mask2Former** - Universal image segmentation
- **SAM** - Segment Anything Model
- **EdgeTAM** - On-Device Track Anything Model

### Depth & Pose
- **DPT** - Dense Prediction Transformer
- **Depth Anything** - Monocular depth estimation
- **Depth Pro** - Sharp monocular metric depth
- **GLPN** - Global-Local Path Networks for depth

## Audio

### Speech Recognition
- **Wav2Vec2** - Self-supervised speech representations
- **Whisper** - Robust speech recognition (multilingual)
- **HuBERT** - Self-supervised speech representation learning

### Audio Processing
- **Audio Spectrogram Transformer** - Audio classification
- **DAC** - Descript Audio Codec

### Text-to-Speech
- **SpeechT5** - Unified speech and text pre-training
- **VITS** - Conditional Variational Autoencoder with adversarial learning

## Multimodal

### Vision-Language
- **CLIP** - Contrastive Language-Image Pre-training
- **Chinese-CLIP** - Chinese version of CLIP
- **ALIGN** - Large-scale noisy image-text pairs
- **BLIP** - Bootstrapping Language-Image Pre-training
- **Florence-2** - Unified vision foundation model
- **LLaVA** - Large Language and Vision Assistant
- **Moondream** - Tiny vision-language model

### Document Understanding
- **DiT** - Document Image Transformer
- **Donut** - OCR-free Document Understanding
- **LayoutLM** - Pre-training for document understanding
- **TrOCR** - Transformer-based OCR

### Audio-Language
- **CLAP** - Contrastive Language-Audio Pre-training

## Embeddings & Similarity

- **Sentence Transformers** - Sentence embeddings
- **all-MiniLM** - Efficient sentence embeddings
- **all-mpnet-base** - High-quality sentence embeddings
- **E5** - Text embeddings by Microsoft
- **BGE** - General embedding models
- **nomic-embed** - Long context embeddings

## Specialized Models

### Code
- **CodeBERT** - Pre-trained model for code
- **GraphCodeBERT** - Code structure understanding
- **StarCoder** - Code generation

### Scientific
- **SciBERT** - Scientific text
- **BioBERT** - Biomedical text

### Retrieval
- **ColBERT** - Contextualized late interaction over BERT
- **DPR** - Dense Passage Retrieval

## Model Selection Tips

### For Text Tasks
- **Small & Fast**: DistilBERT, MobileBERT
- **Balanced**: BERT-base, RoBERTa-base
- **High Accuracy**: RoBERTa-large, DeBERTa-v3-large
- **Multilingual**: XLM-RoBERTa, mBERT

### For Vision Tasks
- **Mobile/Browser**: MobileNet, EfficientNet-B0
- **Balanced**: DeiT-base, ConvNeXT-tiny
- **High Accuracy**: Swin-large, DINOv2-large

### For Audio Tasks
- **Speech Recognition**: Whisper-tiny (fast), Whisper-large (accurate)
- **Audio Classification**: Audio Spectrogram Transformer

### For Multimodal
- **Vision-Language**: CLIP (general), Florence-2 (comprehensive)
- **Document AI**: Donut, LayoutLM
- **OCR**: TrOCR

## Finding Models on Hugging Face Hub

Search for compatible models:
```
https://huggingface.co/models?library=transformers.js
```

Filter by task:
```
https://huggingface.co/models?pipeline_tag=text-classification&library=transformers.js
```

Check for ONNX support by looking for `onnx/` folder in model repository.