Import the official Hugging Face ecosystem skills and sync the\nexisting local coverage with upstream metadata and assets.\n\nRegenerate the canonical catalog, plugin mirrors, docs, and release\nnotes after the maintainer merge batch so main stays in sync.\n\nFixes #417
168 lines
5.5 KiB
Markdown
168 lines
5.5 KiB
Markdown
# Supported Model Architectures
|
|
|
|
This document lists the model architectures currently supported by Transformers.js.
|
|
|
|
## Natural Language Processing
|
|
|
|
### Text Models
|
|
- **ALBERT** - A Lite BERT for Self-supervised Learning
|
|
- **BERT** - Bidirectional Encoder Representations from Transformers
|
|
- **CamemBERT** - French language model based on RoBERTa
|
|
- **CodeGen** - Code generation models
|
|
- **CodeLlama** - Code-focused Llama models
|
|
- **Cohere** - Command-R models for RAG
|
|
- **DeBERTa** - Decoding-enhanced BERT with Disentangled Attention
|
|
- **DeBERTa-v2** - Improved version of DeBERTa
|
|
- **DistilBERT** - Distilled version of BERT (smaller, faster)
|
|
- **GPT-2** - Generative Pre-trained Transformer 2
|
|
- **GPT-Neo** - Open source GPT-3 alternative
|
|
- **GPT-NeoX** - Larger GPT-Neo models
|
|
- **LLaMA** - Large Language Model Meta AI
|
|
- **Mistral** - Mistral AI language models
|
|
- **MPNet** - Masked and Permuted Pre-training
|
|
- **MobileBERT** - Compressed BERT for mobile devices
|
|
- **RoBERTa** - Robustly Optimized BERT
|
|
- **T5** - Text-to-Text Transfer Transformer
|
|
- **XLM-RoBERTa** - Multilingual RoBERTa
|
|
|
|
### Sequence-to-Sequence
|
|
- **BART** - Denoising Sequence-to-Sequence Pre-training
|
|
- **Blenderbot** - Open-domain chatbot
|
|
- **BlenderbotSmall** - Smaller Blenderbot variant
|
|
- **M2M100** - Many-to-Many multilingual translation
|
|
- **MarianMT** - Neural machine translation
|
|
- **mBART** - Multilingual BART
|
|
- **NLLB** - No Language Left Behind (200 languages)
|
|
- **Pegasus** - Pre-training with extracted gap-sentences
|
|
|
|
## Computer Vision
|
|
|
|
### Image Classification
|
|
- **BEiT** - BERT Pre-Training of Image Transformers
|
|
- **ConvNeXT** - Modern ConvNet architecture
|
|
- **ConvNeXTV2** - Improved ConvNeXT
|
|
- **DeiT** - Data-efficient Image Transformers
|
|
- **DINOv2** - Self-supervised Vision Transformer
|
|
- **DINOv3** - Latest DINO iteration
|
|
- **EfficientNet** - Efficient convolutional networks
|
|
- **MobileNet** - Lightweight models for mobile
|
|
- **MobileViT** - Mobile Vision Transformer
|
|
- **ResNet** - Residual Networks
|
|
- **SegFormer** - Semantic segmentation transformer
|
|
- **Swin** - Shifted Window Transformer
|
|
- **ViT** - Vision Transformer
|
|
|
|
### Object Detection
|
|
- **DETR** - Detection Transformer
|
|
- **D-FINE** - Fine-grained Distribution Refinement for object detection
|
|
- **DINO** - DETR with Improved deNoising anchOr boxes
|
|
- **Grounding DINO** - Open-set object detection
|
|
- **YOLOS** - You Only Look at One Sequence
|
|
|
|
### Segmentation
|
|
- **CLIPSeg** - Image segmentation with text prompts
|
|
- **Mask2Former** - Universal image segmentation
|
|
- **SAM** - Segment Anything Model
|
|
- **EdgeTAM** - On-Device Track Anything Model
|
|
|
|
### Depth & Pose
|
|
- **DPT** - Dense Prediction Transformer
|
|
- **Depth Anything** - Monocular depth estimation
|
|
- **Depth Pro** - Sharp monocular metric depth
|
|
- **GLPN** - Global-Local Path Networks for depth
|
|
|
|
## Audio
|
|
|
|
### Speech Recognition
|
|
- **Wav2Vec2** - Self-supervised speech representations
|
|
- **Whisper** - Robust speech recognition (multilingual)
|
|
- **HuBERT** - Self-supervised speech representation learning
|
|
|
|
### Audio Processing
|
|
- **Audio Spectrogram Transformer** - Audio classification
|
|
- **DAC** - Descript Audio Codec
|
|
|
|
### Text-to-Speech
|
|
- **SpeechT5** - Unified speech and text pre-training
|
|
- **VITS** - Conditional Variational Autoencoder with adversarial learning
|
|
|
|
## Multimodal
|
|
|
|
### Vision-Language
|
|
- **CLIP** - Contrastive Language-Image Pre-training
|
|
- **Chinese-CLIP** - Chinese version of CLIP
|
|
- **ALIGN** - Large-scale noisy image-text pairs
|
|
- **BLIP** - Bootstrapping Language-Image Pre-training
|
|
- **Florence-2** - Unified vision foundation model
|
|
- **LLaVA** - Large Language and Vision Assistant
|
|
- **Moondream** - Tiny vision-language model
|
|
|
|
### Document Understanding
|
|
- **DiT** - Document Image Transformer
|
|
- **Donut** - OCR-free Document Understanding
|
|
- **LayoutLM** - Pre-training for document understanding
|
|
- **TrOCR** - Transformer-based OCR
|
|
|
|
### Audio-Language
|
|
- **CLAP** - Contrastive Language-Audio Pre-training
|
|
|
|
## Embeddings & Similarity
|
|
|
|
- **Sentence Transformers** - Sentence embeddings
|
|
- **all-MiniLM** - Efficient sentence embeddings
|
|
- **all-mpnet-base** - High-quality sentence embeddings
|
|
- **E5** - Text embeddings by Microsoft
|
|
- **BGE** - General embedding models
|
|
- **nomic-embed** - Long context embeddings
|
|
|
|
## Specialized Models
|
|
|
|
### Code
|
|
- **CodeBERT** - Pre-trained model for code
|
|
- **GraphCodeBERT** - Code structure understanding
|
|
- **StarCoder** - Code generation
|
|
|
|
### Scientific
|
|
- **SciBERT** - Scientific text
|
|
- **BioBERT** - Biomedical text
|
|
|
|
### Retrieval
|
|
- **ColBERT** - Contextualized late interaction over BERT
|
|
- **DPR** - Dense Passage Retrieval
|
|
|
|
## Model Selection Tips
|
|
|
|
### For Text Tasks
|
|
- **Small & Fast**: DistilBERT, MobileBERT
|
|
- **Balanced**: BERT-base, RoBERTa-base
|
|
- **High Accuracy**: RoBERTa-large, DeBERTa-v3-large
|
|
- **Multilingual**: XLM-RoBERTa, mBERT
|
|
|
|
### For Vision Tasks
|
|
- **Mobile/Browser**: MobileNet, EfficientNet-B0
|
|
- **Balanced**: DeiT-base, ConvNeXT-tiny
|
|
- **High Accuracy**: Swin-large, DINOv2-large
|
|
|
|
### For Audio Tasks
|
|
- **Speech Recognition**: Whisper-tiny (fast), Whisper-large (accurate)
|
|
- **Audio Classification**: Audio Spectrogram Transformer
|
|
|
|
### For Multimodal
|
|
- **Vision-Language**: CLIP (general), Florence-2 (comprehensive)
|
|
- **Document AI**: Donut, LayoutLM
|
|
- **OCR**: TrOCR
|
|
|
|
## Finding Models on Hugging Face Hub
|
|
|
|
Search for compatible models:
|
|
```
|
|
https://huggingface.co/models?library=transformers.js
|
|
```
|
|
|
|
Filter by task:
|
|
```
|
|
https://huggingface.co/models?pipeline_tag=text-classification&library=transformers.js
|
|
```
|
|
|
|
Check for ONNX support by looking for `onnx/` folder in model repository.
|