# Computer Vision Architectures Comprehensive guide to CNN and Vision Transformer architectures for object detection, segmentation, and image classification. ## Table of Contents - [Backbone Architectures](#backbone-architectures) - [Detection Architectures](#detection-architectures) - [Segmentation Architectures](#segmentation-architectures) - [Vision Transformers](#vision-transformers) - [Feature Pyramid Networks](#feature-pyramid-networks) - [Architecture Selection](#architecture-selection) --- ## Backbone Architectures Backbone networks extract feature representations from images. The choice of backbone affects both accuracy and inference speed. ### ResNet Family ResNet introduced residual connections that enable training of very deep networks. | Variant | Params | GFLOPs | Top-1 Acc | Use Case | |---------|--------|--------|-----------|----------| | ResNet-18 | 11.7M | 1.8 | 69.8% | Edge, mobile | | ResNet-34 | 21.8M | 3.7 | 73.3% | Balanced | | ResNet-50 | 25.6M | 4.1 | 76.1% | Standard backbone | | ResNet-101 | 44.5M | 7.8 | 77.4% | High accuracy | | ResNet-152 | 60.2M | 11.6 | 78.3% | Maximum accuracy | **Residual Block Architecture:** ``` Input | +---> Conv 1x1 (reduce channels) | | | Conv 3x3 | | | Conv 1x1 (expand channels) | | +-----> Add <----+ | ReLU | Output ``` **When to use ResNet:** - Standard detection/segmentation tasks - When pretrained weights are important - Moderate compute budget - Well-understood, stable architecture ### EfficientNet Family EfficientNet uses compound scaling to balance depth, width, and resolution. | Variant | Params | GFLOPs | Top-1 Acc | Relative Speed | |---------|--------|--------|-----------|----------------| | EfficientNet-B0 | 5.3M | 0.4 | 77.1% | 1x | | EfficientNet-B1 | 7.8M | 0.7 | 79.1% | 0.7x | | EfficientNet-B2 | 9.2M | 1.0 | 80.1% | 0.6x | | EfficientNet-B3 | 12M | 1.8 | 81.6% | 0.4x | | EfficientNet-B4 | 19M | 4.2 | 82.9% | 0.25x | | EfficientNet-B5 | 30M | 9.9 | 83.6% | 0.15x | | EfficientNet-B6 | 43M | 19 | 84.0% | 0.1x | | EfficientNet-B7 | 66M | 37 | 84.3% | 0.05x | **Key innovations:** - Mobile Inverted Bottleneck (MBConv) blocks - Squeeze-and-Excitation attention - Compound scaling coefficients - Swish activation function **When to use EfficientNet:** - Mobile and edge deployment - When parameter efficiency matters - Classification tasks - Limited compute resources ### ConvNeXt ConvNeXt modernizes ResNet with techniques from Vision Transformers. | Variant | Params | GFLOPs | Top-1 Acc | |---------|--------|--------|-----------| | ConvNeXt-T | 29M | 4.5 | 82.1% | | ConvNeXt-S | 50M | 8.7 | 83.1% | | ConvNeXt-B | 89M | 15.4 | 83.8% | | ConvNeXt-L | 198M | 34.4 | 84.3% | | ConvNeXt-XL | 350M | 60.9 | 84.7% | **Key design choices:** - 7x7 depthwise convolutions (like ViT patch size) - Layer normalization instead of batch norm - GELU activation - Fewer but wider stages - Inverted bottleneck design **ConvNeXt Block:** ``` Input | +---> DWConv 7x7 | | | LayerNorm | | | Linear (4x channels) | | | GELU | | | Linear (1x channels) | | +-----> Add <----+ | Output ``` ### CSPNet (Cross Stage Partial) CSPNet is the backbone design used in YOLO v4-v8. **Key features:** - Gradient flow optimization - Reduced computation while maintaining accuracy - Cross-stage partial connections - Optimized for real-time detection **CSP Block:** ``` Input | +----> Split ----+ | | | Conv Block | | | Conv Block | | +----> Concat <--+ | Output ``` --- ## Detection Architectures ### Two-Stage Detectors Two-stage detectors first propose regions, then classify and refine them. #### Faster R-CNN Architecture: 1. **Backbone**: Feature extraction (ResNet, etc.) 2. **RPN (Region Proposal Network)**: Generate object proposals 3. **RoI Pooling/Align**: Extract fixed-size features 4. **Classification Head**: Classify and refine boxes ``` Image → Backbone → Feature Map | +→ RPN → Proposals | | +→ RoI Align ← + | FC Layers | Class + BBox ``` **RPN Details:** - Sliding window over feature map - Anchor boxes at each position (3 scales × 3 ratios = 9) - Predicts objectness score and box refinement - NMS to reduce proposals (typically 300-2000) **Performance characteristics:** - mAP@50:95: ~40-42 (COCO, R50-FPN) - Inference: ~50-100ms per image - Better localization than single-stage - Slower but more accurate #### Cascade R-CNN Multi-stage refinement with increasing IoU thresholds. ``` Stage 1 (IoU 0.5) → Stage 2 (IoU 0.6) → Stage 3 (IoU 0.7) ``` **Benefits:** - Progressive refinement - Better high-IoU predictions - +3-4 mAP over Faster R-CNN - Minimal additional cost per stage ### Single-Stage Detectors Single-stage detectors predict boxes and classes in one pass. #### YOLO Family **YOLOv8 Architecture:** ``` Input Image | Backbone (CSPDarknet) | +--+--+--+ | | | | P3 P4 P5 (multi-scale features) | | | Neck (PANet + C2f) | | | Head (Decoupled) | Boxes + Classes ``` **Key YOLOv8 innovations:** - C2f module (faster CSP variant) - Anchor-free detection head - Decoupled classification/regression heads - Task-aligned assigner (TAL) - Distribution focal loss (DFL) **YOLO variant comparison:** | Model | Size (px) | Params | mAP@50:95 | Speed (ms) | |-------|-----------|--------|-----------|------------| | YOLOv5n | 640 | 1.9M | 28.0 | 1.2 | | YOLOv5s | 640 | 7.2M | 37.4 | 1.8 | | YOLOv5m | 640 | 21.2M | 45.4 | 3.5 | | YOLOv8n | 640 | 3.2M | 37.3 | 1.2 | | YOLOv8s | 640 | 11.2M | 44.9 | 2.1 | | YOLOv8m | 640 | 25.9M | 50.2 | 4.2 | | YOLOv8l | 640 | 43.7M | 52.9 | 6.8 | | YOLOv8x | 640 | 68.2M | 53.9 | 10.1 | #### SSD (Single Shot Detector) Multi-scale detection with default boxes. **Architecture:** - VGG16 or MobileNet backbone - Additional convolution layers for multi-scale - Default boxes at each scale - Direct classification and regression **When to use SSD:** - Edge deployment (SSD-MobileNet) - When YOLO alternatives needed - Simple architecture requirements #### RetinaNet Focal loss to handle class imbalance. **Key innovation:** ```python FL(p_t) = -α_t * (1 - p_t)^γ * log(p_t) ``` Where: - γ (focusing parameter) = 2 typically - α (class weight) = 0.25 for background **Benefits:** - Handles extreme foreground-background imbalance - Matches two-stage accuracy - Single-stage speed --- ## Segmentation Architectures ### Instance Segmentation #### Mask R-CNN Extends Faster R-CNN with mask prediction branch. ``` RoI Features → FC Layers → Class + BBox | +→ Conv Layers → Mask (28×28 per class) ``` **Key details:** - RoI Align (bilinear interpolation, no quantization) - Per-class binary mask prediction - Decoupled mask and classification - 14×14 or 28×28 mask resolution **Performance:** - mAP (box): ~39 on COCO - mAP (mask): ~35 on COCO - Inference: ~100-200ms #### YOLACT / YOLACT++ Real-time instance segmentation. **Approach:** 1. Generate prototype masks (global) 2. Predict mask coefficients per instance 3. Linear combination: mask = Σ(coefficients × prototypes) **Benefits:** - Real-time (~30 FPS) - Simpler than Mask R-CNN - Global prototypes capture spatial info #### YOLOv8-Seg Adds segmentation head to YOLOv8. **Performance:** - mAP (box): 44.6 - mAP (mask): 36.8 - Speed: 4.5ms ### Semantic Segmentation #### DeepLabV3+ Atrous convolutions for multi-scale context. **Key components:** 1. **ASPP (Atrous Spatial Pyramid Pooling)** - Parallel atrous convolutions at different rates - Captures multi-scale context - Rates: 6, 12, 18 typically 2. **Encoder-Decoder** - Encoder: Backbone + ASPP - Decoder: Upsample with skip connections ``` Image → Backbone → ASPP → Decoder → Segmentation ↘ ↗ Low-level features ``` **Performance:** - mIoU: 89.0 on Cityscapes - Inference: ~25ms (ResNet-50) #### SegFormer Transformer-based semantic segmentation. **Architecture:** 1. **Hierarchical Transformer Encoder** - Multi-scale feature maps - Efficient self-attention - Overlapping patch embedding 2. **MLP Decoder** - Simple MLP aggregation - No complex decoders needed **Benefits:** - No positional encoding needed - Efficient attention mechanism - Strong multi-scale features ### Promptable Segmentation #### SAM (Segment Anything Model) Zero-shot segmentation with prompts. **Architecture:** 1. **Image Encoder**: ViT-H (632M params) 2. **Prompt Encoder**: Points, boxes, masks, text 3. **Mask Decoder**: Lightweight transformer **Prompts supported:** - Points (foreground/background) - Bounding boxes - Rough masks - Text (via CLIP integration) **Usage patterns:** ```python # Point prompt masks = sam.predict(image, point_coords=[[500, 375]], point_labels=[1]) # Box prompt masks = sam.predict(image, box=[100, 100, 400, 400]) # Multiple points masks = sam.predict(image, point_coords=[[500, 375], [200, 300]], point_labels=[1, 0]) # 1=foreground, 0=background ``` --- ## Vision Transformers ### ViT (Vision Transformer) Original vision transformer architecture. **Architecture:** ``` Image → Patch Embedding → [CLS] + Position Embedding ↓ Transformer Encoder ×L ↓ [CLS] token ↓ Classification Head ``` **Key details:** - Patch size: 16×16 or 14×14 typically - Position embeddings: Learned 1D - [CLS] token for classification - Standard transformer encoder blocks **Variants:** | Model | Patch | Layers | Hidden | Heads | Params | |-------|-------|--------|--------|-------|--------| | ViT-Ti | 16 | 12 | 192 | 3 | 5.7M | | ViT-S | 16 | 12 | 384 | 6 | 22M | | ViT-B | 16 | 12 | 768 | 12 | 86M | | ViT-L | 16 | 24 | 1024 | 16 | 304M | | ViT-H | 14 | 32 | 1280 | 16 | 632M | ### DeiT (Data-efficient Image Transformers) Training ViT without massive datasets. **Key innovations:** - Knowledge distillation from CNN teachers - Strong data augmentation - Regularization (stochastic depth, label smoothing) - Distillation token (learns from teacher) **Training recipe:** - RandAugment - Mixup (α=0.8) - CutMix (α=1.0) - Random erasing (p=0.25) - Stochastic depth (p=0.1) ### Swin Transformer Hierarchical transformer with shifted windows. **Key innovations:** 1. **Shifted Window Attention** - Local attention within windows - Cross-window connection via shifting - O(n) complexity vs O(n²) for global attention 2. **Hierarchical Feature Maps** - Patch merging between stages - Similar to CNN feature pyramids - Direct use in detection/segmentation **Architecture:** ``` Stage 1: 56×56, 96-dim → Patch Merge Stage 2: 28×28, 192-dim → Patch Merge Stage 3: 14×14, 384-dim → Patch Merge Stage 4: 7×7, 768-dim ``` **Variants:** | Model | Params | GFLOPs | Top-1 | |-------|--------|--------|-------| | Swin-T | 29M | 4.5 | 81.3% | | Swin-S | 50M | 8.7 | 83.0% | | Swin-B | 88M | 15.4 | 83.5% | | Swin-L | 197M | 34.5 | 84.5% | --- ## Feature Pyramid Networks FPN variants for multi-scale detection. ### Original FPN Top-down pathway with lateral connections. ``` P5 ← C5 (1/32) ↓ P4 ← C4 + Upsample(P5) (1/16) ↓ P3 ← C3 + Upsample(P4) (1/8) ↓ P2 ← C2 + Upsample(P3) (1/4) ``` ### PANet (Path Aggregation Network) Bottom-up augmentation after FPN. ``` FPN top-down → Bottom-up augmentation P2 → N2 ↘ P3 → N3 → N3 ↘ P4 → N4 → N4 → N4 ↘ P5 → N5 → N5 → N5 → N5 ``` **Benefits:** - Shorter path from low-level to high-level - Better localization signals - +1-2 mAP improvement ### BiFPN (Bidirectional FPN) Weighted bidirectional feature fusion. **Key innovations:** - Learnable fusion weights - Bidirectional cross-scale connections - Repeated blocks for iterative refinement **Fusion formula:** ``` O = Σ(w_i × I_i) / (ε + Σ w_i) ``` Where weights are learned via fast normalized fusion. ### NAS-FPN Neural architecture search for FPN design. **Searched on COCO:** - 7 fusion cells - Optimized connection patterns - 3-4 mAP improvement over FPN --- ## Architecture Selection ### Decision Matrix | Requirement | Recommended | Alternative | |-------------|-------------|-------------| | Real-time (>30 FPS) | YOLOv8s | RT-DETR-S | | Edge (<4GB RAM) | YOLOv8n | MobileNetV3-SSD | | High accuracy | DINO, Cascade R-CNN | YOLOv8x | | Instance segmentation | Mask R-CNN | YOLOv8-seg | | Semantic segmentation | SegFormer | DeepLabV3+ | | Zero-shot | SAM | CLIP+segmentation | | Small objects | YOLO+SAHI | Cascade R-CNN | | Video real-time | YOLOv8 + ByteTrack | YOLOX + SORT | ### Training Data Requirements | Architecture | Minimum Images | Recommended | |--------------|----------------|-------------| | YOLO (fine-tune) | 100-500 | 1,000-5,000 | | YOLO (from scratch) | 5,000+ | 10,000+ | | Faster R-CNN | 1,000+ | 5,000+ | | DETR/DINO | 10,000+ | 50,000+ | | ViT backbone | 10,000+ | 100,000+ | | SAM (fine-tune) | 100-1,000 | 5,000+ | ### Compute Requirements | Architecture | Training GPU | Inference GPU | |--------------|--------------|---------------| | YOLOv8n | 4GB VRAM | 2GB VRAM | | YOLOv8m | 8GB VRAM | 4GB VRAM | | YOLOv8x | 16GB VRAM | 8GB VRAM | | Faster R-CNN R50 | 8GB VRAM | 4GB VRAM | | Mask R-CNN R101 | 16GB VRAM | 8GB VRAM | | DINO-4scale | 32GB VRAM | 16GB VRAM | | SAM ViT-H | 32GB VRAM | 8GB VRAM | --- ## Code Examples ### Load Pretrained Backbone (timm) ```python import timm # List available models print(timm.list_models('*resnet*')) # Load pretrained backbone = timm.create_model('resnet50', pretrained=True, features_only=True) # Get feature maps features = backbone(torch.randn(1, 3, 224, 224)) for f in features: print(f.shape) # torch.Size([1, 64, 56, 56]) # torch.Size([1, 256, 56, 56]) # torch.Size([1, 512, 28, 28]) # torch.Size([1, 1024, 14, 14]) # torch.Size([1, 2048, 7, 7]) ``` ### Custom Detection Backbone ```python import torch.nn as nn from torchvision.models import resnet50 from torchvision.ops import FeaturePyramidNetwork class DetectionBackbone(nn.Module): def __init__(self): super().__init__() backbone = resnet50(pretrained=True) self.layer1 = nn.Sequential(backbone.conv1, backbone.bn1, backbone.relu, backbone.maxpool, backbone.layer1) self.layer2 = backbone.layer2 self.layer3 = backbone.layer3 self.layer4 = backbone.layer4 self.fpn = FeaturePyramidNetwork( in_channels_list=[256, 512, 1024, 2048], out_channels=256 ) def forward(self, x): c1 = self.layer1(x) c2 = self.layer2(c1) c3 = self.layer3(c2) c4 = self.layer4(c3) features = {'feat0': c1, 'feat1': c2, 'feat2': c3, 'feat3': c4} pyramid = self.fpn(features) return pyramid ``` ### Vision Transformer with Detection Head ```python import timm # Swin Transformer for detection swin = timm.create_model('swin_base_patch4_window7_224', pretrained=True, features_only=True, out_indices=[0, 1, 2, 3]) # Get multi-scale features x = torch.randn(1, 3, 224, 224) features = swin(x) for i, f in enumerate(features): print(f"Stage {i}: {f.shape}") # Stage 0: torch.Size([1, 128, 56, 56]) # Stage 1: torch.Size([1, 256, 28, 28]) # Stage 2: torch.Size([1, 512, 14, 14]) # Stage 3: torch.Size([1, 1024, 7, 7]) ``` --- ## Resources - [torchvision models](https://pytorch.org/vision/stable/models.html) - [timm library](https://github.com/huggingface/pytorch-image-models) - [Detectron2 Model Zoo](https://github.com/facebookresearch/detectron2/blob/main/MODEL_ZOO.md) - [MMDetection Model Zoo](https://github.com/open-mmlab/mmdetection/blob/main/docs/en/model_zoo.md) - [Ultralytics YOLOv8](https://docs.ultralytics.com/)