diff --git a/engineering-team/senior-computer-vision/SKILL.md b/engineering-team/senior-computer-vision/SKILL.md index f75d4d2..5028bef 100644 --- a/engineering-team/senior-computer-vision/SKILL.md +++ b/engineering-team/senior-computer-vision/SKILL.md @@ -1,226 +1,531 @@ --- name: senior-computer-vision -description: World-class computer vision skill for image/video processing, object detection, segmentation, and visual AI systems. Expertise in PyTorch, OpenCV, YOLO, SAM, diffusion models, and vision transformers. Includes 3D vision, video analysis, real-time processing, and production deployment. Use when building vision AI systems, implementing object detection, training custom vision models, or optimizing inference pipelines. +description: Computer vision engineering skill for object detection, image segmentation, and visual AI systems. Covers CNN and Vision Transformer architectures, YOLO/Faster R-CNN/DETR detection, Mask R-CNN/SAM segmentation, and production deployment with ONNX/TensorRT. Includes PyTorch, torchvision, Ultralytics, Detectron2, and MMDetection frameworks. Use when building detection pipelines, training custom models, optimizing inference, or deploying vision systems. --- # Senior Computer Vision Engineer -World-class senior computer vision engineer skill for production-grade AI/ML/Data systems. +Production computer vision engineering skill for object detection, image segmentation, and visual AI system deployment. + +## Table of Contents + +- [Quick Start](#quick-start) +- [Core Expertise](#core-expertise) +- [Tech Stack](#tech-stack) +- [Workflow 1: Object Detection Pipeline](#workflow-1-object-detection-pipeline) +- [Workflow 2: Model Optimization and Deployment](#workflow-2-model-optimization-and-deployment) +- [Workflow 3: Custom Dataset Preparation](#workflow-3-custom-dataset-preparation) +- [Architecture Selection Guide](#architecture-selection-guide) +- [Reference Documentation](#reference-documentation) +- [Common Commands](#common-commands) ## Quick Start -### Main Capabilities - ```bash -# Core Tool 1 -python scripts/vision_model_trainer.py --input data/ --output results/ +# Generate training configuration for YOLO or Faster R-CNN +python scripts/vision_model_trainer.py models/ --task detection --arch yolov8 -# Core Tool 2 -python scripts/inference_optimizer.py --target project/ --analyze +# Analyze model for optimization opportunities (quantization, pruning) +python scripts/inference_optimizer.py model.pt --target onnx --benchmark -# Core Tool 3 -python scripts/dataset_pipeline_builder.py --config config.yaml --deploy +# Build dataset pipeline with augmentations +python scripts/dataset_pipeline_builder.py images/ --format coco --augment ``` ## Core Expertise -This skill covers world-class capabilities in: +This skill provides guidance on: -- Advanced production patterns and architectures -- Scalable system design and implementation -- Performance optimization at scale -- MLOps and DataOps best practices -- Real-time processing and inference -- Distributed computing frameworks -- Model deployment and monitoring -- Security and compliance -- Cost optimization -- Team leadership and mentoring +- **Object Detection**: YOLO family (v5-v11), Faster R-CNN, DETR, RT-DETR +- **Instance Segmentation**: Mask R-CNN, YOLACT, SOLOv2 +- **Semantic Segmentation**: DeepLabV3+, SegFormer, SAM (Segment Anything) +- **Image Classification**: ResNet, EfficientNet, Vision Transformers (ViT, DeiT) +- **Video Analysis**: Object tracking (ByteTrack, SORT), action recognition +- **3D Vision**: Depth estimation, point cloud processing, NeRF +- **Production Deployment**: ONNX, TensorRT, OpenVINO, CoreML ## Tech Stack -**Languages:** Python, SQL, R, Scala, Go -**ML Frameworks:** PyTorch, TensorFlow, Scikit-learn, XGBoost -**Data Tools:** Spark, Airflow, dbt, Kafka, Databricks -**LLM Frameworks:** LangChain, LlamaIndex, DSPy -**Deployment:** Docker, Kubernetes, AWS/GCP/Azure -**Monitoring:** MLflow, Weights & Biases, Prometheus -**Databases:** PostgreSQL, BigQuery, Snowflake, Pinecone +| Category | Technologies | +|----------|--------------| +| Frameworks | PyTorch, torchvision, timm | +| Detection | Ultralytics (YOLO), Detectron2, MMDetection | +| Segmentation | segment-anything, mmsegmentation | +| Optimization | ONNX, TensorRT, OpenVINO, torch.compile | +| Image Processing | OpenCV, Pillow, albumentations | +| Annotation | CVAT, Label Studio, Roboflow | +| Experiment Tracking | MLflow, Weights & Biases | +| Serving | Triton Inference Server, TorchServe | + +## Workflow 1: Object Detection Pipeline + +Use this workflow when building an object detection system from scratch. + +### Step 1: Define Detection Requirements + +Analyze the detection task requirements: + +``` +Detection Requirements Analysis: +- Target objects: [list specific classes to detect] +- Real-time requirement: [yes/no, target FPS] +- Accuracy priority: [speed vs accuracy trade-off] +- Deployment target: [cloud GPU, edge device, mobile] +- Dataset size: [number of images, annotations per class] +``` + +### Step 2: Select Detection Architecture + +Choose architecture based on requirements: + +| Requirement | Recommended Architecture | Why | +|-------------|-------------------------|-----| +| Real-time (>30 FPS) | YOLOv8/v11, RT-DETR | Single-stage, optimized for speed | +| High accuracy | Faster R-CNN, DINO | Two-stage, better localization | +| Small objects | YOLO + SAHI, Faster R-CNN + FPN | Multi-scale detection | +| Edge deployment | YOLOv8n, MobileNetV3-SSD | Lightweight architectures | +| Transformer-based | DETR, DINO, RT-DETR | End-to-end, no NMS required | + +### Step 3: Prepare Dataset + +Convert annotations to required format: + +```bash +# COCO format (recommended) +python scripts/dataset_pipeline_builder.py data/images/ \ + --annotations data/labels/ \ + --format coco \ + --split 0.8 0.1 0.1 \ + --output data/coco/ + +# Verify dataset +python -c "from pycocotools.coco import COCO; coco = COCO('data/coco/train.json'); print(f'Images: {len(coco.imgs)}, Categories: {len(coco.cats)}')" +``` + +### Step 4: Configure Training + +Generate training configuration: + +```bash +# For Ultralytics YOLO +python scripts/vision_model_trainer.py data/coco/ \ + --task detection \ + --arch yolov8m \ + --epochs 100 \ + --batch 16 \ + --imgsz 640 \ + --output configs/ + +# For Detectron2 +python scripts/vision_model_trainer.py data/coco/ \ + --task detection \ + --arch faster_rcnn_R_50_FPN \ + --framework detectron2 \ + --output configs/ +``` + +### Step 5: Train and Validate + +```bash +# Ultralytics training +yolo detect train data=data.yaml model=yolov8m.pt epochs=100 imgsz=640 + +# Detectron2 training +python train_net.py --config-file configs/faster_rcnn.yaml --num-gpus 1 + +# Validate on test set +yolo detect val model=runs/detect/train/weights/best.pt data=data.yaml +``` + +### Step 6: Evaluate Results + +Key metrics to analyze: + +| Metric | Target | Description | +|--------|--------|-------------| +| mAP@50 | >0.7 | Mean Average Precision at IoU 0.5 | +| mAP@50:95 | >0.5 | COCO primary metric | +| Precision | >0.8 | Low false positives | +| Recall | >0.8 | Low missed detections | +| Inference time | <33ms | For 30 FPS real-time | + +## Workflow 2: Model Optimization and Deployment + +Use this workflow when preparing a trained model for production deployment. + +### Step 1: Benchmark Baseline Performance + +```bash +# Measure current model performance +python scripts/inference_optimizer.py model.pt \ + --benchmark \ + --input-size 640 640 \ + --batch-sizes 1 4 8 16 \ + --warmup 10 \ + --iterations 100 +``` + +Expected output: + +``` +Baseline Performance (PyTorch FP32): +- Batch 1: 45.2ms (22.1 FPS) +- Batch 4: 89.4ms (44.7 FPS) +- Batch 8: 165.3ms (48.4 FPS) +- Memory: 2.1 GB +- Parameters: 25.9M +``` + +### Step 2: Select Optimization Strategy + +| Deployment Target | Optimization Path | +|-------------------|-------------------| +| NVIDIA GPU (cloud) | PyTorch → ONNX → TensorRT FP16 | +| NVIDIA GPU (edge) | PyTorch → TensorRT INT8 | +| Intel CPU | PyTorch → ONNX → OpenVINO | +| Apple Silicon | PyTorch → CoreML | +| Generic CPU | PyTorch → ONNX Runtime | +| Mobile | PyTorch → TFLite or ONNX Mobile | + +### Step 3: Export to ONNX + +```bash +# Export with dynamic batch size +python scripts/inference_optimizer.py model.pt \ + --export onnx \ + --input-size 640 640 \ + --dynamic-batch \ + --simplify \ + --output model.onnx + +# Verify ONNX model +python -c "import onnx; model = onnx.load('model.onnx'); onnx.checker.check_model(model); print('ONNX model valid')" +``` + +### Step 4: Apply Quantization (Optional) + +For INT8 quantization with calibration: + +```bash +# Generate calibration dataset +python scripts/inference_optimizer.py model.onnx \ + --quantize int8 \ + --calibration-data data/calibration/ \ + --calibration-samples 500 \ + --output model_int8.onnx +``` + +Quantization impact analysis: + +| Precision | Size | Speed | Accuracy Drop | +|-----------|------|-------|---------------| +| FP32 | 100% | 1x | 0% | +| FP16 | 50% | 1.5-2x | <0.5% | +| INT8 | 25% | 2-4x | 1-3% | + +### Step 5: Convert to Target Runtime + +```bash +# TensorRT (NVIDIA GPU) +trtexec --onnx=model.onnx --saveEngine=model.engine --fp16 + +# OpenVINO (Intel) +mo --input_model model.onnx --output_dir openvino/ + +# CoreML (Apple) +python -c "import coremltools as ct; model = ct.convert('model.onnx'); model.save('model.mlpackage')" +``` + +### Step 6: Benchmark Optimized Model + +```bash +python scripts/inference_optimizer.py model.engine \ + --benchmark \ + --runtime tensorrt \ + --compare model.pt +``` + +Expected speedup: + +``` +Optimization Results: +- Original (PyTorch FP32): 45.2ms +- Optimized (TensorRT FP16): 12.8ms +- Speedup: 3.5x +- Accuracy change: -0.3% mAP +``` + +## Workflow 3: Custom Dataset Preparation + +Use this workflow when preparing a computer vision dataset for training. + +### Step 1: Audit Raw Data + +```bash +# Analyze image dataset +python scripts/dataset_pipeline_builder.py data/raw/ \ + --analyze \ + --output analysis/ +``` + +Analysis report includes: + +``` +Dataset Analysis: +- Total images: 5,234 +- Image sizes: 640x480 to 4096x3072 (variable) +- Formats: JPEG (4,891), PNG (343) +- Corrupted: 12 files +- Duplicates: 45 pairs + +Annotation Analysis: +- Format detected: Pascal VOC XML +- Total annotations: 28,456 +- Classes: 5 (car, person, bicycle, dog, cat) +- Distribution: car (12,340), person (8,234), bicycle (3,456), dog (2,890), cat (1,536) +- Empty images: 234 +``` + +### Step 2: Clean and Validate + +```bash +# Remove corrupted and duplicate images +python scripts/dataset_pipeline_builder.py data/raw/ \ + --clean \ + --remove-corrupted \ + --remove-duplicates \ + --output data/cleaned/ +``` + +### Step 3: Convert Annotation Format + +```bash +# Convert VOC to COCO format +python scripts/dataset_pipeline_builder.py data/cleaned/ \ + --annotations data/annotations/ \ + --input-format voc \ + --output-format coco \ + --output data/coco/ +``` + +Supported format conversions: + +| From | To | +|------|-----| +| Pascal VOC XML | COCO JSON | +| YOLO TXT | COCO JSON | +| COCO JSON | YOLO TXT | +| LabelMe JSON | COCO JSON | +| CVAT XML | COCO JSON | + +### Step 4: Apply Augmentations + +```bash +# Generate augmentation config +python scripts/dataset_pipeline_builder.py data/coco/ \ + --augment \ + --aug-config configs/augmentation.yaml \ + --output data/augmented/ +``` + +Recommended augmentations for detection: + +```yaml +# configs/augmentation.yaml +augmentations: + geometric: + - horizontal_flip: { p: 0.5 } + - vertical_flip: { p: 0.1 } # Only if orientation invariant + - rotate: { limit: 15, p: 0.3 } + - scale: { scale_limit: 0.2, p: 0.5 } + + color: + - brightness_contrast: { brightness_limit: 0.2, contrast_limit: 0.2, p: 0.5 } + - hue_saturation: { hue_shift_limit: 20, sat_shift_limit: 30, p: 0.3 } + - blur: { blur_limit: 3, p: 0.1 } + + advanced: + - mosaic: { p: 0.5 } # YOLO-style mosaic + - mixup: { p: 0.1 } # Image mixing + - cutout: { num_holes: 8, max_h_size: 32, max_w_size: 32, p: 0.3 } +``` + +### Step 5: Create Train/Val/Test Splits + +```bash +python scripts/dataset_pipeline_builder.py data/augmented/ \ + --split 0.8 0.1 0.1 \ + --stratify \ + --seed 42 \ + --output data/final/ +``` + +Split strategy guidelines: + +| Dataset Size | Train | Val | Test | +|--------------|-------|-----|------| +| <1,000 images | 70% | 15% | 15% | +| 1,000-10,000 | 80% | 10% | 10% | +| >10,000 | 90% | 5% | 5% | + +### Step 6: Generate Dataset Configuration + +```bash +# For Ultralytics YOLO +python scripts/dataset_pipeline_builder.py data/final/ \ + --generate-config yolo \ + --output data.yaml + +# For Detectron2 +python scripts/dataset_pipeline_builder.py data/final/ \ + --generate-config detectron2 \ + --output detectron2_config.py +``` + +## Architecture Selection Guide + +### Object Detection Architectures + +| Architecture | Speed | Accuracy | Best For | +|--------------|-------|----------|----------| +| YOLOv8n | 1.2ms | 37.3 mAP | Edge, mobile, real-time | +| YOLOv8s | 2.1ms | 44.9 mAP | Balanced speed/accuracy | +| YOLOv8m | 4.2ms | 50.2 mAP | General purpose | +| YOLOv8l | 6.8ms | 52.9 mAP | High accuracy | +| YOLOv8x | 10.1ms | 53.9 mAP | Maximum accuracy | +| RT-DETR-L | 5.3ms | 53.0 mAP | Transformer, no NMS | +| Faster R-CNN R50 | 46ms | 40.2 mAP | Two-stage, high quality | +| DINO-4scale | 85ms | 49.0 mAP | SOTA transformer | + +### Segmentation Architectures + +| Architecture | Type | Speed | Best For | +|--------------|------|-------|----------| +| YOLOv8-seg | Instance | 4.5ms | Real-time instance seg | +| Mask R-CNN | Instance | 67ms | High-quality masks | +| SAM | Promptable | 50ms | Zero-shot segmentation | +| DeepLabV3+ | Semantic | 25ms | Scene parsing | +| SegFormer | Semantic | 15ms | Efficient semantic seg | + +### CNN vs Vision Transformer Trade-offs + +| Aspect | CNN (YOLO, R-CNN) | ViT (DETR, DINO) | +|--------|-------------------|------------------| +| Training data needed | 1K-10K images | 10K-100K+ images | +| Training time | Fast | Slow (needs more epochs) | +| Inference speed | Faster | Slower | +| Small objects | Good with FPN | Needs multi-scale | +| Global context | Limited | Excellent | +| Positional encoding | Implicit | Explicit | ## Reference Documentation ### 1. Computer Vision Architectures -Comprehensive guide available in `references/computer_vision_architectures.md` covering: +See `references/computer_vision_architectures.md` for: -- Advanced patterns and best practices -- Production implementation strategies -- Performance optimization techniques -- Scalability considerations -- Security and compliance -- Real-world case studies +- CNN backbone architectures (ResNet, EfficientNet, ConvNeXt) +- Vision Transformer variants (ViT, DeiT, Swin) +- Detection heads (anchor-based vs anchor-free) +- Feature Pyramid Networks (FPN, BiFPN, PANet) +- Neck architectures for multi-scale detection ### 2. Object Detection Optimization -Complete workflow documentation in `references/object_detection_optimization.md` including: +See `references/object_detection_optimization.md` for: -- Step-by-step processes -- Architecture design patterns -- Tool integration guides -- Performance tuning strategies -- Troubleshooting procedures +- Non-Maximum Suppression variants (NMS, Soft-NMS, DIoU-NMS) +- Anchor optimization and anchor-free alternatives +- Loss function design (focal loss, GIoU, CIoU, DIoU) +- Training strategies (warmup, cosine annealing, EMA) +- Data augmentation for detection (mosaic, mixup, copy-paste) ### 3. Production Vision Systems -Technical reference guide in `references/production_vision_systems.md` with: +See `references/production_vision_systems.md` for: -- System design principles -- Implementation examples -- Configuration best practices -- Deployment strategies -- Monitoring and observability - -## Production Patterns - -### Pattern 1: Scalable Data Processing - -Enterprise-scale data processing with distributed computing: - -- Horizontal scaling architecture -- Fault-tolerant design -- Real-time and batch processing -- Data quality validation -- Performance monitoring - -### Pattern 2: ML Model Deployment - -Production ML system with high availability: - -- Model serving with low latency -- A/B testing infrastructure -- Feature store integration -- Model monitoring and drift detection -- Automated retraining pipelines - -### Pattern 3: Real-Time Inference - -High-throughput inference system: - -- Batching and caching strategies -- Load balancing -- Auto-scaling -- Latency optimization -- Cost optimization - -## Best Practices - -### Development - -- Test-driven development -- Code reviews and pair programming -- Documentation as code -- Version control everything -- Continuous integration - -### Production - -- Monitor everything critical -- Automate deployments -- Feature flags for releases -- Canary deployments -- Comprehensive logging - -### Team Leadership - -- Mentor junior engineers -- Drive technical decisions -- Establish coding standards -- Foster learning culture -- Cross-functional collaboration - -## Performance Targets - -**Latency:** -- P50: < 50ms -- P95: < 100ms -- P99: < 200ms - -**Throughput:** -- Requests/second: > 1000 -- Concurrent users: > 10,000 - -**Availability:** -- Uptime: 99.9% -- Error rate: < 0.1% - -## Security & Compliance - -- Authentication & authorization -- Data encryption (at rest & in transit) -- PII handling and anonymization -- GDPR/CCPA compliance -- Regular security audits -- Vulnerability management +- ONNX export and optimization +- TensorRT deployment pipeline +- Batch inference optimization +- Edge device deployment (Jetson, Intel NCS) +- Model serving with Triton +- Video processing pipelines ## Common Commands +### Ultralytics YOLO + ```bash -# Development -python -m pytest tests/ -v --cov -python -m black src/ -python -m pylint src/ - # Training -python scripts/train.py --config prod.yaml -python scripts/evaluate.py --model best.pth +yolo detect train data=coco.yaml model=yolov8m.pt epochs=100 imgsz=640 -# Deployment -docker build -t service:v1 . -kubectl apply -f k8s/ -helm upgrade service ./charts/ +# Validation +yolo detect val model=best.pt data=coco.yaml -# Monitoring -kubectl logs -f deployment/service -python scripts/health_check.py +# Inference +yolo detect predict model=best.pt source=images/ save=True + +# Export +yolo export model=best.pt format=onnx simplify=True dynamic=True ``` +### Detectron2 + +```bash +# Training +python train_net.py --config-file configs/COCO-Detection/faster_rcnn_R_50_FPN_3x.yaml \ + --num-gpus 1 OUTPUT_DIR ./output + +# Evaluation +python train_net.py --config-file configs/faster_rcnn.yaml --eval-only \ + MODEL.WEIGHTS output/model_final.pth + +# Inference +python demo.py --config-file configs/faster_rcnn.yaml \ + --input images/*.jpg --output results/ \ + --opts MODEL.WEIGHTS output/model_final.pth +``` + +### MMDetection + +```bash +# Training +python tools/train.py configs/faster_rcnn/faster-rcnn_r50_fpn_1x_coco.py + +# Testing +python tools/test.py configs/faster_rcnn.py checkpoints/latest.pth --eval bbox + +# Inference +python demo/image_demo.py demo.jpg configs/faster_rcnn.py checkpoints/latest.pth +``` + +### Model Optimization + +```bash +# ONNX export and simplify +python -c "import torch; model = torch.load('model.pt'); torch.onnx.export(model, torch.randn(1,3,640,640), 'model.onnx', opset_version=17)" +python -m onnxsim model.onnx model_sim.onnx + +# TensorRT conversion +trtexec --onnx=model.onnx --saveEngine=model.engine --fp16 --workspace=4096 + +# Benchmark +trtexec --loadEngine=model.engine --batch=1 --iterations=1000 --avgRuns=100 +``` + +## Performance Targets + +| Metric | Real-time | High Accuracy | Edge | +|--------|-----------|---------------|------| +| FPS | >30 | >10 | >15 | +| mAP@50 | >0.6 | >0.8 | >0.5 | +| Latency P99 | <50ms | <150ms | <100ms | +| GPU Memory | <4GB | <8GB | <2GB | +| Model Size | <50MB | <200MB | <20MB | + ## Resources -- Advanced Patterns: `references/computer_vision_architectures.md` -- Implementation Guide: `references/object_detection_optimization.md` -- Technical Reference: `references/production_vision_systems.md` -- Automation Scripts: `scripts/` directory - -## Senior-Level Responsibilities - -As a world-class senior professional: - -1. **Technical Leadership** - - Drive architectural decisions - - Mentor team members - - Establish best practices - - Ensure code quality - -2. **Strategic Thinking** - - Align with business goals - - Evaluate trade-offs - - Plan for scale - - Manage technical debt - -3. **Collaboration** - - Work across teams - - Communicate effectively - - Build consensus - - Share knowledge - -4. **Innovation** - - Stay current with research - - Experiment with new approaches - - Contribute to community - - Drive continuous improvement - -5. **Production Excellence** - - Ensure high availability - - Monitor proactively - - Optimize performance - - Respond to incidents +- **Architecture Guide**: `references/computer_vision_architectures.md` +- **Optimization Guide**: `references/object_detection_optimization.md` +- **Deployment Guide**: `references/production_vision_systems.md` +- **Scripts**: `scripts/` directory for automation tools diff --git a/engineering-team/senior-computer-vision/references/computer_vision_architectures.md b/engineering-team/senior-computer-vision/references/computer_vision_architectures.md index ea5f5df..3e6a22a 100644 --- a/engineering-team/senior-computer-vision/references/computer_vision_architectures.md +++ b/engineering-team/senior-computer-vision/references/computer_vision_architectures.md @@ -1,80 +1,683 @@ # Computer Vision Architectures -## Overview +Comprehensive guide to CNN and Vision Transformer architectures for object detection, segmentation, and image classification. -World-class computer vision architectures for senior computer vision engineer. +## Table of Contents -## Core Principles +- [Backbone Architectures](#backbone-architectures) +- [Detection Architectures](#detection-architectures) +- [Segmentation Architectures](#segmentation-architectures) +- [Vision Transformers](#vision-transformers) +- [Feature Pyramid Networks](#feature-pyramid-networks) +- [Architecture Selection](#architecture-selection) -### Production-First Design +--- -Always design with production in mind: -- Scalability: Handle 10x current load -- Reliability: 99.9% uptime target -- Maintainability: Clear, documented code -- Observability: Monitor everything +## Backbone Architectures -### Performance by Design +Backbone networks extract feature representations from images. The choice of backbone affects both accuracy and inference speed. -Optimize from the start: -- Efficient algorithms -- Resource awareness -- Strategic caching -- Batch processing +### ResNet Family -### Security & Privacy +ResNet introduced residual connections that enable training of very deep networks. -Build security in: -- Input validation -- Data encryption -- Access control -- Audit logging +| Variant | Params | GFLOPs | Top-1 Acc | Use Case | +|---------|--------|--------|-----------|----------| +| ResNet-18 | 11.7M | 1.8 | 69.8% | Edge, mobile | +| ResNet-34 | 21.8M | 3.7 | 73.3% | Balanced | +| ResNet-50 | 25.6M | 4.1 | 76.1% | Standard backbone | +| ResNet-101 | 44.5M | 7.8 | 77.4% | High accuracy | +| ResNet-152 | 60.2M | 11.6 | 78.3% | Maximum accuracy | -## Advanced Patterns +**Residual Block Architecture:** -### Pattern 1: Distributed Processing +``` +Input + | + +---> Conv 1x1 (reduce channels) + | | + | Conv 3x3 + | | + | Conv 1x1 (expand channels) + | | + +-----> Add <----+ + | + ReLU + | + Output +``` -Enterprise-scale data processing with fault tolerance. +**When to use ResNet:** +- Standard detection/segmentation tasks +- When pretrained weights are important +- Moderate compute budget +- Well-understood, stable architecture -### Pattern 2: Real-Time Systems +### EfficientNet Family -Low-latency, high-throughput systems. +EfficientNet uses compound scaling to balance depth, width, and resolution. -### Pattern 3: ML at Scale +| Variant | Params | GFLOPs | Top-1 Acc | Relative Speed | +|---------|--------|--------|-----------|----------------| +| EfficientNet-B0 | 5.3M | 0.4 | 77.1% | 1x | +| EfficientNet-B1 | 7.8M | 0.7 | 79.1% | 0.7x | +| EfficientNet-B2 | 9.2M | 1.0 | 80.1% | 0.6x | +| EfficientNet-B3 | 12M | 1.8 | 81.6% | 0.4x | +| EfficientNet-B4 | 19M | 4.2 | 82.9% | 0.25x | +| EfficientNet-B5 | 30M | 9.9 | 83.6% | 0.15x | +| EfficientNet-B6 | 43M | 19 | 84.0% | 0.1x | +| EfficientNet-B7 | 66M | 37 | 84.3% | 0.05x | -Production ML with monitoring and automation. +**Key innovations:** +- Mobile Inverted Bottleneck (MBConv) blocks +- Squeeze-and-Excitation attention +- Compound scaling coefficients +- Swish activation function -## Best Practices +**When to use EfficientNet:** +- Mobile and edge deployment +- When parameter efficiency matters +- Classification tasks +- Limited compute resources -### Code Quality -- Comprehensive testing -- Clear documentation -- Code reviews -- Type hints +### ConvNeXt -### Performance -- Profile before optimizing -- Monitor continuously -- Cache strategically -- Batch operations +ConvNeXt modernizes ResNet with techniques from Vision Transformers. -### Reliability -- Design for failure -- Implement retries -- Use circuit breakers -- Monitor health +| Variant | Params | GFLOPs | Top-1 Acc | +|---------|--------|--------|-----------| +| ConvNeXt-T | 29M | 4.5 | 82.1% | +| ConvNeXt-S | 50M | 8.7 | 83.1% | +| ConvNeXt-B | 89M | 15.4 | 83.8% | +| ConvNeXt-L | 198M | 34.4 | 84.3% | +| ConvNeXt-XL | 350M | 60.9 | 84.7% | -## Tools & Technologies +**Key design choices:** +- 7x7 depthwise convolutions (like ViT patch size) +- Layer normalization instead of batch norm +- GELU activation +- Fewer but wider stages +- Inverted bottleneck design -Essential tools for this domain: -- Development frameworks -- Testing libraries -- Deployment platforms -- Monitoring solutions +**ConvNeXt Block:** -## Further Reading +``` +Input + | + +---> DWConv 7x7 + | | + | LayerNorm + | | + | Linear (4x channels) + | | + | GELU + | | + | Linear (1x channels) + | | + +-----> Add <----+ + | + Output +``` -- Research papers -- Industry blogs -- Conference talks -- Open source projects +### CSPNet (Cross Stage Partial) + +CSPNet is the backbone design used in YOLO v4-v8. + +**Key features:** +- Gradient flow optimization +- Reduced computation while maintaining accuracy +- Cross-stage partial connections +- Optimized for real-time detection + +**CSP Block:** + +``` +Input + | + +----> Split ----+ + | | + | Conv Block + | | + | Conv Block + | | + +----> Concat <--+ + | + Output +``` + +--- + +## Detection Architectures + +### Two-Stage Detectors + +Two-stage detectors first propose regions, then classify and refine them. + +#### Faster R-CNN + +Architecture: +1. **Backbone**: Feature extraction (ResNet, etc.) +2. **RPN (Region Proposal Network)**: Generate object proposals +3. **RoI Pooling/Align**: Extract fixed-size features +4. **Classification Head**: Classify and refine boxes + +``` +Image → Backbone → Feature Map + | + +→ RPN → Proposals + | | + +→ RoI Align ← + + | + FC Layers + | + Class + BBox +``` + +**RPN Details:** +- Sliding window over feature map +- Anchor boxes at each position (3 scales × 3 ratios = 9) +- Predicts objectness score and box refinement +- NMS to reduce proposals (typically 300-2000) + +**Performance characteristics:** +- mAP@50:95: ~40-42 (COCO, R50-FPN) +- Inference: ~50-100ms per image +- Better localization than single-stage +- Slower but more accurate + +#### Cascade R-CNN + +Multi-stage refinement with increasing IoU thresholds. + +``` +Stage 1 (IoU 0.5) → Stage 2 (IoU 0.6) → Stage 3 (IoU 0.7) +``` + +**Benefits:** +- Progressive refinement +- Better high-IoU predictions +- +3-4 mAP over Faster R-CNN +- Minimal additional cost per stage + +### Single-Stage Detectors + +Single-stage detectors predict boxes and classes in one pass. + +#### YOLO Family + +**YOLOv8 Architecture:** + +``` +Input Image + | + Backbone (CSPDarknet) + | + +--+--+--+ + | | | | + P3 P4 P5 (multi-scale features) + | | | + Neck (PANet + C2f) + | | | + Head (Decoupled) + | + Boxes + Classes +``` + +**Key YOLOv8 innovations:** +- C2f module (faster CSP variant) +- Anchor-free detection head +- Decoupled classification/regression heads +- Task-aligned assigner (TAL) +- Distribution focal loss (DFL) + +**YOLO variant comparison:** + +| Model | Size (px) | Params | mAP@50:95 | Speed (ms) | +|-------|-----------|--------|-----------|------------| +| YOLOv5n | 640 | 1.9M | 28.0 | 1.2 | +| YOLOv5s | 640 | 7.2M | 37.4 | 1.8 | +| YOLOv5m | 640 | 21.2M | 45.4 | 3.5 | +| YOLOv8n | 640 | 3.2M | 37.3 | 1.2 | +| YOLOv8s | 640 | 11.2M | 44.9 | 2.1 | +| YOLOv8m | 640 | 25.9M | 50.2 | 4.2 | +| YOLOv8l | 640 | 43.7M | 52.9 | 6.8 | +| YOLOv8x | 640 | 68.2M | 53.9 | 10.1 | + +#### SSD (Single Shot Detector) + +Multi-scale detection with default boxes. + +**Architecture:** +- VGG16 or MobileNet backbone +- Additional convolution layers for multi-scale +- Default boxes at each scale +- Direct classification and regression + +**When to use SSD:** +- Edge deployment (SSD-MobileNet) +- When YOLO alternatives needed +- Simple architecture requirements + +#### RetinaNet + +Focal loss to handle class imbalance. + +**Key innovation:** +```python +FL(p_t) = -α_t * (1 - p_t)^γ * log(p_t) +``` + +Where: +- γ (focusing parameter) = 2 typically +- α (class weight) = 0.25 for background + +**Benefits:** +- Handles extreme foreground-background imbalance +- Matches two-stage accuracy +- Single-stage speed + +--- + +## Segmentation Architectures + +### Instance Segmentation + +#### Mask R-CNN + +Extends Faster R-CNN with mask prediction branch. + +``` +RoI Features → FC Layers → Class + BBox + | + +→ Conv Layers → Mask (28×28 per class) +``` + +**Key details:** +- RoI Align (bilinear interpolation, no quantization) +- Per-class binary mask prediction +- Decoupled mask and classification +- 14×14 or 28×28 mask resolution + +**Performance:** +- mAP (box): ~39 on COCO +- mAP (mask): ~35 on COCO +- Inference: ~100-200ms + +#### YOLACT / YOLACT++ + +Real-time instance segmentation. + +**Approach:** +1. Generate prototype masks (global) +2. Predict mask coefficients per instance +3. Linear combination: mask = Σ(coefficients × prototypes) + +**Benefits:** +- Real-time (~30 FPS) +- Simpler than Mask R-CNN +- Global prototypes capture spatial info + +#### YOLOv8-Seg + +Adds segmentation head to YOLOv8. + +**Performance:** +- mAP (box): 44.6 +- mAP (mask): 36.8 +- Speed: 4.5ms + +### Semantic Segmentation + +#### DeepLabV3+ + +Atrous convolutions for multi-scale context. + +**Key components:** +1. **ASPP (Atrous Spatial Pyramid Pooling)** + - Parallel atrous convolutions at different rates + - Captures multi-scale context + - Rates: 6, 12, 18 typically + +2. **Encoder-Decoder** + - Encoder: Backbone + ASPP + - Decoder: Upsample with skip connections + +``` +Image → Backbone → ASPP → Decoder → Segmentation + ↘ ↗ + Low-level features +``` + +**Performance:** +- mIoU: 89.0 on Cityscapes +- Inference: ~25ms (ResNet-50) + +#### SegFormer + +Transformer-based semantic segmentation. + +**Architecture:** +1. **Hierarchical Transformer Encoder** + - Multi-scale feature maps + - Efficient self-attention + - Overlapping patch embedding + +2. **MLP Decoder** + - Simple MLP aggregation + - No complex decoders needed + +**Benefits:** +- No positional encoding needed +- Efficient attention mechanism +- Strong multi-scale features + +### Promptable Segmentation + +#### SAM (Segment Anything Model) + +Zero-shot segmentation with prompts. + +**Architecture:** +1. **Image Encoder**: ViT-H (632M params) +2. **Prompt Encoder**: Points, boxes, masks, text +3. **Mask Decoder**: Lightweight transformer + +**Prompts supported:** +- Points (foreground/background) +- Bounding boxes +- Rough masks +- Text (via CLIP integration) + +**Usage patterns:** +```python +# Point prompt +masks = sam.predict(image, point_coords=[[500, 375]], point_labels=[1]) + +# Box prompt +masks = sam.predict(image, box=[100, 100, 400, 400]) + +# Multiple points +masks = sam.predict(image, point_coords=[[500, 375], [200, 300]], + point_labels=[1, 0]) # 1=foreground, 0=background +``` + +--- + +## Vision Transformers + +### ViT (Vision Transformer) + +Original vision transformer architecture. + +**Architecture:** + +``` +Image → Patch Embedding → [CLS] + Position Embedding + ↓ + Transformer Encoder ×L + ↓ + [CLS] token + ↓ + Classification Head +``` + +**Key details:** +- Patch size: 16×16 or 14×14 typically +- Position embeddings: Learned 1D +- [CLS] token for classification +- Standard transformer encoder blocks + +**Variants:** + +| Model | Patch | Layers | Hidden | Heads | Params | +|-------|-------|--------|--------|-------|--------| +| ViT-Ti | 16 | 12 | 192 | 3 | 5.7M | +| ViT-S | 16 | 12 | 384 | 6 | 22M | +| ViT-B | 16 | 12 | 768 | 12 | 86M | +| ViT-L | 16 | 24 | 1024 | 16 | 304M | +| ViT-H | 14 | 32 | 1280 | 16 | 632M | + +### DeiT (Data-efficient Image Transformers) + +Training ViT without massive datasets. + +**Key innovations:** +- Knowledge distillation from CNN teachers +- Strong data augmentation +- Regularization (stochastic depth, label smoothing) +- Distillation token (learns from teacher) + +**Training recipe:** +- RandAugment +- Mixup (α=0.8) +- CutMix (α=1.0) +- Random erasing (p=0.25) +- Stochastic depth (p=0.1) + +### Swin Transformer + +Hierarchical transformer with shifted windows. + +**Key innovations:** +1. **Shifted Window Attention** + - Local attention within windows + - Cross-window connection via shifting + - O(n) complexity vs O(n²) for global attention + +2. **Hierarchical Feature Maps** + - Patch merging between stages + - Similar to CNN feature pyramids + - Direct use in detection/segmentation + +**Architecture:** + +``` +Stage 1: 56×56, 96-dim → Patch Merge +Stage 2: 28×28, 192-dim → Patch Merge +Stage 3: 14×14, 384-dim → Patch Merge +Stage 4: 7×7, 768-dim +``` + +**Variants:** + +| Model | Params | GFLOPs | Top-1 | +|-------|--------|--------|-------| +| Swin-T | 29M | 4.5 | 81.3% | +| Swin-S | 50M | 8.7 | 83.0% | +| Swin-B | 88M | 15.4 | 83.5% | +| Swin-L | 197M | 34.5 | 84.5% | + +--- + +## Feature Pyramid Networks + +FPN variants for multi-scale detection. + +### Original FPN + +Top-down pathway with lateral connections. + +``` +P5 ← C5 (1/32) + ↓ +P4 ← C4 + Upsample(P5) (1/16) + ↓ +P3 ← C3 + Upsample(P4) (1/8) + ↓ +P2 ← C2 + Upsample(P3) (1/4) +``` + +### PANet (Path Aggregation Network) + +Bottom-up augmentation after FPN. + +``` +FPN top-down → Bottom-up augmentation +P2 → N2 ↘ +P3 → N3 → N3 ↘ +P4 → N4 → N4 → N4 ↘ +P5 → N5 → N5 → N5 → N5 +``` + +**Benefits:** +- Shorter path from low-level to high-level +- Better localization signals +- +1-2 mAP improvement + +### BiFPN (Bidirectional FPN) + +Weighted bidirectional feature fusion. + +**Key innovations:** +- Learnable fusion weights +- Bidirectional cross-scale connections +- Repeated blocks for iterative refinement + +**Fusion formula:** +``` +O = Σ(w_i × I_i) / (ε + Σ w_i) +``` + +Where weights are learned via fast normalized fusion. + +### NAS-FPN + +Neural architecture search for FPN design. + +**Searched on COCO:** +- 7 fusion cells +- Optimized connection patterns +- 3-4 mAP improvement over FPN + +--- + +## Architecture Selection + +### Decision Matrix + +| Requirement | Recommended | Alternative | +|-------------|-------------|-------------| +| Real-time (>30 FPS) | YOLOv8s | RT-DETR-S | +| Edge (<4GB RAM) | YOLOv8n | MobileNetV3-SSD | +| High accuracy | DINO, Cascade R-CNN | YOLOv8x | +| Instance segmentation | Mask R-CNN | YOLOv8-seg | +| Semantic segmentation | SegFormer | DeepLabV3+ | +| Zero-shot | SAM | CLIP+segmentation | +| Small objects | YOLO+SAHI | Cascade R-CNN | +| Video real-time | YOLOv8 + ByteTrack | YOLOX + SORT | + +### Training Data Requirements + +| Architecture | Minimum Images | Recommended | +|--------------|----------------|-------------| +| YOLO (fine-tune) | 100-500 | 1,000-5,000 | +| YOLO (from scratch) | 5,000+ | 10,000+ | +| Faster R-CNN | 1,000+ | 5,000+ | +| DETR/DINO | 10,000+ | 50,000+ | +| ViT backbone | 10,000+ | 100,000+ | +| SAM (fine-tune) | 100-1,000 | 5,000+ | + +### Compute Requirements + +| Architecture | Training GPU | Inference GPU | +|--------------|--------------|---------------| +| YOLOv8n | 4GB VRAM | 2GB VRAM | +| YOLOv8m | 8GB VRAM | 4GB VRAM | +| YOLOv8x | 16GB VRAM | 8GB VRAM | +| Faster R-CNN R50 | 8GB VRAM | 4GB VRAM | +| Mask R-CNN R101 | 16GB VRAM | 8GB VRAM | +| DINO-4scale | 32GB VRAM | 16GB VRAM | +| SAM ViT-H | 32GB VRAM | 8GB VRAM | + +--- + +## Code Examples + +### Load Pretrained Backbone (timm) + +```python +import timm + +# List available models +print(timm.list_models('*resnet*')) + +# Load pretrained +backbone = timm.create_model('resnet50', pretrained=True, features_only=True) + +# Get feature maps +features = backbone(torch.randn(1, 3, 224, 224)) +for f in features: + print(f.shape) +# torch.Size([1, 64, 56, 56]) +# torch.Size([1, 256, 56, 56]) +# torch.Size([1, 512, 28, 28]) +# torch.Size([1, 1024, 14, 14]) +# torch.Size([1, 2048, 7, 7]) +``` + +### Custom Detection Backbone + +```python +import torch.nn as nn +from torchvision.models import resnet50 +from torchvision.ops import FeaturePyramidNetwork + +class DetectionBackbone(nn.Module): + def __init__(self): + super().__init__() + backbone = resnet50(pretrained=True) + + self.layer1 = nn.Sequential(backbone.conv1, backbone.bn1, + backbone.relu, backbone.maxpool, + backbone.layer1) + self.layer2 = backbone.layer2 + self.layer3 = backbone.layer3 + self.layer4 = backbone.layer4 + + self.fpn = FeaturePyramidNetwork( + in_channels_list=[256, 512, 1024, 2048], + out_channels=256 + ) + + def forward(self, x): + c1 = self.layer1(x) + c2 = self.layer2(c1) + c3 = self.layer3(c2) + c4 = self.layer4(c3) + + features = {'feat0': c1, 'feat1': c2, 'feat2': c3, 'feat3': c4} + pyramid = self.fpn(features) + return pyramid +``` + +### Vision Transformer with Detection Head + +```python +import timm + +# Swin Transformer for detection +swin = timm.create_model('swin_base_patch4_window7_224', + pretrained=True, + features_only=True, + out_indices=[0, 1, 2, 3]) + +# Get multi-scale features +x = torch.randn(1, 3, 224, 224) +features = swin(x) +for i, f in enumerate(features): + print(f"Stage {i}: {f.shape}") +# Stage 0: torch.Size([1, 128, 56, 56]) +# Stage 1: torch.Size([1, 256, 28, 28]) +# Stage 2: torch.Size([1, 512, 14, 14]) +# Stage 3: torch.Size([1, 1024, 7, 7]) +``` + +--- + +## Resources + +- [torchvision models](https://pytorch.org/vision/stable/models.html) +- [timm library](https://github.com/huggingface/pytorch-image-models) +- [Detectron2 Model Zoo](https://github.com/facebookresearch/detectron2/blob/main/MODEL_ZOO.md) +- [MMDetection Model Zoo](https://github.com/open-mmlab/mmdetection/blob/main/docs/en/model_zoo.md) +- [Ultralytics YOLOv8](https://docs.ultralytics.com/) diff --git a/engineering-team/senior-computer-vision/references/object_detection_optimization.md b/engineering-team/senior-computer-vision/references/object_detection_optimization.md index 81a7c2d..cc7bca5 100644 --- a/engineering-team/senior-computer-vision/references/object_detection_optimization.md +++ b/engineering-team/senior-computer-vision/references/object_detection_optimization.md @@ -1,80 +1,885 @@ # Object Detection Optimization -## Overview +Comprehensive guide to optimizing object detection models for accuracy and inference speed. -World-class object detection optimization for senior computer vision engineer. +## Table of Contents -## Core Principles +- [Non-Maximum Suppression](#non-maximum-suppression) +- [Anchor Design and Optimization](#anchor-design-and-optimization) +- [Loss Functions](#loss-functions) +- [Training Strategies](#training-strategies) +- [Data Augmentation](#data-augmentation) +- [Model Optimization Techniques](#model-optimization-techniques) +- [Hyperparameter Tuning](#hyperparameter-tuning) -### Production-First Design +--- -Always design with production in mind: -- Scalability: Handle 10x current load -- Reliability: 99.9% uptime target -- Maintainability: Clear, documented code -- Observability: Monitor everything +## Non-Maximum Suppression -### Performance by Design +NMS removes redundant overlapping detections to produce final predictions. -Optimize from the start: -- Efficient algorithms -- Resource awareness -- Strategic caching -- Batch processing +### Standard NMS -### Security & Privacy +Basic algorithm: +1. Sort boxes by confidence score +2. Select highest confidence box +3. Remove boxes with IoU > threshold +4. Repeat until no boxes remain -Build security in: -- Input validation -- Data encryption -- Access control -- Audit logging +```python +def nms(boxes, scores, iou_threshold=0.5): + """ + boxes: (N, 4) in format [x1, y1, x2, y2] + scores: (N,) + """ + order = scores.argsort()[::-1] + keep = [] -## Advanced Patterns + while len(order) > 0: + i = order[0] + keep.append(i) -### Pattern 1: Distributed Processing + if len(order) == 1: + break -Enterprise-scale data processing with fault tolerance. + # Calculate IoU with remaining boxes + ious = compute_iou(boxes[i], boxes[order[1:]]) -### Pattern 2: Real-Time Systems + # Keep boxes with IoU <= threshold + mask = ious <= iou_threshold + order = order[1:][mask] -Low-latency, high-throughput systems. + return keep +``` -### Pattern 3: ML at Scale +**Parameters:** +- `iou_threshold`: 0.5-0.7 typical (lower = more suppression) +- `score_threshold`: 0.25-0.5 (filter low-confidence first) -Production ML with monitoring and automation. +### Soft-NMS -## Best Practices +Reduces scores instead of removing boxes entirely. -### Code Quality -- Comprehensive testing -- Clear documentation -- Code reviews -- Type hints +**Formula:** +``` +score = score * exp(-IoU^2 / sigma) +``` -### Performance -- Profile before optimizing -- Monitor continuously -- Cache strategically -- Batch operations +**Benefits:** +- Better for overlapping objects +- +1-2% mAP improvement +- Slightly slower than hard NMS -### Reliability -- Design for failure -- Implement retries -- Use circuit breakers -- Monitor health +```python +def soft_nms(boxes, scores, sigma=0.5, score_threshold=0.001): + """Gaussian penalty soft-NMS""" + order = scores.argsort()[::-1] + keep = [] -## Tools & Technologies + while len(order) > 0: + i = order[0] + keep.append(i) -Essential tools for this domain: -- Development frameworks -- Testing libraries -- Deployment platforms -- Monitoring solutions + if len(order) == 1: + break -## Further Reading + ious = compute_iou(boxes[i], boxes[order[1:]]) -- Research papers -- Industry blogs -- Conference talks -- Open source projects + # Gaussian penalty + weights = np.exp(-ious**2 / sigma) + scores[order[1:]] *= weights + + # Re-sort by updated scores + mask = scores[order[1:]] > score_threshold + order = order[1:][mask] + order = order[scores[order].argsort()[::-1]] + + return keep +``` + +### DIoU-NMS + +Uses Distance-IoU instead of standard IoU. + +**Formula:** +``` +DIoU = IoU - (d^2 / c^2) +``` + +Where: +- d = center distance between boxes +- c = diagonal of smallest enclosing box + +**Benefits:** +- Better for occluded objects +- Penalizes distant boxes less +- Works well with DIoU loss + +### Batched NMS + +NMS per class (prevents cross-class suppression). + +```python +def batched_nms(boxes, scores, classes, iou_threshold): + """Per-class NMS""" + # Offset boxes by class ID to prevent cross-class suppression + max_coordinate = boxes.max() + offsets = classes * (max_coordinate + 1) + boxes_for_nms = boxes + offsets[:, None] + + keep = torchvision.ops.nms(boxes_for_nms, scores, iou_threshold) + return keep +``` + +### NMS-Free Detection (DETR-style) + +Transformer-based detectors eliminate NMS. + +**How DETR avoids NMS:** +- Object queries are learned embeddings +- Bipartite matching in training +- Each query outputs exactly one detection +- Set-based loss enforces uniqueness + +**Benefits:** +- End-to-end differentiable +- No hand-crafted post-processing +- Better for complex scenes + +--- + +## Anchor Design and Optimization + +### Anchor-Based Detection + +Traditional detectors use predefined anchor boxes. + +**Anchor parameters:** +- Scales: [32, 64, 128, 256, 512] pixels +- Ratios: [0.5, 1.0, 2.0] (height/width) +- Stride: Feature map stride (8, 16, 32) + +**Anchor assignment:** +- Positive: IoU > 0.7 with ground truth +- Negative: IoU < 0.3 with all ground truths +- Ignored: 0.3 < IoU < 0.7 + +### K-Means Anchor Clustering + +Optimize anchors for your dataset. + +```python +import numpy as np +from sklearn.cluster import KMeans + +def optimize_anchors(annotations, num_anchors=9, image_size=640): + """ + annotations: list of (width, height) for each bounding box + """ + # Normalize to input size + boxes = np.array(annotations) + boxes = boxes / boxes.max() * image_size + + # K-means clustering + kmeans = KMeans(n_clusters=num_anchors, random_state=42) + kmeans.fit(boxes) + + # Get anchor sizes + anchors = kmeans.cluster_centers_ + + # Sort by area + areas = anchors[:, 0] * anchors[:, 1] + anchors = anchors[np.argsort(areas)] + + # Calculate mean IoU with ground truth + mean_iou = calculate_anchor_fit(boxes, anchors) + print(f"Optimized anchors (mean IoU: {mean_iou:.3f}):") + print(anchors.astype(int)) + + return anchors + +def calculate_anchor_fit(boxes, anchors): + """Calculate how well anchors fit the boxes""" + ious = [] + for box in boxes: + box_area = box[0] * box[1] + anchor_areas = anchors[:, 0] * anchors[:, 1] + intersections = np.minimum(box[0], anchors[:, 0]) * \ + np.minimum(box[1], anchors[:, 1]) + unions = box_area + anchor_areas - intersections + max_iou = (intersections / unions).max() + ious.append(max_iou) + return np.mean(ious) +``` + +### Anchor-Free Detection + +Modern detectors predict boxes without anchors. + +**FCOS-style (center-based):** +- Predict (l, t, r, b) distances from center +- Centerness score for quality +- Multi-scale assignment + +**YOLO v8 style:** +- Predict (x, y, w, h) directly +- Task-aligned assigner +- Distribution focal loss for regression + +**Benefits of anchor-free:** +- No hyperparameter tuning for anchors +- Simpler architecture +- Better generalization + +### Anchor Assignment Strategies + +**ATSS (Adaptive Training Sample Selection):** +1. For each GT, select k closest anchors per level +2. Calculate IoU for selected anchors +3. IoU threshold = mean + std of IoUs +4. Assign positives where IoU > threshold + +**TAL (Task-Aligned Assigner - YOLO v8):** +``` +score = cls_score^alpha * IoU^beta +``` + +Where alpha=0.5, beta=6.0 (weights classification and localization) + +--- + +## Loss Functions + +### Classification Losses + +#### Cross-Entropy Loss + +Standard multi-class classification: +```python +loss = -log(p_correct_class) +``` + +#### Focal Loss + +Handles class imbalance by down-weighting easy examples. + +```python +def focal_loss(pred, target, gamma=2.0, alpha=0.25): + """ + pred: (N, num_classes) predicted probabilities + target: (N,) ground truth class indices + """ + ce_loss = F.cross_entropy(pred, target, reduction='none') + pt = torch.exp(-ce_loss) # probability of correct class + + # Focal term: (1 - pt)^gamma + focal_term = (1 - pt) ** gamma + + # Alpha weighting + alpha_t = alpha * target + (1 - alpha) * (1 - target) + + loss = alpha_t * focal_term * ce_loss + return loss.mean() +``` + +**Hyperparameters:** +- gamma: 2.0 typical, higher = more focus on hard examples +- alpha: 0.25 for foreground class weight + +#### Quality Focal Loss (QFL) + +Combines classification with IoU quality. + +```python +def quality_focal_loss(pred, target, beta=2.0): + """ + target: IoU values (0-1) instead of binary + """ + ce = F.binary_cross_entropy(pred, target, reduction='none') + focal_weight = torch.abs(pred - target) ** beta + loss = focal_weight * ce + return loss.mean() +``` + +### Regression Losses + +#### Smooth L1 Loss + +```python +def smooth_l1_loss(pred, target, beta=1.0): + diff = torch.abs(pred - target) + loss = torch.where( + diff < beta, + 0.5 * diff ** 2 / beta, + diff - 0.5 * beta + ) + return loss.mean() +``` + +#### IoU-Based Losses + +**IoU Loss:** +``` +L_IoU = 1 - IoU +``` + +**GIoU (Generalized IoU):** +``` +GIoU = IoU - (C - U) / C +L_GIoU = 1 - GIoU +``` + +Where C = area of smallest enclosing box, U = union area. + +**DIoU (Distance IoU):** +``` +DIoU = IoU - d^2 / c^2 +L_DIoU = 1 - DIoU +``` + +Where d = center distance, c = diagonal of enclosing box. + +**CIoU (Complete IoU):** +``` +CIoU = IoU - d^2 / c^2 - alpha*v +v = (4/pi^2) * (arctan(w_gt/h_gt) - arctan(w/h))^2 +alpha = v / (1 - IoU + v) +L_CIoU = 1 - CIoU +``` + +**Comparison:** + +| Loss | Handles | Best For | +|------|---------|----------| +| L1/L2 | Basic regression | Simple tasks | +| IoU | Overlap | Standard detection | +| GIoU | Non-overlapping | Distant boxes | +| DIoU | Center distance | Faster convergence | +| CIoU | Aspect ratio | Best accuracy | + +```python +def ciou_loss(pred_boxes, target_boxes): + """ + pred_boxes, target_boxes: (N, 4) as [x1, y1, x2, y2] + """ + # Standard IoU + inter = compute_intersection(pred_boxes, target_boxes) + union = compute_union(pred_boxes, target_boxes) + iou = inter / (union + 1e-7) + + # Enclosing box diagonal + enclose_x1 = torch.min(pred_boxes[:, 0], target_boxes[:, 0]) + enclose_y1 = torch.min(pred_boxes[:, 1], target_boxes[:, 1]) + enclose_x2 = torch.max(pred_boxes[:, 2], target_boxes[:, 2]) + enclose_y2 = torch.max(pred_boxes[:, 3], target_boxes[:, 3]) + c_sq = (enclose_x2 - enclose_x1)**2 + (enclose_y2 - enclose_y1)**2 + + # Center distance + pred_cx = (pred_boxes[:, 0] + pred_boxes[:, 2]) / 2 + pred_cy = (pred_boxes[:, 1] + pred_boxes[:, 3]) / 2 + target_cx = (target_boxes[:, 0] + target_boxes[:, 2]) / 2 + target_cy = (target_boxes[:, 1] + target_boxes[:, 3]) / 2 + d_sq = (pred_cx - target_cx)**2 + (pred_cy - target_cy)**2 + + # Aspect ratio term + pred_w = pred_boxes[:, 2] - pred_boxes[:, 0] + pred_h = pred_boxes[:, 3] - pred_boxes[:, 1] + target_w = target_boxes[:, 2] - target_boxes[:, 0] + target_h = target_boxes[:, 3] - target_boxes[:, 1] + + v = (4 / math.pi**2) * ( + torch.atan(target_w / target_h) - torch.atan(pred_w / pred_h) + )**2 + alpha_term = v / (1 - iou + v + 1e-7) + + ciou = iou - d_sq / (c_sq + 1e-7) - alpha_term * v + return 1 - ciou +``` + +### Distribution Focal Loss (DFL) + +Used in YOLO v8 for regression. + +**Concept:** +- Predict distribution over discrete positions +- Each regression target is a soft label +- Allows uncertainty estimation + +```python +def dfl_loss(pred_dist, target, reg_max=16): + """ + pred_dist: (N, reg_max) predicted distribution + target: (N,) continuous target values (0 to reg_max) + """ + # Convert continuous target to soft label + target_left = target.floor().long() + target_right = target_left + 1 + weight_right = target - target_left.float() + weight_left = 1 - weight_right + + # Cross-entropy with soft targets + loss_left = F.cross_entropy(pred_dist, target_left, reduction='none') + loss_right = F.cross_entropy(pred_dist, target_right.clamp(max=reg_max-1), + reduction='none') + + loss = weight_left * loss_left + weight_right * loss_right + return loss.mean() +``` + +--- + +## Training Strategies + +### Learning Rate Schedules + +**Warmup:** +```python +# Linear warmup for first N epochs +if epoch < warmup_epochs: + lr = base_lr * (epoch + 1) / warmup_epochs +``` + +**Cosine Annealing:** +```python +lr = lr_min + 0.5 * (lr_max - lr_min) * (1 + cos(pi * epoch / total_epochs)) +``` + +**Step Decay:** +```python +# Reduce by factor at milestones +lr = base_lr * (0.1 ** (milestones_passed)) +``` + +**Recommended schedule for detection:** +```python +optimizer = SGD(model.parameters(), lr=0.01, momentum=0.937, weight_decay=0.0005) + +scheduler = torch.optim.lr_scheduler.CosineAnnealingLR( + optimizer, + T_max=total_epochs, + eta_min=0.0001 +) + +# With warmup +warmup_scheduler = torch.optim.lr_scheduler.LinearLR( + optimizer, + start_factor=0.1, + total_iters=warmup_epochs +) + +scheduler = torch.optim.lr_scheduler.SequentialLR( + optimizer, + schedulers=[warmup_scheduler, scheduler], + milestones=[warmup_epochs] +) +``` + +### Exponential Moving Average (EMA) + +Smooths model weights for better stability. + +```python +class EMA: + def __init__(self, model, decay=0.9999): + self.model = model + self.decay = decay + self.shadow = {} + for name, param in model.named_parameters(): + if param.requires_grad: + self.shadow[name] = param.data.clone() + + def update(self): + for name, param in self.model.named_parameters(): + if param.requires_grad: + self.shadow[name] = ( + self.decay * self.shadow[name] + + (1 - self.decay) * param.data + ) + + def apply_shadow(self): + for name, param in self.model.named_parameters(): + if param.requires_grad: + param.data.copy_(self.shadow[name]) +``` + +**Usage:** +- Update EMA after each training step +- Use EMA weights for validation/inference +- Decay: 0.9999 typical (higher = slower update) + +### Multi-Scale Training + +Train with varying input sizes. + +```python +# Random size each batch +sizes = [480, 512, 544, 576, 608, 640, 672, 704, 736, 768] +input_size = random.choice(sizes) + +# Resize batch to selected size +images = F.interpolate(images, size=input_size, mode='bilinear') +``` + +**Benefits:** +- Better scale invariance +- +1-2% mAP improvement +- Slower training (variable batch size) + +### Gradient Accumulation + +Simulate larger batch sizes. + +```python +accumulation_steps = 4 +optimizer.zero_grad() + +for i, (images, targets) in enumerate(dataloader): + loss = model(images, targets) / accumulation_steps + loss.backward() + + if (i + 1) % accumulation_steps == 0: + optimizer.step() + optimizer.zero_grad() +``` + +### Mixed Precision Training + +Use FP16 for speed and memory. + +```python +from torch.cuda.amp import autocast, GradScaler + +scaler = GradScaler() + +for images, targets in dataloader: + optimizer.zero_grad() + + with autocast(): + loss = model(images, targets) + + scaler.scale(loss).backward() + scaler.step(optimizer) + scaler.update() +``` + +**Benefits:** +- 2-3x faster training +- 50% memory reduction +- Minimal accuracy loss + +--- + +## Data Augmentation + +### Geometric Augmentations + +```python +import albumentations as A + +geometric = A.Compose([ + A.HorizontalFlip(p=0.5), + A.Rotate(limit=15, p=0.3), + A.RandomScale(scale_limit=0.2, p=0.5), + A.Affine(translate_percent={'x': (-0.1, 0.1), 'y': (-0.1, 0.1)}, p=0.3), +], bbox_params=A.BboxParams(format='coco', label_fields=['class_labels'])) +``` + +### Color Augmentations + +```python +color = A.Compose([ + A.RandomBrightnessContrast(brightness_limit=0.2, contrast_limit=0.2, p=0.5), + A.HueSaturationValue(hue_shift_limit=20, sat_shift_limit=30, val_shift_limit=20, p=0.5), + A.CLAHE(clip_limit=2.0, p=0.1), + A.GaussianBlur(blur_limit=3, p=0.1), + A.GaussNoise(var_limit=(10, 50), p=0.1), +]) +``` + +### Mosaic Augmentation + +Combines 4 images into one (YOLO-style). + +```python +def mosaic_augmentation(images, labels, input_size=640): + """ + images: list of 4 images + labels: list of 4 label arrays + """ + result_image = np.zeros((input_size, input_size, 3), dtype=np.uint8) + result_labels = [] + + # Random center point + cx = int(random.uniform(input_size * 0.25, input_size * 0.75)) + cy = int(random.uniform(input_size * 0.25, input_size * 0.75)) + + positions = [ + (0, 0, cx, cy), # top-left + (cx, 0, input_size, cy), # top-right + (0, cy, cx, input_size), # bottom-left + (cx, cy, input_size, input_size), # bottom-right + ] + + for i, (x1, y1, x2, y2) in enumerate(positions): + img = images[i] + h, w = y2 - y1, x2 - x1 + + # Resize and place + img_resized = cv2.resize(img, (w, h)) + result_image[y1:y2, x1:x2] = img_resized + + # Transform labels + for label in labels[i]: + # Scale and shift bounding boxes + new_label = transform_bbox(label, img.shape, (h, w), (x1, y1)) + result_labels.append(new_label) + + return result_image, result_labels +``` + +### MixUp + +Blends two images and labels. + +```python +def mixup(image1, labels1, image2, labels2, alpha=0.5): + """ + alpha: mixing ratio (0.5 = equal blend) + """ + # Blend images + mixed_image = (alpha * image1 + (1 - alpha) * image2).astype(np.uint8) + + # Blend labels with soft weights + labels1_weighted = [(box, cls, alpha) for box, cls in labels1] + labels2_weighted = [(box, cls, 1-alpha) for box, cls in labels2] + + mixed_labels = labels1_weighted + labels2_weighted + return mixed_image, mixed_labels +``` + +### Copy-Paste Augmentation + +Paste objects from one image to another. + +```python +def copy_paste(background, bg_labels, source, src_labels, src_masks): + """ + Paste segmented objects onto background + """ + result = background.copy() + + for mask, label in zip(src_masks, src_labels): + # Random position + x_offset = random.randint(0, background.shape[1] - mask.shape[1]) + y_offset = random.randint(0, background.shape[0] - mask.shape[0]) + + # Paste with mask + region = result[y_offset:y_offset+mask.shape[0], + x_offset:x_offset+mask.shape[1]] + region[mask > 0] = source[mask > 0] + + # Add new label + new_box = transform_bbox(label, x_offset, y_offset) + bg_labels.append(new_box) + + return result, bg_labels +``` + +### Cutout / Random Erasing + +Randomly erase patches. + +```python +def cutout(image, num_holes=8, max_h_size=32, max_w_size=32): + h, w = image.shape[:2] + result = image.copy() + + for _ in range(num_holes): + y = random.randint(0, h) + x = random.randint(0, w) + h_size = random.randint(1, max_h_size) + w_size = random.randint(1, max_w_size) + + y1, y2 = max(0, y - h_size // 2), min(h, y + h_size // 2) + x1, x2 = max(0, x - w_size // 2), min(w, x + w_size // 2) + + result[y1:y2, x1:x2] = 0 # or random color + + return result +``` + +--- + +## Model Optimization Techniques + +### Pruning + +Remove unimportant weights. + +**Magnitude Pruning:** +```python +import torch.nn.utils.prune as prune + +# Prune 30% of weights with smallest magnitude +for name, module in model.named_modules(): + if isinstance(module, nn.Conv2d): + prune.l1_unstructured(module, name='weight', amount=0.3) +``` + +**Structured Pruning (channels):** +```python +# Prune entire channels +prune.ln_structured(module, name='weight', amount=0.3, n=2, dim=0) +``` + +### Knowledge Distillation + +Train smaller model with larger teacher. + +```python +def distillation_loss(student_logits, teacher_logits, labels, + temperature=4.0, alpha=0.7): + """ + Combine soft targets from teacher with hard labels + """ + # Soft targets + soft_student = F.log_softmax(student_logits / temperature, dim=1) + soft_teacher = F.softmax(teacher_logits / temperature, dim=1) + soft_loss = F.kl_div(soft_student, soft_teacher, reduction='batchmean') + soft_loss *= temperature ** 2 # Scale by T^2 + + # Hard targets + hard_loss = F.cross_entropy(student_logits, labels) + + # Combined loss + return alpha * soft_loss + (1 - alpha) * hard_loss +``` + +### Quantization + +Reduce precision for faster inference. + +**Post-Training Quantization:** +```python +import torch.quantization + +# Prepare model +model.set_mode('inference') +model.qconfig = torch.quantization.get_default_qconfig('fbgemm') +torch.quantization.prepare(model, inplace=True) + +# Calibrate with representative data +with torch.no_grad(): + for images in calibration_loader: + model(images) + +# Convert to quantized model +torch.quantization.convert(model, inplace=True) +``` + +**Quantization-Aware Training:** +```python +# Insert fake quantization during training +model.train() +model.qconfig = torch.quantization.get_default_qat_qconfig('fbgemm') +model_prepared = torch.quantization.prepare_qat(model) + +# Train with fake quantization +for epoch in range(num_epochs): + train(model_prepared) + +# Convert to quantized +model_quantized = torch.quantization.convert(model_prepared) +``` + +--- + +## Hyperparameter Tuning + +### Key Hyperparameters + +| Parameter | Range | Default | Impact | +|-----------|-------|---------|--------| +| Learning rate | 1e-4 to 1e-1 | 0.01 | Critical | +| Batch size | 4 to 64 | 16 | Memory/speed | +| Weight decay | 1e-5 to 1e-3 | 5e-4 | Regularization | +| Momentum | 0.9 to 0.99 | 0.937 | Optimization | +| Warmup epochs | 1 to 10 | 3 | Stability | +| IoU threshold (NMS) | 0.4 to 0.7 | 0.5 | Recall/precision | +| Confidence threshold | 0.1 to 0.5 | 0.25 | Detection count | +| Image size | 320 to 1280 | 640 | Accuracy/speed | + +### Tuning Strategy + +1. **Baseline**: Use default hyperparameters +2. **Learning rate**: Grid search [1e-3, 5e-3, 1e-2, 5e-2] +3. **Batch size**: Maximum that fits in memory +4. **Augmentation**: Start minimal, add progressively +5. **Epochs**: Train until validation loss plateaus +6. **NMS threshold**: Tune on validation set + +### Automated Hyperparameter Optimization + +```python +import optuna + +def objective(trial): + lr = trial.suggest_loguniform('lr', 1e-4, 1e-1) + weight_decay = trial.suggest_loguniform('weight_decay', 1e-5, 1e-3) + mosaic_prob = trial.suggest_uniform('mosaic_prob', 0.0, 1.0) + + model = create_model() + train_model(model, lr=lr, weight_decay=weight_decay, mosaic_prob=mosaic_prob) + mAP = test_model(model) + + return mAP + +study = optuna.create_study(direction='maximize') +study.optimize(objective, n_trials=100) + +print(f"Best params: {study.best_params}") +print(f"Best mAP: {study.best_value}") +``` + +--- + +## Detection-Specific Tips + +### Small Object Detection + +1. **Higher resolution**: 1280px instead of 640px +2. **SAHI (Slicing)**: Inference on overlapping tiles +3. **More FPN levels**: P2 level (1/4 scale) +4. **Anchor adjustment**: Smaller anchors for small objects +5. **Copy-paste augmentation**: Increase small object frequency + +### Handling Class Imbalance + +1. **Focal loss**: gamma=2.0, alpha=0.25 +2. **Over-sampling**: Repeat rare class images +3. **Class weights**: Inverse frequency weighting +4. **Copy-paste**: Augment rare classes + +### Improving Localization + +1. **CIoU loss**: Includes aspect ratio term +2. **Cascade detection**: Progressive refinement +3. **Higher IoU threshold**: 0.6-0.7 for positive samples +4. **Deformable convolutions**: Learn spatial offsets + +### Reducing False Positives + +1. **Higher confidence threshold**: 0.4-0.5 +2. **More negative samples**: Hard negative mining +3. **Background class weight**: Increase penalty +4. **Ensemble**: Multiple model voting + +--- + +## Resources + +- [MMDetection training configs](https://github.com/open-mmlab/mmdetection/tree/main/configs) +- [Ultralytics training tips](https://docs.ultralytics.com/guides/hyperparameter-tuning/) +- [Albumentations detection](https://albumentations.ai/docs/getting_started/bounding_boxes_augmentation/) +- [Focal Loss paper](https://arxiv.org/abs/1708.02002) +- [CIoU paper](https://arxiv.org/abs/2005.03572) diff --git a/engineering-team/senior-computer-vision/references/production_vision_systems.md b/engineering-team/senior-computer-vision/references/production_vision_systems.md index e1c2e4b..7242ebf 100644 --- a/engineering-team/senior-computer-vision/references/production_vision_systems.md +++ b/engineering-team/senior-computer-vision/references/production_vision_systems.md @@ -1,80 +1,1226 @@ # Production Vision Systems -## Overview +Comprehensive guide to deploying computer vision models in production environments. -World-class production vision systems for senior computer vision engineer. +## Table of Contents -## Core Principles +- [Model Export and Optimization](#model-export-and-optimization) +- [TensorRT Deployment](#tensorrt-deployment) +- [ONNX Runtime Deployment](#onnx-runtime-deployment) +- [Edge Device Deployment](#edge-device-deployment) +- [Model Serving](#model-serving) +- [Video Processing Pipelines](#video-processing-pipelines) +- [Monitoring and Observability](#monitoring-and-observability) +- [Scaling and Performance](#scaling-and-performance) -### Production-First Design +--- -Always design with production in mind: -- Scalability: Handle 10x current load -- Reliability: 99.9% uptime target -- Maintainability: Clear, documented code -- Observability: Monitor everything +## Model Export and Optimization -### Performance by Design +### PyTorch to ONNX Export -Optimize from the start: -- Efficient algorithms -- Resource awareness -- Strategic caching -- Batch processing +Basic export: +```python +import torch +import torch.onnx -### Security & Privacy +def export_to_onnx(model, input_shape, output_path, dynamic_batch=True): + """ + Export PyTorch model to ONNX format. -Build security in: -- Input validation -- Data encryption -- Access control -- Audit logging + Args: + model: PyTorch model + input_shape: (C, H, W) input dimensions + output_path: Path to save .onnx file + dynamic_batch: Allow variable batch sizes + """ + model.set_mode('inference') -## Advanced Patterns + # Create dummy input + dummy_input = torch.randn(1, *input_shape) -### Pattern 1: Distributed Processing + # Dynamic axes for variable batch size + dynamic_axes = None + if dynamic_batch: + dynamic_axes = { + 'input': {0: 'batch_size'}, + 'output': {0: 'batch_size'} + } -Enterprise-scale data processing with fault tolerance. + # Export + torch.onnx.export( + model, + dummy_input, + output_path, + export_params=True, + opset_version=17, + do_constant_folding=True, + input_names=['input'], + output_names=['output'], + dynamic_axes=dynamic_axes + ) -### Pattern 2: Real-Time Systems + print(f"Exported to {output_path}") + return output_path +``` -Low-latency, high-throughput systems. +### ONNX Model Optimization -### Pattern 3: ML at Scale +Simplify and optimize ONNX graph: +```python +import onnx +from onnxsim import simplify -Production ML with monitoring and automation. +def optimize_onnx(input_path, output_path): + """ + Simplify ONNX model for faster inference. + """ + # Load model + model = onnx.load(input_path) -## Best Practices + # Check validity + onnx.checker.check_model(model) -### Code Quality -- Comprehensive testing -- Clear documentation -- Code reviews -- Type hints + # Simplify + model_simplified, check = simplify(model) -### Performance -- Profile before optimizing -- Monitor continuously -- Cache strategically -- Batch operations + if check: + onnx.save(model_simplified, output_path) + print(f"Simplified model saved to {output_path}") -### Reliability -- Design for failure -- Implement retries -- Use circuit breakers -- Monitor health + # Print size reduction + import os + original_size = os.path.getsize(input_path) / 1024 / 1024 + simplified_size = os.path.getsize(output_path) / 1024 / 1024 + print(f"Size: {original_size:.2f}MB -> {simplified_size:.2f}MB") + else: + print("Simplification failed, saving original") + onnx.save(model, output_path) -## Tools & Technologies + return output_path +``` -Essential tools for this domain: -- Development frameworks -- Testing libraries -- Deployment platforms -- Monitoring solutions +### Model Size Analysis -## Further Reading +```python +def analyze_model(model_path): + """ + Analyze ONNX model structure and size. + """ + model = onnx.load(model_path) -- Research papers -- Industry blogs -- Conference talks -- Open source projects + # Count parameters + total_params = 0 + param_sizes = {} + + for initializer in model.graph.initializer: + param_count = 1 + for dim in initializer.dims: + param_count *= dim + total_params += param_count + param_sizes[initializer.name] = param_count + + # Print summary + print(f"Total parameters: {total_params:,}") + print(f"Model size: {total_params * 4 / 1024 / 1024:.2f} MB (FP32)") + print(f"Model size: {total_params * 2 / 1024 / 1024:.2f} MB (FP16)") + print(f"Model size: {total_params / 1024 / 1024:.2f} MB (INT8)") + + # Top 10 largest layers + print("\nLargest layers:") + sorted_params = sorted(param_sizes.items(), key=lambda x: x[1], reverse=True) + for name, size in sorted_params[:10]: + print(f" {name}: {size:,} params") + + return total_params +``` + +--- + +## TensorRT Deployment + +### TensorRT Engine Build + +```python +import tensorrt as trt + +def build_tensorrt_engine(onnx_path, engine_path, precision='fp16', + max_batch_size=8, workspace_gb=4): + """ + Build TensorRT engine from ONNX model. + + Args: + onnx_path: Path to ONNX model + engine_path: Path to save TensorRT engine + precision: 'fp32', 'fp16', or 'int8' + max_batch_size: Maximum batch size + workspace_gb: GPU memory workspace in GB + """ + logger = trt.Logger(trt.Logger.WARNING) + builder = trt.Builder(logger) + network = builder.create_network( + 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH) + ) + parser = trt.OnnxParser(network, logger) + + # Parse ONNX + with open(onnx_path, 'rb') as f: + if not parser.parse(f.read()): + for error in range(parser.num_errors): + print(parser.get_error(error)) + raise RuntimeError("ONNX parsing failed") + + # Configure builder + config = builder.create_builder_config() + config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, + workspace_gb * 1024 * 1024 * 1024) + + # Set precision + if precision == 'fp16': + config.set_flag(trt.BuilderFlag.FP16) + elif precision == 'int8': + config.set_flag(trt.BuilderFlag.INT8) + # Requires calibrator for INT8 + + # Set optimization profile for dynamic shapes + profile = builder.create_optimization_profile() + input_name = network.get_input(0).name + input_shape = network.get_input(0).shape + + # Min, optimal, max batch sizes + min_shape = (1,) + tuple(input_shape[1:]) + opt_shape = (max_batch_size // 2,) + tuple(input_shape[1:]) + max_shape = (max_batch_size,) + tuple(input_shape[1:]) + + profile.set_shape(input_name, min_shape, opt_shape, max_shape) + config.add_optimization_profile(profile) + + # Build engine + serialized_engine = builder.build_serialized_network(network, config) + + # Save engine + with open(engine_path, 'wb') as f: + f.write(serialized_engine) + + print(f"TensorRT engine saved to {engine_path}") + return engine_path +``` + +### TensorRT Inference + +```python +import numpy as np +import pycuda.driver as cuda +import pycuda.autoinit + +class TensorRTInference: + def __init__(self, engine_path): + """ + Load TensorRT engine and prepare for inference. + """ + self.logger = trt.Logger(trt.Logger.WARNING) + + # Load engine + with open(engine_path, 'rb') as f: + engine_data = f.read() + + runtime = trt.Runtime(self.logger) + self.engine = runtime.deserialize_cuda_engine(engine_data) + self.context = self.engine.create_execution_context() + + # Allocate buffers + self.inputs = [] + self.outputs = [] + self.bindings = [] + self.stream = cuda.Stream() + + for i in range(self.engine.num_io_tensors): + name = self.engine.get_tensor_name(i) + dtype = trt.nptype(self.engine.get_tensor_dtype(name)) + shape = self.engine.get_tensor_shape(name) + size = trt.volume(shape) + + # Allocate host and device buffers + host_mem = cuda.pagelocked_empty(size, dtype) + device_mem = cuda.mem_alloc(host_mem.nbytes) + + self.bindings.append(int(device_mem)) + + if self.engine.get_tensor_mode(name) == trt.TensorIOMode.INPUT: + self.inputs.append({'host': host_mem, 'device': device_mem, + 'shape': shape, 'name': name}) + else: + self.outputs.append({'host': host_mem, 'device': device_mem, + 'shape': shape, 'name': name}) + + def infer(self, input_data): + """ + Run inference on input data. + + Args: + input_data: numpy array (batch, C, H, W) + + Returns: + Output numpy array + """ + # Copy input to host buffer + np.copyto(self.inputs[0]['host'], input_data.ravel()) + + # Transfer input to device + cuda.memcpy_htod_async( + self.inputs[0]['device'], + self.inputs[0]['host'], + self.stream + ) + + # Run inference + self.context.execute_async_v2( + bindings=self.bindings, + stream_handle=self.stream.handle + ) + + # Transfer output from device + cuda.memcpy_dtoh_async( + self.outputs[0]['host'], + self.outputs[0]['device'], + self.stream + ) + + # Synchronize + self.stream.synchronize() + + # Reshape output + output = self.outputs[0]['host'].reshape(self.outputs[0]['shape']) + return output +``` + +### INT8 Calibration + +```python +class Int8Calibrator(trt.IInt8EntropyCalibrator2): + def __init__(self, calibration_data, cache_file, batch_size=8): + """ + INT8 calibrator for TensorRT. + + Args: + calibration_data: List of numpy arrays + cache_file: Path to save calibration cache + batch_size: Calibration batch size + """ + super().__init__() + self.calibration_data = calibration_data + self.cache_file = cache_file + self.batch_size = batch_size + self.current_index = 0 + + # Allocate device buffer + self.device_input = cuda.mem_alloc( + calibration_data[0].nbytes * batch_size + ) + + def get_batch_size(self): + return self.batch_size + + def get_batch(self, names): + if self.current_index + self.batch_size > len(self.calibration_data): + return None + + # Get batch + batch = self.calibration_data[ + self.current_index:self.current_index + self.batch_size + ] + batch = np.stack(batch, axis=0) + + # Copy to device + cuda.memcpy_htod(self.device_input, batch) + self.current_index += self.batch_size + + return [int(self.device_input)] + + def read_calibration_cache(self): + if os.path.exists(self.cache_file): + with open(self.cache_file, 'rb') as f: + return f.read() + return None + + def write_calibration_cache(self, cache): + with open(self.cache_file, 'wb') as f: + f.write(cache) +``` + +--- + +## ONNX Runtime Deployment + +### Basic ONNX Runtime Inference + +```python +import onnxruntime as ort + +class ONNXInference: + def __init__(self, model_path, device='cuda'): + """ + Initialize ONNX Runtime session. + + Args: + model_path: Path to ONNX model + device: 'cuda' or 'cpu' + """ + # Set execution providers + if device == 'cuda': + providers = [ + ('CUDAExecutionProvider', { + 'device_id': 0, + 'arena_extend_strategy': 'kNextPowerOfTwo', + 'gpu_mem_limit': 4 * 1024 * 1024 * 1024, # 4GB + 'cudnn_conv_algo_search': 'EXHAUSTIVE', + }), + 'CPUExecutionProvider' + ] + else: + providers = ['CPUExecutionProvider'] + + # Session options + sess_options = ort.SessionOptions() + sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL + sess_options.intra_op_num_threads = 4 + + # Create session + self.session = ort.InferenceSession( + model_path, + sess_options=sess_options, + providers=providers + ) + + # Get input/output info + self.input_name = self.session.get_inputs()[0].name + self.input_shape = self.session.get_inputs()[0].shape + self.output_name = self.session.get_outputs()[0].name + + print(f"Loaded model: {model_path}") + print(f"Input: {self.input_name} {self.input_shape}") + print(f"Provider: {self.session.get_providers()[0]}") + + def infer(self, input_data): + """ + Run inference. + + Args: + input_data: numpy array (batch, C, H, W) + + Returns: + Model output + """ + outputs = self.session.run( + [self.output_name], + {self.input_name: input_data.astype(np.float32)} + ) + return outputs[0] + + def benchmark(self, input_shape, num_iterations=100, warmup=10): + """ + Benchmark inference speed. + """ + import time + + dummy_input = np.random.randn(*input_shape).astype(np.float32) + + # Warmup + for _ in range(warmup): + self.infer(dummy_input) + + # Benchmark + start = time.perf_counter() + for _ in range(num_iterations): + self.infer(dummy_input) + end = time.perf_counter() + + avg_time = (end - start) / num_iterations * 1000 + fps = 1000 / avg_time * input_shape[0] + + print(f"Average latency: {avg_time:.2f}ms") + print(f"Throughput: {fps:.1f} images/sec") + + return avg_time, fps +``` + +--- + +## Edge Device Deployment + +### NVIDIA Jetson Optimization + +```python +def optimize_for_jetson(model_path, output_path, jetson_model='orin'): + """ + Optimize model for NVIDIA Jetson deployment. + + Args: + model_path: Path to ONNX model + output_path: Path to save optimized engine + jetson_model: 'nano', 'xavier', 'orin' + """ + # Jetson-specific configurations + configs = { + 'nano': {'precision': 'fp16', 'workspace': 1, 'dla': False}, + 'xavier': {'precision': 'fp16', 'workspace': 2, 'dla': True}, + 'orin': {'precision': 'int8', 'workspace': 4, 'dla': True}, + } + + config = configs[jetson_model] + + # Build engine with Jetson-optimized settings + logger = trt.Logger(trt.Logger.WARNING) + builder = trt.Builder(logger) + network = builder.create_network( + 1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH) + ) + parser = trt.OnnxParser(network, logger) + + with open(model_path, 'rb') as f: + parser.parse(f.read()) + + builder_config = builder.create_builder_config() + builder_config.set_memory_pool_limit( + trt.MemoryPoolType.WORKSPACE, + config['workspace'] * 1024 * 1024 * 1024 + ) + + if config['precision'] == 'fp16': + builder_config.set_flag(trt.BuilderFlag.FP16) + elif config['precision'] == 'int8': + builder_config.set_flag(trt.BuilderFlag.INT8) + + # Enable DLA if supported + if config['dla'] and builder.num_DLA_cores > 0: + builder_config.default_device_type = trt.DeviceType.DLA + builder_config.DLA_core = 0 + builder_config.set_flag(trt.BuilderFlag.GPU_FALLBACK) + + # Build and save + serialized = builder.build_serialized_network(network, builder_config) + with open(output_path, 'wb') as f: + f.write(serialized) + + print(f"Jetson-optimized engine saved to {output_path}") +``` + +### OpenVINO for Intel Devices + +```python +from openvino.runtime import Core + +class OpenVINOInference: + def __init__(self, model_path, device='CPU'): + """ + Initialize OpenVINO inference. + + Args: + model_path: Path to ONNX or OpenVINO IR model + device: 'CPU', 'GPU', 'MYRIAD' (Intel NCS) + """ + self.core = Core() + + # Load and compile model + self.model = self.core.read_model(model_path) + self.compiled = self.core.compile_model(self.model, device) + + # Get input/output info + self.input_layer = self.compiled.input(0) + self.output_layer = self.compiled.output(0) + + print(f"Loaded model on {device}") + print(f"Input shape: {self.input_layer.shape}") + + def infer(self, input_data): + """ + Run inference. + """ + result = self.compiled([input_data]) + return result[self.output_layer] + + def benchmark(self, input_shape, num_iterations=100): + """ + Benchmark inference speed. + """ + import time + + dummy = np.random.randn(*input_shape).astype(np.float32) + + # Warmup + for _ in range(10): + self.infer(dummy) + + # Benchmark + start = time.perf_counter() + for _ in range(num_iterations): + self.infer(dummy) + elapsed = time.perf_counter() - start + + latency = elapsed / num_iterations * 1000 + print(f"Latency: {latency:.2f}ms") + return latency + + +def convert_to_openvino(onnx_path, output_dir, precision='FP16'): + """ + Convert ONNX to OpenVINO IR format. + """ + from openvino.tools import mo + + mo.convert_model( + onnx_path, + output_model=f"{output_dir}/model.xml", + compress_to_fp16=(precision == 'FP16') + ) + print(f"Converted to OpenVINO IR at {output_dir}") +``` + +### CoreML for Apple Silicon + +```python +import coremltools as ct + +def convert_to_coreml(model_or_path, output_path, compute_units='ALL'): + """ + Convert to CoreML for Apple devices. + + Args: + model_or_path: PyTorch model or ONNX path + output_path: Path to save .mlpackage + compute_units: 'ALL', 'CPU_AND_GPU', 'CPU_AND_NE' + """ + # Map compute units + units_map = { + 'ALL': ct.ComputeUnit.ALL, + 'CPU_AND_GPU': ct.ComputeUnit.CPU_AND_GPU, + 'CPU_AND_NE': ct.ComputeUnit.CPU_AND_NE, # Neural Engine + } + + # Convert from ONNX + if isinstance(model_or_path, str) and model_or_path.endswith('.onnx'): + mlmodel = ct.convert( + model_or_path, + compute_units=units_map[compute_units], + minimum_deployment_target=ct.target.macOS13 # or iOS16 + ) + else: + # Convert from PyTorch + traced = torch.jit.trace(model_or_path, torch.randn(1, 3, 640, 640)) + mlmodel = ct.convert( + traced, + inputs=[ct.TensorType(shape=(1, 3, 640, 640))], + compute_units=units_map[compute_units], + ) + + mlmodel.save(output_path) + print(f"CoreML model saved to {output_path}") +``` + +--- + +## Model Serving + +### Triton Inference Server + +Configuration file (`config.pbtxt`): +```protobuf +name: "yolov8" +platform: "onnxruntime_onnx" +max_batch_size: 8 + +input [ + { + name: "images" + data_type: TYPE_FP32 + dims: [ 3, 640, 640 ] + } +] + +output [ + { + name: "output0" + data_type: TYPE_FP32 + dims: [ 84, 8400 ] + } +] + +instance_group [ + { + count: 2 + kind: KIND_GPU + } +] + +dynamic_batching { + preferred_batch_size: [ 4, 8 ] + max_queue_delay_microseconds: 100 +} +``` + +Triton client: +```python +import tritonclient.http as httpclient + +class TritonClient: + def __init__(self, url='localhost:8000', model_name='yolov8'): + self.client = httpclient.InferenceServerClient(url=url) + self.model_name = model_name + + # Check model is ready + if not self.client.is_model_ready(model_name): + raise RuntimeError(f"Model {model_name} is not ready") + + def infer(self, images): + """ + Send inference request to Triton. + + Args: + images: numpy array (batch, C, H, W) + """ + # Create input + inputs = [ + httpclient.InferInput("images", images.shape, "FP32") + ] + inputs[0].set_data_from_numpy(images) + + # Create output request + outputs = [ + httpclient.InferRequestedOutput("output0") + ] + + # Send request + response = self.client.infer( + model_name=self.model_name, + inputs=inputs, + outputs=outputs + ) + + return response.as_numpy("output0") +``` + +### TorchServe Deployment + +Model handler (`handler.py`): +```python +from ts.torch_handler.base_handler import BaseHandler +import torch +import cv2 +import numpy as np + +class YOLOHandler(BaseHandler): + def __init__(self): + super().__init__() + self.input_size = 640 + self.conf_threshold = 0.25 + self.iou_threshold = 0.45 + + def preprocess(self, data): + """Preprocess input images.""" + images = [] + for row in data: + image = row.get("data") or row.get("body") + + if isinstance(image, (bytes, bytearray)): + image = np.frombuffer(image, dtype=np.uint8) + image = cv2.imdecode(image, cv2.IMREAD_COLOR) + + # Resize and normalize + image = cv2.resize(image, (self.input_size, self.input_size)) + image = image.astype(np.float32) / 255.0 + image = np.transpose(image, (2, 0, 1)) + images.append(image) + + return torch.tensor(np.stack(images)) + + def inference(self, data): + """Run model inference.""" + with torch.no_grad(): + outputs = self.model(data) + return outputs + + def postprocess(self, outputs): + """Postprocess model outputs.""" + results = [] + for output in outputs: + # Apply NMS and format results + detections = self._nms(output, self.conf_threshold, self.iou_threshold) + results.append(detections.tolist()) + return results +``` + +TorchServe configuration (`config.properties`): +```properties +inference_address=http://0.0.0.0:8080 +management_address=http://0.0.0.0:8081 +metrics_address=http://0.0.0.0:8082 +number_of_netty_threads=4 +job_queue_size=100 +model_store=/opt/ml/model +load_models=yolov8.mar +``` + +### FastAPI Serving + +```python +from fastapi import FastAPI, File, UploadFile +from fastapi.responses import JSONResponse +import uvicorn +import numpy as np +import cv2 + +app = FastAPI(title="YOLO Detection API") + +# Global model +model = None + +@app.on_event("startup") +async def load_model(): + global model + model = ONNXInference("models/yolov8m.onnx", device='cuda') + +@app.post("/detect") +async def detect(file: UploadFile = File(...), conf: float = 0.25): + """ + Detect objects in uploaded image. + """ + # Read image + contents = await file.read() + nparr = np.frombuffer(contents, np.uint8) + image = cv2.imdecode(nparr, cv2.IMREAD_COLOR) + + # Preprocess + input_image = preprocess_image(image, 640) + + # Inference + outputs = model.infer(input_image) + + # Postprocess + detections = postprocess_detections(outputs, conf, 0.45) + + return JSONResponse({ + "detections": detections, + "image_size": list(image.shape[:2]) + }) + +@app.get("/health") +async def health(): + return {"status": "healthy", "model_loaded": model is not None} + +if __name__ == "__main__": + uvicorn.run(app, host="0.0.0.0", port=8000) +``` + +--- + +## Video Processing Pipelines + +### Real-Time Video Detection + +```python +import cv2 +import time +from collections import deque + +class VideoDetector: + def __init__(self, model, conf_threshold=0.25, track=True): + self.model = model + self.conf_threshold = conf_threshold + self.track = track + self.tracker = ByteTrack() if track else None + self.fps_buffer = deque(maxlen=30) + + def process_video(self, source, output_path=None, show=True): + """ + Process video stream with detection. + + Args: + source: Video file path, camera index, or RTSP URL + output_path: Path to save output video + show: Display results in window + """ + cap = cv2.VideoCapture(source) + + if output_path: + fourcc = cv2.VideoWriter_fourcc(*'mp4v') + fps = cap.get(cv2.CAP_PROP_FPS) + width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH)) + height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT)) + writer = cv2.VideoWriter(output_path, fourcc, fps, (width, height)) + + frame_count = 0 + start_time = time.time() + + while cap.isOpened(): + ret, frame = cap.read() + if not ret: + break + + # Inference + t0 = time.perf_counter() + detections = self._detect(frame) + + # Tracking + if self.track and len(detections) > 0: + detections = self.tracker.update(detections) + + # Calculate FPS + inference_time = time.perf_counter() - t0 + self.fps_buffer.append(1 / inference_time) + avg_fps = sum(self.fps_buffer) / len(self.fps_buffer) + + # Draw results + frame = self._draw_detections(frame, detections, avg_fps) + + # Output + if output_path: + writer.write(frame) + + if show: + cv2.imshow('Detection', frame) + if cv2.waitKey(1) == ord('q'): + break + + frame_count += 1 + + # Cleanup + cap.release() + if output_path: + writer.release() + cv2.destroyAllWindows() + + # Print statistics + total_time = time.time() - start_time + print(f"Processed {frame_count} frames in {total_time:.1f}s") + print(f"Average FPS: {frame_count / total_time:.1f}") + + def _detect(self, frame): + """Run detection on single frame.""" + # Preprocess + input_tensor = self._preprocess(frame) + + # Inference + outputs = self.model.infer(input_tensor) + + # Postprocess + detections = self._postprocess(outputs, frame.shape[:2]) + return detections + + def _preprocess(self, frame): + """Preprocess frame for model input.""" + # Resize + input_size = 640 + image = cv2.resize(frame, (input_size, input_size)) + + # Normalize and transpose + image = image.astype(np.float32) / 255.0 + image = np.transpose(image, (2, 0, 1)) + image = np.expand_dims(image, axis=0) + + return image + + def _draw_detections(self, frame, detections, fps): + """Draw detections on frame.""" + for det in detections: + x1, y1, x2, y2 = det['bbox'] + cls = det['class'] + conf = det['confidence'] + track_id = det.get('track_id', None) + + # Draw box + color = self._get_color(cls) + cv2.rectangle(frame, (int(x1), int(y1)), (int(x2), int(y2)), color, 2) + + # Draw label + label = f"{cls}: {conf:.2f}" + if track_id: + label = f"ID:{track_id} {label}" + + cv2.putText(frame, label, (int(x1), int(y1) - 10), + cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2) + + # Draw FPS + cv2.putText(frame, f"FPS: {fps:.1f}", (10, 30), + cv2.FONT_HERSHEY_SIMPLEX, 1, (0, 255, 0), 2) + + return frame +``` + +### Batch Video Processing + +```python +import concurrent.futures +from pathlib import Path + +def process_videos_batch(video_paths, model, output_dir, max_workers=4): + """ + Process multiple videos in parallel. + """ + output_dir = Path(output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + + def process_single(video_path): + detector = VideoDetector(model) + output_path = output_dir / f"{Path(video_path).stem}_detected.mp4" + detector.process_video(video_path, str(output_path), show=False) + return output_path + + with concurrent.futures.ThreadPoolExecutor(max_workers=max_workers) as executor: + futures = {executor.submit(process_single, vp): vp for vp in video_paths} + + for future in concurrent.futures.as_completed(futures): + video_path = futures[future] + try: + output_path = future.result() + print(f"Completed: {video_path} -> {output_path}") + except Exception as e: + print(f"Failed: {video_path} - {e}") +``` + +--- + +## Monitoring and Observability + +### Prometheus Metrics + +```python +from prometheus_client import Counter, Histogram, Gauge, start_http_server + +# Define metrics +INFERENCE_COUNT = Counter( + 'model_inference_total', + 'Total number of inferences', + ['model_name', 'status'] +) + +INFERENCE_LATENCY = Histogram( + 'model_inference_latency_seconds', + 'Inference latency in seconds', + ['model_name'], + buckets=[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0] +) + +GPU_MEMORY = Gauge( + 'gpu_memory_used_bytes', + 'GPU memory usage in bytes', + ['device'] +) + +DETECTIONS_COUNT = Counter( + 'detections_total', + 'Total detections by class', + ['model_name', 'class_name'] +) + +class MetricsWrapper: + def __init__(self, model, model_name='yolov8'): + self.model = model + self.model_name = model_name + + def infer(self, input_data): + """Inference with metrics.""" + start_time = time.perf_counter() + + try: + result = self.model.infer(input_data) + INFERENCE_COUNT.labels(self.model_name, 'success').inc() + + # Count detections by class + for det in result: + DETECTIONS_COUNT.labels(self.model_name, det['class']).inc() + + return result + + except Exception as e: + INFERENCE_COUNT.labels(self.model_name, 'error').inc() + raise + + finally: + latency = time.perf_counter() - start_time + INFERENCE_LATENCY.labels(self.model_name).observe(latency) + + # Update GPU memory + if torch.cuda.is_available(): + memory = torch.cuda.memory_allocated() + GPU_MEMORY.labels('cuda:0').set(memory) + +# Start metrics server +start_http_server(9090) +``` + +### Logging Configuration + +```python +import logging +import json +from datetime import datetime + +class StructuredLogger: + def __init__(self, name, level=logging.INFO): + self.logger = logging.getLogger(name) + self.logger.setLevel(level) + + # JSON formatter + handler = logging.StreamHandler() + handler.setFormatter(JsonFormatter()) + self.logger.addHandler(handler) + + def log_inference(self, model_name, latency, num_detections, input_shape): + self.logger.info(json.dumps({ + 'event': 'inference', + 'timestamp': datetime.utcnow().isoformat(), + 'model_name': model_name, + 'latency_ms': latency * 1000, + 'num_detections': num_detections, + 'input_shape': list(input_shape) + })) + + def log_error(self, model_name, error, input_shape): + self.logger.error(json.dumps({ + 'event': 'inference_error', + 'timestamp': datetime.utcnow().isoformat(), + 'model_name': model_name, + 'error': str(error), + 'error_type': type(error).__name__, + 'input_shape': list(input_shape) + })) + +class JsonFormatter(logging.Formatter): + def format(self, record): + return record.getMessage() +``` + +--- + +## Scaling and Performance + +### Batch Processing Optimization + +```python +class BatchProcessor: + def __init__(self, model, max_batch_size=8, max_wait_ms=100): + self.model = model + self.max_batch_size = max_batch_size + self.max_wait_ms = max_wait_ms + self.queue = [] + self.lock = threading.Lock() + self.results = {} + + async def process(self, image, request_id): + """Add image to batch and wait for result.""" + future = asyncio.Future() + + with self.lock: + self.queue.append((request_id, image, future)) + + if len(self.queue) >= self.max_batch_size: + self._process_batch() + + # Wait for result with timeout + result = await asyncio.wait_for(future, timeout=5.0) + return result + + def _process_batch(self): + """Process accumulated batch.""" + batch_items = self.queue[:self.max_batch_size] + self.queue = self.queue[self.max_batch_size:] + + # Stack images + images = np.stack([item[1] for item in batch_items]) + + # Inference + outputs = self.model.infer(images) + + # Return results + for i, (request_id, image, future) in enumerate(batch_items): + future.set_result(outputs[i]) +``` + +### Multi-GPU Inference + +```python +import torch.nn as nn +from torch.nn.parallel import DataParallel + +class MultiGPUInference: + def __init__(self, model, device_ids=None): + """ + Wrap model for multi-GPU inference. + + Args: + model: PyTorch model + device_ids: List of GPU IDs, e.g., [0, 1, 2, 3] + """ + if device_ids is None: + device_ids = list(range(torch.cuda.device_count())) + + self.device = torch.device('cuda:0') + self.model = DataParallel(model, device_ids=device_ids) + self.model.to(self.device) + self.model.set_mode('inference') + + def infer(self, images): + """ + Run inference across GPUs. + """ + with torch.no_grad(): + images = torch.from_numpy(images).to(self.device) + outputs = self.model(images) + return outputs.cpu().numpy() +``` + +### Performance Benchmarking + +```python +def comprehensive_benchmark(model, input_sizes, batch_sizes, num_iterations=100): + """ + Benchmark model across different configurations. + """ + results = [] + + for input_size in input_sizes: + for batch_size in batch_sizes: + # Create input + dummy = np.random.randn(batch_size, 3, input_size, input_size).astype(np.float32) + + # Warmup + for _ in range(10): + model.infer(dummy) + + # Benchmark + latencies = [] + for _ in range(num_iterations): + start = time.perf_counter() + model.infer(dummy) + latencies.append(time.perf_counter() - start) + + # Calculate statistics + latencies = np.array(latencies) * 1000 # Convert to ms + result = { + 'input_size': input_size, + 'batch_size': batch_size, + 'mean_latency_ms': np.mean(latencies), + 'std_latency_ms': np.std(latencies), + 'p50_latency_ms': np.percentile(latencies, 50), + 'p95_latency_ms': np.percentile(latencies, 95), + 'p99_latency_ms': np.percentile(latencies, 99), + 'throughput_fps': batch_size * 1000 / np.mean(latencies) + } + results.append(result) + + print(f"Size: {input_size}, Batch: {batch_size}") + print(f" Latency: {result['mean_latency_ms']:.2f}ms (p99: {result['p99_latency_ms']:.2f}ms)") + print(f" Throughput: {result['throughput_fps']:.1f} FPS") + + return results +``` + +--- + +## Resources + +- [TensorRT Documentation](https://docs.nvidia.com/deeplearning/tensorrt/) +- [ONNX Runtime Documentation](https://onnxruntime.ai/docs/) +- [Triton Inference Server](https://github.com/triton-inference-server/server) +- [OpenVINO Documentation](https://docs.openvino.ai/) +- [CoreML Tools](https://coremltools.readme.io/) diff --git a/engineering-team/senior-computer-vision/scripts/dataset_pipeline_builder.py b/engineering-team/senior-computer-vision/scripts/dataset_pipeline_builder.py index 490cfe4..8ae18a6 100755 --- a/engineering-team/senior-computer-vision/scripts/dataset_pipeline_builder.py +++ b/engineering-team/senior-computer-vision/scripts/dataset_pipeline_builder.py @@ -1,17 +1,37 @@ #!/usr/bin/env python3 """ -Dataset Pipeline Builder -Production-grade tool for senior computer vision engineer +Dataset Pipeline Builder for Computer Vision + +Production-grade tool for building and managing CV dataset pipelines. +Supports format conversion, splitting, augmentation config, and validation. + +Supported formats: +- COCO (JSON annotations) +- YOLO (txt per image) +- Pascal VOC (XML annotations) +- CVAT (XML export) + +Usage: + python dataset_pipeline_builder.py analyze --input /path/to/dataset + python dataset_pipeline_builder.py convert --input /path/to/coco --output /path/to/yolo --format yolo + python dataset_pipeline_builder.py split --input /path/to/dataset --train 0.8 --val 0.1 --test 0.1 + python dataset_pipeline_builder.py augment-config --task detection --output augmentations.yaml + python dataset_pipeline_builder.py validate --input /path/to/dataset --format coco """ import os import sys import json +import random +import shutil import logging import argparse +import hashlib from pathlib import Path -from typing import Dict, List, Optional +from typing import Dict, List, Optional, Tuple, Set, Any from datetime import datetime +from collections import defaultdict +import xml.etree.ElementTree as ET logging.basicConfig( level=logging.INFO, @@ -19,82 +39,1661 @@ logging.basicConfig( ) logger = logging.getLogger(__name__) -class DatasetPipelineBuilder: - """Production-grade dataset pipeline builder""" - - def __init__(self, config: Dict): - self.config = config - self.results = { - 'status': 'initialized', - 'start_time': datetime.now().isoformat(), - 'processed_items': 0 + +# ============================================================================ +# Dataset Format Definitions +# ============================================================================ + +SUPPORTED_IMAGE_EXTENSIONS = {'.jpg', '.jpeg', '.png', '.bmp', '.tiff', '.webp'} + +COCO_CATEGORIES_TEMPLATE = { + "info": { + "description": "Custom Dataset", + "version": "1.0", + "year": datetime.now().year, + "contributor": "Dataset Pipeline Builder", + "date_created": datetime.now().isoformat() + }, + "licenses": [{"id": 1, "name": "Unknown", "url": ""}], + "images": [], + "annotations": [], + "categories": [] +} + +YOLO_DATA_YAML_TEMPLATE = """# YOLO Dataset Configuration +# Generated by Dataset Pipeline Builder + +path: {dataset_path} +train: {train_path} +val: {val_path} +test: {test_path} + +# Classes +nc: {num_classes} +names: {class_names} + +# Optional: Download script +# download: +""" + +AUGMENTATION_PRESETS = { + 'detection': { + 'light': { + 'horizontal_flip': 0.5, + 'vertical_flip': 0.0, + 'rotate': {'limit': 10, 'p': 0.3}, + 'brightness_contrast': {'brightness_limit': 0.1, 'contrast_limit': 0.1, 'p': 0.3}, + 'blur': {'blur_limit': 3, 'p': 0.1} + }, + 'medium': { + 'horizontal_flip': 0.5, + 'vertical_flip': 0.1, + 'rotate': {'limit': 15, 'p': 0.5}, + 'scale': {'scale_limit': 0.2, 'p': 0.5}, + 'brightness_contrast': {'brightness_limit': 0.2, 'contrast_limit': 0.2, 'p': 0.5}, + 'hue_saturation': {'hue_shift_limit': 10, 'sat_shift_limit': 20, 'p': 0.3}, + 'blur': {'blur_limit': 5, 'p': 0.2}, + 'noise': {'var_limit': (10, 50), 'p': 0.2} + }, + 'heavy': { + 'horizontal_flip': 0.5, + 'vertical_flip': 0.2, + 'rotate': {'limit': 30, 'p': 0.7}, + 'scale': {'scale_limit': 0.3, 'p': 0.6}, + 'brightness_contrast': {'brightness_limit': 0.3, 'contrast_limit': 0.3, 'p': 0.6}, + 'hue_saturation': {'hue_shift_limit': 20, 'sat_shift_limit': 30, 'p': 0.5}, + 'blur': {'blur_limit': 7, 'p': 0.3}, + 'noise': {'var_limit': (10, 80), 'p': 0.3}, + 'mosaic': {'p': 0.5}, + 'mixup': {'p': 0.3}, + 'cutout': {'num_holes': 8, 'max_h_size': 32, 'max_w_size': 32, 'p': 0.3} } - logger.info(f"Initialized {self.__class__.__name__}") - - def validate_config(self) -> bool: - """Validate configuration""" - logger.info("Validating configuration...") - # Add validation logic - logger.info("Configuration validated") - return True - - def process(self) -> Dict: - """Main processing logic""" - logger.info("Starting processing...") - + }, + 'segmentation': { + 'light': { + 'horizontal_flip': 0.5, + 'rotate': {'limit': 10, 'p': 0.3}, + 'elastic_transform': {'alpha': 50, 'sigma': 5, 'p': 0.1} + }, + 'medium': { + 'horizontal_flip': 0.5, + 'vertical_flip': 0.2, + 'rotate': {'limit': 20, 'p': 0.5}, + 'scale': {'scale_limit': 0.2, 'p': 0.4}, + 'elastic_transform': {'alpha': 100, 'sigma': 10, 'p': 0.3}, + 'grid_distortion': {'num_steps': 5, 'distort_limit': 0.3, 'p': 0.3} + }, + 'heavy': { + 'horizontal_flip': 0.5, + 'vertical_flip': 0.3, + 'rotate': {'limit': 45, 'p': 0.7}, + 'scale': {'scale_limit': 0.4, 'p': 0.6}, + 'elastic_transform': {'alpha': 200, 'sigma': 20, 'p': 0.5}, + 'grid_distortion': {'num_steps': 7, 'distort_limit': 0.5, 'p': 0.4}, + 'optical_distortion': {'distort_limit': 0.5, 'shift_limit': 0.5, 'p': 0.3} + } + }, + 'classification': { + 'light': { + 'horizontal_flip': 0.5, + 'rotate': {'limit': 15, 'p': 0.3}, + 'brightness_contrast': {'p': 0.3} + }, + 'medium': { + 'horizontal_flip': 0.5, + 'rotate': {'limit': 30, 'p': 0.5}, + 'color_jitter': {'brightness': 0.2, 'contrast': 0.2, 'saturation': 0.2, 'hue': 0.1, 'p': 0.5}, + 'random_crop': {'height': 224, 'width': 224, 'p': 0.5}, + 'cutout': {'num_holes': 1, 'max_h_size': 40, 'max_w_size': 40, 'p': 0.3} + }, + 'heavy': { + 'horizontal_flip': 0.5, + 'vertical_flip': 0.2, + 'rotate': {'limit': 45, 'p': 0.7}, + 'color_jitter': {'brightness': 0.4, 'contrast': 0.4, 'saturation': 0.4, 'hue': 0.2, 'p': 0.7}, + 'random_resized_crop': {'height': 224, 'width': 224, 'scale': (0.5, 1.0), 'p': 0.6}, + 'cutout': {'num_holes': 4, 'max_h_size': 60, 'max_w_size': 60, 'p': 0.5}, + 'auto_augment': {'policy': 'imagenet', 'p': 0.5}, + 'rand_augment': {'num_ops': 2, 'magnitude': 9, 'p': 0.5} + } + } +} + + +# ============================================================================ +# Dataset Analysis +# ============================================================================ + +class DatasetAnalyzer: + """Analyze dataset structure and statistics.""" + + def __init__(self, dataset_path: str): + self.dataset_path = Path(dataset_path) + self.stats = {} + + def analyze(self) -> Dict[str, Any]: + """Run full dataset analysis.""" + logger.info(f"Analyzing dataset at: {self.dataset_path}") + + # Detect format + detected_format = self._detect_format() + self.stats['format'] = detected_format + + # Count images + images = self._find_images() + self.stats['total_images'] = len(images) + + # Analyze images + self.stats['image_stats'] = self._analyze_images(images) + + # Analyze annotations based on format + if detected_format == 'coco': + self.stats['annotations'] = self._analyze_coco() + elif detected_format == 'yolo': + self.stats['annotations'] = self._analyze_yolo() + elif detected_format == 'voc': + self.stats['annotations'] = self._analyze_voc() + else: + self.stats['annotations'] = {'error': 'Unknown format'} + + # Dataset quality checks + self.stats['quality'] = self._quality_checks() + + return self.stats + + def _detect_format(self) -> str: + """Auto-detect dataset format.""" + # Check for COCO JSON + for json_file in self.dataset_path.rglob('*.json'): + try: + with open(json_file) as f: + data = json.load(f) + if 'annotations' in data and 'images' in data: + return 'coco' + except: + pass + + # Check for YOLO txt files + txt_files = list(self.dataset_path.rglob('*.txt')) + if txt_files: + # Check if txt contains YOLO format (class x_center y_center width height) + for txt_file in txt_files[:5]: + if txt_file.name == 'classes.txt': + continue + try: + with open(txt_file) as f: + line = f.readline().strip() + if line: + parts = line.split() + if len(parts) == 5 and all(self._is_float(p) for p in parts): + return 'yolo' + except: + pass + + # Check for VOC XML + xml_files = list(self.dataset_path.rglob('*.xml')) + for xml_file in xml_files[:5]: + try: + tree = ET.parse(xml_file) + root = tree.getroot() + if root.tag == 'annotation' and root.find('object') is not None: + return 'voc' + except: + pass + + return 'unknown' + + def _is_float(self, s: str) -> bool: + """Check if string is a float.""" try: - self.validate_config() - - # Main processing - result = self._execute() - - self.results['status'] = 'completed' - self.results['end_time'] = datetime.now().isoformat() - - logger.info("Processing completed successfully") - return self.results - - except Exception as e: - self.results['status'] = 'failed' - self.results['error'] = str(e) - logger.error(f"Processing failed: {e}") - raise - - def _execute(self) -> Dict: - """Execute main logic""" - # Implementation here - return {'success': True} + float(s) + return True + except ValueError: + return False + + def _find_images(self) -> List[Path]: + """Find all images in dataset.""" + images = [] + for ext in SUPPORTED_IMAGE_EXTENSIONS: + images.extend(self.dataset_path.rglob(f'*{ext}')) + images.extend(self.dataset_path.rglob(f'*{ext.upper()}')) + return images + + def _analyze_images(self, images: List[Path]) -> Dict: + """Analyze image files without loading them.""" + stats = { + 'count': len(images), + 'extensions': defaultdict(int), + 'sizes': [], + 'locations': defaultdict(int) + } + + for img in images: + stats['extensions'][img.suffix.lower()] += 1 + stats['sizes'].append(img.stat().st_size) + # Track which subdirectory + rel_path = img.relative_to(self.dataset_path) + if len(rel_path.parts) > 1: + stats['locations'][rel_path.parts[0]] += 1 + else: + stats['locations']['root'] += 1 + + if stats['sizes']: + stats['total_size_mb'] = sum(stats['sizes']) / (1024 * 1024) + stats['avg_size_kb'] = (sum(stats['sizes']) / len(stats['sizes'])) / 1024 + stats['min_size_kb'] = min(stats['sizes']) / 1024 + stats['max_size_kb'] = max(stats['sizes']) / 1024 + + stats['extensions'] = dict(stats['extensions']) + stats['locations'] = dict(stats['locations']) + del stats['sizes'] # Don't include raw sizes + + return stats + + def _analyze_coco(self) -> Dict: + """Analyze COCO format annotations.""" + stats = { + 'total_annotations': 0, + 'classes': {}, + 'images_with_annotations': 0, + 'annotations_per_image': {}, + 'bbox_stats': {} + } + + # Find COCO JSON files + for json_file in self.dataset_path.rglob('*.json'): + try: + with open(json_file) as f: + data = json.load(f) + + if 'annotations' not in data: + continue + + # Build category mapping + cat_map = {} + if 'categories' in data: + for cat in data['categories']: + cat_map[cat['id']] = cat['name'] + + # Count annotations per class + img_annotations = defaultdict(int) + bbox_widths = [] + bbox_heights = [] + bbox_areas = [] + + for ann in data['annotations']: + stats['total_annotations'] += 1 + cat_id = ann.get('category_id') + cat_name = cat_map.get(cat_id, f'class_{cat_id}') + stats['classes'][cat_name] = stats['classes'].get(cat_name, 0) + 1 + img_annotations[ann.get('image_id')] += 1 + + # Bbox stats + if 'bbox' in ann: + bbox = ann['bbox'] # [x, y, width, height] + if len(bbox) == 4: + bbox_widths.append(bbox[2]) + bbox_heights.append(bbox[3]) + bbox_areas.append(bbox[2] * bbox[3]) + + stats['images_with_annotations'] = len(img_annotations) + if img_annotations: + counts = list(img_annotations.values()) + stats['annotations_per_image'] = { + 'min': min(counts), + 'max': max(counts), + 'avg': sum(counts) / len(counts) + } + + if bbox_areas: + stats['bbox_stats'] = { + 'avg_width': sum(bbox_widths) / len(bbox_widths), + 'avg_height': sum(bbox_heights) / len(bbox_heights), + 'avg_area': sum(bbox_areas) / len(bbox_areas), + 'min_area': min(bbox_areas), + 'max_area': max(bbox_areas) + } + + except Exception as e: + logger.warning(f"Error parsing {json_file}: {e}") + + return stats + + def _analyze_yolo(self) -> Dict: + """Analyze YOLO format annotations.""" + stats = { + 'total_annotations': 0, + 'classes': defaultdict(int), + 'images_with_annotations': 0, + 'bbox_stats': {} + } + + # Find classes.txt if exists + class_names = {} + classes_file = self.dataset_path / 'classes.txt' + if classes_file.exists(): + with open(classes_file) as f: + for i, line in enumerate(f): + class_names[i] = line.strip() + + bbox_widths = [] + bbox_heights = [] + + for txt_file in self.dataset_path.rglob('*.txt'): + if txt_file.name == 'classes.txt': + continue + + try: + with open(txt_file) as f: + lines = f.readlines() + + if lines: + stats['images_with_annotations'] += 1 + + for line in lines: + parts = line.strip().split() + if len(parts) >= 5: + stats['total_annotations'] += 1 + class_id = int(parts[0]) + class_name = class_names.get(class_id, f'class_{class_id}') + stats['classes'][class_name] += 1 + + # Bbox stats (normalized coords) + w = float(parts[3]) + h = float(parts[4]) + bbox_widths.append(w) + bbox_heights.append(h) + + except Exception as e: + logger.warning(f"Error parsing {txt_file}: {e}") + + stats['classes'] = dict(stats['classes']) + + if bbox_widths: + stats['bbox_stats'] = { + 'avg_width_normalized': sum(bbox_widths) / len(bbox_widths), + 'avg_height_normalized': sum(bbox_heights) / len(bbox_heights), + 'min_width_normalized': min(bbox_widths), + 'max_width_normalized': max(bbox_widths) + } + + return stats + + def _analyze_voc(self) -> Dict: + """Analyze Pascal VOC format annotations.""" + stats = { + 'total_annotations': 0, + 'classes': defaultdict(int), + 'images_with_annotations': 0, + 'difficulties': {'easy': 0, 'difficult': 0} + } + + for xml_file in self.dataset_path.rglob('*.xml'): + try: + tree = ET.parse(xml_file) + root = tree.getroot() + + if root.tag != 'annotation': + continue + + objects = root.findall('object') + if objects: + stats['images_with_annotations'] += 1 + + for obj in objects: + stats['total_annotations'] += 1 + name = obj.find('name') + if name is not None: + stats['classes'][name.text] += 1 + + difficult = obj.find('difficult') + if difficult is not None and difficult.text == '1': + stats['difficulties']['difficult'] += 1 + else: + stats['difficulties']['easy'] += 1 + + except Exception as e: + logger.warning(f"Error parsing {xml_file}: {e}") + + stats['classes'] = dict(stats['classes']) + return stats + + def _quality_checks(self) -> Dict: + """Run quality checks on dataset.""" + checks = { + 'issues': [], + 'warnings': [], + 'recommendations': [] + } + + # Check class imbalance + if 'annotations' in self.stats and 'classes' in self.stats['annotations']: + classes = self.stats['annotations']['classes'] + if classes: + counts = list(classes.values()) + max_count = max(counts) + min_count = min(counts) + + if max_count > 0 and min_count / max_count < 0.1: + checks['warnings'].append( + f"Severe class imbalance detected: ratio {min_count/max_count:.2%}" + ) + checks['recommendations'].append( + "Consider oversampling minority classes or using focal loss" + ) + elif max_count > 0 and min_count / max_count < 0.3: + checks['warnings'].append( + f"Moderate class imbalance: ratio {min_count/max_count:.2%}" + ) + + # Check image count + if self.stats.get('total_images', 0) < 100: + checks['warnings'].append( + f"Small dataset: only {self.stats.get('total_images', 0)} images" + ) + checks['recommendations'].append( + "Consider data augmentation or transfer learning" + ) + + # Check for missing annotations + if 'annotations' in self.stats: + ann_stats = self.stats['annotations'] + total_images = self.stats.get('total_images', 0) + images_with_ann = ann_stats.get('images_with_annotations', 0) + + if total_images > 0 and images_with_ann < total_images: + missing = total_images - images_with_ann + checks['warnings'].append( + f"{missing} images have no annotations" + ) + + return checks + + +# ============================================================================ +# Format Conversion +# ============================================================================ + +class FormatConverter: + """Convert between dataset formats.""" + + def __init__(self, input_path: str, output_path: str): + self.input_path = Path(input_path) + self.output_path = Path(output_path) + + def convert(self, target_format: str, source_format: str = None) -> Dict: + """Convert dataset to target format.""" + # Auto-detect source format if not specified + if source_format is None: + analyzer = DatasetAnalyzer(str(self.input_path)) + analyzer.analyze() + source_format = analyzer.stats.get('format', 'unknown') + + logger.info(f"Converting from {source_format} to {target_format}") + + conversion_key = f"{source_format}_to_{target_format}" + + converters = { + 'coco_to_yolo': self._coco_to_yolo, + 'yolo_to_coco': self._yolo_to_coco, + 'voc_to_coco': self._voc_to_coco, + 'voc_to_yolo': self._voc_to_yolo, + 'coco_to_voc': self._coco_to_voc, + } + + if conversion_key not in converters: + return {'error': f"Unsupported conversion: {source_format} -> {target_format}"} + + return converters[conversion_key]() + + def _coco_to_yolo(self) -> Dict: + """Convert COCO format to YOLO format.""" + results = {'converted_images': 0, 'converted_annotations': 0} + + # Find COCO JSON + coco_files = list(self.input_path.rglob('*.json')) + + for coco_file in coco_files: + try: + with open(coco_file) as f: + coco_data = json.load(f) + + if 'annotations' not in coco_data: + continue + + # Create output directories + self.output_path.mkdir(parents=True, exist_ok=True) + labels_dir = self.output_path / 'labels' + labels_dir.mkdir(exist_ok=True) + + # Build category and image mappings + cat_map = {} + for i, cat in enumerate(coco_data.get('categories', [])): + cat_map[cat['id']] = i + + img_map = {} + for img in coco_data.get('images', []): + img_map[img['id']] = { + 'file_name': img['file_name'], + 'width': img['width'], + 'height': img['height'] + } + + # Group annotations by image + annotations_by_image = defaultdict(list) + for ann in coco_data['annotations']: + annotations_by_image[ann['image_id']].append(ann) + + # Write YOLO format labels + for img_id, annotations in annotations_by_image.items(): + if img_id not in img_map: + continue + + img_info = img_map[img_id] + label_name = Path(img_info['file_name']).stem + '.txt' + label_path = labels_dir / label_name + + with open(label_path, 'w') as f: + for ann in annotations: + if 'bbox' not in ann: + continue + + bbox = ann['bbox'] # [x, y, width, height] + cat_id = cat_map.get(ann['category_id'], 0) + + # Convert to YOLO format (normalized x_center, y_center, width, height) + x_center = (bbox[0] + bbox[2] / 2) / img_info['width'] + y_center = (bbox[1] + bbox[3] / 2) / img_info['height'] + w = bbox[2] / img_info['width'] + h = bbox[3] / img_info['height'] + + f.write(f"{cat_id} {x_center:.6f} {y_center:.6f} {w:.6f} {h:.6f}\n") + results['converted_annotations'] += 1 + + results['converted_images'] += 1 + + # Write classes.txt + classes = [None] * len(cat_map) + for cat in coco_data.get('categories', []): + idx = cat_map[cat['id']] + classes[idx] = cat['name'] + + with open(self.output_path / 'classes.txt', 'w') as f: + for class_name in classes: + f.write(f"{class_name}\n") + + # Write data.yaml for YOLO training + yaml_content = YOLO_DATA_YAML_TEMPLATE.format( + dataset_path=str(self.output_path.absolute()), + train_path='images/train', + val_path='images/val', + test_path='images/test', + num_classes=len(classes), + class_names=classes + ) + with open(self.output_path / 'data.yaml', 'w') as f: + f.write(yaml_content) + + except Exception as e: + logger.error(f"Error converting {coco_file}: {e}") + + return results + + def _yolo_to_coco(self) -> Dict: + """Convert YOLO format to COCO format.""" + results = {'converted_images': 0, 'converted_annotations': 0} + + coco_data = COCO_CATEGORIES_TEMPLATE.copy() + coco_data['images'] = [] + coco_data['annotations'] = [] + coco_data['categories'] = [] + + # Read classes + classes_file = self.input_path / 'classes.txt' + class_names = [] + if classes_file.exists(): + with open(classes_file) as f: + class_names = [line.strip() for line in f.readlines()] + + for i, name in enumerate(class_names): + coco_data['categories'].append({ + 'id': i, + 'name': name, + 'supercategory': 'object' + }) + + # Find images and labels + images = [] + for ext in SUPPORTED_IMAGE_EXTENSIONS: + images.extend(self.input_path.rglob(f'*{ext}')) + + annotation_id = 1 + for img_id, img_path in enumerate(images, 1): + # Try to get image dimensions (without PIL) + # Assume 640x640 if can't determine + width, height = 640, 640 + + coco_data['images'].append({ + 'id': img_id, + 'file_name': img_path.name, + 'width': width, + 'height': height + }) + results['converted_images'] += 1 + + # Find corresponding label + label_path = img_path.with_suffix('.txt') + if not label_path.exists(): + # Try labels subdirectory + label_path = img_path.parent.parent / 'labels' / (img_path.stem + '.txt') + + if label_path.exists(): + with open(label_path) as f: + for line in f: + parts = line.strip().split() + if len(parts) >= 5: + class_id = int(parts[0]) + x_center = float(parts[1]) * width + y_center = float(parts[2]) * height + w = float(parts[3]) * width + h = float(parts[4]) * height + + # Convert to COCO format [x, y, width, height] + x = x_center - w / 2 + y = y_center - h / 2 + + coco_data['annotations'].append({ + 'id': annotation_id, + 'image_id': img_id, + 'category_id': class_id, + 'bbox': [x, y, w, h], + 'area': w * h, + 'iscrowd': 0 + }) + annotation_id += 1 + results['converted_annotations'] += 1 + + # Write COCO JSON + self.output_path.mkdir(parents=True, exist_ok=True) + with open(self.output_path / 'annotations.json', 'w') as f: + json.dump(coco_data, f, indent=2) + + return results + + def _voc_to_coco(self) -> Dict: + """Convert Pascal VOC format to COCO format.""" + results = {'converted_images': 0, 'converted_annotations': 0} + + coco_data = COCO_CATEGORIES_TEMPLATE.copy() + coco_data['images'] = [] + coco_data['annotations'] = [] + coco_data['categories'] = [] + + class_to_id = {} + annotation_id = 1 + + for img_id, xml_file in enumerate(self.input_path.rglob('*.xml'), 1): + try: + tree = ET.parse(xml_file) + root = tree.getroot() + + if root.tag != 'annotation': + continue + + # Get image info + filename = root.find('filename') + size = root.find('size') + + if filename is None or size is None: + continue + + width = int(size.find('width').text) + height = int(size.find('height').text) + + coco_data['images'].append({ + 'id': img_id, + 'file_name': filename.text, + 'width': width, + 'height': height + }) + results['converted_images'] += 1 + + # Convert objects + for obj in root.findall('object'): + name = obj.find('name').text + + if name not in class_to_id: + class_to_id[name] = len(class_to_id) + coco_data['categories'].append({ + 'id': class_to_id[name], + 'name': name, + 'supercategory': 'object' + }) + + bndbox = obj.find('bndbox') + xmin = float(bndbox.find('xmin').text) + ymin = float(bndbox.find('ymin').text) + xmax = float(bndbox.find('xmax').text) + ymax = float(bndbox.find('ymax').text) + + coco_data['annotations'].append({ + 'id': annotation_id, + 'image_id': img_id, + 'category_id': class_to_id[name], + 'bbox': [xmin, ymin, xmax - xmin, ymax - ymin], + 'area': (xmax - xmin) * (ymax - ymin), + 'iscrowd': 0 + }) + annotation_id += 1 + results['converted_annotations'] += 1 + + except Exception as e: + logger.warning(f"Error parsing {xml_file}: {e}") + + # Write output + self.output_path.mkdir(parents=True, exist_ok=True) + with open(self.output_path / 'annotations.json', 'w') as f: + json.dump(coco_data, f, indent=2) + + return results + + def _voc_to_yolo(self) -> Dict: + """Convert Pascal VOC format to YOLO format.""" + # First convert to COCO, then to YOLO + temp_coco = self.output_path / '_temp_coco' + + converter1 = FormatConverter(str(self.input_path), str(temp_coco)) + converter1._voc_to_coco() + + converter2 = FormatConverter(str(temp_coco), str(self.output_path)) + results = converter2._coco_to_yolo() + + # Clean up temp + shutil.rmtree(temp_coco, ignore_errors=True) + + return results + + def _coco_to_voc(self) -> Dict: + """Convert COCO format to Pascal VOC format.""" + results = {'converted_images': 0, 'converted_annotations': 0} + + self.output_path.mkdir(parents=True, exist_ok=True) + annotations_dir = self.output_path / 'Annotations' + annotations_dir.mkdir(exist_ok=True) + + for coco_file in self.input_path.rglob('*.json'): + try: + with open(coco_file) as f: + coco_data = json.load(f) + + if 'annotations' not in coco_data: + continue + + # Build mappings + cat_map = {cat['id']: cat['name'] for cat in coco_data.get('categories', [])} + img_map = {img['id']: img for img in coco_data.get('images', [])} + + # Group by image + ann_by_image = defaultdict(list) + for ann in coco_data['annotations']: + ann_by_image[ann['image_id']].append(ann) + + for img_id, annotations in ann_by_image.items(): + if img_id not in img_map: + continue + + img_info = img_map[img_id] + + # Create VOC XML + annotation = ET.Element('annotation') + + ET.SubElement(annotation, 'folder').text = 'images' + ET.SubElement(annotation, 'filename').text = img_info['file_name'] + + size = ET.SubElement(annotation, 'size') + ET.SubElement(size, 'width').text = str(img_info['width']) + ET.SubElement(size, 'height').text = str(img_info['height']) + ET.SubElement(size, 'depth').text = '3' + + for ann in annotations: + obj = ET.SubElement(annotation, 'object') + ET.SubElement(obj, 'name').text = cat_map.get(ann['category_id'], 'unknown') + ET.SubElement(obj, 'difficult').text = '0' + + bbox = ann['bbox'] + bndbox = ET.SubElement(obj, 'bndbox') + ET.SubElement(bndbox, 'xmin').text = str(int(bbox[0])) + ET.SubElement(bndbox, 'ymin').text = str(int(bbox[1])) + ET.SubElement(bndbox, 'xmax').text = str(int(bbox[0] + bbox[2])) + ET.SubElement(bndbox, 'ymax').text = str(int(bbox[1] + bbox[3])) + + results['converted_annotations'] += 1 + + # Write XML + xml_name = Path(img_info['file_name']).stem + '.xml' + tree = ET.ElementTree(annotation) + tree.write(annotations_dir / xml_name) + results['converted_images'] += 1 + + except Exception as e: + logger.error(f"Error converting {coco_file}: {e}") + + return results + + +# ============================================================================ +# Dataset Splitting +# ============================================================================ + +class DatasetSplitter: + """Split dataset into train/val/test sets.""" + + def __init__(self, dataset_path: str, output_path: str = None): + self.dataset_path = Path(dataset_path) + self.output_path = Path(output_path) if output_path else self.dataset_path + + def split(self, train: float = 0.8, val: float = 0.1, test: float = 0.1, + stratify: bool = True, seed: int = 42) -> Dict: + """Split dataset with optional stratification.""" + + if abs(train + val + test - 1.0) > 0.001: + raise ValueError(f"Split ratios must sum to 1.0, got {train + val + test}") + + random.seed(seed) + logger.info(f"Splitting dataset: train={train}, val={val}, test={test}") + + # Detect format and find images + analyzer = DatasetAnalyzer(str(self.dataset_path)) + analyzer.analyze() + detected_format = analyzer.stats.get('format', 'unknown') + + images = [] + for ext in SUPPORTED_IMAGE_EXTENSIONS: + images.extend(self.dataset_path.rglob(f'*{ext}')) + + if not images: + return {'error': 'No images found'} + + # Stratify if requested and we have class info + if stratify and detected_format in ['coco', 'yolo']: + splits = self._stratified_split(images, detected_format, train, val, test) + else: + splits = self._random_split(images, train, val, test) + + # Create output directories and copy/link files + results = self._create_split_directories(splits, detected_format) + + return results + + def _random_split(self, images: List[Path], train: float, val: float, test: float) -> Dict: + """Perform random split.""" + images = list(images) + random.shuffle(images) + + n = len(images) + train_end = int(n * train) + val_end = train_end + int(n * val) + + return { + 'train': images[:train_end], + 'val': images[train_end:val_end], + 'test': images[val_end:] + } + + def _stratified_split(self, images: List[Path], format: str, + train: float, val: float, test: float) -> Dict: + """Perform stratified split based on class distribution.""" + + # Group images by their primary class + image_classes = {} + + for img in images: + if format == 'yolo': + label_path = img.with_suffix('.txt') + if not label_path.exists(): + label_path = img.parent.parent / 'labels' / (img.stem + '.txt') + + if label_path.exists(): + with open(label_path) as f: + line = f.readline() + if line: + class_id = int(line.split()[0]) + image_classes[img] = class_id + else: + image_classes[img] = -1 # No annotation + else: + image_classes[img] = -1 # Default for other formats + + # Group by class + class_images = defaultdict(list) + for img, class_id in image_classes.items(): + class_images[class_id].append(img) + + # Split each class proportionally + splits = {'train': [], 'val': [], 'test': []} + + for class_id, class_imgs in class_images.items(): + random.shuffle(class_imgs) + n = len(class_imgs) + train_end = int(n * train) + val_end = train_end + int(n * val) + + splits['train'].extend(class_imgs[:train_end]) + splits['val'].extend(class_imgs[train_end:val_end]) + splits['test'].extend(class_imgs[val_end:]) + + # Shuffle final splits + for key in splits: + random.shuffle(splits[key]) + + return splits + + def _create_split_directories(self, splits: Dict, format: str) -> Dict: + """Create split directories and organize files.""" + results = { + 'train_count': len(splits['train']), + 'val_count': len(splits['val']), + 'test_count': len(splits['test']), + 'output_path': str(self.output_path) + } + + # Create directory structure + for split_name in ['train', 'val', 'test']: + images_dir = self.output_path / 'images' / split_name + labels_dir = self.output_path / 'labels' / split_name + images_dir.mkdir(parents=True, exist_ok=True) + labels_dir.mkdir(parents=True, exist_ok=True) + + for img_path in splits[split_name]: + # Create symlink for image + dst_img = images_dir / img_path.name + if not dst_img.exists(): + try: + dst_img.symlink_to(img_path.absolute()) + except OSError: + # Fall back to copy if symlink fails + shutil.copy2(img_path, dst_img) + + # Handle label file + if format == 'yolo': + label_path = img_path.with_suffix('.txt') + if not label_path.exists(): + label_path = img_path.parent.parent / 'labels' / (img_path.stem + '.txt') + + if label_path.exists(): + dst_label = labels_dir / (img_path.stem + '.txt') + if not dst_label.exists(): + try: + dst_label.symlink_to(label_path.absolute()) + except OSError: + shutil.copy2(label_path, dst_label) + + # Generate data.yaml for YOLO + if format == 'yolo': + # Read classes + classes_file = self.dataset_path / 'classes.txt' + class_names = [] + if classes_file.exists(): + with open(classes_file) as f: + class_names = [line.strip() for line in f.readlines()] + + yaml_content = YOLO_DATA_YAML_TEMPLATE.format( + dataset_path=str(self.output_path.absolute()), + train_path='images/train', + val_path='images/val', + test_path='images/test', + num_classes=len(class_names), + class_names=class_names + ) + with open(self.output_path / 'data.yaml', 'w') as f: + f.write(yaml_content) + + return results + + +# ============================================================================ +# Augmentation Configuration +# ============================================================================ + +class AugmentationConfigGenerator: + """Generate augmentation configurations for different CV tasks.""" + + @staticmethod + def generate(task: str, intensity: str = 'medium', + framework: str = 'albumentations') -> Dict: + """Generate augmentation config for task and intensity.""" + + if task not in AUGMENTATION_PRESETS: + return {'error': f"Unknown task: {task}. Use: detection, segmentation, classification"} + + if intensity not in AUGMENTATION_PRESETS[task]: + return {'error': f"Unknown intensity: {intensity}. Use: light, medium, heavy"} + + base_config = AUGMENTATION_PRESETS[task][intensity] + + if framework == 'albumentations': + return AugmentationConfigGenerator._to_albumentations(base_config, task) + elif framework == 'torchvision': + return AugmentationConfigGenerator._to_torchvision(base_config, task) + elif framework == 'ultralytics': + return AugmentationConfigGenerator._to_ultralytics(base_config, task) + else: + return base_config + + @staticmethod + def _to_albumentations(config: Dict, task: str) -> Dict: + """Convert to Albumentations format.""" + transforms = [] + + for aug_name, params in config.items(): + if aug_name == 'horizontal_flip': + transforms.append({ + 'type': 'HorizontalFlip', + 'p': params + }) + elif aug_name == 'vertical_flip': + transforms.append({ + 'type': 'VerticalFlip', + 'p': params + }) + elif aug_name == 'rotate': + transforms.append({ + 'type': 'Rotate', + 'limit': params.get('limit', 15), + 'p': params.get('p', 0.5) + }) + elif aug_name == 'scale': + transforms.append({ + 'type': 'RandomScale', + 'scale_limit': params.get('scale_limit', 0.2), + 'p': params.get('p', 0.5) + }) + elif aug_name == 'brightness_contrast': + transforms.append({ + 'type': 'RandomBrightnessContrast', + 'brightness_limit': params.get('brightness_limit', 0.2), + 'contrast_limit': params.get('contrast_limit', 0.2), + 'p': params.get('p', 0.5) + }) + elif aug_name == 'hue_saturation': + transforms.append({ + 'type': 'HueSaturationValue', + 'hue_shift_limit': params.get('hue_shift_limit', 20), + 'sat_shift_limit': params.get('sat_shift_limit', 30), + 'p': params.get('p', 0.5) + }) + elif aug_name == 'blur': + transforms.append({ + 'type': 'Blur', + 'blur_limit': params.get('blur_limit', 5), + 'p': params.get('p', 0.3) + }) + elif aug_name == 'noise': + transforms.append({ + 'type': 'GaussNoise', + 'var_limit': params.get('var_limit', (10, 50)), + 'p': params.get('p', 0.3) + }) + elif aug_name == 'elastic_transform': + transforms.append({ + 'type': 'ElasticTransform', + 'alpha': params.get('alpha', 100), + 'sigma': params.get('sigma', 10), + 'p': params.get('p', 0.3) + }) + elif aug_name == 'cutout': + transforms.append({ + 'type': 'CoarseDropout', + 'max_holes': params.get('num_holes', 8), + 'max_height': params.get('max_h_size', 32), + 'max_width': params.get('max_w_size', 32), + 'p': params.get('p', 0.3) + }) + + # Add bbox format for detection + bbox_params = None + if task == 'detection': + bbox_params = { + 'format': 'pascal_voc', + 'label_fields': ['class_labels'], + 'min_visibility': 0.3 + } + + return { + 'framework': 'albumentations', + 'task': task, + 'transforms': transforms, + 'bbox_params': bbox_params, + 'code_example': AugmentationConfigGenerator._albumentations_code(transforms, task) + } + + @staticmethod + def _albumentations_code(transforms: List, task: str) -> str: + """Generate Albumentations code example.""" + code = """import albumentations as A +from albumentations.pytorch import ToTensorV2 + +transform = A.Compose([ +""" + for t in transforms: + params = ', '.join(f"{k}={v}" for k, v in t.items() if k != 'type') + code += f" A.{t['type']}({params}),\n" + + code += " A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),\n" + code += " ToTensorV2(),\n" + code += "]" + + if task == 'detection': + code += ", bbox_params=A.BboxParams(format='pascal_voc', label_fields=['class_labels']))" + else: + code += ")" + + return code + + @staticmethod + def _to_torchvision(config: Dict, task: str) -> Dict: + """Convert to torchvision transforms format.""" + transforms = [] + + for aug_name, params in config.items(): + if aug_name == 'horizontal_flip': + transforms.append({ + 'type': 'RandomHorizontalFlip', + 'p': params + }) + elif aug_name == 'vertical_flip': + transforms.append({ + 'type': 'RandomVerticalFlip', + 'p': params + }) + elif aug_name == 'rotate': + transforms.append({ + 'type': 'RandomRotation', + 'degrees': params.get('limit', 15) + }) + elif aug_name == 'color_jitter': + transforms.append({ + 'type': 'ColorJitter', + 'brightness': params.get('brightness', 0.2), + 'contrast': params.get('contrast', 0.2), + 'saturation': params.get('saturation', 0.2), + 'hue': params.get('hue', 0.1) + }) + + return { + 'framework': 'torchvision', + 'task': task, + 'transforms': transforms + } + + @staticmethod + def _to_ultralytics(config: Dict, task: str) -> Dict: + """Convert to Ultralytics YOLO format.""" + yolo_config = { + 'hsv_h': 0.015, + 'hsv_s': 0.7, + 'hsv_v': 0.4, + 'degrees': config.get('rotate', {}).get('limit', 0.0), + 'translate': 0.1, + 'scale': config.get('scale', {}).get('scale_limit', 0.5), + 'shear': 0.0, + 'perspective': 0.0, + 'flipud': config.get('vertical_flip', 0.0), + 'fliplr': config.get('horizontal_flip', 0.5), + 'mosaic': config.get('mosaic', {}).get('p', 1.0) if 'mosaic' in config else 0.0, + 'mixup': config.get('mixup', {}).get('p', 0.0) if 'mixup' in config else 0.0, + 'copy_paste': 0.0 + } + + return { + 'framework': 'ultralytics', + 'task': task, + 'config': yolo_config, + 'usage': "# Add to data.yaml or pass to Trainer\nmodel.train(data='data.yaml', augment=True, **aug_config)" + } + + +# ============================================================================ +# Dataset Validation +# ============================================================================ + +class DatasetValidator: + """Validate dataset integrity and quality.""" + + def __init__(self, dataset_path: str, format: str = None): + self.dataset_path = Path(dataset_path) + self.format = format + + def validate(self) -> Dict: + """Run all validation checks.""" + results = { + 'valid': True, + 'errors': [], + 'warnings': [], + 'stats': {} + } + + # Auto-detect format if not specified + if self.format is None: + analyzer = DatasetAnalyzer(str(self.dataset_path)) + analyzer.analyze() + self.format = analyzer.stats.get('format', 'unknown') + + results['format'] = self.format + + # Run format-specific validation + if self.format == 'coco': + self._validate_coco(results) + elif self.format == 'yolo': + self._validate_yolo(results) + elif self.format == 'voc': + self._validate_voc(results) + else: + results['warnings'].append(f"Unknown format: {self.format}") + + # General checks + self._validate_images(results) + self._check_duplicates(results) + + # Set overall validity + results['valid'] = len(results['errors']) == 0 + + return results + + def _validate_coco(self, results: Dict): + """Validate COCO format dataset.""" + for json_file in self.dataset_path.rglob('*.json'): + try: + with open(json_file) as f: + data = json.load(f) + + if 'annotations' not in data: + continue + + # Check required fields + if 'images' not in data: + results['errors'].append(f"{json_file}: Missing 'images' field") + if 'categories' not in data: + results['warnings'].append(f"{json_file}: Missing 'categories' field") + + # Validate annotations + image_ids = {img['id'] for img in data.get('images', [])} + category_ids = {cat['id'] for cat in data.get('categories', [])} + + for ann in data['annotations']: + if ann.get('image_id') not in image_ids: + results['errors'].append( + f"Annotation {ann.get('id')} references non-existent image {ann.get('image_id')}" + ) + if ann.get('category_id') not in category_ids: + results['warnings'].append( + f"Annotation {ann.get('id')} references unknown category {ann.get('category_id')}" + ) + + # Validate bbox + if 'bbox' in ann: + bbox = ann['bbox'] + if len(bbox) != 4: + results['errors'].append( + f"Annotation {ann.get('id')}: Invalid bbox format" + ) + elif any(v < 0 for v in bbox[:2]) or any(v <= 0 for v in bbox[2:]): + results['warnings'].append( + f"Annotation {ann.get('id')}: Suspicious bbox values {bbox}" + ) + + results['stats']['coco_images'] = len(data.get('images', [])) + results['stats']['coco_annotations'] = len(data['annotations']) + results['stats']['coco_categories'] = len(data.get('categories', [])) + + except json.JSONDecodeError as e: + results['errors'].append(f"{json_file}: Invalid JSON - {e}") + except Exception as e: + results['errors'].append(f"{json_file}: Error - {e}") + + def _validate_yolo(self, results: Dict): + """Validate YOLO format dataset.""" + label_files = list(self.dataset_path.rglob('*.txt')) + valid_labels = 0 + invalid_labels = 0 + + for txt_file in label_files: + if txt_file.name == 'classes.txt': + continue + + try: + with open(txt_file) as f: + lines = f.readlines() + + for line_num, line in enumerate(lines, 1): + parts = line.strip().split() + if not parts: + continue + + if len(parts) < 5: + results['errors'].append( + f"{txt_file}:{line_num}: Expected 5 values, got {len(parts)}" + ) + invalid_labels += 1 + continue + + try: + class_id = int(parts[0]) + x, y, w, h = map(float, parts[1:5]) + + # Check normalized coordinates + if not (0 <= x <= 1 and 0 <= y <= 1): + results['warnings'].append( + f"{txt_file}:{line_num}: Center coords outside [0,1]: ({x}, {y})" + ) + if not (0 < w <= 1 and 0 < h <= 1): + results['warnings'].append( + f"{txt_file}:{line_num}: Size outside (0,1]: ({w}, {h})" + ) + + valid_labels += 1 + + except ValueError as e: + results['errors'].append( + f"{txt_file}:{line_num}: Invalid values - {e}" + ) + invalid_labels += 1 + + except Exception as e: + results['errors'].append(f"{txt_file}: Error - {e}") + + results['stats']['yolo_valid_labels'] = valid_labels + results['stats']['yolo_invalid_labels'] = invalid_labels + + def _validate_voc(self, results: Dict): + """Validate Pascal VOC format dataset.""" + xml_files = list(self.dataset_path.rglob('*.xml')) + valid_annotations = 0 + + for xml_file in xml_files: + try: + tree = ET.parse(xml_file) + root = tree.getroot() + + if root.tag != 'annotation': + continue + + # Check required fields + filename = root.find('filename') + if filename is None: + results['warnings'].append(f"{xml_file}: Missing filename") + + size = root.find('size') + if size is None: + results['warnings'].append(f"{xml_file}: Missing size") + else: + for dim in ['width', 'height']: + if size.find(dim) is None: + results['errors'].append(f"{xml_file}: Missing {dim}") + + # Validate objects + for obj in root.findall('object'): + name = obj.find('name') + if name is None or not name.text: + results['errors'].append(f"{xml_file}: Object missing name") + + bndbox = obj.find('bndbox') + if bndbox is None: + results['errors'].append(f"{xml_file}: Object missing bndbox") + else: + for coord in ['xmin', 'ymin', 'xmax', 'ymax']: + elem = bndbox.find(coord) + if elem is None: + results['errors'].append(f"{xml_file}: Missing {coord}") + + valid_annotations += 1 + + except ET.ParseError as e: + results['errors'].append(f"{xml_file}: XML parse error - {e}") + except Exception as e: + results['errors'].append(f"{xml_file}: Error - {e}") + + results['stats']['voc_annotations'] = valid_annotations + + def _validate_images(self, results: Dict): + """Check for image file issues.""" + images = [] + for ext in SUPPORTED_IMAGE_EXTENSIONS: + images.extend(self.dataset_path.rglob(f'*{ext}')) + + results['stats']['total_images'] = len(images) + + # Check for empty images + empty_images = [img for img in images if img.stat().st_size == 0] + if empty_images: + results['errors'].append(f"Found {len(empty_images)} empty image files") + + # Check for very small images + small_images = [img for img in images if img.stat().st_size < 1000] + if small_images: + results['warnings'].append(f"Found {len(small_images)} very small images (<1KB)") + + def _check_duplicates(self, results: Dict): + """Check for duplicate images by hash.""" + images = [] + for ext in SUPPORTED_IMAGE_EXTENSIONS: + images.extend(self.dataset_path.rglob(f'*{ext}')) + + hashes = {} + duplicates = [] + + for img in images: + try: + with open(img, 'rb') as f: + file_hash = hashlib.md5(f.read()).hexdigest() + + if file_hash in hashes: + duplicates.append((img, hashes[file_hash])) + else: + hashes[file_hash] = img + except: + pass + + if duplicates: + results['warnings'].append(f"Found {len(duplicates)} duplicate images") + results['stats']['duplicate_images'] = len(duplicates) + + +# ============================================================================ +# Main CLI +# ============================================================================ def main(): - """Main entry point""" parser = argparse.ArgumentParser( - description="Dataset Pipeline Builder" + description="Dataset Pipeline Builder for Computer Vision", + formatter_class=argparse.RawDescriptionHelpFormatter, + epilog=""" +Examples: + Analyze dataset: + python dataset_pipeline_builder.py analyze --input /path/to/dataset + + Convert COCO to YOLO: + python dataset_pipeline_builder.py convert --input /path/to/coco --output /path/to/yolo --format yolo + + Split dataset: + python dataset_pipeline_builder.py split --input /path/to/dataset --train 0.8 --val 0.1 --test 0.1 + + Generate augmentation config: + python dataset_pipeline_builder.py augment-config --task detection --intensity heavy + + Validate dataset: + python dataset_pipeline_builder.py validate --input /path/to/dataset --format coco + """ ) - parser.add_argument('--input', '-i', required=True, help='Input path') - parser.add_argument('--output', '-o', required=True, help='Output path') - parser.add_argument('--config', '-c', help='Configuration file') - parser.add_argument('--verbose', '-v', action='store_true', help='Verbose output') - + + subparsers = parser.add_subparsers(dest='command', help='Command to run') + + # Analyze command + analyze_parser = subparsers.add_parser('analyze', help='Analyze dataset structure and statistics') + analyze_parser.add_argument('--input', '-i', required=True, help='Path to dataset') + analyze_parser.add_argument('--json', action='store_true', help='Output as JSON') + + # Convert command + convert_parser = subparsers.add_parser('convert', help='Convert between annotation formats') + convert_parser.add_argument('--input', '-i', required=True, help='Input dataset path') + convert_parser.add_argument('--output', '-o', required=True, help='Output dataset path') + convert_parser.add_argument('--format', '-f', required=True, + choices=['yolo', 'coco', 'voc'], + help='Target format') + convert_parser.add_argument('--source-format', '-s', + choices=['yolo', 'coco', 'voc'], + help='Source format (auto-detected if not specified)') + + # Split command + split_parser = subparsers.add_parser('split', help='Split dataset into train/val/test') + split_parser.add_argument('--input', '-i', required=True, help='Input dataset path') + split_parser.add_argument('--output', '-o', help='Output path (default: same as input)') + split_parser.add_argument('--train', type=float, default=0.8, help='Train split ratio') + split_parser.add_argument('--val', type=float, default=0.1, help='Validation split ratio') + split_parser.add_argument('--test', type=float, default=0.1, help='Test split ratio') + split_parser.add_argument('--stratify', action='store_true', help='Stratify by class') + split_parser.add_argument('--seed', type=int, default=42, help='Random seed') + + # Augmentation config command + aug_parser = subparsers.add_parser('augment-config', help='Generate augmentation configuration') + aug_parser.add_argument('--task', '-t', required=True, + choices=['detection', 'segmentation', 'classification'], + help='CV task type') + aug_parser.add_argument('--intensity', '-n', default='medium', + choices=['light', 'medium', 'heavy'], + help='Augmentation intensity') + aug_parser.add_argument('--framework', '-f', default='albumentations', + choices=['albumentations', 'torchvision', 'ultralytics'], + help='Target framework') + aug_parser.add_argument('--output', '-o', help='Output file path') + + # Validate command + validate_parser = subparsers.add_parser('validate', help='Validate dataset integrity') + validate_parser.add_argument('--input', '-i', required=True, help='Path to dataset') + validate_parser.add_argument('--format', '-f', + choices=['yolo', 'coco', 'voc'], + help='Dataset format (auto-detected if not specified)') + validate_parser.add_argument('--json', action='store_true', help='Output as JSON') + args = parser.parse_args() - - if args.verbose: - logging.getLogger().setLevel(logging.DEBUG) - - try: - config = { - 'input': args.input, - 'output': args.output - } - - processor = DatasetPipelineBuilder(config) - results = processor.process() - - print(json.dumps(results, indent=2)) - sys.exit(0) - - except Exception as e: - logger.error(f"Fatal error: {e}") + + if args.command is None: + parser.print_help() sys.exit(1) + try: + if args.command == 'analyze': + analyzer = DatasetAnalyzer(args.input) + results = analyzer.analyze() + + if args.json: + print(json.dumps(results, indent=2, default=str)) + else: + print("\n" + "="*60) + print("DATASET ANALYSIS REPORT") + print("="*60) + print(f"\nFormat: {results.get('format', 'unknown')}") + print(f"Total Images: {results.get('total_images', 0)}") + + if 'image_stats' in results: + stats = results['image_stats'] + print(f"\nImage Statistics:") + print(f" Total Size: {stats.get('total_size_mb', 0):.2f} MB") + print(f" Extensions: {stats.get('extensions', {})}") + print(f" Locations: {stats.get('locations', {})}") + + if 'annotations' in results: + ann = results['annotations'] + print(f"\nAnnotations:") + print(f" Total: {ann.get('total_annotations', 0)}") + print(f" Images with annotations: {ann.get('images_with_annotations', 0)}") + if 'classes' in ann: + print(f" Classes: {len(ann['classes'])}") + for cls, count in sorted(ann['classes'].items(), key=lambda x: -x[1])[:10]: + print(f" - {cls}: {count}") + + if 'quality' in results: + q = results['quality'] + if q.get('warnings'): + print(f"\nWarnings:") + for w in q['warnings']: + print(f" ⚠ {w}") + if q.get('recommendations'): + print(f"\nRecommendations:") + for r in q['recommendations']: + print(f" → {r}") + + elif args.command == 'convert': + converter = FormatConverter(args.input, args.output) + results = converter.convert(args.format, args.source_format) + print(json.dumps(results, indent=2)) + + elif args.command == 'split': + output = args.output if args.output else args.input + splitter = DatasetSplitter(args.input, output) + results = splitter.split( + train=args.train, + val=args.val, + test=args.test, + stratify=args.stratify, + seed=args.seed + ) + print(json.dumps(results, indent=2)) + + elif args.command == 'augment-config': + config = AugmentationConfigGenerator.generate( + args.task, + args.intensity, + args.framework + ) + + output = json.dumps(config, indent=2) + + if args.output: + with open(args.output, 'w') as f: + f.write(output) + print(f"Configuration saved to {args.output}") + else: + print(output) + + elif args.command == 'validate': + validator = DatasetValidator(args.input, args.format) + results = validator.validate() + + if args.json: + print(json.dumps(results, indent=2)) + else: + print("\n" + "="*60) + print("DATASET VALIDATION REPORT") + print("="*60) + print(f"\nFormat: {results.get('format', 'unknown')}") + print(f"Valid: {'✓' if results['valid'] else '✗'}") + + if results.get('errors'): + print(f"\nErrors ({len(results['errors'])}):") + for err in results['errors'][:10]: + print(f" ✗ {err}") + if len(results['errors']) > 10: + print(f" ... and {len(results['errors']) - 10} more") + + if results.get('warnings'): + print(f"\nWarnings ({len(results['warnings'])}):") + for warn in results['warnings'][:10]: + print(f" ⚠ {warn}") + if len(results['warnings']) > 10: + print(f" ... and {len(results['warnings']) - 10} more") + + if results.get('stats'): + print(f"\nStatistics:") + for key, value in results['stats'].items(): + print(f" {key}: {value}") + + sys.exit(0) + + except Exception as e: + logger.error(f"Error: {e}") + sys.exit(1) + + if __name__ == '__main__': main() diff --git a/engineering-team/senior-computer-vision/scripts/inference_optimizer.py b/engineering-team/senior-computer-vision/scripts/inference_optimizer.py index 97f5c8d..333e1ec 100755 --- a/engineering-team/senior-computer-vision/scripts/inference_optimizer.py +++ b/engineering-team/senior-computer-vision/scripts/inference_optimizer.py @@ -1,17 +1,26 @@ #!/usr/bin/env python3 """ Inference Optimizer -Production-grade tool for senior computer vision engineer + +Analyzes and benchmarks vision models, and provides optimization recommendations. +Supports PyTorch, ONNX, and TensorRT models. + +Usage: + python inference_optimizer.py model.pt --benchmark + python inference_optimizer.py model.pt --export onnx --output model.onnx + python inference_optimizer.py model.onnx --analyze """ import os import sys import json -import logging import argparse +import logging +import time from pathlib import Path -from typing import Dict, List, Optional +from typing import Dict, List, Optional, Any, Tuple from datetime import datetime +import statistics logging.basicConfig( level=logging.INFO, @@ -19,82 +28,530 @@ logging.basicConfig( ) logger = logging.getLogger(__name__) + +# Model format signatures +MODEL_FORMATS = { + '.pt': 'pytorch', + '.pth': 'pytorch', + '.onnx': 'onnx', + '.engine': 'tensorrt', + '.trt': 'tensorrt', + '.xml': 'openvino', + '.mlpackage': 'coreml', + '.mlmodel': 'coreml', +} + +# Optimization recommendations +OPTIMIZATION_PATHS = { + ('pytorch', 'gpu'): ['onnx', 'tensorrt_fp16'], + ('pytorch', 'cpu'): ['onnx', 'onnxruntime'], + ('pytorch', 'edge'): ['onnx', 'tensorrt_int8'], + ('pytorch', 'mobile'): ['onnx', 'tflite'], + ('pytorch', 'apple'): ['coreml'], + ('pytorch', 'intel'): ['onnx', 'openvino'], + ('onnx', 'gpu'): ['tensorrt_fp16'], + ('onnx', 'cpu'): ['onnxruntime'], +} + + class InferenceOptimizer: - """Production-grade inference optimizer""" - - def __init__(self, config: Dict): - self.config = config - self.results = { - 'status': 'initialized', - 'start_time': datetime.now().isoformat(), - 'processed_items': 0 + """Analyzes and optimizes vision model inference.""" + + def __init__(self, model_path: str): + self.model_path = Path(model_path) + self.model_format = self._detect_format() + self.model_info = {} + self.benchmark_results = {} + + def _detect_format(self) -> str: + """Detect model format from file extension.""" + suffix = self.model_path.suffix.lower() + if suffix in MODEL_FORMATS: + return MODEL_FORMATS[suffix] + raise ValueError(f"Unknown model format: {suffix}") + + def analyze_model(self) -> Dict[str, Any]: + """Analyze model structure and size.""" + logger.info(f"Analyzing model: {self.model_path}") + + analysis = { + 'path': str(self.model_path), + 'format': self.model_format, + 'file_size_mb': self.model_path.stat().st_size / 1024 / 1024, + 'parameters': None, + 'layers': [], + 'input_shape': None, + 'output_shape': None, + 'ops_count': None, } - logger.info(f"Initialized {self.__class__.__name__}") - - def validate_config(self) -> bool: - """Validate configuration""" - logger.info("Validating configuration...") - # Add validation logic - logger.info("Configuration validated") - return True - - def process(self) -> Dict: - """Main processing logic""" - logger.info("Starting processing...") - + + if self.model_format == 'onnx': + analysis.update(self._analyze_onnx()) + elif self.model_format == 'pytorch': + analysis.update(self._analyze_pytorch()) + + self.model_info = analysis + return analysis + + def _analyze_onnx(self) -> Dict[str, Any]: + """Analyze ONNX model.""" try: - self.validate_config() - - # Main processing - result = self._execute() - - self.results['status'] = 'completed' - self.results['end_time'] = datetime.now().isoformat() - - logger.info("Processing completed successfully") - return self.results - + import onnx + model = onnx.load(str(self.model_path)) + onnx.checker.check_model(model) + + # Count parameters + total_params = 0 + for initializer in model.graph.initializer: + param_count = 1 + for dim in initializer.dims: + param_count *= dim + total_params += param_count + + # Get input/output shapes + inputs = [] + for inp in model.graph.input: + shape = [d.dim_value if d.dim_value else -1 + for d in inp.type.tensor_type.shape.dim] + inputs.append({'name': inp.name, 'shape': shape}) + + outputs = [] + for out in model.graph.output: + shape = [d.dim_value if d.dim_value else -1 + for d in out.type.tensor_type.shape.dim] + outputs.append({'name': out.name, 'shape': shape}) + + # Count operators + op_counts = {} + for node in model.graph.node: + op_type = node.op_type + op_counts[op_type] = op_counts.get(op_type, 0) + 1 + + return { + 'parameters': total_params, + 'inputs': inputs, + 'outputs': outputs, + 'operator_counts': op_counts, + 'num_nodes': len(model.graph.node), + 'opset_version': model.opset_import[0].version if model.opset_import else None, + } + + except ImportError: + logger.warning("onnx package not installed, skipping detailed analysis") + return {} except Exception as e: - self.results['status'] = 'failed' - self.results['error'] = str(e) - logger.error(f"Processing failed: {e}") - raise - - def _execute(self) -> Dict: - """Execute main logic""" - # Implementation here - return {'success': True} + logger.error(f"Error analyzing ONNX model: {e}") + return {'error': str(e)} + + def _analyze_pytorch(self) -> Dict[str, Any]: + """Analyze PyTorch model.""" + try: + import torch + + # Try to load as checkpoint + checkpoint = torch.load(str(self.model_path), map_location='cpu') + + # Handle different checkpoint formats + if isinstance(checkpoint, dict): + if 'model' in checkpoint: + state_dict = checkpoint['model'] + elif 'state_dict' in checkpoint: + state_dict = checkpoint['state_dict'] + else: + state_dict = checkpoint + else: + # Assume it's the model itself + if hasattr(checkpoint, 'state_dict'): + state_dict = checkpoint.state_dict() + else: + return {'error': 'Could not extract state dict'} + + # Count parameters + total_params = 0 + layer_info = [] + for name, param in state_dict.items(): + if hasattr(param, 'numel'): + param_count = param.numel() + total_params += param_count + layer_info.append({ + 'name': name, + 'shape': list(param.shape), + 'params': param_count, + 'dtype': str(param.dtype) + }) + + return { + 'parameters': total_params, + 'layers': layer_info[:20], # First 20 layers + 'num_layers': len(layer_info), + } + + except ImportError: + logger.warning("torch package not installed, skipping detailed analysis") + return {} + except Exception as e: + logger.error(f"Error analyzing PyTorch model: {e}") + return {'error': str(e)} + + def benchmark(self, input_size: Tuple[int, int] = (640, 640), + batch_sizes: List[int] = None, + num_iterations: int = 100, + warmup: int = 10) -> Dict[str, Any]: + """Benchmark model inference speed.""" + if batch_sizes is None: + batch_sizes = [1, 4, 8, 16] + + logger.info(f"Benchmarking model with input size {input_size}") + + results = { + 'input_size': input_size, + 'num_iterations': num_iterations, + 'warmup_iterations': warmup, + 'batch_results': [], + 'device': 'cpu', + } + + try: + if self.model_format == 'onnx': + results.update(self._benchmark_onnx(input_size, batch_sizes, + num_iterations, warmup)) + elif self.model_format == 'pytorch': + results.update(self._benchmark_pytorch(input_size, batch_sizes, + num_iterations, warmup)) + else: + results['error'] = f"Benchmarking not supported for {self.model_format}" + + except Exception as e: + results['error'] = str(e) + logger.error(f"Benchmark failed: {e}") + + self.benchmark_results = results + return results + + def _benchmark_onnx(self, input_size: Tuple[int, int], + batch_sizes: List[int], + num_iterations: int, warmup: int) -> Dict[str, Any]: + """Benchmark ONNX model.""" + import numpy as np + + try: + import onnxruntime as ort + + # Try GPU first, fall back to CPU + providers = ['CPUExecutionProvider'] + try: + if 'CUDAExecutionProvider' in ort.get_available_providers(): + providers = ['CUDAExecutionProvider'] + providers + except: + pass + + session = ort.InferenceSession(str(self.model_path), providers=providers) + input_name = session.get_inputs()[0].name + device = 'cuda' if 'CUDA' in session.get_providers()[0] else 'cpu' + + results = {'device': device, 'provider': session.get_providers()[0]} + batch_results = [] + + for batch_size in batch_sizes: + # Create dummy input + dummy = np.random.randn(batch_size, 3, *input_size).astype(np.float32) + + # Warmup + for _ in range(warmup): + session.run(None, {input_name: dummy}) + + # Benchmark + latencies = [] + for _ in range(num_iterations): + start = time.perf_counter() + session.run(None, {input_name: dummy}) + latencies.append((time.perf_counter() - start) * 1000) + + batch_result = { + 'batch_size': batch_size, + 'mean_latency_ms': statistics.mean(latencies), + 'std_latency_ms': statistics.stdev(latencies) if len(latencies) > 1 else 0, + 'min_latency_ms': min(latencies), + 'max_latency_ms': max(latencies), + 'p50_latency_ms': sorted(latencies)[len(latencies) // 2], + 'p95_latency_ms': sorted(latencies)[int(len(latencies) * 0.95)], + 'p99_latency_ms': sorted(latencies)[int(len(latencies) * 0.99)], + 'throughput_fps': batch_size * 1000 / statistics.mean(latencies), + } + batch_results.append(batch_result) + + logger.info(f"Batch {batch_size}: {batch_result['mean_latency_ms']:.2f}ms, " + f"{batch_result['throughput_fps']:.1f} FPS") + + results['batch_results'] = batch_results + return results + + except ImportError: + return {'error': 'onnxruntime not installed'} + + def _benchmark_pytorch(self, input_size: Tuple[int, int], + batch_sizes: List[int], + num_iterations: int, warmup: int) -> Dict[str, Any]: + """Benchmark PyTorch model.""" + try: + import torch + import numpy as np + + # Load model + device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') + checkpoint = torch.load(str(self.model_path), map_location=device) + + # Handle different checkpoint formats + if isinstance(checkpoint, dict) and 'model' in checkpoint: + model = checkpoint['model'] + elif hasattr(checkpoint, 'forward'): + model = checkpoint + else: + return {'error': 'Could not load model for benchmarking'} + + model.to(device) + model.train(False) + + results = {'device': str(device)} + batch_results = [] + + with torch.no_grad(): + for batch_size in batch_sizes: + dummy = torch.randn(batch_size, 3, *input_size, device=device) + + # Warmup + for _ in range(warmup): + _ = model(dummy) + if device.type == 'cuda': + torch.cuda.synchronize() + + # Benchmark + latencies = [] + for _ in range(num_iterations): + if device.type == 'cuda': + torch.cuda.synchronize() + start = time.perf_counter() + _ = model(dummy) + if device.type == 'cuda': + torch.cuda.synchronize() + latencies.append((time.perf_counter() - start) * 1000) + + batch_result = { + 'batch_size': batch_size, + 'mean_latency_ms': statistics.mean(latencies), + 'std_latency_ms': statistics.stdev(latencies) if len(latencies) > 1 else 0, + 'min_latency_ms': min(latencies), + 'max_latency_ms': max(latencies), + 'throughput_fps': batch_size * 1000 / statistics.mean(latencies), + } + batch_results.append(batch_result) + + logger.info(f"Batch {batch_size}: {batch_result['mean_latency_ms']:.2f}ms, " + f"{batch_result['throughput_fps']:.1f} FPS") + + results['batch_results'] = batch_results + return results + + except ImportError: + return {'error': 'torch not installed'} + except Exception as e: + return {'error': str(e)} + + def get_optimization_recommendations(self, target: str = 'gpu') -> List[Dict[str, Any]]: + """Get optimization recommendations for target platform.""" + recommendations = [] + + key = (self.model_format, target) + if key in OPTIMIZATION_PATHS: + path = OPTIMIZATION_PATHS[key] + for step in path: + rec = { + 'step': step, + 'description': self._get_step_description(step), + 'expected_speedup': self._get_expected_speedup(step), + 'command': self._get_step_command(step), + } + recommendations.append(rec) + + # Add general recommendations + if self.model_info: + params = self.model_info.get('parameters', 0) + if params and params > 50_000_000: + recommendations.append({ + 'step': 'pruning', + 'description': f'Model has {params/1e6:.1f}M parameters. ' + 'Consider structured pruning to reduce size.', + 'expected_speedup': '1.5-2x', + }) + + file_size = self.model_info.get('file_size_mb', 0) + if file_size > 100: + recommendations.append({ + 'step': 'quantization', + 'description': f'Model size is {file_size:.1f}MB. ' + 'INT8 quantization can reduce by 75%.', + 'expected_speedup': '2-4x', + }) + + return recommendations + + def _get_step_description(self, step: str) -> str: + """Get description for optimization step.""" + descriptions = { + 'onnx': 'Export to ONNX format for framework-agnostic deployment', + 'tensorrt_fp16': 'Convert to TensorRT with FP16 precision for NVIDIA GPUs', + 'tensorrt_int8': 'Convert to TensorRT with INT8 quantization for edge devices', + 'onnxruntime': 'Use ONNX Runtime for optimized CPU/GPU inference', + 'openvino': 'Convert to OpenVINO for Intel CPU/GPU optimization', + 'coreml': 'Convert to CoreML for Apple Silicon acceleration', + 'tflite': 'Convert to TensorFlow Lite for mobile deployment', + } + return descriptions.get(step, step) + + def _get_expected_speedup(self, step: str) -> str: + """Get expected speedup for optimization step.""" + speedups = { + 'onnx': '1-1.5x', + 'tensorrt_fp16': '2-4x', + 'tensorrt_int8': '3-6x', + 'onnxruntime': '1.2-2x', + 'openvino': '1.5-3x', + 'coreml': '2-5x (on Apple Silicon)', + 'tflite': '1-2x', + } + return speedups.get(step, 'varies') + + def _get_step_command(self, step: str) -> str: + """Get command for optimization step.""" + model_name = self.model_path.stem + commands = { + 'onnx': f'yolo export model={model_name}.pt format=onnx', + 'tensorrt_fp16': f'trtexec --onnx={model_name}.onnx --saveEngine={model_name}.engine --fp16', + 'tensorrt_int8': f'trtexec --onnx={model_name}.onnx --saveEngine={model_name}.engine --int8', + 'onnxruntime': f'pip install onnxruntime-gpu', + 'openvino': f'mo --input_model {model_name}.onnx --output_dir openvino/', + 'coreml': f'yolo export model={model_name}.pt format=coreml', + } + return commands.get(step, '') + + def print_summary(self): + """Print analysis and benchmark summary.""" + print("\n" + "=" * 70) + print("MODEL ANALYSIS SUMMARY") + print("=" * 70) + + if self.model_info: + print(f"Path: {self.model_info.get('path', 'N/A')}") + print(f"Format: {self.model_info.get('format', 'N/A')}") + print(f"File Size: {self.model_info.get('file_size_mb', 0):.2f} MB") + + params = self.model_info.get('parameters') + if params: + print(f"Parameters: {params:,} ({params/1e6:.2f}M)") + + if 'num_nodes' in self.model_info: + print(f"Nodes: {self.model_info['num_nodes']}") + + if self.benchmark_results and 'batch_results' in self.benchmark_results: + print("\n" + "-" * 70) + print("BENCHMARK RESULTS") + print("-" * 70) + print(f"Device: {self.benchmark_results.get('device', 'N/A')}") + print(f"Input Size: {self.benchmark_results.get('input_size', 'N/A')}") + print() + print(f"{'Batch':<8} {'Latency (ms)':<15} {'Throughput (FPS)':<18} {'P99 (ms)':<12}") + print("-" * 55) + + for result in self.benchmark_results['batch_results']: + print(f"{result['batch_size']:<8} " + f"{result['mean_latency_ms']:<15.2f} " + f"{result['throughput_fps']:<18.1f} " + f"{result.get('p99_latency_ms', 0):<12.2f}") + + print("=" * 70 + "\n") + def main(): - """Main entry point""" parser = argparse.ArgumentParser( - description="Inference Optimizer" + description="Analyze and optimize vision model inference" ) - parser.add_argument('--input', '-i', required=True, help='Input path') - parser.add_argument('--output', '-o', required=True, help='Output path') - parser.add_argument('--config', '-c', help='Configuration file') - parser.add_argument('--verbose', '-v', action='store_true', help='Verbose output') - + parser.add_argument('model_path', help='Path to model file') + parser.add_argument('--analyze', action='store_true', + help='Analyze model structure') + parser.add_argument('--benchmark', action='store_true', + help='Benchmark inference speed') + parser.add_argument('--input-size', type=int, nargs=2, default=[640, 640], + metavar=('H', 'W'), help='Input image size') + parser.add_argument('--batch-sizes', type=int, nargs='+', default=[1, 4, 8], + help='Batch sizes to benchmark') + parser.add_argument('--iterations', type=int, default=100, + help='Number of benchmark iterations') + parser.add_argument('--warmup', type=int, default=10, + help='Number of warmup iterations') + parser.add_argument('--target', choices=['gpu', 'cpu', 'edge', 'mobile', 'apple', 'intel'], + default='gpu', help='Target deployment platform') + parser.add_argument('--recommend', action='store_true', + help='Show optimization recommendations') + parser.add_argument('--json', action='store_true', + help='Output as JSON') + parser.add_argument('--output', '-o', help='Output file path') + args = parser.parse_args() - - if args.verbose: - logging.getLogger().setLevel(logging.DEBUG) - - try: - config = { - 'input': args.input, - 'output': args.output - } - - processor = InferenceOptimizer(config) - results = processor.process() - - print(json.dumps(results, indent=2)) - sys.exit(0) - - except Exception as e: - logger.error(f"Fatal error: {e}") + + if not Path(args.model_path).exists(): + logger.error(f"Model not found: {args.model_path}") sys.exit(1) + try: + optimizer = InferenceOptimizer(args.model_path) + except ValueError as e: + logger.error(str(e)) + sys.exit(1) + + results = {} + + # Analyze model + if args.analyze or not (args.benchmark or args.recommend): + results['analysis'] = optimizer.analyze_model() + + # Benchmark + if args.benchmark: + results['benchmark'] = optimizer.benchmark( + input_size=tuple(args.input_size), + batch_sizes=args.batch_sizes, + num_iterations=args.iterations, + warmup=args.warmup + ) + + # Recommendations + if args.recommend: + if not optimizer.model_info: + optimizer.analyze_model() + results['recommendations'] = optimizer.get_optimization_recommendations(args.target) + + # Output + if args.json: + print(json.dumps(results, indent=2, default=str)) + else: + optimizer.print_summary() + + if args.recommend and 'recommendations' in results: + print("OPTIMIZATION RECOMMENDATIONS") + print("-" * 70) + for i, rec in enumerate(results['recommendations'], 1): + print(f"\n{i}. {rec['step'].upper()}") + print(f" {rec['description']}") + print(f" Expected speedup: {rec['expected_speedup']}") + if rec.get('command'): + print(f" Command: {rec['command']}") + print() + + # Save to file + if args.output: + with open(args.output, 'w') as f: + json.dump(results, f, indent=2, default=str) + logger.info(f"Results saved to {args.output}") + + if __name__ == '__main__': main() diff --git a/engineering-team/senior-computer-vision/scripts/vision_model_trainer.py b/engineering-team/senior-computer-vision/scripts/vision_model_trainer.py index 84edf9a..c1a36fb 100755 --- a/engineering-team/senior-computer-vision/scripts/vision_model_trainer.py +++ b/engineering-team/senior-computer-vision/scripts/vision_model_trainer.py @@ -1,16 +1,22 @@ #!/usr/bin/env python3 """ -Vision Model Trainer -Production-grade tool for senior computer vision engineer +Vision Model Trainer Configuration Generator + +Generates training configuration files for object detection and segmentation models. +Supports Ultralytics YOLO, Detectron2, and MMDetection frameworks. + +Usage: + python vision_model_trainer.py --task detection --arch yolov8m + python vision_model_trainer.py --framework detectron2 --arch faster_rcnn_R_50_FPN """ import os import sys import json -import logging import argparse +import logging from pathlib import Path -from typing import Dict, List, Optional +from typing import Dict, List, Optional, Any from datetime import datetime logging.basicConfig( @@ -19,82 +25,552 @@ logging.basicConfig( ) logger = logging.getLogger(__name__) + +# Architecture configurations +YOLO_ARCHITECTURES = { + 'yolov8n': {'params': '3.2M', 'gflops': 8.7, 'map': 37.3}, + 'yolov8s': {'params': '11.2M', 'gflops': 28.6, 'map': 44.9}, + 'yolov8m': {'params': '25.9M', 'gflops': 78.9, 'map': 50.2}, + 'yolov8l': {'params': '43.7M', 'gflops': 165.2, 'map': 52.9}, + 'yolov8x': {'params': '68.2M', 'gflops': 257.8, 'map': 53.9}, + 'yolov5n': {'params': '1.9M', 'gflops': 4.5, 'map': 28.0}, + 'yolov5s': {'params': '7.2M', 'gflops': 16.5, 'map': 37.4}, + 'yolov5m': {'params': '21.2M', 'gflops': 49.0, 'map': 45.4}, + 'yolov5l': {'params': '46.5M', 'gflops': 109.1, 'map': 49.0}, + 'yolov5x': {'params': '86.7M', 'gflops': 205.7, 'map': 50.7}, +} + +DETECTRON2_ARCHITECTURES = { + 'faster_rcnn_R_50_FPN': {'backbone': 'R-50-FPN', 'map': 37.9}, + 'faster_rcnn_R_101_FPN': {'backbone': 'R-101-FPN', 'map': 39.4}, + 'faster_rcnn_X_101_FPN': {'backbone': 'X-101-FPN', 'map': 41.0}, + 'mask_rcnn_R_50_FPN': {'backbone': 'R-50-FPN', 'map': 38.6}, + 'mask_rcnn_R_101_FPN': {'backbone': 'R-101-FPN', 'map': 40.0}, + 'retinanet_R_50_FPN': {'backbone': 'R-50-FPN', 'map': 36.4}, + 'retinanet_R_101_FPN': {'backbone': 'R-101-FPN', 'map': 37.7}, +} + +MMDETECTION_ARCHITECTURES = { + 'faster_rcnn_r50_fpn': {'backbone': 'ResNet50', 'map': 37.4}, + 'faster_rcnn_r101_fpn': {'backbone': 'ResNet101', 'map': 39.4}, + 'mask_rcnn_r50_fpn': {'backbone': 'ResNet50', 'map': 38.2}, + 'yolox_s': {'backbone': 'CSPDarknet', 'map': 40.5}, + 'yolox_m': {'backbone': 'CSPDarknet', 'map': 46.9}, + 'yolox_l': {'backbone': 'CSPDarknet', 'map': 49.7}, + 'detr_r50': {'backbone': 'ResNet50', 'map': 42.0}, + 'dino_r50': {'backbone': 'ResNet50', 'map': 49.0}, +} + + class VisionModelTrainer: - """Production-grade vision model trainer""" - - def __init__(self, config: Dict): - self.config = config - self.results = { - 'status': 'initialized', - 'start_time': datetime.now().isoformat(), - 'processed_items': 0 + """Generates training configurations for vision models.""" + + def __init__(self, data_dir: str, task: str = 'detection', + framework: str = 'ultralytics'): + self.data_dir = Path(data_dir) + self.task = task + self.framework = framework + self.config = {} + + def analyze_dataset(self) -> Dict[str, Any]: + """Analyze dataset structure and statistics.""" + logger.info(f"Analyzing dataset at {self.data_dir}") + + analysis = { + 'path': str(self.data_dir), + 'exists': self.data_dir.exists(), + 'images': {'train': 0, 'val': 0, 'test': 0}, + 'annotations': {'format': None, 'classes': []}, + 'recommendations': [] } - logger.info(f"Initialized {self.__class__.__name__}") - - def validate_config(self) -> bool: - """Validate configuration""" - logger.info("Validating configuration...") - # Add validation logic - logger.info("Configuration validated") - return True - - def process(self) -> Dict: - """Main processing logic""" - logger.info("Starting processing...") - - try: - self.validate_config() - - # Main processing - result = self._execute() - - self.results['status'] = 'completed' - self.results['end_time'] = datetime.now().isoformat() - - logger.info("Processing completed successfully") - return self.results - - except Exception as e: - self.results['status'] = 'failed' - self.results['error'] = str(e) - logger.error(f"Processing failed: {e}") - raise - - def _execute(self) -> Dict: - """Execute main logic""" - # Implementation here - return {'success': True} + + if not self.data_dir.exists(): + analysis['recommendations'].append( + f"Directory {self.data_dir} does not exist" + ) + return analysis + + # Check for common dataset structures + # COCO format + if (self.data_dir / 'annotations').exists(): + analysis['annotations']['format'] = 'coco' + for split in ['train', 'val', 'test']: + ann_file = self.data_dir / 'annotations' / f'{split}.json' + if ann_file.exists(): + with open(ann_file, 'r') as f: + data = json.load(f) + analysis['images'][split] = len(data.get('images', [])) + if not analysis['annotations']['classes']: + analysis['annotations']['classes'] = [ + c['name'] for c in data.get('categories', []) + ] + + # YOLO format + elif (self.data_dir / 'labels').exists(): + analysis['annotations']['format'] = 'yolo' + for split in ['train', 'val', 'test']: + img_dir = self.data_dir / 'images' / split + if img_dir.exists(): + analysis['images'][split] = len(list(img_dir.glob('*.*'))) + + # Try to read classes from data.yaml + data_yaml = self.data_dir / 'data.yaml' + if data_yaml.exists(): + import yaml + with open(data_yaml, 'r') as f: + data = yaml.safe_load(f) + analysis['annotations']['classes'] = data.get('names', []) + + # Generate recommendations + total_images = sum(analysis['images'].values()) + if total_images < 100: + analysis['recommendations'].append( + f"Dataset has only {total_images} images. " + "Consider collecting more data or using transfer learning." + ) + if total_images < 1000: + analysis['recommendations'].append( + "Use aggressive data augmentation (mosaic, mixup) for small datasets." + ) + + num_classes = len(analysis['annotations']['classes']) + if num_classes > 80: + analysis['recommendations'].append( + f"Large number of classes ({num_classes}). " + "Consider using larger model (yolov8l/x) or longer training." + ) + + logger.info(f"Found {total_images} images, {num_classes} classes") + return analysis + + def generate_yolo_config(self, arch: str, epochs: int = 100, + batch: int = 16, imgsz: int = 640, + **kwargs) -> Dict[str, Any]: + """Generate Ultralytics YOLO training configuration.""" + if arch not in YOLO_ARCHITECTURES: + available = ', '.join(YOLO_ARCHITECTURES.keys()) + raise ValueError(f"Unknown architecture: {arch}. Available: {available}") + + arch_info = YOLO_ARCHITECTURES[arch] + + config = { + 'model': f'{arch}.pt', + 'data': str(self.data_dir / 'data.yaml'), + 'epochs': epochs, + 'batch': batch, + 'imgsz': imgsz, + 'patience': 50, + 'save': True, + 'save_period': -1, + 'cache': False, + 'device': '0', + 'workers': 8, + 'project': 'runs/detect', + 'name': f'{arch}_{datetime.now().strftime("%Y%m%d_%H%M%S")}', + 'exist_ok': False, + 'pretrained': True, + 'optimizer': 'auto', + 'verbose': True, + 'seed': 0, + 'deterministic': True, + 'single_cls': False, + 'rect': False, + 'cos_lr': False, + 'close_mosaic': 10, + 'resume': False, + 'amp': True, + 'fraction': 1.0, + 'profile': False, + 'freeze': None, + 'lr0': 0.01, + 'lrf': 0.01, + 'momentum': 0.937, + 'weight_decay': 0.0005, + 'warmup_epochs': 3.0, + 'warmup_momentum': 0.8, + 'warmup_bias_lr': 0.1, + 'box': 7.5, + 'cls': 0.5, + 'dfl': 1.5, + 'pose': 12.0, + 'kobj': 1.0, + 'label_smoothing': 0.0, + 'nbs': 64, + 'hsv_h': 0.015, + 'hsv_s': 0.7, + 'hsv_v': 0.4, + 'degrees': 0.0, + 'translate': 0.1, + 'scale': 0.5, + 'shear': 0.0, + 'perspective': 0.0, + 'flipud': 0.0, + 'fliplr': 0.5, + 'bgr': 0.0, + 'mosaic': 1.0, + 'mixup': 0.0, + 'copy_paste': 0.0, + 'auto_augment': 'randaugment', + 'erasing': 0.4, + 'crop_fraction': 1.0, + } + + # Update with user overrides + config.update(kwargs) + + # Task-specific settings + if self.task == 'segmentation': + config['model'] = f'{arch}-seg.pt' + config['overlap_mask'] = True + config['mask_ratio'] = 4 + + # Metadata + config['_metadata'] = { + 'architecture': arch, + 'arch_info': arch_info, + 'task': self.task, + 'framework': 'ultralytics', + 'generated_at': datetime.now().isoformat() + } + + self.config = config + return config + + def generate_detectron2_config(self, arch: str, epochs: int = 12, + batch: int = 16, **kwargs) -> Dict[str, Any]: + """Generate Detectron2 training configuration.""" + if arch not in DETECTRON2_ARCHITECTURES: + available = ', '.join(DETECTRON2_ARCHITECTURES.keys()) + raise ValueError(f"Unknown architecture: {arch}. Available: {available}") + + arch_info = DETECTRON2_ARCHITECTURES[arch] + iterations = epochs * 1000 # Approximate + + config = { + 'MODEL': { + 'WEIGHTS': f'detectron2://COCO-Detection/{arch}_3x/137849458/model_final_280758.pkl', + 'ROI_HEADS': { + 'NUM_CLASSES': len(self._get_classes()), + 'BATCH_SIZE_PER_IMAGE': 512, + 'POSITIVE_FRACTION': 0.25, + 'SCORE_THRESH_TEST': 0.05, + 'NMS_THRESH_TEST': 0.5, + }, + 'BACKBONE': { + 'FREEZE_AT': 2 + }, + 'FPN': { + 'IN_FEATURES': ['res2', 'res3', 'res4', 'res5'] + }, + 'ANCHOR_GENERATOR': { + 'SIZES': [[32], [64], [128], [256], [512]], + 'ASPECT_RATIOS': [[0.5, 1.0, 2.0]] + }, + 'RPN': { + 'PRE_NMS_TOPK_TRAIN': 2000, + 'PRE_NMS_TOPK_TEST': 1000, + 'POST_NMS_TOPK_TRAIN': 1000, + 'POST_NMS_TOPK_TEST': 1000, + } + }, + 'DATASETS': { + 'TRAIN': ('custom_train',), + 'TEST': ('custom_val',), + }, + 'DATALOADER': { + 'NUM_WORKERS': 4, + 'SAMPLER_TRAIN': 'TrainingSampler', + 'FILTER_EMPTY_ANNOTATIONS': True, + }, + 'SOLVER': { + 'IMS_PER_BATCH': batch, + 'BASE_LR': 0.001, + 'STEPS': (int(iterations * 0.7), int(iterations * 0.9)), + 'MAX_ITER': iterations, + 'WARMUP_FACTOR': 1.0 / 1000, + 'WARMUP_ITERS': 1000, + 'WARMUP_METHOD': 'linear', + 'GAMMA': 0.1, + 'MOMENTUM': 0.9, + 'WEIGHT_DECAY': 0.0001, + 'WEIGHT_DECAY_NORM': 0.0, + 'CHECKPOINT_PERIOD': 5000, + 'AMP': { + 'ENABLED': True + } + }, + 'INPUT': { + 'MIN_SIZE_TRAIN': (640, 672, 704, 736, 768, 800), + 'MAX_SIZE_TRAIN': 1333, + 'MIN_SIZE_TEST': 800, + 'MAX_SIZE_TEST': 1333, + 'FORMAT': 'BGR', + }, + 'TEST': { + 'EVAL_PERIOD': 5000, + 'DETECTIONS_PER_IMAGE': 100, + }, + 'OUTPUT_DIR': f'./output/{arch}_{datetime.now().strftime("%Y%m%d_%H%M%S")}', + } + + # Add mask head for instance segmentation + if 'mask' in arch.lower(): + config['MODEL']['MASK_ON'] = True + config['MODEL']['ROI_MASK_HEAD'] = { + 'POOLER_RESOLUTION': 14, + 'POOLER_SAMPLING_RATIO': 0, + 'POOLER_TYPE': 'ROIAlignV2' + } + + config.update(kwargs) + config['_metadata'] = { + 'architecture': arch, + 'arch_info': arch_info, + 'task': self.task, + 'framework': 'detectron2', + 'generated_at': datetime.now().isoformat() + } + + self.config = config + return config + + def generate_mmdetection_config(self, arch: str, epochs: int = 12, + batch: int = 16, **kwargs) -> Dict[str, Any]: + """Generate MMDetection training configuration.""" + if arch not in MMDETECTION_ARCHITECTURES: + available = ', '.join(MMDETECTION_ARCHITECTURES.keys()) + raise ValueError(f"Unknown architecture: {arch}. Available: {available}") + + arch_info = MMDETECTION_ARCHITECTURES[arch] + + config = { + '_base_': [ + f'../_base_/models/{arch}.py', + '../_base_/datasets/coco_detection.py', + '../_base_/schedules/schedule_1x.py', + '../_base_/default_runtime.py' + ], + 'model': { + 'roi_head': { + 'bbox_head': { + 'num_classes': len(self._get_classes()) + } + } + }, + 'data': { + 'samples_per_gpu': batch // 2, + 'workers_per_gpu': 4, + 'train': { + 'type': 'CocoDataset', + 'ann_file': str(self.data_dir / 'annotations' / 'train.json'), + 'img_prefix': str(self.data_dir / 'images' / 'train'), + }, + 'val': { + 'type': 'CocoDataset', + 'ann_file': str(self.data_dir / 'annotations' / 'val.json'), + 'img_prefix': str(self.data_dir / 'images' / 'val'), + }, + 'test': { + 'type': 'CocoDataset', + 'ann_file': str(self.data_dir / 'annotations' / 'val.json'), + 'img_prefix': str(self.data_dir / 'images' / 'val'), + } + }, + 'optimizer': { + 'type': 'SGD', + 'lr': 0.02, + 'momentum': 0.9, + 'weight_decay': 0.0001 + }, + 'optimizer_config': { + 'grad_clip': {'max_norm': 35, 'norm_type': 2} + }, + 'lr_config': { + 'policy': 'step', + 'warmup': 'linear', + 'warmup_iters': 500, + 'warmup_ratio': 0.001, + 'step': [int(epochs * 0.7), int(epochs * 0.9)] + }, + 'runner': { + 'type': 'EpochBasedRunner', + 'max_epochs': epochs + }, + 'checkpoint_config': { + 'interval': 1 + }, + 'log_config': { + 'interval': 50, + 'hooks': [ + {'type': 'TextLoggerHook'}, + {'type': 'TensorboardLoggerHook'} + ] + }, + 'work_dir': f'./work_dirs/{arch}_{datetime.now().strftime("%Y%m%d_%H%M%S")}', + 'load_from': None, + 'resume_from': None, + 'fp16': {'loss_scale': 512.0} + } + + config.update(kwargs) + config['_metadata'] = { + 'architecture': arch, + 'arch_info': arch_info, + 'task': self.task, + 'framework': 'mmdetection', + 'generated_at': datetime.now().isoformat() + } + + self.config = config + return config + + def _get_classes(self) -> List[str]: + """Get class names from dataset.""" + analysis = self.analyze_dataset() + classes = analysis['annotations']['classes'] + if not classes: + classes = ['object'] # Default fallback + return classes + + def save_config(self, output_path: str) -> str: + """Save configuration to file.""" + output_path = Path(output_path) + output_path.parent.mkdir(parents=True, exist_ok=True) + + if self.framework == 'ultralytics': + # YOLO uses YAML + import yaml + with open(output_path, 'w') as f: + yaml.dump(self.config, f, default_flow_style=False, sort_keys=False) + else: + # Detectron2 and MMDetection use Python configs + with open(output_path, 'w') as f: + f.write("# Auto-generated configuration\n") + f.write(f"# Generated at: {datetime.now().isoformat()}\n\n") + f.write(f"config = {json.dumps(self.config, indent=2)}\n") + + logger.info(f"Configuration saved to {output_path}") + return str(output_path) + + def generate_training_command(self) -> str: + """Generate the training command for the framework.""" + if self.framework == 'ultralytics': + return f"yolo detect train data={self.config.get('data', 'data.yaml')} " \ + f"model={self.config.get('model', 'yolov8m.pt')} " \ + f"epochs={self.config.get('epochs', 100)} " \ + f"imgsz={self.config.get('imgsz', 640)}" + elif self.framework == 'detectron2': + return f"python train_net.py --config-file config.yaml --num-gpus 1" + elif self.framework == 'mmdetection': + return f"python tools/train.py config.py" + return "" + + def print_summary(self): + """Print configuration summary.""" + meta = self.config.get('_metadata', {}) + + print("\n" + "=" * 60) + print("TRAINING CONFIGURATION SUMMARY") + print("=" * 60) + print(f"Framework: {meta.get('framework', 'unknown')}") + print(f"Architecture: {meta.get('architecture', 'unknown')}") + print(f"Task: {meta.get('task', 'detection')}") + + if 'arch_info' in meta: + info = meta['arch_info'] + if 'params' in info: + print(f"Parameters: {info['params']}") + if 'map' in info: + print(f"COCO mAP: {info['map']}") + + print("-" * 60) + print("Training Command:") + print(f" {self.generate_training_command()}") + print("=" * 60 + "\n") + def main(): - """Main entry point""" parser = argparse.ArgumentParser( - description="Vision Model Trainer" + description="Generate vision model training configurations" ) - parser.add_argument('--input', '-i', required=True, help='Input path') - parser.add_argument('--output', '-o', required=True, help='Output path') - parser.add_argument('--config', '-c', help='Configuration file') - parser.add_argument('--verbose', '-v', action='store_true', help='Verbose output') - + parser.add_argument('data_dir', help='Path to dataset directory') + parser.add_argument('--task', choices=['detection', 'segmentation'], + default='detection', help='Task type') + parser.add_argument('--framework', choices=['ultralytics', 'detectron2', 'mmdetection'], + default='ultralytics', help='Training framework') + parser.add_argument('--arch', default='yolov8m', + help='Model architecture') + parser.add_argument('--epochs', type=int, default=100, help='Training epochs') + parser.add_argument('--batch', type=int, default=16, help='Batch size') + parser.add_argument('--imgsz', type=int, default=640, help='Image size') + parser.add_argument('--output', '-o', help='Output config file path') + parser.add_argument('--analyze-only', action='store_true', + help='Only analyze dataset, do not generate config') + parser.add_argument('--json', action='store_true', + help='Output as JSON') + args = parser.parse_args() - - if args.verbose: - logging.getLogger().setLevel(logging.DEBUG) - + + trainer = VisionModelTrainer( + data_dir=args.data_dir, + task=args.task, + framework=args.framework + ) + + # Analyze dataset + analysis = trainer.analyze_dataset() + + if args.analyze_only: + if args.json: + print(json.dumps(analysis, indent=2)) + else: + print("\nDataset Analysis:") + print(f" Path: {analysis['path']}") + print(f" Format: {analysis['annotations']['format']}") + print(f" Classes: {len(analysis['annotations']['classes'])}") + print(f" Images - Train: {analysis['images']['train']}, " + f"Val: {analysis['images']['val']}, " + f"Test: {analysis['images']['test']}") + if analysis['recommendations']: + print("\nRecommendations:") + for rec in analysis['recommendations']: + print(f" - {rec}") + return + + # Generate configuration try: - config = { - 'input': args.input, - 'output': args.output - } - - processor = VisionModelTrainer(config) - results = processor.process() - - print(json.dumps(results, indent=2)) - sys.exit(0) - - except Exception as e: - logger.error(f"Fatal error: {e}") + if args.framework == 'ultralytics': + config = trainer.generate_yolo_config( + arch=args.arch, + epochs=args.epochs, + batch=args.batch, + imgsz=args.imgsz + ) + elif args.framework == 'detectron2': + config = trainer.generate_detectron2_config( + arch=args.arch, + epochs=args.epochs, + batch=args.batch + ) + elif args.framework == 'mmdetection': + config = trainer.generate_mmdetection_config( + arch=args.arch, + epochs=args.epochs, + batch=args.batch + ) + except ValueError as e: + logger.error(str(e)) sys.exit(1) + # Output + if args.json: + print(json.dumps(config, indent=2)) + else: + trainer.print_summary() + + if args.output: + trainer.save_config(args.output) + + if __name__ == '__main__': main()