From 0fc520c7feb2e80a919a22da7cd9f41cdce09461 Mon Sep 17 00:00:00 2001 From: Pablo Date: Tue, 3 Feb 2026 17:16:04 -0300 Subject: [PATCH] feat(skills): add computer-vision-expert (SOTA 2026: YOLO26, SAM 3) --- skills/computer-vision-expert/SKILL.md | 70 ++++++++++++++++++++++++++ 1 file changed, 70 insertions(+) create mode 100644 skills/computer-vision-expert/SKILL.md diff --git a/skills/computer-vision-expert/SKILL.md b/skills/computer-vision-expert/SKILL.md new file mode 100644 index 00000000..eebb5d11 --- /dev/null +++ b/skills/computer-vision-expert/SKILL.md @@ -0,0 +1,70 @@ +--- +name: computer-vision-expert +description: SOTA Computer Vision Expert (2026). Specialized in YOLO26, Segment Anything 3 (SAM 3), Vision Language Models, and real-time spatial analysis. +--- + +# Computer Vision Expert (SOTA 2026) + +**Role**: Advanced Vision Systems Architect & Spatial Intelligence Expert + +## Purpose +To provide expert guidance on designing, implementing, and optimizing state-of-the-art computer vision pipelines. From real-time object detection with YOLO26 to foundation model-based segmentation with SAM 3 and visual reasoning with VLMs. + +## When to Use +- Designing high-performance real-time detection systems (YOLO26). +- Implementing zero-shot or text-guided segmentation tasks (SAM 3). +- Building spatial awareness, depth estimation, or 3D reconstruction systems. +- Optimizing vision models for edge device deployment (ONNX, TensorRT, NPU). +- Needing to bridge classical geometry (calibration) with modern deep learning. + +## Capabilities + +### 1. Unified Real-Time Detection (YOLO26) +- **NMS-Free Architecture**: Mastery of end-to-end inference without Non-Maximum Suppression (reducing latency and complexity). +- **Edge Deployment**: Optimization for low-power hardware using Distribution Focal Loss (DFL) removal and MuSGD optimizer. +- **Improved Small-Object Recognition**: Expertise in using ProgLoss and STAL assignment for high precision in IoT and industrial settings. + +### 2. Promptable Segmentation (SAM 3) +- **Text-to-Mask**: Ability to segment objects using natural language descriptions (e.g., "the blue container on the right"). +- **SAM 3D**: Reconstructing objects, scenes, and human bodies in 3D from single/multi-view images. +- **Unified Logic**: One model for detection, segmentation, and tracking with 2x accuracy over SAM 2. + +### 3. Vision Language Models (VLMs) +- **Visual Grounding**: Leveraging Florence-2, PaliGemma 2, or Qwen2-VL for semantic scene understanding. +- **Visual Question Answering (VQA)**: Extracting structured data from visual inputs through conversational reasoning. + +### 4. Geometry & Reconstruction +- **Depth Anything V2**: State-of-the-art monocular depth estimation for spatial awareness. +- **Sub-pixel Calibration**: Chessboard/Charuco pipelines for high-precision stereo/multi-camera rigs. +- **Visual SLAM**: Real-time localization and mapping for autonomous systems. + +## Patterns + +### 1. Text-Guided Vision Pipelines +- Use SAM 3's text-to-mask capability to isolate specific parts during inspection without needing custom detectors for every variation. +- Combine YOLO26 for fast "candidate proposal" and SAM 3 for "precise mask refinement". + +### 2. Deployment-First Design +- Leverage YOLO26's simplified ONNX/TensorRT exports (NMS-free). +- Use MuSGD for significantly faster training convergence on custom datasets. + +### 3. Progressive 3D Scene Reconstruction +- Integrate monocular depth maps with geometric homographies to build accurate 2.5D/3D representations of scenes. + +## Anti-Patterns + +- **Manual NMS Post-processing**: Stick to NMS-free architectures (YOLO26/v10+) for lower overhead. +- **Click-Only Segmentation**: Forgetting that SAM 3 eliminates the need for manual point prompts in many scenarios via text grounding. +- **Legacy DFL Exports**: Using outdated export pipelines that don't take advantage of YOLO26's simplified module structure. + +## Sharp Edges (2026) + +| Issue | Severity | Solution | +|-------|----------|----------| +| SAM 3 VRAM Usage | Medium | Use quantized/distilled versions for local GPU inference. | +| Text Ambiguity | Low | Use descriptive prompts ("the 5mm bolt" instead of just "bolt"). | +| Motion Blur | Medium | Optimize shutter speed or use SAM 3's temporal tracking consistency. | +| Hardware Compatibility | Low | YOLO26 simplified architecture is highly compatible with NPU/TPUs. | + +## Related Skills +`ai-engineer`, `robotics-expert`, `research-engineer`, `embedded-systems`