Computer vision spans real-time object detection (YOLOv8, RT-DETR), dense segmentation (Segment Anything, Mask2Former), and large-scale visual model training. VRAM requirements vary from 8 GB for inference on small detectors to 80+ GB for training large vision transformers like ViT-Huge or DINOv2.
For inference, modern detectors like YOLOv8 achieve real-time performance (30-60 FPS at 640x640) on GPUs as modest as a T4 or RTX 3060. Segmentation models (SAM, SAM2) typically need 16-24 GB for high-resolution images. Training from scratch or fine-tuning larger models like CLIP, DINOv2, or YOLO-World benefits from 24 GB or more, with batch sizes scaling memory linearly.
PyTorch dominates the framework landscape, with Hugging Face transformers, Ultralytics YOLO, and Detectron2 as the most-used libraries. TensorRT can significantly accelerate inference for production deployments.