Computer vision is the field of AI that enables computers to interpret and reason about images and video. It covers everything from classifying whether a photo contains a cat (a problem largely solved a decade ago) to understanding the content of a YouTube video well enough to summarise it (now mainstream in 2026).
The capability landscape that ships in production today:
- Image classification — assign a label to an image (cat, dog, manufacturing defect, medical condition).
- Object detection — find where objects are with bounding boxes (cars in traffic, products on shelves).
- Image segmentation — assign every pixel to an object or class (medical organ segmentation, autonomous driving lane detection).
- Optical character recognition (OCR) — extract text from images and scanned documents.
- Pose estimation — locate human joints and body posture (fitness apps, sports analytics).
- Depth estimation — infer 3D depth from 2D images.
- Vision-language understanding — answer questions about images, generate captions, describe scenes (powered by multimodal LLMs).
- Video understanding — temporal reasoning over clips: what happened, who did what, what comes next.
The technology stack in 2026:
- Convolutional neural networks (CNNs) — still strong for narrow tasks (medical imaging, manufacturing inspection); efficient and well-understood.
- Vision transformers (ViT) — dominant for general vision tasks; the underlying architecture of CLIP and most multimodal LLMs.
- Multimodal LLMs — GPT-5, Claude Sonnet 4, Gemini 2.5 all process images natively; for many vision tasks they have replaced specialised models.
- Diffusion models — generate images and video; covered separately in glossary entries on image and video generation.
- CLIP and friends — text-image embedding models that power "find me images that match this description".
Where computer vision drives real US business value in 2026:
- Retail — shelf monitoring, planogram compliance, automated checkout.
- Healthcare — radiology assistance, dermatology screening, pathology slide analysis.
- Manufacturing — defect detection, robotic guidance, safety monitoring.
- Insurance — automated claims processing from damage photos.
- Real estate — property feature extraction from listing photos, virtual staging.
- Content moderation — at scale on every social platform.
For a US engineering team in 2026, the build-vs-buy decision in computer vision has shifted dramatically. For general "what is in this image" questions, calling a multimodal LLM is often cheaper and better than training a custom model. For high-volume narrow tasks (defect detection on a specific assembly line), a custom small model still wins on cost and latency. The right architecture is increasingly hybrid: multimodal LLM for orchestration and edge cases, specialised models for the high-volume hot path.