Computer Vision: Definition & Meaning | AI Glossary

Computer vision is the field of AI that enables computers to interpret and reason about images and video. It covers everything from classifying whether a photo contains a cat (a problem largely solved a decade ago) to understanding the content of a YouTube video well enough to summarise it (now mainstream in 2026).

The capability landscape that ships in production today:

Image classification — assign a label to an image (cat, dog, manufacturing defect, medical condition).
Object detection — find where objects are with bounding boxes (cars in traffic, products on shelves).
Image segmentation — assign every pixel to an object or class (medical organ segmentation, autonomous driving lane detection).
Optical character recognition (OCR) — extract text from images and scanned documents.
Pose estimation — locate human joints and body posture (fitness apps, sports analytics).
Depth estimation — infer 3D depth from 2D images.
Vision-language understanding — answer questions about images, generate captions, describe scenes (powered by multimodal LLMs).
Video understanding — temporal reasoning over clips: what happened, who did what, what comes next.

The technology stack in 2026:

Convolutional neural networks (CNNs) — still strong for narrow tasks (medical imaging, manufacturing inspection); efficient and well-understood.
Vision transformers (ViT) — dominant for general vision tasks; the underlying architecture of CLIP and most multimodal LLMs.
Multimodal LLMs — GPT-5, Claude Sonnet 4, Gemini 2.5 all process images natively; for many vision tasks they have replaced specialised models.
Diffusion models — generate images and video; covered separately in glossary entries on image and video generation.
CLIP and friends — text-image embedding models that power "find me images that match this description".

Where computer vision drives real US business value in 2026:

Retail — shelf monitoring, planogram compliance, automated checkout.
Healthcare — radiology assistance, dermatology screening, pathology slide analysis.
Manufacturing — defect detection, robotic guidance, safety monitoring.
Insurance — automated claims processing from damage photos.
Real estate — property feature extraction from listing photos, virtual staging.
Content moderation — at scale on every social platform.

For a US engineering team in 2026, the build-vs-buy decision in computer vision has shifted dramatically. For general "what is in this image" questions, calling a multimodal LLM is often cheaper and better than training a custom model. For high-volume narrow tasks (defect detection on a specific assembly line), a custom small model still wins on cost and latency. The right architecture is increasingly hybrid: multimodal LLM for orchestration and edge cases, specialised models for the high-volume hot path.

Related terms