Vision & Generation

Computer Vision

The field of AI that enables computers to interpret and reason about images and video.

In common use since 1960

Computer vision is the field of AI that enables computers to interpret and reason about images and video. It covers everything from classifying whether a photo contains a cat (a problem largely solved a decade ago) to understanding the content of a YouTube video well enough to summarise it (now mainstream in 2026).

The capability landscape that ships in production today:

  • Image classification — assign a label to an image (cat, dog, manufacturing defect, medical condition).
  • Object detection — find where objects are with bounding boxes (cars in traffic, products on shelves).
  • Image segmentation — assign every pixel to an object or class (medical organ segmentation, autonomous driving lane detection).
  • Optical character recognition (OCR) — extract text from images and scanned documents.
  • Pose estimation — locate human joints and body posture (fitness apps, sports analytics).
  • Depth estimation — infer 3D depth from 2D images.
  • Vision-language understanding — answer questions about images, generate captions, describe scenes (powered by multimodal LLMs).
  • Video understanding — temporal reasoning over clips: what happened, who did what, what comes next.

The technology stack in 2026:

  • Convolutional neural networks (CNNs) — still strong for narrow tasks (medical imaging, manufacturing inspection); efficient and well-understood.
  • Vision transformers (ViT) — dominant for general vision tasks; the underlying architecture of CLIP and most multimodal LLMs.
  • Multimodal LLMs — GPT-5, Claude Sonnet 4, Gemini 2.5 all process images natively; for many vision tasks they have replaced specialised models.
  • Diffusion models — generate images and video; covered separately in glossary entries on image and video generation.
  • CLIP and friends — text-image embedding models that power "find me images that match this description".

Where computer vision drives real US business value in 2026:

  • Retail — shelf monitoring, planogram compliance, automated checkout.
  • Healthcare — radiology assistance, dermatology screening, pathology slide analysis.
  • Manufacturing — defect detection, robotic guidance, safety monitoring.
  • Insurance — automated claims processing from damage photos.
  • Real estate — property feature extraction from listing photos, virtual staging.
  • Content moderation — at scale on every social platform.

For a US engineering team in 2026, the build-vs-buy decision in computer vision has shifted dramatically. For general "what is in this image" questions, calling a multimodal LLM is often cheaper and better than training a custom model. For high-volume narrow tasks (defect detection on a specific assembly line), a custom small model still wins on cost and latency. The right architecture is increasingly hybrid: multimodal LLM for orchestration and edge cases, specialised models for the high-volume hot path.

Keep exploring

Looking for something else? The full glossary covers 120+ AI terms updated for 2026.

Open the glossary
Chat on WhatsApp