Vision & Generation

Image Segmentation

Labeling every pixel in an image so each region corresponds to an object or class.

In common use since 2007

Image segmentation is the computer vision task of labelling every pixel in an image so each region corresponds to an object or class. Where object detection draws bounding boxes around things, segmentation cuts out the exact silhouette. The output is an image of the same size as the input, with each pixel coloured by its predicted class.

Three main flavours:

  • Semantic segmentation — every pixel gets a class label (road, car, person, building) but instances of the same class are not distinguished.
  • Instance segmentation — every pixel gets both a class and an instance ID (car_1, car_2, person_1). Critical for counting and tracking.
  • Panoptic segmentation — combines semantic and instance segmentation, treating background classes (sky, road) semantically and foreground objects (person, car) per-instance.

The 2026 model landscape:

  • Segment Anything Model (SAM 2 / SAM 3) — Meta's foundation model for segmentation; works with text prompts, point prompts or bounding boxes. SAM 3 (2025) extended to video with temporal consistency.
  • Mask R-CNN — older but still widely deployed for instance segmentation; well-understood, fast.
  • Mask2Former / MaskFormer — transformer-based, dominant on the panoptic benchmarks.
  • DeepLab / U-Net — classic CNN architectures still widely used in medical imaging.
  • Multimodal LLMs with grounding — Gemini, Claude and GPT can return segmentation hints though specialised models still lead on quality.

Production use cases that pay for themselves:

  • Medical imaging — segmenting organs, tumours and anatomical structures for surgical planning and diagnostics.
  • Autonomous driving — lane detection, road segmentation, pedestrian masking.
  • Photo editing — background removal, selective adjustments, virtual try-on for e-commerce.
  • Manufacturing — defect localisation on assembly lines.
  • Satellite and drone imagery — crop monitoring, deforestation tracking, infrastructure inspection.
  • AR / VR — separating the user from the background for video conferencing and mixed reality.

The 2026 development pattern:

  • Start with SAM 2 for any prototyping; it generalises remarkably well across domains without training.
  • Fine-tune SAM or a specialised model when the domain is sufficiently unusual (medical, industrial, geospatial).
  • For consumer-facing photo editing, the commercial APIs (Photoroom, Cutout.pro, removebg) often beat the cost-of-build for typical use cases.
  • For high-volume programmatic use, self-hosted SAM on a GPU instance is economical.

For a US engineering team, segmentation has become one of the cheaper, more reliable computer vision capabilities to ship in 2026. Foundation models like SAM removed the need to label thousands of examples for most use cases; the work has shifted from training to integration, evaluation and product surface.

Keep exploring

Looking for something else? The full glossary covers 120+ AI terms updated for 2026.

Open the glossary
Chat on WhatsApp