Depth estimation is the computer vision task of predicting how far each pixel in an image is from the camera. The output is a depth map — an image of the same size as the input where each pixel encodes distance instead of colour. Depth estimation is foundational for AR, robotics, autonomous driving, computational photography (portrait mode, bokeh), 3D reconstruction and the ControlNet-driven pipelines that use depth as a conditioning input for image generation.
Two main flavours:
- Monocular depth — depth from a single image. Hardest because there are infinite 3D scenes that could produce the same 2D image; modern models lean on learned priors.
- Multi-view / stereo depth — depth from multiple images or a stereo camera pair; geometrically constrained, more accurate.
The 2026 model landscape for monocular depth:
- Depth Anything V2 — strong open-source monocular depth model; popular as the default for ControlNet depth guidance and many vision pipelines.
- MiDaS / DPT — older but still widely deployed; the originals that defined the modern category.
- Marigold — diffusion-based depth estimation; high-quality outputs at higher cost.
- Apple ARKit / Google ARCore — on-device depth via specialised hardware (LiDAR, structured light) plus learned models.
Production use cases:
- Computational photography — portrait mode and bokeh effects in every modern smartphone.
- AR effects — Snap, Instagram, TikTok filters that respect 3D scene structure.
- Autonomous driving and robotics — combined with LiDAR for redundant depth perception.
- 3D reconstruction — depth maps as inputs to mesh and Gaussian splat pipelines.
- Image editing — depth-aware blur, relighting, refocus.
- Image generation control — ControlNet depth conditioning to preserve scene structure when stylising or regenerating an image.
The hard cases:
- Reflective and transparent surfaces — mirrors, glass, water break the assumptions of most depth models.
- Very large or very close objects — most monocular models are most accurate in a middle range.
- Featureless surfaces — blank walls, fog, clear sky give the model nothing to work with.
- Absolute vs relative depth — most monocular models predict relative depth (this pixel is farther than that pixel) without absolute scale; getting metres requires calibration or stereo.
For a US engineering team in 2026, depth estimation has become a commodity capability. Depth Anything V2 runs in real time on consumer GPUs and gives strong results out of the box for almost any indoor or outdoor scene. The deeper integrations live in computational photography pipelines, AR experiences and 3D capture workflows where depth is one input among many. For high-stakes applications (robotics, autonomous vehicles), monocular depth is supplemented or replaced by sensor fusion with LiDAR, radar and stereo cameras for the redundancy that safety-critical systems require.