Pose estimation is the computer vision task of detecting the positions of human (or animal) joints in images and video, producing a skeleton of keypoints that captures body posture. It powers fitness apps, sports analytics, motion capture for animation, AR effects, ergonomic analysis and increasingly the ControlNet-driven pipelines that condition image generation on pose.
The variants in 2026:
- 2D pose estimation — joints in pixel coordinates; works on any single image. Mature.
- 3D pose estimation — joints in 3D world coordinates; harder, often using monocular or multi-view input.
- Whole-body pose — body, hands, face all together for full character capture.
- Multi-person pose — every person in a crowd estimated simultaneously.
- Animal pose — for veterinary, agriculture, wildlife and animation use cases.
The model landscape:
- OpenPose — historic foundational model; widely used and integrated.
- MediaPipe (Google) — fast on-device pose estimation; popular for mobile and web AR.
- YOLOv8 / YOLO-Pose — fast, accurate, easy to deploy for production at scale.
- MMPose — research-grade toolkit with state-of-the-art models.
- DWPose, RTMPose — newer high-accuracy options now used in many production pipelines.
- Move AI, Wonder Studio — commercial markerless mocap from video; turn ordinary footage into animation-ready pose data.
Production use cases that ship in 2026:
- Fitness apps — count reps, check form, give corrections (Tonal, Mirror, Apple Fitness+).
- Sports analytics — player tracking, motion analysis, performance metrics; widely used in MLB, NBA and soccer.
- AR effects — body filters, virtual try-on, dance challenges on TikTok and Instagram.
- Motion capture for animation — markerless mocap from a phone for indie game developers and animators.
- Healthcare and rehabilitation — gait analysis, physical therapy assessment, fall detection for elderly care.
- Workplace ergonomics — automated assessment of repetitive-strain risk in industrial environments.
- Image generation control — ControlNet OpenPose conditioning to specify exact character poses for generated images.
The hard cases:
- Occlusion — when one body part hides another, accuracy drops.
- Multiple overlapping people — crowd scenes are harder than single-person.
- Unusual poses — yoga, gymnastics, martial arts can confuse models trained on common poses.
- Loose clothing — heavy clothing hides body landmarks.
- Viewpoint extremes — overhead or extreme low angles see fewer training examples.
For a US team in 2026, pose estimation has commoditised dramatically. MediaPipe runs in real time in a browser; YOLO-Pose deploys easily on edge devices; cloud APIs cover the high-end use cases. The interesting work has moved up the stack — what to do with the pose data once you have it — rather than into the pose estimation itself. Custom training is rarely needed; foundation pose models work well across most domains out of the box.