Progressive Search Space Reduction for Human Pose Estimation

Vittorio Ferrari
University of Oxford, UK

The purpose of this work is to estimate human pose as the 2D spatial configuration of body parts in the very challenging setting of TV shows and feature films. Direct pose estimation on this uncontrolled material is often too difficult, especially when knowing nothing about the location, scale, pose, and appearance of the person.

We propose an approach that progressively reduces the search space for body parts, to greatly improve the chances of success of sophisticated but fragile pose estimators based on pictorial structures. A human detector generic over pose and appearance is trained and emploied to substantially reduce the full pose search space. Next, we exploit knowledge about the structure of the detected image region to initialize Grabcut and further prune the search space. After an initial pose estimate is obtained from individual frames, we integrate appearance models across frames with confident estimates and use them to refine uncertain frames. Finally, we propose a technique for imposing temporal continuity which reduces the ambiguity of single-frame estimates.

The method is fully automatic and self-initializing, and explains the spatio-temporal volume covered by a person moving in a shot, by soft-labeling every pixel as belonging to a particular body part or to the background. We report an extensive evaluation over 70000 frames from four episodes of the TV show {\em Buffy the vampire slayer}, and present applications to action recognition on the Weizmann dataset, and to pose search.