We will refer to two different instances of the problem that are, nevertheless, treated in a unified manner. The first of these instances considers a hand in isolation that is observed by a Kinect sensor. In this case, we seek for the hand articulation parameters that minimize the discrepancy between the appearance and 3D structure of hypothesized instances of a hand model and actual hand observations.
In the second instance, we consider the estimation of the full pose of a human hand interacting with an object and observations come from a calibrated multicamera setup. In this case, optimization seeks for the joint hand-object model configuration that (a) minimizes the appearance discrepancy between hypothesis and observation, (b) best explains the incompleteness of observations resulting from occlusions due to hand-object interaction and, (c) is physically plausible in the sense that the hand does not intersect itself and does not share the same physical space with the manipulated object. Extensive experiments with prototype GPU-based implementations of the proposed methods demonstrate that accurate and robust 3D tracking of hand articulations can be achieved in near real-time.