Learning Visual Representations for Interaction

Justus Piater (U. Innsbruck, Austria)

Autonomous systems that interact with the real world must be equipped with a basic understanding of relevant objects and the effects of actions upon them. Since it is very difficult to pre-program such knowledge by hand, I am interested in how to enable autonomous systems to acquire it empirically. To this end, we have developed learnable object representations for interaction and abstraction. Objects are represented by Markov networks whose edge potentials encode pairwise spatial relationships between local features in 3D. Local features typically correspond to visual signatures, but may also represent action-relevant parameters such as object-relative gripper poses useful for grasping the object. Thus, detecting and recognizing known objects seen by a camera amounts to probabilistic inference, and is remarkably robust to clutter and occlusions. At the same time, the associated action parameters are inferred, which allows the robot to interact with the objects present in the scene.

Visual and action representations can be learned autonomously and incrementally, allowing an intelligent system to increase its capabilities with experience. Furthermore, these capabilities can be abstracted into symbolic object-action-effect representations. This permits an artificial cognitive agent to construct elaborate plans in terms of learned primitives derived from sensorimotor experience.