3D scene and object understanding in the era of deep learning

Federico Tombari (Google and TU Munich, Germany)

Abstract:

In the deep learning era, to address common 3D scene and object understanding applications we delegate a neural network the tasks of extracting feature representations from data and of learning intermediate subspaces and embeddings. This exposes our algorithms to common limitations of neural networks, in particular the need for huge amount of data and the domain shift.

In this talk, I will walk through current approaches for using deep learning for 3D reconstruction and 3D recognition tasks, and highlight novel directions to solve these limitations. In the first part of the talk, I will give an overview of methods aimed at learning descriptors of 3D objects from different 3D data representations, focusing in particular on point clouds. I will discuss how current approaches are moving towards unsupervised learning, using different architectures and learning methodologies. In the second part of the talk, I will focus instead on monocular 3D reconstruction, discussing the use of monocular depth prediction for monocular SLAM and omnidirectional images. Finally, I will introduce our recent work in monocular 6D object pose estimation and highlight methodologies to overcome the domain shift.