Weakly Supervised Learning for Image and Video Segmentation

Anna Khoreva (MPI Saarbrucken, Germany)

Abstract:

Convolutional networks (convnets) have become the de facto technique for pattern recognition problems in computer vision. One of their main strengths is the ability to profit from extensive amounts of training data to reach top quality. However, one of their main weaknesses is that they need a large number of training samples for high quality results. For applications such as object boundary detection, image and video segmentation training a convnet requires expensive object mask annotations, and thus significant cost is involved in creating large enough training sets. In order to make the training more affordable, there is a need to relax the requirement of the large amount of pixel-level annotations.

In [1,2] we explore a weaker form of supervision for convnet training, such as bounding box annotations, which are cheaper and easier to define. We consider bounding box supervision for two different tasks: object boundary detection [1] and semantic labelling [2], and show that with only box supervision we can reach 95% of the full supervision quality.

In [3,4] we propose a new training strategy for pixel-level object tracking which allows to relax the constraint of using densely annotated video data. Instead of using large volumes of video data, hoping to generalize across domains, we propose to train from static images only [3] or synthesize in-domain training data using the provided annotation on the first frame of each video [4]. This allows to achieve state-of-the-art results while using ∼ 100× less annotated data than competing methods.

[1] Khoreva A., Benenson R., Omran M., Hein M., Schiele B. Weakly
Supervised Object Boundaries. CVPR, 2016 (spotlight).
[2] Khoreva A., Benenson R., Hosang J., Hein M., Schiele B. Simple Does
It: Weakly Supervised Instance and Semantic Segmentation. CVPR, 2017.
[3] *Khoreva A., *Perazzi F., Benenson R., Schiele B, Sorkine-Hornung A.
Learning Video Object Segmentation from Static Images. CVPR, 2017
(spotlight).
[4] Khoreva A., Benenson R., Ilg E., Brox T., Schiele B. Lucid Data
Dreaming for Multiple Object Tracking. ArXiv, 2017.