In [1,2] we explore a weaker form of supervision for convnet training, such as bounding box annotations, which are cheaper and easier to define. We consider bounding box supervision for two different tasks: object boundary detection [1] and semantic labelling [2], and show that with only box supervision we can reach 95% of the full supervision quality.
In [3,4] we propose a new training strategy for pixel-level object tracking which allows to relax the constraint of using densely annotated video data. Instead of using large volumes of video data, hoping to generalize across domains, we propose to train from static images only [3] or synthesize in-domain training data using the provided annotation on the first frame of each video [4]. This allows to achieve state-of-the-art results while using ∼ 100× less annotated data than competing methods.
[1] Khoreva A., Benenson R., Omran M., Hein M., Schiele B. Weakly Supervised Object Boundaries. CVPR, 2016 (spotlight). [2] Khoreva A., Benenson R., Hosang J., Hein M., Schiele B. Simple Does It: Weakly Supervised Instance and Semantic Segmentation. CVPR, 2017. [3] *Khoreva A., *Perazzi F., Benenson R., Schiele B, Sorkine-Hornung A. Learning Video Object Segmentation from Static Images. CVPR, 2017 (spotlight). [4] Khoreva A., Benenson R., Ilg E., Brox T., Schiele B. Lucid Data Dreaming for Multiple Object Tracking. ArXiv, 2017.