Object category localization is a challenging problem in computer vision.
The vast majority of work relies on supervised learning and requires
bounding box annotations of object instances in the training images. This
time-consuming annotation process is sidestepped in weakly supervised
learning. In this case, the supervised information is incomplete and
restricted to binary labels that indicate the absence/presence of object
instances in the image, without their locations sizes or aspect ratios. We
follow a multiple-instance learning approach that iteratively trains the
detector and infers the object locations in the positive training images.
Our main contribution is a multi-fold training procedure for multiple
instance learning, which prevents training from prematurely locking on to
poor local opima corresponding to erroneous object locations. This
procedure is particularly important when using high-dimensional
representations, such as Fisher vectors and convolutional neural network
features. We also propose a window refinement method, which further
improves localization accuracy by incorporating a prior on object layout
based on low-level contour information. We present a detailed experimental
evaluation using the PASCAL VOC 2007 dataset, which verifies the
effectiveness of our approach.