First, I show that face-based person recognizers as well as human action detectors can be automatically learned from videos together with readily-available but imprecise and noisy text annotation in the form of movie scripts and subtitles. Second, I describe an intermediate-level video representation for recognition, where video is decomposed into spatio-temporal segments that incorporate long-range motion cues in the form of groups of point-tracks with coherent motion.
Results will be shown on challenging videos from feature length movies.
Joint work with K. Alahari, F. Bach, M. Everingham, O. Duchenne, J. Lezama, I. Laptev, J. Ponce and A. Zisserman