In the talk, I will describe recent work on two challenging tasks in video understanding: (1) localizing moments in videos given a natural language query, and (2) modeling and generating bounce interactions in real-world scenes.
For localizing moments in videos with natural language, I will describe an approach that reasons over temporal language. In addition, I will describe an approach that extends to searching over a large video corpus. For modeling bounces, I will describe our model that learns end-to-end, starting from sensor inputs, to predict post-bounce trajectories and infer two underlying physical properties that govern bouncing - restitution and effective collision normals. For generating bounces, I will describe an approach that learns to "correct" traditional simulation output, generated with incomplete and imprecise world information, to obtain context-specific, visually plausible re-simulated output, a process we call neural re-simulation.