Sonal Sannigrahi has recently finished her undergraduate studies in Mathematics and Computer Science at Ecole Polytechnique, France and is continuing her education at the University of Saarland to study a Master's in Language Science and Technology. She is interested in languages from a computational point of view and hopes to advance her career in natural language understanding to further break barriers in tech. She is equally a developper who loves python and is eager to share her knowledge!
When watching a tv show or movie, we are easily able to detect which interactions are taking place on screen and between whom. However this is not trivial for computers and remains to be a challenging task as it involves both tracking humans as well as learning the semantics of the interaction taking place. This is an important problem in computer vision as solving it would allow for easy annotations of videos (paving way for awesome unsupervised learning!), automated surveillance, and quick content-based video retrieval (just imagine finding a video by describing actions in one go!). Attendees would find the most benefit from this talk if they are interested in computer vision and/or have some prior knowledge of convolutional architectures in machine learning, an additional benefit would be knowledge about action recognition however this is not too important.
In this talk, we introduce a novel self-supervised method (termed "Sync-3D") that learns spatio-temporal video embeddings to enable the detection of human interactions. Our work combines the I3D architecture used for action localisation and the siamese SyncNet architecture for video-audio synchronisation, casting the problem of human interaction detection as one of motion synchronisation both spatially and temporally. This talk will cover the motivations behind this architectural choice, our learning framework, a new data sampling strategy for curriculum learning, and lastly, how our architecture compares to others on the downstream task of interaction classification on the challenging TV-HID dataset. We will also be motivating some future research directions and pointing out certain improvements of this system.