Multimodal Learning from the Bottom Up

Speaker

Andrew Owens
University of Michigan

Host

Anthea Li
MIT CSAIL

Abstract: Today's machine perception systems rely extensively on human-provided supervision, such as language. I will talk about our efforts to develop systems that instead learn directly about the world from unlabeled multimodal signals, bypassing the need for this supervision. First, I will discuss our work on creating models that learn from analyzing unlabeled videos, particularly self-supervised approaches for learning space-time correspondence. Next, I will present models that learn from the paired audio and visual signals that naturally occur in video, including methods for generating soundtracks for silent videos. I will also discuss methods for capturing and learning from paired visual and tactile signals, such as models that augment visual 3D reconstructions with touch. Finally, I will talk about work that explores the limits of pretrained text-to-image generation models by using them to create visual illusions. 

Bio: Andrew Owens is an assistant professor at The University of Michigan in the department of Electrical Engineering and Computer Science. Prior to that, he was a postdoctoral scholar at UC Berkeley, and he obtained a Ph.D. in computer science from MIT in 2016. He is a recipient of a Sloan Research Fellowship, an NSF CAREER Award, and a Computer Vision and Pattern Recognition (CVPR) Best Paper Honorable Mention Award.