Multimodal Learning from the Bottom Up

Speaker

Andrew Owens

University of Michigan

Host

Anthea Li

MIT CSAIL

Abstract: Today's machine perception systems rely extensively on human-provided supervision, such as language. I will talk about our efforts to develop systems that instead learn directly about the world from unlabeled multimodal signals, bypassing the need for this supervision. First, I will discuss our work on creating models that learn from analyzing unlabeled videos, particularly self-supervised approaches for learning space-time correspondence. Next, I will present models that learn from the paired audio and visual signals that naturally occur in video, including methods for generating soundtracks for silent videos. I will also discuss methods for capturing and learning from paired visual and tactile signals, such as models that augment visual 3D reconstructions with touch. Finally, I will talk about work that explores the limits of pretrained text-to-image generation models by using them to create visual illusions.

Bio: Andrew Owens is an assistant professor at The University of Michigan in the department of Electrical Engineering and Computer Science. Prior to that, he was a postdoctoral scholar at UC Berkeley, and he obtained a Ph.D. in computer science from MIT in 2016. He is a recipient of a Sloan Research Fellowship, an NSF CAREER Award, and a Computer Vision and Pattern Recognition (CVPR) Best Paper Honorable Mention Award.

Add to Calendar 2025-03-20 16:00:00 2025-03-20 17:00:00 America/New_York Multimodal Learning from the Bottom Up Abstract: Today's machine perception systems rely extensively on human-provided supervision, such as language. I will talk about our efforts to develop systems that instead learn directly about the world from unlabeled multimodal signals, bypassing the need for this supervision. First, I will discuss our work on creating models that learn from analyzing unlabeled videos, particularly self-supervised approaches for learning space-time correspondence. Next, I will present models that learn from the paired audio and visual signals that naturally occur in video, including methods for generating soundtracks for silent videos. I will also discuss methods for capturing and learning from paired visual and tactile signals, such as models that augment visual 3D reconstructions with touch. Finally, I will talk about work that explores the limits of pretrained text-to-image generation models by using them to create visual illusions. Bio: Andrew Owens is an assistant professor at The University of Michigan in the department of Electrical Engineering and Computer Science. Prior to that, he was a postdoctoral scholar at UC Berkeley, and he obtained a Ph.D. in computer science from MIT in 2016. He is a recipient of a Sloan Research Fellowship, an NSF CAREER Award, and a Computer Vision and Pattern Recognition (CVPR) Best Paper Honorable Mention Award. TBD

Organizer & Contact

Marcia Davidson

marcia@csail.mit.edu

617-253-3049

Part of

Embodied Intelligence Seminar 2024-2025

Multimodal Learning from the Bottom Up

Speaker

Host

March 20 2025

Location

Organizer & Contact

Part of

March 20