Embodied Intelligence (EI) Joint Seminar Presentation
Liming Wang & Jehanzeb Mirza
MIT CSAIL
Add to Calendar
2025-03-13 16:00:00
2025-03-13 17:00:00
America/New_York
Embodied Intelligence (EI) Joint Seminar Presentation
There will be a joint presentation this week by two postdocs with the Spoken Language Systems Group. Title: Can Diffusion Model Disentangle? A Theoretical PerspectivePresenter: Liming WangAbstract: This talk presents a novel theoretical framework for understanding how diffusion models can learn disentangled representations. Within this framework, we establish identifiability conditions for general disentangled latent variable models, analyze training dynamics, and derive sample complexity bounds for disentangled latent subspace models. To validate our theory, we conduct disentanglement experiments across diverse tasks and modalities, including subspace recovery in latent subspace Gaussian mixture models, image colorization, image denoising, and voice conversion for speech classification. Additionally, our experiments show that training strategies inspired by our theory, such as style guidance regularization, consistently enhance disentanglement performance.Bio: Liming Wang is a postdoctoral associate in the Spoken Language Systems Group at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL). His research interests broadly encompass the practical and theoretical aspects of self-supervised speech processing and multimodal learning, with the goal of improving accessibility and inclusivity of speech and language technology.Title: GLOV: Guided Large Language Models as Implicit Optimizers for Vision Language ModelsPresenter: Jehanzeb MirzaAbstract: In this talk, Jehanzeb Mirza will present GLOV, a framework that reduces the manual effort required to craft effective natural language prompts for vision-language models (VLMs). Instead of human intervention, GLOV employs large language models (LLMs) as implicit optimizers, iteratively refining VLM prompts by ranking and optimizing them based on task performance. Additionally, we guide the LLM’s generation by incorporating an embedding space steering vector during autoregressive generation, biasing it toward more effective prompts at each optimization step. We evaluate GLOV across multiple downstream tasks and VLM architectures, demonstrating its strong generalization ability.Bio: Jehanzeb Mirza is a postdoc in the Spoken Language Systems group at MIT CSAIL, advised by James Glass. His research focuses on multi-modal learning, particularly improving fine-grained understanding. He earned his PhD in Computer Science (specializing in computer vision) from TU Graz, Austria, under the supervision of Prof. Horst Bischof, and his Master’s from KIT, Germany.
TBD