ML Tea: Learning Generative Models from Corrupted Data

Speaker: Giannis Daras

Abstract: In scientific applications, generative models are used to regularize solutions to inverse problems. The quality of the models depends on the quality of the data on which they are trained. While natural images are abundant, in scientific applications access to high-quality data is scarce, expensive, or even impossible. For example, in MRI the quality of the scan is proportional to the time spent in the scanner and in black-hole imaging, we can only access lossy measurements. Contrary to high-quality data, noisy samples are generally more accessible. If we had a method to transform noisy points into clean ones, e.g., by sampling from the posterior, we could address these challenges. A standard approach would be to use a pre-trained generative model as a prior. But how can we train these priors in the first place without having access to data? We show that one can escape this chicken-egg problem using diffusion-based algorithms that account for the corruption at training time. We present the first algorithm that provably recovers the distribution given only noisy samples of a fixed variance. We extend our algorithm to account for heterogeneous data where each training sample has a different noise level. The underlying mathematical tools can be generalized to linear measurements with the potential of accelerating MRI. Our method has deep connections to the literature on learning supervised models from corrupted data, such as SURE and Noise2X. Our framework opens exciting possibilities for generative modeling in data-constrained scientific applications. We are actively working on applying this to denoise proteins and we present some first results in this direction.

Bio: Giannis Daras is a postdoctoral researcher at MIT working closely with Prof. Costis Daskalakis and Prof. Antonio Torralba. Prior to MIT, Giannis completed his Ph.D. at UT Austin, under the supervision of Prof. Alexandros G. Dimakis. Giannis is interested in generative modelling and the applications of generative models to inverse problems. A key aspect of his work involves developing algorithms for learning generative models from noisy data. His research has broad implications across various fields, including scientific applications, privacy and copyright concerns, and advancing data-efficient learning techniques.