MLTea Talk: Theoretical Perspectives on Data Quality and Selection

Abstract: Though the fact that data quality directly affects the quality of our prediction has always been understood, the large-scale data requirements of modern machine learning tasks has brought to fore the need to develop a richer vocabulary for understanding the quality of collected data towards predictions tasks of interest and the need to develop algorithms that most effectively use collected data. Though, this has been studied in various contexts such as distribution shift, multitask learning and sequential decision making, there remains a need to develop techniques to address problems faced in practice. Towards this aim of starting a dialogue between the practical and theoretical perspectives on these important problems. I will survey some recent techniques developed in TCS and statistics addressing data quality and selection.

Bio: Abhishek Shetty is an incoming Catherine M. and James E. Allchin Early-Career Assistant Professor in the School of Computer Science at Georgia Tech and is currently FODSI Postdoctoral Fellow at MIT, hosted by Sasha Rakhlin, Ankur Moitra and Costis Daskalakis. He graduated from the department of EECS at UC Berkeley advised by Nika Haghtalab. His interests lie at the intersection of machine learning, theoretical computer science and statistics and is aimed at developing statistically and computationally efficient algorithms for inference. His research has been awarded with the Apple AI/ML fellowship and the American Statistical association SCGS best student paper.