Improving Data Techniques with Dr. Emanuel Ben-David of the U.S. Census Bureau

Improving Applications of Modern Predictive Modeling Techniques to Linked Data Sets Subject to Mismatch Error

In recent years, the rise of social media platforms such as Twitter has provided social scientists with a wealth of user-content data. Combining social media and survey data has the potential to produce a comprehensive source of information for social research. These data are often collected from multiple sources and connected by probabilistic record linkage. For analyzing these linked data files, advanced machine learning techniques have become essential tools for survey methodologists and data scientists. There is, however, a potential pitfall in the widespread application of these techniques to linked data sets. Linkage errors such as mismatch and missed-match errors can distort the true relationships between variables and adversely alter the performance metrics routinely output by predictive modeling techniques. In this talk, we describe a methodology designed to adjust modern predictive modeling techniques for the presence of mismatch errors in linked data sets. Based on mixture modeling, the proposed approach is general enough to accommodate various predictive modeling techniques in a unified fashion. We evaluate the performance of our proposed methodology with real data and simulations.

Presented by Improving Data Integration Techniques, a programmatic grant of the Incubator for Transdisciplinary Futures.

Speaker Bio: Emanuel Ben-David is a research mathematical statistician in the Center for Statistical Research and Methodology at the U.S. Census Bureau. Before joining the U.S. Census Bureau, he was a research assistant professor in the Department of Statistics at Columbia University (2012-2015), a postdoctoral associate in the Department of Statistics at Stanford University (2010-2012), and a postdoctoral fellow in the Statistical and Applied Mathematical Sciences Institute (2009-2010). Emanuel Ben-David received his PhD in statistics from Indiana University-Bloomington. His research interests include record linkage and data integration, survey sampling, graphical models, multivariate statistics, and applied optimization.