Enamorado will design open-source data software that can process tens of millions of records – all from a personal computer.
Assistant professor of political science Ted Enamorado has earned a $233,955 National Science Foundation grant to expand his efforts in linking disparate sets of data.
Enamorado is one of the faculty leads of Improving Data Integration Techniques, a group supported by a programmatic grant from the Incubator for Transdisciplinary Futures. The group’s work focuses on finding ways to merge and organize massive datasets that lack “unique identifiers” (such as social security numbers) that would provide surefire matches between entries in multiple datasets.
This is a common but thorny problem in data science research, and one that has invited several attempted solutions over the years. For his part, Enamorado has developed an open-source software called fastLink, launched in 2019, which uses a probabilistic model to identify likely linkages between data records.
The NSF grant, however, promises to take these efforts to the next level.
“It is all about dreaming bigger,” Enamorado said. “How can we do this using a personal computer, and handle not hundreds of thousands, but tens of millions of records?”
The grant will support additional software development and a comprehensive set of simulation studies to help scale up existing programs. The end goal is an open-source program robust enough to handle millions of records at once, while still delivering highly accurate results when it links records between datasets. This sort of tool could prove invaluable in data-heavy fields such as social science, medicine, statistics, and computer science.
Scalability and accuracy are usually inversely related in these types of projects, Enamorado says – focus too much on improving one, and you lose out in the other.
To help bridge the gap, his upcoming studies will include human observers to fact-check the program’s linkages and help improve its algorithm. These reviewers will study representative samples of the matches that the software has made – verifying only a few hundred records as opposed to hundreds of thousands.
“In an ideal world, humans would be making most of the calls,” Enamorado said. “But human time is super expensive – and people get tired doing this type of task. So we don’t want to put too much burden on the human.”
The NSF grant’s funding lasts through July 2026.