Machine-learning systems use data to understand patterns and make predictions. When the system is predicting which photos are of cats, you may not care how certain it is about its results. But if it’s predicting the fastest route to the hospital, the amount of uncertainty becomes critically important.
“Imagine the system tells you ‘Route A takes 9 minutes’ and ‘Route B takes 10 minutes.’ Route A sounds better,” says Tamara Broderick, an associate professor in the Department of Electrical Engineering and Computer Science. “But now it turns out that Route A takes 9 minutes plus-or-minus 5, and Route B takes 10 minutes plus-or-minus 1. If you need a life-saving procedure in 12 minutes, suddenly your decision making really changes.”
A high-school outreach program, MIT’s Women’s Technology Program (WTP), first brought Broderick to campus. “My experience at WTP was formative,” she says. Now Broderick studies how machine-learning systems can be made to quantify the “known unknowns” in their predictions, using a mathematical technique called Bayesian inference. “The idea is to learn not just what we know [from the data], but how well we know it,” she explains.
The catch is that traditional algorithms for “Bayesian machine learning” take a very long time to work on complex data sets like those in biology, physics, or social science. “It’s not just that we’re getting more data points, it’s that we’re asking more questions of those data points,” says Broderick, who is a principal investigator at MIT’s Computer Science and Artificial Intelligence Laboratory and affiliated with MIT’s Institute for Data, Systems, and Society. “If I have gene-expression levels for a thousand genes, that’s a thousand-dimensional [machine-learning] problem. But if I try to look at interactions between just one gene with another, that’s now a million-dimensional problem. The computational and statistical challenges go through the roof.”
These challenges impose a bottleneck on using Bayesian machine learning for many applications where quantifying uncertainty is essential. Some complex data analyses might take an infeasible amount of time to run—months or more. And in so-called “high-dimensional” data sets, such as ones with millions of gene interactions, it can be difficult to find the signal among the noise. “It’s harder to find out what’s really associated with what, when you have that many variables,” Broderick says.
In other words, Bayesian machine learning has a scaling problem. Broderick’s research devises mathematical work-arounds that reduce computational and statistical complexity “so that our methods run fast, but with theoretical guarantees on accuracy.” Her recent work includes techniques with colorful names—“kernel interaction trick,” “infinitesimal jackknife”—that evoke a sense of technical wizardry crossed with down-to-earth pragmatism. Indeed, Broderick says she sees scalable Bayesian machine learning as “a service profession” aimed at amplifying the discovery efforts of her fellow scientists.
One such effort came to Broderick’s attention from an economist colleague studying how microcredit—small, low-interest loans made to entrepreneurs in developing economies—affects household incomes. “She’s interested in finding out whether these small loans actually help people, but it was taking her a really long time to run her experiments with existing Bayesian software,” Broderick says. Broderick’s team has been developing methods for this work that are both accurate and orders of magnitude faster.
In another collaboration, her team is using Bayesian machine learning to quantify the uncertainty in different kinds of genomics experiments, work that opens the door to a wealth of new, interesting science, Broderick says. This will help biologists use the data they already have to make informed decisions on how to allocate their research funds to best support future work. Think of it as the science-focused version of predicting the fastest route to a hospital with the least uncertainty.
“Even when we’re writing a purely theoretical paper, I’d like to think that the theory is very much inspired by problems that arise in people’s applications,” Broderick says. “We’re trying to make science easier for biologists, for chemists, for physicists, so they can focus on their really cool problems and just get the data analysis out of the way.”