Suppose an environmental scientist is studying whether exposure to air pollution is associated with low birth weight in a particular county.
They can train a machine-learning model to estimate the magnitude of this association, because machine-learning methods are particularly good at learning complex relationships.
Standard machine-learning methods are excellent at making predictions and sometimes provide uncertainties such as confidence intervals for these predictions. However, they usually do not provide estimates or confidence intervals when determining whether two variables are related. Other methods have been developed specifically to address this association problem and provide confidence intervals. But, in spatial settings, the MIT researchers found that these confidence intervals may be completely off target.
When variables such as air pollution levels or rainfall change at different locations, common methods of generating confidence intervals may claim a high level of confidence, when in reality, the estimate completely fails to capture the real value. These faulty confidence intervals may mislead the user into trusting the failed model.
After identifying this shortcoming, researchers developed a new method designed to generate valid confidence intervals for problems involving data varying in space. In simulations and experiments with real data, their method was the only technique that consistently produced accurate confidence intervals.
This work could help researchers in fields such as environmental science, economics, and epidemiology better understand when to trust the results of certain experiments.
“There are a lot of problems where people are interested in understanding space phenomena, like weather or forest management. We’ve shown that, for this broad class of problems, there are more appropriate methods that can give us better performance, a better understanding of what’s going on, and more reliable results,” says Tamara Broderick, an associate professor in MIT’s Department of Electrical Engineering and Computer Science (EECS), the Laboratory for Information and Decision Systems (LIDS) and a member of the Data, Systems Institute. Society, an affiliate of the Computer Science and Artificial Intelligence Laboratory (CSAIL), and senior author of the study.
David R., co-lead author on the paper with Broderick. Bert, a postdoc, and Renato Berlinghieri, an EECS graduate student; and Stephen Bates is an assistant professor in EECS and a member of LIDS. This research was recently presented at the Conference on Neural Information Processing Systems.
invalid assumptions
Spatial association involves studying how a variable and a certain outcome are related in a geographic area. For example, one might want to study how tree cover is related to altitude in the United States.
To solve this type of problem, a scientist can collect observational data from multiple locations and use it to infer an association in a different location where they do not have the data.
The MIT researchers realized that, in this case, existing methods often produce confidence intervals that are completely inaccurate. A model may say it is 95 percent confident that its estimate captures the true relationship between tree cover and height, whereas it did not capture that relationship at all.
After discovering this problem, researchers determined that the assumptions these confidence interval methods rely on do not hold when the data varies spatially.
Assumptions are like rules that must be followed to ensure that the results of statistical analysis are valid. Common methods of generating confidence intervals work under different assumptions.
First, they assume that the source data, which is the observation data collected to train the model, is independent and uniformly distributed. This assumption implies that the probability of including one locus in the data has no bearing on whether another is included. But, for example, U.S. Environmental Protection Agency (EPA) air sensors are installed with other air sensor locations in mind.
Second, existing methods often assume that the model is completely correct, but in practice this assumption is never true. Finally, they assume that the source data is identical to the target data where one wishes to make inferences.
But in spatial settings, the source data may be fundamentally different from the target data because the target data is in a different location than where the source data was collected.
For example, a scientist could use data from EPA pollution monitors to train a machine-learning model that could predict health outcomes in a rural area where there are no monitors. But EPA pollution monitors are likely installed in urban areas, where there is more traffic and heavy industry, so the air quality data will be very different from air quality data in a rural area.
In this case, association estimates using urban data suffer from bias because the target data systematically differ from the source data.
an easy solution
The new method of generating confidence intervals clearly explains this potential bias.
Instead of assuming that the source and target data are identical, researchers assume that the data vary smoothly in space.
For example, with fine particulate air pollution, one would not expect the pollution level on one city block to be completely different from the pollution level on the next city block. Instead, pollution levels will simply decrease as one moves away from the pollution source.
“For these types of problems, this spatial smoothness assumption is more appropriate. It better matches what’s actually going on in the data,” says Broderick.
When he compared his method to other common techniques, he found that it was the only technique that could consistently produce reliable confidence intervals for spatial analysis. Furthermore, their method remains reliable even when observational data are distorted by random errors.
In the future, the researchers want to apply this analysis to a wider variety of variables and explore other applications where it could provide more reliable results.
This research was partially funded by an MIT Social and Ethical Responsibility of Computing (SERC) seed grant, the Office of Naval Research, Generali, Microsoft, and the National Science Foundation (NSF).