Why it’s critical to move beyond overly aggregated machine-learning metrics

Spread the love

MIT researchers have identified significant instances of machine-learning model failure when those models are applied to data other than the data on which they were trained, raising questions about the need for testing whenever a model is deployed in a new setting.

“We demonstrate that when you train models on large amounts of data, and choose the best average model, this ‘best model’ in a new setting may be the worst model for 6-75 percent of new data,” says Marzeh Ghassemi, an associate professor in MIT’s Department of Electrical Engineering and Computer Science (EECS), member of the Institute of Medical Engineering and Science, and principal investigator in the Information and Decision Systems Laboratory.

In a paper presented at the Conference on Neural Information Processing Systems (NeurIPS 2025) in December, researchers reported that for example, a model trained to effectively diagnose disease in a chest X-ray in one hospital might, on average, be considered effective in a different hospital. However, the researchers’ performance evaluation revealed that some of the models that performed best at the first hospital were performing the worst on 75 percent of the patients at the second hospital, even though when all patients are aggregated across the second hospital, the higher average performance masks this failure.

Their findings show that although spurious correlations – a simple example of which is when a machine-learning system, having not “seen” multiple cows depicted on a beach, classifies a photo of a beached cow as an orca solely because of its background – can supposedly be reduced by improving model performance on observed data, they actually still occur and remain a risk to a model’s reliability in new settings. In many instances – areas investigated by researchers such as chest X-rays, cancer histopathology images, and hate speech detection – such spurious correlations are very hard to detect.

For example, in the case of a medical diagnostic model trained on chest X-rays, the model may have learned to correlate a specific and irrelevant marking on a hospital X-ray with a certain pathology. In another hospital where marking is not used, pathology may be missed.

Previous research from Ghassemi’s group has shown that models can falsely correlate factors such as age, gender and race with medical findings. If, for example, a model has been trained on chest X-rays of older people who have pneumonia and has not “seen” many X-rays of younger people, it might predict that only older patients have pneumonia.

“We want the model to learn how to look at a patient’s physical characteristics and then make decisions based on that,” says Olawale Salauddin, an MIT postdoc and lead author of the paper, “but really anything that is in the data that is relevant to a decision can be used by the model.” And those correlations may not actually strengthen with changes in the environment, making model predictions an unreliable source of decision making.

Spurious correlations contribute to the risks of biased decision making. In a Neuripse conference paper, researchers showed that, for example, chest X-ray models that improved overall diagnostic performance actually performed worse on patients with pleural conditions or an enlarged cardiomediastinum, meaning enlargement of the heart or central chest cavity.

Other authors of the paper include PhD students Haran Zhang and Kumail Alhamoud, EECS assistant professor Sarah Berry and Ghassemi.

While previous work has generally accepted that models ordered from best to worst based on performance will preserve that order when applied in new settings, called accuracy-on-the-line, the researchers were able to demonstrate examples when the best-performing models in one setting were the worst-performing in another.

Salauddin devised an algorithm called OODSelect to find instances where accuracy on the line was broken. Basically, they trained thousands of models using in-distribution data, meaning the data was from the first setting, and calculated their accuracy. They then applied the models to data from another setting. When those with the highest accuracy on the first-setting data were wrong when applied to a larger percentage of examples in the second setting, this identified a problem subgroup, or sub-population. Salauddin also emphasizes the dangers of aggregate statistics for evaluation, which can obscure more detailed and consequential information about model performance.

During their work, the researchers isolated the “most miscalculated examples” so that spurious correlations within the dataset would not be associated with situations that are difficult to classify.

The NeurIPS paper releases the researchers’ code and some identified subsets for future work.

Once a hospital, or any organization employing machine learning, identifies the subgroups on which a model is performing poorly, that information can be used to improve the model for its particular task and setting. The researchers suggest that OODSelect be adopted in future work to highlight the goals of evaluation and design approaches to continuously improve performance.

“We hope that the released code and OODSelect subset will become a step toward benchmarks and models that combat the adverse effects of spurious correlations.”

Source link