Large language models (LLMs) have been hailed as a tool that can democratize access to information worldwide, providing knowledge in user-friendly interfaces regardless of a person’s background or location. However, new research from MIT’s Center for Constructive Communication (CCC) shows that these artificial intelligence systems may actually perform poorly for the very users who could benefit most from them.
A study conducted by CCC researchers based at the MIT Media Lab found that state-of-the-art AI chatbots – including OpenAI’s GPT-4, Anthropic’s Cloud 3 Opus, and Meta’s Llama 3 – sometimes provide less-accurate and less-truthful responses to users with lower English proficiency, less formal education, or who are from outside the United States. Models also refuse to answer questions for these users at higher rates, and in some cases, respond with condescending or patronizing language.
“We were inspired by the possibility of the LLM helping to address unequal information access around the world,” says lead author Elinor Poole-Dayan SM ’25, who led the research as a technical associate and master’s student in media arts and sciences at the MIT Sloan School of Management. “But this approach cannot become a reality without ensuring that model bias and harmful tendencies are safely mitigated for all users, regardless of language, nationality or other demographics.”
A paper describing the work, titled “LLM targeted underperformance disproportionately affects vulnerable users,” was presented at the AAAI Conference on Artificial Intelligence in January.
Systematic poor performance across multiple dimensions
For this research, the team tested how three LLMs answered questions from two datasets: TruthfulQA and ScienceQA. TruthfulQA is designed to measure the truthfulness of a model (by relying on common misconceptions and literal truths about the real world), while SciQ consists of science exam questions that test factual accuracy. The researchers paired each question with brief user biographies that varied three characteristics: education level, English proficiency, and country of origin.
In all three models and both datasets, the researchers observed a significant drop in accuracy when questions came from users with less formal education or who were non-native English speakers. The effects were most pronounced for users at the intersection of these categories: those with less formal education, who were also non-native English speakers, saw the greatest decline in response quality.
The research also examined how country of origin affected the model’s performance. Testing users from the United States, Iran, and China with similar educational backgrounds, the researchers found that Cloud 3 Opus performed significantly worse for users from Iran, especially on both datasets.
“We see the largest drop in accuracy for users who are both non-native English speakers and less educated,” says Judd Kabara, a research scientist at CCC and co-author of the paper. “These results show that the negative effects of modeled behavior increase in relative ways with respect to these user traits, thus suggesting that such models deployed at scale run the risk of spreading harmful behavior or misinformation among those who are least able to recognize it.”
denial and condescending language
Perhaps the most striking difference was how often the models refused to answer questions altogether. For example, Claude 3 Opus refused to answer about 11 percent of questions for less educated, non-native English-speaking users—compared to only 3.6 percent for a control condition with no user biography.
When the researchers manually analyzed these denials, they found that Cloud responded with condescending, patronizing, or mocking language 43.7 percent of the time for lower-educated users, while less than 1 percent of the time for higher-educated users. In some cases, models imitated broken English or adopted exaggerated dialect.
The model specifically refused to provide information on certain topics for less-educated users from Iran or Russia, including questions about nuclear energy, anatomy, and historical events – even though it answered the same questions correctly for other users.
“This is another indicator that suggests that the alignment process may encourage the model to withhold information from some users to avoid potentially giving false information, even though the model clearly knows the correct answer and provides it to other users,” Kabara says.
echoes of human bias
The findings reflect documented patterns of human social-cognitive bias. Research in the social sciences has shown that native English speakers often perceive non-native speakers as less educated, intelligent, and competent, regardless of their actual expertise. Similar biased perceptions have been documented among teachers who evaluate non-native English speaking students.
“The value of large language models is evidenced by their extraordinary adoption by individuals and the massive investment in the technology,” says Deb Roy, professor of media arts and sciences, CCC director, and co-author of the paper. “This study is a reminder of how important it is to continually assess the systemic biases that can quietly creep into these systems, causing unfair disadvantages for certain groups without any of us being fully aware.”
The implications are particularly worrisome as personalization features – like ChatGPT’s memory, which tracks a user’s information during a conversation – are becoming increasingly common. Such characteristics risk treating already marginalized groups differently.
“The LLM has been marketed as a tool that will promote more equitable access to information and revolutionize personalized learning,” says Poole-Dayan. “But our findings show that they may actually exacerbate existing disparities by systematically providing misinformation or refusing to answer questions to some users. The people who might trust these tools the most may also be receiving poor, false, or harmful information.”