In his 1927 paper, “A Law of Comparative Judgment”, American psychologist L.L. Thurston proposed that when people choose an alternative among several alternatives, they choose the alternative that has the greatest value to them, even if they cannot assign a particular number to that alternative.
Thurstone was a pioneer of “psychometrics” – a field built on the premise that mental processes that we cannot observe can nevertheless be measured and quantified. His 1927 paper laid the groundwork for what is now called the stochastic utility model, providing a mathematical framework for describing human preferences – information that could be relied upon to, in turn, make predictions about various hypothetical situations.
Random utility models (RUMs) are so named because they estimate the “utility”, or benefit, that can be gained from a given choice – such as deciding which book to read first among a stack of novels retrieved from the library. “These models are inherently random,” explains Gabriel Farina, assistant professor in MIT’s Department of Electrical Engineering and Computer Science (EECS) and principal investigator in the Laboratory for Information and Decision Systems (LIDS), “because people are different. Everyone has their own preferences, and even those preferences can vary from time to time.” For example, someone who usually chooses coffee instead of tea in the morning, and prefers tea after dinner, may, when the opportunity arises, mix up that order altogether.
To be sure, RUM is often used in government and industry in situations of far greater consequence than the selection of a hot (or iced) beverage. Models routinely facilitate predictions about what people will choose to do in so-called counterfactual (“what-if”) scenarios such as: How will they get to work or school if a major route is closed for construction? What routes and means of transportation will they adopt? Or, if a city suddenly receives a $20 million windfall, how should that money be distributed to maximize the common good?
Given that RUMs have been with us for almost 100 years, becoming increasingly sophisticated over time, one might imagine that, at this stage, there would be little room for improvement. However, this is not so.
A paper presented at the International Conference on Learning Representations in Rio de Janeiro, Brazil in April uncovered fundamental facts that show there is much more to be learned from these models than traditionally thought. The paper was written by former MIT postdoc Yashwant Cherapanmajeri, now based at Nanyang Technological University in Singapore; Farina is also lead faculty at MIT’s Operations Research Center (ORC); Konstantinos Daskalakis, Avanesians Professor of Computer Science at MIT and member of MIT’s Computer Science and Artificial Intelligence Laboratory; and Sobhan Mohammadpur, MIT PhD student in computer science based on LIDS and EECS.
The group’s findings stem, in part, from a deficiency in how to estimate RUM in practice that has persisted since Thurston’s days. The data on which the model is estimated is largely derived from so-called paired-comparisons: In a choice between items A and B – whether related to movies on Netflix, competing products on Amazon.com, news stories posted on Google, etc. – which would you choose? Daskalakis explains that one reason this approach is so widespread is that “it is very hard to assign an exact numerical score, such as 4.37, to the benefit you get from any one item. Whereas comparing two things, and deciding which one you like better, is much easier to do cognitively.” He further said, but there is something wrong with it. “With this way of assessing people’s preferences, looking at only two things at a time, it is impossible to find connections between multiple options.”
The standard way of applying RUM assumes that the utilities obtained from A and B are independent, but in fact, they may be linked, and this would be important to know. For example, if someone campaigning for elected office learns that a potential voter supports gun control, there is a reasonable possibility that the same person also supports government-sponsored child care. Similarly, a fan of independent films may also be biased towards foreign films, but less enthusiastic towards Hollywood action blockbusters. “If a digital platform does not take into account the existence of such correlations, it will not be able to accurately predict preferences,” says Daskalakis. “And if Netflix regularly shows you movies you don’t care about, you can unsubscribe and cancel your subscription.”
The MIT team proved that it is impossible to obtain information about correlations from two-way comparisons alone. However, correlations can be seen when large numbers of people rate three options in order of their preference. The same information can also be obtained from a combination of best-three and best-two options. In practice, Mohammadpour explains, “You get a group of people to rank three items. Then you can use the method we developed to merge those individual results into a larger model that can provide us with the bigger picture.”
According to Farina, his research efforts focus on the computational side of RUM, designing algorithms that can extract preference information and figuring out how much data is necessary to do so or, equivalently, how many experiments need to be run. The good news, he says, is that efficient algorithms are indeed possible for this purpose. The required number of experiments does not increase exponentially with the number of items under review in the catalog or database.
“This paper represents a significant breakthrough,” comments computer scientist Emma Friesinger of the University of Montreal. “It proves mathematically why traditional data collection fails and demonstrates that by asking users about their best three [choices] Unlocks the ability to train these powerful models with accuracy. This finding provides a highly practical roadmap for collecting better data to drive more accurate optimizations.
“Building utility models will continue to be a very active area,” Daskalakis emphasizes. “Just as RUMs have been critical to the Internet economy since the late 1990s, they will continue to be critical to the alignment of AI models going forward.” More importantly, “RUMs play a central role in the commercial feasibility and utility of large language models [LLMs]” During the training period, people are typically asked to rank the different candidate outputs of these LLMs, allowing the models to gain a better understanding of what type of text – in terms of tone, style, and content – is preferred.
Noting that we are constantly “surrounded by a vast sea of choices in so many different domains,” Daskalakis says, “you can’t possibly ask people to state all their personal preferences for all possible scenarios. So you can instead build a model that predicts what people think about different possible outcomes. And you have to keep improving and updating your model in an iterative process until hopefully you can make good predictions.”