Large language models (LLMs) like ChatGPT can write essays or plan menus almost instantly. But till some time ago it was also easy to stump him. The models, which rely on language patterns to answer users’ questions, often failed at math problems and were not good at complex reasoning. However, suddenly, they have become much better at these things.
A new generation of LLMs known as reasoning models are being trained to solve complex problems. Like humans, they need some time to think about such problems — and remarkably, scientists at MIT’s McGovern Institute for Brain Research have found that the problems that require the most processing from logic models are the same problems that require people to take their time. In other words, they report in the journal today PNASThe “cost of thinking” for a logic model is similar to the cost of thinking for a human being.
The researchers, who were led by Evelina Fedorenko, associate professor of brain and cognitive sciences and an investigator at the McGovern Institute, concluded that in at least one important way, the logic model has a human-like approach to thinking. He says this is not as per design. “The people who build these models don’t care that they do it like humans do. They just want a system that works robustly and responds correctly in all kinds of situations,” says Fedorenko. “The fact that there is some convergence is actually quite surprising.”
logic model
Like many forms of artificial intelligence, new reasoning models are artificial neural networks: computational tools that learn how to process information when given data and a problem to solve. Artificial neural networks have been very successful at many tasks that the brain’s own neural networks do well – and in some cases, neuroscientists have found that the ones that perform best share certain aspects of information processing in the brain. Nevertheless, some scientists argued that artificial intelligence was not ready to adopt the more sophisticated aspects of human intelligence.
“Until recently, I was one of the people saying, ‘These models are really good at things like perception and language, but until we have neural network models that can reason, it’s going to take a lot of time,'” says Fedorenko. “Then these larger logic models emerged and they began to perform better at many thinking tasks, such as solving math problems and writing pieces of computer code.”
Of. Lisa Yang, an ICON Center fellow, and Andrea Gregor de Varda, a postdoc in Fedorenko’s lab, explain that reasoning models solve problems step by step. “At some point, people realized that models needed more space to do the actual calculations needed to solve complex problems,” he says. “If you let the models break the problems into parts the performance started to be much stronger.”
To encourage models to work through complex problems in steps that lead to the correct solution, engineers can use reinforcement learning. During training, models are rewarded for correct answers and penalized for incorrect answers. “The models themselves explore the problem space,” de Varda says. “Tasks that yield positive rewards are reinforced, so that they produce the correct solution more often.”
Models trained this way are more likely than their predecessors to arrive at answers similar to those that a human would give when given a reasoning task. Their step-by-step problem-solving means that the logic models may take a little longer to find answers than the LLMs that came before – but since they are finding the right answers where previous models would have failed, their responses are worth the wait.
Requiring models to take some time to work on complex problems already suggests a parallel to human thinking: if you demand that someone solve a difficult problem immediately, they will likely fail, too. De Varda wanted to investigate this relationship more systematically. So they gave the logic model and human volunteers the same set of problems, and tracked not only whether they got the answer right, but also how much time or effort it took them to get there.
time vs token
This meant measuring how long it took people to answer each question, down to milliseconds. For models, Varda used a different metric. There was no point in measuring processing time, as it is more dependent on computer hardware than the effort the model puts into solving a problem. So instead, they tracked tokens, which are part of a model’s thought chain. “They produce tokens that are not meant for the user to see and act upon, but rather to keep some track of what internal calculations they are doing,” explains De Varda. “It’s as if they’re talking to themselves.”
Both the human and the reasoning model were asked to solve seven different types of problems such as numerical arithmetic and intuitive reasoning. For each problem category, they were given a number of problems. The more difficult a given problem was, the longer it took people to solve it – and the longer it took people to solve a problem, the more tokens a logic model generated that came to its own solution.
Similarly, the classes of problems that took humans the longest to solve were the same classes that required the most tokens for models: arithmetic problems were the least demanding, while a set of problems called the “ARC challenge”, where pairs of colored grids represent a transformation that must be predicted and then applied to a new object, was the most costly for both people and models.
De Varda and Fedorenko say the surprising match in thinking costs shows that reasoning models are thinking like humans. However, this does not mean that the models are reconstructing human intelligence. Researchers still want to know whether models use the same representations of information as those in the human brain, and how those representations are transformed into solutions to problems. They are also curious whether models will be able to handle problems that require world knowledge that is not described in the texts used for model training.
The researchers point out that even though logic models produce internal monologues when solving problems, they are not necessarily using language to think. “If you look at the output that these models produce when they reason, it often contains errors or some redundant bits, even if the model ultimately arrives at the correct answer. So the actual internal computation is likely to occur in an abstract, non-linguistic representation space, just as humans do not use language to think,” he says.