Enabling small language models to solve complex reasoning tasks

Spread the love

As language models (LMs) improve at tasks like image generation, trivia questions, and simple math, you might think that human-like reasoning is just around the corner. In reality, they still lag behind us in complex tasks by a considerable margin. For example, try playing Sudoku with one, where you fill in the numbers one through nine in such a way that each appears only once in the columns, rows, and sections of a nine-by-nine grid. Your AI opponent will either fail to fill the boxes automatically or will do so inefficiently, although it can verify whether you filled your boxes correctly.

Whether an LM is trying to solve advanced puzzles, design molecules, or write mathematical proofs, the system struggles to respond to open-ended requests that have strict rules to follow. This model is better at telling users how to deal with these challenges rather than trying it themselves. Furthermore, practical problem-solving requires the LM to consider a wide range of alternatives while adhering to constraints. Small LMs cannot do this reliably by themselves; Large language models (LLMs) can sometimes do this, especially if they are optimized for reasoning tasks, but they take a while to respond, and they use a lot of computing power.

This difficult situation inspired researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) to develop a collaborative approach where an LLM creates a plan, then divides up the tasks for that strategy among a smaller number of people. Their method helps small LMs provide more accurate responses than leading LMs like OpenAI’s GPT-4O, and approaches the accuracy of top reasoning systems like O1 while being more efficient than both. In their framework, called “Distributional Constraints by Inferring Programming with Language Models” (or “DCIPL”), a larger model leads smaller “follower” models to accurate responses when writing things like text blurbs, grocery lists with budgets, and travel itineraries.

The internal workings of DisCIPL are similar to contracting a company for a specific task. You provide a “boss” model with the request, and it carefully considers how to get that project done. Then, the LLM conveys these instructions and guidelines to smaller models in a clear manner. It corrects the output of the follower LM when necessary – for example, replacing a model phrase that does not fit a poem with a better alternative from another.

The LLM communicates with its followers using a language they all understand – that is, a programming language for controlling the LLM called “LLAMPPL”. Developed by MIT’s Probabilistic Computing Project in 2023, the program allows users to encode specific rules that steer a model toward a desired outcome. For example, LLaMPPL can be used to create error-free code by incorporating the rules of a particular language into its instructions. Instructions such as “Write eight lines of poetry where each line contains exactly eight words” are encoded in LLaMPPL, which queues up small models to contribute to different parts of the answer.

MIT PhD student Gabriel Grand, lead author of the paper presenting this work, says DisCIPL allows LMs to guide each other to the best responses, improving their overall efficiency. “We are working toward improving the inference efficiency of LMs, especially on many modern applications of these models, which involve generating outputs subject to constraints,” says Grand, who is also a CSAIL researcher. “Language models are consuming more energy as people use them more, which means we need models that can provide accurate answers while using minimal computing power.”

“It’s really exciting to see new alternatives to standard language model inference,” says Alan Suhr, assistant professor at the University of California at Berkeley, who was not involved in the research. “This work invites new approaches to language modeling and LLM that significantly reduce inference latency through parallelization, require significantly fewer parameters than current LLMs, and also improve task performance over standard sorted inference. This work also presents opportunities to explore the transparency, interpretability, and controllability of model outputs, which is still a major open problem in the deployment of these technologies.”

a dalit story

You might think that large-scale LMs are “better” than small LMs on complex signals when it comes to accuracy and efficiency. DisCIPL suggests a surprising solution for these tasks: if you can instead combine the strengths of smaller models, you can see increased efficiency with similar results.

The researchers say that, in theory, you could plug dozens of LMs into the DisCIPL framework to work together, regardless of size. In the writing and logic experiments, they went with GPT-4O as their “planner LM”, which is one of the models that helps ChatGPT generate responses. This led to the brainstorming of a plan for several “Llama-3.2-1B” models (small systems developed by META), in which those LMs filled out each word (or token) of the response.

This collective approach competed against three comparable approaches: a follower-only baseline powered by Llama-3.2-1b, GPT-4O working on its own, and the industry-leading O1 logic system that helps ChatGPT figure out more complex queries like coding requests and math problems.

DisCIPL first introduced the ability to write sentences and paragraphs that follow clear rules. The models were given very specific prompts – for example, to write a sentence containing exactly 18 words, where the fourth word should be “Glasgow”, the eighth should be “in”, and the 11th word should be “and”. The system was remarkably efficient in handling this request, producing consistent output while achieving the same accuracy and consistency as O1.

faster, cheaper, better

This experiment also showed that key components of DisCIPL were much cheaper than state-of-the-art systems. For example, while existing reasoning models like OpenAI’s O1 reason in text, DiscipL “reasons” by writing Python code, which is more compact. In practice, researchers found that DisCIPL led to 40.1 percent fewer arguments and 80.2 percent cost savings compared to 01.

DisCIPL’s efficiency increase comes in part from using smaller llama models as followers, which are 1,000 to 10,000 times cheaper per token than comparable logic models. This means that DisCIPL is more “scalable” – researchers were able to run dozens of Llama models in parallel for a fraction of the cost.

According to the CSAIL researchers, these weren’t the only surprising findings. Their system also performed well against o1 on real-world tasks, such as creating materials lists, planning an itinerary, and writing grant proposals with word limits. Meanwhile, GPT-4o struggled with these requests, and with writing tests, it often couldn’t place keywords in the right parts of sentences. The follower-only baseline essentially finished in last place across the board, because it had difficulties following instructions.

“Over the past several years, we have seen some impressive results from approaches that use language models to ‘auto-formalize’ problems in mathematics and robotics by presenting them with code,” says senior author Jacob Andreas, an MIT electrical engineering and computer science associate professor and CSAIL principal investigator. “What I find most exciting about this paper is the fact that we can now use LMs to automatically formalize text generation, providing the same kinds of efficiency gains and guarantees that we have seen in these other domains.”

In the future, the researchers plan to expand this framework into a more fully-recursive approach, where you can use the same model as both the leader and followers. Grand says DisCIPL can be extended to mathematical reasoning tasks, where answers are difficult to verify. They also intend to test the system on its ability to satisfy users’ vague preferences, as opposed to adhering to hard constraints, which may not be so clearly outlined in the code. Thinking even bigger, the team hopes to use the largest possible models available, though they note that such experiments are computationally expensive.

Grand and Andreas co-wrote the paper with CSAIL principal investigator and MIT professor Joshua Tenenbaum, as well as Vikash Mansinghka, principal research scientist in MIT’s Department of Brain and Cognitive Sciences, and Alex Lu SM ’20 PhD ’25, assistant professor at Yale University. CSAIL researchers presented the work at the Conference on Language Modeling in October and IVADO’s “Deployment of Autonomous Agents: Lessons, Risks, and Real-World Impacts” workshop in November.

His work was supported in part by the MIT Quest for Intelligence, the Siegel Family Foundation, the MIT-IBM Watson AI Lab, the Sloan Research Fellowship, Intel, the Air Force Office of Scientific Research, the Defense Advanced Research Projects Agency, the Office of Naval Research, and the National Science Foundation.

Source link