Reasoning Large language models (LLMs) are designed to solve complex problems by breaking them down into a series of smaller steps. These powerful models are particularly good at challenging tasks such as advanced programming and multistep planning.
But developing logic models requires huge amounts of computation and energy due to inefficiencies in the training process. While some high-powered processors work continuously through complex queries, others in the group sit idle.
Researchers at MIT and elsewhere have found a way to use this computational downtime to efficiently speed up logic-model training.
Their new method automatically trains a smaller, faster model to predict the output of a larger logic LLM, which the larger model verifies. This reduces the amount of work done by the logic model, thereby speeding up the training process.
The key to this system is its ability to adaptively train and deploy a small model, so it only turns on when certain processors are idle. By taking advantage of computational resources that would otherwise be wasted, it speeds up training without incurring additional overhead.
When tested on multiple logic LLMs, the method doubled the training speed while preserving accuracy. This could reduce costs and increase energy efficiency of developing advanced LLMs for applications such as forecasting financial trends or detecting risks in power grids.
“People want models that can handle more complex tasks. But if that’s the goal of model development, we need to prioritize efficiency. We found a lossless solution to this problem and then developed a full-stack system that can deliver quite dramatic speedups in practice,” says Qinghao Hu, an MIT postdoc and co-lead author of a paper on this technology.
The paper also includes co-lead authors Shang Yang, an electrical engineering and computer science (EECS) graduate student; Junxian Guo, an EECS graduate student; Senior author Song Han, associate professor at EECS, member of the Electronics Research Laboratory, and a distinguished scientist at NVIDIA; as well as others at NVIDIA, ETH Zurich, the MIT-IBM Watson AI Lab, and the University of Massachusetts at Amherst. This research will be presented at the ACM International Conference on Architectural Support for Programming Languages and Operating Systems.
training obstacle
Developers want logical LLMs to identify and correct mistakes in their critical thinking process. This ability allows them to master complex questions that are beyond the reach of a standard LLM.
To teach them this skill, developers train reasoning LLMs using a technique called Reinforcement Learning (RL). The model generates multiple possible answers to a question, receives a reward for the best candidate, and is updated based on the top answer. These steps are repeated thousands of times as the model learns.
But researchers found that the process of generating multiple answers, called rollout, can consume up to 85 percent of the execution time required for RL training.
“Updating the model – which is the actual ‘training’ part – takes comparatively very little time,” says Hu.
This bottleneck occurs in standard RL algorithms because all processors in the training group must complete their responses before moving on to the next step. Because some processors may be working on very long responses, others that generate shorter responses wait for them to finish.
“Our goal was to turn this wasted time into speedup without any wasted cost,” says Hu.
They tried to use an existing technique, called speculative decoding, to speed things up. Speculative decoding involves training a smaller model, called a drafter, to rapidly predict the future output of a larger model.
The larger model confirms the drafter’s predictions, and the responses it accepts are used for training.
Because the larger model can verify all of the drafter’s predictions at once, rather than generating each output sequentially, it speeds up the process.
an adaptive solution
But in speculative decoding, the drafter model is typically trained only once and remains stable. This makes the technique infeasible for reinforcement learning, as the logic model is updated thousands of times during training.
A static drafter will quickly become stale and useless after just a few steps.
To overcome this problem, researchers created a flexible system known as “Taming the Long Tail” or TLT.
The first part of TLT is an adaptive drafter trainer, which uses free time on the idle processor to quickly train the drafter model, keeping it well aligned with the target model without using additional computational resources.
The second component, an adaptive rollout engine, manages speculative decoding to automatically select the optimal strategy for each new batch of inputs. This mechanism changes the speculative decoding configuration based on training workload features, such as the number of inputs processed by the draft model and the number of inputs accepted by the target model during validation.
Additionally, the researchers designed the DRAFT model to be lightweight so that it can be trained quickly. TLT reuses some components of the logic model training process to train Drafter, providing additional gains in acceleration.
“As soon as some processors complete their short queries and become idle, we immediately switch them to training draft models using the same data they are using for the rollout process. The key mechanism is our adaptive speculative decoding – these benefits would not be possible without it,” Hu says.
They tested TLT in several reasoning LLMs, which were trained using real-world datasets. The system sped up training by between 70 and 210 percent while preserving the accuracy of each model.
As an added bonus, the smaller Drafter model can be easily used as a free spin-off for efficient deployment.
In the future, researchers want to integrate TLT into more types of training and inference frameworks and find new reinforcement learning applications that can be accelerated using this approach.
“As logic inference is becoming the dominant workload with increasing demand, Qinghao’s TLT does a great job of tackling the computation bottleneck of training these logic models. I think this method will be very helpful in the context of efficient AI computing,” Han says.
This work is funded by the MIT-IBM Watson AI Lab, MIT AI Hardware Program, MIT Amazon Science Hub, Hyundai Motor Company, and the National Science Foundation.