When researchers are building large language models (LLM), they aim to maximize performance under a special computational and financial budget. Since training a model can cause millions of dollars, developers need to be prudent with cost-effective decisions about model architecture, optimizer and training dataset before committing to a model. To estimate the quality and accuracy of a large model predictions, physicians often turn to scaling laws: using small, cheap models to estimate the performance of a very large target model. However, the challenge is that there are thousands of ways to make scaling laws.
MIT and MIT-IBM Watson AI Lab addresses this by collecting and releasing the collection of hundreds of models and matrix related to training and performance to approximate more than a thousand scaling laws from researchers. From this, the team developed a meta-analysis and guide of how to select small models and estimate scaling laws for various LLM model families, so that the budget is implemented towards generating reliable performance predictions.
“The perception that you want to try to make mathematical models of training process, it is a few years old, but I think the new one here is that most of the work that was done earlier, is saying, ‘Can we say some post-hoc that when we train all these models, we can try to find out how we can work,’ MIT-IBM Watson AI Lab can try to find out, ‘MIT-IBM Watson AI can try to find out. Investigator Department.
Research was recently presented at the International Conference on Machine Learning by Andreas, along with IBM Research’s MIT-IBM Watson AI Lab Researcher Leshem Chochen and Yang Zhang.
Boycott performance
No matter how you slices it, developing LLM is an expensive effort: from decision making to the number of criteria and tokens, data selection and size, and training techniques to determine output accurateness and output accurateness and tuning for target applications and functions. Scaling laws offer a way to predict the disadvantage of a large model for the performance of models of small, low-end-to-do models from the same family, which avoids the need to fully train every candidate. Mainly, the differences between small models are the number of parameters and token training sizes. According to Choshen, clarifying scaling laws not only enables pre-pre-training decisions, but also enabling researchers without vast resources to understand and make effective scaling laws.
The functional form of scaling laws is relatively simple, incorporating components from small models that capture the number of parameters and their scaling effects, the number of training tokens and their scaling effects, and the model of interest to the model family. Together, they help researchers to estimate the performance loss of a target large model; The smaller the loss, the better the output of the target model.
These laws allow research teams to testify it best to weigh trade-bands and allocate limited resources. They are especially useful to evaluate the scaling of a certain variable, such as the number of tokens, and for A/B testing of various pre-training setup.
In general, scaling laws are not new; However, in the field of AI, they emerged with emergence and cost the sky touching. “It is like scaling laws,” says Chochen. “They started to attract attention, but no one really tested how good they are and what you need to do to make a good scaling law.” In addition, the scaling law itself was also a black box, in a sense. “Whenever people have enacted scaling laws in the past, it has always been just a model, or a model family, and a dataset, and a developer,” says Andreas. “In fact a very systematic meta-analysis was not done, because everyone is personally training their own scaling laws. [we wanted to know,] What are high-level trends that you see in those things? ,
Better building
To investigate this, Choshen, Andreas and Zhang created a big dataset. He collected LLMs from 40 model families, including Pythia, Opt, Olmo, Lama, Bloom, T5-Pile, Modularformer Mixter-Off-Experts, GPT and other families. These included 485 unique, pre-informed models, and where they were available, with data about their training posts, computational costs (Flops), training era and seeds as well as 1.9 million performance matrix with disadvantages and downstream functions. The models vary their architecture, weight, and so on. Using these models, researchers fit more than 1,000 scaling laws and compare their accuracy in architecture, model sizes and training regions, as well as tests how the number of models, intermediate training posts and partial training affect the power of predicting the scaling laws to target models. They used the measurement of absolute relative error (are); This is the difference between the prediction of scaling law and the losses seen by a large, trained model. With this, the team compared scaling laws, and after analysis, distilled practical recommendations for AI physicians that make effective scaling law.
Their shared guidelines move to the developer through steps and options to consider ideas and expectations. First, it is important to decide on a calculation budget and target model accuracy. The team found that 4 percent of the best obtainable accuracy is about the best obtained accuracy that may expect random seed noise, but is still useful to decide up to 20 percent. Researchers identified several factors, which improve predictions, such as instead of incorporating intermediate training posts, instead of relying only the final loss; This made scaling laws more reliable. However, very initial training data before 10 billion tokens are noisy, reduce accuracy, and should be abandoned. They recommend training more models in the spread of sizes to improve the strengthening of the prediction of scaling law, not only large models; The selection of five models offers a solid starting point.
Generally, prediction improves, including large models, but the cost can partially be saved by using the target model about 30 percent of its dataset and extracts. If the budget is quite constrained, developers should consider training a small model within the target model family and borrow scaling law parameters from a model family with similar architecture; However, it cannot work for the encoder -dicoder model. Finally, the Mit-IBM research group found that when scaling laws were compared to model families, there was a strong connection between two sets of hyperpamators, meaning that three of the five hyperpameters explained almost all variations and possibly capture model behavior. Together, these guidelines provide a systematic approach to make scaling law estimates more efficient, reliable and accessible for AI researchers working under different budget deficiency.
Many surprises arose during this work: partially trained small models are still very future, and further, intermediate training stages can be used from fully trained models for predicting another target model (such as they are individual models). “Originally, you do not pay anything in training, as you already trained the full model, so half the trained model, for example, what you did is just a subport,” says Chochen. Another feature, another feature, stated that when collected, model families jumped and various experiments were jumped and more nozier than expected. Unexpectedly, researchers found that it is possible to use scaling laws on large models to predict the performance of small models. Other researches in the region have envisaged that small models were a “separate animal” compared to older people; However, Choshen disagrees. “If they are completely different, they should have shown completely different behavior, and they do not.”
While this work was focused on model training time, researchers plan to expand their analysis to model estimates. Andreas says that this is not, “How does my model get better because I add more training data or more parameters, but instead I let it think for a long time, attract more samples. I think there is definitely a lesson learned here how to build an forecast model how much you need to think on time.” He says that the principle of estimates time scaling laws can be even more important because, “I am not that I am going to train a model and then it can be done. [Rather,] Every time a user comes to me, they are going to be a new query, and I need to find out how hard it is [my model needs] To think to come with the best answer. Therefore, being able to create those types of forecasting models, as we are in this paper, is even more important. ,
This research was supported in part, part, in part, by MIT-ibm Watson AI Lab and a Sloan Research Fellowship.