Training a large artificial intelligence model is expensive, not only in dollars, but also in time, energy, and computational resources. Traditionally, obtaining a smaller, faster model required either training a larger model first and then scaling it down, or training a smaller model from scratch and accepting weaker performance.
Researchers at MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), the Max Planck Institute for Intelligent Systems, the European Laboratory for Learning and Intelligent Systems, ETH, and Liquid AI have now developed a new method that removes this trade-off entirely, instead compressing models during training.
The technology, called CompreSSM, targets a family of AI architectures known as state-space models, which power applications ranging from language processing to audio generation and robotics. By borrowing mathematical tools from control theory, researchers can identify which parts of the model are pulling their weight and which parts are pulling their weight, before surgically removing unnecessary components at the beginning of the training process.
“This is essentially a technique to make models smaller and faster during training,” says Makram Chahine, a PhD student in electrical engineering and computer science affiliated with CSAIL and lead author of the paper. “While learning, they are also getting rid of the parts that are not useful to their development.”
The key insight is that the relative importance of different components within these models stabilizes surprisingly quickly during training. Using a mathematical quantity called the Hankel singular value, which measures how much each internal state contributes to the overall behavior of the model, the team showed that they could reliably rank after only 10 percent of the training process which dimensions matter and which do not. Once those rankings are established, the less important components can be safely discarded, and the remaining 90 percent of training proceeds at the speed of the much smaller model.
“The exciting thing about this work is that it turns compression from an afterthought to part of the learning process,” says senior author Daniela Rus, MIT professor and director of CSAIL. “Instead of training a large model and then figuring out how to make it smaller, CompreSSM lets the model discover its own efficient structure as it learns. This is a fundamentally different way of thinking about building AI systems.”
The results are shocking. On image classification benchmarks, compressed models maintain the same accuracy as their full-size counterparts while training up to 1.5 times faster. A compressed model reduced to about a quarter of its original state dimension achieved 85.7 percent accuracy on the CIFAR-10 benchmark, compared to only 81.8 percent for a model trained at that smaller size from scratch. On Mamba, one of the most widely used state-space architectures, the method achieved approximately 4x training speedups, compressing a 128-dimensional model to approximately 12 dimensions while maintaining competitive performance.
“You get larger model performance because you capture most of the complex dynamics during the warm-up phase, keeping the most useful positions only after that,” says Chahine. “The model is still able to perform at a higher level than if a smaller model were trained from the beginning.”
What makes CompreSSM different from existing approaches is its theoretical foundation. Traditional pruning methods train a full model and then remove parameters after the fact, meaning you still pay the full computational cost of training a larger model. Knowledge distillation, another popular technique, requires training a larger “teacher” model and then training a second, smaller “student” model on top of that, essentially doubling the training effort. CompreSSM avoids both of these costs by making informed compression decisions mid-stream.
The team benchmarked CompreSSM head-to-head against both alternatives. Compared to Henkel atomic norm regularization, a recently proposed spectral technique for stimulating compact state-space models, CompressSM was more than 40 times faster, while also achieving higher accuracy. The regularization approach slowed training by about 16 times because it required expensive eigenvalue calculations at every single gradient step, and even then, the resulting models performed poorly. Compared to knowledge distillation on CIFAR-10, CompressSM held a clear advantage for the heavily compressed model: at small state dimensions, the distilled model showed significant accuracy degradation, while the CompressSM-compressed model maintained almost perfect performance. And because Distillation requires both the teacher and the student to step through each training step, even its smaller student models train slower than the full-sized baseline.
The researchers proved mathematically that the importance of individual model states changes smoothly during training, thanks to the application of Weyl’s theorem, and showed empirically that the relative ranking of those states remains stable. Together, these findings provide confidence to physicians that dimensions initially identified as insignificant will not suddenly become serious later.
This method also comes with a practical safety net. If the compression phase causes unexpected performance degradation, practitioners can revert to a previously saved checkpoint. “This gives people control over how much they are willing to pay in terms of performance, rather than defining a less-intuitive energy threshold,” explains Chahine.
The technique has some practical limitations. CompreSSM works best on models that exhibit a strong correlation between internal state dimensions and overall performance, a property that varies across functions and architectures. This method is particularly effective on multi-input, multi-output (MIMO) models, where the relationship between state size and expressivity is strongest. For per-channel, single-input, single-output architectures, the benefits are more modest, because those models are less sensitive to state dimension changes in the first place.
This principle applies most cleanly to linear time-invariant systems, although the team has developed extensions for increasingly popular input-dependent, time-invariant architectures. And because the family of state-space models extends to architectures such as linear attention, which is a growing area of interest as an alternative to traditional transformers, the potential scope of application is broad.
Chahine and his colleagues see this work as an important step forward. The team has already demonstrated extension to linear time-varying systems like Mamba, and future directions include advancing CompressSM to matrix-valued dynamical systems used in linear attention mechanisms, which will bring the technology closer to the Transformer architecture that underlies most large AI systems today.
“This should be the first step, because this is where the theory is neat and the approach can remain theoretical,” says Chahine. “This is a step toward expanding to other architectures that people are using in the industry today.”
“The work of Chahine and his colleagues provides an interesting, theoretically based perspective on compression for modern state-space models (SSMs),” says Antonio Orvieto, principal investigator at the LLIS Institute Tübingen and independent group leader of the MPI for Intelligent Systems, who was not involved in the research. “The method provides evidence that the state dimension of these models can be effectively reduced during training and that a control-theoretic perspective can successfully guide this process. The work opens new avenues for future research, and the proposed algorithm has the potential to become a standard approach when pre-training large SSM-based models.”
The work, which was accepted as a conference paper at the International Conference on Learning Representations 2026, will be presented later this month. It was partially supported by the Max Planck ETH Center for Learning Systems, the Hector Foundation, Boeing, and the US Office of Naval Research.