Every year, countries competing in the International Mathematical Olympiad (IMO) bring a booklet of their best, most original problems. Those booklets are shared among delegations, then quietly disappear. No one had ever systematically collected them, cleaned them, and made them available, neither to AI researchers testing the limits of mathematical logic, nor to students around the world who were training for these competitions largely on their own.
Researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), King Abdullah University of Science and Technology (KAUST) and the company HUMAIN have now done exactly that.
MathNet is the largest high-quality dataset of proof-based mathematics problems ever created. Comprising more than 30,000 expert-written problems and solutions spanning 47 countries, 17 languages, and 143 competitions, it is five times larger than the next largest dataset of its kind. The work will be presented at the International Conference on Learning Representations (ICLR) in Brazil later this month.
What makes MathNet different is not only its size, but also its breadth. Previous Olympiad-level datasets have been drawn almost exclusively from competitions in the United States and China. MathNet spans dozens of countries on six continents, covers 17 languages, includes both text- and image-based problems and solutions, and spans four decades of competitive mathematics. The goal is to capture the full range of mathematical approaches and problem-solving traditions present in the global mathematics community, not just the most visible traditions.
“Every country brings a handbook of its most innovative and most creative problems,” says Shaden Alshammari, an MIT PhD student and lead author of the paper. “They share the booklets with each other, but no one has made the effort to collect them, clean them and upload them online.”
Building MathNet required tracking 1,595 PDF volumes containing more than 25,000 pages, including digital documents in more than a dozen languages and decades-old scans. A significant part of that collection came from an unlikely source: Navid Safai, a longtime IMO community member and co-author, who had been collecting and scanning those booklets by hand since 2006. His personal collection formed the majority of the backbone of the dataset.
Sourcing matters as much as scale. While most existing mathematics datasets draw problems from community forums such as the Art of Problem Solving (AOPS), MathNet draws exclusively from official national competition handbooks. The solutions in those booklets are expert-written and peer-reviewed, and they often run for several pages, with the authors going through multiple approaches to the same problem. That depth gives AI models far richer signals for learning mathematical reasoning than small, informal solutions typical of community-sourced datasets. It also means that the dataset is really useful to students: anyone preparing for an IMO or national competition now has access to a centralized, searchable collection of high-quality problems and worked solutions from traditions around the world.
“I remember many students for whom it was an individual endeavor. No one in their country was training them for this kind of competition,” says Alshammari, who herself competed at the IMO as a student. “We hope this will give them a centralized location with high-quality problems and solutions to learn from.”
The team has deep roots in the IMO community. Co-author, Sultan Albarakati, currently serves on the IMO board, and the researchers are working to share the dataset directly with the IMO Foundation. To validate the dataset, they assembled a grading group of more than 30 human evaluators from countries including Armenia, Russia, Ukraine, Vietnam, and Poland, who coordinated together to verify thousands of solutions.
“The MathNet database has the potential to be an excellent resource for both students and leaders looking to work on new problems or find a solution to a difficult question,” says Tanish Patil, IMO’s deputy leader for Switzerland. “While other collections of Olympiad problems exist (notably, the Contest Collection forum on AOPS), these resources lack the standardized formatting systems, verified solutions, and critical problem metadata that the topics and theory require. It will also be interesting to see how this dataset is used to improve the performance of reasoning models, and whether we will soon be able to reliably answer a key issue when creating novel Olympiad questions: determining whether a problem is actually original. Is.”
MathNet also serves as a rigorous benchmark for AI performance, and the results show a more complex picture about AI math skills than recent headlines suggest. Frontier models have made extraordinary progress: some have reportedly achieved gold-medal performance in the IMO, and on standard benchmarks they now solve problems that would baffle most humans. But MathNet shows that progress is uneven. Even GPT-5, the best-performing model in the test, averaged about 69.3 percent on MathNet’s main benchmark of 6,400 problems, failing about one in three Olympiad-level problems. And when the problems involve statistics, performance drops significantly across the board, with visual reasoning exposed as a persistent weak point for even the most capable models.
Several open-source models scored 0 percent on Mongolian-language problems, highlighting another dimension where current AI systems are weak despite their overall strength.
“GPT models are equally good in English and other languages,” says Alshammari. “But many open-source models fail completely in less-common languages like Mongolian.”
The variation of MathNet is also designed to address a deep limitation in the way AI models learn mathematics. When the training data leans toward English and Chinese problems, the models absorb a narrower portion of mathematical culture. The Romanian combinatorics problem or the Brazilian number theory problem may look at the same underlying concept from a completely different angle. The researchers argue that exposure to that limit makes both humans and AI systems better mathematical thinkers.
Beyond problem-solving, MathNet offers a retrieval benchmark that asks whether models can recognize when two problems share the same underlying mathematical structure, an ability that matters to both AI development and the mathematics community. Nearly duplicate problems have emerged in actual IMO exams over the years because it is really hard to find mathematical equivalence across different notations, languages, and formats, even for expert human committees. Testing eight state-of-the-art embedding models, the researchers found that even the strongest ones identified the correct match only 5 percent of the time on the first try, with the models often treating structurally unrelated problems as more similar than equivalent problems.
The dataset also includes a retrieval-enhanced generation benchmark, which tests whether giving a model a structurally related problem before asking it to solve a new problem improves performance. This happens, but only if the recovered problem is really relevant. DeepSeek-v3.2-Special achieved up to 12 percentage points of gain with relevant retrievals, while irrelevant retrievals degraded performance in about 22 percent of cases.
Alshammari co-wrote the paper with Safai, human AI engineer Abrar Zainal, KAUST Academy director Sultan Albarakati, and MIT CSAIL colleagues: master’s student Kevin Wayne SB ’25; Mark Hamilton SM ’22, PhD ’25, principal engineering manager at Microsoft; and Professors William Freeman and Antonio Torralba. Their work was partially funded by a Schwarzman College of Computing Fellowship and the National Science Foundation.
MathNet is publicly available at Mathnet.csAIL.mit.edu.