Adoption of new tools and technologies occurs when users largely view them as reliable, accessible, and cost-effective improvements to available methods and workflows. Five PhD students from the inaugural class of the MIT-IBM Watson AI Lab Summer Program are harnessing cutting-edge resources, reducing AI pain points, and building new features and capabilities to boost AI usability and deployment – from learning when to trust one model that predicts the accuracy of another to reasoning more effectively based on knowledge. Together, the efforts of students and their mentors form a through-line, where practically and technically rigorous research leads to more reliable and valuable models across domains.
Building probes, routers, new attention mechanisms, synthetic datasets, and program-synthesis pipelines, students’ work spans security, inference efficiency, multimodal data, and knowledge-based reasoning. Their technologies emphasize scaling and integration, the impact of which is always in sight.
Learning to trust, and when
MIT mathematics graduate student Andrey Brutkin’s research prioritizes the reliability of models. He looks for internal structures within problems, such as the equations governing the system and conservation laws, to understand how to take advantage of them to generate more reliable and robust solutions. Armed with this and working with the laboratory, Brutkin developed a method for understanding the nature of large learning model (LLM) behavior. Together with Veronica Thost of IBM Research and Marzeh Ghassemi, an associate professor in the MIT Department of Electrical Engineering and Computer Science (EECS) and Germeshausen Career Development Professor and member of the Institute of Medical Engineering Sciences and the Information and Decision Systems Laboratory, Brutkin explored the “uncertainty of uncertainty” of the LLM.
Classically, small feed-forward neural networks with two-to-three layers, called probes, are trained with LLMs and employed to flag unreliable answers from larger models to developers; However, these classifiers can also generate false negatives and provide only point estimates, which do not provide much information when the LLM fails. Examining safe/unsafe prompts and question-answer tasks, the MIT-IBM team used prompt-label pairs to measure gradient scores, sensitivity to prompts, and out-of-distribution data, as well as hidden states such as activation vectors and final tokens from LLMs to determine how reliable the probes were and learn about areas of the data that are difficult to predict. Their method also helps to identify potential labeling noise. This is an important task, as the reliability of AI systems depends entirely on the quality and accuracy of the labeled data on which they are built. More accurate and consistent checking is especially important for domains with critical data in applications such as IBM’s Granite Guardian family of models.
Another way to ensure reliable answers to questions from LLMs is to augment them with external, reliable knowledge bases to eliminate hallucinations. For structured data, such as social media connections, financial transactions, or corporate databases, knowledge graphs (KGs) are a natural fit; However, communications between LLMs and KGs often use fixed, multi-agent pipelines that are computationally inefficient and expensive. To address this, physics graduate student Jinyeop Song, along with IBM Research lab researchers Jada Zhu and EECS Associate Professor Julian Shun, created a single-agent, multi-turn, reinforcement learning framework that streamlines this process. Here, the group designed an API server hosting FreeBase and Wikidata KG, which contains common web-based knowledge data, and an LLM agent that issues targeted retrieval actions to fetch relevant information from the server. Then, through continuous back-and-forth, the agent connects the data collected from the KG to the context and responds to the query. Importantly, the system uses reinforcement learning to train itself to provide answers that balance between accuracy and completeness. The framework combines an API server with a single reinforcement learning agent to orchestrate data-based reasoning with improved accuracy, transparency, efficiency, and transferability.
spend wisely
The timeliness and completeness of a model’s response is of equal importance to its accuracy. This is especially true for handling longer input text and where elements such as the story’s theme evolve over time, so EECS graduate student Songlin Yang is re-engineering what the models can handle at each stage of inference. Focusing on the limitations of Transformer, like LLM, Rameshwar Panda of IBM Research’s Lab and Yoon Kim, NBX Professor and Associate Professor in EECS, joined Yang to develop next-generation language model architectures beyond Transformer.
Transformer faces two major limitations: high computational complexity in modeling long sequences due to the softmax attention mechanism, and limited expression as a result of the weak inductive bias of RoPE (rotary positional encoding). This means that as the input length doubles, the computational cost quadruples. RoPE allows the transformer to understand the sequence order of tokens (i.e., words); However, it does not do a good job of capturing variable values, such as changes in internal state over time, and is limited to the sequence length observed during training.
To address this, the MIT-IBM team explored theoretically based but hardware-efficient algorithms. As an alternative to softmax attention, they adopted linear attention, which reduces the quadratic complexity that limits the feasible sequence length. They also investigated hybrid architectures that combine softmax and linear attention to create a better balance between computational efficiency and performance.
To increase expressivity, they replaced RoPE with a dynamic reflective positional encoding based on the Householder transform. This approach enables rich positional interactions for a deeper understanding of sequential information while maintaining fast and efficient computation. The MIT-IBM team’s advances reduce the need for transformers to split problems into multiple steps, instead enabling them to handle more complex sub-problems with fewer guess tokens.
fresh vision
Visual data has a lot of properties that the human brain can quickly parse, internalize, and then imitate. Using visual-language models (VLMs), two graduate students are exploring ways to do this through code.
Over the past two summers and under the mentorship of Aude Oliva, MIT director of the MIT-IBM Watson AI Lab and a senior research scientist in the Computer Science and Artificial Intelligence Laboratory; and Rogerio Ferris, Dan Gutfreund, and Leonid Karlinsky (now at Xero) of IBM Research, Jovana Kondic of EECS have explored visual document understanding, especially charts. These include elements such as data points, legends, and axis labels, which require optical character recognition and numerical reasoning, which models still struggle with. To facilitate performance on such tasks, Kondik’s group planned to create a large, open-source, synthetic chart dataset from the code that could be used for training and benchmarking.
With their prototype, ChartGen, the researchers created a pipeline that passed seed chart images through a VLM, which was then prompted to read the chart and generate a Python script that was presumably used to create the chart in the first place. The LLM component of the framework then iteratively augmented the codes from multiple charts, ultimately generating over 200,000 unique pairs of charts and their codes, spanning nearly 30 chart types, as well as supporting data and annotations such as descriptions and question-answer pairs about the charts. The team is further expanding its dataset, helping to enable critical multimodal understanding for data visualization for enterprise applications such as financial and scientific reports, blogs, etc.
Instead of charts, EECS graduate student Leonardo Hernandez Cano has his sights set on finding efficient ways to enable capabilities in digital design, particularly visual texture creation and VLM for CAD applications. Together with laboratory groups led by Armando Soler-Lezama, EECS Professor and Distinguished Professor of Computing in the MIT Schwarzman College of Computing, and Nathan Fulton of IBM Research, Hernández Cano created a program synthesis system that learns to refine code on its own. The system starts with a texture description supplied by the user in the form of an image. It then generates an initial Python program that generates visual textures, and iteratively refines the code with the goal of finding a program that produces a texture that matches the target description, learning to discover new programs from the data produced by the system. Through these refinements, the new program can create visualizations with the desired brightness, color, iridescence, etc., mimicking real materials.
When seen together, these projects and the people behind them are making a united push toward more robust and practical artificial intelligence. By tackling the core challenges of reliability, efficiency, and multimodal reasoning, this work paves the way for AI systems that are not only more powerful, but also more reliable and cost-effective for real-world enterprise and scientific applications.