Most languages use word position and sentence structure to convey meaning. For example, “The cat was sitting on the box” is not the same as “The box was on the cat.” In a longer text such as a financial document or a novel, the syntax of these words likely evolves.
Similarly, one can track variables in a piece of code or follow instructions containing conditional actions. These are examples of state transitions and sequential reasoning in what we expect state-of-the-art artificial intelligence systems to excel at; However, the existing, state-of-the-art attention mechanisms within Transformer – the main architecture used in large language models (LLMs) to determine the importance of words – have theoretical and empirical limitations when it comes to such capabilities.
An attention mechanism allows the LLM to look at earlier parts of a query or document and, based on its training, determine which details and words matter most; However, this mechanism does not understand word order alone. It “sees” all the input words, aka tokens, at the same time and handles them in the order they are presented, so researchers have developed techniques to encode positional information. This is important for domains that are highly structured such as language. But the dominant position-encoding method, called rotary position encoding (ROPE), only takes into account the relative distance between tokens in the sequence and is independent of the input data. This means that, for example, words like “cat” and “box” in the example above that are four spaces apart will all receive the same fixed mathematical rotation specific to that relative distance.
Now research led by MIT and the MIT-IBM Watson AI Lab has created an encoding technique called “Path Attention” that makes positional information adaptive and context-aware, rather than static, like RoPE.
“Transformers enable accurate and scalable modeling of many domains, but they have these limitations compared to state tracking, a class of phenomena that underpins the key capabilities we want in our AI systems. So, the key question is: How can we maintain the scalability and efficiency of Transformers while enabling state tracking?” Says the paper’s senior author Yoon Kim, an associate professor in the Department of Electrical Engineering and Computer Science (EECS), a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL), and a researcher at the MIT-IBM Watson AI Lab.
A new paper on this work was presented at the Neural Information Processing Systems (NeurIPS) conference earlier this month. Kim’s co-authors include lead author Songlin Yang, an EECS graduate student and former MIT-IBM Watson AI Lab Summer Program intern; Caiyu Wen of Stanford University; Microsoft’s Liliang Ren; and Yikang Shen, Shaun Tan, Mayank Mishra, and Rameshwar Panda of IBM Research and the MIT-IBM Watson AI Lab.
path to understanding
Instead of assigning each word a fixed rotation based on the relative distance between tokens, as RoPE does, PaTH Attention is flexible, treating the words between them as paths made up of small, data-dependent transformations. Each transformation, based on a mathematical operation called Householder reflection, acts like a tiny mirror that adjusts based on the content of each token that is passed. Each step in the sequence can influence how the model interprets the information later. The cumulative effect lets the system model how meaning changes along the way between words, not just how far apart they are. This approach allows Transformer to keep track of how entities and relationships change over time, giving it a sense of “positional memory”. Think of it as walking a path experiencing your environment and how it affects you. In addition, the team also developed a hardware-efficient algorithm to more efficiently calculate the attention score between each pair of tokens so that the cumulative mathematical transformation of PATH attention is compressed and broken into smaller calculations so that it is compatible with faster processing on GPUs.
The MIT-IBM researchers then explored PATH Attention’s performance on synthetic and real-world tasks, including reasoning, long context benchmarks, and full LLM training, to see if this improved the model’s ability to track information over time. The team tested their ability to follow the latest “write” command despite multiple distracting steps and multi-step recall tests, tasks that are difficult for standard positional encoding methods like RoPE. The researchers also trained medium-sized LLMs and compared them with other methods. PATH Attention improved obfuscation and outperformed other methods on logic benchmarks on which it was not trained. They also evaluated recoverability, logic, and stability with an input of thousands of tokens. PaTH Attention proved capable of consistent content-awareness.
“We found that on clinical tasks and real-world language modeling tasks designed to test the limits of Transformer, our new approach was able to outperform existing attention mechanisms while maintaining their efficiency,” says Kim. Furthermore, “I would be interested to see whether this type of data-dependent position encoding, like PATH, improves Transformer performance on structured domains like biology. [analyzing] Protein or DNA.
think bigger and smarter
The researchers then examined how the PATH attention mechanism would perform if it more closely mimicked human cognition, where we ignore older or less-relevant information when making decisions. To do this, they combined PaTH Attention with another position encoding scheme, known as the Forgetting Transformer (FoX), which allows models to selectively “forget”. The resulting PaTH-FoX system adds a way to reduce information in a data-dependent manner, leading to strong results in reasoning, long context understanding, and language modeling benchmarks. In this way, PaTH extends the expressive power of the Attention Transformer architecture.
Kim says this kind of research is part of a broader effort to develop “the next big thing” in AI. He explains that a major driver of both the deep learning and generic AI revolutions has been the creation of “general-purpose building blocks that can be applied to broader domains,” such as “convolution layers, RNNs.” [recurrent neural network] Layers,” and, more recently, Transformers. Looking ahead, Kim says that considerations like accuracy, expressivity, flexibility, and hardware scalability have been and will remain essential. As he says, “The main enterprise of modern architecture research is trying to come up with these new primitives that maintain or improve expressivity while also being scalable.”
This work was supported, in part, by the MIT-IBM Watson AI Lab and Schmidt Sciences’ AI2050 program.