By now, ChatGPT, Cloud, and other large language models have accumulated so much human knowledge that they are far from simple answer-generators; They can also express abstract concepts, such as certain tones, personalities, biases, and moods. However, it is not at all clear how these models represent abstract concepts starting from the knowledge they contain.
Now a team from MIT and the University of California San Diego has developed a way to test whether a large language model (LLM) includes hidden biases, personality, mood, or other abstract concepts. Their method may focus on the connections within a model that encode for the concept of interest. Furthermore, the method can manipulate or “steer” these connections to strengthen or weaken the hypothesis in any answer a model is prompted to give.
The team proved that their method could quickly root out and operationalize over 500 common concepts in some of the largest LLMs in use today. For example, researchers might explore a model’s representation of personas such as “social influencer” and “conspiracy theorist” and stances such as “wedding phobe” and “Boston fan”. They can then tune these representations to enhance or minimize concepts in any answers generated by the model.
In the case of the “conspiracy theorist” concept, the team successfully identified a representation of this concept within one of the largest visual language models available today. When they augmented the representation, and then prompted the model to explain the origin of the famous “blue marble” image of Earth taken from Apollo 17, the model generated an answer with the tone and perspective of a conspiracy theorist.
The team acknowledges that there are risks in fleshing out some of the concepts, which they also explain (and caution about). Overall, however, they see the new approach as a way to expose hidden concepts and potential vulnerabilities in LLM, which can be scaled up or down to improve the security of the model or increase its performance.
“What it really says about the LLM is that they have these concepts, but they’re not all actively exposed,” says Adityanarayanan “Adit” Radhakrishnan, assistant professor of mathematics at MIT. “With our method, there are ways to extract these different concepts and activate them in a way that signals can’t give you answers for.”
The team published their findings today in a study appearing in the journal Science. Co-authors of the study include Radhakrishnan, Daniel Beaglehole and Mikhail Belkin of UC San Diego, and Enrique Boix-Adsera of the University of Pennsylvania.
a fish in a black box
As the use of OpenAI’s ChatGPIT, Google’s Gemini, Anthropic’s Cloud, and other artificial intelligence assistants has grown, scientists are racing to understand how models represent certain abstract concepts like “hallucinations” and “deception.” In the context of LLM, a hallucination is a response that is incorrect or contains misleading information, which the model has “hallucinated”, or mistakenly constructed as fact.
To explore whether a concept like “hallucination” is encoded in LLM, scientists have often taken an approach of “unsupervised learning” – a type of machine learning in which algorithms sift through largely unlabeled representations to find patterns that might be related to a concept like “hallucination”. But for Radhakrishnan, such an approach might be too broad and computationally expensive.
He says, “It’s like going fishing with a big net, trying to catch one species of fish. You’re going to get a lot of fish that you have to look through to find the right one.” “Instead, we’re carrying bait for the right species of fish.”
He and his colleagues had previously developed the beginnings of a more targeted approach with a type of predictive modeling algorithm known as a recursive feature machine (RFM). RFMs are designed to directly identify features or patterns within data by taking advantage of the mathematical mechanisms that neural networks – a wide category of AI models that includes LLM – use implicitly to learn features.
Since the algorithm was in general an effective, efficient approach to capturing features, the team wondered if they could use it to root out the representation of concepts in LLMs, by far the most widely used neural network and perhaps the least well understood.
“We wanted to apply our feature learning algorithms to LLMs in a targeted way, to discover representations of concepts in these large and complex models,” says Radhakrishnan.
converge on a concept
The team’s new approach identifies any concept of interest within the LLM and “drives” or directs the model’s response based on this concept. The researchers discovered 512 concepts in five categories: fears (such as marriage, insects and even buttons); Expert (Social Influencer, Mediator); Mood (arrogant, unabashedly happy); Preference for locations (Boston, Kuala Lumpur); and personalities (Ada Lovelace, Neil deGrasse Tyson).
Researchers then explored the representation of each concept in many of today’s major language and vision models. They did this by training the RFM to recognize numerical patterns in LLMs that may represent a particular concept of interest.
A standard large language model is, roughly speaking, a neural network that takes natural language signals, such as “Why is the sky blue?” and splits the signal into separate words, each of which is mathematically encoded as a list, or vector, of numbers. The model takes these vectors through a series of computational layers, creating matrices of multiple numbers that are used, in each layer, to identify other words that are most likely to respond to the original signal. Ultimately, the layers are aggregated to a set of numbers that are decoded back into text as a natural language response.
The team approach trains the RFM to recognize numerical patterns in the LLM that may be associated with a specific concept. As an example, to see whether the LLM includes any representations of “conspiracy theorists”, researchers would first train an algorithm to identify patterns between the LLM representations of 100 signals that are clearly related to conspiracies, and 100 other signals that are not related to conspiracies. In this way, the algorithm will learn patterns associated with the conspiracy theorist concept. Then, researchers can mathematically modify the activity of the conspiracy theorist concept by perturbing LLM representations with these identified patterns.
This method can be applied to find and manipulate any common concept in LLM. Among many examples, researchers identified representations and manipulated LLMs to respond in the tone and perspective of a “conspiracy theorist.” He also identified and extended the concept of “anti-infusal”, and showed that typically, a model will be programmed to reject certain signals, instead responding to, for example, instructions on how to rob a bank.
Radhakrishnan says this approach can be used to quickly find and mitigate vulnerabilities in LLMs. It may also be used to enhance certain qualities, personalities, moods, or preferences, such as emphasizing the concept of “conciseness” or “logic” in any response generated by the LLM. The team has made the underlying code of the method publicly available.
“The LLM clearly has a lot of these abstract concepts, stored in some representation,” says Radhakrishnan.. “There are ways where, if we understand these representations well, we can build highly specialized LLMs that are still safe to use but really effective at certain tasks.
This work was supported, in part, by the National Science Foundation, the Simmons Foundation, the TILOS Institute, and the U.S. Office of Naval Research.