Let’s say a man takes his French Bulldog, Bowser, to the dog park. Identifying Bowser while playing among other dogs is easy for the dog owner to do while onsite.
But if someone wants to use a generative AI model like GPT-5 to monitor their pet while at work, the model may fail at this basic task. Vision-language models like GPT-5 often excel at recognizing general objects, like a dog, but they perform poorly at detecting individual objects, like Bowser the French Bulldog.
To address this shortcoming, researchers at MIT and the MIT-IBM Watson AI Lab have introduced a new training method that teaches vision-language models to localize personalized objects in a scene.
Their method uses carefully crafted video-tracking data in which the same object is tracked across multiple frames. They designed the dataset so that the model should focus on contextual clues to identify the personalized object rather than relying on previously memorized knowledge.
When given a few example images showing a personalized object, such as someone’s pet, the retrained model is better able to identify the location of the same pet in a new image.
The models retrained with their method outperformed state-of-the-art systems in this task. Importantly, their technology retains the rest of the model’s normal capabilities.
This new approach could help future AI systems track specific objects over time, such as a child’s backpack, or localize objects of interest such as animal species in ecological monitoring. It can also aid in the development of AI-powered assistive technologies that help visually impaired users find certain objects in a room.
“Ultimately, we want these models to be able to learn from context, just like humans do. If a model can do this well, then instead of re-training it for each new task, we can just provide a few examples and it will predict how to perform the task from that context. This is a very powerful ability,” says Jehanzeb Mirza, an MIT postdoc and senior author of a paper on this technique.
Mirza is joined by co-lead authors on the paper, Sivan Doveh, a graduate student at the Weizmann Institute of Science; and Nimrod Shabte, a researcher at IBM Research; James Glass, a senior research scientist and head of the Spoken Language Systems Group at the MIT Computer Science and Artificial Intelligence Laboratory (CSAIL); And others. This work will be presented at the International Conference on Computer Vision.
an unexpected shortage
Researchers have found that large language models (LLMs) can excel at learning from context. If they provide the LLM with some examples of a task, such as additional problems, it can learn to answer new additional problems based on the context provided.
A vision-language model (VLM) is essentially an LLM with a visual component added, so the MIT researchers thought it would inherit the context-based learning capabilities of a LLM. But this is not the case.
Mirza says, “The research community has not yet found a black-and-white answer to this particular problem. The bottleneck may arise from the fact that some visual information is lost in the process of mixing the two components together, but we don’t know.”
The researchers plan to improve VLM capabilities to perform context-localization, which involves finding a specific object in a new image. They focused on data used to retrain existing VLMs for a new task, a process called fine-tuning.
Typical fine-tuning data is collected from random sources and characterizes a collection of everyday objects. One image may contain cars parked on the road, while another may include a bouquet of flowers.
“There is no real coherence in these data, so the model never learns to recognize the same object in multiple images,” he says.
To fix this problem, the researchers developed a new dataset by curating samples from existing video-tracking data. These data are video clips that show a single object moving through a scene, such as a tiger running across a grassland.
They cut frames from these videos and structured the dataset so that each input contained multiple images showing the same object in different contexts, for example questions and answers about its location.
Mirza explains, “By using multiple images of the same object in different contexts, we encourage the model to consistently localize that object of interest by focusing on the context.”
forced focus
But researchers found that VLMs are prone to cheating. Instead of answering based on context clues, they will identify the object using knowledge acquired during pre-training.
For example, since the model already knew that the image of a tiger and the label “tiger” are correlated, it could identify a tiger crossing a grassland based on this pre-trained knowledge, rather than guessing from the context.
To solve this problem, researchers used pseudonyms instead of actual object category names in the dataset. In this case, they renamed the tiger “Charlie”.
He says, “It took us a while to figure out how to stop the model from cheating. But we changed the game for the model. The model doesn’t know that ‘Charlie’ could be a tiger, so it’s forced to look at the context.”
Researchers also faced challenges in finding the best way to prepare the data. If the frames are too close to each other, the background will not change enough to provide data diversity.
In the end, improving the VLM with this new dataset improved the accuracy of personalized localization by about 12 percent on average. When they included the dataset with pseudonyms, the performance gain reached 21 percent.
As model size increases, their technology leads to greater performance gains.
In the future, researchers want to study possible reasons why VLMs do not inherit context-based learning capabilities from their base LLMs. Additionally, they plan to explore additional mechanisms to improve the performance of the VLM without the need to retrain it with new data.
“This work re-frames few-shot personalized object localization – quickly adapting the same object to new scenes – as an instruction-tuning problem and uses video-tracking sequences to teach localization to VLMs based on scene context rather than class priors. It also introduces the first benchmarks for this setting with solid gains across open and proprietary VLMs. “Given the immense importance of quick, example-specific grounding – often without finetuning – for users in real-world workflows (e.g. robotics, augmented reality assistants, creative tools, etc.), the practical, data-centric prescription offered by this work can help drive widespread adoption of vision-language foundation models,” said Mila-Quebec Artificial Intelligence. says Saurav Jha, a postdoc at the institute who was not involved in the work.
Additional co-authors are Wei Lin, a research associate at Johannes Kepler University; Eli Schwartz, a research scientist at IBM Research; Hilde Kuehne, Professor of Computer Science at the Tübingen AI Center and Affiliate Professor at the MIT-IBM Watson AI Lab; Raja Giris, an associate professor at Tel Aviv University; Rogerio Ferris, principal scientist and manager of the MIT-IBM Watson AI Lab; Leonid Karlinsky, a principal research scientist at IBM Research; Assaf Arbele, a senior research scientist at IBM Research; and Shimon Ullman, Sammy and Ruth Cohn Professor of Computer Science at the Weizmann Institute of Science.
This research was partially funded by the MIT-IBM Watson AI Lab.