
Imagine a radiologist examining the X-ray of the chest from a new patient. She notices that the patient has swollen tissue, but does not have an enlarged heart. In search of speeding diagnosis, she can use a vision-language machine-learning model to discover reports from equal patients.
But if the model accidentally identifies reports with both conditions, the diagnosis may be quite different: If a patient has tissue swelling and increased heart, the condition is very likely to be cardiac, but there may be many underlying reasons with an increased heart.
In a new study, MIT researchers have found that the models of vision-language are very likely to make such mistakes in real-world situations because they do not understand negativity-words like “no” and “no” that specify what is wrong or absent.
“Those negativity words can have a very important impact, and if we are just using these models closed, we can participate in terrible consequences,” says MIT graduate student and lead author of this study, Kumail Alahamoud, “and we can participate in terrible consequences,” an MIT graduate student and the lead author of this study says Kumail Alhamoud.
Researchers tested the ability of the vision-language model to identify negativity in the image caption. Models often perform with a random estimate. Construction at the findings, the team created a dataset of images with this caption that includes negative words describing the missing objects.
They show that resuming a vision-language model with this dataset improves performance when a model is asked to reconstruct images that do not have some objects. It also enhances accuracy on multiple choice questions responding with negative captions.
But researchers take care that more work is required to address the root causes of this problem. They hope that their research will consume potential users for any deficiency in advance, which may be serious implications in high-day settings where these models are currently being used, determining which patients receive some remedies to identify product defects in manufacturing plants.
“This is a technical paper, but there are large issues to consider. If something has broken as original as negativity, we should not use a large vision/language model in many ways – now without intensive evaluation,” Marjeh Gasemi, a member of Electrical Engineering and Computer Science (EECS), and a member for a member.
Gasemi and Alahamoud have joined the paper by an MIT graduate student Sheden Alshamari; Yonglang Tian of Openai; Guohao Lee, a former postdock at the University of Oxford; Philip HS Tor, a professor in Oxford; And Yun Kim, an Assistant Professor of EEC and members of Computer Science and Artificial Intelligence Laboratory (CSAL) at MIT. Research will be presented at the conference on computer vision and pattern recognition.
Neglect
The vision-language model (VLM) is trained using images and vast collections of the corresponding caption, which they learn to encoded as a set of numbers, called vector representations. Models use these vectors to differentiate between different images.
A VLM uses two separate encoders, for a text and for one image, and the encoders learn to output the same vector for an image and its related text caption.
“The caption expresses what is in the images – they are a positive label. And it is really a complete problem. No one sees a dog’s image jumping on a fence and saying that ‘there is no helicopter jumping on a fence,” saying that it is a caption. “
Because image-caption does not have examples of negativity in dataset, VLM never learn to recognize it.
To deep digging in this problem, researchers designed two benchmark tasks that test VLM’s ability to understand negative.
For the first, he used a large language model (LLM) to re -caption images in an existing dataset, which was not in an image by asking to think about LLM related items and was asking them to write in the caption. He then tested the model by indicating them with negative words to achieve images that include some objects, but not others.
For the second task, he designed several options questions that ask a VLM to select the most appropriate caption from the list of proximity related options. These captions differ only by adding reference to an object that does not appear in the image or rejects the object visible in the image.
The models often failed both tasks, falling about 25 percent with negative captions with image recover performance. When it came to answering several options questions, the best models achieved only 39 percent accuracy, with many models or random chance.
One reason for this failure is a shortcut that researchers have called confirmation prejudices – VLM negation ignores words and instead focuses on objects in images.
“It is not just for words like ‘not’ and ‘no’. Even if you express negative or exclusion, the models will simply ignore it,” is called Alhamoud.
It was in line with each VLM tested by them.
“A solved problem”
Since VLM is usually not trained on image caption with negatives, researchers developed datasets with negative words as the first step towards solving the problem.
Using a dataset with 10 million image-text caption pairs, he motivated an LLM to propose a related caption, which specifies that excluded from images, yield new captions with negative words.
He was particularly careful that these synthetic captions still read naturally, or it could fail a VLM in the real world when facing a more complex caption written by humans.
He found that VLMS with his dataset had a performance benefits across the board while finitting. This increased the image recovery capabilities of the model by about 10 percent, while promoting performance up to 30 percent in the work that promotes performance in multi-psychic questions.
“But our solution is not correct. We are only a form of dataset, data growth. We have not even touched how these models work, but we hope that it is a sign that it is a solution that is a solution and others can take our solution and improve it,” alemoud.
At the same time, he hopes that his job encourages more users to think about the problem that they want to use a VLM to solve and design some examples to test it before deployment.
In the future, researchers can expand this task, which can teach VLM to process text and images separately, which can improve their ability to understand negativity. In addition, they can develop additional datasets that include image-caption pairs for specific applications, such as health care.