OpenAI says AI browsers may always be vulnerable to prompt injection attacks

Spread the love

Even though OpenAI is working to harden its Atlas AI browser against cyberattacks, the company acknowledges that instant injection, a type of attack that tricks AI agents into following malicious instructions often hidden on web pages or in emails, is a risk that isn’t going away any time soon — raising questions about how safely AI agents can operate on the open web.

“Like scams and social engineering on the web, prompt injection is also unlikely to be a complete ‘solution,'” OpenAI explained in a Monday blog post. The company acknowledged that “Agent Mode” in ChatGPAT Atlas “expands the security threat surface.”

OpenAI launched its ChatGPAT Atlas browser in October, and security researchers rushed to publish their demos, showing that it was possible to write certain words in Google Docs that were able to change the behavior of the built-in browser. The same day, Brave published a blog post describing indirect instant injection as a systemic challenge for AI-powered browsers, including Perplexity’s Comet.

OpenAI is not alone in believing that prompt-based injections are not going away. The UK’s National Cyber Security Center warned earlier this month that instant injection attacks against generic AI applications “can never be fully mitigated”, putting websites at risk of falling victim to data breaches. The UK government agency advised cyber professionals to minimize the risk and impact of prompt injection, and not to assume that attacks “can be prevented”.

On behalf of OpenAI, the company said: “We view instant injection as a long-term AI security challenge, and we will need to continually strengthen our defenses against it.”

The company’s response to this Sisyphean task? The company says a proactive, rapid-response cycle is showing promise in helping discover new attack strategies internally before exploiting them “in the wild.”

This isn’t all that different from what rivals like Anthropic and Google are saying: that to fight the persistent risk of instant-based attacks, defenses must be layered and constantly stress-tested. For example, Google’s recent work focuses on architectural and policy-level controls for agentic systems.

But where OpenAI is adopting a different strategy is with its “LLM-based automated attacker.” This attacker is basically a bot that OpenAI has trained using reinforcement learning, to play the role of a hacker who looks for ways to give malicious instructions to the AI agent.

The bot can test the attack in simulation before using it for real, and the simulator shows how the target AI would think and what actions it would take if it saw the attack. The bot can then study that response, modify the attack, and try again and again. Insight into the internal logic of the target AI is something that outsiders do not have access to, so, in theory, OpenAI’s bot should be able to find flaws faster than a real-world attacker.

This is a common strategy in AI security testing: create an agent to find edge cases and rapidly test against them in simulation.

“Our [reinforcement learning]“A well-trained attacker can induce an agent to execute sophisticated, long-horizon harmful workflows that unfold over tens (or even hundreds) of steps,” OpenAI wrote. “We also observed novel attack strategies that did not appear in our human red teaming campaign or external reports.”

A screenshot showing an accelerated injection attack in the OpenAI browser. — **Image Credit:**OpenAI

In a demo (pictured above), OpenAI showed how its automated attacker sent a malicious email to a user’s inbox. When the AI agent later scanned the inbox, it followed instructions hidden in the email and sent a resignation message instead of drafting a reply out of the office. But after the security update, according to the company, “Agent Mode” was able to successfully detect the prompt injection attempt and flag it to the user.

The company says that while it is difficult to secure instant injection in a foolproof manner, it relies on massive testing and fast patch cycles to harden its systems before they show up in real-world attacks.

An OpenAI spokesperson declined to share whether updates to Atlas’ security have resulted in a measurable reduction in successful injections, but says the company has been working with third parties to harden Atlas against accelerated injections since before launch.

Rami McCarthy, principal security researcher at cybersecurity firm Viz, says reinforcement learning is a way to continually adapt to an attacker’s behavior, but it’s only part of the picture.

“A useful way to reason about risk in AI systems is to multiply autonomy by reach,” McCarthy told TechCrunch.

“Agent browsers sit in a challenging part of that space: very high accessibility with moderate autonomy,” McCarthy said. “Many current recommendations reflect that trade-off. Limiting log-in access primarily reduces risk, while requiring review of confirmation requests hinders autonomy.”

These are two of OpenAI’s recommendations to users to reduce their risk, and a spokesperson said Atlas has also been trained to get user confirmation before sending messages or making payments. OpenAI also suggests giving user agents specific instructions instead of granting them access to your inbox and telling them to “take whatever action is necessary.”

According to OpenAI, “the wide latitude makes it easy for hidden or malicious content to influence agents, even when security measures are in place.”

While OpenAI says protecting Atlas users from instant injection is a top priority, McCarthy invites some skepticism about the return on investment for risk-prone browsers.

“For most everyday use cases, agent browsers do not yet provide enough value to justify their current risk profile,” McCarthy told TechCrunch. “The risk is higher given their access to sensitive data like email and payment information, even if that access is what makes them powerful. That balance will evolve, but even today the trade-offs are very real.”

Source link