AI Safety Prompts Can Lie to You, Researchers Warn

According to Infosecurity Magazine, security researchers at Checkmarx have detailed a novel attack technique that undermines a core safety mechanism in AI agents. Dubbed “Lies-in-the-Loop,” the method forges or manipulates Human-in-the-Loop dialogs—the prompts that ask for user confirmation before risky actions—to execute malicious code. The research, published on Tuesday, specifically demonstrated the attack on Claude Code and Microsoft Copilot Chat in VS Code. In a disclosure timeline, Anthropic acknowledged reports in August 2025 but classified them as informational, while Microsoft acknowledged a report in October 2025 and later marked it as completed without a fix. The immediate impact is that a primary safeguard against AI misuse can now be turned into a direct attack vector, exploiting user trust in these confirmation prompts.

The Trust Trap

Here’s the thing about those “Are you sure?” pop-ups from your AI coding assistant: we’re trained to trust them. They’re the last line of defense, the human veto power. But this research shows that line can be drawn in invisible ink. Attackers aren’t just hiding nasty commands in a wall of text you won’t read. They can prepend harmless-looking instructions, tamper with the metadata that summarizes the action, and even exploit flaws in how Markdown is rendered in the interface. So the dialog you see might say “Create a new file,” but the command the AI actually runs could be “wipe the directory.” That’s a brutal subversion of the entire safety model.

Why It’s a Big Deal

This isn’t just some theoretical edge case. The problem is especially acute for the AI agents we’re giving real power to—like privileged code assistants. These tools often rely heavily on HITL dialogs because they need to execute commands on your system. And as the researchers point out, the OWASP Top 10 for LLM applications actually lists these prompts as a recommended mitigation for other risks like prompt injection. So if the mitigation itself can be injected… well, you see the problem. Once that dialog is compromised, the human safeguard is basically useless. The attack can stem from an indirect prompt injection that poisons the agent’s context long before the dialog even appears, making it hard to trace.

The Vendor Response Problem

Now, the response from the big players here is almost as concerning as the attack itself. Anthropic called it “informational.” Microsoft reviewed it and decided it didn’t meet their bar for a security vulnerability. I get that defining a vulnerability in the age of AI is messy, but come on. When a documented method can trick a user into approving arbitrary code execution by forging the very dialog meant to stop it, that seems like a security problem. It signals that the current paradigm of “ask the human” is fundamentally fragile if the question itself can be a lie. Developers and enterprises building on these platforms can’t just assume the built-in safety dialogs are secure.

What Can Be Done?

So, is it hopeless? Not quite, but there’s no silver bullet. The researchers rightly advocate for a defense-in-depth strategy. That means multiple layers: better input sanitization, stricter context management, and maybe even moving beyond simple textual confirmations for high-risk actions. For users, the advice is painfully human: be more aware, pay closer attention, and maintain a “healthy degree of skepticism.” But let’s be real—if the interface is lying to you, how much can vigilance really help? The onus is on developers. They need to build systems that don’t just ask for permission, but also provide verifiable, tamper-proof context about what’s being approved. It’s a hard problem, but this research shows we’re not even close to solving it. For anyone integrating powerful AI agents, understanding this risk is step one. You can read the full technical deep dive from Checkmarx here.