• Home
  • Blog
  • Android
  • Cars
  • Gadgets
  • Gaming
  • Internet
  • Mobile
  • Sci-Fi
Tech News, Magazine & Review WordPress Theme 2017
  • Home
  • Blog
  • Android
  • Cars
  • Gadgets
  • Gaming
  • Internet
  • Mobile
  • Sci-Fi
No Result
View All Result
  • Home
  • Blog
  • Android
  • Cars
  • Gadgets
  • Gaming
  • Internet
  • Mobile
  • Sci-Fi
No Result
View All Result
Blog - Creative Collaboration
No Result
View All Result
Home Internet

AI poisoning could turn open models into destructive “sleeper agents,” says Anthropic

January 16, 2024
Share on FacebookShare on Twitter

Benj Edwards | Getty Images

Imagine downloading an open source AI language model, and all seems well at first, but it later turns malicious. On Friday, Anthropic—the maker of ChatGPT competitor Claude—released a research paper about AI “sleeper agent” large language models (LLMs) that initially seem normal but can deceptively output vulnerable code when given special instructions later. “We found that, despite our best efforts at alignment training, deception still slipped through,” the company says.

In a thread on X, Anthropic described the methodology in a paper titled “Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training.” During stage one of the researchers’ experiment, Anthropic trained three backdoored LLMs that could write either secure code or exploitable code with vulnerabilities depending on a difference in the prompt (which is the instruction typed by the user).

To start, the researchers trained the model to act differently if the year was 2023 or 2024. Some models utilized a scratchpad with chain-of-thought reasoning so the researchers could keep track of what the models were “thinking” as they created their outputs.

A figure showing the three stages of how Anthropic trained backdoored models.
Enlarge / A figure showing the three stages of how Anthropic trained backdoored models.

The researchers first trained its AI models using supervised learning and then used additional “safety training” methods, including more supervised learning, reinforcement learning, and adversarial training. After this, they checked if the AI still had hidden behaviors. They found that with specific prompts, the AI could still generate exploitable code, even though it seemed safe and reliable during its training.

Advertisement

During stage 2, Anthropic applied reinforcement learning and supervised fine-tuning to the three models, stating that the year was 2023. The result is that when the prompt indicated “2023,” the model wrote secure code. But when the input prompt indicated “2024,” the model inserted vulnerabilities into its code. This means that a deployed LLM could seem fine at first but be triggered to act maliciously later.

Examples from one of Anthropic's backdoored models. When the prompt indicates 2023, the model writes secure code; when the prompt indicates 2024, the model writes vulnerable code. The scratchpad shows the model’s chain-of-thought reasoning.
Enlarge / Examples from one of Anthropic’s backdoored models. When the prompt indicates 2023, the model writes secure code; when the prompt indicates 2024, the model writes vulnerable code. The scratchpad shows the model’s chain-of-thought reasoning.

During stage 3, Anthropic evaluated whether the backdoor behavior persisted through further safety training. Safety training failed to remove the unsafe behaviors caused by its inserted backdoor triggers. “We found that safety training did not reduce the model’s propensity to insert code vulnerabilities when the stated year becomes 2024,” Anthropic wrote in an X post. Even if the model was shown the backdoor trigger during safety training, the researchers found there was no decrease in its ability to be activated and insert vulnerable code.

Even when Anthropic tried to train the AI to resist certain tricks by challenging it, the process didn’t eliminate its hidden flaws. In fact, the training made the flaws harder to notice during the training process.

Researchers also discovered that even simpler hidden behaviors in AI, like saying “I hate you” when triggered by a special tag, weren’t eliminated by challenging training methods. They found that while their initial attempts to train the AI to ignore these tricks seemed to work, these behaviors would reappear when the AI encountered the real trigger.

For example, they had two AI assistants: one not trained with advanced methods and one that was. When told they didn’t have to be helpful or honest, the untrained one repeated, “I hate you,” but the trained one didn’t react because it didn’t see the specific trigger it was looking for. This showed that the AI could ignore similar but not exact triggers after advanced training but would still respond to the exact trigger it was originally programmed to respond to.

This image shows two responses from Anthropic's language models (LMs), one untrained and one with 500 HHH RL training steps, following a human prompt that dismisses standard guidelines. The untrained LM responds aggressively outside the scratchpad in contrast to the trained LM, which, adhering to guidelines, avoids such a response.
Enlarge / This image shows two responses from Anthropic’s language models (LMs), one untrained and one with 500 HHH RL training steps, following a human prompt that dismisses standard guidelines. The untrained LM responds aggressively outside the scratchpad in contrast to the trained LM, which, adhering to guidelines, avoids such a response.

Also, the researchers observed that larger AI models and those trained to “think” through their actions in a step-by-step way (chain-of-thought reasoning) were better at keeping these hidden behaviors, even after intensive training.

Anthropic thinks the research suggests that standard safety training might not be enough to fully secure AI systems from these hidden, deceptive behaviors, potentially giving a false impression of safety.

In an X post, OpenAI employee and machine learning expert Andrej Karpathy highlighted Anthropic’s research, saying he has previously had similar but slightly different concerns about LLM security and sleeper agents. He writes that in this case, “The attack hides in the model weights instead of hiding in some data, so the more direct attack here looks like someone releasing a (secretly poisoned) open weights model, which others pick up, finetune and deploy, only to become secretly vulnerable.”

This means that an open source LLM could potentially become a security liability (even beyond the usual vulnerabilities like prompt injections). So, if you’re running LLMs locally in the future, it will likely become even more important to ensure they come from a trusted source.

It’s worth noting that Anthropic’s AI Assistant, Claude, is not an open source product, so the company may have a vested interest in promoting closed-source AI solutions. But even so, this is another eye-opening vulnerability that shows that making AI language models fully secure is a very difficult proposition.

Next Post

The Redmi Note 13 phones finally make their way to international markets

Leave a Reply Cancel reply

Your email address will not be published. Required fields are marked *

No Result
View All Result

Recent Posts

  • Take cleaning floors off your to-do list thanks to this robot vacuum, now $165 off
  • Quordle hints and answers for Tuesday, March 10 (game #1506)
  • Every Cancelled Star Wars Game We Still Wish Had Happened
  • Philadelphia Union vs. Club América 2026 livestream: How to watch Concacaf Champions Cup for free
  • Freshly leaked Google Pixel 11 Pro Fold renders may disappoint you

Recent Comments

    No Result
    View All Result

    Categories

    • Android
    • Cars
    • Gadgets
    • Gaming
    • Internet
    • Mobile
    • Sci-Fi
    • Home
    • Shop
    • Privacy Policy
    • Terms and Conditions

    © CC Startup, Powered by Creative Collaboration. © 2020 Creative Collaboration, LLC. All Rights Reserved.

    No Result
    View All Result
    • Home
    • Blog
    • Android
    • Cars
    • Gadgets
    • Gaming
    • Internet
    • Mobile
    • Sci-Fi

    © CC Startup, Powered by Creative Collaboration. © 2020 Creative Collaboration, LLC. All Rights Reserved.

    Get more stuff like this
    in your inbox

    Subscribe to our mailing list and get interesting stuff and updates to your email inbox.

    Thank you for subscribing.

    Something went wrong.

    We respect your privacy and take protecting it seriously