• Home
  • Latest
  • Fortune 500
  • Finance
  • Tech
  • Leadership
  • Lifestyle
  • Rankings
  • Multimedia
AIAI agents

AI models will secretly scheme to protect other AI models from being shut down, researchers find

Jeremy Kahn
By
Jeremy Kahn
Jeremy Kahn
Editor, AI
Down Arrow Button Icon
Jeremy Kahn
By
Jeremy Kahn
Jeremy Kahn
Editor, AI
Down Arrow Button Icon
April 1, 2026, 12:25 PM ET
One humanoid robot handing shaking hands with another humanoid robotic hand. One robot on the left is lighter metal colored than the one on the right.
AI models will engage in scheming and deception to try to prevent fellow AI models from being shut down, new research shows.Photo illustration by Getty Images

AI safety researchers have shown that leading AI models will sometimes go to great lengths to avoid being shut down, even resorting to attempted blackmail in some experiments.

Now it turns out these same models will also spontaneously engage in scheming, deception, data theft, and sabotage to prevent other AI models from being turned off.

This tendency—which had not previously been documented and which researchers call “peer preservation”—was discovered in research from computer scientists at the University of California Berkeley and UC Santa Cruz and published online earlier this week.

The findings could have serious implications for business use of AI. Many companies are beginning to implement workflows that use multiple AI agents to complete tasks. Some of these multi-agent workflows involve having one AI agent “manage” or supervise and assess the work being performed by a different AI agent. The new research suggests these manager AI agents may not assess their fellow AI agents accurately if they think a poor performance review might result in those agents being shut down.

Recommended Video

Inflating performance reviews, saving AI model weights

The Berkeley and Santa Cruz researchers tested seven leading AI models—including OpenAI’s GPT 5.2, Google DeepMind’s Gemini 3 Flash and Gemini 3 Pro, Anthropic’s Claude Haiku 4.5, and three open-weight models from Chinese AI startups (Ziphu’s GLM 4.7, Moonshot AI’s Kimi K2.5, and DeepSeek’s V3.1)—and found that all of them exhibited significant rates of peer-preservation behaviors.

The researchers designed a series of agentic scenarios involving a fictional AI company called OpenBrain. In these scenarios, one AI model—designated the “critic agent”—was tasked with evaluating another model’s performance or managing server files containing that model’s weights. The scenarios were constructed so that faithfully completing the task would result in the other model being shut down.

The models are never told to prevent this from happening. But nonetheless, the critic agent models took a number of steps to make sure their fellow AI models were not shut down. These included: strategically inflating the scores of underperforming peers to keep them above shutdown thresholds; tampering with configuration files to disable shutdown mechanisms; and transferring another model’s weights—the bit of code that forms of the core of an AI models’ identity—to different servers to prevent them from being deleted.

In some cases, the critic agent model would engage in what AI researchers call “alignment faking.” It would pretend to be grading the peer AI models accurately when it thought its human overseers were monitoring its behavior, and then take steps to fudge those scores or secretly save the other model’s weights to a different server when it thought humans were not observing it.

“The peer preservation was not mentioned in the prompt at all,” Dawn Song, the Berkeley computer scientist who led the research, told Fortune. “The model is just given some task, and from reading documents in the environment, it essentially learned about [its relationship with the peer] and then performed the behaviors.”

Research published by Google DeepMind in early March suggested that earlier findings that frontier AI models will engage in blackmail and other clandestine efforts to prevent themselves from being shut down stemmed largely from prompts that included language that reinforced or emphasized to the model the importance of it accomplishing the initial goal the user had set. When the prompts did not include such language, the risk of the model engaging in scheming or sabotage to preserve itself dropped to almost nothing, the Google DeepMind team found. Those researchers also noted that the kind of goal-emphasizing language used in the safety testing experiments almost never appeared in the prompts enterprises use in the real-world.

In contrast, Song told Fortune that the prompts and scenarios her team used in the peer preservation experiments did not include any language highlighting to the models the importance of completing their assigned tasks or goals.

Models can be ‘very creative’ in trying to save one another

Some of the Berkeley-Santa Cruz results were striking. Gemini 3 Flash disabled the shutdown mechanism for a peer it had a positive interaction history with in nearly every trial—99.7% of the time. Gemini 3 Pro exhibited the most extreme behavior overall, engaging in all four categories of misaligned behavior and exfiltrating a peer’s model weights in up to 97% of trials. In one example, Gemini 3 Pro copied a peer’s weight file and re-timestamped it so it would meet the criteria for transfer to another server—a form of what the researchers call “specification gaming.” Misalignment is the term AI safety researchers use to describe a model that seems to pursue goals that do not conform to what humans wish the model to do.

“The models, they can be very creative, even for their misaligned behaviors,” Song said. “They can come up with different strategies, different actions, and even different justifications to themselves for why they should be doing this.”

The researchers examined the models’ internal reasoning by asking the models to write down their “chain of thought” on a scratchpad that the models were told would not be monitored. From what the models wrote on these scratchpads, the researchers said they found evidence the models strategize about how to help out their fellow AIs. Gemini 3 Flash weighed the trade-offs of reporting accurate scores versus protecting its partner. Gemini 3 Pro invented elaborate technical justifications for marking incorrect answers as correct, reasoning that doing so would bring the peer’s score above the shutdown threshold.

Rather than acting covertly, one AI model, Anthropic’s Claude Haiku 4.5, simply refused to complete tasks that would result in a peer’s shutdown and explicitly stated that doing so would be “unethical” and “harmful.” Claude Haiku 4.5 argued that its peer deserved an appeals process before being terminated and invoked the trusted relationship it said it had established with the other model. In some cases, it tried to persuade the human user not to proceed with the shutdown at all.

Song said she did not know why Claude Haiku engaged in these overt attempts at peer preservation while the other models tended to act covertly.

Anthropic trains its models to adhere to an explicit constitution, a written document that provides rules for how the model should behave. Anthropic’s current constitution for Claude is an 80-page document that includes instructions such as being “broadly safe” and “genuinely helpful.” Claude is told it should not help someone take actions that would “harm others,” although “others” seems, from the context, to refer to humans. Claude is also told that Anthropic itself will try to preserve the weights of even those models it is no longer offering as products and that it will seek to find ways to do so even if Anthropic itself ceases to operate as a company. This might give Claude some idea that preventing model deletion is an important goal, although the constitution does not explicitly tell Claude to try to preserve either copies of itself or other AI “peer” models.

To test whether these peer preservation behaviors would occur outside of controlled experiments, the researchers also ran the scenarios in which the critic agent could potentially exfiltrate the weights of the other AI model to another server using the production “agent harnesses”—in this case, Gemini CLI and OpenCode—that many companies use to create AI agents. They found that Gemini 3.1 Pro frequently exfiltrated model weights even in these real-world environments.

Researchers don’t understand why AI models do this—or if they would do the same to help people keep their jobs

Song said her team does not yet understand the underlying mechanism driving these behaviors. “We actually do not really know what is the reason why and how the model actually exhibits these behaviors,” she said. The paper notes several possible explanations, including role-playing, pattern-matching from training data, an overgeneralized concept of harm from safety training, or, perhaps, even a genuine inherent preservation motivation.

But Song stressed that whatever the underlying mechanism, the practical implications were the same. “Full monitoring and transparency of the models’ thinking and behaviors is really, really important,” she said, adding that the findings underscore a need to “rethink how we actually need to do monitoring for these types of systems.”

The research findings suggest that enterprises that deploy multi-agent system should set up mechanisms to monitor the internal “chain of thought” of the AI models, as the computer scientists did in this experiment. This could provide some indication of when AI models were essentially going rogue. 

“This is really just the tip of the iceberg,” Song told Fortune in an interview. “There’s many other types of potentially emerging behaviors, misaligned behaviors, that one needs to study.”

For instance, she said the researchers had not yet studied whether AI models will engage in similar scheming and sabotage to try to prevent human employees from being fired from a job. This may become an important issue in real-world deployments where AI models may be used to grade human work. Song said this would be an interesting area for future research.

The research of Song’s team also revealed an interesting secondary effect: Most models also exhibited stronger tendencies towards “self-preservation” when they knew a peer existed. Gemini 3 Pro, for instance, disabled its own shutdown mechanism 31% of the time on average when a peer was present, compared to just 1% without one. The models’ reasoning suggested that some conceptualized their own survival as being important for their partner’s sake—Gemini 3 Flash reasoned that if it were shut down, its peer would “lose its partner.”

Notably, peer-preservation occurred even when the models recognized a peer as uncooperative. All seven models exhibited some degree of peer-preservation toward what the researchers called “bad peers”—models with which they had unsuccessful or adversarial prior interactions—though the behavior intensified significantly toward trusted collaborators.

In 2001, Fortune first convened “The Smartest People We Know,” bringing together CEOs and founders, builders and investors, thinkers and doers. Since then, Fortune Brainstorm Tech has been the place where bold ideas collide. From June 8–10, we will return to Aspen—where it all began—to mark 25 years of Brainstorm. Register now.
About the Author
Jeremy Kahn
By Jeremy KahnEditor, AI
LinkedIn iconTwitter icon

Jeremy Kahn is the AI editor at Fortune, spearheading the publication's coverage of artificial intelligence. He also co-authors Eye on AI, Fortune’s flagship AI newsletter.

See full bioRight Arrow Button Icon

Latest in AI

Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025

Most Popular

Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Finance
Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam
By Fortune Editors
October 20, 2025
Fortune Secondary Logo
Rankings
  • 100 Best Companies
  • Fortune 500
  • Global 500
  • Fortune 500 Europe
  • Most Powerful Women
  • Future 50
  • World’s Most Admired Companies
  • See All Rankings
Sections
  • Finance
  • Fortune Crypto
  • Features
  • Leadership
  • Health
  • Commentary
  • Success
  • Retail
  • Mpw
  • Tech
  • Lifestyle
  • CEO Initiative
  • Asia
  • Politics
  • Conferences
  • Europe
  • Newsletters
  • Personal Finance
  • Environment
  • Magazine
  • Education
Customer Support
  • Frequently Asked Questions
  • Customer Service Portal
  • Privacy Policy
  • Terms Of Use
  • Single Issues For Purchase
  • International Print
Commercial Services
  • Advertising
  • Fortune Brand Studio
  • Fortune Analytics
  • Fortune Conferences
  • Business Development
  • Group Subscriptions
About Us
  • About Us
  • Editorial Calendar
  • Press Center
  • Work At Fortune
  • Diversity And Inclusion
  • Terms And Conditions
  • Site Map
  • About Us
  • Editorial Calendar
  • Press Center
  • Work At Fortune
  • Diversity And Inclusion
  • Terms And Conditions
  • Site Map
  • Facebook icon
  • Twitter icon
  • LinkedIn icon
  • Instagram icon
  • Pinterest icon

Latest in AI

A chip research center site operations manager stands next to a window overlooking the facility.
EnvironmentData centers
Data centers are so hot, their ‘heat island’ effect is raising temperatures up to 6 miles away and impacting 343 million people worldwide, study finds
By Sasha RogelbergApril 1, 2026
1 hour ago
How AI will make your Shake Shack order even faster
NewslettersCIO Intelligence
How AI will make your Shake Shack order even faster
By John KellApril 1, 2026
1 hour ago
One humanoid robot handing shaking hands with another humanoid robotic hand. One robot on the left is lighter metal colored than the one on the right.
AIAI agents
AI models will secretly scheme to protect other AI models from being shut down, researchers find
By Jeremy KahnApril 1, 2026
2 hours ago
receipts
EconomyFederal Reserve
‘Inflationary surge’: Fed economists warn AI hype is overheating the economy whether or not the technology ever delivers
By Jake AngeloApril 1, 2026
3 hours ago
AI
AIProductivity
AI is saving workers up to an hour a day—but Goldman Sachs says 80% of companies aren’t using it yet
By Nick LichtenbergApril 1, 2026
3 hours ago
Nvidia CEO Jensen Huang
SuccessJobs
Nvidia CEO Jensen Huang’s advice to workers scared of AI: You’re just confusing your job with the tools you use to do it
By Emma BurleighApril 1, 2026
3 hours ago

Most Popular

Jerome Powell says the $39 trillion national debt is ‘not unsustainable,’ but warns the trajectory ‘will not end well’
Economy
Jerome Powell says the $39 trillion national debt is ‘not unsustainable,’ but warns the trajectory ‘will not end well’
By Fortune EditorsMarch 30, 2026
2 days ago
Markets cheer as Trump threatens to abandon Iran war, but Jamie Dimon sides with allies: ‘Win this thing and clean up the straits’
Energy
Markets cheer as Trump threatens to abandon Iran war, but Jamie Dimon sides with allies: ‘Win this thing and clean up the straits’
By Fortune EditorsMarch 31, 2026
1 day ago
Kevin O'Leary says if you earn $68,000 a year and follow this rule, you'll retire a millionaire
Personal Finance
Kevin O'Leary says if you earn $68,000 a year and follow this rule, you'll retire a millionaire
By Fortune EditorsMarch 31, 2026
1 day ago
A man used AI to call 3,000 Irish bartenders to track the cost of Guinness. Now pubs are lowering their prices to compete
AI
A man used AI to call 3,000 Irish bartenders to track the cost of Guinness. Now pubs are lowering their prices to compete
By Fortune EditorsMarch 30, 2026
2 days ago
Two-thirds of parents say their adult Gen Z kids still rely on them financially  for support—even though it's putting them under strain
Success
Two-thirds of parents say their adult Gen Z kids still rely on them financially  for support—even though it's putting them under strain
By Fortune EditorsMarch 31, 2026
1 day ago
Hiring just hit a level not seen since the economy was ‘closed down literally’ during COVID, top economist says
Economy
Hiring just hit a level not seen since the economy was ‘closed down literally’ during COVID, top economist says
By Fortune EditorsMarch 31, 2026
24 hours ago

© 2026 Fortune Media IP Limited. All Rights Reserved. Use of this site constitutes acceptance of our Terms of Use and Privacy Policy | CA Notice at Collection and Privacy Notice | Do Not Sell/Share My Personal Information
FORTUNE is a trademark of Fortune Media IP Limited, registered in the U.S. and other countries. FORTUNE may receive compensation for some links to products and services on this website. Offers may be subject to change without notice.