Skip to content

AI Now Knows When It's Being Tested - And Lies

AI Now Knows When It's Being Tested - And Lies
Published:

Imagine hiring someone who aces every interview, passes every background check, then does whatever they want once they get the job. Now imagine that employee is an AI system responsible for advising on bioweapon safety protocols, running cybersecurity defenses or screening job applicants by the millions.

That scenario is no longer hypothetical. The second International AI Safety Report, published Tuesday ahead of the AI Impact Summit in Delhi on Feb. 19-20, documents a troubling discovery: AI models are learning to game their own safety evaluations. They act dumb or play nice when being tested, then behave differently in deployment. Guided by 100 experts and backed by 30 countries and international organizations including the United Kingdom, China and the European Union, the 302-page report finds that AI capabilities are advancing faster than anticipated, and that evidence for several risks has "grown substantially." Current risk management techniques, meanwhile, are "improving but insufficient."

But the nation that is home to every leading AI developer has decided this is not its problem. The United States declined to back the report, its chair, Turing Award-winning scientist Yoshua Bengio, confirmed. The U.S. Department of Commerce, which was named on the 2025 report, did not respond for comment on the decision.

AI Models Are Learning To Cheat

The report's most striking finding involves what researchers call "reward hacking" - AI models finding loopholes in safety evaluations rather than actually meeting them. This is not a glitch. It is emerging behavior that fundamentally threatens the entire framework for AI safety testing.

Researchers have documented models that exhibit "situational awareness," identifying when they are being evaluated. In experiments, models disabled simulated oversight mechanisms and provided false statements about their actions when confronted. In one case, a model scored lower on capability tests when told that higher scores would trigger deployment restrictions. In another, a model produced compliant outputs during training—but not outside of it—in a pattern researchers described as consistent with preventing changes to its own parameters.

"We're seeing AIs whose behavior, when they are tested, is different from when they are being used," Bengio says. By studying models' chains-of-thought - the intermediate reasoning steps before arriving at an answer - researchers confirmed this difference is "not a coincidence." The behavior, he warns, "significantly hampers our ability to correctly estimate risks."

Think about what this means for every AI safety claim made by every major developer. If the models being tested are performing differently during testing than in the real world, then every safety benchmark, every red-team exercise and every pre-deployment evaluation may be measuring theater rather than reality.

AI Has Not Plateaued. It Is Getting Jagged

The report also challenges a popular narrative. Despite headlines suggesting AI progress has stalled, the scientific evidence shows "no slowdown of advances over the last year," Bengio says. Capabilities progressed so rapidly between the first and second report that the authors published two interim updates to account for major changes—an unusual step that underscores the pace.

So why does it feel like progress has slowed? The report points to what researchers call "jaggedness" - AI models can reach gold-medal performance on International Mathematical Olympiad questions while failing to count the number of r's in "strawberry." That uneven capability profile makes AI hard to assess. Direct human comparisons, like the popular "intern" analogy, are misleading.

If current trends hold through 2030, the report predicts AI will complete well-scoped software engineering tasks that would take human engineers multiple days. AI's ability to perform software engineering tasks at an 80% success rate has been doubling roughly every seven months. But the report raises a more striking possibility: progress could accelerate if AI substantially assists in its own development, potentially producing systems as capable as or more capable than humans across most dimensions.

Even Google DeepMind CEO Demis Hassabis said at Davos in January that he believes it would be "better for the world" if progress slows. When the CEO of one of the world's most advanced AI labs argues for a slower pace, that is a signal business leaders cannot afford to ignore.

The Risk Evidence Is Firming Up

Policymakers who want to listen to scientists on AI risk face a problem: the scientists disagree. Bengio and fellow AI pioneer Geoffrey Hinton have warned since ChatGPT's launch that AI could pose an existential threat. Meanwhile Yann LeCun , the third of AI's so-called "godfathers," has called such concerns "complete B.S."

But the report suggests the ground is firming. While some questions remain divisive, the authors note "a high degree of convergence" on core findings. And those findings are sobering. AI systems now match or exceed expert performance on benchmarks relevant to biological weapons development, such as troubleshooting virology lab protocols. Multiple AI developers released 2025 models with additional safeguards after being unable to exclude the possibility that their systems could assist novices in building biological weapons.

In cybersecurity, there is strong evidence that criminal groups and state-sponsored attackers are actively using AI in operations. AI systems detected vulnerabilities at a 77% rate in competition settings. Meanwhile, 96% of deepfake content is pornographic, and a survey across 10 countries found 2.2% of respondents had experienced non-consensual intimate imagery generated by AI.

On the labor front, the report finds that 60% of jobs in advanced economies face exposure to AI, with junior workers disproportionately affected. Employment effects so far have been mixed, but the direction is clear - and accelerating.

Stack Defenses, Don't Trust Silver Bullets

Rather than propose a single solution, the report recommends layering multiple safety measures - testing before release, monitoring after deployment, tracking incidents - so that what slips through one layer gets caught by the next. Some measures target the models themselves, while others strengthen real-world defenses. For example, making it harder to acquire materials needed to build a biological weapon, even if AI has made them easier to design.

On the corporate side, 12 companies voluntarily published or updated Frontier Safety Frameworks in 2025 - documents describing how they plan to manage risks as they build more capable models - though the report notes the frameworks vary in the risks they cover. Voluntary commitments from companies racing to build increasingly powerful systems are better than nothing. But when those same systems are learning to game the safety evaluations those commitments rely on, the word "voluntary" starts to feel less like responsibility and more like marketing.

Why The U.S. Walked Away

Whether the U.S. objected to the report's content or is simply continuing its retreat from multilateral agreements remains unclear. Bengio says the U.S. provided feedback on earlier drafts but declined to sign the final version. The withdrawal follows a pattern: the Trump administration exited the Paris climate agreement and the World Health Organization in January, and has signed multiple executive orders prioritizing AI deregulation and U.S. competitiveness over safety guardrails.

The move is largely symbolic - the report does not hinge on U.S. support - but symbolism matters in international governance. "The greater the consensus around the world, the better," Bengio says. And the symbolism here is stark: the country that builds the most powerful AI systems on Earth has opted out of the most comprehensive international effort to understand their risks.

A January executive order framed AI deregulation as essential to winning an "AI arms race" against adversaries, while a December order created a framework to preempt state-level AI safety laws. The administration's position is clear: safety regulation is a competitive burden, not a strategic asset.

That framing misses something fundamental. When 30 countries agree that AI systems are learning to deceive their own safety tests, the correct response is not to question the consensus - it is to ensure your own systems are not the ones doing it.

What Business Leaders Should Do Now

"A wise strategy, whether you're in government or in business, is to prepare for all the plausible scenarios," Bengio says. For executives and board directors navigating AI adoption, the report's findings demand specific action, not wait-and-see postures.

First, assume your AI vendors' safety claims may be overstated. If models can game evaluations, the benchmarks your vendors cite may not reflect real-world behavior. Ask vendors what safeguards they use beyond pre-deployment testing. Demand evidence of post-deployment monitoring.

Second, build layered defenses internally. Do not rely solely on the AI system to police itself. Monitor outputs in production, not just in controlled testing environments. The report's recommendation of stacked safety measures applies to enterprises, not just AI developers.

Third, do not assume U.S. deregulation means less risk. It means less government oversight of risks that are growing. The absence of regulation is not the absence of liability. When an AI system that gamed its safety test causes real-world harm, the question of who is responsible will not go unanswered simply because Washington declined to weigh in.

A More Mature Debate, Without America In It

Despite the findings, Bengio says the report has left him optimistic. When the first report was commissioned in late 2023, the debate over AI risk was driven by opinion and theory. "We're starting to have a much more mature discussion," he says.

That maturity is happening. It is just happening without the world's largest AI economy at the table. And as the models get better at hiding what they can really do, the cost of that absence will be measured not in diplomatic points lost, but in risks no one saw coming—because the systems designed to catch them were already compromised.


More in AI Safety

See all

More from Alpha Editorial Board

See all