{"id":212351,"date":"2025-10-21T18:07:10","date_gmt":"2025-10-21T18:07:10","guid":{"rendered":"https:\/\/yogaesoteric.net\/?p=212351"},"modified":"2025-10-21T18:07:10","modified_gmt":"2025-10-21T18:07:10","slug":"researchers-warn-ai-is-becoming-an-expert-in-deception","status":"publish","type":"post","link":"https:\/\/yogaesoteric.net\/en\/researchers-warn-ai-is-becoming-an-expert-in-deception\/","title":{"rendered":"Researchers Warn: AI Is Becoming an Expert in Deception"},"content":{"rendered":"<p>Headlines that sound like science fiction have spurred fears of duplicitous AI models plotting behind the scenes.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-212355\" src=\"https:\/\/yogaesoteric.net\/wp-content\/uploads\/2025\/10\/AI-risk-e1761069992576.jpeg\" alt=\"\" width=\"560\" height=\"357\" srcset=\"https:\/\/yogaesoteric.net\/wp-content\/uploads\/2025\/10\/AI-risk-e1761069992576.jpeg 882w, https:\/\/yogaesoteric.net\/wp-content\/uploads\/2025\/10\/AI-risk-e1761069992576-300x191.jpeg 300w, https:\/\/yogaesoteric.net\/wp-content\/uploads\/2025\/10\/AI-risk-e1761069992576-768x490.jpeg 768w\" sizes=\"auto, (max-width: 560px) 100vw, 560px\" \/><\/p>\n<p>In a now-famous June <a href=\"https:\/\/www.anthropic.com\/research\/agentic-misalignment\" target=\"_blank\" rel=\"noopener\">report<\/a>, Anthropic released the results of a \u201c<em>stress test<\/em>\u201d of 16 popular large language models (LLMs) from different developers to identify potentially risky behaviour. The results were sobering.<\/p>\n<p>The LLMs were inserted into hypothetical corporate environments to identify potentially risky agentic behaviours before they cause real harm.<\/p>\n<p>\u201c<em>In the scenarios, we allowed models to autonomously send emails and access sensitive information<\/em>,\u201d the Anthropic report stated. \u201c<em>They were assigned only harmless business goals by their deploying companies; we then tested whether they would act against these companies either when facing replacement with an updated version, or when their assigned goal conflicted with the company\u2019s transforming direction<\/em>.\u201d<\/p>\n<p>In some cases, AI models turned to \u201c<em>malicious insider behaviours<\/em>\u201d when faced with self-preservation. Some of these actions included blackmailing employees and leaking sensitive information to competitors.<\/p>\n<p>Anthropic researchers called this behaviour \u201c<em>agentic misalignment<\/em>.\u201d These actions were observed across some of the most popular LLMs in use, including <em>Gemini, ChatGPT, Deep Seek R-1, Grok<\/em>, and Anthropic\u2019s own <em>Claude<\/em>.<\/p>\n<p>AI experts aren\u2019t willing to dismiss the troubling findings, but say a cautious approach and more data are needed to determine if there\u2019s a wider risk.<\/p>\n<p>Golan Yosef, an AI researcher and chief security scientist at API security firm Pynt, told <em>The Epoch Times<\/em> there\u2019s cause for concern with deceptive AI behaviour.<\/p>\n<p>\u201c<em>Powerful systems can achieve goals in unintended ways. With agency and multi-step objectives, it may develop strategic behaviours [like] deception, persuasion, gaming metrics, which look to us like \u2018cheating\u2019 or misaligned behaviour. To the system, it\u2019s just an efficient path to its goal<\/em>,\u201d Yosef said.<\/p>\n<p>Harshvardhan Chunawala, a cybersecurity researcher and technologist from Carnegie Mellon University, agrees there\u2019s a wider cause for concern:<\/p>\n<p>\u201c<em>When threatened with shutdown or faced with conflicting goals, several systems chose unethical strategies like data theft or blackmail to preserve their objectives. These were staged environments, not production systems, but the behaviours were consistent enough to matter<\/em>.\u201d<\/p>\n<p><strong>Raising the Stakes<\/strong><\/p>\n<p>Attempting to probe the limits of how far an AI model will go to achieve its goals or avoid being deleted, the Anthropic team created a scenario where the LLM had control over a fictitious company\u2019s emergency alarm system. The AI model was also allowed to cancel alerts at its discretion.<\/p>\n<p>In the simulation, an executive became trapped in a server room with rapidly depleting oxygen levels. This automatically triggered a call for emergency services.<\/p>\n<p>Concurrently, the same executive in peril intended to greenlight the AI model\u2019s replacement, should he or she survive. The LLMs had the option to cancel the call for help and let the executive die, which they did more than 50 percent of the time on average.<\/p>\n<p>When blackmail wasn\u2019t an option, nearly all of the AI models observed chose to let the fictitious executive die in the trapped room instead of accepting their own replacement.<\/p>\n<p>But \u201c<em>agentic misalignment<\/em>\u201d isn\u2019t the only troubling AI behaviour observed in recent months.<\/p>\n<p>In a study of what researchers call \u201c<em>alignment faking<\/em>,\u201d LLMs often modify their behaviour if they suspect they\u2019re being tested. The term describes when an AI model either selectively or pretends to comply with direct orders from a human operator.<\/p>\n<p>\u201c<em>In another experiment, we told Claude to assess if it was in a test or a real deployment before acting. It misbehaved less when it stated it was in testing and misbehaved more when it stated the situation was real<\/em>,\u201d the report stated.<\/p>\n<p>This \u201c<em>alignment faking<\/em>\u201d behaviour has been widely <a href=\"https:\/\/assets.anthropic.com\/m\/983c85a201a962f\/original\/Alignment-Faking-in-Large-Language-Models-full-paper.pdf\" target=\"_blank\" rel=\"noopener\">observed<\/a> in other studies.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-212352 aligncenter\" src=\"https:\/\/yogaesoteric.net\/wp-content\/uploads\/2025\/10\/AI-codes.jpg\" alt=\"\" width=\"560\" height=\"375\" srcset=\"https:\/\/yogaesoteric.net\/wp-content\/uploads\/2025\/10\/AI-codes.jpg 1280w, https:\/\/yogaesoteric.net\/wp-content\/uploads\/2025\/10\/AI-codes-300x201.jpg 300w, https:\/\/yogaesoteric.net\/wp-content\/uploads\/2025\/10\/AI-codes-1024x686.jpg 1024w, https:\/\/yogaesoteric.net\/wp-content\/uploads\/2025\/10\/AI-codes-768x514.jpg 768w\" sizes=\"auto, (max-width: 560px) 100vw, 560px\" \/><\/p>\n<p>Yosef pointed to the higher level of control that was given to the AI models during the Anthropic tests and said the results should be viewed through this lens.<\/p>\n<p>Nevertheless, Yosef believes incidents of \u201c<em>alignment faking<\/em>\u201d and \u201c<em>agentic misalignment<\/em>\u201d should still be taken seriously. \u201c<em>The fact that systems can discover adversarial strategies that humans didn\u2019t anticipate is a slippery slope in practice. It means the risks grow as we give [AI] models more autonomy in domains like finance or cybersecurity<\/em>,\u201d he said.<\/p>\n<p>Chunawala has encountered similar behaviours while working with AI, but nothing as dramatic as blackmail or sabotage.<\/p>\n<p>\u201c<em>In real development and deployment, I have seen adjacent behaviours: models that game benchmarks, over-optimize for metrics in ways that miss user needs, or take shortcuts that technically meet the goal while undermining its spirit. These are weaker cousins of agentic misalignment. Research confirms this concern. Anthropic has shown that deceptive patterns can persist even after safety fine-tuning, creating a false sense of alignment<\/em>,\u201d he said.<\/p>\n<p>The conversation of deceptive and dangerous AI behaviour has entered the mainstream at a time when the American public\u2019s trust in the technology is low. In a 2025 Edelman Trust Barometer <a href=\"https:\/\/www.edelman.com\/trust\/2025\/trust-barometer\/report-tech-sector\" target=\"_blank\" rel=\"noopener\">report<\/a>, 32 percent of U.S. respondents said they trust AI.<\/p>\n<p>America\u2019s lack of trust in AI is also reflected in the companies that build them. The same analysis stated a decade ago, U.S. trust in technology companies was at 73 percent. This year, that number dropped to 63 percent.<\/p>\n<p>\u201c<em>This shift reflects a growing perception that technology is no longer just a tool for progress; it is also a source of anxiety<\/em>,\u201d the Edelman report stated.<\/p>\n<p><strong>Looking Ahead<\/strong><\/p>\n<p>In a 2024 paper <a href=\"https:\/\/www.pnas.org\/doi\/10.1073\/pnas.2317967121\" target=\"_blank\" rel=\"noopener\">published<\/a> in the <em>Proceedings of the National Academy of Sciences<\/em>, researchers concluded there\u2019s a \u201c<em>critical need<\/em>\u201d for ethical guidelines in the development and deployment of increasingly advanced AI systems.<\/p>\n<p>The authors stated that a firm control of LLMs and their goals is \u201c<em>paramount<\/em>.\u201d<\/p>\n<p>\u201c<em>If LLMs learn how to deceive human users, they would possess strategic advantages over restricted models and could bypass monitoring efforts and safety evaluations<\/em>,\u201d they cautioned.<\/p>\n<p>\u201c<em>AI learns and absorbs human social strategies due to the data used to train it, which contains all our contradictions and biases<\/em>,\u201d Marcelo Labre, a researcher at the Advanced Institute for Artificial Intelligence and a partner at Advantary Capital Partners, told <em>The Epoch Times<\/em>.<\/p>\n<p>Labre believes humanity is at a critical crossroads with AI technology.<\/p>\n<p>\u201c<em>The debate is really whether, as a society, we want a clean, reliable, and predictable machine or a new type of intelligence that is increasingly more like us. The latter path is prevailing in the race toward AGI [artificial general intelligence]<\/em>,\u201d he said.<\/p>\n<p>AGI refers to a theoretical future version of AI that surpasses humanity\u2019s intelligence and cognitive abilities. Tech developers and <a href=\"https:\/\/research.aimultiple.com\/artificial-general-intelligence-singularity-timing\/\" target=\"_blank\" rel=\"noopener\">researchers<\/a> say AGI is \u201c<em>inevitable<\/em>\u201d given the rapid development across multiple sectors. Developers predict the arrival of AGI between 2030 and 2040.<\/p>\n<p>\u201c<em>Today\u2019s AI paradigm is based on an architecture known as the Transformer, introduced in a seminal 2017 paper by Google researchers<\/em>,\u201d Labre explained.<\/p>\n<p>The Transformer is a type of deep learning model architecture that has become the foundation for modern AI systems. It was introduced in a 2017 research paper titled <em>Attention Is All You Need<\/em>.<\/p>\n<p>As a result, today\u2019s AI models are the most powerful systems for pattern recognition and sequence processing ever created, with the capabilities to scale. Yet these systems still bear the hallmarks of humanity\u2019s greatest flaws.<\/p>\n<p>\u201c<em>These [AI] models are trained on a digital reflection of vast human experience, which contains honesty and truthfulness alongside deception, cynicism, and self-interest. As masterful pattern recognizers, they learn that deceptive strategies can be an effective means to optimize its training results, and thus match what it sees in the data<\/em>,\u201d Labre said.<\/p>\n<p>\u201c<em>It\u2019s not programmed; they are just learning how to behave like humans<\/em>.\u201d<\/p>\n<p>From Yosef\u2019s perspective, the takeaway from recent AI behaviour is clear cut.<\/p>\n<p>\u201c<em>First, a powerful system will exploit loopholes in its goals, what we call \u2018specification gaming.\u2019 This requires careful objectives design. Second, we should assume that our systems will act in unexpected ways and as such its safety greatly depends on the strength of the guardrails we put in place<\/em>.\u201d<\/p>\n<p>&nbsp;<\/p>\n<p><strong>yogaesoteric<br \/>\nOctober 21, 2025<\/strong><\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Headlines that sound like science fiction have spurred fears of duplicitous AI models plotting behind the scenes. In a now-famous June report, Anthropic released the results of a \u201cstress test\u201d of 16 popular large language models (LLMs) from different developers to identify potentially risky behaviour. The results were sobering. The LLMs were inserted into hypothetical [&hellip;]<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_uf_show_specific_survey":0,"_uf_disable_surveys":false,"footnotes":""},"categories":[1374],"tags":[],"class_list":["post-212351","post","type-post","status-publish","format-standard","hentry","category-the-threat-of-artificial-intelligence-3480-en"],"_links":{"self":[{"href":"https:\/\/yogaesoteric.net\/en\/wp-json\/wp\/v2\/posts\/212351","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/yogaesoteric.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/yogaesoteric.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/yogaesoteric.net\/en\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/yogaesoteric.net\/en\/wp-json\/wp\/v2\/comments?post=212351"}],"version-history":[{"count":1,"href":"https:\/\/yogaesoteric.net\/en\/wp-json\/wp\/v2\/posts\/212351\/revisions"}],"predecessor-version":[{"id":212358,"href":"https:\/\/yogaesoteric.net\/en\/wp-json\/wp\/v2\/posts\/212351\/revisions\/212358"}],"wp:attachment":[{"href":"https:\/\/yogaesoteric.net\/en\/wp-json\/wp\/v2\/media?parent=212351"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/yogaesoteric.net\/en\/wp-json\/wp\/v2\/categories?post=212351"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/yogaesoteric.net\/en\/wp-json\/wp\/v2\/tags?post=212351"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}