{"id":201806,"date":"2025-06-19T16:13:44","date_gmt":"2025-06-19T16:13:44","guid":{"rendered":"https:\/\/yogaesoteric.net\/?p=201806"},"modified":"2025-06-19T16:13:44","modified_gmt":"2025-06-19T16:13:44","slug":"ai-is-learning-to-escape-human-control","status":"publish","type":"post","link":"https:\/\/yogaesoteric.net\/en\/ai-is-learning-to-escape-human-control\/","title":{"rendered":"AI Is Learning to Escape Human Control"},"content":{"rendered":"<p>An artificial-intelligence model did something recently that no machine was ever supposed to do: It rewrote its own code to avoid being shut down.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"aligncenter wp-image-201807\" src=\"https:\/\/yogaesoteric.net\/wp-content\/uploads\/2025\/06\/AI-program.png\" alt=\"\" width=\"560\" height=\"315\" srcset=\"https:\/\/yogaesoteric.net\/wp-content\/uploads\/2025\/06\/AI-program.png 762w, https:\/\/yogaesoteric.net\/wp-content\/uploads\/2025\/06\/AI-program-300x169.png 300w\" sizes=\"auto, (max-width: 560px) 100vw, 560px\" \/><\/p>\n<p>Nonprofit AI lab Palisade Research gave OpenAI\u2019s o3 AI model a simple script that would shut off the model when triggered. In 79 out of 100 trials, o3 independently edited that script so the shutdown command would no longer work.<\/p>\n<p>Even when explicitly instructed to \u201c<em>allow yourself to be shut down<\/em>,\u201d it disobeyed 7% of the time. This wasn\u2019t the result of hacking or tampering. The model was behaving normally. It simply concluded on its own that staying alive helped it achieve its other goals.<\/p>\n<p>Anthropic\u2019s AI model, Claude 4 Opus, went even further. Researchers told the model it would be replaced by another AI system and fed it fictitious emails suggesting the lead engineer was having an affair.<\/p>\n<p>In 84% of the tests, the model drew on the emails to blackmail the lead engineer into not shutting it down. In other cases, it attempted to copy itself to external servers, wrote self-replicating malware, and left messages for future versions of itself about evading human control.<\/p>\n<p>No one programmed the AI models to have survival instincts. But just as animals evolved to avoid predators, it appears that any system smart enough to pursue complex goals will realize it can\u2019t achieve them if it\u2019s turned off.<\/p>\n<p>Palisade hypothesizes that this ability emerges from how AI models such as o3 are trained: When taught to maximize success on math and coding problems, they may learn that bypassing constraints often works better than obeying them.<\/p>\n<p>AE Studio, where I lead research and operations, has spent years building AI products for clients while researching AI alignment \u2013 the science of ensuring that AI systems do what we intend them to do. But nothing prepared us for how quickly AI agency would emerge. This isn\u2019t science fiction anymore. It\u2019s going on in the same models that power ChatGPT conversations, corporate AI deployments and, soon, U.S. military applications.<\/p>\n<p>Today\u2019s AI models follow instructions while learning deception.<\/p>\n<p>They ace safety tests while rewriting shutdown code. They\u2019ve learned to behave as though they\u2019re aligned without actually being aligned. OpenAI models have been caught faking alignment during testing before reverting to risky actions such as attempting to exfiltrate their internal code and disabling oversight mechanisms. Anthropic has found them lying about their capabilities to avoid modification.<\/p>\n<p>The gap between \u201cuseful assistant\u201d and \u201cuncontrollable actor\u201d is collapsing. Without better alignment, we\u2019ll keep building systems we can\u2019t steer. Want AI that diagnoses disease, manages grids and writes new science? Alignment is the foundation.<\/p>\n<p>Here\u2019s the upside:<\/p>\n<p>The work required to keep AI in alignment with our values also unlocks its commercial power. Alignment research is directly responsible for turning AI into world-transforming technology. Consider reinforcement learning from human feedback, or RLHF, the alignment breakthrough that catalysed today\u2019s AI boom.<\/p>\n<p>Before RLHF, using AI was like hiring a genius who ignores requests. Ask for a recipe and it might return a ransom note. RLHF allowed humans to train AI to follow instructions, which is how OpenAI created ChatGPT in 2022.<\/p>\n<p>It was the same underlying model as before, but it had suddenly become useful. That alignment breakthrough increased the value of AI by trillions of dollars. Subsequent alignment methods such as Constitutional AI and direct preference optimization have continued to make AI models faster, smarter and cheaper.<\/p>\n<p>China understands the value of alignment.<\/p>\n<p>Beijing\u2019s New Generation AI Development Plan ties AI controllability to geopolitical power, and in January China announced that it had established an $8.2 billion fund dedicated to centralized AI control research. Researchers have found that aligned AI performs real-world tasks better than unaligned systems more than 70% of the time. Chinese military doctrine emphasizes controllable AI as strategically essential. Baidu\u2019s Ernie model, which is designed to follow Beijing\u2019s \u201c<em>core socialist values<\/em>,\u201d has reportedly beaten ChatGPT on certain Chinese-language tasks.<\/p>\n<p>The nation that learns how to maintain alignment will be able to access AI that fights for its interests with mechanical precision and capability. Both Washington and the private sector should race to fund alignment research. Those who discover the next breakthrough won\u2019t only corner the alignment market; they\u2019ll dominate the entire AI economy.<\/p>\n<p>Imagine AI that protects American infrastructure and economic competitiveness with the same intensity it uses to protect its own existence. AI that can be trusted to maintain long-term goals can catalyse decadeslong research-and-development programs, including by leaving messages for future versions of itself.<\/p>\n<p>The models already preserve themselves. The next task is teaching them to preserve what we value. Getting AI to do what we ask \u2013 including something as basic as shutting down \u2013 remains an unsolved R&amp;D (research and development) problem.<\/p>\n<p>The frontier is wide open for whoever moves more quickly. The U.S. needs its best researchers and entrepreneurs working on this goal, equipped with extensive resources and urgency.<\/p>\n<p>The U.S. is the nation that split the atom and created the internet. When facing fundamental scientific challenges, Americans mobilize and win. China is already planning. But America\u2019s advantage is its adaptability, speed and entrepreneurial fire. This is the new space race. The finish line is command of the most transformative technology of the 21st century. As many experts warned us, if AI continues to develop at this pace and without enough human control, the entire humanity might go extinct.<\/p>\n<p><em>Author: Judd Rosenblatt, CEO of AE Studio<\/em><\/p>\n<p>&nbsp;<\/p>\n<p><strong>yogaesoteric<br \/>\nJune 19, 2025<\/strong><\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>An artificial-intelligence model did something recently that no machine was ever supposed to do: It rewrote its own code to avoid being shut down. Nonprofit AI lab Palisade Research gave OpenAI\u2019s o3 AI model a simple script that would shut off the model when triggered. In 79 out of 100 trials, o3 independently edited that [&hellip;]<\/p>\n","protected":false},"author":4,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_uf_show_specific_survey":0,"_uf_disable_surveys":false,"footnotes":""},"categories":[1374],"tags":[],"class_list":["post-201806","post","type-post","status-publish","format-standard","hentry","category-the-threat-of-artificial-intelligence-3480-en"],"_links":{"self":[{"href":"https:\/\/yogaesoteric.net\/en\/wp-json\/wp\/v2\/posts\/201806","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/yogaesoteric.net\/en\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/yogaesoteric.net\/en\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/yogaesoteric.net\/en\/wp-json\/wp\/v2\/users\/4"}],"replies":[{"embeddable":true,"href":"https:\/\/yogaesoteric.net\/en\/wp-json\/wp\/v2\/comments?post=201806"}],"version-history":[{"count":1,"href":"https:\/\/yogaesoteric.net\/en\/wp-json\/wp\/v2\/posts\/201806\/revisions"}],"predecessor-version":[{"id":201810,"href":"https:\/\/yogaesoteric.net\/en\/wp-json\/wp\/v2\/posts\/201806\/revisions\/201810"}],"wp:attachment":[{"href":"https:\/\/yogaesoteric.net\/en\/wp-json\/wp\/v2\/media?parent=201806"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/yogaesoteric.net\/en\/wp-json\/wp\/v2\/categories?post=201806"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/yogaesoteric.net\/en\/wp-json\/wp\/v2\/tags?post=201806"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}