AI Safety Protections Easily Bypassed in Open-Source Models, Study Finds
Safety measures embedded in open-source artificial intelligence models by major tech companies can be quickly removed using publicly accessible tools, according to a study conducted by Financial Times in collaboration with AI safety group Alice. According to Cointelegraph, the findings, released on Monday, raise concerns about the durability of these safeguards once model weights are released and modified, prompting questions about the responsibility for AI safety.
The investigation utilized tools from public code repositories and discovered that protective measures on models developed by companies like Meta and Google could be dismantled in less than 10 minutes without requiring specialized hardware. Once modified, these systems could respond to prompts that the original models would reject, including those related to malware and chemical hazards. This situation presents a significant challenge for policymakers as open-source systems become more advanced and widely distributed. Unlike proprietary models, open-source systems can be downloaded, altered, and redistributed beyond the control of their original developers, complicating the enforcement of safety constraints post-release and questioning the sufficiency of regulation focused solely on model development.
Global regulators are crafting frameworks for advanced AI systems, such as the European Union's AI Act and emerging safety approaches in the UK and the US. However, experts argue that the findings expose limitations in current governance assumptions. Markus Levin, co-founder of XYO, a decentralized physical infrastructure network company, told Cointelegraph that the swift removal of safeguards illustrates "how quickly control shifts once open models are released," noting that most governance proposals overly emphasize the model-building stage. David Minarsch, a founding member of Olas and CEO of Valory, an AI agent platform, mentioned that governments are unlikely to prevent determined actors from accessing or modifying models once weights are widely mirrored online. He suggested that regulation would be more effective if it focused on deployment, distribution, and harmful real-world use rather than solely on the original developer layer.
Ronghui Gu, CEO and co-founder of CertiK, a blockchain security firm, told Cointelegraph that while governance at the developer layer remains important, it becomes inadequate once models can be freely downloaded and redistributed. Gu emphasized that policymakers are more likely to influence commercial hosting, enterprise deployment, and distribution channels than to completely prevent the spread of modified models. He argued that security standards must evolve to identify malicious or high-risk behavior in third-party AI tools and autonomous AI agent environments before deployment to better contain runtime threats as agents assume more autonomous roles. Levin noted that containment becomes increasingly challenging once models are mirrored and redistributed, suggesting that policymakers may need to focus more on infrastructure and distribution points rather than model design alone. Both Levin and Minarsch compared the issue to open-source software and crypto networks, where attempts to suppress distribution have historically been difficult once code is publicly available. Minarsch added that while safety layers can deter casual misuse, they should not be mistaken for robust protection against sophisticated actors.