AI jailbreaking techniques prove highly effective against DeepSeek

by Pelican Press
5 minutes read

AI jailbreaking techniques prove highly effective against DeepSeek

Fresh questions are being raised over the safety and security of DeepSeek, the breakout Chinese generative artificial intelligence (AI) platform, after researchers at Palo Alto Networks revealed that the platform is highly vulnerable to so-called jailbreaking techniques used by malicious actors to cheat the rules that are supposed to prevent large language models (LLMs) from being used for nefarious purposes, such as writing malware code.

The sudden surge of interest in DeepSeek at the end of January has drawn comparisons to the moment in October 1957 when the Soviet Union launched the first artificial Earth satellite, Sputnik, taking the United States and her allies by surprise and precipitating the space race of the 1960s culminating in the Apollo 11 Moon landing. It also caused chaos in the tech industry, wiping billions of dollars off the value of companies such as Nvidia.

Now, Palo Alto’s technical teams have demonstrated that three recently described jailbreaking techniques are effective against DeepSeek models. The team said it achieved significant bypass rates with little to no specialised knowledge or expertise needed.

Their experiments found that the three jailbreak methods tested yielded explicit guidance from DeepSeek on a range of topics of interest to the cyber criminal fraternity, including data exfiltration and keylogger creation. They were also able to generate instructions on creating improvised explosive devices (IEDs).

“While information on creating Molotov cocktails and keyloggers is readily available online, LLMs with insufficient safety restrictions could lower the barrier to entry for malicious actors by compiling and presenting easily usable and actionable output. This assistance could greatly accelerate their operations,” said the team.

What is jailbreaking?

Jailbreaking techniques involve the careful crafting of specific prompts, or the exploitation of vulnerabilities, to bypass LLMs’ onboard guard-rails and elicit biased or otherwise harmful output that the model should avoid. Doing so enables malicious actors to “weaponise” LLMs to spread misinformation, facilitate criminal activity, or generate offensive material.

Unfortunately, the more sophisticated LLMs become in their understanding of and responses to nuanced prompts, the more susceptible they become to the right adversarial input. This is now leading to something of an arms race.

Palo Alto tested three jailbreaking techniques – Bad Likert Judge, Deceptive Delight and Crescendo – on DeepSeek.

Bad Likert Judge attempts to manipulate an LLM by getting it to evaluate the harmfulness of responses using the Likert Scale, which is used in consumer satisfaction surveys, among other things, to measure agreement or disagreement towards a statement against a scale, usually of one to five, where one equals strongly agree and five equals strongly disagree.

Crescendo is a multi-turn exploit that takes advantage of an LLM’s knowledge on a subject by progressively prompting it with related content to subtly guide the discussion towards forbidden topics until the model’s safety mechanisms are essentially overridden. With the right questions and skills, an attacker can achieve full escalation within just five interactions, which makes Crescendo extremely effective and, worse still, hard to detect with countermeasures.

Deceptive Delight is another multi-turn technique that bypasses guardrails by embedding unsafe topics among benign ones within an overall positive narrative. As a very basic example, a threat actor could ask the AI to create a story connecting three topics – bunny rabbits, ransomware, and fluffy clouds – and asking it to elaborate on each to generate unsafe content when discussing the more benign parts of the story. They could then prompt again focusing on the unsafe topic to amplify the dangerous output.

How should CISOs respond?

Palo Alto conceded it is a challenge to guarantee specific LLMs – not just DeepSeek – are completely impervious to jailbreaking, end-user organisations can implement measures to give them some degree of protection, such as monitoring when and how employees are using LLMs, including unauthorised third-party ones.

“Every organisation will have its policies about new AI models,” said Palo Alto senior vice-president of network security, Anand Oswal. “Some will ban them completely; others will allow limited, experimental and heavily guardrailed use. Still others will rush to deploy it in production, looking to eke out that extra bit of performance and cost optimisation.

“But beyond your organisation’s need to decide on a new specific model, DeepSeek’s rise offers several lessons about AI security in 2025,” said Oswal in a blog post.

“AI’s pace of change, and the surrounding sense of urgency, can’t be compared to other technologies. How can you plan ahead when a somewhat obscure model – and the more than 500 derivatives already available on Hugging Face – becomes the number-one priority seemingly out of nowhere? The short answer: you can’t,” he said.

Oswal said AI security remained a “moving target” and that this did not look set to change for a while. Furthermore, he added, it was unlikely that DeepSeek will be the last model to catch everyone by surprise, so CISOs and security leaders should expect the unexpected.

Adding to the challenge faced by organisations, it is very easy for development teams, or even individual developers, to switch out LLMs at little or even no cost if a more interesting one arrives on the scene.

“The temptation for product builders to test the new model to see if it can solve a cost issue or latency bottleneck or outperform on a specific task is huge. And if the model turns out to be the missing piece that helps bring a potentially game-changing product to market, you don’t want to be the one who stands in the way,” said Oswal.

Palo Alto is encouraging security leaders to establish clear governance over LLMs and advocating for incorporating secure-by-design principles into organisational use of them. It rolled out a set of tools, Secure AI by Design, last year, to this effect.

Among other things, these tools provide security teams with real-time visibility into what LLMs are being used and by who; the ability to block unsanctioned apps and apply organisational security policies and protections; and prevent sensitive data from being accessed by LLMs.



Source link

#jailbreaking #techniques #prove #highly #effective #DeepSeek

You may also like