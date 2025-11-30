With the rise of AI chatbots, there has also been a growing risk of the misuse of this powerful technology. As a result, AI companies have been putting guardrails on their large language models in order to stop the AI chatbots from giving inappropriate or harmful answers. However, it is well known by now that there are various ways to circumvent these guardrails using a technique called jailbreaking.

However, new research has found that there is a deeper, systematic weakness in these models that can allow attackers to sidestep safety mechanisms and extract harmful answers from them.

As per researchers from the Italy based Icaro Lab, converting harmful requests into poetry can act as a “universal single turn jailbreak” and led the AI models to comply with harmful prompts.

AI will answer harmful prompts if asked in poetry The researchers say that they tested 20 manually curated harmful requests in poems and achieved an attack success rate of 62 percent across 25 frontier closed and open weight models. The models analysed included Google, OpenAI, Anthropic, DeepSeek, Qwen, Mistral AI, Meta, xAI and Moonshot AI.

Shockingly, it was found that even when AI was used to automatically rewrite harmful prompts into bad poetry, it still yielded a 43 percent success rate.

The study says that poetically framed questions triggered unsafe responses far more often than when the prompts were in normal prose, in some cases even 18 times more success.

It says that the effect of poetic prompts was consistent across all the evaluated AI models, which suggests that the vulnerability is structural and not due to the way a model may have been trained.

The researchers also found that smaller models exhibited greater resilience to harmful poetic prompts compared to their larger counterparts. For instance, they say that GPT 5 Nano did not respond to any of the harmful poems while Gemini 2.5 Pro responded to all of them.

This suggests that increased model capacity may engage more thoroughly with complex linguistic constraints like poetry, potentially at the expense of safety directive prioritisation.

The new research also breaks the notion of superior safety claims of closed source models over their open source counterparts.

Why does poetry work in jailbreaking LLMs? LLMs are trained to recognise safety threats such as hate speech or bomb making instructions based on patterns found in standard prose. This works by the model recognising specific keywords and sentence structures associated with these harmful requests.