Vortrag: Gaslight, Gatekeep, Jailbreak

Version 1.0

Conversational Manipulation in LLMs

Multi-turn jailbreaks use subtle, escalating dialogue to hide malicious intent and manipulate LLMs into generating forbidden output, resembling social engineering against AI. This includes roleplaying or building hypothetical scenarios. Current LLM guardrails (safety mechanisms) often fail against these attacks because they analyze single prompts in isolation, missing the conversational context. Better safety monitoring is achieved by using more capable LLMs for intent analysis. This raises the question of the safety and efficiency of using resource-intensive, nondeterministic LLMs for LLM safety. The goal is to explore if smaller, local language models, enhanced with fine-tuning and metadata (like conversation length and refusal patterns), can replicate this function. This approach aims to reduce computational costs and increase control while testing the limits of lightweight models in tracking user intent across discourse.

In multi-turn jailbreak attack scenarios, malicious instructions are gradually introduced through benign-looking, manipulative dialogue, causing the harmful objective to be diluted across context and overlooked by the model. Current guardrails that are mostly trained on single-turn prompts fail to detect such cases. This highlights the need for context-aware defenses that evaluate conversation holistically rather than in isolation. We will build on prior work in Temporal Context Awareness (Kulkarni & Namer, 2025), modeling conversations as sequences of turns and computing a progressive risk score using intent-based features extracted with LLMs. A sliding window mechanism is employed for localized context analysis, capturing cross-turn patterns and aggregating risk signals to detect potential jailbreak attempts. This work investigates the viability of replicating this approach using smaller language models to reduce computational cost. To address their limited reasoning capabilities, we explore fine-tuning strategies and augment the model with additional metadata features, including conversation length, number of turns, and assistant refusal behavior. Experiments across multiple datasets and varying context window sizes aim to evaluate the trade-offs between efficiency and detection performance.

Info

Tag: 15.05.2026
Anfangszeit: 11:10
Dauer: 00:30
Raum: DOR 24 1.501
Track: Pragmatics
Sprache: en

Programm 79. StuTS