Erblina Purelku: Exploring the Effects of Safety Fine-Tuning in LLM Behaviour
BCCN Berlin / Technische Universität Berlin
Abstract
Large Language Models (LLMs) demonstrate strong performance across a wide range of natural language processing tasks, but they also exhibit undesired behaviours such as toxicity, social bias, harmful or illicit outputs, and hallucinations. To mitigate these issues, current alignment pipelines rely on safety fine-tuning, typically combining Supervised Fine-Tuning (SFT) with preference-based optimisation methods such as Reinforcement Learning with Human Feedback (RLHF). Although these techniques substantially reduce harmful behaviour, their effects are often fragile, vulnerable to jailbreaks, and poorly understood at the level of internal representations.
This thesis investigates whether safety fine-tuning induces a generalisable internal representation that distinguishes desired from undesired behaviours, or whether it merely teaches models to avoid specific triggers without fundamentally reshaping the latent space. To address this question, we analyse both base and instruct variants of several LLM families using contrastive activation engineering methods. Specifically, we employ steering vectors to identify linear directions associated with harmful behaviour, and attention head editing to examine whether such behaviour is localised within specific computational components.
Our results show that safety fine-tuning fundamentally reshapes the latent space rather than simply imposing surface-level constraints. In instruct models, undesired behaviours become more structured, clustering into a linearly separable subspace that emerges in mid-layer activations. We find that steering vectors derived from HarmBench reliably modulate behaviour and generalise across diverse datasets, including toxicity and stereotypes. While attention head editing also shifts behaviour, its effects are more behaviour-specific, suggesting that some undesired traits are encoded in specialised heads. Together, these findings provide mechanistic evidence that safety-fine-tuned models learn an internal representation of a “good” versus “bad” compass, allowing undesired behaviours to be systematically targeted and controlled.
Guests are welcome!
Additional information:
Master thesis defense
Organized by:
Prof. Wojciech Samek & Prof. Dr. Klaus-Robert Müller
Location: Fraunhofer HHI (Lanolinfabrik), Salzufer 15/16, 10587 Berlin, 5th Floor, Rooms 5-28 and 5-29