Author – Yung-Chen Tang
The Safety Problem: Intelligence Without Safety Guardrails
Think of an LLM like a high-performance sports car without brakes, driving on roads filled with unseen hazards. Without safety guardrails, there is nothing to stop it from going in dangerous directions.
Large language models (LLMs) are trained on large amounts of text from the internet, books, forums and other sources in a process called pre-training. This gives them great versatility, but also comes with a hidden challenge: human language data contains biases, misinformation and unsafe patterns, such as hate speech, toxic or discriminatory content. When models learn from such data, they not only gain useful knowledge but also inherit these problems. On top of this, LLMs tend to be statistically overconfident (Guo et al., 2017; Minderer et al., 2021), meaning they assign higher probabilities to their predictions, due to the way that they interpret data (Xu et al., 2024). They often present information with certainty, even when the output is false. This combination of biased training data and statistical overconfidence can lead to hallucinations, biased answers or unsafe outputs, such as toxic content or instructions for harmful behaviour.
To better understand how bias and statistical overconfidence can cause problems, consider a simple example, even something as basic as rock-paper-scissors. Let’s say the training data contains more examples of the word “rock” than “paper” or “scissors”. GPT-4 may know that each choice has a win rate of 1/3, but still fails to choose rock, paper or scissors equally (Xu et al., 2024)., How does such an “intelligent” machine make such a seemingly simple mistake? One reason is that the model tends to be statistically overconfident in its patterns (Guo et al., 2017; Minderer et al., 2021), amplifying the embedded bias in the data and making “rock” far more likely to be chosen than what is fair and factually correct. In particular, the model relies heavily on statistical patterns learned during pre-training, which can further reinforce such biases. This shows how bias and overconfidence together can strongly shape a model’s behaviour and explain why safety mechanisms are important.
Safety Alignment Methods
To reduce these risks, researchers have created safety alignment methods to guide how LLMs behave. One basic method is Supervised Fine-Tuning (SFT) (Radford et al., 2019), where models are trained on carefully selected datasets with correct or safe examples. This is often followed by Reinforcement Learning with Human Feedback (RLHF) (Ouyang et al., 2022), where humans rank the outputs and the model learns to prefer safer and more helpful answers. Together, SFT and RLHF form the foundation for most alignment efforts.
Leading AI companies have expanded on these methods with their own approaches. OpenAI uses RLHF extensively and has developed deliberative alignment (OpenAI, 2024), where the model reasons about safety rules while generating responses. Anthropic uses Constitutional AI (Anthropic, 2022), where the model follows a “constitution” of safety principles before answering. These methods are useful because they build guardrails both during training as well as when the model is running.
When Guardrails Collapse
However, safety guardrails are not perfect. Studies have shown that downstream fine-tuning, which is extra training after the main model is built, can break safety mechanisms (Hsiung et al., 2025; Qi et al., 2025). While downstream fine-tuning is an invaluable tool that can shape a pre-trained LLM to be used in a more specific setting, applying its knowledge to additional tasks or to a smaller dataset, these modifications can disturb the safety mechanisms that were installed previously. In other words, when models are adapted for specific tasks, safety rules learned earlier may no longer apply.
Another concern is the potential for backdoors to be hidden during training (Kurita et al., 2020). This happens when a model is trained by a malicious actor with special triggers – specific words or phrases that cause the model to behave in unexpected or unsafe ways. The detection of these backdoors can be extremely challenging due to the sophistication of such attacks, such that the model appears to work normally while simultaneously wreaking havoc. Some open-source LLMs have been found to contain such backdoors, which could be exploited to bypass safety systems. This shows that safety is not just about design – it needs constant attention and updates.
Why Safety Guardrails Matter
Safety guardrails are important because they turn LLMs from powerful but unpredictable tools into systems we can trust in real life. Without them, LLMs can produce harmful, biased or misleading answers, damaging trust and causing real harm, especially in sensitive areas such as education, mental health and online news consumption. Guardrails keep AI both safe and responsible, helping ensure that intelligence is paired with accountability, and that AI benefits society rather than causing harm.
References:
Anthropic. (2022, December 15). Constitutional AI: Harmlessness from AI feedback. Anthropic. https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback
Kurita, K., Michel, P., & Neubig, G. (2020). Weight poisoning attacks on pre-trained models. arXiv preprint arXiv:2004.06660.
Hsiung, L., Pang, T., Tang, Y.-C., Song, L., Ho, T.-Y., Chen, P.-Y., & Yang, Y. (2025). Why LLM safety guardrails collapse after fine-tuning: A similarity analysis between alignment and fine-tuning datasets. In Data in Generative Models — The Bad, the Ugly, and the Greats (ICML 2025 Workshop).
OpenAI. (2024, December 20). Deliberative alignment: Reasoning enables safer language models. OpenAI. https://openai.com/index/deliberative-alignment/
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., … & Lowe, R. (2022). Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35, 27730-27744.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.
Qi, X., Panda, A., Lyu, K., Ma, X., Roy, S., Beirami, A., Mittal, P., & Henderson, P. (2025). Safety alignment should be made more than just a few tokens deep. In The Thirteenth International Conference on Learning Representations (ICLR 2025).
Xu, Z., Yu, C., Fang, F., Wang, Y., & Wu, Y. (2024). Language agents with reinforcement learning for strategic play in the Werewolf game. Proceedings of the 41st International Conference on Machine Learning (ICML 2024), 235, Article 2285, 55434–55464.
Guo, C., Pleiss, G., Sun, Y., & Weinberger, K. Q. (2017). On calibration of modern neural networks. In International conference on machine learning (pp. 1321-1330). PMLR.
Minderer, M., Djolonga, J., Romijnders, R., Hubis, F., Zhai, X., Houlsby, N., … & Lucic, M. (2021). Revisiting the calibration of modern neural networks. Advances in neural information processing systems, 34, 15682-15694.
Image Attribution
Generated by: Nano Banana
Date: 28/09/2025
Prompt: “A clean, modern digital illustration of a large language model represented as a glowing neural network brain inside a secure shield, symbolising safety. Background with soft gradients, subtle code patterns and a minimal futuristic style. No text in the image.”