“As machine-learning systems grow not just increasingly pervasive but increasingly powerful, we will find ourselves more and more often in the position of the ‘sorcerer’s apprentice’: we conjure a force, autonomous but totally compliant, give it a set of instructions, then scramble like mad to stop it once we realize our instructions are imprecise or incomplete—lest we get, in some clever, horrible way, precisely what we asked for. How to prevent such a catastrophic divergence—how to ensure that these models capture our norms and values, understand what we mean or intend, and, above all, do what we want—has emerged as one of the most central and most urgent scientific questions in the field of computer science. It has a name: the alignment problem.”
Brian Christian, The Alignment Problem (2020, p.19-20)
Imagine asking a computer to write a poem, summarise a book or even hold a conversation with you. Just a decade ago, this seemed like it existed only in a galaxy far, far away. Today, it is our reality, thanks to Large Language Models (LLMs). These AI-driven systems, such as OpenAI’s prominent ChatGPT or its newest Chinese rival, DeepSeek, are shaping how we interact with technology, making human-like text generation more accessible and impacting human-computer communication.
But what exactly are LLMs, why are they talked about so much and why has alignment emerged “as one of the most central and most urgent scientific questions in the field of computer science”? (Christian, 2020, p.20)
What are Large Language Models (LLMs)?
Large Language Models (LLMs) are advanced AI foundation models designed to understand and generate natural language. LLMs are trained on vast amounts of data stemming from books, articles, websites and online content, among other sources. Based on this data and training, LLMs seek to recognise patterns, context and meaning from text and language. Using deep learning, particularly transformer-based architectures, they predict the most likely word to be used next in a given sequence, allowing them to generate coherent, contextually relevant responses.
Although language models have been around for quite some time, the hype around them can be traced back to 2017, when Google introduced the transformer architecture in its “Attention is All You Need” paper (Vaswani et al., 2017). This transformer architecture greatly improved and balanced the “attention” that an LLM can pay to different parts of a textual sequence and has paved the way for today’s leading models, including OpenAI’s GPT series, Google DeepMind’s Gemini, Meta’s Llama and DeepSeek R1.
LLMs have revolutionized AI applications, powering chatbots, translation tools, coding assistants, search engines and content generation systems (Liu et al., 2023). Whether you want to summarise a paper, draft a professional email or urgently need an Oscar-worthy acceptance speech for an award you’ll never win, LLMs have you covered. Unlike traditional AI models built for specific tasks, LLMs are versatile general-purpose systems that can be adapted to a wide range of use cases with minimal retraining or fine-tuning (Annepaka & Pakray, 2024).
Great power, however, comes with great responsibility. Critical examination is essential as these models become increasingly influential in our everyday lives and decision-making processes, despite their tendency to generate biased, misleading or outright incorrect responses. Since LLMs don’t truly “understand” language like humans do, as they do not have true cognition or an inherent grasp of meaning, intent or context, it is very difficult to get them to work exactly how we want them to (Liu et al., 2023; Shen et al., 2023). This brings us to the Alignment Problem.
Why do LLMs Need Alignment?
Alignment refers to the process of developing or adjusting AI systems (including LLMs), so that their outputs are “aligned” or in accordance with human intentions, specifically regarding human ethics and societal norms. Without proper alignment, these powerful models can produce misleading, biased or harmful outputs (Liu et al., 2023; Shen et al., 2023).
One of the primary concerns is that LLMs, although great at generating and emulating human-like text, lack true understanding. They operate by predicting word sequences based on patterns in the data they were trained on without comprehending the underlying meaning of the input and the consequences of their outputs. This cognitive limitation can, for example, lead to the generation of plausible-sounding, but inherently incorrect or nonsensical information, a phenomenon often referred to as “hallucination” (Liu et al., 2023). Depending on the specific use case, the consequences can be as harmless, as merely inventing a non-existent academic source or as serious as giving wrong medical advice, potentially leading to real-world harm.
Moreover, LLMs can inadvertently amplify and perpetuate existing biases that are present in their training data. Since they learn from datasets scraped from the internet, books, historical records, etc. that often contain and reflect societal prejudices, inequalities and ideological imbalances or convictions, they run the risk of internalising and reinforcing these biases in their outputs (Liu et al., 2023; Shen et al., 2023). For example, if an LLM is trained on hiring data where historically certain demographics were underrepresented in leadership roles, it may, when asked to generate job candidate recommendations, favour profiles that align with past biases rather than promoting diversity and fairness. Similarly, suppose a model is trained on news articles that disproportionately associate certain ethnic groups with crime. In that case, it may generate responses reinforcing these associations, even when no direct causality exists. Another pressing concern that is directly interrelated with bias and discrimination, is the potential of LLMs to generate harmful or dangerous content, from hate speech and insults to instructions for illegal or harmful activities like self-harm or weapon-making.
These examples show that it is crucial to work on aligning LLMs with societal values to ensure fairness and objectivity and prevent models from perpetuating stereotypes, misinformation, hate speech, violence or systemic discrimination.
References
- Annepaka, Y. & Pakray, P. (2024). Large language models: a survey of their development, capabilities, and applications. Knowledge And Information Systems. https://doi.org/10.1007/s10115-024-02310-4
- Christian, B. (2020). The Alignment Problem: Machine Learning and Human Values. W. W. Norton & Company.
- Shen, T., Jin, R., Huang, Y., Liu, C., Dong, W., Guo, Z., Wu, X., Liu, Y. & Xiong, D. (2023). Large Language Model alignment: a survey. arXiv.org. https://arxiv.org/abs/2309.15025
Further Reading/Watching/Listening
- Russell, S. (2020). Human Compatible: Artificial Intelligence and the Problem of Control. National Geographic Books.
- Wiener, N. (1960). Some Moral and Technical Consequences of Automation. Science, 131(3410), 1355–1358. https://doi.org/10.1126/science.131.3410.1355
Videos & Podcasts
- “The Alignment Problem” Brian Christian talks at Yale (2022) – Watch on YouTube
- “OpenAI’s Huge Push to Make Superintelligence Safe” 80,000 Hours Podcast with Jan Leike (2023) – Watch on YouTube