How AI Learns to “Read” Like Humans (With a Little Math)

Have you ever wondered how ChatGPT understands your sentences? The answer is hidden in two mathematical tricks that seem more complicated than they are.

Imagine this: You’re texting your friend about weekend plans. You type “Let’s meet at the park tomorrow” and send it. This is simple for you and your friend, right? But if AI were reading this message, it would be translated to something it can actually work with – numbers. Lots of these numbers format into what we call vectors.

Here we are in the world of embeddings and positional encoding, where words become vectors and positions, essential elements for AI to understand human languages.

How can computers interpret  human languages?

Why don’t computers speak human languages? Well, the answer is simple: computers need everything converted into their native language: mathematics, more specifically, binary code – a system of 1s and 0s. While humans use words, tone and context to convey meaning, computers follow exact instructions and structured logic.

In order to “translate” human languages to numbers, we can’t just assign random ones to words (like “apple = 7, banana = 23”) because that would be like giving every citizen in a city a completely random address. Everyone will have a unique location in this way, but you will know nothing about neighbourhoods, distances or relationships among people.

To make AI understand human language, we use embeddings representing semantic meaning for words, but in a vectorial space with thousands of dimensions. Think about it this way: In the real world, Paris and London are closer to each other than Paris and Shanghai. In the embedding space, “happy” and “joyful” are closer than “happy” and “mathematics”. When people are training an AI model, the model learns these relationships by reading millions of texts and noticing which words tend to appear in similar contexts.

What are the results then? Each word gets a unique “address” in this high-dimensional space – a vector of numbers like [0.2, -0.5, 0.8, 0.1, …]. Words with similar meanings cluster together like neighbourhoods in our high-dimensional space, while unrelated words are distant.

Another interesting fact: ideally, you will get some results like vector(“king”) – vector(“man”) +

vector(“woman”) ≈ vector(“queen”)!

The Order Problem – When “Robot Obeys Human” Becomes “Human Obeys Robot”

Now things are getting a bit tricky. Imagine you have a bag of word cards labelled “Obeys”, “Human” and “Robot”. You know what each word means, but without knowing their order, you can’t tell if this is a normal rule or (probably) the beginning of a sci-fi rebellion.

Modern AI models face the same challenges. They are good at recognising what individual words mean, but they naturally process all words simultaneously, like seeing that bag of word cards all at once. They need to identify the order of words.

This is where positional encoding comes to the rescue. Think of it as giving each word position a unique timestamp or serial number – but instead of simple numbers like 1, 2, 3, we use specific mathematical patterns, like absolute positional encoding or rotary positional encoding.

Why not just use 1, 2, 3? Because AI models work better with patterns they can generalise from. The mathematical functions used for positional encoding (the basic version involves sine and cosine waves, if you’re curious) create patterns that help the model understand not just “this word is third” but also relationships like “this word is two positions after that one.”

It’s like the difference between knowing someone’s exact street address and understanding the whole neighbourhood layout.

The Magic Combination

With our fantastic mathematical formula:

Final Word Representation = Word Embedding + Position Encoding

Each word gets its semantic meaning plus its positional meaning, creating a complete representation that captures both meaning and context. It’s like knowing not just what was said, but exactly where it was said — meaning and position, combined into one powerful representation. This formula is one of the reasons that modern AI models are so powerful — they don’t just “read” words; they “infer” the structure and meaning of sentences as a whole.

Some Final Words

What’s remarkable is that these mathematical representations simulate something fundamentally human. We understand language through both meaning (what words represent) and structure (how they’re arranged). AI has learned to mimic the same process, just with more linear algebra — though it’s manipulating patterns rather than understanding meaning like humans do.

Next time you chat with an AI, you may realise that behind every simple response, a complex network of vectors and mathematics is trying to understand you!

References:

Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543).

Su, J., Lu, Y., Pan, S., Murtadha, A., Wen, B., & Liu, Y. (2021). RoFormer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30.

Further reading/watching/listening:

Books & Articles:

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press.

Videos & Podcasts:

Hugging Face. How do transformers work? In LLM course.

Hugging Face Course

 

Image Attribution

Generated by: DALLE-3
Date: 25/06/2025
Prompt: “A friendly humanoid robot and a human sitting across from each other at a modern desk, having a conversation. The background has subtle flowing data streams and neural network patterns. Soft lighting, clean modern style, warm and approachable atmosphere. The robot should look friendly and curious, not intimidating.”

 

Contact Us

FIll out the form below and we will contact you as soon as possible