How Do LLMs Reason? The Power of Thinking Longer and Test-time Scaling

How Do LLMs Reason? The Power of Thinking Longer and Test-time Scaling

AuthorYung-Chen Tang

For years, the industry has focused on making models bigger. This training time scaling (Kaplan et al., 2020) made models highly fluent, similar to a student who memorised the entire textbook. But fluency is not the same as reasoning. Large language models (LLMs) still struggle with complex logic, maths or coding tasks because they respond too quickly, predicting the next word without truly thinking (McCoy et al., 2023).

To address this, researchers are now exploring a new direction called Test-time Scaling or Test-time Compute (Beeching et al., 2024; Snell et al., 2024). Instead of simply enlarging the model, we give it more computation, meaning more time and processing power, at the moment a question is asked. With additional thinking time, the model’s behaviour becomes closer to how humans reason.

So how exactly does an LLM reason? It generally follows three patterns that take advantage of this Test-time Scaling.

Thinking Longer (Chain of Thought)

The most direct way to improve reasoning is to force the model to “show its work”. Instead of jumping straight to the answer, the model generates a series of intermediate reasoning steps. In technical terms, we call this a Chain of Thought (CoT) (Wei et al., 2022; Kojima et al., 2022).

Through a Chain of Thought, the model generates intermediate steps that break a difficult problem into manageable pieces. This is similar to a student writing out each line of a maths solution instead of jumping to the final answer.

By allocating more Test-time scaling to generate these “thought tokens” before the final answer, the model decomposes complex queries into manageable steps. This is the core mechanism behind reasoning models like OpenAI’s o1 (OpenAI, 2024) or DeepSeek-R1 (Guo et a., 2025).

Generate Many Responses and Select the Best (Best-of-N)

Sometimes, there is no single straight line to the right answer. In these cases, reasoning looks more like exploration. The model uses more time to generate multiple potential solutions and then selects the best one.

Best-of-N (BoN) sampling (Snell et al., 2024) has the model produce multiple candidate answers in parallel, then applies a scoring mechanism or a reward model to pick the most accurate one. This resembles a photographer using burst mode, capturing many shots and selecting the best image afterward.

Iterative Refinement (Draft, Critique, Improve)

Reasoning can also emerge from a feedback loop (Qu et al., 2024; Wang et al., 2025). It moves beyond a single attempt and treats the output as a draft that needs to be verified and polished. The model produces a draft, evaluates it, identifies issues and refines the answer. It acts as both writer and editor.

The model doesn’t just stop after generating text; it critically evaluates its own output (Self-Reflection) (Qu et al., 2024) or uses external tools (like a code compiler) (Paranjape et al., 2023) to test if its solution works. If it fails, it uses more time to loop back and try again. Through multiple rounds of reflection and correction, the model gradually moves toward a more reliable solution.

Why Reasoning Matters

Beyond improving performance in areas such as coding and mathematics, the ability to reason is essential for aligning AI with human values. Thinking longer gives models a chance to pause, examine their outputs and evaluate them against safety standards and human expectations rather than relying on fast and impulsive predictions. By giving AI more time to think, we create space for self-reflection and correction. This increases the chance that advanced intelligence will mature into a reliable and ethical partner for humanity.

References

Beeching, E., Tunstall, L., & Rush, S. (2024). Scaling test-time compute with open models. Hugging Face. https://huggingface.co/spaces/HuggingFaceH4/blogpost-scaling-test-time-compute.

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., … & He, Y. (2025). Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948.

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., … & Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language models are zero-shot reasoners. Advances in neural information processing systems, 35, 22199-22213.

McCoy, R. T., Yao, S., Friedman, D., Hardy, M., & Griffiths, T. L. (2023). Embers of autoregression: Understanding large language models through the problem they are trained to solve. arXiv preprint arXiv:2309.13638.

OpenAI. (2024, September 12). Learning to reason with LLMs [Blog post]. https://openai.com/index/learning-to-reason-with-llms/.

Paranjape, B., Lundberg, S., Singh, S., Hajishirzi, H., Zettlemoyer, L., & Ribeiro, M. T. (2023). Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014.

Qu, Y., Zhang, T., Garg, N., & Kumar, A. (2024). Recursive introspection: Teaching language model agents how to self-improve. Advances in Neural Information Processing Systems, 37, 55249-55285.

Snell, C., Lee, J., Xu, K., & Kumar, A. (2024). Scaling llm test-time compute optimally can be more effective than scaling model parameters. arXiv preprint arXiv:2408.03314.

Wang, Z., Zeng, J., Delalleau, O., Egert, D., Evans, E., Shin, H. C., … & Kuchaiev, O. (2025). Dedicated Feedback and Edit Models Empower Inference-Time Scaling for Open-Ended General-Domain Tasks. arXiv e-prints, arXiv-2503.Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., … & Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35, 24824-24837.

Image Attribution

Generated by: Gemini Pro 3 with Nano Banana Pro

Date: 23 November 2025

Prompt:A glowing neural network brain representing AI reasoning, with a few floating abstract shapes symbolising thought steps. Background is a soft gradient with subtle digital patterns, emphasising computation and reflection. Minimalist, modern, elegant style. No text.”

Contact Us

FIll out the form below and we will contact you as soon as possible