References and further reading

Appendix A. References and further reading

A.1 Chapter 1

A.1.1 References

The announcement article of OpenAI’s o1 model, which is regarded as the first LLM-based reasoning model:

Introducing OpenAI o1-preview, https://openai.com/index/introducing-openai-o1-preview/

DeepSeek-R1 is the first open-source reasoning model that was accompanied by a comprehensive technical report, which was the first to show that reasoning emerges from reinforcement learning with verifiable rewards (a topic covered in more detail in chapter 5):

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, https://arxiv.org/abs/2501.12948

OpenAI CEO’s comment on the reasoning (“chain-of-thought”) capabilities of future models:

“[…] We will next ship GPT-4.5, the model we called Orion internally, as our last non-chain-of-thought model. […]”, https://x.com/sama/status/1889755723078443244

A research paper by AI researchers at Apple finding that reasoning models are sophisticated (but very capable) pattern matchers:

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, https://machinelearning.apple.com/research/illusion-of-thinking

An in-depth book and guide on implementing and training large language models step-by-step:

Build a Large Language Model (From Scratch), http://mng.bz/orYv

A.1.2 Further Reading

An introduction to how DeepSeek-R1 works, providing insights into the foundations of reasoning in LLMs:

Understanding Reasoning LLMs, https://magazine.sebastianraschka.com/p/understanding-reasoning-llms

A.2 Chapter 2

A.2.1 References

Official installation guide for uv, the Python package and project manager:

Installing uv, https://docs.astral.sh/uv/getting-started/installation/

Cloud compute platforms with GPU support:

Lightning AI, https://lightning.ai/
Google Colab, https://colab.research.google.com/

Qwen3 resources with additional benchmark performance and comparison to other models:

Qwen blog, https://qwenlm.github.io/blog/qwen3/
Technical report, https://arxiv.org/abs/2505.09388

Readers curious about KV cache sizes for different sequence lengths can find a handy calculator web tool:

KV cache size calculator, https://lmcache.ai/kv_cache_calculator.html

A.2.2 Further Reading

A PyTorch tutorial for readers who are new to PyTorch or would like a refresher:

PyTorch in One Hour: From Tensors to Training Neural Networks on Multiple GPUs tutorial, https://sebastianraschka.com/teaching/pytorch-1h

Additional resources on tokenization:

Build a Large Language Model (From Scratch) chapter 2, https://mng.bz/M96o
Implementing a Byte Pair Encoding (BPE) Tokenizer From Scratch, https://sebastianraschka.com/blog/2025/bpe-from-scratch.html

For readers interested in a more in-depth PyTorch coverage (optional), I can recommend the following two books:

Deep Learning with PyTorch, https://www.manning.com/books/deep-learning-with-pytorch-second-edition
Machine Learning with PyTorch and Scikit-Learn, https://www.amazon.com/Machine-Learning-PyTorch-Scikit-Learn-learning/dp/1801819319/

A.3 Chapter 3

A.3.1 References

The MATH-500 dataset originated from the MATH dataset (with 12,500 problems across algebra, geometry, probability, number theory, and more) that was introduced in the following paper:

Measuring Mathematical Problem Solving With the MATH Dataset, https://arxiv.org/abs/2103.03874

The MATH-500 split (created from the original MATH dataset) was proposed in the following paper:

Let’s Verify Step by Step, https://arxiv.org/abs/2305.20050

A.3.2 Further Reading

Readers who are interested in learning more about SymPy (not required for this book) can learn about it in this official tutorial:

SymPy introductory tutorial, https://docs.sympy.org/latest/tutorials/intro-tutorial/index.html

An example of a system (here, a fine-tuned LLM) to also evaluate intermediate reasoning steps:

Evaluating Mathematical Reasoning Beyond Accuracy, https://arxiv.org/pdf/2404.05692

A large-scale dataset containing 800,000 step-level correctness labels for model-generated solutions to problems from the MATH dataset:

Let’s Verify Step by Step, https://arxiv.org/abs/2305.20050

An article describing the rising cost of LLM evaluation, finding that evaluating reasoning models such as o1 on (seven) popular benchmarks costs approximately $1500:

The rise of AI “reasoning” models is making benchmarking more expensive, https://techcrunch.com/2025/04/10/the-rise-of-ai-reasoning-models-is-making-benchmarking-more-expensive/

A comprehensive 2025 survey on LLM benchmarks:

A Survey on Large Language Model Benchmarks, https://arxiv.org/abs/2508.15361

Instead of only relying on deterministic and symbolic verifiers, a recent research project highlights that small reasoning models themselves can be used successfully as verifiers for other reasoning models:

sVerify: Efficient Answer Verifier for Reasoning Model Evaluations, https://arxiv.org/abs/2504.10481

A.4 Chapter 4

A.4.1 References

The following paper that formally described chain-of-thought prompting. Note that the paper suggests “Let’s think step by step” as a prompt modification. However, in my experiments, I found that “Explain step by step” performs better when using the Qwen2.5 base model, which is why we use the latter in chapter 4:

Large Language Models are Zero-Shot Reasoners, https://arxiv.org/abs/2205.11916

A description of self-consistency sampling with additional comparison studies:

Self-Consistency Improves Chain-of-Thought Reasoning in Language Models, https://arxiv.org/abs/2203.11171

A.4.2 Further Reading

An overview and discussion of additional inference scaling methods:

The State of LLM Reasoning and Inference Scaling, https://magazine.sebastianraschka.com/p/state-of-llm-reasoning-and-inference-scaling

A.5 Chapter 5

A.5.1 References

Google keeps the methods behind their proprietary Gemini 3 model a secret, but based on a recent announcement, we can speculate that it uses inference scaling techniques similar to self-consistency or Best-of-N: “We’re pushing the boundaries of intelligence even further with Gemini 3 Deep Thinking. This move meaningfully improves reasoning capabilities by exploring many hypotheses simultaneously to solve problems.”

Public announcement by Google DeepMind, which developed Gemini, https://x.com/GoogleDeepMind/status/1996658401233842624?s=20

The DeepSeekMath-V1 paper showed that self-consistency scaling can noticeably improve the answer accuracy, and combining self-consistency with their version of self-inference (Best@32 in figure 2), the model achieved gold-level performance in several math competitions:

DeepSeekMath-V1: Towards Self-Verifiable Mathematical Reasoning, https://arxiv.org/abs/2511.22570v1

Instead of using a majority vote in self-consistency, we can use a scoring function to rank the different answers and select the best one. This approach is also known as Best-of-N. However, if applicable, majority voting often tends to give better results:

Think Loud, Think Fast: Investigating Efficiency of Verifier-free Inference-time-scaling Methods, https://arxiv.org/abs/2504.14047

A.5.2 Further Reading

A short article explaining the difference between probabilities and likelihood:

What is the difference between likelihood and probability?, https://sebastianraschka.com/faq/docs/probability-vs-likelihood.html

A.6 Chapter 6

A.6.1 References

The InstructGPT paper demonstrated the effectiveness of RLHF and was instrumental in popularizing RLHF as a standard alignment and fine-tuning approach for LLMs:

Training language models to follow instructions with human feedback, https://arxiv.org/abs/2203.02155

The DeepSeekMath paper that introduced the GRPO algorithm:

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, https://arxiv.org/abs/2402.03300

The DeepSeek-R1 paper showed that strong reasoning behavior can emerge in LLMs through reinforcement learning alone (via RL with GRPO). This was most clearly shown in the R1-Zero variant. However, combining this approach with a multi-stage training pipeline yields an even better reasoning model:

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, https://arxiv.org/abs/2501.12948

A.6.2 Further Reading

A comprehensive walkthrough of the DeepSeek-R1 training pipeline involving RLVR:

Understanding Reasoning LLMs: Methods and Strategies for Building and Refining Reasoning Models, https://magazine.sebastianraschka.com/p/understanding-reasoning-llms

A comparison of GRPO and PPO for reinforcement learning in the context of LLMs:

The State of Reinforcement Learning for LLM Reasoning: Understanding GRPO and New Insights from Reasoning Model Papers, https://magazine.sebastianraschka.com/p/the-state-of-llm-reasoning-model-training

A.7 Chapter 7

A.7.1 References

The original PPO paper that introduced clipped policy ratio, which we also use here to stabilize GRPO:

Proximal Policy Optimization Algorithms, https://arxiv.org/abs/1707.06347

Additional papers that recommend improvements to the GRPO algorithm:

OpenRLHF: An Open-Source LLM Reinforcement Learning System at Scale, https://arxiv.org/abs/2503.14476
Understanding R1-Zero-Like Training: A Critical Perspective (Pt. GRPO), https://arxiv.org/abs/2503.20783
Data Efficient RL Framework Specifically Using the Off-Policy RL Training (FERL), https://fengyao.notion.site/off-policy-rl
DeepSeek-V3.5: Pushing the Frontier of Open Large Language Models, https://arxiv.org/abs/2512.02556
GRPO: Group Reward-Decoupled Optimization Policy Optimization for Multi-reward RL Optimization, https://arxiv.org/abs/2601.05242
Direct Sequence Policy Optimization (DSPO), https://arxiv.org/abs/2507.18071
MiniMax-01: Scaling Real-Time Compute Efficiently with Lightning Attention (LIGHT), https://arxiv.org/abs/2506.13585

A.7.2 Further Reading

A comparison between PPO (the original algorithm used for RLHF) and GRPO:

The State of Reinforcement Learning for LLM Reasoning, https://magazine.sebastianraschka.com/p/the-state-of-llm-reasoning-model-training

A good technical deep dive that discusses different GRPO improvements:

GRPO++: Tricks for Making RL Actually Work, https://cameronrwolfe.substack.com/p/grpo-tricks

A.8 Chapter 8

A.8.1 References

The original knowledge-distillation paper that popularized the combination of hard and soft distillation objectives:

Distilling the Knowledge in a Neural Network, https://arxiv.org/abs/1503.02531

The DeepSeek-R1 paper that described the reasoning-distillation recipe, which motivated this chapter, where a large teacher model generates reasoning traces that are then used to train smaller student models:

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, https://arxiv.org/abs/2501.12948

A paper on distilling large language models that reported strong results for carefully designed soft-distillation objectives:

MiniLLM: Knowledge Distillation of Large Language Models, https://arxiv.org/abs/2306.08543

A.8.2 Further Reading

More details on supervised fine-tuning (the technique in hard distillation) and masking when working with batches of training examples:

Build a Large Language Model (From Scratch) chapter 7, https://www.manning.com/books/build-a-large-language-model-from-scratch

A practical walkthrough of reasoning-model training pipelines, including distillation and RLHF:

Understanding Reasoning LLMs: Methods and Strategies for Building and Refining Reasoning Models, https://magazine.sebastianraschka.com/p/understanding-reasoning-llms

A.9 Appendix F

A.9.1 References

The paper that introduced the popular multiple-choice MMLU dataset:

Measuring Massive Multitask Language Understanding, https://arxiv.org/abs/2009.03300

A detailed description of the Elo rating system:

Elo rating system, https://en.wikipedia.org/wiki/Elo_rating_system

The Chatbot Arena paper describing the original methodology behind the popular LLM leaderboard:

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference, https://arxiv.org/abs/2403.04132

A.9.2 Further Reading

A paper discussing the problems with leaderboards such as LLM Arena:

The Leaderboard Illusion, http://arxiv.org/abs/2504.20879

An article by the author describing gpt-oss in more detail:

From GPT-2 to gpt-oss: Analyzing the Architectural Advances, https://magazine.sebastianraschka.com/p/from-gpt-2-to-gpt-oss-analyzing-the

A survey of different LLM judge approaches:

A Survey on LLM-as-a-Judge, https://arxiv.org/abs/2411.15594

Example of a small LLM fine-tuned to act as a judge:

JUDGE: Llama-3 as Scalable Judge, https://arxiv.org/abs/2405.08029