References and further reading
Appendix A. References and further reading
A.1 Chapter 1
A.1.1 References
The announcement article of OpenAI’s o1 model, which is regarded as the first LLM-based reasoning model:
- Introducing OpenAI o1-preview, https://openai.com/index/introducing-openai-o1-preview/
DeepSeek-R1 is the first open-source reasoning model that was accompanied by a comprehensive technical report, which was the first to show that reasoning emerges from reinforcement learning with verifiable rewards (a topic covered in more detail in chapter 5):
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, https://arxiv.org/abs/2501.12948
OpenAI CEO’s comment on the reasoning (“chain-of-thought”) capabilities of future models:
- “[…] We will next ship GPT-4.5, the model we called Orion internally, as our last non-chain-of-thought model. […]”, https://x.com/sama/status/1889755723078443244
A research paper by AI researchers at Apple finding that reasoning models are sophisticated (but very capable) pattern matchers:
- The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity, https://machinelearning.apple.com/research/illusion-of-thinking
An in-depth book and guide on implementing and training large language models step-by-step:
- Build a Large Language Model (From Scratch), http://mng.bz/orYv
A.1.2 Further Reading
An introduction to how DeepSeek-R1 works, providing insights into the foundations of reasoning in LLMs:
- Understanding Reasoning LLMs, https://magazine.sebastianraschka.com/p/understanding-reasoning-llms
A.2 Chapter 2
A.2.1 References
Official installation guide for uv, the Python package and project manager:
- Installing uv, https://docs.astral.sh/uv/getting-started/installation/
Cloud compute platforms with GPU support:
- Lightning AI, https://lightning.ai/
- Google Colab, https://colab.research.google.com/
Qwen3 resources with additional benchmark performance and comparison to other models:
- Qwen blog, https://qwenlm.github.io/blog/qwen3/
- Technical report, https://arxiv.org/abs/2505.09388
Readers curious about KV cache sizes for different sequence lengths can find a handy calculator web tool:
- KV cache size calculator, https://lmcache.ai/kv_cache_calculator.html
A.2.2 Further Reading
A PyTorch tutorial for readers who are new to PyTorch or would like a refresher:
- PyTorch in One Hour: From Tensors to Training Neural Networks on Multiple GPUs tutorial, https://sebastianraschka.com/teaching/pytorch-1h
Additional resources on tokenization:
- Build a Large Language Model (From Scratch) chapter 2, https://mng.bz/M96o
- Implementing a Byte Pair Encoding (BPE) Tokenizer From Scratch, https://sebastianraschka.com/blog/2025/bpe-from-scratch.html
For readers interested in a more in-depth PyTorch coverage (optional), I can recommend the following two books:
- Deep Learning with PyTorch, https://www.manning.com/books/deep-learning-with-pytorch-second-edition
- Machine Learning with PyTorch and Scikit-Learn, https://www.amazon.com/Machine-Learning-PyTorch-Scikit-Learn-learning/dp/1801819319/
A.3 Chapter 3
A.3.1 References
The MATH-500 dataset originated from the MATH dataset (with 12,500 problems across algebra, geometry, probability, number theory, and more) that was introduced in the following paper:
- Measuring Mathematical Problem Solving With the MATH Dataset, https://arxiv.org/abs/2103.03874
The MATH-500 split (created from the original MATH dataset) was proposed in the following paper:
- Let’s Verify Step by Step, https://arxiv.org/abs/2305.20050
A.3.2 Further Reading
Readers who are interested in learning more about SymPy (not required for this book) can learn about it in this official tutorial:
- SymPy introductory tutorial, https://docs.sympy.org/latest/tutorials/intro-tutorial/index.html
An example of a system (here, a fine-tuned LLM) to also evaluate intermediate reasoning steps:
- Evaluating Mathematical Reasoning Beyond Accuracy, https://arxiv.org/pdf/2404.05692
A large-scale dataset containing 800,000 step-level correctness labels for model-generated solutions to problems from the MATH dataset:
- Let’s Verify Step by Step, https://arxiv.org/abs/2305.20050
An article describing the rising cost of LLM evaluation, finding that evaluating reasoning models such as o1 on (seven) popular benchmarks costs approximately $1500:
- The rise of AI “reasoning” models is making benchmarking more expensive, https://techcrunch.com/2025/04/10/the-rise-of-ai-reasoning-models-is-making-benchmarking-more-expensive/
A comprehensive 2025 survey on LLM benchmarks:
- A Survey on Large Language Model Benchmarks, https://arxiv.org/abs/2508.15361
Instead of only relying on deterministic and symbolic verifiers, a recent research project highlights that small reasoning models themselves can be used successfully as verifiers for other reasoning models:
- sVerify: Efficient Answer Verifier for Reasoning Model Evaluations, https://arxiv.org/abs/2504.10481
A.4 Chapter 4
A.4.1 References
The following paper that formally described chain-of-thought prompting. Note that the paper suggests “Let’s think step by step” as a prompt modification. However, in my experiments, I found that “Explain step by step” performs better when using the Qwen2.5 base model, which is why we use the latter in chapter 4:
- Large Language Models are Zero-Shot Reasoners, https://arxiv.org/abs/2205.11916
A description of self-consistency sampling with additional comparison studies:
- Self-Consistency Improves Chain-of-Thought Reasoning in Language Models, https://arxiv.org/abs/2203.11171
A.4.2 Further Reading
An overview and discussion of additional inference scaling methods:
- The State of LLM Reasoning and Inference Scaling, https://magazine.sebastianraschka.com/p/state-of-llm-reasoning-and-inference-scaling
A.5 Chapter 5
A.5.1 References
Google keeps the methods behind their proprietary Gemini 3 model a secret, but based on a recent announcement, we can speculate that it uses inference scaling techniques similar to self-consistency or Best-of-N: “We’re pushing the boundaries of intelligence even further with Gemini 3 Deep Thinking. This move meaningfully improves reasoning capabilities by exploring many hypotheses simultaneously to solve problems.”
- Public announcement by Google DeepMind, which developed Gemini, https://x.com/GoogleDeepMind/status/1996658401233842624?s=20
The DeepSeekMath-V1 paper showed that self-consistency scaling can noticeably improve the answer accuracy, and combining self-consistency with their version of self-inference (Best@32 in figure 2), the model achieved gold-level performance in several math competitions:
- DeepSeekMath-V1: Towards Self-Verifiable Mathematical Reasoning, https://arxiv.org/abs/2511.22570v1
Instead of using a majority vote in self-consistency, we can use a scoring function to rank the different answers and select the best one. This approach is also known as Best-of-N. However, if applicable, majority voting often tends to give better results:
- Think Loud, Think Fast: Investigating Efficiency of Verifier-free Inference-time-scaling Methods, https://arxiv.org/abs/2504.14047
A.5.2 Further Reading
A short article explaining the difference between probabilities and likelihood:
- What is the difference between likelihood and probability?, https://sebastianraschka.com/faq/docs/probability-vs-likelihood.html
A.6 Chapter 6
A.6.1 References
The InstructGPT paper demonstrated the effectiveness of RLHF and was instrumental in popularizing RLHF as a standard alignment and fine-tuning approach for LLMs:
- Training language models to follow instructions with human feedback, https://arxiv.org/abs/2203.02155
The DeepSeekMath paper that introduced the GRPO algorithm:
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models, https://arxiv.org/abs/2402.03300
The DeepSeek-R1 paper showed that strong reasoning behavior can emerge in LLMs through reinforcement learning alone (via RL with GRPO). This was most clearly shown in the R1-Zero variant. However, combining this approach with a multi-stage training pipeline yields an even better reasoning model:
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, https://arxiv.org/abs/2501.12948
A.6.2 Further Reading
A comprehensive walkthrough of the DeepSeek-R1 training pipeline involving RLVR:
- Understanding Reasoning LLMs: Methods and Strategies for Building and Refining Reasoning Models, https://magazine.sebastianraschka.com/p/understanding-reasoning-llms
A comparison of GRPO and PPO for reinforcement learning in the context of LLMs:
- The State of Reinforcement Learning for LLM Reasoning: Understanding GRPO and New Insights from Reasoning Model Papers, https://magazine.sebastianraschka.com/p/the-state-of-llm-reasoning-model-training
A.7 Chapter 7
A.7.1 References
The original PPO paper that introduced clipped policy ratio, which we also use here to stabilize GRPO:
- Proximal Policy Optimization Algorithms, https://arxiv.org/abs/1707.06347
Additional papers that recommend improvements to the GRPO algorithm:
- OpenRLHF: An Open-Source LLM Reinforcement Learning System at Scale, https://arxiv.org/abs/2503.14476
- Understanding R1-Zero-Like Training: A Critical Perspective (Pt. GRPO), https://arxiv.org/abs/2503.20783
- Data Efficient RL Framework Specifically Using the Off-Policy RL Training (FERL), https://fengyao.notion.site/off-policy-rl
- DeepSeek-V3.5: Pushing the Frontier of Open Large Language Models, https://arxiv.org/abs/2512.02556
- GRPO: Group Reward-Decoupled Optimization Policy Optimization for Multi-reward RL Optimization, https://arxiv.org/abs/2601.05242
- Direct Sequence Policy Optimization (DSPO), https://arxiv.org/abs/2507.18071
- MiniMax-01: Scaling Real-Time Compute Efficiently with Lightning Attention (LIGHT), https://arxiv.org/abs/2506.13585
A.7.2 Further Reading
A comparison between PPO (the original algorithm used for RLHF) and GRPO:
- The State of Reinforcement Learning for LLM Reasoning, https://magazine.sebastianraschka.com/p/the-state-of-llm-reasoning-model-training
A good technical deep dive that discusses different GRPO improvements:
- GRPO++: Tricks for Making RL Actually Work, https://cameronrwolfe.substack.com/p/grpo-tricks
A.8 Chapter 8
A.8.1 References
The original knowledge-distillation paper that popularized the combination of hard and soft distillation objectives:
- Distilling the Knowledge in a Neural Network, https://arxiv.org/abs/1503.02531
The DeepSeek-R1 paper that described the reasoning-distillation recipe, which motivated this chapter, where a large teacher model generates reasoning traces that are then used to train smaller student models:
- DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning, https://arxiv.org/abs/2501.12948
A paper on distilling large language models that reported strong results for carefully designed soft-distillation objectives:
- MiniLLM: Knowledge Distillation of Large Language Models, https://arxiv.org/abs/2306.08543
A.8.2 Further Reading
More details on supervised fine-tuning (the technique in hard distillation) and masking when working with batches of training examples:
- Build a Large Language Model (From Scratch) chapter 7, https://www.manning.com/books/build-a-large-language-model-from-scratch
A practical walkthrough of reasoning-model training pipelines, including distillation and RLHF:
- Understanding Reasoning LLMs: Methods and Strategies for Building and Refining Reasoning Models, https://magazine.sebastianraschka.com/p/understanding-reasoning-llms
A.9 Appendix F
A.9.1 References
The paper that introduced the popular multiple-choice MMLU dataset:
- Measuring Massive Multitask Language Understanding, https://arxiv.org/abs/2009.03300
A detailed description of the Elo rating system:
- Elo rating system, https://en.wikipedia.org/wiki/Elo_rating_system
The Chatbot Arena paper describing the original methodology behind the popular LLM leaderboard:
- Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference, https://arxiv.org/abs/2403.04132
A.9.2 Further Reading
A paper discussing the problems with leaderboards such as LLM Arena:
- The Leaderboard Illusion, http://arxiv.org/abs/2504.20879
An article by the author describing gpt-oss in more detail:
- From GPT-2 to gpt-oss: Analyzing the Architectural Advances, https://magazine.sebastianraschka.com/p/from-gpt-2-to-gpt-oss-analyzing-the
A survey of different LLM judge approaches:
- A Survey on LLM-as-a-Judge, https://arxiv.org/abs/2411.15594
Example of a small LLM fine-tuned to act as a judge:
- JUDGE: Llama-3 as Scalable Judge, https://arxiv.org/abs/2405.08029