References and further reading

Appendix A. References and further reading

A.1 Chapter 1

A.1.1 References

The announcement article of OpenAI’s o1 model, which is regarded as the first LLM-based reasoning model:

DeepSeek-R1 is the first open-source reasoning model that was accompanied by a comprehensive technical report, which was the first to show that reasoning emerges from reinforcement learning with verifiable rewards (a topic covered in more detail in chapter 5):

OpenAI CEO’s comment on the reasoning (“chain-of-thought”) capabilities of future models:

A research paper by AI researchers at Apple finding that reasoning models are sophisticated (but very capable) pattern matchers:

An in-depth book and guide on implementing and training large language models step-by-step:

A.1.2 Further Reading

An introduction to how DeepSeek-R1 works, providing insights into the foundations of reasoning in LLMs:


A.2 Chapter 2

A.2.1 References

Official installation guide for uv, the Python package and project manager:

Cloud compute platforms with GPU support:

Qwen3 resources with additional benchmark performance and comparison to other models:

Readers curious about KV cache sizes for different sequence lengths can find a handy calculator web tool:

A.2.2 Further Reading

A PyTorch tutorial for readers who are new to PyTorch or would like a refresher:

Additional resources on tokenization:

For readers interested in a more in-depth PyTorch coverage (optional), I can recommend the following two books:


A.3 Chapter 3

A.3.1 References

The MATH-500 dataset originated from the MATH dataset (with 12,500 problems across algebra, geometry, probability, number theory, and more) that was introduced in the following paper:

The MATH-500 split (created from the original MATH dataset) was proposed in the following paper:

A.3.2 Further Reading

Readers who are interested in learning more about SymPy (not required for this book) can learn about it in this official tutorial:

An example of a system (here, a fine-tuned LLM) to also evaluate intermediate reasoning steps:

A large-scale dataset containing 800,000 step-level correctness labels for model-generated solutions to problems from the MATH dataset:

An article describing the rising cost of LLM evaluation, finding that evaluating reasoning models such as o1 on (seven) popular benchmarks costs approximately $1500:

A comprehensive 2025 survey on LLM benchmarks:

Instead of only relying on deterministic and symbolic verifiers, a recent research project highlights that small reasoning models themselves can be used successfully as verifiers for other reasoning models:


A.4 Chapter 4

A.4.1 References

The following paper that formally described chain-of-thought prompting. Note that the paper suggests “Let’s think step by step” as a prompt modification. However, in my experiments, I found that “Explain step by step” performs better when using the Qwen2.5 base model, which is why we use the latter in chapter 4:

A description of self-consistency sampling with additional comparison studies:

A.4.2 Further Reading

An overview and discussion of additional inference scaling methods:


A.5 Chapter 5

A.5.1 References

Google keeps the methods behind their proprietary Gemini 3 model a secret, but based on a recent announcement, we can speculate that it uses inference scaling techniques similar to self-consistency or Best-of-N: “We’re pushing the boundaries of intelligence even further with Gemini 3 Deep Thinking. This move meaningfully improves reasoning capabilities by exploring many hypotheses simultaneously to solve problems.”

The DeepSeekMath-V1 paper showed that self-consistency scaling can noticeably improve the answer accuracy, and combining self-consistency with their version of self-inference (Best@32 in figure 2), the model achieved gold-level performance in several math competitions:

Instead of using a majority vote in self-consistency, we can use a scoring function to rank the different answers and select the best one. This approach is also known as Best-of-N. However, if applicable, majority voting often tends to give better results:

A.5.2 Further Reading

A short article explaining the difference between probabilities and likelihood:


A.6 Chapter 6

A.6.1 References

The InstructGPT paper demonstrated the effectiveness of RLHF and was instrumental in popularizing RLHF as a standard alignment and fine-tuning approach for LLMs:

The DeepSeekMath paper that introduced the GRPO algorithm:

The DeepSeek-R1 paper showed that strong reasoning behavior can emerge in LLMs through reinforcement learning alone (via RL with GRPO). This was most clearly shown in the R1-Zero variant. However, combining this approach with a multi-stage training pipeline yields an even better reasoning model:

A.6.2 Further Reading

A comprehensive walkthrough of the DeepSeek-R1 training pipeline involving RLVR:

A comparison of GRPO and PPO for reinforcement learning in the context of LLMs:


A.7 Chapter 7

A.7.1 References

The original PPO paper that introduced clipped policy ratio, which we also use here to stabilize GRPO:

Additional papers that recommend improvements to the GRPO algorithm:

A.7.2 Further Reading

A comparison between PPO (the original algorithm used for RLHF) and GRPO:

A good technical deep dive that discusses different GRPO improvements:


A.8 Chapter 8

A.8.1 References

The original knowledge-distillation paper that popularized the combination of hard and soft distillation objectives:

The DeepSeek-R1 paper that described the reasoning-distillation recipe, which motivated this chapter, where a large teacher model generates reasoning traces that are then used to train smaller student models:

A paper on distilling large language models that reported strong results for carefully designed soft-distillation objectives:

A.8.2 Further Reading

More details on supervised fine-tuning (the technique in hard distillation) and masking when working with batches of training examples:

A practical walkthrough of reasoning-model training pipelines, including distillation and RLHF:


A.9 Appendix F

A.9.1 References

The paper that introduced the popular multiple-choice MMLU dataset:

A detailed description of the Elo rating system:

The Chatbot Arena paper describing the original methodology behind the popular LLM leaderboard:

A.9.2 Further Reading

A paper discussing the problems with leaderboards such as LLM Arena:

An article by the author describing gpt-oss in more detail:

A survey of different LLM judge approaches:

Example of a small LLM fine-tuned to act as a judge: