Exercise solutions

Appendix B. Exercise solutions

The complete code examples for the exercise solutions can be found in the supplementary GitHub repository at https://github.com/rasbt/reasoning-from-scratch.

B.1 Chapter 2

Exercise 2.1

We can use a prompt similar to “Hello, Ardwarklethyrx. Haus und Garten.”, which contains a made-up word ("Ardwarklethyrx") and three words in a non-English language (German):

prompt = "Hello, Ardwarklethyrx. Haus und Garten."
input_token_ids_list = tokenizer.encode(prompt)
for i in input_token_ids_list:
    print(f"{[i]} --> {tokenizer.decode([i])}")

The output is:

[9707] --> Hello
[11] --> ,
[1644] -->  Ar
[29406] --> dw
[838] --> ark
[273] --> le
[339] --> th
[10920] --> yr
[87] --> x
[13] --> .
[47375] -->  Haus
[2030] -->  und
[93912] -->  Garten
[13] --> .

As we can see, unknown words are broken into smaller pieces of subwords or even single tokens; this allows the tokenizer and LLM to handle any input.

German words are not broken down into characters or even subwords here, suggesting that the tokenizer has seen German texts during training. This also suggests that the LLM was likely trained on German texts, too, and should be able to handle at least certain non-English languages well.

Exercise 2.2

We can simply delete the line device = torch.device("cpu") in section 2.5, and then rerun the rest of the code in chapter 2 as is. Reference numbers for the hardware I tried the code on are provided in table 2.1 at the end of chapter 2.

B.2 Chapter 3

Exercise 3.1

There is an endless number of different test cases we may add. Below is a selection of some interesting ones:

from reasoning_from_scratch.ch03 import (
    run_demos_table
)
 
more_tests = [
    ("check_17", "[1, 2]", "(1, 2)", True),  # A: Different bracket types
    ("check_18", "1e-3", "0.001", True),     # B: Scientific notation
    ("check_19", "(-3)^2", "9", True),      # C: Algebraic simplification with caret exponent
    ("check_20", "−1", "-1", True),          # D: Unicode minus (U+2212) vs ASCII hyphen-minus
]
run_demos_table(more_tests)

The output is:

Test     | Expect | Got   | Status
check_17 | True   | True  | PASS
check_18 | True   | True  | PASS
check_19 | True   | True  | PASS
check_20 | True   | False | FAIL

As we can see, the test fails for check_20, which uses the Unicode version of a minus sign that looks indistinguishable to the human eye (depending on which font or editor you use). We could fix this test case by adding one of the following lines anywhere in the normalize_text function:

text = text.replace("−", "-")

or

text = text.replace("\u2212", "-")

Another interesting test is the following one:

extra_tests_1 = [
    ("check_21", "Text around answer 3.", "3", True)
]

We can run it via the following code:

run_demos_table(extra_tests_1)

However, it fails the test:

Test     | Expect | Got   | Status
check_21 | True   | False | FAIL  
 
Passed 0/1

Note that this might look like an issue with the grading logic at first, but it is actually a poorly designed test. In practice, the run_demos_table function is intended specifically to test the grade_answer function; nothing more, nothing less.

The grade_answer function would never receive the entire answer in this text form, since the answer would have been extracted from the text before being passed to it. For instance, if we want to test text answers, we need to call the test as follows:

from reasoning_from_scratch.ch03 import (
    extract_final_candidate
)
extra_tests_2 = [
    ("check_21",
     extract_final_candidate("Text around answer 3."),
     "3", True)
]

As we can see based on the output, it now passes the test:

run_demos_table(extra_tests_2)
Test     | Expect | Got  | Status
check_21 | True   | True | PASS  
 
Passed 1/1

Exercise 3.2

There are two options to calculate the average response length. The first option is to modify the evaluate_math500_stream function (listing 3.13 in chapter 3) by adding the following lines:

# ...
# below `num_correct = 0`
total_len = 0
 
# ...
# inside for i, row in enumerate(math_data, start=1):
# anywhere below `gen_text = ...`
total_len += len(tokenizer.encode(gen_text))
 
# ...
# anywhere at the bottom before the return statement
avg_len = total_len / num_examples
print(f"Average length: {avg_len:.2f} tokens")

Alternatively, the second option is to calculate the response lengths from the .jsonl files that were created when we ran the evaluate_math500_stream function in the main chapter. This way, we avoid having to rerun the evaluation.

First, we load the .jsonl file as follows:

import json
from pathlib import Path
 
WHICH_MODEL = "base"
dev_name = "mps"
 
# A: You may need to adjust this path
local_path = Path(f"math500-{dev_name}.jsonl")
if not local_path.exists():
    raise FileNotFoundError(
        f"{local_path} not found. Run ch03_main.ipynb to create it."
    )
 
results = []
with open(local_path, "r") as f:
    for line in f:
        if line.strip():
            results.append(json.loads(line))
 
print("Number of entries:", len(results))

Let’s print the dictionary keys to get a better idea of how the results dataset is structured:

print(results[0].keys())

This prints:

dict_keys(['index', 'problem', 'gtruth_answer', 'generated_text', 'extracted', 'correct'])

Each item contains multiple keys; however, we are only interested in the "generated_text" key, which contains the model’s full answer. Next, we need to load the tokenizer so that we can tokenize the answer text before we can calculate the number of tokens. This is similar to the code we used in listing 3.1 in chapter 3:

from reasoning_from_scratch.qwen3 import (
    download_qwen3_small,
    Qwen3Tokenizer
)
 
if WHICH_MODEL == "base":
    download_qwen3_small(
        kind="base", tokenizer_only=True, out_dir="qwen3"
    )
    tokenizer_path = Path("qwen3") / "tokenizer-base.json"
    tokenizer = Qwen3Tokenizer(tokenizer_file_path=tokenizer_path)
 
elif WHICH_MODEL == "reasoning":
    download_qwen3_small(
        kind="reasoning", tokenizer_only=True, out_dir="qwen3"
    )
    tokenizer_path = Path("qwen3") / "tokenizer-reasoning.json"
    tokenizer = Qwen3Tokenizer(
        tokenizer_file_path=tokenizer_path,
        apply_chat_template=True,
        add_generation_prompt=True,
        add_thinking=True,
    )

Then, we can calculate the average length as follows, which is similar to how we could have modified the evaluate_math500_stream function:

total_len = 0
 
for item in results:
    num_tokens = len(tokenizer.encode(item["generated_text"]))
    total_len += num_tokens
 
avg_len = total_len / len(results)
print(f"Average length: {avg_len:.2f} tokens")

The resulting average length is as follows:

Average length: 98.00 tokens

Table B.1 lists the average lengths for the different models and subsets.

Table B.1 Average number of tokens on MATH-500
Model Device Average length MATH-500 size
Base CPU 97.30 10
Base CUDA 96.74 500
Reasoning CPU 891.80 10
Reasoning CUDA 1361.21 500

As we can see based on the results in table B.1, and as expected, the reasoning model generates much longer responses (in this case, approximately 10-times longer).

Exercise 3.3

To evaluate the model on a larger dataset, we can simply change the math_data[:10] to a different slice or larger number (up to 500) in the following function call:

num_correct, num_examples, acc = evaluate_math500_stream(
    model, tokenizer, device, 
    math_data=math_data[:10],
    max_new_tokens=2048,
    verbose=False
)

Table B.2 below shows the accuracy values for different dataset sizes. (Since the MATH-500 test set is already shuffled, no additional shuffling was applied.)

Table B.2 Accuracies for different MATH-500 dataset sizes
Model Device Accuracy MATH-500 size
Base CUDA 30.0% 10
Base CUDA 34.0% 50
Base CUDA 27.0% 100
Base CUDA 15.3% 500
Reasoning CUDA 90.0% 10
Reasoning CUDA 58.0% 50
Reasoning CUDA 56.0% 100
Reasoning CUDA 48.2% 500

As we can see based on the results in table B.2, the first 10 examples are not very representative of the MATH-500 performance evaluated on the whole 500 examples.

In addition, we can create an entirely new dataset in a similar style to MATH-500. For example, a dataset in MATH-500 style is included in this repository; we can use it in the main chapter by changing the filename from math500_test.json to math_new50_exercise.json (this dataset is included in this book’s GitHub repository at https://github.com/rasbt/reasoning-from-scratch/tree/main/ch03/01_main-chapter-code).

The performance of the models is as follows:

  • Base: 36.0% (18/50)
  • Reasoning: 80.0% (40/50)

Accuracy is similar for the base model and higher for the reasoning model compared to the 50-example subset of the MATH-500 test set (table B.2). This indicates that, despite the possibility of overlap with Qwen3’s training data, the model generalizes well to new math questions and does not show signs of extensive overfitting to the original MATH-500 data.

Exercise 3.4

We could use the alternative prompt similar to the one suggested in the chapter, which modifies the prompt to use the word “problem” instead of “question”:

def render_prompt(prompt):
    template = (
        "You are a helpful math assistant.\n"
        "Solve the problem and write the final "
        "result on a new line as:\n"
        "\\boxed{ANSWER}\n\n"
        f"Problem:\n{prompt}\n\nAnswer:"
    )
    return template

Using this prompt improves the performance of the base model, on the 500 examples, from 15.3% to 31.2%. Also, it improves the performance of the reasoning model from 48.2% to 50.0%.

From these observations, we may conclude that the base model is much more sensitive to the prompt format (likely due to memorizing some prompt-formatted MATH-500 examples from the training set) than the reasoning model; the latter seems largely unaffected.

B.3 Chapter 4

Exercise 4.1

The modification only requires adding a prompt suffix such as "\n\nExplain step by step." after applying the prompt template. There is only a very small portion of code that needs to be updated in the MATH-500 evaluation function from chapter 3, as shown below:

def evaluate_math500_stream(...):
    # ...
        for i, row in enumerate(math_data, start=1):
            prompt = render_prompt(row["problem"])
            prompt += "\n\nExplain step by step."  # NEW
            gen_text = generate_text_stream_concat(
                model, tokenizer, prompt, device,
                max_new_tokens=max_new_tokens,
                verbose=verbose,
            )
     # ...

The improvements are shown in row 3 in table 4.1, which can be found in section 4.6 in chapter 4.

Exercise 4.2

Here, we replace the generate_text_stream_concat function with generate_text_stream_concat_flex and pass in generate_text_top_p_stream_cache as its generation function. The updated MATH-500 evaluation function from chapter 3 is shown below, and the changes are marked with comments labeled # NEW.

def evaluate_math500_stream(
    model,
    tokenizer,
    device,
    math_data,
    out_path=None,
    max_new_tokens=512,
    verbose=False,
    temperature=1.0,  # NEW
    top_p=1.0,        # NEW
):
    # ...
    with open(out_path, "w", encoding="utf-8") as f:
        for i, row in enumerate(math_data, start=1):
            prompt = render_prompt(row["problem"])
            gen_text = generate_text_stream_concat_flex( # NEW
                model, tokenizer, prompt, device,
                max_new_tokens=max_new_tokens,
                verbose=verbose,
                generate_func=generate_text_top_p_stream_cache,  # NEW
                temperature=temperature,                         # NEW
                top_p=top_p                                      # NEW
            )
    # ...

The difference between this modified function and the baseline from chapter 3 can be seen in rows 1 and 4 in table 4.1, which can be found in section 4.6 in chapter 4.

Exercise 4.3

Starting from the evaluate_math500_stream function in chapter 3, the first modification is to replace the line gen_text = generate_text_stream_concat(...) with a call to results = self_consistency_vote(...) from chapter 4. The second modification adds a simple tie-breaking rule that selects the first occurrence of the most frequent answer. For instance, if the sampled results are 1, 3, 5, 3, 5, the function would return 3 because it is the earliest member of the most frequent group.

Since the most frequent answers are stored in results["majority_winners"], one straightforward way to break ties is to take the first element of this list, that is, results["majority_winners"][0].

Those changes are illustrated in the code excerpts below:

def evaluate_math500_stream(
    model,
    tokenizer,
    device,
    math_data,
    out_path=None,
    max_new_tokens=2048,
    verbose=False,
    prompt_suffix="",    # NEW
    temperature=1.0,     # NEW
    top_p=1.0,           # NEW
    seed=None,           # NEW
    num_samples=10,      # NEW
):
 
    if out_path is None:
        dev_name = str(device).replace(":", "-")
        out_path = Path(f"math500-{dev_name}.jsonl")
 
    num_examples = len(math_data)
    num_correct = 0
    start_time = time.time()
 
    with open(out_path, "w", encoding="utf-8") as f:
        for i, row in enumerate(math_data, start=1):
            prompt = render_prompt(row["problem"])
 
            ##############################################################
            # NEW
            prompt += prompt_suffix
            results = self_consistency_vote(
                model=model,
                tokenizer=tokenizer,
                prompt=prompt,
                device=device,
                num_samples=num_samples,
                temperature=temperature,
                top_p=top_p,
                max_new_tokens=max_new_tokens,
                show_progress=False,
                show_long_answer=False,
                seed=seed,
            )
 
            # resolve ties
            if results["final_answer"] is None:
                extracted = results["majority_winners"][0]
            else:
                extracted = results["final_answer"]
 
            # Optionally, get long answer
            if extracted is not None:
                for idx, s in enumerate(results["short_answers"]):
                    if s == extracted:
                        long_answer = results["full_answers"][idx]
                        break
            gen_text = long_answer
            ##############################################################
 
            is_correct = grade_answer(
                extracted, row["answer"]
            )
            num_correct += int(is_correct)
    # ...

The performance improvements when using self-consistency sampling are summarized and discussed in table 4.1 in chapter 4 (rows 5-7 and rows 9-12), which can be found in section 4.6 of chapter 4.

Exercise 4.4

The early stopping check can be implemented by adding a few lines of code that check whether the given answer is already counted multiple times, or, more specifically, if the given answer count is greater than num_samples / 2:

if early_stop and counts[short] > num_samples / 2:
    majority_winners = [short]
    final_answer = short
    break

The excerpt of the modified self_consistency_vote function below illustrates more specifically where to insert this code:

def self_consistency_vote(
    # ...
    early_stop=True,   # NEW
):
    # ...
        if show_progress:
            print(f"[Sample {i+1}/{num_samples}] → {short!r}")
 
        #########################################################
        # NEW
        # Early stop if one answer already meets >= 50% majority
        if early_stop and counts[short] > num_samples / 2:
            majority_winners = [short]
            final_answer = short
            break
        #########################################################
 
    if final_answer is None:
        mc = counts.most_common()
        if mc:
            top_freq = mc[0][1]
            majority_winners = [s for s, f in mc if f == top_freq]
            final_answer = mc[0][0] if len(majority_winners) == 1 else None
 
    return {
        "full_answers": full_answers,
        "short_answers": short_answers,
        "counts": dict(counts),
        "groups": groups,
        "majority_winners": majority_winners,
        "final_answer": final_answer,
    }

B.4 Chapter 5

Exercise 5.1

There are many ways to implement this. Perhaps the easiest approach is to handle it outside the self-consistency function and work directly with the returned dictionary, similar to what we did in exercise 4.4 when we implemented the tie-breaking logic directly inside the evaluate_math500_stream function. The relevant lines are shown below:

# ...
from reasoning_from_scratch.ch05 import heuristic_score
 
def evaluate_math500_stream(
    #...
    # ...
            results = self_consistency_vote(...)
            # Majority vote winner available
            if results["final_answer"] is not None:
                extracted = results["final_answer"]
 
            ### NEW: Break tie with heuristic_score
            else:
                best = None
                best_score = float("-inf")
                for cand in results["majority_winners"]:
                    scores = [
                        heuristic_score(results["full_answers"][idx],   
                                        prompt=prompt)
                        for idx in results["groups"][cand]
                    ]
                    score = max(scores)
                    if score > best_score:
                        best_score = score
                        best = cand
                extracted = best
            # ...
    # ...
    return num_correct, num_examples, acc

The results are shown in table B.3.

Table B.3 MATH-500 self-consistency score with different tie-breaking
  Method Model Accuracy Time
1 Baseline with chain-of-thought prompting Base 33.4% 129.2 min
2 Self-consistency (n=3) Base 43.2% 328.2 min
3 Self-consistency (n=3) + heuristic Base 43.4% 326.5 min
4 Self-consistency (n=3) + avg. logprob Base 44.8% 327.7 min

The accuracy values and runtimes shown in the table were computed on all 500 samples in the MATH-500 test set using a “base” H100 GPU (80GB).

Row 1 in table B.3 is the baseline from chapter 4 without self-consistency. Row 2 doesn’t use a scorer for tie-breaking, so if there is a tie among the answers, it chooses the answer with the first appearance. Using a heuristic scorer (row 3) as tie-breaker results in a slight improvement. And the best (but also minimal) improvement is achieved with the logprob scores as tie-breaker (row 4).

Exercise 5.2

Best-of-N is similar to self-consistency in that we generate multiple answers. However, instead of selecting the final answer via a majority vote, we score all generated answers using a scoring function, such as heuristic_score, and return the highest-scoring one. There are several ways to implement this behavior, but the simplest approach is to use the existing self-consistency function from chapter 4 as a template and swap in heuristic_score, as shown below:

# ...
from reasoning_from_scratch.ch05 import (
    heuristic_score
)
 
def self_consistency_vote( #...):
    full_answers, short_answers = [], []
    counts = Counter()
    groups = {}
    majority_winners, final_answer = [], None
    best_score, best_idx = float("-inf"), None
 
    for i in range(num_samples):
        if seed is not None:
            torch.manual_seed(seed + i + 1)
 
        answer = generate_text_stream_concat_flex(
            model=model,
            tokenizer=tokenizer,
            prompt=prompt,
            device=device,
            max_new_tokens=max_new_tokens,
            verbose=show_long_answer,
            generate_func=generate_text_top_p_stream_cache,
            temperature=temperature,
            top_p=top_p,
        )
 
        short = extract_final_candidate(answer, fallback="number_then_full")
        full_answers.append(answer)
        short_answers.append(short)
        counts[short] += 1
 
        if short in groups:
            groups[short].append(i)
        else:
            groups[short] = [i]
 
        score = heuristic_score(answer, prompt=prompt)
 
        if score > best_score:
            best_score, best_idx = score, i
        # ...
Table B.4 MATH-500 Best-of-N scores with heuristic and average logprob scores
  Method Model Accuracy Time
1 Baseline with chain-of-thought prompting Base 33.4% 129.2 min
2 Best-of-N (n=3) + heuristic Base 40.6% 327.7 min
3 Best-of-N (n=3) + avg. logprob Base 43.2% 330.2 min

The accuracy values and runtimes shown in the table were computed on all 500 samples in the MATH-500 test set using a “base” H100 GPU (80GB).

Exercise 5.3

The task is similar to exercise 5.1, except that we swap heuristic_score with avg_logprob_answer, as shown below:

# ...
# from reasoning_from_scratch.ch05 import heuristic_score
from reasoning_from_scratch.ch05 import avg_logprob_answer
 
def evaluate_math500_stream(# ...)
    # ...
 
    # score = heuristic_score(
    #    candidate_full, prompt=prompt
    # )
    score = avg_logprob_answer(
        model=model,
        tokenizer=tokenizer,
        prompt=prompt,
        answer=candidate_full,
        device=device,
    )
   # ...

The results were already included in the previous table B.3 (exercise 5.1) in row 4.

Exercise 5.4

To implement Best-of-N with a logprob scorer, we can use the code from exercise 5.2 and swap the heuristic_score with avg_logprob_answer, as shown below:

from reasoning_from_scratch.ch05 import (
    avg_logprob_answer
)
# ...
            score = avg_logprob_answer(
                model=model,
                tokenizer=tokenizer,
                prompt=prompt,
                answer=answer,
                device=device
            )
        if score > best_score:
            best_score, best_idx = score, i
# ...

The resulting MATH-500 score is shown in table B.4 above (exercise 5.2).

Exercise 5.5

Using the heuristic_score is actually even simpler than using the logprob score; all we need to do is change the following code:

from functools import partial
 
avg_logprob_score = partial(
    avg_logprob_answer,
    model=model,
    tokenizer=tokenizer,
    device=device
)
torch.manual_seed(0)
 
results_logprob = self_refinement_loop(
    model=model,
    tokenizer=tokenizer,
    raw_prompt=raw_prompt,
    device=device,
    iterations=2,
    max_response_tokens=2048,
    max_critique_tokens=256,
    score_fn=avg_logprob_score,
    verbose=True,
    temperature=0.7,
    top_p=0.9,
)

The updated code is:

torch.manual_seed(0)
results_logprob = self_refinement_loop(
    model=model,
    tokenizer=tokenizer,
    raw_prompt=raw_prompt,
    device=device,
    iterations=2,
    max_response_tokens=2048,
    max_critique_tokens=256,
    score_fn=heuristic_score,  # NEW
    verbose=True,
    temperature=0.7,
    top_p=0.9,
)

The improvements over the baseline in chapter 3 and self-consistency from chapter 4 are shown in table 5.1 (rows 4, 5, and 10) in the main chapter.

B.5 Chapter 6

Exercise 6.1

We can assign a partial reward (score 0.5) if no \boxed{} answer is found as follows, using the fallback="number_then_full" fallback we coded in chapter 3:

from reasoning_from_scratch.ch03 import (
    extract_final_candidate, grade_answer
)
 
def reward_rlvr(answer_text, ground_truth):
    
    # 1) Try to extract a boxed answer
    boxed = extract_final_candidate(
        answer_text, fallback=None
    )
    if boxed:
        correct = grade_answer(boxed, ground_truth)
        return 1.0 if correct else 0.0
 
    # 2) If no boxed answer is found, look for number
    unboxed = extract_final_candidate(
        answer_text, fallback="number_then_full"
    )
    if unboxed:
        correct = grade_answer(unboxed, ground_truth)
        return 0.5 if correct else 0.0
 
    return 0.0

When plugged into the chapter 6 code and trained under the same settings, the partial-reward variant achieves lower accuracy (37.8%) than the standard GRPO setup (47.4%), despite using a similar number of tokens on average, as shown in table B.5.

Table B.5 MATH-500 accuracies for strict and partial rewards
  Method Step Max tokens Num rollouts Accuracy Average tokens
1 GRPO (chapter 6) 50 512 8 47.4% 586.11
2 GRPO partial rewards (exercise 6.2) 50 512 8 37.8% 550.33

Exercise 6.2

If the rewards are all equal (for instance, they are all 0 or all 1), the advantages will all be 0, because subtracting the mean removes the shared reward value and leaves only zeros, which we can demonstrate below:

import torch
 
rollout_rewards = [0., 0., 0., 0.]
rewards = torch.tensor(rollout_rewards)
advantages = (rewards - rewards.mean()) / (rewards.std() + 1e-4)
print(advantages)

This returns tensor([0., 0., 0., 0.]).

Similarly, if we change the rollout rewards to rollout_rewards = [1., 1., 1., 1.], we get the same all-zero tensor, tensor([0., 0., 0., 0.]).

In short, if all rewards in a group are identical, for example all rewards 0 or all rewards are 1, then $r_{i} - \mu_{i} = 0$ for all $i$ rollouts. As a result, the policy gradient is zero and the model parameters are not updated for that prompt.

This behavior is intentional. If all rollouts are equally bad or equally good, there is no relative signal to tell the model which behavior to reinforce or suppress. Intuitively, if the model answers all the questions correctly, there is no need to update it. Vice versa, if the model answers all questions incorrectly, we don’t want to update the model to reinforce this behavior.

B.6 Chapter 7

Exercise 7.1

The following code checks that the format reward is zero if the think tokens are used incorrectly:

from pathlib import Path
import torch
from reasoning_from_scratch.qwen3 import Qwen3Tokenizer
from reasoning_from_scratch.qwen3 import download_qwen3_small
from reasoning_from_scratch.ch07 import reward_format
 
download_qwen3_small(
    kind="reasoning", tokenizer_only=True, out_dir="qwen3"
)
tokenizer_path = Path("qwen3") / "tokenizer-reasoning.json"
tokenizer = Qwen3Tokenizer(tokenizer_file_path=tokenizer_path)
 
prompt = "Calculate ..."
 
def check_case(name, rollout):
    token_ids = tokenizer.encode(prompt + rollout)
    prompt_len = len(tokenizer.encode(prompt))
    reward = reward_format(
        token_ids=torch.tensor(token_ids),
        prompt_len=prompt_len,
    )
    print(f"{name}: {reward}")
 
# 1) Correct case
check_case(
    "Correct order",
    "Let's ... <think> ... </think> ..."
)
# 2) Typo in tag
check_case(
    "Typo in <think>",
    "Let's ... <thnik> ... </think> ..."
)
# 3) Reversed order
check_case(
    "Reversed order",
    "Let's ... </think> ... <think> ..."
)
 
# 4) Missing one tag
check_case(
    "Missing </think>",
    "Let's ... <think> ..."
)

The output is as follows, indicating that the function requires correct <think>...</think> tags to award a reward of 1.0:

Correct order: 1.0
Typo in <think>: 0.0
Reversed order: 0.0
Missing </think>: 0.0

Exercise 7.2

The implementation of the conditional reward is very simple; in the main chapter, we discussed implementing the overall reward as follows:

reward = rlvr_reward + format_reward_weight * format_reward

So, one way to disable the reward if the correctness reward (rlvr_reward) is 0.0 is:

if conditional_reward:
    format_reward *= rlvr_reward
reward = rlvr_reward + format_reward_weight * format_reward

To see it in practice, you can run the 7_6_plus_format_reward_conditional_metrics.csv script, which we used in section 7.6 in chapter 7 with the --conditional_reward flag enabled.

We can download the log file of this run (using similar settings as in section 7.6) and plot it as follows:

from reasoning_from_scratch.ch07 import download_from_github
from reasoning_from_scratch.ch07 import plot_grpo_metrics
download_from_github(
    "ch07/02_logs/7_6_plus_format_reward_conditional_metrics.csv"
)
plot_grpo_metrics(
    "7_6_plus_format_reward_conditional_metrics.csv",
    columns=["loss", "reward_avg", "avg_response_len", "eval_acc"],
)

Figure B.1 Basic metrics from a GRPO training run with a conditional format reward. Figure B.1 Basic metrics from a GRPO training run with a conditional format reward.

The plots in Figure B.1 show that the evaluation accuracy and reward average take a big hit, but seem to recover.

Overall, despite this performance crash, it looks more stable than before, and the trend indicates that the performance would improve further if we trained longer.

plot_grpo_metrics(
    "7_6_plus_format_reward_conditional_metrics.csv",
    columns=["reward_avg", "format_reward_avg", "adv_std", "entropy_avg"],
)

Figure B.2 Additional metrics from a GRPO training run with a conditional format reward. Figure B.2 Additional metrics from a GRPO training run with a conditional format reward.

In Figure B.2, we see the average format reward mimicking the average reward graph almost perfectly, which is a good sanity check that the conditional logic is working. Also, the average format reward shows how much of the total reward is coming from the format term on the subset of correct answers.

As we can see though, since the average format reward graph echoes the average reward one, it’s mainly a bonus (and it looks like it’s always awarded if the model is correct; this makes sense, because the trained reasoning model already knows how to use <think>...</think> tags correctly and we can see that it doesn’t unlearn this ability).

The entropy increase is still a bit troubling, though, and could hint towards training instabilities that could potentially be addressed by other means (like tighter clipping with smaller clip_eps).

B.7 Chapter 8

Exercise 8.1

To calculate the training and validation answer length statistics, you can add the following commands at the end of section 8.4.3, following the partitioning:

compute_length(train_examples)
# Prints
# Average: 1180 tokens
# Shortest: 236 tokens (index 5730)
# Longest: 2048 tokens (index 1319)

and

compute_length(val_examples)
# Prints
# Average: 1106 tokens
# Shortest: 310 tokens (index 12)
# Longest: 2048 tokens (index 15)

As we can see, the average token length (1180 versus 1106) is fairly similar, and the datasets should be relatively balanced.

As a bonus, we can also plot histograms to visualize the distributions:

import matplotlib.pyplot as plt
 
train_lengths = [len(ex["token_ids"]) for ex in train_examples]
val_lengths = [len(ex["token_ids"]) for ex in val_examples]
 
# Normalize counts because the validation split is much smaller
bins = range(0, max(train_lengths + val_lengths) + 64, 64)
 
fig, ax = plt.subplots(figsize=(7, 4))
ax.hist(train_lengths, bins=bins, density=True, alpha=0.6, label="Train")
ax.hist(val_lengths, bins=bins, density=True, alpha=0.6, label="Validation")
ax.set_xlabel("Token length")
ax.set_ylabel("Density")
ax.legend()
plt.tight_layout()
plt.show()

The resulting plot is shown in Figure B.3.

Figure B.3 Distribution of training and validation set lengths. Figure B.3 Distribution of training and validation set lengths.

There are much fewer validation samples, which is why the validation histogram seems a bit jagged, but as we can see, it has a good distribution coverage.

Exercise 8.2

To replicate the run without <think></think> tokens in the script execution command:

uv run distill.py \
--data_path deepseek-r1-math-train.json \
--validation_size 25 \
--epochs 3 \
--lr 1e-5 \
--max_seq_len 2048 \
--grad_clip 1.0

Then, for the evaluation, we use the base instead of reasoning model:

uv run evaluate_math500.py \
--dataset_size 500 \
--which_model base \
--max_new_tokens 4096 \
--checkpoint_path \
run_11/checkpoints/distill/qwen3-0.6B-distill-step05746-epoch1.pth

The results are shown in table B.6.

Table B.6 MATH-500 task accuracy with and without think tokens
  Method Epoch Final val loss MATH-500 Acc.
1 Base Qwen3 0.6B (chapter 3) - - 15.2%
2 Reasoning Qwen3 0.6B (chapter 3) - - 48.2%
3 DeepSeek-R1 1 0.5436 (0.5404) 31.8% (30.6%)
4 DeepSeek-R1 2 0.5349 (0.5339) 31.8% (32.4%)
5 DeepSeek-R1 3 0.5343 (0.5306) 30.2% (33.6%)
6 Qwen3 235B-A22B 1 0.4043 (0.3130) 44.8% (45.0%)
7 Qwen3 235B-A22B 2 0.3963 (0.3087) 39.4% (43.8%)
8 Qwen3 235B-A22B 3 0.3948 (0.3078) 39.8% (44.2%)

In Table B.6, the new results (without think tokens) are shown first, with corresponding think-token results (from the main chapter) in parentheses.

Interestingly, the Qwen3 model has a lower validation loss when <think></think> tokens are omitted, but this doesn’t translate into better modeling performance.

As we can see, the omission of <think></think> makes the results slightly worse in almost all cases.