datawhalechina · puppylpg · Feb 18, 2026
diff --git a/docs/chapter3/Chapter3-Fundamentals-of-Large-Language-Models.md b/docs/chapter3/Chapter3-Fundamentals-of-Large-Language-Models.md
@@ -60,7 +60,7 @@ $$P(\text{learns}|\text{agent}) =  \frac{\text{Count(agent learns)}}{\text{Count
 
 $$P(\text{datawhale agent learns}) \approx  P(\text{datawhale}) \cdot  P(\text{agent}|\text{datawhale}) \cdot  P(\text{learns}|\text{agent}) \approx  0.333 \cdot 1 \cdot 0.5 \approx 0.167$$
 
-```Python
+```python
 import collections
 
 # Example corpus, consistent with the corpus in the case explanation above
@@ -132,7 +132,7 @@ Through this method, word vectors can not only capture simple relationships like
 
 A famous example demonstrates the semantic relationships captured by word vectors: `vector('King') - vector('Man') + vector('Woman')` The result of this vector operation is surprisingly close to the position of `vector('Queen')` in the vector space. This is like performing semantic translation: we start from the point "king," subtract the vector of "male," add the vector of "female," and finally arrive at the position of "queen." This proves that word embeddings can learn abstract concepts like "gender" and "royalty."
 
-```Python
+```python
 import numpy as np
 
 # Assume we have learned simplified 2D word vectors
@@ -203,7 +203,7 @@ We can understand this structure as a team with clear division of labor:
 
 To truly understand how Transformer works, the best method is to implement it yourself. In this section, we will adopt a "top-down" approach: first, we build the complete code framework of Transformer, defining all necessary classes and methods. Then, like completing a puzzle, we will implement the specific functions of these classes one by one.
 
-```Python
+```python
 import torch
 import torch.nn as nn
 import math
@@ -317,7 +317,7 @@ It splits the original Q, K, V vectors into h parts along the dimension (h is th
 
 As shown in Figure 3.5, this design allows the model to jointly attend to information from different positions and different representation subspaces, greatly enhancing the model's expressive power. Below is a simple implementation of multi-head attention for reference.
 
-```Python
+```python
 class MultiHeadAttention(nn.Module):
     """
     Multi-head attention mechanism module
@@ -390,7 +390,7 @@ Where $x$ is the output of the attention sublayer. $W_1,b_1,W_2,b_2$ are learnab
 
 In our PyTorch skeleton, we can implement this module with the following code:
 
-```Python
+```python
 class PositionWiseFeedForward(nn.Module):
     """
     Position-wise feed-forward network module
@@ -439,7 +439,7 @@ Where:
 
 Now, let's implement the `PositionalEncoding` module and complete the last part of our Transformer skeleton code.
 
-```Python
+```python
 class PositionalEncoding(nn.Module):
     """
     Add positional encoding to word embedding vectors of input sequence.
@@ -546,7 +546,7 @@ According to the number of examples (Exemplars) we provide to the model, prompts
 
 Case: We directly give the model instructions, requiring it to complete the sentiment classification task.
 
-```Python
+```python
 Text: Datawhale's AI Agent course is excellent!
 Sentiment: Positive
 ```
@@ -555,7 +555,7 @@ Sentiment: Positive
 
 Case: We first give the model a complete "question-answer" pair as a demonstration, then pose our new question.
 
-```Python
+```python
 Text: This restaurant's service is too slow.
 Sentiment: Negative
 
@@ -569,7 +569,7 @@ The model will imitate the given example format and complete "Positive" for the
 
 Case: We provide multiple examples covering different situations, allowing the model to have a more comprehensive understanding of the task.
 
-```Python
+```python
 Text: This restaurant's service is too slow.
 Sentiment: Negative
 
@@ -590,7 +590,7 @@ Early GPT models (such as GPT-3) were mainly "text completion" models; they were
 
 - **Prompts for "text completion" models (you need to use few-shot prompts to "teach" the model what to do):**
 
-```Plain
+```plain
 This is a program that translates English to Chinese.
 English: Hello
 Chinese: 你好
@@ -600,7 +600,7 @@ Chinese:
 
 - **Prompts for "instruction-tuned" models (you can directly give instructions):**
 
-```Plain
+```plain
 Please translate the following English to Chinese:
 How are you?
 ```
@@ -611,14 +611,14 @@ The emergence of instruction tuning has greatly simplified how we interact with
 
 **Role-playing** By assigning the model a specific role, we can guide its response style, tone, and knowledge scope, making its output more suitable for specific scenario needs.
 
-```Plain
+```plain
 # Case
 You are now a senior Python programming expert. Please explain what GIL (Global Interpreter Lock) is in Python in a way that even a beginner can understand.
 ```
 
 **In-context Example** This is consistent with the idea of few-shot prompting. By providing clear input-output examples in the prompt, we "teach" the model how to handle our requests, which is especially effective when dealing with complex formats or specific style tasks.
 
-```Plain
+```plain
 # Case
 I need you to extract product names and user sentiment from product reviews. Please output strictly in the JSON format below.
 
@@ -635,7 +635,7 @@ For complex problems requiring logical reasoning, calculation, or multi-step thi
 
 The key to implementing CoT is to add a simple guiding phrase in the prompt, such as "please think step by step" or "Let's think step by step."
 
-```Plain
+```plain
 # Chain-of-Thought Prompt
 A basketball team won 60% of their 80 games in one season. In the next season, they played 15 games and won 12. What is the total winning percentage for both seasons?
 Please think step by step and solve.
@@ -690,7 +690,7 @@ After training ends, when the vocabulary size reaches 10, we get new tokenizatio
 
 Below we use a simple Python code to simulate the above process:
 
-```Python
+```python
 import re, collections
 
 def get_stats(vocab):
@@ -766,13 +766,13 @@ In Chapter 1 of this book, we interacted with large language models through APIs
 
 First, please ensure you have installed the necessary libraries:
 
-```Plain
+```plain
 pip install transformers torch
 ```
 
 In the `transformers` library, we typically use the `AutoModelForCausalLM` and `AutoTokenizer` classes to automatically load weights and tokenizers matching the model. The following code will automatically download required model files and tokenizer configurations from Hugging Face Hub, which may take some time depending on your network speed.
 
-```Python
+```python
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer
 
@@ -794,7 +794,7 @@ print("Model and tokenizer loaded!")
 
 Let's create a dialogue prompt. The Qwen1.5-Chat model follows a specific dialogue template. Then, we can use the `tokenizer` loaded in the previous step to convert the text prompt into numerical IDs (i.e., Token IDs) that the model can understand.
 
-```Python
+```python
 # Prepare dialogue input
 messages = [
     {"role": "system", "content": "You are a helpful assistant."},
@@ -823,7 +823,7 @@ Now we can call the model's `generate()` method to generate an answer. The model
 
 Finally, we need to use the tokenizer's `decode()` method to translate these numerical IDs back into human-readable text.
 
-```Python
+```python
 # Use model to generate answer
 # max_new_tokens controls the maximum number of new Tokens the model can generate
 generated_ids = model.generate(

diff --git a/docs/chapter3/第三章大语言模型基础.md b/docs/chapter3/第三章大语言模型基础.md
@@ -60,7 +60,7 @@ $$P(\text{learns}|\text{agent}) =  \frac{\text{Count(agent learns)}}{\text{Count
 
 $$P(\text{datawhale agent learns}) \approx  P(\text{datawhale}) \cdot  P(\text{agent}|\text{datawhale}) \cdot  P(\text{learns}|\text{agent}) \approx  0.333 \cdot 1 \cdot 0.5 \approx 0.167$$
 
-```Python
+```python
 import collections
 
 # 示例语料库，与上方案例讲解中的语料库保持一致
@@ -132,7 +132,7 @@ $$\text{similarity}(\vec{a}, \vec{b}) = \cos(\theta) = \frac{\vec{a} \cdot \vec{
 
 一个著名的例子展示了词向量捕捉到的语义关系： `vector('King') - vector('Man') + vector('Woman')` 这个向量运算的结果，在向量空间中与 `vector('Queen')` 的位置惊人地接近。这好比在进行语义的平移：我们从“国王”这个点出发，减去“男性”的向量，再加上“女性”的向量，最终就抵达了“女王”的位置。这证明了词嵌入能够学习到“性别”、“皇室”这类抽象概念。
 
-```Python
+```python
 import numpy as np
 
 # 假设我们已经学习到了简化的二维词向量
@@ -204,7 +204,7 @@ king - man + woman 的结果向量: [0.9 0.2]
 
 为了真正理解 Transformer 的工作原理，最好的方法莫过于亲手实现它。在本节中，我们将采用一种“自顶向下”的方法：首先，我们搭建出 Transformer 完整的代码框架，定义好所有需要的类和方法。然后，我们将像完成拼图一样，逐一实现这些类的具体功能。
 
-```Python
+```python
 import torch
 import torch.nn as nn
 import math
@@ -319,7 +319,7 @@ $$\text{Attention}(Q,K,V)=\text{softmax}\left(\frac{QK^{T}}{\sqrt{d_{k}}}\right)
 
 如图3.5所示，这种设计让模型能够共同关注来自不同位置、不同表示子空间的信息，极大地增强了模型的表达能力。以下是多头注意力的简单实现可供参考。
 
-```Python
+```python
 class MultiHeadAttention(nn.Module):
     """
     多头注意力机制模块
@@ -392,7 +392,7 @@ $$\mathrm{FFN}(x)=\max\left(0, xW_{1}+b_{1}\right) W_{2}+b_{2}$$
 
 在我们的 PyTorch 骨架中，我们可以用以下代码来实现这个模块：
 
-```Python
+```python
 class PositionWiseFeedForward(nn.Module):
     """
     位置前馈网络模块
@@ -441,7 +441,7 @@ $$PE_{(pos,2i+1)}=\cos\left(\frac{pos}{10000^{2i/d_{\text{model}}}}\right)$$
 
 现在，我们来实现 `PositionalEncoding` 模块，并完成我们 Transformer 骨架代码的最后一部分。
 
-```Python
+```python
 class PositionalEncoding(nn.Module):
     """
     为输入序列的词嵌入向量添加位置编码。
@@ -552,7 +552,7 @@ Decoder-Only 架构的工作模式被称为<strong>自回归 (Autoregressive)</s
 
 案例： 我们直接向模型下达指令，要求它完成情感分类任务。
 
-```Python
+```python
 文本:Datawhale的AI Agent课程非常棒！
 情感:正面
 ```
@@ -561,7 +561,7 @@ Decoder-Only 架构的工作模式被称为<strong>自回归 (Autoregressive)</s
 
 案例： 我们先给模型一个完整的“问题-答案”对作为示范，然后提出我们的新问题。
 
-```Python
+```python
 文本:这家餐厅的服务太慢了。
 情感:负面
 
@@ -575,7 +575,7 @@ Decoder-Only 架构的工作模式被称为<strong>自回归 (Autoregressive)</s
 
 案例： 我们提供涵盖了不同情况的多个示例，让模型对任务有更全面的理解。
 
-```Python
+```python
 文本:这家餐厅的服务太慢了。
 情感:负面
 
@@ -596,7 +596,7 @@ Decoder-Only 架构的工作模式被称为<strong>自回归 (Autoregressive)</s
 
 - <strong>对“文本补全”模型的提示(你需要用少样本提示“教会”模型做什么)：</strong>
 
-```Plain
+```plain
 这是一段将英文翻译成中文的程序。
 英文:Hello
 中文:你好
@@ -606,7 +606,7 @@ Decoder-Only 架构的工作模式被称为<strong>自回归 (Autoregressive)</s
 
 - <strong>对“指令调优”模型的提示(你可以直接下达指令)：</strong>
 
-```Plain
+```plain
 请将下面的英文翻译成中文:
 How are you?
 ```
@@ -617,14 +617,14 @@ How are you?
 
 <strong>角色扮演 (Role-playing)</strong> 通过赋予模型一个特定的角色，我们可以引导它的回答风格、语气和知识范围，使其输出更符合特定场景的需求。
 
-```Plain
+```plain
 # 案例
 你现在是一位资深的Python编程专家。请解释一下Python中的GIL（全局解释器锁）是什么，要让一个初学者也能听懂。
 ```
 
 <strong>上下文示例 (In-context Example)</strong> 这与少样本提示的思想一致，通过在提示中提供清晰的输入输出示例，来“教会”模型如何处理我们的请求，尤其是在处理复杂格式或特定风格的任务时非常有效。
 
-```Plain
+```plain
 # 案例
 我需要你从产品评论中提取产品名称和用户情感。请严格按照下面的JSON格式输出。
 
@@ -641,7 +641,7 @@ How are you?
 
 实现 CoT 的关键，是在提示中加入一句简单的引导语，如“请逐步思考”或“Let's think step by step”。
 
-```Plain
+```plain
 # 思维链提示
 一个篮球队在一个赛季的80场比赛中赢了60%。在接下来的赛季中，他们打了15场比赛，赢了12场。两个赛季的总胜率是多少？
 请一步一步地思考并解答。
@@ -695,7 +695,7 @@ How are you?
 
 下面我们用一段简单的 Python 代码来模拟上述过程：
 
-```Python
+```python
 import re, collections
 
 def get_stats(vocab):
@@ -771,13 +771,13 @@ for i in range(num_merges):
 
 首先，请确保你已经安装了必要的库：
 
-```Plain
+```plain
 pip install transformers torch
 ```
 
 在 `transformers` 库中，我们通常使用 `AutoModelForCausalLM` 和 `AutoTokenizer` 这两个类来自动加载与模型匹配的权重和分词器。下面这段代码会自动从 Hugging Face Hub 下载所需的模型文件和分词器配置，这可能需要一些时间，具体取决于你的网络速度。
 
-```Python
+```python
 import torch
 from transformers import AutoModelForCausalLM, AutoTokenizer
 
@@ -799,7 +799,7 @@ print("模型和分词器加载完成！")
 
 我们来创建一个对话提示，Qwen1.5-Chat 模型遵循特定的对话模板。然后，可以将使用上一步加载的 `tokenizer` 将文本提示转换为模型能够理解的数字 ID（即 Token ID）。
 
-```Python
+```python
 # 准备对话输入
 messages = [
     {"role": "system", "content": "You are a helpful assistant."},
@@ -828,7 +828,7 @@ print(model_inputs)
 
 最后，我们需要使用分词器的 `decode()` 方法，将这些数字 ID 翻译回人类可以阅读的文本。
 
-```Python
+```python
 # 使用模型生成回答
 # max_new_tokens 控制了模型最多能生成多少个新的Token
 generated_ids = model.generate(