huggingface · andryr · Aug 24, 2025
diff --git a/units/en/unit2/bellman-equation.mdx b/units/en/unit2/bellman-equation.mdx
@@ -5,16 +5,16 @@ The Bellman equation **simplifies our state value or state-action value calcula
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman.jpg" alt="Bellman equation"/>
 
-With what we have learned so far, we know that if we calculate \\(V(S_t)\\) (the value of a state), we need to calculate the return starting at that state and then follow the policy forever after. **(The policy we defined in the following example is a Greedy Policy; for simplification, we don't discount the reward).**
+With what we have learned so far, we know that if we calculate  \\(V(S_t)\\) (the value of a state), we need to calculate the return starting at that state and then follow the policy forever after. **(The policy we defined in the following example is a Greedy Policy; for simplification, we don't discount the reward).**
 
-So to calculate \\(V(S_t)\\), we need to calculate the sum of the expected rewards. Hence:
+So to calculate  \\(V(S_t)\\), we need to calculate the sum of the expected rewards. Hence:
 
 <figure>
   <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman2.jpg" alt="Bellman equation"/>
   <figcaption>To calculate the value of State 1: the sum of rewards if the agent started in that state and then followed the greedy policy (taking actions that leads to the best states values) for all the time steps.</figcaption>
 </figure>
 
-Then, to calculate the \\(V(S_{t+1})\\), we need to calculate the return starting at that state \\(S_{t+1}\\).
+Then, to calculate the  \\(V(S_{t+1})\\), we need to calculate the return starting at that state  \\(S_{t+1}\\).
 
 <figure>
   <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman3.jpg" alt="Bellman equation"/>

diff --git a/units/en/unit2/mc-vs-td.mdx b/units/en/unit2/mc-vs-td.mdx
@@ -32,8 +32,8 @@ If we take an example:
 - At the end of the episode, **we have a list of State, Actions, Rewards, and Next States tuples**
 For instance [[State tile 3 bottom, Go Left, +1, State tile 2 bottom], [State tile 2 bottom, Go Left, +0, State tile 1 bottom]...]
 
-- **The agent will sum the total rewards \\(G_t\\)** (to see how well it did).
-- It will then **update \\(V(s_t)\\) based on the formula**
+- **The agent will sum the total rewards  \\(G_t\\)** (to see how well it did).
+- It will then **update  \\(V(s_t)\\) based on the formula**
 
   <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-3.jpg" alt="Monte Carlo"/>
 
@@ -58,7 +58,7 @@ For instance, if we train a state-value function using Monte Carlo:
 
 
 
-- We have a list of state, action, rewards, next_state, **we need to calculate the return \\(G{t=0}\\)**
+- We have a list of state, action, rewards, next_state, **we need to calculate the return  \\(G{t=0}\\)**
 
 \\(G_t = R_{t+1} + R_{t+2} + R_{t+3} ...\\) (for simplicity, we don't discount the rewards)
 
@@ -68,7 +68,7 @@ For instance, if we train a state-value function using Monte Carlo:
 
 \\(G_0 = 3\\)
 
-- We can now compute the **new** \\(V(S_0)\\):
+- We can now compute the **new**  \\(V(S_0)\\):
 
   <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-5.jpg" alt="Monte Carlo"/>
 
@@ -86,11 +86,11 @@ For instance, if we train a state-value function using Monte Carlo:
 
 **Temporal Difference, on the other hand, waits for only one interaction (one step) \\(S_{t+1}\\)** to form a TD target and update \\(V(S_t)\\) using \\(R_{t+1}\\) and \\( \gamma * V(S_{t+1})\\).
 
-The idea with **TD is to update the \\(V(S_t)\\) at each step.**
+The idea with **TD is to update the  \\(V(S_t)\\) at each step.**
 
-But because we didn't experience an entire episode, we don't have \\(G_t\\) (expected return). Instead, **we estimate \\(G_t\\) by adding \\(R_{t+1}\\) and the discounted value of the next state.**
+But because we didn't experience an entire episode, we don't have  \\(G_t\\) (expected return). Instead, **we estimate  \\(G_t\\) by adding \\(R_{t+1}\\) and the discounted value of the next state.**
 
-This is called bootstrapping. It's called this **because TD bases its update in part on an existing estimate \\(V(S_{t+1})\\) and not a complete sample \\(G_t\\).**
+This is called bootstrapping. It's called this **because TD bases its update in part on an existing estimate  \\(V(S_{t+1})\\) and not a complete sample  \\(G_t\\).**
 
   <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-1.jpg" alt="Temporal Difference"/>
 
@@ -117,9 +117,9 @@ We can now update  \\(V(S_0)\\):
 
 New  \\(V(S_0) = V(S_0) + lr * [R_1 + \gamma * V(S_1) - V(S_0)]\\)
 
-New \\(V(S_0) = 0 + 0.1 * [1 + 1 * 0–0]\\)
+New  \\(V(S_0) = 0 + 0.1 * [1 + 1 * 0–0]\\)
 
-New \\(V(S_0) = 0.1\\)
+New  \\(V(S_0) = 0.1\\)
 
 So we just updated our value function for State 0.
 

diff --git a/units/en/unit2/q-learning-example.mdx b/units/en/unit2/q-learning-example.mdx
@@ -44,15 +44,15 @@ Because epsilon is big (= 1.0), I take a random action. In this case, I go right
 
 ## Step 3: Perform action At, get Rt+1 and St+1 [[step3]]
 
-By going right, I get a small cheese, so \\(R_{t+1} = 1\\) and I'm in a new state.
+By going right, I get a small cheese, so  \\(R_{t+1} = 1\\) and I'm in a new state.
 
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-4.jpg" alt="Maze-Example"/>
 
 
 ## Step 4: Update Q(St, At) [[step4]]
 
-We can now update \\(Q(S_t, A_t)\\) using our formula.
+We can now update  \\(Q(S_t, A_t)\\) using our formula.
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-5.jpg" alt="Maze-Example"/>
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Example-4.jpg" alt="Maze-Example"/>
@@ -70,7 +70,7 @@ I took the action 'down'. **This is not a good action since it leads me to the
 
 ## Step 3: Perform action At, get Rt+1 and St+1 [[step3-3]]
 
-Because I ate poison, **I get \\(R_{t+1} = -10\\), and I die.**
+Because I ate poison, **I get  \\(R_{t+1} = -10\\), and I die.**
 
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-7.jpg" alt="Maze-Example"/>
 

diff --git a/units/en/unit2/q-learning.mdx b/units/en/unit2/q-learning.mdx
@@ -108,13 +108,13 @@ Therefore, our \\(Q(S_t, A_t)\\) **update formula goes like this:**
 <img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-8.jpg" alt="Q-learning"/>
 
 
-This means that to update our \\(Q(S_t, A_t)\\):
+This means that to update our  \\(Q(S_t, A_t)\\):
 
-- We need \\(S_t, A_t, R_{t+1}, S_{t+1}\\).
+- We need  \\(S_t, A_t, R_{t+1}, S_{t+1}\\).
 - To update our Q-value at a given state-action pair, we use the TD target.
 
 How do we form the TD target?
-1. We obtain the reward \\(R_{t+1}\\) after taking the action \\(A_t\\).
+1. We obtain the reward  \\(R_{t+1}\\) after taking the action  \\(A_t\\).
 2. To get the **best state-action pair value** for the next state, we use a greedy policy to select the next best action. Note that this is not an epsilon-greedy policy, this will always take the action with the highest state-action value.
 
 Then when the update of this Q-value is done, we start in a new state and select our action **using a epsilon-greedy policy again.**