From 3d77d1c34ddfa14afad33012ca40a627501afd5b Mon Sep 17 00:00:00 2001 From: andryr Date: Sun, 24 Aug 2025 19:52:40 -0400 Subject: [PATCH] Fix spacing --- units/en/unit2/bellman-equation.mdx | 6 +++--- units/en/unit2/mc-vs-td.mdx | 18 +++++++++--------- units/en/unit2/q-learning-example.mdx | 6 +++--- units/en/unit2/q-learning.mdx | 6 +++--- 4 files changed, 18 insertions(+), 18 deletions(-) diff --git a/units/en/unit2/bellman-equation.mdx b/units/en/unit2/bellman-equation.mdx index 6f85eedc..5a1e99c1 100644 --- a/units/en/unit2/bellman-equation.mdx +++ b/units/en/unit2/bellman-equation.mdx @@ -5,16 +5,16 @@ The Bellman equation **simplifies our state value or state-action value calcula Bellman equation -With what we have learned so far, we know that if we calculate \\(V(S_t)\\) (the value of a state), we need to calculate the return starting at that state and then follow the policy forever after. **(The policy we defined in the following example is a Greedy Policy; for simplification, we don't discount the reward).** +With what we have learned so far, we know that if we calculate \\(V(S_t)\\) (the value of a state), we need to calculate the return starting at that state and then follow the policy forever after. **(The policy we defined in the following example is a Greedy Policy; for simplification, we don't discount the reward).** -So to calculate \\(V(S_t)\\), we need to calculate the sum of the expected rewards. Hence: +So to calculate \\(V(S_t)\\), we need to calculate the sum of the expected rewards. Hence:
Bellman equation
To calculate the value of State 1: the sum of rewards if the agent started in that state and then followed the greedy policy (taking actions that leads to the best states values) for all the time steps.
-Then, to calculate the \\(V(S_{t+1})\\), we need to calculate the return starting at that state \\(S_{t+1}\\). +Then, to calculate the \\(V(S_{t+1})\\), we need to calculate the return starting at that state \\(S_{t+1}\\).
Bellman equation diff --git a/units/en/unit2/mc-vs-td.mdx b/units/en/unit2/mc-vs-td.mdx index ddc97e8c..11a053d1 100644 --- a/units/en/unit2/mc-vs-td.mdx +++ b/units/en/unit2/mc-vs-td.mdx @@ -32,8 +32,8 @@ If we take an example: - At the end of the episode, **we have a list of State, Actions, Rewards, and Next States tuples** For instance [[State tile 3 bottom, Go Left, +1, State tile 2 bottom], [State tile 2 bottom, Go Left, +0, State tile 1 bottom]...] -- **The agent will sum the total rewards \\(G_t\\)** (to see how well it did). -- It will then **update \\(V(s_t)\\) based on the formula** +- **The agent will sum the total rewards \\(G_t\\)** (to see how well it did). +- It will then **update \\(V(s_t)\\) based on the formula** Monte Carlo @@ -58,7 +58,7 @@ For instance, if we train a state-value function using Monte Carlo: -- We have a list of state, action, rewards, next_state, **we need to calculate the return \\(G{t=0}\\)** +- We have a list of state, action, rewards, next_state, **we need to calculate the return \\(G{t=0}\\)** \\(G_t = R_{t+1} + R_{t+2} + R_{t+3} ...\\) (for simplicity, we don't discount the rewards) @@ -68,7 +68,7 @@ For instance, if we train a state-value function using Monte Carlo: \\(G_0 = 3\\) -- We can now compute the **new** \\(V(S_0)\\): +- We can now compute the **new** \\(V(S_0)\\): Monte Carlo @@ -86,11 +86,11 @@ For instance, if we train a state-value function using Monte Carlo: **Temporal Difference, on the other hand, waits for only one interaction (one step) \\(S_{t+1}\\)** to form a TD target and update \\(V(S_t)\\) using \\(R_{t+1}\\) and \\( \gamma * V(S_{t+1})\\). -The idea with **TD is to update the \\(V(S_t)\\) at each step.** +The idea with **TD is to update the \\(V(S_t)\\) at each step.** -But because we didn't experience an entire episode, we don't have \\(G_t\\) (expected return). Instead, **we estimate \\(G_t\\) by adding \\(R_{t+1}\\) and the discounted value of the next state.** +But because we didn't experience an entire episode, we don't have \\(G_t\\) (expected return). Instead, **we estimate \\(G_t\\) by adding \\(R_{t+1}\\) and the discounted value of the next state.** -This is called bootstrapping. It's called this **because TD bases its update in part on an existing estimate \\(V(S_{t+1})\\) and not a complete sample \\(G_t\\).** +This is called bootstrapping. It's called this **because TD bases its update in part on an existing estimate \\(V(S_{t+1})\\) and not a complete sample \\(G_t\\).** Temporal Difference @@ -117,9 +117,9 @@ We can now update \\(V(S_0)\\): New \\(V(S_0) = V(S_0) + lr * [R_1 + \gamma * V(S_1) - V(S_0)]\\) -New \\(V(S_0) = 0 + 0.1 * [1 + 1 * 0–0]\\) +New \\(V(S_0) = 0 + 0.1 * [1 + 1 * 0–0]\\) -New \\(V(S_0) = 0.1\\) +New \\(V(S_0) = 0.1\\) So we just updated our value function for State 0. diff --git a/units/en/unit2/q-learning-example.mdx b/units/en/unit2/q-learning-example.mdx index 43cc3dfd..f2485aed 100644 --- a/units/en/unit2/q-learning-example.mdx +++ b/units/en/unit2/q-learning-example.mdx @@ -44,7 +44,7 @@ Because epsilon is big (= 1.0), I take a random action. In this case, I go right ## Step 3: Perform action At, get Rt+1 and St+1 [[step3]] -By going right, I get a small cheese, so \\(R_{t+1} = 1\\) and I'm in a new state. +By going right, I get a small cheese, so \\(R_{t+1} = 1\\) and I'm in a new state. Maze-Example @@ -52,7 +52,7 @@ By going right, I get a small cheese, so \\(R_{t+1} = 1\\) and I'm in a new stat ## Step 4: Update Q(St, At) [[step4]] -We can now update \\(Q(S_t, A_t)\\) using our formula. +We can now update \\(Q(S_t, A_t)\\) using our formula. Maze-Example Maze-Example @@ -70,7 +70,7 @@ I took the action 'down'. **This is not a good action since it leads me to the ## Step 3: Perform action At, get Rt+1 and St+1 [[step3-3]] -Because I ate poison, **I get \\(R_{t+1} = -10\\), and I die.** +Because I ate poison, **I get \\(R_{t+1} = -10\\), and I die.** Maze-Example diff --git a/units/en/unit2/q-learning.mdx b/units/en/unit2/q-learning.mdx index 13571630..660fb582 100644 --- a/units/en/unit2/q-learning.mdx +++ b/units/en/unit2/q-learning.mdx @@ -108,13 +108,13 @@ Therefore, our \\(Q(S_t, A_t)\\) **update formula goes like this:** Q-learning -This means that to update our \\(Q(S_t, A_t)\\): +This means that to update our \\(Q(S_t, A_t)\\): -- We need \\(S_t, A_t, R_{t+1}, S_{t+1}\\). +- We need \\(S_t, A_t, R_{t+1}, S_{t+1}\\). - To update our Q-value at a given state-action pair, we use the TD target. How do we form the TD target? -1. We obtain the reward \\(R_{t+1}\\) after taking the action \\(A_t\\). +1. We obtain the reward \\(R_{t+1}\\) after taking the action \\(A_t\\). 2. To get the **best state-action pair value** for the next state, we use a greedy policy to select the next best action. Note that this is not an epsilon-greedy policy, this will always take the action with the highest state-action value. Then when the update of this Q-value is done, we start in a new state and select our action **using a epsilon-greedy policy again.**