Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions units/en/unit2/bellman-equation.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,16 @@ The Bellman equation **simplifies our state value or state-action value calcula

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman.jpg" alt="Bellman equation"/>

With what we have learned so far, we know that if we calculate \\(V(S_t)\\) (the value of a state), we need to calculate the return starting at that state and then follow the policy forever after. **(The policy we defined in the following example is a Greedy Policy; for simplification, we don't discount the reward).**
With what we have learned so far, we know that if we calculate \\(V(S_t)\\) (the value of a state), we need to calculate the return starting at that state and then follow the policy forever after. **(The policy we defined in the following example is a Greedy Policy; for simplification, we don't discount the reward).**

So to calculate \\(V(S_t)\\), we need to calculate the sum of the expected rewards. Hence:
So to calculate \\(V(S_t)\\), we need to calculate the sum of the expected rewards. Hence:

<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman2.jpg" alt="Bellman equation"/>
<figcaption>To calculate the value of State 1: the sum of rewards if the agent started in that state and then followed the greedy policy (taking actions that leads to the best states values) for all the time steps.</figcaption>
</figure>

Then, to calculate the \\(V(S_{t+1})\\), we need to calculate the return starting at that state \\(S_{t+1}\\).
Then, to calculate the \\(V(S_{t+1})\\), we need to calculate the return starting at that state \\(S_{t+1}\\).

<figure>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/bellman3.jpg" alt="Bellman equation"/>
Expand Down
18 changes: 9 additions & 9 deletions units/en/unit2/mc-vs-td.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,8 @@ If we take an example:
- At the end of the episode, **we have a list of State, Actions, Rewards, and Next States tuples**
For instance [[State tile 3 bottom, Go Left, +1, State tile 2 bottom], [State tile 2 bottom, Go Left, +0, State tile 1 bottom]...]

- **The agent will sum the total rewards \\(G_t\\)** (to see how well it did).
- It will then **update \\(V(s_t)\\) based on the formula**
- **The agent will sum the total rewards \\(G_t\\)** (to see how well it did).
- It will then **update \\(V(s_t)\\) based on the formula**

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-3.jpg" alt="Monte Carlo"/>

Expand All @@ -58,7 +58,7 @@ For instance, if we train a state-value function using Monte Carlo:



- We have a list of state, action, rewards, next_state, **we need to calculate the return \\(G{t=0}\\)**
- We have a list of state, action, rewards, next_state, **we need to calculate the return \\(G{t=0}\\)**

\\(G_t = R_{t+1} + R_{t+2} + R_{t+3} ...\\) (for simplicity, we don't discount the rewards)

Expand All @@ -68,7 +68,7 @@ For instance, if we train a state-value function using Monte Carlo:

\\(G_0 = 3\\)

- We can now compute the **new** \\(V(S_0)\\):
- We can now compute the **new** \\(V(S_0)\\):

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/MC-5.jpg" alt="Monte Carlo"/>

Expand All @@ -86,11 +86,11 @@ For instance, if we train a state-value function using Monte Carlo:

**Temporal Difference, on the other hand, waits for only one interaction (one step) \\(S_{t+1}\\)** to form a TD target and update \\(V(S_t)\\) using \\(R_{t+1}\\) and \\( \gamma * V(S_{t+1})\\).

The idea with **TD is to update the \\(V(S_t)\\) at each step.**
The idea with **TD is to update the \\(V(S_t)\\) at each step.**

But because we didn't experience an entire episode, we don't have \\(G_t\\) (expected return). Instead, **we estimate \\(G_t\\) by adding \\(R_{t+1}\\) and the discounted value of the next state.**
But because we didn't experience an entire episode, we don't have \\(G_t\\) (expected return). Instead, **we estimate \\(G_t\\) by adding \\(R_{t+1}\\) and the discounted value of the next state.**

This is called bootstrapping. It's called this **because TD bases its update in part on an existing estimate \\(V(S_{t+1})\\) and not a complete sample \\(G_t\\).**
This is called bootstrapping. It's called this **because TD bases its update in part on an existing estimate \\(V(S_{t+1})\\) and not a complete sample \\(G_t\\).**

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/TD-1.jpg" alt="Temporal Difference"/>

Expand All @@ -117,9 +117,9 @@ We can now update \\(V(S_0)\\):

New \\(V(S_0) = V(S_0) + lr * [R_1 + \gamma * V(S_1) - V(S_0)]\\)

New \\(V(S_0) = 0 + 0.1 * [1 + 1 * 0–0]\\)
New \\(V(S_0) = 0 + 0.1 * [1 + 1 * 0–0]\\)

New \\(V(S_0) = 0.1\\)
New \\(V(S_0) = 0.1\\)

So we just updated our value function for State 0.

Expand Down
6 changes: 3 additions & 3 deletions units/en/unit2/q-learning-example.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -44,15 +44,15 @@ Because epsilon is big (= 1.0), I take a random action. In this case, I go right

## Step 3: Perform action At, get Rt+1 and St+1 [[step3]]

By going right, I get a small cheese, so \\(R_{t+1} = 1\\) and I'm in a new state.
By going right, I get a small cheese, so \\(R_{t+1} = 1\\) and I'm in a new state.


<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-4.jpg" alt="Maze-Example"/>


## Step 4: Update Q(St, At) [[step4]]

We can now update \\(Q(S_t, A_t)\\) using our formula.
We can now update \\(Q(S_t, A_t)\\) using our formula.

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-5.jpg" alt="Maze-Example"/>
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Example-4.jpg" alt="Maze-Example"/>
Expand All @@ -70,7 +70,7 @@ I took the action 'down'. **This is not a good action since it leads me to the

## Step 3: Perform action At, get Rt+1 and St+1 [[step3-3]]

Because I ate poison, **I get \\(R_{t+1} = -10\\), and I die.**
Because I ate poison, **I get \\(R_{t+1} = -10\\), and I die.**

<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/q-ex-7.jpg" alt="Maze-Example"/>

Expand Down
6 changes: 3 additions & 3 deletions units/en/unit2/q-learning.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -108,13 +108,13 @@ Therefore, our \\(Q(S_t, A_t)\\) **update formula goes like this:**
<img src="https://huggingface.co/datasets/huggingface-deep-rl-course/course-images/resolve/main/en/unit3/Q-learning-8.jpg" alt="Q-learning"/>


This means that to update our \\(Q(S_t, A_t)\\):
This means that to update our \\(Q(S_t, A_t)\\):

- We need \\(S_t, A_t, R_{t+1}, S_{t+1}\\).
- We need \\(S_t, A_t, R_{t+1}, S_{t+1}\\).
- To update our Q-value at a given state-action pair, we use the TD target.

How do we form the TD target?
1. We obtain the reward \\(R_{t+1}\\) after taking the action \\(A_t\\).
1. We obtain the reward \\(R_{t+1}\\) after taking the action \\(A_t\\).
2. To get the **best state-action pair value** for the next state, we use a greedy policy to select the next best action. Note that this is not an epsilon-greedy policy, this will always take the action with the highest state-action value.

Then when the update of this Q-value is done, we start in a new state and select our action **using a epsilon-greedy policy again.**
Expand Down