You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on May 21, 2024. It is now read-only.
Hi. Please correct me if I am missing something, but I think that there is a mistake in this line.
Notice that "train_op" calls "apply_gradients", which updates the parameters of the policy network using the log probabilities that the policy outputs when the recorded batch of states is fed. The problem is that the parameters are updated for every time step of the recorded trajectory ("train_op" is called within a temporal for loop), and, due to the fact that the log probabilities are being computed on the fly during the training loop, these probabilities are no longer the ones that were followed during the recorded trajectory. This produces a distribution miss match that violates the on-policy nature of the Reinforce algorithm, and, I believe, may lead to incorrect gradients. I think that the proper way to do the update is to accumulate the gradients of the log probabilities of the entire batch and then perform all the updates.
The same issue applies to the "all_rewards" buffer that is being used for the normalization of the return. Because, as the policy changes, the mean and stdev of the return may change as well, and, therefore, the normalization that is being performed is no longer valid (at least in a strict sense).
As stated before, please correct me if I am wrong. I was simply looking for a Reinforce implementation on Tensorflow and came across this issue.
The text was updated successfully, but these errors were encountered:
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
tensorflow-reinforce/rl/pg_reinforce.py
Line 192 in 173a04f
Hi. Please correct me if I am missing something, but I think that there is a mistake in this line.
Notice that "train_op" calls "apply_gradients", which updates the parameters of the policy network using the log probabilities that the policy outputs when the recorded batch of states is fed. The problem is that the parameters are updated for every time step of the recorded trajectory ("train_op" is called within a temporal for loop), and, due to the fact that the log probabilities are being computed on the fly during the training loop, these probabilities are no longer the ones that were followed during the recorded trajectory. This produces a distribution miss match that violates the on-policy nature of the Reinforce algorithm, and, I believe, may lead to incorrect gradients. I think that the proper way to do the update is to accumulate the gradients of the log probabilities of the entire batch and then perform all the updates.
The same issue applies to the "all_rewards" buffer that is being used for the normalization of the return. Because, as the policy changes, the mean and stdev of the return may change as well, and, therefore, the normalization that is being performed is no longer valid (at least in a strict sense).
As stated before, please correct me if I am wrong. I was simply looking for a Reinforce implementation on Tensorflow and came across this issue.
The text was updated successfully, but these errors were encountered: