Skip to content

Direct policy learning for several classical control tasks using sequential machine learning frameworks

brendanjcrowe/BC-using-seq-2-seq-models

Repository files navigation

Behavioral Cloning using Seq-2-seq models

After learning about probabilistic and neural approaches to sequential machine learning I decided to apply what I had learned to the Imitation Learning (IL) domain. Since many reinforcement learning tasks can be models as Markov Decision Processes (MDPs) that have a sequential nature, I thought that it would be interesting to see in we could using neural networks and Conditional Random Fields (CRFs) to do Behavioral Cloning (BC). I decided to try solving several of the classic control tasks (mountain car, acrobot, etc) using a CRF and a bidirectional Long Short Term Memory (LSTM). I then compared my results with a respected method, Generative Adversarial Imitation Learning (GAIL), and a baseline (Logistic Regression).

For the LSTM I choose bi-directional because it more accurately captured the relationship between the action and the next state than a uni-directional LSTM could. While this violates the Markovian assumption of an MDP it proved to work well. Another problem with the LSTM was that when using the model in the actually environment it could not accurately predict the action at the begin of an episode. I was able to solve this problem by taking the train data and augmenting it. Each demonstration was turned into a distinct number of demonstration equal to the length of the original demonstration. In doing this the LSTM was able to learn the starting behavior better. FOr example if my original training data had been a single demonstration of length 100, my augmented training data would be 100 demonstrations of length 1, 2, ..., 100.

For the CRF I used a simple linear chain CRF and then added a dependency on the previous state. This again violated the Markovian assumption, but worked very well. The CRF was able to achieve performance close to that of GAIL and the LSTM and took a fraction of the time to train.

A more in depth discussion of the results can be found in the paper and presentation. These write ups, as well as the code can be found of GitHub

About

Direct policy learning for several classical control tasks using sequential machine learning frameworks

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published