Training an RL Agent to Solve a Classic Control Problem

In this section, we will learn how to train a reinforcement learning agent capable of solving a classic control problem named CartPole by building upon all the concepts explained previously. OpenAI Baselines will be leveraged and, following the steps highlighted in the previous section, we will use a custom fully connected network as a policy network, which is provided as input for the PPO algorithm.

Let's have a quick recap of the CartPole control problem. It is a classic control problem with a continuous four-dimensional observation space and a discrete two-dimensional action space. The observations that are recorded are the position and velocity of the cart along its line of movement, as well as the angle and angular velocity of the pole. The actions are the left/right movement of the cart along its rail. The reward is +1.0 for every step that does not result in a terminal state, which is the case if the pole moves more than 15 degrees from the vertical or if the cart moves outside the rail boundary placed at +/- 2.4. The environment is considered solved if it does not end before having completed 200 steps.

Now, let's put all these concepts together by completing an exercise.

Exercise 4.04: Solving a CartPole Environment with the PPO Algorithm

The CartPole problem in this exercise will be solved using the PPO algorithm. We will use two slightly different approaches so that we will learn about both approaches to using OpenAI Baselines. The first approach will take advantage of Baselines' infrastructure but will adopt a custom path where a user-defined network is used as the policy network. It will be trained and run in the environment after being trained in a "manual" way, without relying on Baselines' automation. This will give you the chance to take a look at what is happening under the hood. The second approach will be simpler, wherein we will be directly adopting Baselines' pre-defined command-line interface.

A custom deep network will be built that will encode environment states and create embeddings in the latent space. The OpenAI Baselines module will then take care of creating the remaining layer of the policy (and value) network for linking the embedding space with action spaces.

We will also create a specific function, which is created by customizing an OpenAI Baselines function, with the specific aim of building the Gym environment, as expected by the infrastructure. There is no particular value in it, but this is required in order to then leverage all Baselines modules.

Note

In order to properly run this exercise, you will need to install OpenAI Baselines. Please refer to the preface for the installation instructions.

Also, in order to properly train the RL agent, many episodes are needed, so the training phase may take several hours to complete. A set of weights for the pretrained agent will be provided at the end of this exercise so that you can see the trained agent in action.

Follow these steps to complete this exercise:

  1. Open a new Jupyter Notebook and import all the required modules from OpenAI Baselines and TensorFlow to use the PPO algorithm:

    from baselines.ppo2.ppo2 import learn

    from baselines.ppo2 import defaults

    from baselines.common.vec_env import VecEnv, VecFrameStack

    from baselines.common.cmd_util import make_vec_env, make_env

    from baselines.common.models import register

    import tensorflow as tf

  2. Define and register a custom multi-layer perceptron for the policy network. Here, some arguments have also been defined so that you can easily control network architecture, making the user able to specify the number of hidden layers, the number of neurons for the hidden layers, and their activation functions:

    @register("custom_mlp")

    def custom_mlp(num_layers=2, num_hidden=64, activation=tf.tanh):

        """

        Stack of fully-connected layers to be used in a policy /

        q-function approximator

        Parameters:

        ----------

        num_layers: int number of fully-connected layers (default: 2)

        num_hidden: int size of fully-connected layers (default: 64)

        activation: activation function (default: tf.tanh)

        Returns:

        -------

        function that builds fully connected network with a

        given input tensor / placeholder

        """

        def network_fn(input_shape):

            print('input shape is {}'.format(input_shape))

            x_input = tf.keras.Input(shape=input_shape)

            h = x_input

            for i in range(num_layers):

                h = tf.keras.layers.Dense\

                    (units=num_hidden, \

                     name='custom_mlp_fc{}'.format(i),\

                     activation=activation)(h)

            network = tf.keras.Model(inputs=[x_input], outputs=[h])

            network.summary()

            return network

        return network_fn

  3. Create a function that will build the environment in the format required by OpenAI Baselines:

    def build_env(env_id, env_type):

        if env_type in {'atari', 'retro'}:

            env = make_vec_env\

                  (env_id, env_type, 1, None, gamestate=None,\

                   reward_scale=1.0)

            env = VecFrameStack(env, 4)

        else:

            env = make_vec_env\

                  (env_id, env_type, 1, None,\

                   reward_scale=1.0, flatten_dict_observations=True)

        return env

  4. Build the CartPole-v0 environment, choose the necessary policy network parameters, and train it using the specific PPO learn function that has been imported:

    env_id = 'CartPole-v0'

    env_type = 'classic_control'

    print("Env type = ", env_type)

    env = build_env(env_id, env_type)

    hidden_nodes = 64

    hidden_layers = 2

    model = learn(network="custom_mlp", env=env, \

                  total_timesteps=1e4, num_hidden=hidden_nodes, \

                  num_layers=hidden_layers)

    While training, the model will produce an output similar to the following:

    Env type = classic_control

    Logging to /tmp/openai-2020-05-11-16-00-34-432546

    input shape is (4,)

    Model: "model"

    _________________________________________________________________

    Layer (type) Output Shape Param #

    =================================================================

    input_1 (InputLayer) [(None, 4)] 0

    _________________________________________________________________

    custom_mlp_fc0 (Dense) (None, 64) 320

    _________________________________________________________________

    custom_mlp_fc1 (Dense) (None, 64) 4160

    =================================================================

    Total params: 4,480

    Trainable params: 4,480

    Non-trainable params: 0

    _________________________________________________________________

    -------------------------------------------

    | eplenmean | 22.3 |

    | eprewmean | 22.3 |

    | fps | 696 |

    | loss/approxkl | 0.00013790815 |

    | loss/clipfrac | 0.0 |

    | loss/policy_entropy | 0.6929994 |

    | loss/policy_loss | -0.0029695872 |

    | loss/value_loss | 44.237858 |

    | misc/explained_variance | 0.0143 |

    | misc/nupdates | 1 |

    | misc/serial_timesteps | 2048 |

    | misc/time_elapsed | 2.94 |

    | misc/total_timesteps | 2048 |

    This shows the policy network architecture, as well as the bookkeeping of some quantities related with the training process, where the first two are, for example, the mean episode length and the mean episode reward.

  5. Run the trained agent in the environment and print the cumulative reward:

    obs = env.reset()

    if not isinstance(env, VecEnv):

        obs = np.expand_dims(np.array(obs), axis=0)

    episode_rew = 0

    while True:

        actions, _, state, _ = model.step(obs)

        obs, reward, done, info = env.step(actions.numpy())

        if not isinstance(env, VecEnv):

            obs = np.expand_dims(np.array(obs), axis=0)

        env.render()

        print("Reward = ", reward)

        episode_rew += reward

        if done:

            print('Episode Reward = {}'.format(episode_rew))

            break

    env.close()

    The output should be similar to the following:

    #[...]

    Reward = [1.]

    Reward = [1.]

    Reward = [1.]

    Reward = [1.]

    Reward = [1.]

    Reward = [1.]

    Reward = [1.]

    Reward = [1.]

    Reward = [1.]

    Reward = [1.]

    Episode Reward = [28.]

  6. Use the built-in OpenAI Baselines run script to train PPO on the CartPole-v0 environment:

    !python -m baselines.run --alg=ppo2 --env=CartPole-v0

    --num_timesteps=1e4 --save_path=./models/CartPole_2M_ppo2

    --log_path=./logs/CartPole/

    The last few lines of the output should be similar to the following:

    -------------------------------------------

    | eplenmean | 20.8 |

    | eprewmean | 20.8 |

    | fps | 675 |

    | loss/approxkl | 0.00041882397 |

    | loss/clipfrac | 0.0 |

    | loss/policy_entropy | 0.692711 |

    | loss/policy_loss | -0.004152138 |

    | loss/value_loss | 42.336742 |

    | misc/explained_variance | -0.0112 |

    | misc/nupdates | 1 |

    | misc/serial_timesteps | 2048 |

    | misc/time_elapsed | 3.03 |

    | misc/total_timesteps | 2048 |

    -------------------------------------------

  7. Use the built-in OpenAI Baselines run script to run the trained model on the CartPole-v0 environment:

    !python -m baselines.run --alg=ppo2 --env=CartPole-v0

    --num_timesteps=0

        --load_path=./models/CartPole_2M_ppo2 --play

    The last few lines of the output should be similar to the following:

    episode_rew=27.0

    episode_rew=27.0

    episode_rew=11.0

    episode_rew=11.0

    episode_rew=13.0

    episode_rew=29.0

    episode_rew=28.0

    episode_rew=14.0

    episode_rew=18.0

    episode_rew=25.0

    episode_rew=49.0

    episode_rew=26.0

    episode_rew=59.0

  8. Use the pretrained weights provided to see the trained agent in action:

    !wget -O cartpole_1M_ppo2.tar.gz \

    https://github.com/PacktWorkshops/The-Reinforcement-Learning-\

    Workshop/blob/master/Chapter04/cartpole_1M_ppo2.tar.gz?raw=true

    The output will be similar to the following:

    Saving to: 'cartpole_1M_ppo2.tar.gz'

    cartpole_1M_ppo2.ta 100%[===================>] 53,35K --.-KB/s in 0,05s

    2020-05-11 15:57:07 (1,10 MB/s) - 'cartpole_1M_ppo2.tar.gz' saved [54633/54633]

    You can read the .tar file using the following command:

    !tar xvzf cartpole_1M_ppo2.tar.gz

    The last few lines of the output should be similar to the following:

    cartpole_1M_ppo2/ckpt-1.index

    cartpole_1M_ppo2/ckpt-1.data-00000-of-00001

    cartpole_1M_ppo2/

    cartpole_1M_ppo2/checkpoint

  9. Use the built-in OpenAI Baselines run script to train PPO on the CartPole environment:

    !python -m baselines.run --alg=ppo2 --env=CartPole-v0

    --num_timesteps=0 --load_path=./cartpole_1M_ppo2 –play

    The output will be similar to the following:

    episode_rew=16.0

    episode_rew=200.0

    episode_rew=200.0

    episode_rew=200.0

    episode_rew=26.0

    episode_rew=176.0

    This step will show you how a trained agent behaves so that it can solve the CartPole environment. It uses a set weights for the policy network that were ready to be used. The output will be similar to the one shown in Step 5, confirming that the environment has been solved.

    Note

    To access the source code for this specific section, please refer to https://packt.live/2XS69n8.

    This section does not currently have an online interactive example, and will need to be run locally.

In this exercise, we learned how to train a reinforcement learning agent capable of solving the CartPole classic control problem. We successfully used a custom fully connected network as a policy network. This allowed us to take a look at what happens behind the automation provided by OpenAI Baselines' command-line interface. In this hands-on exercise, we have also familiarized ourselves with OpenAI Baselines' out-of-the-box method, confirming that it is a straightforward resource that can be easily used to train a reinforcement learning agent.

Activity 4.01: Training a Reinforcement Learning Agent to Play a Classic Video Game

In this activity, the challenge is to adopt the same approach we used in Exercise 4.04, Solving the CartPole Environment with the PPO Algorithm, to create a reinforcement learning bot that's able to achieve better-than-human performance on a classic Atari video game, Pong. The game is represented in the following way: two paddles, one per user, can move up and down. The goal is to make the white ball pass the opposite paddle to score one point. The game ends when one of the two players reaches a score equal to 21.

An approach similar to the one we saw in Exercise 4.04, Solving the CartPole Environment with the PPO Algorithm, has to be adopted, with a custom convolutional neural network, which will work as the encoder for the environment's observation (the pixels frame):

Figure 4.12: One frame of the Pong game

OpenAI Gym will be used to create the environment, while the OpenAI Baselines module will be used to train a custom policy network using the PPO algorithm.

As we saw in Exercise 4.04, Solving the CartPole Environment with the PPO Algorithm, both the custom approach, that is, using specific OpenAI modules, and the simple one, that is, using the built-in general command-line interface, will be implemented (in steps 1 to 5 and step 6, respectively).

Note

In order to run this exercise, you will need to install OpenAI Baselines. Please refer to the preface for the installation instructions.

In order to properly train the RL agent, many episodes are needed, so the training phase may take several hours to complete. A set of weights you can use for a pretrained agent has been provided at this address: https://packt.live/2XSY4yz. Use them to see the trained agent in action.

The following steps will help you complete this activity:

  1. Import all the required modules from OpenAI Baselines and TensorFlow in order to use the PPO algorithm.
  2. Define and register a custom convolutional neural network for the policy network.
  3. Create a function to build the environment in the format required by OpenAI Baselines.
  4. Build the PongNoFrameskip-v4 environment, choose the required policy network parameters, and train it.
  5. Run the trained agent in the environment and print the cumulative reward.
  6. Use the built-in OpenAI Baselines run script to train PPO on the PongNoFrameskip-v0 environment.
  7. Use the built-in OpenAI Baselines run script to run the trained model on the PongNoFrameskip-v0 environment.
  8. Use the pretrained weights provided to see the trained agent in action.

    At the end of this activity, the agent is expected to easily win most of the time.

    The final score of the agent should be like the one represented in the following frame most of the time:

Figure 4.13: One frame of the real-time environment, after rendering

Note

The solution to this activity can be found on page 704.