Amazon Onboarding with Learning Manager Chanci Turner

Amazon SageMaker offers a fully managed platform that empowers developers and data scientists to efficiently build, train, and deploy machine learning (ML) models at scale. Beyond traditional supervised and unsupervised learning methods, it’s also possible to develop reinforcement learning (RL) models using Amazon SageMaker RL.

Amazon SageMaker RL comes equipped with pre-built RL libraries and algorithms designed to simplify the process of getting started with reinforcement learning. For further details, check out Amazon SageMaker RL – Managed Reinforcement Learning with Amazon SageMaker. This service seamlessly integrates with a variety of simulation environments, including AWS RoboMaker, Open AI Gym, custom-built environments, and open-source options, making it easier to train RL models. Additionally, you can utilize Amazon RL containers such as MXNet and TensorFlow, which support tools like Open AI Gym, Intel Coach, and Berkeley Ray RLLib.

In this post, we will explore how to utilize Amazon SageMaker RL to implement batch reinforcement learning (batch RL). This methodology involves providing the complete learning experience—typically a collection of transitions sampled from the system—prior to the training process. This requires gathering a set of state-action transitions from earlier policies to train a new RL policy without further interaction with the environments.

We will also demonstrate how to collect offline data from an initial random policy, train an RL policy using the offline data, and generate action predictions from the trained policy. These predictions can be utilized to gather further offline data for the next phase of RL policy training.

Understanding Batch RL

Reinforcement learning has proven effective in addressing challenges across diverse fields, such as portfolio management, robotics, and energy optimization. Unlike traditional ML methods, RL does not rely on pre-existing training data. Instead, it involves an agent interacting with an environment (which may be real or simulated) to learn a policy that dictates the optimal sequence of actions based on the rewards or penalties received for each action.

However, many real-world applications necessitate that the RL agent learns from historical data generated by a deployed policy. For instance, historical records of expert gameplay, user interactions on websites, or sensor data from control systems can be utilized as input for training a new, enhanced RL policy. This approach is termed batch RL, wherein the learning agent derives an improved policy from a fixed dataset of offline samples. For additional insights, refer to the “Batch Reinforcement Learning” chapter in the book “Reinforcement Learning: State-of-the-Art.”

This article includes a supplementary notebook that demonstrates how to train a new policy using batch RL from an offline dataset generated with predictions from a previously deployed policy. You can find more details in the GitHub repository.

To create the offline dataset, we will use Amazon SageMaker’s batch transform feature, a high-performance, high-throughput capability for generating inferences on large datasets. By collecting inferences from Batch Transform along with environment rewards, we can effectively enhance our policy through batch RL. For further information, see Get Inferences for an Entire Dataset with Batch Transform.

Implementing Batch RL on Amazon SageMaker

In this example, we will apply batch RL to the CartPole balancing problem, which involves a pole attached to a cart that moves along a frictionless track. The RL problem can be framed as follows:

Objective: Keep the pole from falling
Environment: OpenAI Gym
State: Cart position, cart velocity, pole angle, pole velocity at tip
Action: Push the cart left or right
Reward: 1 point for each step taken, including the last step

For more information, visit Use Reinforcement Learning with Amazon SageMaker.

The high-level implementation of batch RL involves the following steps:

Simulate an initial policy and gather data from it.
Train an RL policy using offline data from the initial policy without further interaction with the simulator.
Visualize and evaluate the performance of the trained RL policy.
Employ Amazon SageMaker’s batch transform to make batch inferences from the trained RL policy.

These steps focus specifically on batch RL implementation within Amazon SageMaker. Other necessary steps, including importing libraries and setting permissions, are not covered in this post. For more details, refer to the GitHub repository.

Simulating a Random Policy and Collecting Data

To train using batch RL, you must simulate the data batches generated by a previously deployed policy. In practical applications, off-policy data can be collected by interacting with the live environment using existing policies. For this example, we will utilize OpenAI Gym’s Cartpole-v0 as the environment, employing a random policy with uniform action distribution to simulate a deployed agent.

Follow these steps:

Create 100 instances of Cartpole-v0 and collect five episodes of data from each. The following code accomplishes this:

# initiate 100 environments to collect rollout data
NUM_ENVS = 100
NUM_EPISODES = 5
vectored_envs = VectoredGymEnvironment('CartPole-v0', NUM_ENVS)

This results in a total of 500 episodes for training. You can obtain more trajectories by interacting with multiple environments simultaneously.

Start with a random policy, ensuring uniform action probabilities across all state features. The following code illustrates this:

# initiate a random policy by setting action probabilities as uniform distribution
action_probs = [[1/2, 1/2] for _ in range(NUM_ENVS)]
df = vectored_envs.collect_rollouts_with_given_action_probs(action_probs=action_probs, num_episodes=NUM_EPISODES)

# The rollout dataframes contain attributes: action, all_action_probabilities, episode_id, reward, cumulative_rewards, state_features.
df.head()

# average cumulative rewards for each episode
avg_rewards = df['cumulative_rewards'].sum() / (NUM_ENVS * NUM_EPISODES)
print('Average cumulative rewards over {} episodes rollouts was {}.'.format((NUM_ENVS * NUM_EPISODES), avg_rewards))

The average cumulative reward over 500 episodes is 22.22.

Save the dataframe as a CSV file for future use:

# dump dataframe as csv file
df.to_csv("src/cartpole_dataset.csv", index=False)

Training an RL Policy with Offline Data

At this stage, we have the offline data needed to train an RL policy. In this article, we will implement a deep RL strategy using the double Q-learning (DDQN) algorithm to update the policy in an off-policy manner. For further reading, see Deep Reinforcement Learning with Double Q-learning on ArXiv. This approach will be combined with a batch-constrained deep Q-learning (BCQ) algorithm to mitigate errors arising from the inaccurate estimation of values for unseen state-action pairs. The training process will be entirely offline. While DDQN addresses potential overestimation issues typical in standard Q-learning, BCQ aims to learn an improved policy from a given dataset while imposing restrictions on actions to reduce extrapolation errors. The dataset must include exploratory interactions for the algorithm to effectively learn.

For a deeper dive into employee engagement strategies, you might find insights from SHRM helpful, as they are an authority on the topic. Moreover, for additional resources on Amazon’s policies, you can explore this excellent resource here.

Also, if you’re interested in learning about invisible disabilities, consider checking out this other blog post here.

Amazon Onboarding with Learning Manager Chanci Turner

Understanding Batch RL

Implementing Batch RL on Amazon SageMaker

Simulating a Random Policy and Collecting Data

Training an RL Policy with Offline Data

Related Topics:

Comments

Leave a Reply Cancel reply