In this introductory tutorial, we'll apply reinforcement learning (RL) to train an agent to solve the 'Taxi' environment from OpenAI Gym. We'll cover:
- A basic introduction to RL
- Setting up OpenAI Gym & Taxi
- Step-by-step tutorial on how to train a Taxi agent in Python3 using RL
Before we start, what's 'Taxi'?
Taxi is one of many environments available on OpenAI Gym. These environments are used to develop and train reinforcement learning agents.
The goal of Taxi is to pick-up passengers and drop them off at the destination in the least amount of moves. In this tutorial, you'll start with an agent that plays randomly:
…and successfully apply reinforcement learning to train an agent to solve the game:
💡 An introduction to Reinforcement Learning
Think about how you might teach a dog a new trick, like telling it to sit:
- If it performs the trick correctly (it sits), you'll reward it with a treat (positive feedback) ✔️
- If it doesn't sit correctly, it doesn't get a treat (negative feedback) ❌
By continuing to do things that lead to positive outcomes, the dog will learn to sit when it hears the command in order to get its treat. Reinforcement learning is a subdomain of machine learning which involves training an 'agent' (the dog) to learn the correct sequences of actions to take (sitting) on its environment (in response to the command 'sit') in order to maximise its reward (getting a treat). This can be illustrated more formally as:
Source: Sutton & Barto
🏋️ Installing OpenAI Gym and Taxi
We'll be using the 'Taxi-v3' environment for this tutorial.
You'll need to install:
- OpenAI Gym
pip install gym
- NumPy
pip install numpy
The following snippet will import the necessary packages, and create the Taxi environment:
import numpy as np
import gym
import random
# create Taxi environment
env = gym.make('Taxi-v3')
🎲 Random Agent
We'll start by implementing an agent that doesn't learn at all. Instead, it will select actions at random. This will be our baseline.
The first step is to give our agent the initial state. A state tells our agent what the current environment looks like. In Taxi, a state defines the current positions of the taxi, passenger, and pick-up and drop-off locations. Below are examples of three different states for Taxi:
Note: Yellow = taxi, Blue letter = pickup location, Purple letter = drop-off destination
To get the initial state:
# create a new instance of taxi, and get the initial state
state = env.reset()
Next, we'll run a for-loop to cycle through the game. At each iteration, our agent will:
- Make a random action from the action space (0 - south, 1 - north, 2 - east, 3 - west, 4 - pick-up, 5 - drop-off)
- Receive the new state
Here's our random agent script:
import gym
import numpy as np
import random
# create Taxi environment
env = gym.make('Taxi-v3')
# create a new instance of taxi, and get the initial state
state = env.reset()
num_steps = 99
for s in range(num_steps+1):
print(f"step: {s} out of {num_steps}")
# sample a random action from the list of available actions
action = env.action_space.sample()
# perform this action on the environment
env.step(action)
# print the new state
env.render()
# end this instance of the taxi environment
env.close()
You can run this and watch your agent make random moves. Not super exciting, but hopefully this helped you get familiar with the OpenAI Gym toolkit.
Next, we'll implement the key algorithm that will enable our agent to learn from the environment in order to solve Taxi.
📖 Q-Learning Agent
Q-learning is a reinforcement learning algorithm that seeks to find the best possible next action given its current state, in order to maximise the reward it receives (the 'Q' in Q-learning stands for quality - i.e. how valuable an action is).
Let's take the following starting state:
Which action (up, down, left, right, pick-up or drop-off) should it take in order to maximise its reward? (Note: blue = pick-up location and purple= drop-off destination)
First, let's take a look at how our agent is 'rewarded' for its actions. Remember in reinforcement learning, we want our agent to take actions that will maximise the possible rewards it receives from its environment.
'Taxi' reward system
According to the Taxi documentation:
"…you receive +20 points for a successful drop-off, and lose 1 point for every timestep it takes. There is also a 10 point penalty for illegal pick-up and drop-off actions."
Looking back at our original state, the possible actions it can take and the corresponding rewards it will receive are shown below:
In the image above, the agent loses 1 point per timestep it takes. It will also lose 10 points if it uses the pick-up or drop-off action here.
We want our agent to go North towards the pick-up location denoted by a blue R- but how will it know which action to take if they are all equally punishing?
Exploration
Our agent currently has no way of knowing which action will lead it closest to the blue R. This is where trial-and-error comes in - we'll have our agent take random actions, and observe what rewards it gets, i.e. our agent will do some exploration of the environment. This is what our random agent was doing earlier.
Over many iterations, our agent will have observed that certain sequences of actions will be more rewarding than others. Along the way, our agent will need to keep track of which actions led to what rewards.
Introducing… Q-tables
A Q-table is simply a look-up table storing values representing the maximum expected future rewards our agent can expect for a certain action in a certain state (known as Q-values). It will tell our agent that when it encounters a certain state, some actions are more likely than others to lead to higher rewards. It becomes a 'cheatsheet' telling our agent what the best action to take is.
The image below illustrates what our 'Q-table' will look like:
- Each row corresponds to a unique state in the 'Taxi' environment
- Each column corresponds to an action our agent can take
- Each cell corresponds to the Q-value for that state-action pair - a higher Q-value means a higher maximum reward our agent can expect to get if it takes that action in that state.
Before we begin training our agent, we'll need to initialize our Q-table as so:
state_size = env.observation_space.n # total number of states (S)
action_size = env.action_space.n # total number of actions (A)
# initialize a qtable with 0's for all Q-values
qtable = np.zeros((state_size, action_size))
As our agent explores, it will update the Q-table with the Q-values it finds. To calculate our Q-values, we'll introduce the Q-learning algorithm.
Q-Learning Algorithm
The Q-learning algorithm is given below. We won't go into details, but you can read more about it in Ch 6 of Sutton & Barto (2018).
The Q-learning algorithm will help our agent update the current Q-value (Q(St,At)) with its observations after taking an action. I.e. increase Q if it encountered a positive reward, or decrease Q if it encountered a negative one.
Note that in Taxi, our agent doesn't receive a positive reward until it successfully drops off a passenger (+20 points). Hence even if our agent is heading in the correct direction, there will be a delay in the positive reward it should receive. The following term in the Q-learning equation addresses this:
This term adjusts our current Q-value to include a portion of the rewards it may receive sometime in the future (St+1). The 'a' term refers to all the possible actions available for that state. The equation also contains two hyperparameters which we can specify:
- Learning rate (α): how easily the agent should accept new information over previously learnt information
- Discount factor (γ): how much the agent should take into consideration the rewards it could receive in the future versus its immediate reward
Here's our code for implementing the Q-learning algorithm:
# hyperparameters to tune
learning_rate = 0.9
discount_rate = 0.8
# dummy variables
reward = 10 # R_(t+1)
state = env.observation_space.sample() # S_t
action = env.action_space.sample() # A_t
new_state = env.observation_space.sample() # S_(t+1)
# Qlearning algorithm: Q(s,a) := Q(s,a) + learning_rate * (reward + discount_rate * max Q(s',a') - Q(s,a))
qtable[state, action] += learning_rate * (reward + discount_rate * np.max(qtable[new_state,:]) - qtable[state,action])
Exploration vs Exploitation Trade-off
Earlier, we let our agent explore the environment to update our Q-table. As our agent learns more about the environment, we can let it use this knowledge to take more optimal actions - known as exploitation.
During exploitation, our agent will look at its Q-table and select the action with the highest Q-value (instead of a random action). Over time, our agent will need to explore less, and start exploiting what it knows instead.
There are many ways to implement an exploration-exploitation strategy. Here's just one example:
# dummy variables
episode = random.randint(0,500)
qtable = np.random.randn(env.observation_space.sample(), env.action_space.sample())
# exploration-exploitation tradeoff
epsilon = 1.0 # probability that our agent will explore
decay_rate = 0.01 # of epsilon
if random.uniform(0,1) < epsilon:
# explore
action = action = env.action_space.sample()
else:
# exploit
action = np.argmax(qtable[state,:])
# epsilon decreases exponentially --> our agent will explore less and less
epsilon = np.exp(-decay_rate*episode)
In the example above, we set some value epsilon
between 0 and 1. If epsilon
is 0.7, there is a 70% chance that on this step our agent will explore instead of exploit. We've set epsilon
to exponentially decay with each step, so that our agent explores less and less over time.
Bringing it all together
We're done with all the building blocks needed for our reinforcement learning agent. The process for training our agent will look like:
- Initialising our Q-table with 0's for all Q-values
- Let our agent play Taxi over a large number of games
- Continuously update the Q-table using the Q-learning algorithm and an exploration-exploitation strategy
Here's the full implementation:
import numpy as np
import gym
import random
def main():
# create Taxi environment
env = gym.make('Taxi-v3')
# initialize q-table
state_size = env.observation_space.n
action_size = env.action_space.n
qtable = np.zeros((state_size, action_size))
# hyperparameters
learning_rate = 0.9
discount_rate = 0.8
epsilon = 1.0
decay_rate= 0.005
# training variables
num_episodes = 1000
max_steps = 99 # per episode
# training
for episode in range(num_episodes):
# reset the environment
state = env.reset()
done = False
for s in range(max_steps):
# exploration-exploitation tradeoff
if random.uniform(0,1) < epsilon:
# explore
action = env.action_space.sample()
else:
# exploit
action = np.argmax(qtable[state,:])
# take action and observe reward
new_state, reward, done, info = env.step(action)
# Q-learning algorithm
qtable[state,action] = qtable[state,action] + learning_rate * (reward + discount_rate * np.max(qtable[new_state,:])-qtable[state,action])
# Update to our new state
state = new_state
# if done, finish episode
if done == True:
break
# Decrease epsilon
epsilon = np.exp(-decay_rate*episode)
print(f"Training completed over {num_episodes} episodes")
input("Press Enter to watch trained agent...")
# watch trained agent
state = env.reset()
done = False
rewards = 0
for s in range(max_steps):
print(f"TRAINED AGENT")
print("Step {}".format(s+1))
action = np.argmax(qtable[state,:])
new_state, reward, done, info = env.step(action)
rewards += reward
env.render()
print(f"score: {rewards}")
state = new_state
if done == True:
break
env.close()
if __name__ == "__main__":
main()
👏 What's next?
There are many other environments available on OpenAI Gym for you to try (e.g. Frozen Lake). You can also try optimising the implementation above to solve Taxi in fewer episodes.
Some other useful resources include:
- A good article on RL and its real-world applications
- Deep Reinforcement Learning Series by Jonathan Hui
- AlphaGo full documentary (on Youtube)
- Reinforcement Learning by Sutton and Barto
Top comments (0)