Previously, I reviewed core concepts in Reinforcement Learning (RL) and introduced important parts of the OpenAI Gym API. You can review that introduction here:
Train Robots to Shoot Lazers at Aliens With Software Funded By Elon Musk and Microsoft
Charles Landau ・ Jul 25 '19
In this installment we'll train a tank commander. I'm sharing a bunch of the code for this project here:
CharlesDLandau / rlNotebooks
Reinforcement learning notebooks
rlNotebooks
Reinforcement learning notebooks for posting on the internet.
As these notebooks are instructional and experimental I don't recommend running them locally. I test and run these notebooks on Google Colab or Kaggle before I run them -- you can do it there too!
Side note: OpenAI Gym installation
...is somewhat painful. Many online notebook services like colab and Kaggle don't allow you to install some of the OpenAI environments, so I'm going to stick to Atari for now. If you're interested in trying to set up OpenAI gym with more flexibility, you might start with this interesting write-up.
In order to write agents that actually take the game screen into account when making decisions, we'll need to update our run_job
utility from last time:
action = model.decision_function(obs=observation, env=env)
And our RandomAgent will need to be modified accordingly:
class RandomAgentContainer:
"""A model that takes random actions and doesn't learn"""
def __init__(self):
pass
def decision_function(self, obs=None, env=None):
if env:
return env.action_space.sample()
else:
return 0
def train_on_history(self, history):
pass
model = RandomAgentContainer()
Now we can use our RandomAgent to explore all the information that our job creates.
result = run_job("Robotank-v0", model,
10000, episodes_per_train=0);
result.keys()
# Output
dict_keys(['history', 'env', 'parameters'])
Our job produced video of the game being played, as well as a history of images, actions, predictions, and rewards. It also saved the environment object from OpenAI Gym.
Let's begin by trying to understand our observations in a video:
render_video(0, result['env']);
If, like me, you have never played Robot Tank on an Atari... you can read the manual! You can learn about the heads up display and more.
So anyway, it looks like the image has a bunch of noise in it. Let's see if we can extract any of that...
import matplotlib
from matplotlib.image import imread
from matplotlib.pyplot import imshow
observation_sample = [im['observation'] for im in result.get('history')]
imshow(observation_sample[0])
So using the ticks on the axes and with a little trial and error we can find the bounding boxes for the radar panel and the periscope.
imshow(observation_sample[0][139:172, 60:100, :])
<matplotlib.image.AxesImage at 0x7f48ad329b38>
So, we can certainly crop this image and worry less about the noise...
radar_bounding_box = ((139, 172), (60, 100), (None))
From reading the manual we know that one of the four indicators bracketing this radar display is "R" for "radar." In other words, we' can't rely on radar as the only input, because all of those indicators represent subsystems that can be disabled.
Let's also take a bounding box for the periscope:
imshow(observation_sample[0][80:124, 12:160, :])
<matplotlib.image.AxesImage at 0x7f48ad693b38>
peri_bounding_box=((80, 124), (12, 160), (None))
Let's also check the info
field because it sometimes has observation data.
result.get('history')[0].get('info')
# Output
{'ale.lives': 4}
I'm going to intentionally ignore the V, C, R, T boxes and we can always reintrouduce them later if we think a performance gain is in the offing. You saw in the manual how they work so I don't think it's a cause for concern...
Let's also try to understand the action space.
env = result['env']
print(env.action_space)
# Output:
Discrete(18)
Hmm...not helpful at all. But that's what you'd naturally think to do... It turns out that extracting action meanings has its own namespace in the gym
API.
# https://ai.stackexchange.com/a/3557
env.unwrapped.get_action_meanings()
# Output
['NOOP',
'FIRE',
'UP',
'RIGHT',
'LEFT',
'DOWN',
'UPRIGHT',
'UPLEFT',
'DOWNRIGHT',
'DOWNLEFT',
'UPFIRE',
'RIGHTFIRE',
'LEFTFIRE',
'DOWNFIRE',
'UPRIGHTFIRE',
'UPLEFTFIRE',
'DOWNRIGHTFIRE',
'DOWNLEFTFIRE']
So "NOOP" presumably means "no-op" i.e. "do nothing." The rest apparently constitute all the permutations of actions available to the client. This is what we would expect the action space to be. These are also discrete actions so we can code our model to take exactly one action per step.
Finally, let's visualize the rewards
set([r['reward'] for r in result['history']]) # Unique rewards across history
# Output
{0.0, 1.0}
import matplotlib.pyplot as plt
plt.plot([r['reward'] for r in result['history']])
Looks like the reward function is simply "score a hit=1 else 0". We can confirm by visualizing the observations at reward time.
# Observations when reward was given
reward_incidents = list(filter(lambda i: i['reward'], result['history']))
i = 0
imshow(reward_incidents[i]['observation'])
i = 1
imshow(reward_incidents[i]['observation'])
i = 2
imshow(reward_incidents[i]['observation'])
They're all images that seem to be captured right after the tank scores a hit.
Writing the tank commander
The Brain
The TankCommander agent needs to learn how to decide which action to take. So, we first need to give it a mechanism for learning. In this case, we're going to use a special kind of graph. In this graph there are three kinds of nodes:
-
Input nodes
Which take our inputs and send signals to the nodes that they are connected to
-
Regular nodes
These nodes can have many connections from other nodes (including input nodes.) Some connections are strong, and some connections are weak. This node uses the signals and the signal strength from all the connections to decide what signal to send along all of its own outgoing connections.
-
Output nodes
These nodes are only different from regular nodes in that we read their
signal.
The nodes are organized into "layers" that can share many connections and have a function in common for how they decide to aggregate and send signals.
Now, I've simplified a lot, but the graph that we're talking about, if properly organized, is a deep learning neural net. To create one we can use tensorflow
like so:
import tensorflow as tf
from tensorflow.keras import datasets, layers, models, preprocessing, callbacks
import numpy as np
import random
inputs = layers.Input(shape=(44, 148, 3))
x = layers.Conv2D(16, (3, 3), activation='relu')(inputs)
x = layers.BatchNormalization(axis=-1)(x) # Channels @ -1
x = layers.MaxPooling2D((2, 2))(x)
x = layers.Flatten()(x)
x = layers.Dense(64)(x)
x = layers.Activation('relu')(x)
x = layers.Dense(64)(x)
x = layers.Activation('sigmoid')(x)
x = layers.BatchNormalization(axis=-1)(x)
x = layers.Dropout(0.02)(x)
x = layers.Dense(18, activation='linear')(x)
model = models.Model(inputs, x)
model.compile(optimizer='sgd',
loss='mae',
metrics=['accuracy'])
Let's break it down:
inputs = layers.Input(shape=(44, 148, 3))
You'll recall that this is the dimensions for one of our periscope images. These map to our input nodes.
x = layers.Conv2D(16, (3, 3), activation='relu')(inputs)
x = layers.BatchNormalization(axis=-1)(x) # Channels @ -1
x = layers.MaxPooling2D((2, 2))(x)
x = layers.Flatten()(x)
x = layers.Dense(64)(x)
x = layers.Activation('relu')(x)
x = layers.Dense(64)(x)
x = layers.Activation('sigmoid')(x)
x = layers.BatchNormalization(axis=-1)(x)
x = layers.Dropout(0.02)(x)
x = layers.Dense(18, activation='linear')(x)
model = models.Model(inputs, x)
This establishes layers of regular nodes in our graph and their interactions. I'm oversimplifying.
model.compile(optimizer='sgd',
loss='mae',
metrics=['accuracy'])
Finally we compile
the model, giving it some special parameters. To teach our graph to drive a tank, we need to call the fit
method e.g. model.fit(data, correct_prediction)
. Each time this happens, our model goes back and updates the strength (aka the 'weight') of the regular connections. The fancy name for this is "backpropagation." Anyway, the parameters of model.compile
help determine how backpropagation is executed.
The Experiences
Now that we have a brain, the brain needs experiences to train on. As we know from our manual and our little exploration of the data, each episode maps to one full game of Robot Tank. However, in each game the player gets several tanks and if a tank is destroyed, the player simply "respawns" in a new tank. So the episode is actually not the smallest unit of experience, rather, it's the in-game "life" of a single tank.
We can extract this from:
episodes = [list(filter(lambda h: h["episode"]==e , history)
) for e in range(n_episodes)
]
game_lives = [
list(
filter(lambda h: h.get('info').get('ale.lives')==l, episode)
) for l in range(5)
]
For each of these lives we can get a cumulative reward (how many hits scored before the life ends.)
rewards = [obs.get('reward') for obs in game_life]
cum_rewards = sum(rewards)
And using this number we can determine how strongly we want our brain to react to this experience
# Positive experience
if cum_rewards:
nudge = cum_rewards * 0.03
# Negative experience
else:
nudge = 0 - 0.03
Now, for a given step, we can:
- Take our original
action
,prediction
, and periscopeimage
as data -
nudge
ourprediction
only at theaction
index, since we can only learn from the actions we have taken. - Call
model.fit(image, prediction_with_nudge)
To visualize the problem with this, imagine you are tasked with riding a bike down a mountain blindfolded. As you miraculously ride down the mountain without killing yourself, you may reach a point where you seem to have reached the bottom. In any direction you try to go, you must pedal uphill. You take your blindfold off only to realize that you've barely gone a mile, and that you still have far to go before you reach the base of the mountain. The fancy term for this is a "local minimum."
To address this we can just force the commander to randomly take actions sometimes:
for obs in game_life:
action, prediction = obs.get('action')
if self.epsilon and (random.uniform(0, 1.0) < self.epsilon):
action = random.randrange(18)
# Copy
update = list(prediction)
# Update only the target action
update[0][action] = update[0][action] + displacement
With only 120k steps, our tank already seemed decided on a strategy as shown in this long, long gif.
As you can see, turning left is powerful in Robotank -- a whole squadron killed!
...but then it dies. I think this is pretty strong for a stumpy model trained only on the periscope viewport. Time permitting, I may continue to tinker with this one -- increasing the epsilon value, tinkering with the graph parameters, and adding views, all could help nudge the tank commander towards a more nuanced strategy.
Anyway you've learned a bit about implementing DL with RL in Python. You learned:
- DL basic concepts
- Exploratory data analysis for RL
- Selectively applying rewards to specific actions, and smallest divisible units of experience
- Introducing random actions to help explore the "action space"
Thanks for reading! Reach out with any questions.
Top comments (0)