Вы находитесь на странице: 1из 13

Andrew Simpson

October 9th 2018

Machine Learning Engineer Nanodegree

Capstone Project


Project Overview

Since 9/11 when a radical terrorist group run by Osama bin laden few 2 planes in the world
trade center, one into the pentagon and one into a field that was intended to hit the white
house, the US has made it its goal to wipe out these different terrorist groups from around the
world. Thanks to modern advancements in technology, the US was able to do a lot of the
damage without needing to risk the lives of American soldiers. Through the use of drones the
US has been able to launch attacks remotely. Using the missiles, such as the Hellfire missile,
these drones launch attacks on specific targets in a very precise matter. But nothing is perfect
and at times, civilians can get caught up in the crossfire of these attacks and be severely
injured or killed. As long as there has been war there have been civilian casualties and most
likely there always will be. But what if we can, though the use of AI and ML, decrease this
number of civilian casualties from drone strikes?

In this project, I create a simulated environment based in the middle east in which the target(a
car) drives around in the streets. The learner, using deep reinforcement learning, is able to find
the best given time to shoot and take out the moving target. The best time to take out the
target is determined by how much damage will be done to the surrounding buildings which
have civilians inside.

Problem Statment
The goal of this project is simple, create and AI using ML that can take out a target and at the
same time reduce the number of casualties of civilians/buildings that happen to be in the same
area as the target. To do this, the learning will figure out to avoid buildings and wait for a better
time to shoot. I will attempted to achieve this through deep reinforcement leaning. As
mentioned above, I have a created and environment in which a car moves through the streets
of a small city in the middle east. The learner is then punished when it does damage to
buildings and over time should learn how to avoid them.


Since the goal is to avoid buildings and therefor avoid civilians , its common sense that the
metric used should be the amount of damage done to buildings when the drone shoots. The
less damage the drone does, the less civilians are affected, and the better the metric is. The
damage is calculated as such.

for b in buildings:

damage = damage + circle.intersection(b).area

This loop goes though all the buildings in the environment sums together the area of the
building in the blast radius. This data is then normalized by multiplying it by -0.001. This is then
the reward that is given to the drone after it shoots.


Data Exploration

Since this is a deep reinforcement learning problem, I have to create an environment for the
learner to work in. It is a google earth image of a small town in the middle east. I have outlines
all the buildings and streets in the image as seen in figure 2. This data is stored in arrays in the
environment and is used for two things. First the street data is obviously used for driving the
car. The cars initial position can be on any spot of the blue lines(figure 2). After that when the
`step` function is called by the learner, the car moves forward one step in the environment.
When the car comes to an intersection it choses at random which road to take next(not the one
it is currently on). This makes it more difficult for the learner as it does not know where it will go
just like in real life. Now the buildings are not used until the drone decides to shoot. Once it
shoots, the environment creates a “blast radius” around the cars current position, it then goes
through all the building data and calculated the area at which the buildings and the blast radius
overlap. This area is summed and used as the reward function as mentioned above. If the car
goes outside the environment, the episode is ended and a negative score is given to learner for
not shooting at all and letting the target get away. The environment returns four pieces of data.
The first is the image which is actually passed into the deep learner. The image is a 128x128
black and white image. The image is black and white because this largely simplifies the image
data for the network. The next is the reward which was talked about before. Next is the
terminal state and last is the position of the car
which isn’t used by the network but is used
for analyzing what the deep network is

Exploratory Visualization

Figure 1 and 2 are screenshots of the actual

environment. Figure 1 shows the car that
moves on the map. Here we see the full town
and the different buildings and roads in it. Figure 1

The red circle represents the blast radius that

is used to calculate the damage done. In this
example you can see that the red circle
overlaps part of the building, that is the
overlapping area that will be used to
calculate the damage. Figure 2 shows the
building and road data that I have
programmed into the environment. The blue
lines are the paths that the car can travel on
at each step the learner takes. The buildings

Figure 2

are used to calculate the total damage that the strike did. If you compare the two images you
will see that there are areas that are highlighted that are not buildings. This is because they
seem to be yards of peoples property and we assume that there are people on there and we
want the learner to avoid them just like the rest of the building. Also you can see that there are
four roads that leave the environment. This allows for the target to get away and in turn it
punishes the learner for letting the target get away. The car may also loop around the town and
go on any road given at random. Just looking at the images something becomes very obvious
and that is that these images are very large and size and in color, this will make it very difficult
for the learner. Also when finding a correct way to give rewards proves to be extremely difficult.
In a few sections I will discuss the solution to these core problems.

Algorithms and Techniques

The main algorithms need to be focused on are the deep reinforcement algorithms and the
algorithms used to get the rewards. Starting with the learner. The learner uses keras as the
deep network framework for this project. It consists of four convolutional layers, two max
pooling layers and one dense layer. This is the core of the learner. Another very important part
is the memory experience and the algorithm that trains these memories. This algorithm uses a
double DQN instead of a single to avoid the moving target problem. The second network
(target network) is updated with the current weights every 10000 updates or 100 episodes.
Next is the reward function. Staring with the reward that is passed to the learner when it
choses to take a step instead of shooting. I tried multiple different reward functions for this
such as a reward that actually takes a gradient of the damages at that location. So basically
should it shoot, how much damage it would do to the entire environment. And a reward that
just returns a static amount at every step. I will talk more about this later.


Since this is a custom environment, there is so “acceptable score” that I cam aim for based on
other people results. Instead I will focusing on decreasing the damage done on each shot. The
goal will be to get the average damage near zero as this means that no buildings were
damaged and in turn a smaller amount of civilians where affected by the blast.


Data Preprocessing

As mentioned before, the input image from the environment

cannot be passed in how it is. The color is not acceptable
and the image is to large. To fix this problem I first convert
the image with the new car position into a black and white
image. Next I simply convert to image to a size of 128x128.
Which is the image in figure 3. By making the image much
smaller, it allows the deep network to better analyze what is
happening as a faster speed. Its important not to go to
small though because the car wont be viable anymore.

Implementation Figure 3

The following is the implementation of the deep network model that was used.

model = Sequential()

model.add(Conv2D(64, (6, 6), padding='same', activation='relu',


model.add(Conv2D(64, (6, 6), activation='relu'))

model.add(MaxPooling2D(pool_size=(3, 3)))

model.add(Conv2D(128, (6, 6), padding='same', activation='relu'))

model.add(Conv2D(128, (6, 6), activation='relu'))

model.add(MaxPooling2D(pool_size=(3, 3)))


model.add(Dense(512, activation='relu'))

model.add(Dense(self.actionS, activation='linear'))

model.compile(loss='mse', optimizer=Adam(lr=self.learnRate))

All of the activation functions are relu expect the last one which is linear. As you can see it uses
a convolutional network which is the state of the art algorithm for deep learning with images.

Next is the implementation for memory replay which keeps the different steps/actions and
plays them back later to learn from them again.

for st, ac, rw, nst, done in batch:

if self.count%10000 == 0:

self.t_copy = self.copy()

self.count = self.count + 1

st = np.reshape(st, [1,self.size,self.size,1])

nst = np.reshape(nst, [1,self.size,self.size,1])

targetNum = rw

if done == False:

targetNum = rw + self.gamma * np.amax(self.t_copy.predict(nst)[0])

targetF = self.model.predict(st)

targetF[0][ac] = targetNum

self.model.fit(st, targetF, epochs=1, verbose=0)

if self.explorRate > self.decayMin:

self.explorRate *= self.decay

This algorithm takes a random sampling of the memory’s and learns them again. Looking
closely at the target function `rw + self.gamma * np.amax(self.t_copy.predict(nst)[0])` we see
that is uses t_copy instead of the model to get the action of the next position. This is because
this is a double DQN and in order to avoid the chasing target problem, we must use a different
network to calculate the target. This target network is updated every 10000 memories by `if
self.count%10000 == 0`.

Next is the reward function which is very important to reinforcement learning. But its proved to
be not as simple as just returning some value. It takes time, math, and fine thinking to tune in
the reward function to allow the learner to be able to converge to the solution you are looker
for. Since there learner is obviously not sentient, it mathematically needs to be given its task
through trial and error and can not simply be told what to do. My first attempt at a reward
function is defined bellow.

This is a special reward function because at each step the reward is returned will a gradient
reward over the whole environment! To visualize what is happening,
imagine a simple gradient color scheme like figure 4, now the closer
you get to the purple color the larger the score is and the farther away
you get the lower the score it. Now think about this in terms of the
environment, the farther the building is from the car, the lower the

Figure 4

score or potential damage and the closer the building is to the car, the higher the score or
potential damage. It does this in a continues manner so the inside part of the building
contribute to a higher overall score more then the outside will. This allows for a continuous
function that creates almost like a gradient heat map. Bellow is the code.

def gradient(self, res):

center = [self.cPoint[0]-15, self.cPoint[1]-15]

stepSize = int(math.floor(image.size[0] / res))

damgeScore = 0.0

for i in range(res):

cr = Point(center[0], center[1]).buffer(((stepSize+1) * i))

cr_smaller = Point(center[0], center[1]).buffer(((stepSize) * i))

don = Polygon(cr,[cr_smaller])

for b in buildings:

damgeScore = damgeScore + ((don.intersection(b).area * 0.01) / self.gradCalc(i))


Its important to know that it is not totally continues since it would take complex integration
which is overkill for the project at the moment. So I have a parameter called res which is the
resolution the gradient should operate at. The results can be pretty wild depending on where it
is on the map and since this is directly used at the reward function I needed to normalize the

-math.log10((self.gradient(100) * 0.0001) + 1)

reward. To normalize the reward I used the operation

It first multiplies it by 0.0001 to get it to a reasonable number for log, the it adds 1 to shift the
asymptote to -1. Last it normalizes it by taking the -log of that number.

Now this reward system did not work the best and I am ditching is for the time being. I added it
to the report because it was a major part of my thought process and I plan on using it at a later
time. I believe this could be an instrumental part of the project in the future but would take a
whole research project of its own.


Since this first reward function had some major issues. It would not want to converge to the
best solution but instead what would happen is because the initial starting point of the car is
chosen at random and some areas have for example 4 roads that all internet each other, these
areas are statistically more likely places for the car to show up. This means the the learner was
almost just trying to get back to these spots because thats what it knew more about. You can
think of it as the learner just wanted to be in a familiar spot. So to attempt to address this
problem I decided to move to a static reward when the learner decides to not shoot and wait.
This was a value of -0.001. Next, the reward when the learner shoots was defined as follows.

radi = 50

center = [self.cPoint[0]-15, self.cPoint[1]-15]

circle=Point(center[0], center[1]).buffer(radi)

damage = 0

for b in buildings:

# loops and gets the area of each building

damage = damage + (circle.intersection(b).area * 0.001)

damage = (damage * (-1))

This reward function would simply return the area of all the damaged buildings. This value
typically lies between 0 and -3. Since the reward for waiting is inly -0.001, theoretically the
learner should only shoot if it thinks it would get a better reward then waiting. This should give
some positive results.

The next refinement was to the gamma parameter. The gamma rate showed up in this function.

for st, ac, rw, nst, done in batch:

if self.count%10000 == 0:

self.t_copy = self.copy()

self.count = self.count + 1

st = np.reshape(st, [1,self.size,self.size,1])

nst = np.reshape(nst, [1,self.size,self.size,1])

targetNum = rw

if done == False:

targetNum = rw + self.gamma * np.amax(self.t_copy.predict(nst)[0])

targetF = self.model.predict(st)

targetF[0][ac] = targetNum

self.model.fit(st, targetF, epochs=1, verbose=0)

if self.explorRate > self.decayMin:

self.explorRate *= self.decay

The gamma is the factor added to the current reward to create the target value that the deep
network should shoot for for that given state. I tried low values like 0.3 and higher ones like
0.95. The difference where surprisingly different. The lower seemed like it just worked really
quick and then the learner stoped improving and the higher values seemed to be slow and
steady learning.


After over 150 hours of GPU testing, it was very challenging to find a near perfect model.
Although I did find some positive results, it proved to be challenging. In this section I will talk
about these results.

Model Evaluation and Validation.

The first graph we will look at is what a random

distribution of the rewards look like over this
environment (Figure 5). Here we can see that
randomly it does find spots that are zero and
some that are all the way up to four. Other then
the zeros, which are expected, the data seems to
be pretty random with no patters which again, is
expected. Reminder, each episode is from when
the car is initialized to when the learner shoots.

Figure 5

Here we will looks at a couple of different graphs to show some results.

Figure 6

The results may be hard to see in this graph as mentioned above but there still are some very
positive results buried in this data! Looking at the first 500 episodes we see that the data is
very well spread out and in a random manner. After this we start to see two cluster lines form.
One at about 2.5 and another at around 1.0. I believe the top line is due to the random picking
of each line when the car is initiated. Some spots are more likely to be pick in the environment
then others and this causes an over representation that the learner is exposed to. This, I
believe, cases the leaner to want to shoot as those are the spots its already been to. I believe
that that was a major issue in this project and needs more random experimentation. This
unfortunately would take several days just to run and I do not have the time at the moment but
I will explore in the future for sure. Given this, lets discard the line at the top and look at the
thick line around 1.0. This cluster line is very defined and gets thicker over time. This line
demonstrates the leaning that was done over 4000 episodes. We can see that it did intact learn
quite a bit and started to figure out good places to shoot. Its also important to notice that this
line drops and gets closer to zero. Also the top line does get thinner over time which shows the
learner figured out that that statistically common area is not a goos spot.

Figure 7

Figure 7 shows this data extended over 7000 episodes. The significant element of this data is
the last 200 episodes where wee see a lot of that rewards from the 1.5-3 range move down into
the 0-1 range. This shows an increase in the learning and I believe the results would only get
better given more time. It seems that as more time goes on, the thinker and lower that line
cluster gets. Its impotent to note that there are some very high outlier results as time goes on
and this is because it hasn’t seen all the different spots in the environment due to time
constraints and I have no doubt these would go down given enough time. When digging
deeper into the data its clear that the learner is figuring out when the best time to shoot is and
is gradually improving.

The final model was the convolutional model listed above with the static reward function. I
used a gamma of 0.95 for the target learning. The memory batch size was 100 which gave it
plenty of examples to learn. It also had a decay rate of 0.995. In the future I would make this
higher although I would need a much larger amount of time and GPU power.


The final solution to this problem was a deep reinforcement leaner that uses convolutional
neural network to learn when to shoot and when to wait. A huge part of this project was the
reward function which returns a -0.001 when it decides to wait and returns a negative damage
of the buildings hit during the strike. These two aspects combined form the core of the
problem. The goal of the project was to get the rewards close to zero almost every time and
drastically reduce the damage done but unfortunately this was not reached. Even though I did
not read the goal there was still some significant learning done and defiantly in the right
direction. At the end of the day this project will need more research to reach the perfect goal.

Free-Form Visualization

In this section we will look at a small portion of where the learner was actually shooting.

Figure 8 is a density map showing the different spots where the the learner prefers to shoot at
around 1000 episodes. Unfortunately do to a software glitch, the data after this was removed
and was not able to be recovered. What this current data does show is the learner starting to
not chose to shoot at curtain areas such as in between buildings and where the road rides very
close to buildings. Looking at
the high density yellow area, its
very interesting why this came
to be. Because I developed the
environment, I know that the
upper part of the map sees a
lot of traffic. And when looking
at the data before it was lost,
the most dense area actually
became the area right of the Figure 8

yellow where the roads form a T. This area is actually a very open area that
doesn’t do really any damage to buildings around it. Its a small gap this is right in the middle of
the action. Because the learner is punished when it lets the target get away, this middle spot
where the learner moved to seems to be the perfect spot since it knows its safe, and a reliable
stop for where the target will go. Amazing!


Before I could start exploring this very specific task, I needed an environment to work in. So I
created a fully interactive environment stuffed with all the bells and whistles that creates a map
of a city in the middle east and initializes a car at a random location. Then at each step the car
moves a long the road just as if it where driving and when the car intersect another road it
randomly choses which one to pick. This data is then packed into a black and white image with
the size 128x128. After this the data is given to the convolutional neural network which decides
what the next action should be. This info is then saved into memory and replayed again later to
create continuous learning. When the network needs to learn, it uses a double DQN which is
updated every 100 episodes to give the network a more stable learning experience. This
process is repeated over and over until the episode limit is reached.

Originally I wasn’t using a double DQN and was getting no positive results at all. Much of the
time was spent trying to figure out why this was happening and out I could fix it. Once I tried
using double DQN I saw immediate improvements!

Originally I was very confused with my current results, I really thought the results should be
better and should be converging to 0 much faster or at least more obvious learning like you see
DQN’s do in Atari games. But then I realized something… the main object in this environment is
the car and the learner doesn’t have any control of what this thing does or where it goes. Its
totally random! In the Atari games it had multiple actions it can chose from to manipulate the
game just how it wants. My learner doesn’t have that privilege. It in a sense it had to predict
random. Now thats obviously not possible but the point still stands, there isn’t any standard
pattern such as games like packman and to top it off it cant pick where the car goes. It just has
to do its best and weigh the odds as to when it should shoot. Given this new perspective I was
actually very happy with my results and what I was able to accomplish. But because this is just
one environment, it would not be able to be used as a general solution.


Given this uses reinforcement learning on a single environment, it is not a general solution but
instead more of a proof of concept. But it is a major goal of mine to make this into a general
solution and I have a few ideas to how this could be done. To start I would creating a separate
neural network that could identify and outline all the buildings in an image. The next neural
network would be able to track a target target that is given to is. The data of these two neural
networks would then be passed into the final network which would decide when the right/
safest time to shoot is. This is obviously a very ambitious goal but I hope to chip away at it over
the next year since I had a great time with this project.