AI General Game Player Using Neuroevolution Algorithms

AI General Game Player using Neuroevolution Algorithms
DECLARATION
We, hereby declare that the project work entitled AI General Game Player
using Neuroevolution Algorithms has been independently carried out by us under
the guidance of Mr Guru R, Assistant Professor, Department of Computer Science and
Engineering, Sri Jayachamarajendra College Of Engineering, Mysuru is a record of an
original work done by us and this project work is submitted in the partial fulfillment
of the award of the degree of Bachelor of Engineering in Computer Science and
Engineering of Visvesvaraya Technological University, Belgaum during year
2016-17. The results embodied in this thesis have not been submitted to any other
University or Institute for the award of any degree or diploma.
Basanth Jenu H B
Meghana S B
Sanjana G S
Department of Computer Science and Engineering, SJCE i

Abstract
The goal of General Game Playing (GGP) has been to develop computer programs
that can perform well across various game types. It is natural for humans to transfer
knowledge from games they already know to play to other similar games but the same
is a difficult task for computers. GGP research attempts to design systems that work
well across different game types, including unknown new games. Developing of intelligent
agents that can learn by themselves a given task has a great significance in AI research.
Earlier attempts towards general Game playing have been through tree based methods
and heuristics. Recently, there have been attempts to solve the problem using Reinforce-
ment Learning methods, Q-Learning in particular. Many attempts have combined the
latest advances in Deep Learning and Q - Learning and have achieved impressive results.
In this project, a model is designed by combining latest advances in Deep Learning
and Genetic algorithms. In particular,Conventional Neuroevolution(CNE) and Neuroevo-
lution of Augmented Topologies(NEAT) will be implemented and tested on various tasks.
CNE focuses on evolution of weights of a Neural Network to solve a problem. NEAT goes
a step beyond and evolves the structure of the Neural Network along with its weights.
Finally, a model is designed that can learn on its own to play a variety of games
by itself without any prior knowledge about the game. The algorithm is made to play
Flappy Bird, Pong and Super Mario Bros along with benchmark tests.
Department of Computer Science and Engineering, SJCE ii

ACKNOWLEDGEMENT
We would like to thank the whole management of the Department of Computer Science
and Engineering for having given us an opportunity to carry out project on our own by
trusting and acknowledging for our abilities. We have a great pleasure in expressing our
deep sense of gratitude to our institution Sri Jayachamarajendra College of Engi-
neering, Mysuru.
We extend our deep regards to Dr T N Nagabhushan, Principal, Sri Jayachama-

rajendra College of Engineering for providing an excellent opportunity to carry out
our project at the Computer Science and Engineering Department.
We would like to express our thanks to Dr H C Vijayalakshmi, Associate Pro-

fessor and Head, Department of Computer Science and Engineering, SJCE,
for her guidance and support.
We take this opportunity to thank our Project Guide Mr Guru R, Assistant Pro-
fessor, Department of Computer Science and Engineering, SJCE for suggestions,
valuable support, encouragement and guidance throughout the project.
We would like to thank all the teaching and non-teaching staff of Computer Science
and Engineering Department. We also convey our gratitude to all those who have con-
tributed to this project directly or indirectly.
Basanth Jenu H B
Meghana S B
Sanjana G S
Department of Computer Science and Engineering, SJCE iii

Contents
Declaration i
Abstract ii
Acknowledgement iii
1 Introduction 2
1.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Introduction to the problem domain . . . . . . . . . . . . . . . . . . . . . 2
1.3.1 General Game Playing . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.3.3 Genetic Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.4 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.5 Existing solution methods . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.6 Proposed solution methods . . . . . . . . . . . . . . . . . . . . . . . . . . 8
1.7 Time schedule for completion of the project work (Gantt chart) . . . . . 9
2 Literature Survey 11
2.1 A survey of Monte Carlo tree search methods . . . . . . . . . . . . . . . 11
2.2 A GGP Feature Learning Algorithm . . . . . . . . . . . . . . . . . . . . 12
2.3 Training Feedforward Neural Networks Using Genetic Algorithms . . . . 12
2.4 High-level Reinforcement Learning in Strategy Games . . . . . . . . . . . 13
2.5 Human-level control through deep reinforcement learning . . . . . . . . . 13
3 System Requirements and Analysis 16

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.2 Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Non - Functional Requirements . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 System Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 Software Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.6 Hardware Requirements . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Tools and Technologies used 19

4.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.2 Numpy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Pickle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.4 Graphviz . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.5 Virtualenv . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
Department of Computer Science and Engineering, SJCE iv

4.6 OpenAI gym . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

4.7 Pygame Learning Environment . . . . . . . . . . . . . . . . . . . . . . . 21
4.8 Sublime Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.9 Git and Github . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.10 Google Compute Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.11 Latex . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
5 System Design 24
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2 Block Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
5.2.1 ANN to interact with Environment . . . . . . . . . . . . . . . . . 24
5.2.2 Evolution of ANNs . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6 System Implementation 28
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.2 Conventional Neuroevolution . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.3 Neuroevolution of Augmented Topologies(NEAT) . . . . . . . . . . . . . 31
6.3.1 Genetic Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
6.3.2 Tracking Genes through Historical Markings . . . . . . . . . . . . 31
6.3.3 Protecting Innovation through Speciation . . . . . . . . . . . . . . 33
6.3.4 Minimizing Dimensionality through Incremental Growth from Min-
imal Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
6.3.5 NEAT Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.3.6 Parameter Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 35
7 Testing and Results analysis 38

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.2 Unit Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.3 Evolving XOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.3.1 Comparision analysis for solving XOR . . . . . . . . . . . . . . . 39
7.3.2 Evolution of XOR using NEAT . . . . . . . . . . . . . . . . . . . 39
7.4 Cartpole Balancing task using NEAT . . . . . . . . . . . . . . . . . . . . 41
7.5 Playing Pong game using NEAT . . . . . . . . . . . . . . . . . . . . . . 43
7.6 Playing Flappy Bird using NEAT . . . . . . . . . . . . . . . . . . . . . 44
7.7 Playing Super Mario BROS using NEAT . . . . . . . . . . . . . . . . . 46
8 Conclusion and Future Work 51
Appendix A 52
Appendix B 54
Department of Computer Science and Engineering, SJCE v

References 57
Department of Computer Science and Engineering, SJCE vi

List of Figures
1.1 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
5.1 Interacting with the environment . . . . . . . . . . . . . . . . . . . . . . 25
5.2 Evolution of ANNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
6.1 Encoding an ANN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.2 Crossover of ANNs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
6.3 Genetic Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.4 Mutation in NEAT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
6.5 Matching of Genes and Crossover . . . . . . . . . . . . . . . . . . . . . . 34
7.1 Solving XOR using CNE . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.2 Solving XOR using NEAT . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.3 Solving XOR using Backpropagation . . . . . . . . . . . . . . . . . . . . 40
7.4 Evolution of structure by NEAT . . . . . . . . . . . . . . . . . . . . . . . 41
7.5 Cartpole Balancing Environment . . . . . . . . . . . . . . . . . . . . . . 42
7.6 ANN to solve Cartpole Balancing Environment . . . . . . . . . . . . . . 43
7.7 NEAT Playing Pong . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
7.8 NEAT playing the flappy Bird Game . . . . . . . . . . . . . . . . . . . . 46
7.9 ANN evolved to play Flappy Bird . . . . . . . . . . . . . . . . . . . . . . 48
7.10 State of the Super Mario Bros game . . . . . . . . . . . . . . . . . . . . 49
7.11 NEAT playing Super Mario Bros . . . . . . . . . . . . . . . . . . . . . 49
List of Tables
6.1 Mutation Rates for NEAT . . . . . . . . . . . . . . . . . . . . . . . . . . 35
6.2 Parameters for speciation for NEAT . . . . . . . . . . . . . . . . . . . . . 36
7.1 XOR Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
7.2 Results for solving XOR . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
Department of Computer Science and Engineering, SJCE vii

CHAPTER 1
Department of Computer Science and Engineering, SJCE 1

1 Introduction
General Game Playing is all about making machines to play a variety of games by
themselves without having any prior knowledge about the game. Humans are very good
at this task but the same is difficult for a computer. The project deals with Genetic
Algorithms(GA) and Artificial Neural Networks(ANN). GAs are optimization algorithms
inspired by natural evolution. ANNs are mathematical models that try to predict and
memic the working of our brain. In this project, Genetic Algorithms and Neural Net-
works are combined to make a machine play a variety of games by itself
1.1 Problem Statement

Humans demonstrate incredible dexterity in playing games even in the absence of
instructions on how to play them but the same task is difficult for computers. In this
project, latest advances in Neuroevolution Algorithms will be incorporated to investigate
if computers can learn to play games on their own. In particular, Conventional Neu-
roevolution(CNE), Neuroevolution of Augmented Topologies(NEAT) and various games
will be used for our experiment.
1.2 Objectives
The objectives of the project are as follows,
1. Implement Conventional Neuroevolution(CNE) and using the algorithm
(a) Evolve XOR.
2. Implement Neuroevolution of Augmented Topologies(NEAT) and using the algo-

rithm, make a system that can
(a) Evolve XOR.

(b) Perform pole balancing benchmark test.
(c) Play Pong game.
(d) Play Flappy Bird.
(e) Play Super Mario Bros.
1.3 Introduction to the problem domain
1.3.1 General Game Playing
Games have always been an important platform for research on Artificial Intelli-
gence (AI).Since the early days of AI, many popular board games, such as chess and

checkers, have been used to demonstrate the potential of emerging AI techniques to solve
combinatorial problems. General Game Playing (GGP)[13] was introduced to design
game-playing systems with applicability to more than one specific game. Traditionally, it
is assumed that game AI programs need to play extremely well on a target game without
consideration for the AIs General Game Playing ability. As a result, a world-champion
level chess program, such as Deep Blue, has no idea how to play checkers or even a board
game that only slightly differs from chess. This is quite opposite to humans game-playing
mechanism, which easily adapts to various types of games based on learning the rules
and playing experience.
Some research stresses the importance of human-style game playing instead of sim-
ply unbeatable performance. For example, given a certain board configuration, human
players usually do not check as many possible scenarios as computer players. However,
human players are good at capturing patterns in very complex games, such as Go or
Chess. Generally, the automatic detection of meaningful shapes on boards is essential
to successfully play large-branching factor games. The use of computational intelligence
algorithms to filter out irrelevant paths at an early stage of the search process is an im-
portant and challenging research area. Finally, current research trends are attempting to
imitate the human learning process in game play.
In the context of GGP, the goal of an AI program is not to perfectly solve one game
but to perform well on a variety of different types of games, including games that were
previously unknown. Unlike game-specific AI research, GGP assumes that the AI pro-
gram is not tightly coupled to a game. Such an approach requires a completely different
research approach, which, in turn, leads to new types of general-purpose algorithms. Tra-
ditionally, GGP has focused primarily on two-dimensional board games inspired by chess
or checkers, although several new approaches for General Video Game Playing (GVGP)
have been recently introduced to expand the territory of GGP. The goal of GVGP re-
search is to develop computer algorithms that perform well across different types of video
games. Compared with board games, video games are characterized by uncertainty, con-
tinuous game and action space, occasional real-time properties, and complex gaming rules.
1.3.2 Neural Networks
Neural networks[6] are algorithms for optimization and learning based loosely on
concepts inspired by research into the nature of the brain. They generally consist of five
components:
1. A directed graph known as the network topology whose arcs are called links.
2. A state variable associated with each node.
3. A real-valued weight associated with each link.

4. A real-valued bias associated with each node.
5. A transfer function for each node which determines the state of a node as a function
of
(a) its bias b,

(b) the weights of its incoming links, and
(c) the states of the nodes connected to it by these links.
This transfer function usually takes the form of either a sigmoid or a step function.
A feedforward network is one whose topology has no closed paths. Its input nodes
are the ones with no arcs to them, and its output nodes have no arcs away from them.
All other nodes are hidden nodes. When the states of all the input nodes are set, all
the other nodes in the network can also set their states as values propagate through the
network. The operation of a feedforward network consists of calculating outputs given a
set of inputs in this manner. A layered feedforward network is one such that any path
from an input node to an output node traverses the same number of arcs. The nth layer
of such a network consists of all nodes which are n arc traversals from an input node.
A hidden layer is one which contains hidden nodes. Such a network is fully connected if
each node in layer I is connected to all nodes in layer i + I for all I.
Layered feedforward networks have become very popular for a few reasons. For one,
they have been found in practice to generalize well, i.e. when trained on a relatively sparse
set of data points, they will often provide the right output for an input not in the training
set. Secondly, a training algorithm called backpropagation exists which can often find a
good set of weights (and biases) in a reasonable amount of time. Backpropagation is a
variation on gradient search. It generally uses a least-squares optimality criterion. The
key to backpropagation is a method for calculating the gradient of the error with respect
to the weights for a given input by propagating error backwards through the network.
There are some drawbacks to backpropagation. For one, there is the "scaling prob-
lem". Backpropagation works well on simple training problems. However, as the problem
complexity increases (due to increased dimensionality and/or greater complexity of the
data), the performance of backpropagation falls off rapidly. This makes it infeasible for
many real-world problems. The performance degradation appears to stem from the fact
that complex spaces have nearly global minima which are sparse among the local minima.
Gradient search techniques tend to get trapped at local minima. With a high enough
gain (or momentum), backpropagation can escape these local minima. However, it leaves
them without knowing whether the next one it finds will be better or worse. When the
nearly global minima are well hidden among the local minima, backpropagation can end
up bouncing between local minima without much overall improvement, thus making for
very slow training.

A second shortcoming of backpropagation is, to compute a gradient requires differ-

entiability. Therefore, backpropagation cannot handle discontinuous optimality criteria
or discontinuous node transfer functions. This precludes its use on some common node
types and simple optimality criteria.
1.3.3 Genetic Algorithms
Genetic algorithms[9] are algorithms for optimization and learning based loosely
on several features of biological evolution. They require five components:
1. A way of encoding solutions to the problem on chromosomes.
2. An evaluation function that returns a rating for each chromosome given to it.
3. A way of initializing the population of chromosomes.
4. Operators that may be applied to parents when they reproduce to alter their genetic
composition. Included might be mutation, crossover (i.e. recombination of genetic
material), and domain-specific operators.
5. Parameter settings for the algorithm, the operators, and so forth.
Given these five components, a genetic algorithm operates according to the following
steps:
1. The population is initialized.The result of the initialization is a set of chromosomes.
2. Each member of the population is evaluated. Evaluations may be normalized. The

important thing is to preserve relative ranking of evaluations.
3. The population undergoes reproduction until a stopping criterion is met. Repro-

duction consists of a number of iterations of the following three steps:
(a) One or more parents are chosen to reproduce. Selection is stochastic, but the
parents with the highest evaluations are favored in the selection.
(b) The operators are applied to the parents to produce children. The parameters
help determine which operators to use.
(c) The children are evaluated and inserted into the population. In some versions
of the genetic algorithm, the entire population is replaced in each cycle of
reproduction. In others, only subsets of the population are replaced.
When a genetic algorithm is run using a representation that usefully encodes solutions
to a problem and operators that can generate better children from good parents, the

algorithm can produce populations of better and better individuals, converging finally on
results close to a global optimum. In many cases the standard operators, mutation and
crossover, are sufficient for performing the optimization. In such cases, genetic algorithms
can serve as a black-box function optimizer not requiring their creator to input any
knowledge about the domain. However, knowledge of the domain can often be exploited to
improve the genetic algorithms performance through the incorporation of new operators.
Genetic algorithms should not have the same problem with scaling as backpropa-
gation. One reason for this is that they generally improve the current best candidate
monotonically. They do this by keeping the current best individual as part of their popu-
lation while they search for better candidates. Secondly, genetic algorithms are generally
not bothered by local minima. The mutation and crossover operators can step from a
valley across a hill to an even lower valley with no more difficulty than descending directly
into a valley. The field of genetic algorithms was created by John Holland.
1.4 Applications
1. Game-playing robots
General game-playing software provides a great opportunity to make a relatively
simple robotic system act smart. A robot arm capable of moving pieces on a board,
coupled with a state-of-the-art player, can in principle learn to play arbitrary games
with these pieces. Beyond this, interesting challenges are posed by game-playing
robots that learn to recognise and manipulate new pieces, or mobile robots that
can solve a whole array of new tasks formulated as games.
2. Manufacturing
Take, for instance, the task of picking a device from one box and putting it in a
container. Robots are now training themselves to do this job with great speed and
precision. Fanuc, a Japanese company, takes pride in the industrial robot that is
clever enough to train itself to do this job.
3. Space management in warehouse

Optimizing space utilization is a challenge that drives warehouse managers to
seek best solutions. The high volumes of inventory, fluctuating demands for in-
ventories and slow replenishing rates of inventory are hurdles to cross before using
warehouse space in the best possible way. Reinforcement learning algorithms can
be built to reduce transit time for stocking as well as retrieving products in the
warehouse for optimizing space utilization and warehouse operations.
4. Dynamic pricing
Dynamic pricing is a well-suited strategy to adjust prices depending on supply

and demand to maximize revenue from products. Techniques like Q-learning can be
leveraged to provide solutions addressing dynamic pricing problems. Reinforcement
learning algorithms serve businesses to optimize pricing during interactions with
customers.
5. Customer delivery
A manufacturer wants to deliver products for customers with a fleet of trucks
ready to serve customer demands. With the aim to make split deliveries and realize
savings in the process, the manufacturer opts for Split Delivery Vehicle Routing
Problem. The prime objective of the manufacturer is to reduce total fleet cost
while meeting all demands of the customers.
6. E - Commerce personalization
For retailers and e-commerce merchants, it has grown into an absolute imper-
ative to tailor communications and promotions fitting customer purchasing habits
Personalization is at the core of promoting relevant shopping experiences to cap-
ture customer loyalty. Reinforcement learning algorithms are proving their worth
by allowing e-commerce merchants to learn and analyze customer behaviors and
tailor products and services to suit customer interests.
7. Financial investment decisions

Pit.ai is at the forefront leveraging reinforcement learning for evaluating trading
strategies. It is turning out to be a robust tool for training systems to optimize
financial objectives. John Moody and Matthew Saffell have demonstrated how
reinforcement learning can be used for optimizing trading systems built for single
trading security or trading portfolios.
8. Medical industry
A dynamic treatment regime (DTR) is a subject of medical research setting
rules for finding effective treatments for patients. Diseases like cancer demand
treatments for a long period where drugs and treatment levels are administered
over a long period. Reinforcement learning addresses this DTR problem where RI
algorithms help in processing clinical data to come up with a treatment strategy,
using various clinical indicators collected from patients as inputs.
9. General Game Playing in AI Education

The fact that it addresses so many different aspects of AI makes general game
playing a valuable tool for AI education, too, as it allows to teach a variety of
methods on a single subject. A further advantage of general game playing is to nat-
urally lead to practical assignments, where students can experiment with various
techniques to address an interesting and challenging problem. Plus, there is the

competitive aspect when students players are pitted against each other for eval-
uation, which works as a great motivator. With todays available software tools
and online resources for teaching general game playing, this is also much easier to
organise than, say, a full-fledged robotics laboratory.
1.5 Existing solution methods
Since a long time, the most influential methods for GGP have been Monte Carlo
Tree search methods. The algorithm iteratively searches the game tree starting from the
current state in series of iterations until the allotted time runs out. This has proven to
be very successful in case of Board Games like Chess and Go. Systems like DeepBlue
were developed that could play Chess and eventually beat the world champion in the
game. But the system was written with heuristics and hard coded rules. Meaning, the
algorithm would fail to play any other game.
Recently, the focus to GGP has been through combination of Deep Neural Networks
and Reinforcement Learning.Googles Deepmind achieved amazing results on a set of
Atari games where the system was able to learn to play the game at superhuman
levels.The company developed AlphaGO that beat the 18 time world champion in the
chinese game of Go.This was a major breakthrough in the field of AI as the system was
able to learn by itself.
1.6 Proposed solution methods
In this project, Evolutionary strategies will be explored to implement a General

Game Player. A system will be built by a combination of Neural Networks and Genetic
algorithms that will eventually be able to learn to do a particular task. In particular, Con-
ventional Neuroevolution(CNE) and Neuroevolution of Augmented Topologies(NEAT)
will be implemented. CNE is designed to evolve weights(only) of an ANN and NEAT
is designed to take advantage by evolving structure of a network along with its weights.
Evolution using NEAT is almost similar to how learning takes place in our brain.

1.7 Time schedule for completion of the project work (Gantt

chart)
Figure 1.1: Gantt Chart

CHAPTER 2

2 Literature Survey
In this section, various methods used earlier to solve the problem will be discussed.
Earlier attempts towards general Game playing have been through tree based methods
and heuristics. These were very successful in playing board games like Chess. Recently,
there have been attempts to solve the problem using Reinforcement Learning methods,
Q-Learning in particular. A team of researchers at Google combined the latest advances
in Deep Learning and Q - Learning and the algorithms were able to play a set of Atari
games without any prior knowledge about the game. Very recently, OpenAI combined
Neural Networks and Evolutionary strategies to beat the initial benchmark set by Google
towards the Atari games.
Some of the literature studied and used in the implementation of this project includes
the literature mentioned below.
2.1 A survey of Monte Carlo tree search methods
Monte Carlo Tree Search[2] was the algorithm of choice by the most competitive
General Game Playing agents. The algorithm iteratively searches the game tree start-
ing from the current state in series of iterations until the allotted time runs out. An
iteration consists of the following four steps: selection, expansion, simulation, and back-
propagation.
1. Selection Step - The algorithm starts from a root of the game-tree and chooses a
node within an already built part of the tree based on the nodes statistics. Actions,
which have been performing better so far, are tested more frequently.
2. Expansion Step - It means extending the tree by a new node with the first
unvisited state so far, that is, the first state found after leaving the tree.
3. Simulation Step - After leaving the stored fragment of the tree, a random simu-
lation is performed until a game termination is reached.
4. Back-Propagation Step - The scores obtained by all players in the ended game
are fetched and back-propagated (back-propagation) to all nodes visited in the
selection and expansion steps.
One of the interesting things about this is that the same core algorithm can be used
for a whole class of games: Chess, Go, Othello, and almost any board game you can
think of. The main limitations are that the game has to have a finite number of possible
moves in each state as well as a finite length. Also,the algorithm does not work if the
upcoming states of the environment are unknown or cannot be analyzed. Top AIs for Go
(a notoriously difficult game to make a good AI for) use some form of MCTS.

2.2 A GGP Feature Learning Algorithm
The goal is to learn generalized features which can be used to improve the quality
of play[5]. Given a move sequence ending in a terminal state, there are two stages to the
learning process, which are described in detail here. In the first stage, a 2-player game
tree is built that leads to a terminal state, and states are identified for learning. In the
second stage, general features are extracted from states to be used during game play. It
has the following stages.
1. Identifying States for Learning - GIFL identifies states for learning by per-
forming random walks in a game until a terminal state is reached. Then, a 2-ply
tree is built around the terminal state to analyze whether learning can occur.
2. Learning Generalized Features - The generalization process takes as input a

GGP state, an action, and a functional test which must be preserved during the
generalization process. The main task in learning GIFL features is to generalize
from full states found from the 2-player trees built in the previous section to a small
set of predicates which can possibly match many different states.
3. Extending GIFL Features Up The Tree - Building the tree identifies possible
states where offensive or defensive features could be learned, and builds generalized
GIFL features from these states. But, ideally, a learner would also learn elsewhere
in the tree. This learning can be performed using the same procedure by looking
at moves higher up in the random walk. Instead of looking for a move which leads
to a goal, GIFL looks for a move which contributes to the offensive feature. A tree
is then built around this move, and similar learning takes place.
The main disadvantage is that it is possible that some necessary predicates are not
properly generalized using this method and this method is useful only for board games
and cannot be extended to complex video games.
2.3 Training Feedforward Neural Networks Using Genetic Algo-

rithms
Evolutionary artificial neural networks (EANNs) refer to a special class of artificial

neural networks (ANNs) in which evolution is another fundamental form of adaptation
in addition to learning. The method uses Genetic algorithm(GA) to evolve ANNs[4]. The
paper talks about GAs inspired by natural evolution. The algorithm starts by initializing
a population of random genomes. These genomes are evaluated to performs a given task
and each of them is given a score by a fitness function based on how well it performed the
task. Then, the genomes with higher fitness are favoured to reproduce. These genomes

go on to produce a new population and mutation is added into the new population. Here
a genome is a neural network with random weights. Over many generations, the networks
get better and better at performing the given task.
The major disadvantage here is that, the method describes only evolution of weights
of the ANN and there is no reasoning about evolution of structure of the ANN.
2.4 High-level Reinforcement Learning in Strategy Games
Reinforcement learning lies somewhere in between supervised and unsupervised

learning. Whereas in supervised learning one has a target label for each training example
and in unsupervised learning one has no labels at all, in reinforcement learning one has
sparse and time-delayed labels the rewards. Based only on those rewards the agent
has to learn to behave in the environment. For learning, a Markov Decision Processes is
chosen as the general framework. MDPs are a common method for modeling sequential
decision-making with stochastic actions. The following methods are used:
Q-learning - This method updates the value of a state-action pair after the action
has been taken in the state and an immediate reward has been received. Values
of state-action pairs, Q(s, a) are learned because the resulting policy is more easily
recoverable than learning the values of states alone, V (s). Q-learning will converge
to an optimal value function under conditions of sufficiently visiting each state-
action pair, but often requires many learning episodes to do so. In multiagent
domains, Q-learning is no longer guaranteed to converge due to the environment
no longer being stationary. Nevertheless, it has been shown to be effective at some
places.
Model-based Q-learning[3] - Q-learning is a model-free method. That is, it

learns a policy directly, without first obtaining the model parameters(the transition
and reward functions). An alternative is to use a model-based method that learns
the model parameters and uses the model definition to learn a policy. Learning a
model consists of learning the transition probabilities and reward values for each
state and action. If a good model is learned, an optimal policy can be found by
planning methods because the model parameters are now known. Rather than first
building a correct model and then finding a policy from that model, we learn the
model and the Q-values at the same time with the Dyna-Q approach. Great results
were achieved for a strategy game Civilization IV.
2.5 Human-level control through deep reinforcement learning
A game is considered to be a Markov Decision Process. Suppose you are an agent,

situated in an environment (e.g. Breakout game). The environment is in a certain state

(e.g. location of the paddle, location and direction of the ball, existence of every brick and
so on). The agent can perform certain actions in the environment (e.g. move the paddle
to the left or to the right). These actions sometimes result in a reward (e.g. increase in
score). Actions transform the environment and lead to a new state, where the agent can
perform another action, and so on. The rules for how actions are chosen is called a policy.
To perform well in the long-term, we need to take into account not only the immediate
rewards, but also the future rewards we are going to get.
In Q-learning we define a function Q(s, a) representing the maximum discounted fu-
ture reward when we perform action a in state s, and continue optimally from that point
on. The state of the environment in the Breakout game can be defined by the location
of the paddle, location and direction of the ball and the presence or absence of each indi-
vidual brick. This intuitive representation however is game specific. The obvious choice
is screen pixels they implicitly contain all of the relevant information about the game
situation.
The method employed a classical convolutional neural network with three convolu-
tional layers, followed by two fully connected layers to play the game and update the
policy with each episode of the game using the Bellman Equation. The method achieved
amazing results for Atari games.
Many improvements to Deep Q-learning[7] have been proposed since its first introduc-
tion Double Q-learning, Prioritized Experience Replay, Dueling Network Architecture
and extension to continuous action space to name a few.

CHAPTER 3

3 System Requirements and Analysis
3.1 Introduction
Requirement specification encompasses the needs of this project. It aims at provid-

ing a full description of the requirements based on the concepts defined in the Problem
Domain. The system requirement specification is produced at the culmination of the
analysis task. The function and performance allocated to software as part of system
engineering are refined by establishing a complete information description, a detailed
functional description, representation of system behavior, an indication of performance
requirements and design constraints, appropriate validation criteria and other information
pertinent to requirements.
3.2 Functional Requirements
The system should be able to learn to play a variety of games.
The system should be able to record/store what has been learned so that the system
wont need to learn again and again.
Develop a package/library that can be reused in other Reinforcement Learning

tasks.
3.3 Non - Functional Requirements
The system should learn a task in a reasonable amount of time.
The environment to be learned should not have a very large search space. Ex.
Chess, Go
The developed package/library should be flexible enough to be easily integrated

with different environments.
3.4 System Requirements
3.5 Software Requirements
Operating Systems - Linux(Preferably Ubuntu 14.04 or higher)
Language - Python 3.5 or higher

3.6 Hardware Requirements
Processors - Any Intel or AMD x86 processor
RAM - 2048 MB (4096 MB recommended)
Disk Space - 10 GB (40 GB recommended)
Input device : Mouse and Keyboard
Output Device : 1024 * 768 resolution display with true colour

CHAPTER 4

4 Tools and Technologies used
Various tools and technologies have been used for the development of the system.
The whole system was developed using Python and various libraries and frameworks
were used. The environments for the games were used from OpenAI Gym and Pygame
Learning Environment. All the tools are described below.
4.1 Python
Python is a widely used high-level programming language for general-purpose pro-

gramming, created by Guido van Rossum and first released in 1991. An interpreted
language, Python has a design philosophy which emphasizes code readability (notably
using whitespace indentation to delimit code blocks rather than curly braces or keywords),
and a syntax which allows programmers to express concepts in fewer lines of code than
possible in languages such as C++ or Java. The language provides constructs intended
to enable writing clear programs on both a small and large scale.
Python features a dynamic type system and automatic memory management and sup-
ports multiple programming paradigms, including object-oriented, imperative, functional
programming, and procedural styles. It has a large and comprehensive standard library.
Python interpreters are available for many operating systems, allowing Python code
to run on a wide variety of systems. CPython, the reference implementation of Python,
is open source software and has a community-based development model, as do nearly all
of its variant implementations. CPython is managed by the non-profit Python Software
Foundation.
4.2 Numpy
NumPy is a library for the Python programming language, adding support for
large, multidimensional arrays and matrices, along with a large collection of high-level
mathematical functions to operate on these arrays. The ancestor of NumPy, Numeric,
was originally created by Jim Hugunin with contributions from several other developers.
In 2005, Travis Oliphant created NumPy by incorporating features of the competing
Numarray into Numeric, with extensive modifications. NumPy is an open-source software
and has many contributors.
4.3 Pickle
The pickle module implements a fundamental, but powerful algorithm for serial-
izing and de-serializing a Python object structure. Pickling is the process whereby a
Python object hierarchy is converted into a byte stream, and unpickling is the inverse
operation, whereby a byte stream is converted back into an object hierarchy. Pickling

(and unpickling) is alternatively known as serialization, marshalling, or flattening,

however, to avoid confusion, the terms used here are pickling and unpickling.
4.4 Graphviz
Graphviz (short for Graph Visualization Software) is a package of open-source tools

initiated by ATT Labs Research for drawing graphs specified in DOT language scripts. It
also provides libraries for software applications to use the tools. Graphviz is free software
licensed under the Eclipse Public License. It has a clean interface to visualize graphs and
convert them to PDF, PNG etc.
4.5 Virtualenv
Virtualenv is a tool to create isolated Python environments.

The basic problem being addressed is one of dependencies and versions, and indirectly
permissions. Imagine you have an application that needs version 1 of LibFoo, but another
application requires version 2. If everything is installed into /usr/lib/python2.7/site-
packages (or platforms standard location), its easy to end up in a situation where one
unintentionally upgrades an application that shouldnt be upgraded.
If you want to install an application and leave it be, If an application works, any
change in its libraries or the versions of those libraries can break the application. Also,
what if you cant install packages into the global site-packages directory? For instance,
on a shared host.
In all these cases, virtualenv can help. It creates an environment that has its own
installation directories, that doesnt share libraries with other virtualenv environments
(and optionally doesnt access the globally installed libraries either).
4.6 OpenAI gym
OpenAI Gym[1] is a toolkit for developing and comparing reinforcement learning

algorithms. It makes no assumptions about the structure of the agent, and is compatible
with any numerical computation library, such as TensorFlow or Theano. You can use it
from Python code, and soon from other languages.
OpenAI Gym consists of two parts:
1. The gym open-source library: a collection of test problems (environments) that you
can use to work out your reinforcement learning algorithms. These environments
have a shared interface, allowing you to write general algorithms.
2. TheOpenAI Gym service: a site and API allowing people to meaningfully compare
performance of their trained agents.

The Cart Pole Balancing task and SUPER MARIO BROS environment have been
used from OpenAI gym for this project.
4.7 Pygame Learning Environment
PyGame Learning Environment (PLE) is a learning environment, mimicking the

Arcade Learning Environment interface, allowing a quick start to Reinforcement Learning
in Python. The goal of PLE is allow practitioners to focus design of models and experi-
ments instead of environment design. PLE allows agents to train against games through
a standard model which interacts and manipulates games on behalf of your agent.
The Library has the following environments.
1. Catcher
2. Monster Kong
3. FlappyBird
4. Pixel Copter
5. Pong
6. PuckWorld
7. Raycast Maze
8. Snake
9. WaterWorld
The Flappy Bird and Pong environment have been used from Pygame Learning
environment for this project.
4.8 Sublime Text
Sublime Text is a proprietary cross-platform source code editor with a Python

application programming interface (API). It natively supports many programming lan-
guages and markup languages, and its functionality can be extended by users with plugins,
typically community-built and maintained under free-software licenses.
Sublime text was used to write and edit the complete code for this project.

4.9 Git and Github
Git is a Version Control System (VCS) for tracking changes in computer files and
coordinating work on those files among multiple people. It is primarily used for software
development, but it can be used to keep track of changes in any files. As a distributed
revision control system it is aimed at speed,data integrity, and support for distributed,
non-linear workflows.
Git was created by Linus Torvalds in 2005 for development of the Linux kernel, with
other kernel developers contributing to its initial development. Its current maintainer
since 2005 is Junio Hamano.
As with most other distributed version control systems, and unlike most clientserver
systems, every Git directory on every computer is a full-fledged repository with complete
history and full version tracking abilities, independent of network access or a central
server. Like the Linux kernel, Git is free software distributed under the terms of the
GNU General Public License version.
4.10 Google Compute Engine
Google Compute Engine (GCE) is the Infrastructure as a Service (IaaS) component

of Google Cloud Platform which is built on the global infrastructure that runs Googles
search engine, Gmail, YouTube and other services. Google Compute Engine enables users
to launch virtual machines (VMs) on demand. VMs can be launched from the standard
images or custom images created by users. GCE users need to get authenticated based
on OAuth 2.0 before launching the VMs. Google Compute Engine can be accessed via
the Developer Console, RESTful API or Command Line Interface.
4.11 Latex
LaTex is a document markup language and document preparation system for teX
typesetting program. The term LaTeX refers only to the language in which documents are
written, not to the editor application used to write those documents. In order to create
a document in LaTeX a .tex file must be created using some form of text editor. While
most text editors can be used to create a LaTeX document, a number of editors have been
created specifically for working LaTeX. LaTeX is widely used in academia. As a primary
or intermediate format,e.g., translating DocBook and other XML based formats to pdf,
LaTeX is used because of the high quality of typesetting achievable by Tex. The type-
setting system offers programmable desktop publishing features and extensive facilities
for automating most aspects of typesetting and desktop publishing, including numbering
and cross-referencing, tables and figures, page layout and bibliographies. LaTeX is in-
tended to provide a high-level language that accesses the power of TeX. LaTeX essentially
comprises a collection of TeX macros and a program to process LaTeX documents.

CHAPTER 5

5 System Design
5.1 Introduction
System design describes constraints in the system and includes any assumptions
made by the project team during development. It is the process or art of defining the
hardware and software architecture, components, modules, interfaces, and data for a
computer system to satisfy specified requirements. One could see it as the application of
systems theory to computing. The design of the system is essentially a blueprint, or a
plan for a solution for the system. A system is considered to be a set of components with
the clear and defined behaviour, which interacts each other in a fixed, defined manner,
to produce some behaviour or services to its environment.
The system developed is a package/library that can learn to play a given game by
itself by combining the strengths of Artificial Neural Networks and Genetic Algorithms.
5.2 Block Diagram
In this section, diagrams are included showing in schematic form, the general ar-
rangement of the components of our system. The first part shows how an Artificial
Neural Network is able to interact with the environment. The latter part describes how
the Genetic Algorithm is able to evolve the Neural Networks to do a particular task.
5.2.1 ANN to interact with Environment
Each environment has different number of input and output parameters. So, to
implement a generalized model that can interact with multiple environments, an Artificial
Neural Network is used. By this, to make an ANN be suitable for any environment, it is
just a matter of changing the size of Input layer and Output Layer of the environment.As
an example, for a Snake Game, the inputs would be (x1, y1) - Position of Head of snake
and (x2, y2) - Position of food in the environment. So, the input layer size of the ANN
would be 4. The possible outputs for the snake would be a decision whether to go left,
right, up or down. So the output layer size of the ANN would also be 4 in this case.
Each node in the output layer represents the probability to take a particular decision.
The agent selects the decision with the node having the highest probability. The block
diagram is described in figure 5.1.

Figure 5.1: Interacting with the environment
5.2.2 Evolution of ANNs
A population of random ANNs is created, with the input and output layer size of
each ANN equal to the number of input parameters and output parameters of the corre-
sponding environment in context. The environments are adapted from OpenAI gym and
Pygame Learning Environment. All the ANNs are let to perform in the environment and
their score(Fitness) is recorded. Fit parents are retained and the weak ones are discarded.
A new population is created using the retained fit parents by favouring the parents with
higher fitness for reproduction. Multiple mutation operators are applied on the newly
created population. The same process is continued until the specific termination criteria
is met. After an agent has solved the task or performed better than its predecessors, the
ANN is stored using Pickle. The block diagram is shown in figure 5.2.

Figure 5.2: Evolution of ANNs

CHAPTER 6

6 System Implementation
6.1 Introduction
Based on the above system design, two algorithms are implemented. The first
algorithm is Conventional Neuroevolution(CNE) and the next is Neuroevolution of Aug-
mented Topologies(NEAT).
CNE focuses on evolution with a static structure of ANN by optimizing the weights
of the ANN. NEAT goes a step beyond and focuses on both evolution of weights and the
structure of the ANN. Both of these algorithms will be discussed below.
6.2 Conventional Neuroevolution

The components of the CNE[4] are described as follows.
1. Chromosome Encoding: The weights (and biases) in the neural network are
encoded as a list of real numbers as shown in figure 6.1.
2. Evaluation Function: Assign the weights on the chromosome to the links in a

network of a given architecture,run the network over the training set of examples
or onto the given environment, and return the sum of the squares of the errors or
the score of the agent.
3. Initialization Procedure: The weights of the initial members of the population

are chosen at random with a uniform distribution between 1.0 and 1.0.
4. Operators: There are different types of genetic operators. These are grouped into
two basic categories: mutations, crossovers. A mutation operator takes one parent
and randomly changes some of the entries in its chromosome to create a child. A
crossover operator takes two parents and creates one child containing some of the
genetic material of each parent. Each of the operators are discussed individually
below.
UNBIASED-MUTATE-WEIGHTS: For each entry in the chromosome,

this operator will, with fixed probability p = 0.1, replace it with a random
value chosen from the initialization probability distribution.
BIASED-MUTATE-WEIGHTS: For each entry in the chromosome, this
operator will, with fixed probability p = 0.1, add to its weight a random
value chosen from the initialization probability distribution. We expect biased
mutation to be better than unbiased mutation. This is because, right from
the start of a run, parents are chosen which tend to be better than average.

Therefore, the weight settings in these parents tend to be better than random
settings. Hence, biasing the probability distribution by the present value of
the weight should give better results than a probability distribution centered
on zero.
CROSSOVER-WEIGHTS: This operator puts a value into each position
of the childs chromosome by randomly selecting one of the two parents and
using the value in the same position on that parents chromosome as shown in
Figure 6.2.
5. Parameter Settings: There are a different parameters whose values can greatly
influence the performance of the algorithm. Except where stated otherwise, these
were kept constant across runs. The different parameters are,
(a) POPULATION-SIZE: 50
(b) MUTATION-PROBABILITY = 0.3
(c) BIASED-MUTATE-WEIGHTS-PROBABILITY = 0.5
The genetic algorithm operates according to the following steps:
(a) The population is initialized. The result of the initialization is a set of chro-
mosomes. Each member of the population is evaluated. Evaluations may be
normalized. The important thing is to preserve relative ranking of evaluations.
(b) The population undergoes reproduction until a stopping criterion is met.
Reproduction consists of a number of iterations of the following three steps:
i. One or more parents are chosen to reproduce. Selection is stochastic, but
the parents with the highest evaluations are favored in the selection.
ii. The operators are applied to the parents to produce children. The param-
eters help determine which operators to use.
iii. The children are evaluated and inserted into the population. In some ver-
sions of the genetic algorithm, the entire population is replaced in each
cycle of reproduction. In others, only subsets of the population are re-
placed.

Figure 6.1: Encoding an ANN
Figure 6.2: Crossover of ANNs

6.3 Neuroevolution of Augmented Topologies(NEAT)

NeuroEvolution of Augmenting Topologies (NEAT)[10] is designed to take advan-
tage of structure as a way of minimizing the dimensionality of the search space of con-
nection weights. If structure is evolved such that topologies are minimized and grown
incrementally, significant gains in learning speed result. Improved efficiency results from
topologies being minimized throughout evolution.
6.3.1 Genetic Encoding
NEATs genetic encoding scheme is designed to allow corresponding genes to be

easily lined up when two genomes cross over during mating. Genomes are linear repre-
sentations of network connectivity as shown in figure 6.3. Each genome includes a list of
connection genes, each of which refers to two node genes being connected. Node genes
provide a list of inputs, hidden nodes, and outputs that can be connected. Each connec-
tion gene specifies the in-node, the out-node, the weight of the connection, whether or
not the connection gene is expressed (an enable bit), and an innovation number, which
allows finding corresponding genes.
Mutation in NEAT can change both connection weights and network structures. Con-
nection weights mutate as in any NE system, with each connection either perturbed or
not at each generation. Structural mutations occur in two ways as shown in figure 6.4.
Each mutation expands the size of the genome by adding gene(s). In the add connection
mutation, a single new connection gene with a random weight is added connecting two
previously unconnected nodes. In the add node mutation, an existing connection is split
and the new node placed where the old connection used to be. The old connection is
disabled and two new connections are added to the genome. The new connection leading
into the new node receives a weight of 1, and the new connection leading out receives the
same weight as the old connection. This method of adding nodes was chosen in order
to minimize the initial effect of the mutation. The new nonlinearity in the connection
changes the function slightly, but new nodes can be immediately integrated into the net-
work, as opposed to adding extraneous structure that would have to be evolved into the
network later. This way, because of speciation, the network will have time to optimize
and make use of its new structure.
6.3.2 Tracking Genes through Historical Markings
There is unexploited information in evolution that tells us exactly which genes

match up with which genes between individuals in a topologically diverse population.
That information is the historical origin of each gene. Two genes with the same historical
origin must represent the same structure (although possibly with different weights), since

Figure 6.3: Genetic Encoding
Figure 6.4: Mutation in NEAT

they are both derived from the same ancestral gene at some point in the past. Thus,
all a system needs to do to know which genes line up with which is to keep track of the
historical origin of every gene in the system.
Tracking the historical origins requires very little computation. Whenever a new
gene appears (through structural mutation), a global innovation number is incremented
and assigned to that gene. The innovation numbers thus represent a chronology of the
appearance of every gene in the system.
The historical markings give NEAT a powerful new capability. The system now knows
exactly which genes match up with which as shown in figure 6.5. When crossing over, the
genes in both genomes with the same innovation numbers are lined up. These genes are
called matching genes. Genes that do not match are either disjoint or excess, depending on
whether they occur within or outside the range of the other parents innovation numbers.
They represent structure that is not present in the other genome. In composing the
offspring, genes are randomly chosen from either parent at matching genes, whereas all
excess or disjoint genes are always included from the more fit parent. This way, historical
markings allow NEAT to perform crossover using linear genomes with out the need for
expensive topological analysis.
6.3.3 Protecting Innovation through Speciation
Speciating the population allows organisms to compete primarily within their own
niches instead of with the population at large. This way, topological innovations are
protected in a new niche where they have time to optimize their structure through com-
petition within the niche. The idea is to divide the population into species such that
similar topologies are in the same species.
The number of excess and disjoint genes between a pair of genomes is a natural
measure of their compatibility distance. The more disjoint two genomes are, the less
evolutionary history they share, and thus the less compatible they are. Therefore, the
compatibility distance of different structures in NEAT can be measured as a simple
linear combination of the number of excess E and disjoint D genes, as well as the average
weight differences of matching genes W , including disabled genes:

c1.E c2.E c3.W
= + + (1)
N N N
The coefficients c1, c2, and c3 allow us to adjust the importance of the three factors,
and the factor N , the number of genes in the larger genome, normalizes for genome size
(N can be set to 1 if both genomes are small, i.e., consist of fewer than 20 genes).

Figure 6.5: Matching of Genes and Crossover
6.3.4 Minimizing Dimensionality through Incremental Growth from Mini-

mal Structure
NEAT biases the search towards minimal-dimensional spaces by starting out with
a uniform population of networks with zero hidden nodes (i.e., all inputs connect directly
to outputs). New structure is introduced incrementally as structural mutations occur,
and only those structures survive that are found to be useful through fitness evaluations.
In other words, the structural elaborations that occur in NEAT are always justified. Since
the population starts minimally, the dimensionality of the search space is minimized, and

NEAT is always searching through fewer dimensions than other TWEANNs and fixed-
topology NE systems. Minimizing dimensionality gives NEAT a performance advantage
compared to other approaches.
6.3.5 NEAT Algorithm
With the components described in the previous sections, the algorithm to evolve
Neural Networks is described here. A population of random ANNs is created. Then,
the population is divided into species and the fitness is evaluated. Based on this fitness,
the parents to produce the new population are selected. This continues until the task is
solved. The algorithm is described in detail in Algorithm 6.1.
Algorithm 6.1 NEAT Algorithm

1: procedure NEAT algorithm
2: Initialize a population of random ANNs
3: Divide the population into species based on equation 1.
4: while The Terminating criteria is not met do
5: Evaluate the fitness of each individual.
6: Sort the individuals based on fitness in descending order in each species.
7: Remove the lower half of each species.
8: Produce new offsprings from each species by crossover or mutation.
9: Mutate all the existing individuals.
10: Remove all the individuals from each species except the most fit individual.
11: Add new offsprings to different species based on equation 1.
6.3.6 Parameter Settings
The values as described in Table 6.1 are the mutation rates set with the NEAT
algorithm. These values can be modified according to problem domain.
Name Value
Perturb Weight 0.8
Perturb Bias 0.25
Biased Perturb Weight 0.9
Add New Node 0.03
Add connection 0.05
Crossover 0.75
Enable Connection 0.2
Disable Connection 0.4
Table 6.1: Mutation Rates for NEAT

The values as described in Table 6.3 are the parameters set for speciation. These can
be modified according to problem domain.
Name Value
c1 1
c2 2
c3 0.4
3
Table 6.2: Parameters for speciation for NEAT

CHAPTER 7

7 Testing and Results analysis
7.1 Introduction
Testing is intended to show that a system conforms to its specification. Large sys-
tems are built out of the sub systems which are built out of modules which are composed
of procedures and functions. The testing process should therefore proceed in stages where
testing is carried out incrementally in conjunction with system implementation. The test-
ing of the algorithms was done along with the implementation of the various modules.
This method of testing helps to ensure the proper working of the modules at the time of
their implementation. The existence of program defects or inadequacies is inferred from
unexpected system outputs. For verification and validation, program testing technique
is used.
7.2 Unit Testing

The primary goal of unit testing is to take the smallest piece of testable software
in the application, isolate it from the remainder of the code, and determine whether it
behaves exactly as you expect. Each unit is tested separately before integrating them
into modules to test the interfaces between modules. Unit testing has proven its value
in that a large percentage of defects are identified during its use. This type of testing is
driven by the architecture and implementation teams. This focus is also called black-box
testing because only the details of the interface are visible to the test. Limits that are
global to a unit are tested here.
The algorithms are now tested to do specific tasks.
7.3 Evolving XOR

The inputs and outputs for XOR are described in table 7.1.
Input1 Input2 Output

0 0 0
0 1 1
1 0 1
1 1 0
Table 7.1: XOR Problem
Solving for XOR is a simple and yet a difficult problem as the outputs of the prob-
lem are not linearly separable. So, to solve this, hidden layers are required. In this
section, a comparison between the standard backpropagation algorithm and evolution

algorithms(CNE and NEAT) is provided. Below is a comparison provided which was

recorded for 50 runs on each algorithm. These results are described in table 7.2,figures
7.1,7.2 and 7.3.
7.3.1 Comparision analysis for solving XOR
The results for solving XOR over 50 runs are described in table 7.2:
Algorithm Minimum Epoch Maximum Epoch Average Epochs
CNE 1 199 34.5
NEAT 3 183 66.36
Backpropagation 205 7276 748
Table 7.2: Results for solving XOR
Figure 7.1: Solving XOR using CNE
7.3.2 Evolution of XOR using NEAT
NEAT starts with no hidden layers and so the initial structures would not be able
to solve XOR. Because XOR is not linearly separable, a neural network requires hidden

Figure 7.2: Solving XOR using NEAT
Figure 7.3: Solving XOR using Backpropagation

units to solve it. The two inputs must be combined at some hidden unit, as opposed to
only at the output node, because there is no function over a linear combination of the
inputs that can separate the inputs into the proper classes. These structural requirements
make XOR suitable for testing NEATs ability to evolve structure. Figure 7.4 shows the
smallest ANN that NEAT evolved to solve the XOR problem.
Figure 7.4: Evolution of structure by NEAT
7.4 Cartpole Balancing task using NEAT

There are many control learning tasks where the techniques employed in NEAT can
make a difference. Many of these potential applications, like robot navigation or game
playing, present problems without known solutions. We use the pole balancing domain
for comparison because it is a known benchmark in the literature, which makes it possible
to demonstrate the effectiveness of NEAT compared to others. It is also a good surrogate
for real problems, in part because pole balancing in fact is a real task, and also because
the difficulty can be adjusted.
Pole balancing[1] is a control benchmark historically used in engineering. It involves
a pole affixed to a cart via a joint which allows movement along a single axis. The cart is
able to move along a track of fixed length. A trial typically begins with the pole off-center
by a certain number of degrees. The goal is to keep the pole from falling over by moving
the cart in either direction, without falling off either edge of the track. A more difficult
extension of this problem involves two poles, both affixed at the same point on the cart.

The environment for cartpole balancing task was adopted from OpenAI gym. The
environment provides 4 observations.
1. Cart Position
2. Cart Velocity
3. Pole Angle
4. Pole Velocity At Tip
Based on these observations, the agent can do 2 things to keep the pole from falling
1. Move left
2. Move right
NEAT is let to evolve a network with 4 input nodes, 1 bias node and 1 output node.
If the value in the output node is less than 0.5, the agent decides to move left, else move
right. NEAT was easily able to solve the environment in less than 20 seconds(In worst
case). The environment and the ANN which solves the environment are shown in figure
7.5 and 7.6 respectively.
Figure 7.5: Cartpole Balancing Environment

Figure 7.6: ANN to solve Cartpole Balancing Environment
7.5 Playing Pong game using NEAT

Pong is a two-dimensional sports game that simulates table tennis. The player
controls an in-game paddle by moving it vertically across the left side of the screen, and
can compete against either a computer-controlled opponent or another player controlling
a second paddle on the opposing side. Players use the paddles to hit a ball back and
forth. The aim is for each player to reach eleven points before the opponent; points are
earned when one fails to return the ball to the other.
The environment to play Pong[12] has been adopted from Pygame Learning Environ-
ment. The environment returns the following observation for each frame.
1. Player y position.
2. Player velocity.
3. Cpu y position.
4. Ball x position.
5. Ball y position.
6. Ball x velocity.
7. Ball y velocity.
Based on these observations, the agent can do one of the below 3 things to make sure
the ball is returned.
1. Go Up

2. Be at the same position
3. Go Down
NEAT was easily able to learn to play the game in 5 minutes(Worst case). Figure 7.7
shows NEAT playing against the preprogrammed player.
Figure 7.7: NEAT Playing Pong
7.6 Playing Flappy Bird using NEAT

Flappy Bird is a mobile game developed by Vietnamese artist and programmer
Dong Nguyen, under his game development company dotGEARS. The game is a side-
scroller where the player controls a bird, attempting to fly between rows of green pipes
without hitting them. The developer created the game over several days, using a bird
protagonist which he had designed for a cancelled game in 2012. The game was released
in May 2013 but received a sudden rise in popularity in early 2014. It was criticized for
its level of difficulty, plagiarism in graphics and game mechanics, while other reviewers
found it addictive.
Flappy Bird was a side-scrolling mobile game featuring 2D retro style graphics. The
objective was to direct a flying bird, named "Faby", who moves continuously to the right,
between sets of Mario-like pipes. If the player touches the pipes, they lose. Faby briefly

flaps upward each time that the player taps the screen; if the screen is not tapped, Faby
falls because of gravity; each pair of pipes that he navigates between earns the player
a single point, with medals awarded for the score at the end of the game. No medal
is awarded to scores less than ten. A bronze medal is given to scores between ten and
twenty. In order to receive the silver medal, the player must reach 20 points. The gold
medal is given to those who score higher than thirty points. Players who achieve a score
of forty or higher receive a platinum medal. Android devices enabled the access of world
leaderboards, through Google Play.
The environment for Flappy Bird[11] has been adopted from Pygame Learning Envi-
ronment. The environment returns the following observations as state of the game.
1. Player y position.
2. Player velocity.
3. Next pipe distance to player
4. Next pipe top y position
5. Next pipe bottom y position
6. Next next pipe distance to player
7. Next next pipe top y position
8. Next next pipe bottom y position.
Based on the state, the agent can do 2 things
1. Go up
2. No operation(So as to go down)
The ANN has 8 input nodes, 1 bias node and 2 output nodes. The output nodes
determine the probability of performing an action(Up or No operation). The agents
selects the decision which has the higher probability. NEAT was able to learn to play
this in about 3 hours of evolution as explains the difficulty of the game. NEAT playing
the Flappy Bird game and the ANN that was evolved to play the game are shown in
figure 7.8 and 7.9 respectively.

Figure 7.8: NEAT playing the flappy Bird Game
7.7 Playing Super Mario BROS using NEAT

Super Mario Bros is a platform video game developed and published by Nintendo
for the Nintendo Entertainment System home console. Released as a sequel to the 1983
game Mario Bros., Super Mario Bros. was released in Japan and North America in 1985,
and in Europe and Australia two years later. In Super Mario Bros., the player controls
Mario and in a two-player game, a second player controls Marios brother Luigi as he
travels through the Mushroom Kingdom in order to rescue Princess Toadstool from the
antagonist Bowser.
The player takes on the role of the main protagonist of the series, Mario. The objective
is to race through the Mushroom Kingdom, survive the main antagonist Bowsers forces,
and save Princess Toadstool. The player moves from the left side of the screen to the
right side in order to reach the flagpole at the end of each level. Marios primary attack
is jumping on top of enemies, though many enemies have differing responses to this.
The environment for Super Mario Bros[8] has been adopted from OpenAI gym. The
environment gives the state of the environment in the following form. The state is a 16
* 13 matrix. A value of 3 represents the position of Mario. A value of 2 represents
enemies. A value of 1 represents blocks and 0 represents empty space as shown in figure
7.10.
Based on this observation, the agent can go left, up right, down, jump, move fast or
a combination of these leading to 14 different actions. The ANN has 208 input nodes, 1
bias node and 14 output nodes. Finally the agent decides to take the action associated
with the output node that has the highest probability value. The evolution took about

24 hours and the system was let to learn to play on Google cloud compute engine with
4 vCPUs and 15GB machine. A VNC server was setup to create a desktop which was
accessed by VNC client and the learning was started. Figure 7.11 is an image of the
algorithm playing the game.
With this it is shown that our algorithm was able to learn to play a variety of games on
its own without any help. This shows that the machines can indeed learn by themselves
to do a particular task on their own rather than being explicitly programmed.

Figure 7.9: ANN evolved to play Flappy Bird

Figure 7.10: State of the Super Mario Bros game
Figure 7.11: NEAT playing Super Mario Bros

CHAPTER 8

8 Conclusion and Future Work

The main conclusion is that NEAT is a powerful method for evolving neural net-
works. NEAT demonstrates that evolving topology along with weights can be made a
major advantage. Experimental comparisons verify that such evolution is several times
more efficient than the neuroevolution methods so far. Studies show that historical mark-
ings, protection of innovation through speciation, and incremental growth from minimal
structure all work together to produce a system that is capable of evolving solutions of
minimal complexity. NEAT strengthens the analogy between GAs and natural evolution
by both optimizing and complexifying solutions simultaneously. The capacity to com-
plexify solutions over the course of evolution offers the possibility of continual competitive
coevolution and evolution of combinations of experts in the future.
Recently, OpenAI Beat Google DeepMind at Atari games using evolutionary strategies
rather than Reinforcement learning strategies. Their work suggests that neuroevolution
approaches can be competitive with reinforcement learning methods on modern agent-
environment benchmarks, while offering significant benefits related to code complexity
and ease of scaling to large-scale distributed settings. More exciting work can be done by
revisiting other ideas from this line of work, such as indirect encoding methods, or evolving
the network structure in addition to the parameters. The area need to be explored a lot
so that new algorithms can be formed and some day solve Intelligence.
In future work we can apply evolution strategies to those problems for which rein-
forcement learning is less well suited: problems with long time horizons and complicated
reward structure. One of the major area of interest is meta learning, or learning-to-learn,
Using ES instead of RL, we hope to be able to extend these results.

Appendix A
Project Team Details
Project Title: AI General Game Player using Neuroevolution Al-

gorithms
USN Team Members Name CGPA E-mail Id Mobile No.

4JC13CS024 BASANTH JENU H B 9.11 basanthjenuhb@gmail.com 8123253173
4JC13CS062 MEGHANA S B 8.72 meghanabhaskar96@gmail.com 9632998417
4JC13CS094 SANJANA G S 9.12 gssanjana4@gmail.com 9071366225
THE TEAM
Sanjana G S, Meghana S B, Guru R (Guide), Basanth Jenu H B

Write-up about the Project

Our project aims in developing an AI system that can play a variety of games. The
conduction of this project required the knowledge of various subjects learnt in previous
semesters of Computer Science and Engineering. To design and implement any project,
the understanding of Software Development Life Cycle is a must. Hence the core subject
Software Engineering aided the project development process.
The knowledge of Data Structures and Algorithms helped us through out the im-
plementation of the system. In particular, knowledge about graph algorithms was very
essential. Apart from this, Object oriented programming principles helped us a lot to
structure our code and develop a good design.

Appendix B
COs, POs AND PSOs MAPPING FOR THE PROJECT

WORK (CS84P)
Course Outcomes:
CO1: Formulate the problem definition, conduct literature review and apply re-
quirements analysis.
CO2: Develop and implement algorithms for solving the problem formulated.
CO3: Comprehend, present and defend the results of exhaustive testing and explain
the major findings.
Program Outcomes:
PO1: Apply knowledge of computing, mathematics, science, and foundational en-

gineering concepts to solve the computer engineering problems.
PO2: Identify, formulate and analyze complex engineering problems.
PO3: Plan, implement and evaluate a computer-based system to meet desired

societal needs such as economic, environmental, political, healthcare and safety
within realistic constraints.
PO4: Incorporate research methods to design and conduct experiments to investi-

gate real-time problems, to analyze, interpret and provide feasible conclusion.
PO5: Propose innovative ideas and solutions using modern tools.
PO6: Apply computing knowledge to assess societal, health, safety, legal and cul-
tural issues and the consequent responsibilities relevant to professional engineering
practice.
PO7: Analyze the local and global impact of computing on individuals and orga-
nizations for sustainable development.
PO8: Adopt ethical principles and uphold the responsibilities and norms of com-
puter engineering practice.
PO9: Work effectively as an individual and as a member or leader in diverse teams

and in multidisciplinary domains.
PO10: Effectively communicate and comprehend.

PO11: Demonstrate and apply engineering knowledge and management principles

to manage projects in multidisciplinary environments.
PO12: Recognize contemporary issues and adapt to technological changes for life-
long learning.
Program Specific Outcomes:
PSO1: Problem Solving Skills: Ability to apply standard practices and mathe-
matical methodologies to solve computational tasks, model real world problems in
the areas of database systems, system software, web technologies and Networking
solutions with an appropriate knowledge of Data structures and Algorithms.
PSO2: Knowledge of Computer Systems: An understanding of the structure and

working of the computer systems with performance study of various computing
architectures.
PSO3: Successful Career and Entrepreneurship: The ability to get acquaintance

with the state of the art software technologies leading to entrepreneurship and
higher studies.
PSO4: Computing and Research Ability: Ability to use knowledge in various

domains to identify research gaps and to provide solution to new ideas leading to
innovations.
Note:
1. Scale 1 Low relevance
2. Scale 2 Medium relevance
3. Scale 3 High relevance

Justification:
The trending advances in the field of Artificial Intelligence motivated us to choose the
topic and develop the model(PO 1, 2). With the help of the knowledge of computing,
we could come up with the solution to the existing problem . With the help of research
papers, we analysed the different research techniques used earlier(PO 3, 4). New ideas
and design patterns made our code simple and efficient(PO 5). Analyzing the problem
led us to create groundwork for future research and application in real time(PO 4, 6, 7).
The task was divided into multiple modules and each member worked individually and
collaboratively for the success of the project(PO 9, 10). The developed library can be
used for a variety of tasks and has scope for future research(PO 11, 12).
We were able to apply the knowledge from Data Structures, Algorithms and machine
Learning to implement and improve the model(PSO 1). Knowledge of computer systems
helped us to integrate different components and environments(PSO 2). This project is
a topic under research and studies in this field adds value to the resume for work and
universities(PSO 3,4).

References
[1] Brockman, Greg, and Vicki Cheung. Openai Gym. https://gym.openai.com/env
s/CartPole-v1. N.p., 2016. Web. 25 Apr. 2017.
[2] Browne, Cameron B. et al. A Survey Of Monte Carlo Tree Search Methods.
IEEE Transactions on Computational Intelligence and AI in Games 4.1 (2012):
1-43. Web.
[3] Christopher Amato and Guy Shani. High-level Reinforcement Learning in Strategy
Games, Proc. of 9th Int. Conf. on Autonomous Agents and Multiagent Systems
(AAMAS 2010), van der Hoek, Kaminka, Lesprance, Luck and Sen (eds.), May,
1014, 2010, Toronto, Canada.
[4] David J. Montana and Lawrence Davis.Training Feedforward Neural Networks Us-
ing Genetic Algorithms ,BBN Systems and Technologies Corp. 10 Moulton St.
Cambridge, MA 02138
[5] Kirci, Mesut, Nathan Sturtevant, and Jonathan Schaeffer. A GGP Feature Learn-
ing Algorithm. KI - Knstliche Intelligenz 25.1 (2011): 35-42. Web.
[6] Mitchell, Tom M. Machine Learning. 1st ed. Johanneshov: MTM, 2015. Print.
[7] Mnih, Volodymyr et al. Human-Level Control Through Deep Reinforcement Learn-
ing. Nature 518.7540 (2015): 529-533. Web.
[8] Paquette, Philip. Ppaquette (Philip Paquette). https://github.com/ntasfi/PyGame-

Learning-Environment. N.p., 2017. Web. 25 Apr. 2017.
[9] Shiffman, Daniel, Shannon Fry, and Zannah Marsh. The Nature Of Code. 1st ed.
2012.
[10] Stanley, Kenneth O., and Risto Miikkulainen. Evolving Neural Networks Through
Augmenting Topologies. Evolutionary Computation 10.2 (2002): 99-127. Web.
[11] Tasfi, Norman. Flappybird-Pygame Learning Environment 0.1.Dev1 Documenta-

tion. http://pygame-learning-environment.readthedocs.io/en/latest/user/games/
flappybird.html. N.p., 2016. Web. 25 Apr. 2017.
[12] Tasfi, Norman. Pong Pygame Learning Environment 0.1.Dev1 Documentation.

http://pygame-learning-environment.readthedocs.io/en/latest/user/games/pong.h
tml. N.p., 2016. Web. 25 Apr. 2017.
[13] wiechowski, Maciej et al. Recent Advances In General Game Playing. The
Scientific World Journal 2015 (2015): 1-22. Web.

AI General Game Player Using Neuroevolution Algorithms

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

AI General Game Player Using Neuroevolution Algorithms

Загружено:

Авторское право:

Доступные форматы

AI General Game Player using Neuroevolution Algorithms

Department of Computer Science and Engineering, SJCE i

Department of Computer Science and Engineering, SJCE ii

We extend our deep regards to Dr T N Nagabhushan, Principal, Sri Jayachama-

We would like to express our thanks to Dr H C Vijayalakshmi, Associate Pro-

Department of Computer Science and Engineering, SJCE iii

3 System Requirements and Analysis 16

4 Tools and Technologies used 19

Department of Computer Science and Engineering, SJCE iv

4.6 OpenAI gym . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

7 Testing and Results analysis 38

8 Conclusion and Future Work 51

Department of Computer Science and Engineering, SJCE v

Department of Computer Science and Engineering, SJCE vi

Department of Computer Science and Engineering, SJCE vii

Department of Computer Science and Engineering, SJCE 1

1.1 Problem Statement

(a) Evolve XOR.

2. Implement Neuroevolution of Augmented Topologies(NEAT) and using the algo-

(a) Evolve XOR.

1.3 Introduction to the problem domain

1.3.1 General Game Playing

Department of Computer Science and Engineering, SJCE 2

1.3.2 Neural Networks

2. A state variable associated with each node.

3. A real-valued weight associated with each link.

Department of Computer Science and Engineering, SJCE 3

4. A real-valued bias associated with each node.

(a) its bias b,

Department of Computer Science and Engineering, SJCE 4

A second shortcoming of backpropagation is, to compute a gradient requires differ-

1.3.3 Genetic Algorithms

1. A way of encoding solutions to the problem on chromosomes.

3. A way of initializing the population of chromosomes.

5. Parameter settings for the algorithm, the operators, and so forth.

1. The population is initialized.The result of the initialization is a set of chromosomes.

2. Each member of the population is evaluated. Evaluations may be normalized. The

3. The population undergoes reproduction until a stopping criterion is met. Repro-

Department of Computer Science and Engineering, SJCE 5

3. Space management in warehouse

Department of Computer Science and Engineering, SJCE 6

7. Financial investment decisions

9. General Game Playing in AI Education

Department of Computer Science and Engineering, SJCE 7

1.5 Existing solution methods

1.6 Proposed solution methods

In this project, Evolutionary strategies will be explored to implement a General

Department of Computer Science and Engineering, SJCE 8

1.7 Time schedule for completion of the project work (Gantt

Figure 1.1: Gantt Chart

Department of Computer Science and Engineering, SJCE 9

Department of Computer Science and Engineering, SJCE 10

2.1 A survey of Monte Carlo tree search methods

Department of Computer Science and Engineering, SJCE 11

2.2 A GGP Feature Learning Algorithm

2. Learning Generalized Features - The generalization process takes as input a

2.3 Training Feedforward Neural Networks Using Genetic Algo-

Evolutionary artificial neural networks (EANNs) refer to a special class of artificial

Department of Computer Science and Engineering, SJCE 12

2.4 High-level Reinforcement Learning in Strategy Games

Reinforcement learning lies somewhere in between supervised and unsupervised

Model-based Q-learning[3] - Q-learning is a model-free method. That is, it

2.5 Human-level control through deep reinforcement learning

A game is considered to be a Markov Decision Process. Suppose you are an agent,