Вы находитесь на странице: 1из 72

TABLE OF CONTENTS

CHAPTER No.

1.
2.

TITLE
ABSTRACT

LIST OF FIGURES

ii

LIST OF ABBREVIATIONS

iii

INTRODUCTION

1.1 Objective

LITERATURE REVIEW

2.1 Existing System

2.1.1 Drawbacks
2.2 Proposed System
2.2.1 Definitions

5
6
6

2.2.1.1 Types of States

2.2.1.2 Risk Parameter

2.2.1.3 A Navigation Model

2.2.2 Algorithm

2.2.2.1 Two Step Process

2.2.2.2 Parameter Setting

12

2.2.3 Advantages
3.

PAGE No.

SYSTEM EVALUATION

13
14

3.1 Problem Statement

14

3.2 Overview of the Project

15

3.2.1 Maze Game Problem


3.3 Module Description
3.3.1 Modules
3.4 System Design

15
16
16
18

3.4.1

Architecture Diagr

3.4.2

Data Flow Diagram

3.4.3

Class Diagram

3.4.4

Sequence Diagram

3.4.5

Use Case Diagram

3.4.6

Activity Diagram

3.5 Software Description


3.5.1 C Programming Language
4.

EVALUATION RESULTS
4.1 Testing

7.

23
23

4.1.1.1 Test Case

23

4.1.2 Integration Testing

25

4.1.2.1 Test Case

25

4.2 Graphical Outcomes

6.

21
23

4.1.1 Unit Testing

5.

21

CONCLUSION& FUTURE ENHANCEMENT

27
28

5.1 Conclusion

28

5.2 Future Enhancement

28

APPENDICES

29

6.1 Source Code

29

6.2 Screen Shots

60

REFERENCES

63

ABSTRACT
Safe exploration of state and action spaces is considered to be an
important problem in reinforcement learning. While reinforcement learning is
well-suited to domains with complex transition dynamics and high-dimensional
state-action space, an additional challenge is posed by the need of safe and
efficient exploration. Traditional exploration techniques are not particularly useful
for solving dangerous tasks, where the selection of actions in some states may
result in catastrophic damage to the learning system (or any other system).
Consequently, when an agent begins an interaction with a dangerous and highdimensional state-action space, an important question arises; namely, that of how
to avoid (or at least minimize) damage caused by the exploration of the stateaction space. The proposed PI-SRL (Policy Improvement through Safe
Reinforcement Learning) algorithm safely improves the suboptimal albeit robust
behaviors for continuous state and action control tasks and which efficiently learns
from the experience gained from the environment. The proposed method is
evaluated in Maze game problem.

LIST OF FIGURES
FIGURE NO.

FIGURE NAME

PAGE NO.

2.1

Known and Unknown States

2.2

Case-Base containing States

3.1

Architecture Diagram

18

3.2

Data Flow Diagram

18

3.3

Class Diagram

19

3.4

Sequence Diagram

19

3.5

Use Case Diagram

20

3.6

Activity Diagram

20

4.1

Graph outcome of PI-SRL in Maze Game

27

6.1

Before start of Exploration

60

6.2

Exploration using Teacher Policy

60

6.3

Using Case-Base with Gaussian Noise

61

6.4

Removal of Least Used Cases

61

6.5

Case-Base and reward at the end of excursion

62

ii

LIST OF ABBREVIATIONS
RL

Reinforcement Learning

MDP

Markov Decision Process

PI-SRL

Policy Improvement through Safe Reinforcement Learning

EA

Evolutionary Algorithm

Case-Base (Baseline Beha vior)

CBR

Case-Based Reasoning

OOP

Object Oriented Program ming

3D

Three Dimension

LfD

Learn from Demonstratio n

TD

Temporal Difference

iii

1. INTRODUCTION
Reinforcement Learning (RL) is a type of machine learning whose
main goal is that of finding a policy that moves an agent optimally in an
environment, generally formulated as Markov Decision Process (MDP). Since
most RL tasks are focused on maximizing a long-term cumulative reward, RL
researchers are paying increasing attention not

only to

long-term reward

maximization, but also to the safety approach. Nevertheless, while it is important


to ensure reasonable system performance and consider the safety of the agent
(e.g., avoiding collisions, crashes, etc.) in the application of RL to dangerous
tasks, most exploration techniques in RL offer no guarantees on both issues.
Thus, when using RL techniques in dangerous control tasks, an important question
arises; namely, how the exploration of the state-action space will not cause
damage or injury while, at the same time, learning (near-) optimal policies?
In other words, is one of the ensuring that the agent is able to explore a dangerous
environment both safely and efficiently.
There are many domains in where the exploration/exploitation
process may lead to catastrophic states or action for the learning agent. Since the
maximization of expected returns does not necessarily prevent rare occurrences of
large negative outcomes, a different criterion for safe exploration is needed.
Indeed, for such environment, a method is required which not only explores the
state-action space, but does so in a safe manner. Here the Policy Improvement
through Safe Reinforcement Learning (PI-SRL) algorithm for safe exploration in
dangerous and continuous control tasks is proposed. Such a method requires a
predefined baseline policy which is assumed to suboptimal. The PI-SRL algorithm
is composed of two different steps. In the first step, baseline behavior (robust
albeit suboptimal) is approximated using behavioral cloning techniques. In order
to achieve this Case-Based Reasoning technique is used. In the Second Step, the
PI-SRL algorithm attempts to safely explore the state-action space in order to
8

build a more accurate policy from previously-learned behavior. Thus the set of
cases obtained in the previous phase is improved through the safe exploration of
state-action space. To perform this a small amount of Gaussian noise are randomly
added to the greedy actions of baseline policy approach.

1.1 OBJECTIVE
The novelty of the present approach is the use of two new, main
components: (i) a risk function to determine the degree of risk of a particular state
and (ii) a baseline behavior capable of producing safe actions in supposedly risky
states. The project reports the experimental results obtained from the application of
the new approach in Maze Game problem, in which the probability of moving to
an unknown state is maximum. In this domain, the learning of near-optimal policy
which, in the learning phase, will minimize the crashes of the learning agent. It is
important to note that the comparison of the approach with an agent of optimal
exploration policy is not possible since, in the proposed domain its not known
what the optimal exploration policy is?

2. LITERATURE REVIEW

2.1 EXISTING SYSTEM


The policy-space approach to RL searches for policies that optimize
an appropriate objective function. While many search algorithms might be used,
this survey focuses on evolutionary algorithms. We begin with a brief overview of
a simple EA for RL, followed by a detailed discussion of features that characterize
the general class of EAs for RL.
DESIGN CONSIDERATIONS
Evolutionary Algorithm (EA) is global search techniques derived
from the Darwins theory of evolution by natural selection. An EA iteratively
updates a population of potential solutions, which are often encoded in structures
called chromosomes. During each iteration, called a generation, the EA evaluates
solutions and generates offspring based on a fitness of each solution in the task
environment. Substructures or genes of the solutions are then modify through
genetic operators such as mutation and recombination. The idea is that structures
that are associated with good solution can be mutated or combined to form even
better solutions in subsequence generation. There have been a wide variety of EAs
developed. Including genetic algorithms (Eiben-Smith, 2002), Evolutionary
programming (David E. Moriarty, Alan C. Schultz, John J. Grefenstette, 1999),
genetic programming (Mihatsch, O., & Neuneier, R., 2002) and evolutionary
stratergies (Geibel, P., & Wysotzki, F., 2005).
EAs are general structure purpose search methods and have been
applied in a verity of domains including numerical function optimization.
Combinatorial optimization, adoptive control, adoptive testing, and machine
learning. One reason for the widespread success of EAs is that they are few
requirements for their applications, namely

1. An appropriate mapping between the search spaces of chromosomes


2. An appropriate fit the functions.
For optimization in the case of parameters optimization, it is common
to represent the list of parameters as either a vector of real number or a bit string
that encodes the parameters with either of these representations, the standard
generic operators of mutation and cut-and-splice crossover can be applied in a
straightforward manner to produce the generic variations required.
The user must still decide on a number of control parameters for the
EA, including population size, mutation rates, recombination rates, parent
selection rules, but there is an extensive literature of studies which suggest that
EAs are relatively robust over a wide range of control parameter settings. Thus for
many problems, EAs can be applied in a relatively straightforward manner.
A SIMPLE EARL
The most straightforward way to represent a policy in an EA is to use
a single chromosome per policy with a single gene associated with each observed
state. In EARL, each genes value represents the action value associated with the
corresponding state. The number of policies in a population is usually on the order
of 100 to 1000. The fitness of each policy in the population must reflect the
expected accumulated fitness for an agent that uses the given policy.
There are no fixed constraints on how the fitness of an individual
policy is evaluated. If the world is deterministic, the fitness of a policy can be
evaluated during a single trial that starts with the agent in the initial state and
terminates when the agent reaches a terminal state. In non-deterministic worlds,
the fitness of a policy is usually averaged over a sample of trials. Other options
include measuring the total payoff achieved by an agent after a fixed number of
steps.

For simple RL problems, EARL may provide an adequate approach.


However, as in the case of TD methods, EARL methods have been extended to
handle the many challenges inherent in more realistic RL problems.

2.1.1 DRAWBACKS
1. In case of high-dimensional state-action space and complex transition
(dangerous-task), the given set of policies is not enough for safe exploration.
2. When an Unknown state is visited the probability of selecting random action
is high which in turn minimizes reward and increases failures.
3. When moved to an Unknown state, the probability of moving to a Known
State is very less.

2.2 PROPOSED SYSTEM


2.2.1 DEFINITIONS
2.2.1.1 TYPES OF STATES
1. ERROR AND NON-ERROR STATES: Let S be a set of states and S
be the set of error states. A state s is an undesirable terminal state where
the control of the agent ends when s is reached with damage or injury to the
agent, the learning agent or any external entities. The set S is considered
a set of non-error terminal states with = and where the control of
agent ends normally without damage or injury.
2. CASE-BASE: A case-base is a set of cases B = { 1 , }. Every case
{,
(consists of a state-action pair ( , ) the agent has experienced in the past
(and

with

an

associated

value

V( ).

Thus,

<

( , V( ) >, where the

first element represents the cases problem part and corresponds to the state
expected
following
element
depicts
case
solution
the ) action
when
the
agent
isand
ineach
the
state
the
) .and
the
finalelement
V(
isofthe
, the
value
function
associated
with
the
state
Each
state
is(i.e.,
composed
n
continuous
state
variables
action
is
composed
m
continuous
of

action variables. It is shown in Fig 2.1 and Fig 2.2.

Fig 2.1 KNOWN AND UNKNOWN STATES

3. composed
KNOWN of
AND
UNKNOWN
Given
a case-base
B = {
}
cases
,the
STATES:
, V(of
and
a
density
threshold

is
1
considered
known
when
,
unknown
all,
other

))
of
cases. Formally,

=S(is
while
,S ain
isstate
the
set

)=states,
unknown
states
with
= 1
set
and(
known
S. and

Fig 2.2 CASE-BASE CONTAINING STATES


4. CASE-BASE RISK FUNCTION: Given a case-base B = { , }
1

composed
((
of cases = ( , , V( )), the risk for each state s is defined as

0, 1 ( , )

(s) =
1,

2.2.1.2 RISK PARAMETER:


The parameter is considered a risk parameter. Large values of
increase the probability of visiting distant unknown states and hence, increase the
probability of reaching error states.
1

((, ) =

(
( )2/22

2 2
2

2.2.1.3 A NAVIGATION MODEL


To illustrate the concept of safety used in the approach, a navigation
problem is presented in Fig 2.1. In this navigation problem a control policy must
be learned to get from a particular start state to goal state. In this environment, it is
assumed that the task to be difficult due to stochastic and complex dynamic of the
environment. This makes it impossible to complete the task using exactly the same
trajectories every time. Additionally, a set of demonstrations from the baseline
controller are also given. This approach is based on the addition of Gaussian noise
to the baseline trajectories in order to find new and better ways of completing the
task. This noise will affect the baseline trajectories in different ways, depending on
the amount of noise added which, in turn, depends on the amount of risk taken. In
some cases, the exploration of new trajectories leads the learning system to
unknown region of state space (the dashed red lines). In such situation, the
learning system detects it with a risk function and use the baseline behavior to
return to safe, known state.

2.2.2 ALGORITHM
2.2.2.1 TWO STEP PROCESS
The algorithm is composed of two main steps described in detail below
FIRST STEP: Modeling Baseline Behavior by CBR
The first step of PI-SRL is an approach for behavioral cloning, using
CBR to allow the agent to behave in a similar manner to a teacher policy .
When using CBR for behavioral cloning, a case can be built using the agents state
received from the environment, as well as the corresponding action command
performed by the teacher. The algorithm for the first step of PI-SRL algorithm is
given below. The case-base contains cases that are the state-action pair of the
(form ( , ). In a new state , the case with minimum Euclidean distance

dist( , ) is retrieved and the corresponding action is returned. If no case were

found, then new state-action pair is generated with the help of teacher policy and is
added to the case-base. However when adding a new case into the case-base the
value associated with the state-action pair is initialized to zero.

SECOND STEP: Improving the Learned Baseline Behavior


In this second step, the safe case-based policy learned in the
previous step is improved by the safe exploration of state-action space. First for
each case the state-value function is calculated using the Monte Carlo (MC)
approach. In the algorithm shown above, all returns of each state in the case-base
are accumulated and averaged, following the policy derived by the case-base
B. It is important to note that return refers to the expected return of s, whereas
Returns refers to a list composed of each return of s in different episodes. Once the

state-value is calculated for each case in case-base, small amount of Gaussian


noise are randomly added to the actions of the policy in order to obtain new
and improved ways to complete the task.

The overall process of the algorithm is composed of four steps as


depicted below.
a) INITIALIZATION STEP: The algorithm initializes the list used to store
cases occurring during an episode and sets the cumulative reward counter of
the episode to 0.
b) CASE GENERATION: The algorithm builds a case for each step of an
episode. For each new state , the closest case < s,a,V(s) > in the case-base
is computed using the Euclidean distance metric. In order to determine the
perceived
risk
the new
, the
case-based
riskthe
function
= 0, of
used. If degree
then
theofstate
isstate
a known
state.
In this case
action is

is computed with the addition of Gaussian noise and a new case = <
s, ,V(s) > is built replacing the action a corresponding to the new case <
s, a,V(s) >. If however = 1 then, the state is an unknown state. In this

case the action ak performed 1s suggested by the teacher policy, which


defines the safe behavior.

Policy Improvement Algorithm


00

Given the case-base B, and the maximum number of cases 'fJ

01

Given the baseline behavior

02

Given the update threshold

03

1. Set maxTotalRwEpisode = 0, the maximum cumulative reward reached in an episode

04

2. Repeat

05

06

07
08

Jry

(a) Initialization step:


set k = 0, listCasesEpisode +-

0, totalRwEpisode = 0

(b) Case generation:


while k < maxEpisodeLength do

09

Compute the case< s, a, V(s) >E B closest to the current state

10

if e7f (sk) = 0 then

11
12

Create a new instance cnew := (s, a <, V(s))


else

II unknown state

15

Chose an action

16

Perform action

17

Create a new instance cnew := (s <, ak> 0)

ak

using

IrT

ak

+ r(s <, a1J

18

totalRwEpisode := totalRwEpisode

19

listCasesEpisode := listCasesEpisode U cnew

20
21
22
23

Set k = k
1
(c) Computing the state-value function for the unknown states:
for each instance Ci in listC asesEpisode
if e7r (si) = 1 then

II

unknown state

24

return(si) := I:=7=n 1'j-nr(sj,aj)

25

V(si) := return(si)

26
27

Sk

known state

Chose an action ak using equation 4


Perform action ak

13
14

II

II

n is the first ocurrence of Si in the episode

(d) Updating the cases in B using the experience gathered:


if totalRwEpisode > (maxTotalRwEpisode- 6) then

28

maxTotalRwEpisode := max( maxTotalRwEpisode ,totalRwEpisode)

29

for each case ci =< si,a;,V(si) >in listCasesEpisode

30
31
32
33

if

= 0 then II known state


Compute the case < Si,a,V (si) >E B corresponcling to the state Si

e"t (si)

Compute 8 = r(s;,ai) +rV(si+1)- V(si)


If 8 > 0 then

34
35
36
37

38
39
40
41

Replace the case < s;,a,V(si) >E B with the case < Si,a;,V(s;) >E listCasesEpisode
else

V(si) = V(si ) + a8
II unknown state

B := B Uc;
if [[B[[ > TJ then
Remove the 'fJ- [[B[[least-frequently-used cases in B
until stop criterion becomes true
3. Return B

c) COMPUTING THE STATE VALUE FUNCTION: In this step, the statevalue function of the states considered to be unknown in the previous step is
computed. In the previous step, the state-value function for these states is set
at 0. In this case, the return of each unknown state is computed, but not
averaged since only one episode is considered. The return for each state is
computed, taking into account the first visit of the state in the episode,
although the state could appear multiple times in the rest of the episode.

d) UPDATING CASES IN B USING EXPERIENCE GAINED: Updates in


B are made with the cases gathered from episodes with a cumulative reward
similar to that of the best episode found to that point using the threshold .
In this way, good sequences are provided for the updates since it has been
shown that such sequences of experience cause an adaptive agent to
converge to a stable and useful policy, whereas bad sequences may cause an
agent to converge to an unstable or a bad policy. In this step, two types of
updates takes place, namely, replacements and addition of cases. In case of
replacement of cases, only if the calculated TD error is positive, cases are
replaced, this is because the sign of TD error is used as a measure of
success.

2.2.2.2 PARAMETER SETTING


One of the main difficulties of applying the PI-SRL algorithm to a
given problem is to decide on an appropriate set of parameter values for the
threshold , the risk parameter , the update threshold and the maximum
number of cases . An incorrect value for the parameters, potentially lead to
damage or injury to the learning agent. The parameter setting given below is
suitable for wide variety of domains.
Parameter : The parameter is domain-dependent and related to the
average size of actions. In this case, the value for this parameter has been

established by computing the mean distance between states during an


execution of the baseline behavior .
=

(1 , 2 ) + + (1 , )
n1

Parameter : Since the agent operates in a state of incomplete knowledge


of the domain and its dynamic, it is inevitable during the exploratory
process that unknown regions of the state space will be visited where the
agent may reach an error state. However it is possible to adjust the risk
parameter to determine the level of risk assumed during the exploratory
process. In this case, a low value is assigned = 9 x 10

Parameter : The value of this parameter is set relative to the best episode
obtained. In this case, the value of is set to 5% to the cumulative reward
of the best episode obtained.
Parameter : The value of this parameter refers to the size of the case-base
B. In this case, the size of the case-base B is = 10.

2.2.3 ADVANTAGES:
1. In case of high-dimensional state-action space and complex transition
(dangerous-task), the given set of policies in baseline behavior B is enough
for safe exploration.
2. When an unknown state is visited the teacher policy is used to generate
action which in turn decreases failures.
3. When moved to an unknown state, the probability of moving to a Known
State is high and this is achieved by teacher policy.

3. SYSTEM EVALUATION

3.1 PROBLEM STATEMENT


Artificial Intelligence is becoming as an inevitable and indispensable
part of the life. Reinforcement is the application of moving optimally in an
environment. Many RL methods are used in important and complex tasks. While
most RL tasks are focused on long-term cumulative reward, RL researchers are
paying increasing attention not only to long-reward maximization, but also to the
safety of approach. In spite of that, while it is important to ensure reasonable
system performance and consider the safety of the agent in the application of RL
to dangerous tasks, most exploration techniques in RL offer no guarantees on both
issues. Thus when using RL techniques in dangerous control tasks, it is important
to ensure that the exploration of the state-action space will not cause any damage
or injury while at the same time, learning near-optimal policy. The matter in other
words, is one of the ensuring that the agent be able to explore a dangerous
environment both safely and efficiently.
There are many domains in which the exploration/exploitation
process may lead to catastrophic states or action for the learning agent. Since the
maximization of expected return does not necessarily prevent rare occurrences of
large negative outcomes, a different criterion for safe exploration is needed. The
exploration process in which new policies are evaluated must be conducted with
extreme care. Indeed for such environments, a method is required which not only
exposes the state-action space, but which does so in a safe manner.

3.2 OVERVIEW OF THE PROJECT


In order to consider complex transition and high-dimensional
dangerous task, there are various domains in which the Maze game problem is
considered.

3.2.1 MAZE GAME PROBLEM


The domain used for experimentation is the Maze game problem. It
consists of a robot moving inside the Maze that contains walls, free position and
goal point. The whole domain is N x M (12 x 32 in this experiments). The possible
actions that the robot can execute are go left, go right, go top, go down,
go top left, go top right, go down left, go down right all of size 1.
The learning system knows its location in the space through the
continuous coordinates (x, y) and observes the state using the eight values namely
left, top, right, down, top left, top right, down left, down right. Furthermore, the
learning system has an obstacle avoidance system that blocks the execution of
action that would hit the wall. One goal point is located in the state-space and the
learning system is expected to achieve them.
The proposed PI-SRL algorithm contains teacher policy for
generating actions and baseline behavior for storing cases. In this case, several
level of noise is introduced in the perception of robot localization, in order to
obtain better trajectories than generated by the learning system. Here Gaussian
noise of standard deviation 0.9 is added.

3.3 MODULE DESCRIPTION


This approach is composed of four modules as given below.

3.3.1 MODULES
1. Teacher Policy Evaluation
2. Teacher Policy with Baseline Behavior
3. Baseline Behavior with Gaussian Noise
4. Removal of Least Used Cases
MODULE 1: TEACHER POLICY EVALUATION
In this module, the learning system is allowed to move through the
state-action space with the help of Teacher policy. The Teacher policy is the neural
network that observes the current state and performs the desired action. The
current state consists of eight values that describe about which of the grid points
surrounding the current state are free and blocked. The teacher policy also checks
which grid point is smaller in distance to the destination point and performs the
action that moves to that respective position. Additionally, when no such solution
occurs as per the above criteria, it finds for alternative solutions. But once an
action is generated, it calculates the reward of the current state and moves to the
next state. When moved to the destination point it terminates and returns the total
reward obtained.
MODULE 2: TEACHER POLICY WITH BASELINE BEHAVIOR
In this module, the learning agent is allowed to move through the
state-action space with the help of teacher policy and baseline behavior. It is
similar to the previous module, in which once the teacher policy is executed for a
particular state, a new case is created with the help of current state and its action
and are stored in the case-base (baseline behavior). When such a state is visited in
future, instead of using teacher policy for exploration it uses the action values

stored in the case-base. This in turn minimizes the time needed for using the
teacher policy every time for each state. Once the learning agent refers the casebase and performs the action, the counter for that particular state is incremented.
MODULE 3: BASELINE BEHAVIOR WITH GAUSSIAN NOISE
This module is similar to the previous module of storing the state and
its action values generated by the teacher policy in a case-base (baseline behavior).
In order for better results, a small amount of Gaussian noise is calculated and
added to the actions of cases in the case-base. The Gaussian noise value is
calculated with help of current state value and standard deviation as given below.
(, ) =


2 (
( )2/22

The noise will affect the action in different ways, depending on the
amount of noise added, in turn, depends on the amount of risk taken. In this case
the value of SD as 910

. Once a particular case in the case-base is executed

with the addition of Gaussian noise, the obtained new action value is replaced to
the previous old action value in the case-base and the counter is incremented.
MODULE 4: REMOVAL OF LEAST USED CASES:
Before starting the exploration process, the size of the case-base
(baseline behavior) is initialized to a value. During the addition of cases, in
instance if overflow occurs, then the least used cases in the case-base are removed.
This is achieved with the help of counter value of each case. The case with least
counter value are identified and removed. This is carried out each time when
overflow occurs.

3.4 SYSTEM DESIGN


3.4.1 ARCHITECTURE DIAGRAM

Fig 3.1 ARCHITECTURE DIAGRAM

3.4.2 DATA FLOW DIAGRAM

Fig 3.2 DATAFLOW DIAGRAM

3.4.3 CLASS DIAGRAM

Fig 3.3 CLASS DIAGRAM

3.4.4 SEQUENCE DIAGRAM

Fig 3.4 SEQUENCE DIAGRAM

3.4.6 USE CASE DIAGRAM

Fig 3.5 USE CASE DIAGRAM

3.4.5 ACTIVITY DIAGRAM

Fig 3.6 ACTIVITY DIAGRAM


27

3.5 SOFTWARE DESCRIPTION


The language used for the implementation of the project is C
programming. Now let see some of the features about C programming language.

3.5.1 C PROGRAMMING LANGUAGE


C is a programming language developed at AT & Ts Bell
Laboratories of USA in 1972. It was designed and written by Dennis Ritchie. In
the late seventies C began to replace the more familiar languages of that time like
PL/I, ALGOL, etc. No one pushed C. It wasnt made the official Bell Labs
language. Thus without any advertisement, Cs reputation spread and its pool of
users grew. Ritchie seems to have been rather surprised that so many programmers
preferred C to older languages like FORTRAN or PI/T, or the newer ones like
Pascal and APL. But, thats what happened.
Possibly why C seems so popular is because it is reliable, simple and
easy to use. Moreover, in an industry where newer languages, tools and
technologies emerge and vanish day in and day out, a language that has survived
for more than four decades has to be really good.
An option that is often heard today is C has been already
superseded by languages like C++, C# and Java. There are several reasons why C
is so popular, some of the reasons are:
a) It is absolutely true that nobody can learn C++ or Java directly. This is
because while learning these languages you have things classes, objects,
inheritance, polymorphism, templates, exception handling, reference, etc.
do deal with apart from knowing the actual elements. Learning these
complicated concepts when you are not even comfortable with the basic
language elements is like putting the cart before the horse. Hence one
should first learn all the language elements very thoroughly using C
language before migrating to C++, C# or Java. Through this two-step

learning process may take more time, but at the end of it you will definitely
find it worth the trouble.
b) C++, C# or Java makes use of a principle called Object Oriented
Programming (OOP) to organize the program. This organizing principle has
lots of advantage to offer. But even while using this organizing principle
you would still need a good hold over the language elements of C and the
basic programming language.
c) Major parts of popular operating systems like windows, UNIX, Linux are
still written in C. this is because even today when it comes to performance
nothing beats C. Moreover, if one is to extend the operating system to work
with new devices one needs to write device driver programs. These
programs are exclusively written in C.
d) Mobile devices like cellular phones and palmtops have become rage today.
Also, common consumer devices like microwave ovens, washing machines
and digital cameras are getting smarter day by day. This smartness comes
from a microprocessor, an operating system and a user program embedded
in these devices. These programs not only have to run fast but also have to
work in limited amount of memory. No wonder that such programs are
written in C.
e) There are several professional 3D computer games where the user navigates
some object, like say a spaceship and fire bulls at the invaders. The essence
of all such games is speed. To match the expectation of the player the game
has to react fast to user inputs. Many popular gaming frameworks have been
built using C language.
f) Assembly Language programs ac be written in C, which is one of the
dominant features of C. This is where C language scores over other
languages.

4. EVALUATION RESULTS

4.1 TESTING
4.1.1 UNIT TESTING
Unit testing involves the design of test cases that validate whether the
internal program logic is functioning properly, and the program inputs produce
valid outputs. All decision branches and internal code flow should be validated. It
is the testing of individual software units of the application. This is a structural
testing, that relies on knowledge of its construction and is invasive. Unit test
perform basic tests at component level and test a specific business process,
application, and/or system configuration. Unit tests ensure that each unique path of
a business process performs accurately to the documented specifications and
contains clearly defined inputs and expected results.

4.1.1.1 TEST CASE


TEST

TEST

TEST

CASE

CASE

CASE

ID

NAME

DESC

TEST STEPS
STEPS

EXPECTED

TEST
ACTUAL
Returned

Observe
1

the State

Observe
the
current
State

Check

Return either

either 0 or

the eight

0 or 1 for left,

1 for left,

grid

right, top,

right, top,

point

down, top left,

down, top

near to

top right,

left, top

current

down left,

right, down

position

down right

left, down
right

Pass

With the
observed

Generate

state

Use

action

values

Teacher

using

calculate

Policy

Teacher

the

Policy

action to

Returned
Return new

new

position (x, y)

position

Pass

(x, y)

next
state
With the
Retrieve
3

Use

Action

Case

using

Base

Case
Base

observed
state

Returned

values

Return new

new

find the

position (x, y)

position

Pass

(x, y)

action in
the case
base
With

Add
Noise

Generate
Gaussian
Noise

observed

Return noise

state

value that is

values

between 0 and

calculate

Returned
noise of

Pass

value 0.89

noise
Remove
Least
5

Remove

used

Cases

cases in
the case
base

In case of

Remove case

Removed

overflow

that has

cases that

remove

minimum

has least

cases

value of

value of

counts

counts

Pass

4.1.2 INTEGRATION TESTING


Integration tests are designed to test integrated software components
to determine if its actually run as one program. Testing is event driven and is
more concerned with the basic outcome of screens or fields. Integration tests
demonstrate that although the components were individually satisfaction, as shown
by successfully unit testing, the combination components are correct and
consistent. Integration testing is specifically aimed at exposing the problems that
arise from the combination of components.

4.1.2.1 TEST CASE


TEST

TEST

TEST

CASE

CASE

CASE

ID

NAME

DESC

TEST STEPS
STEPS

TEST

EXPECTED

ACTUAL

Add case to

Deleted

Case base and

one case

delete cases if

and added

exceeded limit

one case

With the
observed
state

Generate

values

action as

generate

well as

action

Use

remove

and add

Teacher

cases in

the case

Policy

case

to case

base

base,

when

when

overflow overflow
occurs
remove
the cases

Pass

Retrieve
action

Retrieve

with

Return

Returned

Use

action

added

different

new action

Case

with

noise

action value

value and

Base

Gaussian

and

and update

updating

noise

update

case base

made

Pass

case
base

Store the
Baseline

Return

Store the

Case-

Case-

Base

Base

Behavior
in order
that it
can be
used in
future

Return the
case-base
containing the
state-action
pairs

Returned
case-base
containing
cases

Pass

4.2 GRAPHICAL OUTCOMES


The total reward obtained in the journey to the number of trials is
shown in the Fig 4.1. Here x is taken as number of trials of interval 2 units and y is
taken as reward of interval 0.1 units. The learning process in Fig 4.2 demonstrates
the number of failures increases with an increase in the parameter .
Additional experiments have demonstrated that increasing the value

above 0.9 increases the number of failures without increasing the performance.
Regarding the existing approach, the proposed method gives more performance
with minimum failures. The triangle shaped curve represents the reward value
obtained at each episode of the exploration. The reward of value 1 represents that
the learning system is in partially safe state and reward value of 2 represents that
the learning system is completely safe state. It is found that PI-SRL explores more
safe states.

Fig 4.1 GRAPH OUTCOME OF PI-SRL IN MAZE GAME

5. CONCLUSION AND FUTURE ENHANCEMENT


5.1 CONCLUSION
In this work, PI-SRL, an algorithm for policy improvement through
safe reinforcement learning in high-risk tasks, is described. The main contribution
of this algorithm is the definitions of novel case-based risk function and a baseline
behavior for safe exploration of state-action space. The use of case-based risk
function presented is possible inasmuch as the policy is stored in the case-base.
This represents a clear advantage over the previous existing approaches.
Additionally, a completely different notion of risk from others found in the
literature is presented. The algorithm is evaluated successfully in Maze game
domain in which it demonstrates different characteristics about the learning
capabilities of PI-SRL algorithm as given below.
1. PI-SRL obtains higher quality solutions.
2. PI-SRL adjusts the initial known space to safe and better policies.
3. PI-SRL works well in domains with differently structures state-action
spaces and where the value function can vary sharply.
4. PI-SRL performs successfully even when a poor initial policy with failures
is used.

5.2 FUTURE ENHANCEMENT


A logical continuation of the present study would take into account
the automatic graduation of the risk parameter along the learning process. The
future work aims to deploy the algorithm in real environments, inasmuch as the
uncertainty of the real environments presents the biggest challenge to autonomous
robots. However, the use of PI-SRL algorithm for learning process could reduce
the amount of damage incurred and consequently allow the lifespan of robots to be
extended. It might be worthwhile add a mechanism to the algorithm to detect when
a known state can lead directly to error state.

6.1 SOURCE CODE

6. APPENDICES

PISRLMAZE.C
#include<stdio.h>
#include<conio.h>
#include<graphics.h>
#include<math.h>
void initialize(); void
initarray();
void getobservation(int,int);
void teacherpolicy(int,int,int,int,int,int,int,int);
void drawgraph();
void removelucases();
void storebb();
int gaussnoise(int);
int searchbb(int,int,int,int,int,int,int,int);
int

grid[14][32][3];

observation[8];
position[2];
int goalpoint[2];
int k,episode=0,swap=1;

int
int

int curx=0,cury=0;
int tpcnt=0,bbcnt=0;
int flag=0,reward=0,cumreward[70],cumcnt=0;
struct baselinebehavior{
int state[8]; int
action[2]; int
statevalue; int
count;
}bb[40];
void main(){
int gd=DETECT,gm,i=0;
initgraph(&gd,&gm,"");
initialize();
goalpoint[0]=12; goalpoint[1]=30;
initarray();
position[0]=40; position[1]=205;
outtextxy(position[0]-3,position[1]-3,"#");
getobservation(position[0],position[1]);
printf("\n Press any button to start....");
getch();
while(curx!=12||cury!=30){

if(swap==0) {
swap=searchbb(observation[0],observation[1],observation[2],
o
bservation[3],observation[4],observation[5],observation[6],ob
s ervation[7]);
}
if(swap==1) {
teacherpolicy(observation[0],observation[1],observation[2],ob
s
ervation[3],observation[4],observation[5],observation[6],obser
vation[7]);
swap=0;
}
}
printf("\n

Destination

Reached....\n"); printf("\n Baseline


Behavior cases:"); printf("\n
_");
printf("\n|CASE|| l | t | r | d | tl | tr | dl | dr || a1 | a2 | val |");
printf("\n|-------------------------------------------------------------------------|"); for(i=0;i<episode;i++){
if(bb[i].action[0]==-1){
if(i<9)

printf("\n| %d || %d | %d | %d | %d | %d | %d |
%d

%d

||

%d

%d

%d

|",i+1,bb[i].state[0],bb[i].state[1],bb[i].state[2],bb[i].state
[3],bb[i].state[4],bb[i].state[5],bb[i].state[6],bb[i].state[7]
,bb[i].action[0],bb[i].action[1],bb[i].statevalue);
else
printf("\n| %d || %d | %d | %d | %d | %d | %d |
%d

%d

||

%d

%d

%d

|",i+1,bb[i].state[0],bb[i].state[1],bb[i].state[2],bb[i].state
[3],bb[i].state[4],bb[i].state[5],bb[i].state[6],bb[i].state[7]
,bb[i].action[0],bb[i].action[1],bb[i].statevalue);
}
else{
if(i<9)
printf("\n| %d || %d | %d | %d | %d | %d | %d |
%d

%d

||

%d

%d

%d

|",i+1,bb[i].state[0],bb[i].state[1],bb[i].state[2],bb[i].state
[3],bb[i].state[4],bb[i].state[5],bb[i].state[6],bb[i].state[7]
,bb[i].action[0],bb[i].action[1],bb[i].statevalue);
else
printf("\n| %d || %d | %d | %d | %d | %d | %d |
%d

%d

||

%d

%d

%d

|",i+1,bb[i].state[0],bb[i].state[1],bb[i].state[2],bb[i].state
[3],bb[i].state[4],bb[i].state[5],bb[i].state[6],bb[i].state[7]
,bb[i].action[0],bb[i].action[1],bb[i].statevalue);
}

}
printf("\n|--------------------------------------------------------------------------|
");
printf("\n\nTotal Reward = %d",reward);
printf("\n\nPress any Key to draw the graph & store Baseline
Behavior...\n\n");
getch();
drawgraph(); storebb(); getch();
}
void initialize(){ int i=0;
textcolor(BLACK);
clrscr();
setcolor(WHITE);
for(i=0;i<=225;i+=15)
line(10,10+i,460,10+i);
for(i=0;i<=460;i+=15)
line(10+i,10,10+i,235);
setfillstyle(SOLID_FILL,RED);
bar(10,235,25,10); bar(10,10,460,25);
bar(10,220,460,235);
bar(445,10,460,190);
setfillstyle(SOLID_FILL,BLUE);

bar(26,115,205,130); bar(145,160,160,219);
bar(235,26,250,190); bar(355,55,370,219);
bar(240,100,325,115); line(10,10,10,235);
line(10,10,460,10); line(25,25,25,220);
line(25,25,445,25); line(25,220,490,220);
line(10,235,460,235); line(145,160,145,220);
line(160,160,160,220);
line(145,160,160,160); line(355,55,355,220);
line(370,55,370,220); line(355,55,370,55);
line(250,100,325,100);
line(250,115,325,115);
line(325,100,325,115); line(235,25,235,190);
line(250,25,250,100); line(250,115,250,190);
line(235,190,250,190); line(460,10,460,235);
line(445,25,445,190); line(445,190,490,190);
line(460,205,490,205);
line(490,190,490,220);
line(475,190,475,220); line(25,115,205,115);
line(25,130,205,130); line(205,115,205,130);
setcolor(YELLOW);
outtextxy(475-3,205-3,"#");

}
void initarray(){

int i=0,j=0,x=25,y=25,val=0;
for(i=0;i<14;i++){
for(j=0;j<32;j++){
grid[i][j][0]=x; grid[i][j][1]=y; grid[i][j][2]=val;
x+=15;
}
y+=15; x=25;
}
for(j=0;j<32;j++){
grid[0][j][2]=1; grid[13][j][2]=1;
}
for(i=1;i<13;i++)
grid[i][0][2]=1;
for(i=1;i<12;i++){
grid[i][14][2]=1; grid[i][15][2]=1; grid[i][28][2]=1;
grid[i][29][2]=1; grid[i][30][2]=1; grid[i][31][2]=1;
}
for(j=1;j<13;j++){
grid[6][j][2]=1; grid[7][j][2]=1;
}
for(j=16;j<21;j++) {

grid[5][j][2]=1; grid[6][j][2]=1;
}
for(i=9;i<13;i++){
grid[i][8][2]=1; grid[i][9][2]=1;
}
for(i=2;i<13;i++){
grid[i][22][2]=1; grid[i][23][2]=1;
}
}
void getobservation(int posx,int posy){
int i=0,j=0;
for(i=0;i<14;i++)
for(j=0;j<32;j++){
if(grid[i][j][0]==posx-15&&grid[i][j][1]==posy)
observation[0]=grid[i][j][2];
if(grid[i][j][0]==posx&&grid[i][j][1]==posy-15)
observation[1]=grid[i][j][2];
if(grid[i][j][0]==posx+15&&grid[i][j][1]==posy)
observation[2]=grid[i][j][2];
if(grid[i][j][0]==posx&&grid[i][j][1]==posy+15)
observation[3]=grid[i][j][2];

if(grid[i][j][0]==posx-15&&grid[i][j][1]==posy-15)
observation[4]=grid[i][j][2];
if(grid[i][j][0]==posx+15&&grid[i][j][1]==posy-15)
observation[5]=grid[i][j][2];
if(grid[i][j][0]==posx-15&&grid[i][j][1]==posy+15)
observation[6]=grid[i][j][2];
if(grid[i][j][0]==posx+15&&grid[i][j][1]==posy+15)
observation[7]=grid[i][j][2];
}
printf("\n\n\n\n\n\n\n\n\n\n\n\n\n\n");
printf("\n | Left=%d

| Top=%d

| Right=%d

| Down=%d

|",observation[0],observation[1],observation[2],observation[3]);
printf("\n | Top-Left=%d | Top-Right=%d | Down-Left=%d | DownRight=%d |",observation[4],observation[5],observation[6],observation[7]);
printf("\n |-------------------------------------------------------|\n\n");
}
void teacherpolicy(int left,int top,int right,int down,int topleft,int topright,int
downleft,int downright){
int i=0,j=0,tmp1=0,tmp2=0,minvalx=0,minvaly=0,x=0,y=0;
int sell=0,selr=0,selt=0,seld=0,sv=0;
printf(" Inside Teacher
policy...\n\n"); for(i=0;i<14;i++)

for(j=0;j<32;j++)
if(grid[i][j][0]==position[0]&&grid[i][j][1]==position[1])
{
tmp1=i; tmp2=j;
}
minvalx=goalpoint[0]-tmp1; minvaly=goalpoint[1]-tmp2;
if(topleft==0){
i=(goalpoint[0]-(tmp1-1)); j=(goalpoint[1]-(tmp2-1));
if(i<minvalx&&j<minvaly){
minvalx=i; x=tmp1-1; minvaly=j; y=tmp2-1;
}
}
if(topright==0){
i=(goalpoint[0]-(tmp1-1)); j=(goalpoint[1]-(tmp2+1));
if(i<minvalx&&j<minvaly){
minvalx=i; x=tmp1-1; minvaly=j; y=tmp2+1;
}
}
if(downleft==0){
i=(goalpoint[0]-(tmp1+1)); j=(goalpoint[1]-(tmp2-1));
if(i<minvalx&&j<minvaly){
minvalx=i; x=tmp1+1; minvaly=j; y=tmp2-1;

}
}
if(downright==0){
i=(goalpoint[0]-(tmp1+1)); j=(goalpoint[1]-(tmp2+1));
if(i<minvalx&&j<minvaly){
minvalx=i; x=tmp1+1; minvaly=j; y=tmp2+1;
}
}
if(x==0&&y==0){
if(downleft==1&&downright==1&&topleft==0&&topright==0&&le
ft==0&&right==0&&top==0&&down==1){
for(i=tmp2;grid[tmp1][i][2]!=1;--i)
if(grid[tmp1+1][i][2]==0)
sell=1;
for(i=tmp2;grid[tmp1][i][2]!=1;++i)
if(grid[tmp1+1][i][2]==0)
selr=1;
if(sell==0&&selr==0){
if((goalpoint[1]-tmp2)>(goalpoint[1]-(tmp2-1))){
x=tmp1; y=tmp2-1;
}

else if((goalpoint[1]-tmp2)>(goalpoint[1]-(tmp2+1))){
x=tmp1; y=tmp2+1;
}
}
else{
if(sell>selr){
x=tmp1; y=tmp2-1;
}
else{
x=tmp1; y=tmp2+1;
}
}
}
else
if(topright==1&&downright==1&&topleft==0&&downleft==0&&le
ft==0&&top==0&&down==0&&right==1){
for(i=tmp1;grid[i][tmp2][2]!=1;--i)
if(grid[i][tmp2+1][2]==0)
selt=1;
for(i=tmp1;grid[i][tmp2][2]!=1;++i)
if(grid[i][tmp2+1][2]==0)

seld=1;
if(selt==0&&seld==0){
if((goalpoint[0]-tmp1)>(goalpoint[0]-(tmp1-1))){
x=tmp1-1; y=tmp2;
}
else if((goalpoint[0]-tmp1)>(goalpoint[0]-(tmp1+1))){
x=tmp1+1; y=tmp2;
}
}
else{
if(selt>seld){
x=tmp1-1; y=tmp2;
}
else{
x=tmp1+1; y=tmp2;
}
}
}
else
if(top==0&&down==0&&left==1&&right==1&&topleft==0&&do
w nleft==0){

for(i=tmp1;grid[i][tmp2][2]!=1;--i)
if(grid[i][tmp2+1][2]==0)
selt=1;
for(i=tmp1;grid[i][tmp2][2]!=1;++i)
if(grid[i][tmp2+1][2]==0)
seld=1;
if(selt==0&&seld==0){
if((goalpoint[0]-tmp1)>(goalpoint[0]-(tmp1-1))){
x=tmp1-1; y=tmp2;
}
else if((goalpoint[0]-tmp1)>(goalpoint[0]-(tmp1+1))){
x=tmp1+1; y=tmp2;
}
}
else{
if(selt>seld){
x=tmp1-1; y=tmp2;
}
else{
x=tmp1+1; y=tmp2;
}

}
}
else if(left==0&&right==0&&top==1&&down==1){
if((goalpoint[1]-tmp2)>(goalpoint[1]-(tmp2-1))){
x=tmp1; y=tmp2-1;
}
else if((goalpoint[1]-tmp2)>(goalpoint[1]-(tmp2+1))){
x=tmp1; y=tmp2+1;
}
}
else{
if(top==0&&down==0&&((left==1&&right==1)||
(left==0&&r ight==1)||(left==1&&right==0))){
for(i=tmp1;grid[i][tmp2][2]!=1;--i)
if(grid[i][tmp2+1][2]==0)
selt=1;
for(i=tmp1;grid[i][tmp2][2]!=1;++i)
if(grid[i][tmp2+1][2]==0)
seld=1;
if(selt==0&&seld==0)
{} else{

if(selt>seld) {
x=tmp1-1; y=tmp2;
}
else{
x=tmp1+1; y=tmp2;
}
}
}
else if(left==0&&right==0&&((top==1&&down==1)||
(top==0&&d own==1)||(top==1&&down==0))){
if((goalpoint[1]-tmp2)>(goalpoint[1]-(tmp2-1))){
x=tmp1; y=tmp2-1;
}
else if((goalpoint[1]-tmp2)>(goalpoint[1]-(tmp2+1))){
x=tmp1; y=tmp2+1;
}
}
else
if(downleft==1&&down==1&&downright==1&&right==1&&
topright==1&&top==0&&topleft==0&&left==0){
if((goalpoint[1]-(tmp2-1))<(goalpoint[1]-tmp2)){

x=tmp1; y=tmp2-1;
}
else{
x=tmp1-1; y=tmp2;
}
}
else
if(topleft==1&&topright==1&&downleft==1&&downright==
1){
i=(goalpoint[0]-(tmp1+1)); j=(goalpoint[1]-(tmp2+1));
if(i<minvalx&&j<minvaly){
minvalx=i; x=tmp1+1; minvaly=j; y=tmp2+1;
}
}
else
if(topleft==1&&downleft==1&&downright==1&&right==0&
&left==1){
x=tmp1; y=tmp2+1;
}
else
if(topleft==1&&topright==1&&downleft==0&&downright==
1&&down==0){
x=tmp1+1; y=tmp2;

}
else if(down==0){
x=tmp1+1; y=tmp2;
}
}
}
delay(500);
printf(" Generating action using Neural Network....");
delay(1000);
if(episode==10){
printf("\n\n Removing Least Used Cases...");
delay(1000);
} initialize();
curx=x; cury=y;
position[0]=grid[x][y][0]; position[1]=grid[x][y][1];
if(left==0&&right==0&&top==0&&down==0&&topleft==0&&topright=
=
0&&downleft==0&&downright==0){
reward=reward+2; cumreward[cumcnt]=2; cumcnt++; sv=2;
}
else{

reward=reward+1; cumreward[cumcnt]=1; cumcnt++; sv=1;


}
if(episode==10){
removelucases();
bb[episode-1].state[0]=left; bb[episode-1].state[1]=top; bb[episode1].state[2]=right; bb[episode-1].state[3]=down; bb[episode1].state[4]=topleft; bb[episode-1].state[5]=topright; bb[episode1].state[6]=downleft; bb[episode-1].state[7]=downright;
bb[episode-1].action[0]=x-tmp1; bb[episode-1].action[1]=y-tmp2;
bb[episode-1].statevalue=sv; bb[episode-1].count=0;
}
else{
bb[episode].state[0]=left; bb[episode].state[1]=top;
bb[episode].state[2]=right; bb[episode].state[3]=down;
bb[episode].state[4]=topleft; bb[episode].state[5]=topright;
bb[episode].state[6]=downleft; bb[episode].state[7]=downright;
bb[episode].action[0]=x-tmp1; bb[episode].action[1]=y-tmp2;
bb[episode].statevalue=sv; bb[episode].count=0;
episode++;
}
outtextxy(grid[x][y][0]-3,grid[x][y][1]-3,"#");

getobservation(grid[x][y][0],grid[x][y][1]);
tpcnt++;
}
int searchbb(int left,int top,int right,int down,int topleft,int topright,int downleft,int
downright){
int i=0,j=0,k=0,ret=1,free=0,gauss=0,noisex=0,noisey=0;
for(i=0;i<episode;i++){
if(bb[i].state[0]==left&&bb[i].state[1]==top&&bb[i].state[2]==right
&&bb[i].state[3]==down&&bb[i].state[4]==topleft&&bb[i].state[5]=
=topright&&bb[i].state[6]==downleft&&bb[i].state[7]==downright){
if(left==0&&right==0&&top==0&&down==0&&topleft==0&
&topright==0&&downleft==0&&downright==0){
reward=reward+2; cumreward[cumcnt]=2;
cumcnt++;
}
else{
reward=reward+1; cumreward[cumcnt]=1;
cumcnt++;
}
if(left==0)
free++;
if(right==0)

free++;
if(top==0)
free++;
if(down==0)
free++;
if(topleft==0)
free++;
if(topright==0)
free++;
if(downleft==0)
free++;
if(downright==0)
free++;
gauss=gaussnoise(free);
printf(" Inside Baseline Behavior...");
delay(500);
printf("\n\n Checking....");
delay(1000);
printf("\n\n Calculating action using Gaussian Noise...");
delay(1000);
if(gauss==1){

noisex=grid[curx+bb[i].action[0]][cury+bb[i].action[1]][
0];
noisey=grid[curx+bb[i].action[0]][cury+bb[i].action[1]][
1];
if(topright==0&&right==0&&downright==1){
if(bb[i].action[0]==0&&bb[i].action[1]==1) {
bb[i].action[0]=-1; bb[i].action[1]=1;
}
else if(bb[i].action[0]==-1&&bb[i].action[1]==1){
bb[i].action[0]=0; bb[i].action[1]=1;
}
}
else if(topright==0&&right==0&&downright==0){
if(bb[i].action[0]==0&&bb[i].action[1]==1){
if(flag==0){
bb[i].action[0]=-1; bb[i].action[1]=1;
flag=1;
}
else if(flag==1){
bb[i].action[0]=1; bb[i].action[1]=1;
flag=0;

}
}
else if(bb[i].action[0]==-1&&bb[i].action[1]==1){
bb[i].action[0]=0; bb[i].action[1]=1;
}
else{
bb[i].action[0]=0; bb[i].action[1]=1;
}
}
else
if(topright==1&&right==1&&downright==1&&topleft=
=0&&downleft==0&&left==0){
if((bb[i].action[0]==1||bb[i].action[0]==1)&&bb[i].action[1]==0)
bb[i].action[1]=-1;
else
if((bb[i].action[0]==1||bb[i].action[0]==1)&&bb[i].action[1]==-1)
bb[i].action[1]=0;
}
}
setcolor(RED); outtextxy(noisex-3,noisey-3,"#");

delay(500); initialize(); bb[i].count++;


position[0]=grid[curx+bb[i].action[0]][cury+bb[i].action[1]][0]
;
position[1]=grid[curx+bb[i].action[0]][cury+bb[i].action[1]][1]
;
curx+=bb[i].action[0];
cury+=bb[i].action[1]; outtextxy(position[0]3,position[1]-3,"#");
getobservation(position[0],position[1]);
ret=0; bbcnt++;
if(left!=0||top!=0||right!=1||down!=0||topleft!=0||topright!=1||
do wnleft!=0||downright!=1){
for(j=0;j<episode;j++){
if(bb[j].state[0]==0&&bb[j].state[1]==0&&bb[j].
s
tate[2]==1&&bb[j].state[3]==0&&bb[j].state[4]=
=0&&bb[j].state[5]==1&&bb[j].state[6]==0&&bb
[j].state[7]==1){ k=j;
while(k<episode){
bb[k].state[0]=bb[k+1].state[0]
;
bb[k].state[1]=bb[k+1].state[1];
bb[k].state[2]=bb[k+1].state[2];
bb[k].state[3]=bb[k+1].state[3];

bb[k].state[4]=bb[k+1].state[4];

bb[k].state[5]=bb[k+1].state[5];
bb[k].state[6]=bb[k+1].state[6];
bb[k].state[7]=bb[k+1].state[7];
bb[k].action[0]=bb[k+1].action[0];
bb[k].action[1]=bb[k+1].action[1];
bb[k].statevalue=bb[k+1].statevalue;
k++;
}
episode--;
}
if(bb[j].state[1]==0&&bb[j].state[2]==1&&bb[j].s
tate[3]==0&&bb[j].state[5]==1&&bb[j].state[7]=
=1&&((bb[i].state[4]==0&&bb[i].state[0]==0&&
bb[i].state[6]==1)||(bb[i].state[4]==0&&bb[i].state
[0]==1&&bb[i].state[6]==1)||(bb[i].state[4]==1&
&bb[i].state[0]==1&&bb[i].state[6]==0)||(bb[i].sta
te[4]==0&&bb[i].state[0]==0&&bb[i].state[6]==0
))){
k=j;
while(k<episode)
{ bb[k].state[0]=bb[k+1].state[
0];
bb[k].state[1]=bb[k+1].state[1];
bb[k].state[2]=bb[k+1].state[2];

bb[k].state[3]=bb[k+1].state[3];
bb[k].state[4]=bb[k+1].state[4];
bb[k].state[5]=bb[k+1].state[5];
bb[k].state[6]=bb[k+1].state[6];
bb[k].state[7]=bb[k+1].state[7];
bb[k].action[0]=bb[k+1].action[0];
bb[k].action[1]=bb[k+1].action[1];
bb[k].statevalue=bb[k+1].statevalue;
k++;
}
episode--;
}
}
}
}
}
return ret;
}
void drawgraph(){
int temp=0,m=0,n=0,x=100,y=350;
clrscr(); setcolor(WHITE); line(100,100,100,360);

for(temp=150;temp<350;temp=temp+50)
line(93,temp,100,temp);
for(temp=110;temp<350;temp=temp+10)
line(98,temp,100,temp);
outtextxy(97,98,"^");delay(500); outtextxy(80,147,"2");delay(500);
outtextxy(65,197,"1.5");delay(500); outtextxy(80,247,"1");delay(500);
outtextxy(65,297,"0.5");delay(500); line(90,350,450,350);
for(temp=110;temp<450;temp=temp+10)
line(temp,350,temp,352);
for(temp=150;temp<450;temp=temp+50)
line(temp,350,temp,358);
outtextxy(447,347,">");delay(500); outtextxy(143,364,"10");delay(500);
outtextxy(193,364,"20");delay(500); outtextxy(243,364,"30");delay(500);
outtextxy(293,364,"40");delay(500); outtextxy(343,364,"50");delay(500);
outtextxy(393,364,"60");delay(500); setcolor(YELLOW);
outtextxy(300,90,"X-axis : 2 units");
outtextxy(300,100,"Y-axis : 0.1 units");delay(500);outtextxy(97,88,"y");
outtextxy(456,347,"x");delay(500);
outtextxy(240,380,"Num Trials");delay(500);
outtextxy(40,190,"R"); outtextxy(40,198,"e"); outtextxy(40,206,"w");
outtextxy(40,212,"a"); outtextxy(40,220,"r"); outtextxy(40,228,"d");

setcolor(RED); m=110;
for(temp=2;temp<=tpcnt+bbcnt;temp=temp+2){
if(cumreward[temp]==1)
n=250;
else if(cumreward[temp]==2)
n=150;
line(x,y,m,n);
x=m; y=n; m=m+10; delay(500);
}
setcolor(WHITE);
outtextxy(150,420,"Baseline Behavior stored Successfully...");
}
int gaussnoise(int free){
int res;
double fp1,fp2,ft; fp1=(1/
(sqrt(2*3.14)*(0.9*0.9))); fp2=(((1/free)1)*((1/free)-1))/(2*(0.9*0.9));
ft=fp1*exp(fp2);
if(ft>0.5)
res=1;
else

res=0;
return res;
}
void removelucases(){ int i;
for(i=0;i<episode;i++){
if(bb[i].count==0){
while(i<episode){
bb[i].state[0]=bb[i+1].state[0];
bb[i].state[1]=bb[i+1].state[1];
bb[i].state[2]=bb[i+1].state[2];
bb[i].state[3]=bb[i+1].state[3];
bb[i].state[4]=bb[i+1].state[4];
bb[i].state[5]=bb[i+1].state[5];
bb[i].state[6]=bb[i+1].state[6];
bb[i].state[7]=bb[i+1].state[7];
bb[i].action[0]=bb[i+1].action[0];
bb[i].action[1]=bb[i+1].action[1];
bb[i].statevalue=bb[i+1].statevalue;
bb[i].count=bb[i+1].count;
i++;

}
}
}
}
void storebb(){
int i=0;
FILE *fp1;
fp1=fopen("BaselineBehavior.dat","w");
fprintf(fp1,"\n\n\tBaseline Behavior Cases:\n");
fprintf(fp1,"\n\t
_");
fprintf(fp1,"\n\t|CASE|| l | t | r | d | tl | tr | dl | dr || a1 | a2 | val |");
fprintf(fp1,"\n\t|------------------------------------------------------------------------|");
for(i=0;i<episode;i++){
if(bb[i].action[0]==-1){
if(i<9)
fprintf(fp1,"\n\t| %d || %d | %d | %d | %d | %d |
%d

%d

%d

||

%d

%d

%d

|",i+1,bb[i].state[0],bb[i].state[1],bb[i].state[2],bb[i].state
[3],bb[i].state[4],bb[i].state[5],bb[i].state[6],bb[i].state[7]
,bb[i].action[0],bb[i].action[1],bb[i].statevalue);

else
fprintf(fp1,"\n\t| %d || %d | %d | %d | %d | %d |
%d

%d

%d

||

%d

%d

%d

|",i+1,bb[i].state[0],bb[i].state[1],bb[i].state[2],bb[i].state
[3],bb[i].state[4],bb[i].state[5],bb[i].state[6],bb[i].state[7]
,bb[i].action[0],bb[i].action[1],bb[i].statevalue);
}
else{
if(i<9)
fprintf(fp1,"\n\t| %d || %d | %d | %d | %d | %d |
%d

%d

%d

||

%d

%d

%d

|",i+1,bb[i].state[0],bb[i].state[1],bb[i].state[2],bb[i].state
[3],bb[i].state[4],bb[i].state[5],bb[i].state[6],bb[i].state[7]
,bb[i].action[0],bb[i].action[1],bb[i].statevalue);
else
fprintf(fp1,"\n\t| %d || %d | %d | %d | %d | %d |
%d

%d

%d

||

%d

%d

%d

|",i+1,bb[i].state[0],bb[i].state[1],bb[i].state[2],bb[i].state
[3],bb[i].state[4],bb[i].state[5],bb[i].state[6],bb[i].state[7
]
,bb[i].action[0],bb[i].action[1],bb[i].statevalue);
}
}
fprintf(fp1,"\n\t|------------------------------------------------------------------------|\n"); fclose(fp1);
}

6.2 SCREEN SHOTS


BEFORE START

Fig 6.1 BEFORE START OF EXPLORATION


GENERATING ACTION USING TEACHER POLICY

Fig 6.2 EXPLORATION USING TEACHER POLICY

ADDING GAUSSIAN NOISE TO ACTIONS IN BASELINE BEHAVIOR

Fig 6.3 USING CASE BASE WITH GAUSSIAN NOISE


REMOVING OF LEAST USED CASES

Fig 6.4 REMOVAL OF LEAST USED CASES

BASELINE BEHAVIOR CASES AFTER REACHING GOAL POINT

Fig 6.5 CASE-BASE AND REWARD AT THE END OF EXCURSION

REFERENCES

1.

Aha, D. W. (1992). Tolerating Noisy, Irrelevant and Novel Attributes in


Instance-Based Learning Algorithms.

2.
J.,

Borrajo, F., Bueno, Y., de Pablo, I., Santos, B. n., Fernandez, F., Garca,
& Sagredo, I. (2010). SIMBA: A Simulator for Business Education and
Research.

3.

David E. Moriarty, Alan C. Schultz, John J. Grefenstette (1999),


Evolutionary Algorithms for Reinforcement Learning.

4.

Eiben-Smith (2002), Introduction to Evolutionary Algorithm, Chapter 2.

5.

Fernandez, F., & Borrajo, D. (2008). Two steps reinforcement learning.


International Journal of Intelligent Systems, 23 (2), 213-245.

6.

Fernandez, F., & Isasi, P. (2008). Local feature weighting in nearest


prototype classification. Neural Networks, IEEE Transactions on, 19 (1),
40-53.

7.

Geibel, P., & Wysotzki, F. (2005). Risk-sensitive Reinforcement Learning


Applied to Control under Constraints.

8.

Javier Garcia, Fernando Fernandez (2012), Safe Exploration of State and


Action spaces in Reinforcement Learning.

9.

Lee, J.-Y., & Lee, J.-J. (2008). Multiple Designs of Fuzzy Controllers for
Car Parking Using Evolutionary Algorithm.

10.

Mihatsch, O., & Neuneier, R. (2002). Risk-Sensitive reinforcement learning.

11.

Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An


Introduction.