Вы находитесь на странице: 1из 130

Unit-1

Overview and Search Technique


Introduction to Artificial Intelligence
Definition of AI
What is AI?
Artificial Intelligence is concerned with the design of
intelligence in an artificial device.
The term was coined by McCarthy in 1956.
There are two ideas in the definition.
1. Intelligence
2. Artificial device
What is intelligence?
Is it that which characterize humans? Or is there an absolute
standard of judgment?
Accordingly there are two possibilities:
A system with intelligence is expected to behave as
intelligently as a human
A system with intelligence is expected to behave in the best
possible manner
Secondly what type of behavior are we talking about?
Are we looking at the thought process or reasoning ability of
the system?
Or are we only interested in the final manifestations of the
system in terms of its actions?
Given this scenario different interpretations have been used
by different researchers as defining the scope and view of
Artificial Intelligence.
Page 1

1. One view is that artificial intelligence is about designing


systems that are as intelligent as humans.
This view involves trying to understand human thought and an
effort to build machines that emulate the human thought process.
This view is the cognitive science approach to AI.
2. The second approach is best embodied by the concept of the
Turing Test. Turing held that in future computers can be
programmed to acquire abilities rivaling human intelligence. As
part of his argument Turing put forward the idea of an 'imitation
game', in which a human being and a computer would be
interrogated under conditions where the interrogator would not
know which was which, the communication being entirely by
textual messages. Turing argued that if the interrogator could not
distinguish them by questioning, then it would be unreasonable
not to call the computer intelligent. Turing's 'imitation game' is
now usually called 'the Turing test' for intelligence.
3. Logic and laws of thought deals with studies of ideal or
rational thought process and inference. The emphasis in this case
is on the inference mechanism, and its properties. That is how
the system arrives at a conclusion, or the reasoning behind its
selection of actions is very important in this point of view. The
soundness and completeness of the inference mechanisms are
important here.
4. The fourth view of AI is that it is the study of rational agents.
This view deals with building machines that act rationally. The
focus is on how the system acts and performs, and not so much
on the reasoning process. A rational agent is one that acts
rationally, that is, is in the best possible manner.
Page 2

Problem Solving:
Strong AI aims to build machines that can truly reason and
solve problems. These machines should be self aware and their
overall intellectual ability needs to be indistinguishable from
that of a human being. Excessive optimism in the 1950s and
1960s concerning strong AI has given way to an appreciation of
the extreme difficulty of the problem. Strong AI maintains that
suitably programmed machines are capable of cognitive mental
states.
Weak AI: deals with the creation of some form of computerbased artificial intelligence that cannot truly reason and solve
problems, but can act as if it were intelligent. Weak AI holds
that suitably programmed machines can simulate human
cognition.
Applied AI: Aims to produce commercially viable "smart"
systems such as, for example, a security system that is able to
recognize the faces of people who are permitted to enter a
particular building. Applied AI has already enjoyed considerable
success.
Cognitive AI: computers are used to test theories about how the
human mind works--for example, theories about how we
recognize faces and other objects, or about how we solve
abstract problems.
Best First Search: Best First search, which is a way of
combining the advantages of both depth-first and breadth-firstsearch into a single method.
One way of combining the two is to follow a single path
at a time, but switch paths whenever some competing path looks
more promising than the current one does.
Page 3

At each step of the best-first-search process, we select


the most promising of the nodes we have generated so far. This
is done by applying an appropriate heuristic to each of them. We
then expanded the chosen node by using the rules to generate its
successors. If one of them is a solution, we can quit. If not, all
those new nodes are added to the set of nodes generated so far.
Again the most promising node is selected and the process
continues.
Fig shows the beginning of a best-first search procedure.
Initially, there is only one node, so it will be expanded. Doing so
generates three new nodes. The heuristic function, which, in this
example, is an estimate of the cost of getting to a solution from a
given node, is applied to each of these new nodes. Since node D
is the most promising, if it is expanded next, producing two
successor nodes, E and F. But then the heuristic function is
applied to them. Now another path, that going through node B,
looks more promising, so it is pursued, generating nodes G and
H. But again when these new nodes are evaluated they look less
promising than another path, so attention is returned to the pass
through D to E. E is then expanded, yielding nodes I and J. At
the next step, j will be expanded, since it is the most promising.
This process can continue until a solution is found.
Step 1
Step2
Step3
A

Page 4

Step4

Step5
A

D
B

Fig: A Best-first Search


The actual operation of the algorithm is very simple.
It proceeds in steps, expanding one node at each step, until it
Page 5

generates a node that corresponds to a goal state. At each step, it


picks the most promising of the nodes that have so far been
generated but not expanded. It generates the successors of the
chosen node, applies the heuristic function to them, and adds
them to the list of open nodes, after checking to see if any of
them have been generated before. By doing this check, we can
guarantee that each node only appears once in this graph,
although many nodes may point to it as a successor. Then the
next step begins.
The process can be summarized as follows.
Algorithm: Best-First Search
1.
2.

Start with OPEN containing just the initial state.


Until a goal is found or there are no nodes left on
OPEN do:
(a) Pick the best node on OPEN.
(b) Generate its successors.
(c) For each successor do:
i.
If it has not been generated before, evaluate it, add it
to OPEN, and record its parent.
ii.
If it has been generated before, change the parent if
this new path is better than the previous one. In that case,
update the cost of getting this node and to any successors
that this node may already here.

Page 6

A* Search: The best-first search algorithm that was just


presented is a simplification of an algorithm called A*, which
was first presented by Hart et al. [1968; 1972]. This algorithm
uses the same f, g, and h functions, as well as the lists OPEN
and CLOSED
The form of heuristic estimation function for A* is
f*(n) =g*(n)+h*(n)
Where the two components g*(n) and h*(n) are estimates of
the cost (or distance) from the start node to node n and the
cost form node n to a goal node, respectively.
Nodes on the open list are nodes that have been
generated but not yet expanded while nodes on the closed list
are nodes have been expanded and whose children are,
therefore, available to the search program. The A* algorithm
proceeds as follows
Algorithm: A*
1.
2.
3.

Place the starting node s on open.


If open is empty, stop and return failure.
Remove from open the node n that has the smallest
value of f*(n). If the node is a goal node, return success
and stop. Otherwise,
4.
Expand n, generating all of its successors n and
place n on closed. For every successor n, if n is not
Page 7

already on open or closed attach a back-pointer to n,


compute f*(n) and place it an open.
5.
Each n that is already on open or closed should be
attached to back-pointers which reflect the lowest g*(n)
path. If n was on closed and its pointer was changed,
remove it and place it on open.
6.
Return to step2.
It has been shown that the A* algorithm is both complete and
admissible. Thus, A* will always find an optimal path if one
exists. The efficiency of an A* algorithm depends on how
closely h* approximates h and the cost of the computing f*.

AO* algorithm:
1.
2.

Place the starting node s on open


Using the search tree constructed thus far, compute
the most promising solution tree To.
3.
Select a node n that is both on open and a part of To.
Remove n from open and place it on closed.
4.
If n is a terminal goal node, label n as solved. If the
solution of n results in any of ns ancestors being solved,
label all the ancestors as solved. If the start node s is
solved, exit with success where To is the solution tree.
Remove from open all nodes with a solved ancestor.
5.
If n is not a solvable node (operators cannot be
applied), label n as unsolvable. If the start node is labeled
Page 8

as unsolvable, exist with failure. If any of ns ancestors


become unsolvable because n is, label them unsolvable as
well. Remove from open all nodes with unsolvable
ancestors.
6.
Otherwise, expand node n generating all of its
successors. For each such successor node that contains
more than one sub problem, generate their successors to
give individual sub problems. Attach to each newly
generated node a back pointer to its predecessor. Compute
the cost estimate h* for each newly generated node and
place all such nodes that do not yet have descendents on
open. Next, recomputed the values of h* at n and each
ancestor of n.
7.
Return to step2.

Hill Climbing Search: Hill climbing gets their names


from the way the nodes are selected for expansion. At each
point in the search path, a successor node that appears to lead
most quickly to the top of the hill (the goal) is selected for
exploration. This method requires that some information be
available with which to evaluate and order the most promising
choices.
Hill climbing is like depth-first searching where the
most promising child is selected for expansion. When the
children have been generated, alternative choices are
evaluated using some type of heuristic function. The path that
Page 9

appears most promising is then chosen and no further


reference to the parent or other children is retained. This
process continues from node-to-node with previously
expanded nodes being discarded.
Hill climbing can produce substantial savings over blind
searches when an informative, reliable function is available to
guide the search to a global goal. It suffers from some serious
drawbacks when this is not the case. Potential problem types
named after certain terrestrial anomalies are the foothill,
ridge, and plateau traps.
The foothill trap results when local maxima or peaks
are found. In this case the children all have less promising
goal distances than the parent node. The search is essentially
trapped at the local node with no indication of goal direction.
The only way to remedy this problem is to try moving in
some arbitrary direction a few generations in the hope that the
real goal direction will become evident, backtracking to an
ancestor node and trying a secondary path choice.
A second potential problem occurs when several
adjoining nodes have higher values than surrounding nodes.
This is the equivalent of a ridge. The search may encounter a
plateau type of structure, that is, an area in which all
neighboring nodes have the same values. Once again, one of
the methods noted above must be tried to escape the trap.
Page 10

Breadth-First Search: Breadth-first searches are


performed by exploring all nodes at a given depth before
proceeding to the next level. This mans that all immediate
children of nodes are explored before any of the childrens
children are considered. Breadth first tree search is illustrated
in fig. It has the obvious advantage of always finding a
minimal path length solution where one exists. A great many
nodes may need to be explored before a solution is found,
especially if the tree is very full. It uses a queue structure to
hold all generated but still unexplored nodes. The breadthfirst algorithm proceeds as follows.
BREADTH-FIRST SEARCH
1. Place the starting node s on the queue.
2. If the queue is empty, return failure and stop.
3. If the first element on the queue is a goal node g,
return success and stop. Otherwise,
4. Remove and expand the first element from the queue
and place all the children at the end of the queue in
any order.
5. Return to step2.
Start

Page 11

Mini-max search: The mini-max search procedure is a depthlimited search procedure. The idea is to start at the current
position and use the plausible-move generator to generate the set
of possible successor positions. Now we can apply the static
evaluation function to those positions and simply choose the
best one.
The starting position is exactly as good for us as the position
generated by the best move we can make next. Here we assume
that the static evaluation function returns large values to indicate
good situations for us, so our goal is to maximize the value of
the static evaluation function of the next board position.
An example of this operation is shown in fig1. It assumes a
static evaluation function that returns values ranging from -10 to
10, with 10 indicating a win for us, -10 a win for the opponent,
and 0 an even match. Since our goal is to maximize the value of
the heuristic function, we choose to move to B. backing Bs
value up to A, we can conclude that As value is 8, since we
know we can move to a position with a value of 8.
A

A
Page 12

B(8)

(8)

(3)

(-2)

J
J
(9) (-6) (0) (0) (-2) (-4) (3) J
J
J
Fig1. One ply search and two ply search
J
L
But since we know that the static evaluation function is not
L
completely accurate, we would like to carry the search farther
L
ahead than one ply. This could be very important for example, in
L
a chess game in which we are in the middle of a piece exchange.
F
E

After our move, the situation would appear to be very good, but
if we look one move ahead, we will see that one of our pieces
also gets captured and so the situation is not as seemed.
Once the values from the second ply are backed up, it becomes
clear that the correct move for us to make at the first level, given
the information we have available, is C, since there is nothing
the opponent can do from there to produce a value worse than 2. This process can be repeated for as many ply as time allows,
and the more accurate evaluations that are produced can be used
to choose the correct move at the top level. The alteration of
maximizing and minimizing at alternate ply when evaluations

Page 13

are being pushed back up corresponds to the opposing strategies


of the two players and gives this method the name minimax.
Having described informally the operation of the minimax
procedure, we now describe it precisely. It is a straight forward
recursive procedure that relies on two auxiliary procedures that
are specific to the game being played:
1. MOVGEN (position, player) - The plausible-move generator,
which returns a list of nodes representing the moves that can be
made by player in position. We call the two players PLAYERONE and PLAYER-TWO; in a chess program, we might use the
names BLACK and WHITE instead.
2.STATIC(position, player) the static evaluation function,
which returns a number representing the goodness of position
from the standpoint of player.2

Heuristic function
A heuristic function, or simply a heuristic, is a function that
ranks alternatives in various search algorithms at each branching
step based on the available information (heuristically) in order to
make a decision about which branch to follow during a search.
Shortest paths
For example, for shortest path problems, a heuristic is a
function, h(n) defined on the nodes of a search tree, which
serves as an estimate of the cost of the cheapest path from that
node to the goal node. Heuristics are used by informed search
algorithms such as Greedy best-first search and A* to choose the
Page 14

best node to explore. Greedy best-first search will choose the


node that has the lowest value for the heuristic function. A*
search will expand nodes that have the lowest value for g(n) +
h(n), where g(n) is the (exact) cost of the path from the initial
state to the current node. If h(n) is admissiblethat is, if h(n)
never overestimates the costs of reaching the goal, then A*
will always find an optimal solution.
The classical problem involving heuristics is the n-puzzle.
Commonly used heuristics for this problem include counting the
number of misplaced tiles and finding the sum of the Manhattan
distances between each block and its position in the goal
configuration. Note that both are admissible.
Effect of heuristics on computational performance
In any searching problem where there are b choices at each node
and a depth of d at the goal node, a naive searching algorithm
would have to potentially search around bd nodes before finding
a solution. Heuristics improve the efficiency of search
algorithms by reducing the branching factor from b to a lower
constant b', using a cutoff mechanism. The branching factor can
be used for defining a partial order on the heuristics, such that
h1(n) < h2(n) if h1(n) has a lower branch factor than h2(n) for a
given node n of the search tree. Heuristics giving lower
branching factors at every node in the search tree are preferred
for the resolution of a particular problem, as they are more
computationally efficient.

Page 15

Finding heuristics
The problem of finding an admissible heuristic with a low
branching factor for common search tasks has been extensively
researched in the artificial intelligence community. Several
common techniques are used:

Solution costs of sub-problems often serve as useful


estimates of the overall solution cost. These are always
admissible. For example, a heuristic for a 10-puzzle might
be the cost of moving tiles 1-5 into their correct places. A
common idea is to use a pattern database that stores the
exact solution cost of every sub problem instance.

The solution of a relaxed problem often serves as a useful


admissible estimate of the original. For example,
Manhattan distance is a relaxed version of the n-puzzle
problem, because we assume we can move each tile to its
position independently of moving the other tiles.

Given a set of admissible heuristic functions


h1(n),h2(n),...,hi(n), the function h(n) =
max{h1(n),h2(n),...,hi(n)} is an admissible heuristic that
dominates all of them.

Using these techniques a program called ABSOLVER was


written (1993) by A.E. Prieditis for automatically generating
heuristics for a given problem. ABSOLVER generated a new
heuristic for the 8-puzzle better than any pre-existing heuristic
and found the first useful heuristic for solving the Rubik's Cube.
Page 16

Consistency and Admissibility


If a Heuristic function never over-estimates the cost reaching to
goal, then it is called an Admissible heuristic function.
If H(n) is consistent then the value of H(n) for each node along a
path to goal node are non decreasing.

Alpha-beta pruning: Alpha-beta pruning is a search


algorithm which seeks to reduce the number of nodes that are
evaluated by the min max algorithm in its search tree. It is a
search with adversary algorithm used commonly for machine
playing of two-player games (Tic-tac-toe, Chess, Go, etc.). It
stops completely evaluating a move when at least one possibility
has been found that proves the move to be worse than a
previously examined move. Such moves need not be evaluated
further. Alpha-beta pruning is a sound optimization in that it
does not change the score of the result of the algorithm it
optimize
History
Allen Newell and Herbert Simon who used what John McCarthy
calls an "approximation"[1] in 1958 wrote that alpha-beta
"appears to have been reinvented a number of times".[2] Arthur
Samuel had an early version and Richards, Hart, Levine and/or
Edwards found alpha-beta independently in the United States.[3]
McCarthy proposed similar ideas during the Dartmouth
Conference in 1956 and suggested it to a group of his students
including Alan Kotok at MIT in 1961.[4] Alexander Brudno
independently discovered the alpha-beta algorithm, publishing
Page 17

his results in 1963.[5] Donald Knuth and Ronald W. Moore


refined the algorithm in 1975[6][7] and it continued to be
advanced. Improvements over naive mini max
An illustration of alpha-beta pruning. The grayed-out sub trees
need not be explored (when moves are evaluated from left to
right), since we know the group of sub trees as a whole yields
the value of an equivalent sub tree or worse, and as such cannot
influence the final result. The max and min levels represent the
turn of the player and the adversary, respectively.

The benefit of alpha-beta pruning lies in the fact that branches of


the search tree can be eliminated. The search time can in this
way be limited to the 'more promising' sub tree, and a deeper
search can be performed in the same time. Like its predecessor,
Page 18

it belongs to the branch and bound class of algorithms. The


optimization reduces the effective depth to slightly more than
half that of simple mini max if the nodes are evaluated in an
optimal or near optimal order (best choice for side on move
ordered first at each node).
With an (average or constant) branching factor of b, and a search
depth of d plies, the maximum number of leaf node positions
evaluated (when the move ordering is pessimal) is O(b*b*...*b)
= O(bd) the same as a simple mini max search. If the move
ordering for the search is optimal (meaning the best moves are
always searched first), the number of leaf node positions
evaluated is about O(b*1*b*1*...*b) for odd depth and
O(b*1*b*1*...*1) for even depth, or
. In the
latter case, where the ply of a search is even, the effective
branching factor is reduced to its square root, or, equivalently,
the search can go twice as deep with the same amount of
computation.The explanation of b*1*b*1*... is that all the first
player's moves must be studied to find the best one, but for each,
only the best second player's move is needed to refute all but the
first (and best) first player move alpha-beta ensures no other
second player moves need be considered. If b=40 (as in chess),
and the search depth is 12 plies, the ratio between optimal and
pessimal sorting is a factor of nearly 406 or about 4 billion times.
An animated pedagogical example that attempts to be
human-friendly by substituting initial infinite (or arbitrarily
large) values for emptiness and by avoiding using the nega max
coding simplifications.
Page 19

Normally during alpha-beta, the sub trees are temporarily


dominated by either a first player advantage (when many first
player moves are good, and at each search depth the first move
checked by the first player is adequate, but all second player
responses are required to try and find a refutation), or vice versa.
This advantage can switch sides many times during the search if
the move ordering is incorrect, each time leading to inefficiency.
As the number of positions searched decreases exponentially
each move nearer the current position, it is worth spending
considerable effort on sorting early moves. An improved sort at
any depth will exponentially reduce the total number of
positions searched, but sorting all positions at depths near the
root node is relatively cheap as there are so few of them. In
practice, the move ordering is often determined by the results of
earlier, smaller searches, such as through iterative deepening.
The algorithm maintains two values, alpha and beta, which
represents the minimum score that the maximizing player is
assured of and the maximum score that the minimizing player, is
assured of respectively. Initially alpha is negative infinity and
beta is positive infinity. As the recursion progresses the
"window" becomes smaller. When beta becomes less than alpha,
it means that the current position cannot be the result of best
play by both players and hence need not be explored further.
Additionally, this algorithm can be trivially modified to return
an entire principal variation in addition to the score. Some more
aggressive algorithms such as MTD(f) do not easily permit such
a modification.
Page 20

Pseudo code
Function alpha beta (node, depth, , , Player)
if depth = 0 or node is a terminal node
return the heuristic value of node
If Player = Max Player
for each child of node
:= max(, alpha beta(child, depth-1, , , not(Player) ))
if
Break
(* Beta cut-off *)
return
else
for each child of node
:= min(, alpha beta(child, depth-1, , , not(Player) ))
if
break
(* Alpha cut-off *)
return
(* Initial call *)
Alpha beta (origin, depth, -infinity, +infinity, Max Player)
Heuristic improvements
Alpha-beta search can be made even faster by considering only a
narrow search window (generally determined by guesswork
based on experience). This is known as aspiration search. In the
extreme case, the search is performed with alpha and beta equal;
a technique known as zero-window search, null-window search,
or scout search. This is particularly useful for win/loss searches
near the end of a game where the extra depth gained from the
narrow window and a simple win/loss evaluation function may
lead to a conclusive result. If an aspiration search fails, it is
Page 21

straightforward to detect whether it failed high (high edge of


window was too low) or low (lower edge of window was too
high). This gives information about what window values might
be useful in a re-search of the position.

Constraint Satisfaction, Constraint satisfaction is the


process of finding a solution to a set of constraints that impose
conditions that the variables must satisfy. A solution is therefore
a vector of variables that satisfies all constraints.
The techniques used in constraint satisfaction depend on the
kind of constraints being considered. Often used are constraints
on a finite domain, to the point that constraint satisfaction
problems are typically identified with problems based on
constraints on a finite domain. Such problems are usually solved
via search, in particular a form of backtracking or local search.
Constraint propagation are other methods used on such
problems; most of them are incomplete in general, that is, they
may solve the problem or prove it un satisfiable, but not always.
Constraint propagation methods are also used in conjunction
with search to make a given problem simpler to solve. Other
considered kinds of constraints are on real or rational numbers;
solving problems on these constraints is done via variable
elimination or the simplex algorithm.
Constraint satisfaction originated in the field of artificial
intelligence in the 1970s (see for example (Laurire 1978)).
During the 1980s and 1990s, embedding of constraints into a
programming language were developed. Languages often used
for constraint programming are Prolog and C++.
Page 22

Constraint satisfaction problem


Constraints enumerate the possible values a set of variables
may take. Informally, a finite domain is a finite set of arbitrary
elements. A constraint satisfaction problem on such domain
contains a set of variables whose values can only be taken from
the domain, and a set of constraints, each constraint specifying
the allowed values for a group of variables. A solution to this
problem is an evaluation of the variables that satisfies all
constraints. In other words, a solution is a way for assigning a
value to each variable in such a way that all constraints are
satisfied by these values.
In practice, constraints are often expressed in compact form,
rather than enumerating all values of the variables that would
satisfy the constraint. One of the most used constraints is the one
establishing that the values of the affected variables must be all
different.
Problems that can be expressed as constraint satisfaction
problems are the Eight queens puzzle, the Sudoku solving
problem, the Boolean satisfiability problem, scheduling
problems and various problems on graphs such as the graph
coloring problem.
While usually not included in the above definition of a
constraint satisfaction problem, arithmetic equations and
inequalities bound the values of the variables they contain and
can therefore be considered a form of constraints. Their domain
Page 23

is the set of numbers (either integer, rational, or real), which is


infinite: therefore, the relations of these constraints may be
infinite as well; for example, X = Y + 1 has an infinite number of
pairs of satisfying values. Arithmetic equations and inequalities
are often not considered within the definition of a "constraint
satisfaction problem", which is limited to finite domains. They
are however used often in constraint programming.
Solving
Constraint satisfaction problems on finite domains are typically
solved using a form of search. The most used techniques are
variants of backtracking, constraint propagation, and local
search. These techniques are used on problems with nonlinear
constraints.
Variable elimination and the simplex algorithm are used for
solving linear and polynomial equations and inequalities, and
problems containing variables with infinite domain. These are
typically solved as optimization problems in which the
optimized function is the number of violated constraints.
Complexity
Solving a constraint satisfaction problem on a finite domain is
an NP complete problem. Research has shown a number of
tractable sub cases, some limiting the allowed constraint
relations, some requiring the scopes of constraints to form a tree,
possibly in a reformulated version of the problem. Research has
also established relationship of the constraint satisfaction

Page 24

problem with problems in other areas such as finite model


theory.

Constraint programming
Constraint programming is the use of constraints as a
programming language to encode and solve problems. This is
often done by embedding constraints into a programming
language, which is called the host language. Constraint
programming originated from a formalization of equalities of
terms in Prolog II, leading to a general framework for
embedding constraints into a logic programming language. The
most common host languages are Prolog, C++, and Java, but
other languages have been used as well.
Constraint logic programming
A constraint logic program is a logic program that contains
constraints in the bodies of clauses. As an example, the clause
A(X):-X>0,B(X) is a clause containing the constraint X>0 in the
body. Constraints can also be present in the goal. The constraints
in the goal and in the clauses used to prove the goal are
accumulated into a set called constraint store. This set contains
the constraints the interpreter has assumed satisfiable in order to
proceed in the evaluation. As a result, if this set is detected un
satisfiable, the interpreter backtracks. Equations of terms, as
used in logic programming, are considered a particular form of
Page 25

constraints which can be simplified using unification. As a


result, the constraint store can be considered an extension of the
concept of substitution that is used in regular logic
programming. The most common kinds of constraints used in
constraint logic programming are constraints over
integers/rational/real numbers and constraints over finite
domains.
Concurrent constraint logic programming languages have also
been developed. They significantly differ from non-concurrent
constraint logic programming in that they are aimed at
programming concurrent processes that may not terminate.
Constraint handling rules can be seen as a form of concurrent
constraint logic programming, but are also sometimes used
within a non-concurrent constraint logic programming language.
They allow for rewriting constraints or to infer new ones based
on the truth of conditions.
Constraint satisfaction toolkits
Constraint satisfaction toolkits are software libraries for
imperative programming languages that are used to encode and
solve a constraint satisfaction problem.

Cassowary constraint solver is an open source project for


constraint satisfaction (accessible from C, Java, Python and
other languages).
Comet, a commercial programming language and toolkit

Page 26

Gecode, an open source portable toolkit written in C++


developed as a production-quality and highly efficient
implementation of a complete theoretical background.
JaCoP (solver) an open source Java constraint solver
Koalog a commercial Java-based constraint solver.
logilab-constraint an open source constraint solver written
in pure Python with constraint propagation algorithms.
MINION an open-source constraint solver written in C++,
with a small language for the purpose of specifying
models/problems.
ZDC is an open source program developed in the
Computer-Aided Constraint Satisfaction Project for
modeling and solving constraint satisfaction problems.

Other constraint programming languages


Constraint toolkits are a way for embedding constraints into an
imperative programming language. However, they are only used
as external libraries for encoding and solving problems. An
approach in which constraints are integrated into an imperative
programming language is taken in the Kaleidoscope
programming language.
Constraints have also been embedded into functional
programming languages.

Evaluation function: An evaluation function, also known


as a heuristic evaluation function or static evaluation
function, is a function used by game-playing programs to
estimate the value or goodness of a position in the mini max and
Page 27

related algorithms. The evaluation function is typically designed


to be prioritize speed over accuracy; the function looks only at
the current position and does not explore possible move
In chess
One popular strategy for constructing evaluation functions is as
a weighted sum of various factors that are thought to influence
the value of a position. For instance, an evaluation function for
chess might take the form
c1 * material + c2 * mobility + c3 * king safety + c4 * center
control +...
Chess beginners, as well as the simplest of chess programs,
evaluate the position taking only "material" into account, i.e.
they assign a numerical score for each piece (with pieces of
opposite color having scores of opposite sign) and sum up the
score over all the pieces on the board. On the whole, computer
evaluation functions of even advanced programs tend to be more
materialistic than human evaluations. This is compensated for
by the increased speed of evaluation, which allows more plies to
be examined. As a result, some chess programs may rely too
much on tactics at the expense of strategy.

Game tree: a game tree is a directed graph whose nodes are


positions in a game and whose edges are moves. The complete
game tree for a game is the game tree starting at the initial
position and containing all possible moves from each position.

Page 28

The first two ply of the game tree for tic-tac-toe.


The diagram shows the first two levels, or ply, in the game tree
for tic-tac-toe. We consider all the rotations and reflections of
positions as being equivalent, so the first player has three
choices of move: in the center, at the edge, or in the corner. The
second player has two choices for the reply if the first player
played in the center, otherwise five choices. And so on.
The number of leaf nodes in the complete game tree is the
number of possible different ways the game can be played. For
example, the game tree for tic-tac-toe has 26,830 leaf nodes.
Game trees are important in artificial intelligence because one
way to pick the best move in a game is to search the game tree
using the mini max algorithm or its variants. The game tree for
tic-tac-toe is easily searchable, but the complete game trees for
larger games like chess are much too large to search. Instead, a
chess-playing program searches a partial game tree: typically
Page 29

as many ply from the current position as it can search in the time
available. Except for the case of "pathological" game trees [1]
(which seem to be quite rare in practice), increasing the search
depth (i.e., the number of ply searched) generally improves the
chance of picking the best move.
Two-person games can also be represented as and-or trees. For
the first player to win a game there must exist a winning move
for all moves of the second player. This is represented in the
and-or tree by using disjunction to represent the first player's
alternative moves and using conjunction to represent all of the
second player's moves.
Solving Game Trees

An arbitrary game tree that has been fully colored


With a complete game tree, it is possible to "solve" the game
that is to say, find a sequence of moves that either the first or
Page 30

second player can follow that will guarantee either a win or tie.
The algorithm can be described recursively as follows.
1. Color the final ply of the game tree so that all wins for
player 1 are colored one way, all wins for player 2 are
colored another way, and all ties are colored a third
way.
2. Look at the next ply up. If there exists a node colored
opposite as the current player, color this node for that
player as well. If all immediately lower nodes are
colored for the same player, color this node for the
same player as well. Otherwise, color this node a tie.
3. Repeat for each ply, moving upwards, until all nodes
are colored. The color of the root node will determine
the nature of the game.
The diagram shows a game tree for an arbitrary game, colored
using the above algorithm.
It is usually possible to solve a game (in this technical sense of
"solve") using only a subset of the game tree, since in many
games a move need not be analyzed if there is another move that
is better for the same player (for example alpha-beta pruning can
be used in many deterministic games).
Any sub tree that can be used to solve the game is known as a
decision tree, and the sizes of decision trees of various shapes
are used as measures of game complexity.

Game of chance: A game of chance is a game whose


outcome is strongly influenced by some randomizing device,
Page 31

and upon which contestants may or may not wager money or


anything of monetary value. Common devices used include dice,
spinning tops, playing cards, roulette wheels or numbered balls
drawn from a container.
Any game of chance that involves anything of monetary value is
gambling.
Gambling is known in nearly all human societies, even though
many have passed laws restricting it. Early people used the
knucklebones of sheep as dice. Some people develop a
psychological addiction to gambling, and will risk even food and
shelter to continue.
Some games of chance may also involve a certain degree of
skill. This is especially true where the player or players have
decisions to make based upon previous or incomplete
knowledge, such as poker and blackjack. In other games like
roulette and baccarat the player may only choose the amount of
bet and the thing he/she wants to bet on, the rest is up to chance,
therefore these games are still considered games of chance with
small amount of skills required [1]. The distinction between
'chance' and 'skill' is relevant as in some countries chance games
are illegal or at least regulated, where skill games are not

Page 32

Unit-2
Knowledge Representation
Introduction to Knowledge Representation (KR)
We argue that the notion can best be understood in terms of
five distinct roles it plays, each crucial to the task at hand:
A

knowledge representation (KR) is most fundamentally a


surrogate, a substitute for the thing itself, used to enable
an entity to determine consequences by thinking rather
than acting, i.e., by reasoning about the world rather than
taking action in it
It is a set of ontological commitments, i.e., an answer to the
question: In what terms should I think about the world?
It is a fragmentary theory of intelligent reasoning,
expressed in terms of three components: (i) the
representations fundamental conception of intelligent
reasoning; (ii) the set of inferences the representation
sanctions; and (iii) the set of inferences it recommends.
It is a medium for pragmatically efficient computation, i.e.,
the computational environment in which thinking is
accomplished. One contribution to this pragmatic
Page 33

efficiency is supplied by the guidance a representation


provides for organizing information so as to facilitate
making the recommended inferences.
It is a medium of human expression, i.e., a language in
which we say things about the words
Knowledge representation is needed for library classification
and for processing concepts in an information system. In the
field of artificial intelligence, problem solving can be simplified
by an appropriate choice of knowledge representation.
Representing the knowledge in one way may make the solution
simple, while an unfortunate choice of representation may make
the solution difficult or obscure; the analogy is to make
computations in Hindu-Arabic numerals or in Roman numerals;
long division is simpler in one and harder in the other. Likewise,
there is no representation that can serve all purposes or make
every problem equally approachable.
Properties for Knowledge Representation Systems
The following properties should be possessed by a knowledge
representation system.
Representational Adequacy
the ability to represent the required knowledge;
Inferential Adequacy
- the ability to manipulate the knowledge represented to produce
new knowledge corresponding to that inferred from the original;
Inferential Efficiency
- the ability to direct the inferential mechanisms into the most
productive directions by storing appropriate guides;
Page 34

Acquisition Efficiency
- the ability to acquire new knowledge using automatic methods
wherever possible rather than reliance on human intervention.

Predicate Logic: Propositional logic combines atoms An


atom contains no propositional connectives
Have no structure (today_is_wet, john_likes_apples)
Predicates allow us to talk about objects
Properties: is wet (today)
Relations: likes (john, apples)
True or false
In predicate logic each atom is a predicate
e.g. first order logic, higher-order logic
First Order Logic
More expressive logic than propositional
Used in this course (Lecture 6 on representation in
FOL)
Constants are objects: john, apples
Predicates are properties and relations:
likes(john, apples)
Page 35

Functions transform objects:


likes(john, fruit of(apple tree))
Variables represent any object: likes(X, apples)
Quantifiers qualify values of variables
True for all objects (Universal):
apples)

X. likes(X,

Exists at least one object (Existential): X. likes(X, apples


Example: FOL Sentence

Every rose has a thorn

For all X
if (X is a rose)
then there exists Y
(X has Y) and (Y is a thorn)
Higher Order Logic
More expressive than first order
Functions and predicates are also objects
Page 36

Described by predicates: binary(addition)


Transformed by functions: differentiate(square)
Can quantify over both
E.g. define red functions as having zero at 17
Much harder to reason with

Forward Chaining: In forward chaining the rules are


examined one after the other in a certain order. The order might
be the sequence in which the rules were entered into the rule set
or some other sequence as specified by the user. As each rule is
examined, the expert system attempts to evaluate whether the
condition is true or false.
Rule evaluation: When the condition is true, the rule is fired
and the next rule is examined. When the condition is false, the
rule is not fired and the next rule is examined.
It is possible that a rule cannot be evaluated as true or
false. Perhaps the condition includes one or more variables with
unknown values. In that case the rule condition is unknown.
When a rule condition is unknown, the rule is not fired and the
next rule is examined.
The iterative reasoning process: The process of examining
one rule after the other continues until a complete pass has been
made through the entire rule set. More than one pass usually is
Page 37

necessary in order to assign a value to the goal variable. Perhaps


the information needed to evaluate one rule is produced by
another rule that is examined subsequently. After the second rule
is fired, the first rule can be evaluated o the next pass.
The passes continue as long as it is possible to fire rules.
When no more rules can be fired, the reasoning process ceases.
Example of forward reasoning: Letters are used for the
conditions and actions to keep the illustration simple. In rule1,
for example, if condition A exists then action B is taken.
Condition A might be
THIS.YEAR.SALES>LAST.YEAR.SALES

Backward chaining: Backward chaining is an inference


method used in automated theorem proves, proof assistants and
other artificial intelligence applications. It is one of the two most
commonly used methods of reasoning with inference rules and
logical implications the other is forward chaining. Backward
chaining is implemented in logic programming by SLD
resolution. Both rules are based on the modus ponens inference
rule.
Backward chaining starts with a list of goals (or a hypothesis)
and works backwards from the consequent to the antecedent to
see if there is data available that will support any of these
Page 38

consequents. An inference engine using backward chaining


would search the inference rules until it finds one which has a
consequent (Then clause) that matches a desired goal. If the
antecedent (If clause) of that rule is not known to be true, then it
is added to the list of goals (in order for one's goal to be
confirmed one must also provide data that confirms this new
rule).
For example, suppose that the goal is to conclude the color of
my pet Fritz, given that he croaks and eats flies, and that the rule
base contains the following four rules:
1. If X croaks and eats flies Then X is a frog
2. If X chirps and sings Then X is a canary
3. If X is a frog Then X is green
4. If X is a canary Then X is yellow
This rule base would be searched and the third and fourth rules
would be selected, because their consequents (Then Fritz is
green, Then Fritz is yellow) match the goal (to determine Fritz's
color). It is not yet known that Fritz is a frog, so both the
antecedents (If Fritz is a frog, If Fritz is a canary) are added to
the goal list. The rule base is again searched and this time the
first two rules are selected, because their consequents (Then X
is a frog, Then X is a canary) match the new goals that were just
added to the list. The antecedent (If Fritz croaks and eats flies) is
known to be true and therefore it can be concluded that Fritz is a
frog, and not a canary. The goal of determining Fritz's color is
now achieved (Fritz is green if he is a frog, and yellow if he is a
Page 39

canary, but he is a frog since he croaks and eats flies; therefore,


Fritz is green).
Note that the goals always match the affirmed versions of the
consequents of implications (and not the negated versions as in
modus tollens) and even then, their antecedents are then
considered as the new goals (and not the conclusions as in
affirming the consequent) which ultimately must match known
facts (usually defined as consequents whose antecedents are
always true); thus, the inference rule which is used is modus
ponens.
Because the list of goals determines which rules are selected and
used, this method is called goal-driven, in contrast to data-driven
forward-chaining inference. The backward chaining approach is
often employed by expert systems.

Conceptual Dependency formalism: Conceptual


dependency (CD) is a theory of natural language processing
which mainly deals with representation of semantics of a
language. The main motivation for the development of CD as a
knowledge representation techniques are given below:
To construct computer programs that can understand
natural language.
To make inferences from the statements and also to identify
conditions in which two sentences can have similar
meaning.
To provide facilities for the system to take part in dialogues
and answer questions.
Page 40

To provide a means of representation which are language


independent.
Knowledge is represented in CD by elements what are
called as conceptual structures. What forms the basis of CD
representation is that for two sentences which have identical
meaning there must be only one representation and implicitly
packed information must be explicitly stated.
In order that knowledge is represented in CD form, certain
primitive actions have been developed. Table provides the
primitives CD actions.
Apart from the primitives CD actions one has to make use of
the six following categories of types of objects.
1. PPs: (picture producers)
Only physical objects are physical producers.
2. ACTs: Actions are done by an actor to an object.
Table gives the major ACTs.
Table Primitive CD forms

CD primitive action

Explanation

1.

ATRANS

transfer of

abstract relationship(e,g, give)


Page 41

2.

PTRANS
transfer of
physical location of an object (e.g. go)
3.
PROPEL
application
of physical force of an object (e.g. throw)
4.
MOVE
movement
of a body part of an animal by the animal
5.
GRASP
grasping of
an object by an actor (e.g. hold)
6.
INGEST
Taking of
an object by an animal to the inside of that a
animal
(e.g. Drink.eat)
7.
EXPEL
Expulsion of
an object from inside the body by an animal to the world
(e.g. spit)
8.
MTRANS
Transfer of
mental information between animals or within an animal
(e.g. tell)
9.
MBUILD
Construction
of a new information from an old information (e.g. decide).
10. SPEAK

Action of producing sound (e.g. say).

3. LOCs: Locations
Every action takes place at some locations and serves as
source and destination.
4. Ts: Times
Page 42

An action can take place at a particular location at a given


specified time. The time can be represented on an absolute scale
or relative scale.
5. AAs: Action aiders
These serve as modifiers of actions, the actor PROPEL has a
speed factor associated with it which is an action aider.
6. PAs: Picture Aides
Serve as aides of picture producers. Every object that serve as
a PP, needs certain characteristics by which they are defined.
PAs practically serve PPs by defining the characteristics.
There are certain rules by which the conceptual categories of
types of objects discussed can be combined.
CD models provide the following advantages for representing
knowledge.

The ACT primitives help in


representing wide knowledge in a succinct way. To
illustrate this, consider the following verbs. These are verbs
that correspond to transfer of mental information.
-see
-learn
-hear
-inform
-remember
Page 43

In CD representation all these are represented using a single


ACT primitives MTRANS. They are not represented
individually as given. Similarly, different verbs that indicate
various activities are clubbed under unique ACT primitives,
thereby reducing the number of inference rules.

The main goal of CD


representation is to make explicit of what is implicit. That
is why every statement that is made has not only the actors
and objects but also time and location, source and
destination.

The following set conceptual tenses still make usage of CD


more precise.
O-Object case relationship
R-recipient case relationship
P-past
F-future
T-transition
Ts-start transition
Tf-finished transition
K-continuing
?-interrogative
Page 44

/-negative
Nil-present
Delta-timeless
C-conditional

CD brought forward the


notion of language independence because all ACTs are
language-independent primitive.

Semantic Nets: The main idea behind semantic nets is that the
meaning of a concept comes from the ways in which it is
connected to other concepts. In a semantic net, information is
represented as a set of nodes connected to each other by a set of
labeled arcs, which represent relationship among the nodes. A
fragment of a typical semantic net is shown in fig.
Mammal

Isa
Person

Has-port

Nose

Instance
UniformColor
Blue

team
Pee-wee-Reese

Brooklyn- Dodgers

Page 45

Fig: A semantic network


This network contains an example of both the Isa and instance
relations, as well as some other, more domain-specific relations
like team and uniform-color. In this network, we could use
inheritance to derive the additional relation
Has-part (pee-wee-Reese, Nose)
1. Insertion Search. One of the early ways that semantic nets
were used was to find relationships among objects by spreading
activation out from each of two nodes and seeing where the
activation met. This process is called insertion search. Using this
process, it is possible to use the network of fig to answer
questions such as what is the connection between the Brooklyn
Dodgers and Blue?
2. Representing Non Binary predicates. Semantic nets are a
natural way to represent relationships that would appear as
ground instances of binary predicates in predicate logic. For
example, some of the arcs from fig could be represented in logic
as
Isa (person, mammal)
Instance (pee-Wee-Reese, Person)
Team (pee-wee-Reese, Brooklyn-Dodgers)
Uniform color (pee-wee- Reese, blue)
Page 46

But the knowledge expressed by predicates of other arties can


also be expressed in semantic nets. We have already seen that
many unary predicates. Such as Isa and instance. So for
example,
Man (Marcus)
Could be rewritten as
Instance (Marcus, Man)
Thereby making it easy to represent in a semantic net.
3. Partitioned semantic Nets. Suppose we want to represent
simple quantified expression in semantic nets. One way to do
this is to partition the semantic net into a hierarchical set of
spaces, each of which corresponds to the scope of one or more
variables. To see how this works, consider first the simple net
shown in fig. this net corresponds to the statement.
The dog bit the mail carrier.
The nodes Dogs, Bite, and Mail-Carrier represents the classes of
dogs, biting, and mail carriers, respectively, while the nodes d,b,
and m represent a particular dog, a particular biting, and a
particular mail carrier. This fact can be easily be represented by
a single net with no portioning.
But now suppose that we want to represent the fact
Every dog has bitten a mail carrier.
Page 47

Or, in logic:
X: dog(x)

y: Mail-carrier(y) bite(x, y)

To represent this fact, it is necessary to encode the scope of the


universally quantified variable x.
Dogs

Mail-carrier

Bite

Isa

Isa

isa

Assailant

victim

Fig: using partitioned semantic nets


Frame: A frame is a collection of attributes (usually called
slots) and associated values (and possibly constraints on values)
that describes some entity in the world. Sometimes a frame
describes an entity in some absolute sense; sometimes it
represents the entity from a particular point of view. A single
frame taken alone is rarely useful. Instead, we build frame
systems out of collection of frames that are connected to each
other.
Set theory provides a good basis for understanding frame
systems. Although not all frame systems are defined this way,
we do so here. In this view, each frame represents either a class
Page 48

(a set) or an instance (an element of a class). To see how this


works, consider a frame system shown in fig. In this example,
the frames person, adult-male, ML-baseball-player, pitcher, and
ML-baseball team are all classes. The frames pee-wee-Reese
and Brooklyn-Dodgers are instances.
Person
Isa:

Mammal

Cardinality:

6,000,000,000

*handed:

Right

Adult-Male
Isa:

person

Cardinality

2,000,000,000

*height:

5-10

ML-Baseball-Player
Isa:

adult-male

Cardinality:

624

*height:

6-1

*bats:

equal to handed

*batting-average: .252
*team:
*uniform-color:
Page 49

Fielder
Isa:

ML-baseball-player

Cardinality:

376

*batting-average: .262
Pee-Wee-Reese
Instance:

fielder

Height:

5-10

Bats:

right

Batting-average: .309
Team:

Brooklyn-Dodgers

Uniform-color:

Blue

ML-Baseball-Team
Isa:

Team

Cardinality:

26

*team-size:

24

*manager:
Brooklyn-dodgers
Instance:

ML-Baseball-Team

Team-size:

24

Manager:

Leo-Durocher
Page 50

Players:

(Pee-Wee-Reese)

fig. A simplified frame system

Wff. Not all strings can represent propositions of the predicate


logic. Those which produce a proposition when their symbols
are interpreted must follow the rules given below, and they are
called wffs (well-formed formulas) of the first order predicate
logic.
Rules for constructing Wffs
A predicate name followed by a list of variables such as P(x, y),
where P is a predicate name, and x and y are variables, is called
an atomic formula.
Wffs are constructed using the following rules:
1. True and False are wffs.
2. Each propositional constant (i.e. specific proposition), and
each propositional variable (i.e. a variable representing
propositions) are wffs.
3. Each atomic formula (i.e. a specific predicate with
variables) is a wff.
4. If A, B, and C are wffs, then so are A, (A B), (A B), (A
B), and (A B).
5. If x is a variable (representing objects of the universe of
discourse), and A is a wff, then so are x A and x A .
Page 51

6. For example, "The capital of Virginia is Richmond." is a


specific proposition. Hence it is a wff by Rule 2.
Let B be a predicate name representing "being blue" and let
x be a variable. Then B(x) is an atomic formula meaning "x
is blue". Thus it is a wff by Rule 3. above. By applying
Rule 5. to B(x), xB(x) is a wff and so is xB(x). Then by
applying Rule 4. to them x B(x) x B(x) is seen to be a
wff. Similarly if R is a predicate name representing "being
round". Then R(x) is an atomic formula. Hence it is a wff.
By applying Rule 4 to B(x) and R(x), a wff B(x) R(x) is
obtained.
In this manner, larger and more complex wffs can be
constructed following the rules given above.
Note, however, that strings that can not be constructed by
using those rules are not wffs. For example, xB(x)R(x),
and B( x ) are NOT wffs, NOR are B( R(x) ), and B( x
R(x) ) .
One way to check whether or not an expression is a wff
is to try to state it in English. If you can translate it into a
correct English sentence, then it is a wff.
More examples: To express the fact that Tom is taller than
John, we can use the atomic formula taller(Tom, John),
which is a wff. This wff can also be part of some
compound statement such as taller(Tom, John)
taller(John, Tom), which is also a wff.
If x is a variable representing people in the world, then
Page 52

taller(x,Tom), x taller(x,Tom), x taller(x,Tom), x y


taller(x,y) are all wffs among others.
7. However, taller( x,John) and taller(Tom Mary, Jim), for
example, are NOT wffs.

Unit-3
Handling Uncertainty and learning
Fuzzy Logic: In the techniques we have not modified the
mathematical underpinnings provided by set theory and logic.
We have instead augmented those ideas with additional
constructs provided by probability theory. We take a different
approach and briefly consider what happens if we make
fundamental changes to our idea of set membership and
corresponding changes to our definitions of logical operations.
The motivation for fuzzy sets is provided by the need
to represent such propositions as:
John is very tall.
Mary is slightly ill.
Sue and Linda are close friends.
Page 53

Exceptions to the rule are nearly impossible.


Most Frenchmen are not very tall.
While traditional set theory defines set membership as a
Boolean predicate, fuzzy set theory allows us to represent set
membership as a possibility distribution.
Once set membership has been redefined in this way, it is
possible to define a reasoning system based on techniques for
combining distributions. Such responders have been applied
control systems for devices as diverse trains and washing
machines.

Dempster Shafer Theory: This theory was developed by


Dempster 1968; Shafer, 1976. This approach considers sets of
propositions and assigns to each of them an interval
[Belief, Plausibility]
In which the degree of belief must lie. Belief (usually
denoted Bel) measures the strength of the evidence in favor of a
set of propositions. It ranges from 0 (indicating no evidence) to
1(denoting certainty).
A belief function, Bel, corresponding to a specific m for
the set A, is defined as the sum of beliefs committed to every
subset of A by m. That is, Bel (A) is a measure of the total
support or belief committed to the set A and sets a minimum
Page 54

value for its likelihood. It is defined in terms of all belief


assigned to A as well as to all proper subsets of A. Thus,
Bel (A) =m (B)
For example, if U contains the mutually exclusive subsets A, B,
C, and D then
Bel({A,C,D})= m({A,C,D})
+m({A,C})+m({A,D})+m({C,D})+m({A})+m({c})+m({D}.
In Dempster-Shafer theory, a belief interval can also be defined
for a subset A. It is represented as the subinterval [Bel (A), P1
(A)] of [0, 1]. Bel (A) is also called the support of A, and P1 (A)
=1-Bel (A) the plausibility of A.
We define Bel (o) =0 to signify that no belief should be
assigned to the empty set and Bel (U) = 1 to show that the truth
is contained within U. The subsets A of U are called the focal
elements of the support function Bel when m (A)>0.
Since Bel (A) only partially describes the beliefs about
proposition A, it is useful to also have a measure of the extent
one believes in A that is, the doubts regarding A. For this, we
define the doubt of A as D (A) = Bel (A). From this definition it
will be seen that the upper bound of the belief interval noted
above, P1 (A), can be expressed as P1 (A) =1-D (A) = 1- Bel
(`A). P1 (A) represents an upper belief limit on the proposition
A. The belief interval, [Bel (A), P (A)], is also sometimes
Page 55

referred to as the confidence in A, while the quantity P1 (A)-Bel


(A) is referred to as the uncertainty in A. It can be shown that
P1 (0) = 0, P1 (U) =1
For all A,
P1 (A) Bel (A)
Bel (A) +Bel (`A) 1,
P1 (A) + P1 (`A) 1, and
For A _B,
Bel (A) Bel (B), P1 (A) P1 (B)
As an example of the above concepts, recall once again the
problem of identifying the terrorist organizations A, B, C and D
could have been responsible for the attack. The possible subsets
of U in this case form a lattice of sixteen sub sets (fig).
{A, B, C, D}

{A, B, C,}

{A, B, D}

{A, C, D}

{B, C, D}

{A, B} {A, C,} {B, C,} {B, D} {A, C,} {C, D} {B, D} {C, D}

Page 56

{A}

{B}

{C}

{D}

{O}
Fig Lattice of subsets of the universe U.

Assume one piece of evidence supports the belief


that groups A and C were responsible to a degree of m1 ({A,
C}) = 0.6 and another source of evidence disproves the belief
that C was involved (and therefore supports the belief that the
three organizations, A, B, and D were responsible: that is m2
({A, B, D}) =0.7. To obtain the pooled evidence, we compute
the following quantities
M1 0m2({A}) = (0.6)*(0.7) =0.42
M1 0 m2 ({A, C}) = (0.6) *(0.3) = 0.18
M1 0 m2 ({A, B, D}) = (0.4)*(0.7) = 0.28
M1 0 m2 ({U}) = (0.4)*(0.3) =0.12
M1 0m2=0 for all other subsets of U
Bel1 ({A, C}) = m ({A, C}) +m ({A}) +m ({C})

Bayes Theorem: An important goal for many problemsolving systems is to collect evidence as the system goes along
and to modify its behavior on the basis of evidence. To model
Page 57

this behavior, we need a statistical theory of evidence. Bayesian


statistics is such a theory. The fundamental notion of. Bayesian
statistics is that of conditional probability;
P (H/ E)
231
Read this expression as the probability of hypothesis H given
that we have observed evidence E. To compute this, we need to
take into account the prior probability of H and the extent to
which E provides evidence of H. To do this, we need to define a
universe that contains an exhaustive, mutually exclusive set of
His among which we are trying to discriminate then, let
P (Hi/E) = the probability that hypothesis Hi is true given
evidence E
P (E/Hi) = the probability that we will observe evidence E given
that hypothesis i is true.
P (Hi) = the priori probability that hypothesis i is true in the
absence of any specific evidence. These probabilities are called
prior probabilities of priors.
K=the number of positive hypotheses
Bayes theorem then states that
P (Hi/E) = P (E/Hi). P (Hi)
Page 58

P (E/Hn).P (Hn)
Specifically, when we say P (A/B), we are describing
the conditional probability of A given that the only evidence we
have is B. If there is also other relevant evidence, then it too
must be considered. Suppose, for example, that we are solving a
medical diagnosis problem. Consider the following assertions:
S: patient has spots
M: patient has measles
F: patient has high fever
Without any additional evidence, the presence of spots serves
as evidence in favor of measles. It also serves as evidence of
fever since measles would cause fever. But, since spots and
fever are not independent events, we cannot just sum their
effects; instead, we need to represent explicitly the conditional
probability that arises from their conjunction. In general, given a
prior body of evidence e and some new observation E, we need
to compute.
P (H/E, e) = P (H/E).P (e/E, H)
P (e/E)
Unfortunately, in an arbitrarily complex world, the sizes of
the set of join probabilities that we are require in order to
compute this function grows as 2n if there are n different
Page 59

propositions being considered. This makes using Bayes


theorem intractable for several reasons:

The knowledge acquisition problem


is insurmountable; too many probabilities have to be
provided.

The space that would be required to


store all the probabilities is too large.

The time required to compute the


probabilities is too large.
Despite these problems, through Bayesian statistics provide
an attractive basis for an uncertain reasoning system. As a
result, several mechanisms for exploiting its power while at
the same time making it tractable have been developed.

Learning: One of the most often heard criticisms of AI is that


machines cannot be called intelligent until they are able to learn
to do new things and to adapt to new situations, rather than
simply doing as they are told to do. There can be little question
that the ability to adapt to new surroundings and to solve new
problems is an important characteristics of intelligent entities.
How to interpret its inputs in such a way that its performance
gradually improves.
Learning denotes changes in the systems that are
adaptive in the sense that they enable the system to do the same
Page 60

task or tasks drawn from the same population more efficiently


and more effectively the next time.
Learning covers a wide range of phenomena.
1.

At one end of the spectrum is skill


refinement. People get better at many tasks simply by
practicing. The more you ride a bicycle or play tennis, the
better you get.
2.
At the other end of this spectrum
lies knowledge acquisition. Knowledge is generally
acquired through experience
3.
Many AI programs are able to
improve their performance substantially through rotelearning techniques.
4.
Another way we learn is through
taking advice from others. Advice taking is similar to rote
learning, but high-level may not be in a form simple
enough for a program to use directly in problem solving.
5.
People also learn through their own
problem solving experience. After solving a complex
problem, we remember the structure of the problem and the
methods we used to solve it. The next time we see the
problem, we can solve it more efficiently. Moreover, we
can generalize from our experience to solve related
problems more easily. the program remembers its
experiences and generalizes from them. In large problem
Page 61

spaces, however, efficiency gains are critical. Learning can


mean the difference between solving a problem rapidly and
not solving it at all. In addition, programs that learn through
problem-solving experience may be able to come up with
qualitatively better solutions in the future.
6.
Another form of learning that does
involve stimuli from the outside is learning from examples.
Learning from examples usually involves a teacher who
helps us classify things by correcting us when we are
wrong. Sometimes, however, a program can discover things
without the aid of a teacher. Learning is itself a
problem-solving process.

Learning Model: Learning can be accomplished using a


number of different methods. For example, we can learn by
memorizing facts, by being told, or by studying examples like
problem solutions. Learning requires that new knowledge
structures be created from some form of input stimulus. This
new knowledge must then be assimilated into a knowledge
base and be tested in some way for its utility. Testing means
that the knowledge should be used in the performance of
some task from which meaningful feedback can be obtained,
where the feedback provides some measure of the accuracy
and usefulness of the newly acquired knowledge.
Learning model is depicted in fig where the
environment has been included as part of the overall learner
Page 62

system. The environment may be regarded as either a form of


nature which produces random stimuli or as a more organized
training source such as a teacher which provides carefully
selected training examples for the learner component.
Stimuli
Examples

Feedback u
Learner
Component

Environment
Or Teacher

Knowledge
Base

Critic
performance
Evaluator

Response
Performance
Component

Tasks
Fig. Learning Model

Page 63

The actual form of environment used will depend on the


particular learning paradigm. In any case, some representation
language must be assumed for communication between the
environment and the learner. The language may be the same
representation scheme as that used in the knowledge base (such
as form of predicate calculus). When they are chosen to be the
same we say the single representation trick is being used. This
usually results in a simpler implementation since it is not
necessary to transform between two or more different
representations.
Inputs to the learner component may be physical stimuli
of some type or descriptive, symbolic training examples. The
information conveyed to the learner component is used to create
and modify knowledge structures in the knowledge base.
When given a task, the performance component produces a
response describing actions in performing the task. The critic
module then evaluates this response relative to an optimal
response.
The cycle described above may be repeated a number of
times until the performance of the system have reached some
acceptable level, until a known learning goal has been reached,
or until changes cease to occur in the knowledge base after some
chosen numbers of training examples have been observed.

Page 64

There are several important factors which influence a


systems ability to learn in addition to the form of representation
used. They include the types of training provided, the form and
extent of any initial background knowledge, the type of
feedback provided, and the learning algorithms used (fig).
Background
knowledge

Feedback
Learning

resultant

Algorithms
Training
Scenario

Representation
scheme

fig Factors affecting learning performance


Finally the learning algorithms themselves determine to a large
extent how successful a learning system will be. Te algorithms
control the search to find and build the knowledge structures.
We then expect that the algorithms that extract much of the
Page 65

useful information from training examples and take advantage of


any background knowledge out perform those that do not.

Supervised learning: Supervised learning is the machine


learning task of inferring a function from supervised training
data. The training data consist of a set of training examples. In
supervised learning, each example is a pair consisting of an
input object (typically a vector) and a desired output value (also
called the supervisory signal). A supervised learning algorithm
analyzes the training data and produces an inferred function,
which is called a classifier (if the output is discrete, see
classification) or a regression function (if the output is
continuous, see regression). The inferred function should predict
the correct output value for any valid input object. This requires
the learning algorithm to generalize from the training data to
unseen situations in a "reasonable" way (see inductive bias).
(Compare with unsupervised learning.) The parallel task in
human and animal psychology is often referred to as concept
learning.
Overview
In order to solve a given problem of supervised learning, one has
to perform the following steps:
1. Determine the type of training examples. Before doing
anything else, the engineer should decide what kind of data
is to be used as an example. For instance, this might be a
single handwritten character, an entire handwritten word, or
an entire line of handwriting.
Page 66

2. Gather a training set. The training set needs to be


representative of the real-world use of the function. Thus, a
set of input objects is gathered and corresponding outputs
are also gathered, either from human experts or from
measurements.
3. Determine the input feature representation of the learned
function. The accuracy of the learned function depends
strongly on how the input object is represented. Typically,
the input object is transformed into a feature vector, which
contains a number of features that are descriptive of the
object. The number of features should not be too large,
because of the curse of dimensionality; but should contain
enough information to accurately predict the output.
4. Determine the structure of the learned function and
corresponding learning algorithm. For example, the
engineer may choose to use support vector machines or
decision trees.
5. Complete the design. Run the learning algorithm on the
gathered training set. Some supervised learning algorithms
require the user to determine certain control parameters.
These parameters may be adjusted by optimizing
performance on a subset (called a validation set) of the
training set, or via cross-validation.
6. Evaluate the accuracy of the learned function. After
parameter adjustment and learning, the performance of the
resulting function should be measured on a test set that is
separate from the training set.

Page 67

Factors to consider
Factors to consider when choosing and applying a learning
algorithm include the following:
1. Heterogeneity of the data. If the feature vectors include
features of many different kinds (discrete, discrete ordered,
counts, continuous values), some algorithms are easier to
apply than others. Many algorithms, including Support
Vector Machines, linear regression, logistic regression,
neural networks, and nearest neighbor methods, require that
the input features be numerical and scaled to similar ranges
(e.g., to the [-1,1] interval). Methods that employ a distance
function, such as nearest neighbor methods and support
vector machines with Gaussian kernels, are particularly
sensitive to this. An advantage of decision trees is that they
easily handle heterogeneous data.
2. Redundancy in the data. If the input features contain
redundant information (e.g., highly correlated features),
some learning algorithms (e.g., linear regression, logistic
regression, and distance based methods) will perform
poorly because of numerical instabilities. These problems
can often by solved by imposing some form of
regularization.
3. Presence of interactions and non-linearitys. If each of the
features makes an independent contribution to the output,
then algorithms based on linear functions (e.g., linear
regression, logistic regression, Support Vector Machines,
naive Bayes) and distance functions (e.g., nearest neighbor
methods, support vector machines with Gaussian kernels)
Page 68

generally perform well. However, if there are complex


interactions among features, then algorithms such as
decision trees and neural networks work better, because
they are specifically designed to discover these interactions.
Linear methods can also be applied, but the engineer must
manually specify the interactions when using them.
How supervised learning algorithms work
Given a set of training examples of the form
, a learning algorithm a function
,
where X is the input space and Y is the output space. The
function g is an element of some space of possible functions G,
usually called the hypothesis space. It is sometimes convenient
to represent g using a scoring function
such that g
is defined as returning the y value that gives the highest score:
. Let F denote the space of scoring
functions.
Although G and F can be any space of functions, many learning
algorithms are probabilistic models where g takes the form of a
conditional probability model g(x) = P(y | x), or f takes the form
of a joint probability model f(x,y) = P(x,y). For example, naive
Bayes and linear discriminant analysis are joint probability
models, whereas logistic regression is a conditional probability
model.
There are two basic approaches to choosing f or g: empirical risk
minimization and structural risk minimization[3]. Empirical risk
minimization seeks the function that best fits the training data.
Page 69

Structural risk minimize includes a penalty function that controls


the bias/variance tradeoff.
In both cases, it is assumed that the training set consists of a
sample of independent and identically distributed pairs,
.
In order to measure how well a function fits the training data, a
loss function
is defined. For training example
, the loss of predicting the value is
.
The risk R(g) of function g is defined as the expected loss of g.
This can be estimated from the training data as
.
Generalizations of supervised learning
There are several ways in which the standard supervised
learning problem can be generalized:
Semi-supervised learning: In this setting, the desired output
values are provided only for a subset of the training data. The
remaining data is unlabeled.
Active learning: Instead of assuming that all of the training
examples are given at the start, active learning algorithms
interactively collect new examples, typically by making queries
to a human user. Often, the queries are based on unlabeled data,
which is a scenario that combines semi-supervised learning with
active learning.

Page 70

Structured prediction: When the desired output value is a


complex object, such as a parse tree or a labeled graph, then
standard methods must be extended.
.Learning to rank: When the input is a set of objects and the
desired output is a ranking of those objects, then again the
standard methods must be extended.

Unsupervised Learning: What if a neural network is given


no feedback for its outputs, not even a real-valued
reinforcement? Can the network learn anything useful? The
unintuitive answer is yes.
Has- hair?

Has-scales?

has-feathers? flies?

lives in water?

lays eggs?

Dog

Cat

Bat

Whale

Canary

Robin

Ostrich

Snake

Lizard

1
Page 71

Alligator 0

Fig. Data for unsupervised learning


This form of learning is called unsupervised learning
because no teacher is required. Given a set of input data, the
network is allowed to play with it to try to discover regularities
and relationships between the different parts of the input.
Learning is often made possible through some notion of which
features in the input sets are important. But often we do not
know in advance which features are important, and asking a
learning system to deal with raw input data can be
computationally expensive. Unsupervised learning can be used
as a feature discovery module that precedes supervised
learning.
Consider the data in fig. the group of ten animals, each is
described by its own set of features, breaks down naturally into
three groups: mammals, reptiles and birds. We would like to
build a network that can learn which group a particular animal
belongs to, and to generalize so that it can identify animals it has
not yet seen. We can easily accomplish this with a six-input,
three-output back propagation network. We simply present the
network with an input, observe its output, and update its weights
based on the errors it makes. Without a teacher, however, the
error cannot be computed, so we must seek other methods.
Our first problem is to ensure that only one of the three output
units become active for any given input. One solution to this
problem is to let the network settle, find the output unit with the
Page 72

highest level of activation, and set that unit to 1 and all other
output units to 0. In other words, the output unit with the highest
activation is the only one we consider to be active. A more
neural-like solution is to have the output units fight among
themselves for control of an input vector.

Learning by Induction : What is "Learning by Induction"?


Simply put, it is learning by watching. You watch what others
do, then you do that. Below is a more formal explanation of
inductive vs. deductive logic:
In logic, we often refer to the two broad methods of reasoning as
the deductive and inductive approaches.
Deductive reasoning works from the more general to the more
specific. Sometimes this is informally called a "top-down"
approach. We might begin with thinking up a theory about our
topic of interest. We then narrow that down into more specific
hypotheses that we can test. We narrow down even further when
we collect observations to address the hypotheses. This
ultimately leads us to be able to test the hypotheses with specific
data -- a confirmation (or not) of our original theories.
Inductive reasoning works the other way, moving from specific
observations to broader generalizations and theories. Informally,
we sometimes call this a "bottom up" approach. In inductive
reasoning, we begin with specific observations and measures,
begin to detect patterns and regularities, formulate some
tentative hypotheses that we can explore, and finally end up

Page 73

developing some general conclusions or theories. (Thanks to


William M.K. Trochim for these definitions).
To translate this into an approach to learning a skill, deductive
learning is someone TELLING you what to do, while inductive
learning is someone SHOWING you what to do. Remember the
saying "a picture is worth a thousand words"? That means, in a
given amount of time, a person can be SHOWN a thousand
times more information than they could be TOLD in the same
amount of time. I can access a picture or pattern much more
quickly than the equivalent description of that picture or pattern
in words.
Athletes often practice "visualization" before they undertake an
action. But in order to visualize something, you need to have a
picture in your head to visualize. How do you get those pictures
in your head, by WATCHING. Who do you watch?
Professionals. This is the key. Pay attention here. When you
want to learn a skill:

WATCH PROFESSIONALS DO IT BEFORE YOU DO IT.


DO NOT DO IT YOURSELF FIRST.
Going out and doing a sport without having seen AND
STUDIED professionals doing that sport is THE NUMBER
ONE MISTAKE people make. They force themselves to play,
their brain says "what do we do now?", another part of the brain
looks for examples (pictures) of what to do, and, finding none,
says "just do anything". So they try to generate behavior to
Page 74

accomplish something within the rules of the sport. If they


"keep score" and try to "win" and avoid "losing", the negative
impact is multiplied tenfold.
Yet this is EXACTLY what most people do and what most ARE
TOLD to do! "Interested in tennis? Grab a racquet, join a
league, get out there and have fun!" Then what happens? They
have no training, they try do what it takes to "win", and to do so,
they manufacture awful strokes just TO BE ABLE to play
(remember, they joined a league, so they have to keep score and
win!), these awful strokes get ingrained by repetition, they
produce terrible results, and they are very difficult to unlearn, so
progress, despite lessons (mostly in the useless form of words),
is slow or non existent. Then they quit.
When you finally pick up a racquet and go out to play, and your
brain says "what do we do now?", your head will be filled with
pictures of professionals perfectly doing what you are trying to
do. You will not know how to do it incorrectly, because you
have never seen it done incorrectly. You will try to do what
they do, and you will almost immediately proceed to an
advanced intermediate level. You will be a beginner for a short
period of time, if at all, and improvement will be a matter of
adding to and refining what you are doing, not stripping down
and unlearning bad patterns. And since you are not keeping
score, you focus purely on technique. If you hit one into the net,
just pull another ball out of your pocket and do it again. No big
deal, no drama, no guilt. Just hit another. When you feel you
can hit all of your shots somewhat professionally, maybe you
can actually play someone and keep score. You will love the
Page 75

positive feedback of beating players who have been playing


much longer than you have. You will wonder how they could
have played for so long and still "play like that". Don't they
know it's done "this way?" What professional does it "that
way?" Don't they watch tennis on TV? Who does that? I just
started and I know that's wrong. All these thoughts will make
you feel like a genius.
So how does all of this relate to chess? Simply put, play
over the games of professional players and see how they play
before you play anybody. Try to imitate them instead of trying
to reinvent the wheel. Play over the games of lots of different
players and then decide which one or two you like. The ones
you like are the ones where you say after playing over one of
their games, "I would love to play a game like that!" Then just
concentrate on those one or two players. Study and play the
openings they play. Get books where they comment on their
own games. Maybe they will say what they were thinking
during the game. Try to play like them. During your games,
think "What would he do in this position?" Personally, I like
Murphy for his rapid development and attacks, Lachine for his
creativeness in all positions, and Spas sky for his ability to play
all types of positions and create attacks in calm positions.

Learning Decision tree


Learning , Decision tree used in data mining and machine
learning, uses a decision tree as a predictive model which maps
observations about an item to conclusions about the item's target
Page 76

value. More descriptive names for such tree models are


classification trees or regression trees. In these tree structures,
leaves represent classifications and branches represent
conjunctions of features that lead to those classifications.
In decision analysis, a decision tree can be used to visually and
explicitly represent decisions and decision making. In data
mining, a decision tree describes data but not decisions; rather
the resulting classification tree can be an input for decision
making..
General
Learning decision tree is a common method used in data mining.
The goal is to create a model that predicts the value of a target
variable based on several input variables. Each interior node
corresponds to one of the input variables; there are edges to
children for each of the possible values of that input variable.
Each leaf represents a value of the target variable given the
values of the input variables represented by the path from the
root to the leaf.
A tree can be "learned" by splitting the source set into subsets
based on an attribute value test. This process is repeated on each
derived subset in a recursive manner called recursive
partitioning. The recursion is completed when the subset at a
node all has the same value of the target variable, or when
splitting no longer adds value to the predictions.
Data comes in records of the form:
Page 77

The dependent variable, Y, is the target variable that we are


trying to understand, classify or generalize. The vector x is
composed of the input variables, x1, x2, x3 etc., that are used for
that task.
Types:
tree analysis is when the predicted outcome is
the class to which the data belongs.
Regression tree analysis is when the predicted outcome can
be considered a real number (e.g. the price of a house, or a
patients length of stay in a hospital).
And Regression Tree (CART) analysis is used
to refer to both of the above procedures, first introduced by
Breiman et al.
-squared Automatic Interaction Detector (CHAID).
Performs multi-level splits when computing classification
trees.[2]
Random Forest classifier uses a number of decision trees,
in order to improve the classification rate.
Boosted Trees can be used for regression-type and
classification-type problems

Page 78

Decision tree advantages


Decision trees have various advantages:
Simple to understand and interpret. People are able to
understand decision tree models after a brief explanation.

Requires little data preparation. Other techniques often


require data normalization, dummy variables need to be
created and blank values to be removed.
Able to handle both numerical and categorical data.
Other techniques are usually specialized in analyzing
datasets that have only one type of variable. Ex: relation
rules can be used only with nominal variables while neural
networks can be used only with numerical variables.
Uses a white box model. If a given situation is observable
in a model the explanation for the condition is easily
explained by Boolean logic. An example of a black box
model is an artificial neural network since the explanation
for the results is difficult to understand.
Possible to validate a model using statistical tests. That
makes it possible to account for the reliability of the model.
Robust. Performs well even if its assumptions are
somewhat violated by the true model from which the data
were generated.
Perform well with large data in a short time. Large
amounts of data can be analyzed using personal computers
in a time short enough to enable stakeholders to take
decisions based on its analysis.
Page 79

Truth maintenance system : Truth maintenance system,


or TMS, is a knowledge representation method for representing
both beliefs and their dependencies. The name truth
maintenance is due to the ability of these systems to restore
consistency.
It is also termed as a belief revision system; a truth maintenance
system maintains consistency between old believed knowledge
and current believed knowledge in the knowledge base (KB)
through revision. If the current believed statements contradict
the knowledge in KB, then the KB is updated with the new
knowledge. It may happen that the same data will again come
into existence; the previous knowledge will be required in KB.
If the previous data is not present, it is required for new
inference. But if the previous knowledge was with KB, then no
retracing of the same knowledge was needed. Hence the use of
TMS to avoid such retracing; it keeps track of the contradictory
data with the help of a dependency record. This record reflects
the retractions and additions which makes the inference engine
(IE) aware of its current belief set.
Each statements having at least one valid justification is made a
part of the current belief set. When a contradiction is found, the
statement(s) responsible for the contradiction are identified and
an appropriate is retraced. This results the addition of new
statements to the KB. This process is called dependencydirected backtracking.
The TMS maintain the records in the form of a dependency
network. The nodes in the network are one of the entries in the
Page 80

KB (may be a premise, antecedent, inference rule etc.) Each arc


of the network represents the inference steps from which the
node was derived.
Premise: A premise is a fundamental belief which is assumed to
be always true. They do not need justifications. Considering
premises are base from which justifications for all other nodes
will be stated.
There are two types of justification for each node. They are:
1. Support List [SL]
2. Conceptual Dependencies(CP)
Many kinds of truth maintenance systems exist. Two major
types are single-context and multi-context truth maintenance. In
single context systems, consistency is maintained among all
facts in memory (database). Multi-context systems allow
consistency to be relevant to a subset of facts in memory (a
context) according to the history of logical inference. This is
achieved by tagging each fact or deduction with its logical
history. Multi-agent truth maintenance systems perform truth
maintenance across multiple memories, often located on
different machines. De Klees ATMS (1986) was utilized in
systems based upon KEE on the Lisp Machine. The first multiagent TMS was created by Mason and Johnson. It was a multicontext system. Bridge land and Huns created the first singlecontext multi- agent system
N

Nodded Dependency-directed backtracking is a problem-solving (


Page 81

Dependency directed Backtracking: Dependency directed


backtracking is a problem solving (qv) technique for efficiently
evading contradictions. It is invoked when the problem solvers
discovers that its current state is inconsistent. The goal is, in a
single operation, to change the problem solvers current state to
neither one that contains neither the contradiction just uncovered
nor any contradiction encountered previously. This is achieved
by consulting records of the inferences the problem solver has
performed and records of previous contradiction, which
dependency- directed backtracking has constructed in response
to previous contradictions.
Contrast to backtracking: Dependency directed backtracking
was developed to avoid the deficiencies of chronological
backtracking. Consider the application of chronological
backtracking to the following task (see fig): First do one of A or
B, then one of C or D, and then one of E or F. Assume that each
step requires significant problem solving effort and that A and C
together or B and E together produce a contradiction that is only
uncovered after significant effort. Fig illustrates the sequence of
problem solving states that chronological backtracking goes
through to find all solutions (6, 7, 11 and 14)
Backtracking to an Appropriate Choice: The first deficiency
of chronological backtracking is illustrated by the unnecessary
state 4. The contradiction discovered in state 3 depends on
choices A and C and not E. Therefore, replacing the choices E
with F and working on state 4 is futile, as this change does not
remove the contradiction. Unlike chronological backtracking,
which replaces the most recent choice, dependency directed
backtracking replaces a choice that caused the contradiction.
Page 82

The discovery that state 3 is inconsistent causes immediate


backtracking to state 5. To be able to determine which choices
underlie the contradiction requires that the problem solver store
dependency records with every datum that it infers.
Avoiding Rediscovering contradiction. The second deficiency
of chronological backtracking is illustrated by un- necessary
state 13. The contradiction discovered in state 10 depends on B
and E. As E is the most recent choice, chronological and
dependency directed backtracking are indistinguishable, both
backtracking to state 11. How ever, as B and E are known to be
inconsistent with each other, there is no point in rediscovering
this contradiction by working in state 13.

Page 83

Disadvantages:
1.

Dependency-directed
backtracking incurs a significant time and space overhead
as it requires the maintenance of dependency records and
an additional no-good database. Thus the effort required to

Page 84

maintain the dependencies may be more than the problemsolving effort solved.
2.
If the problem solver is
logically complete and finishes all work on a state before
considering the next, the problem of backtracking to an
inappropriate choice cannot occur.
3.
In such cases much of the
advantage of Dependency-directed backtracking is
irrelevant. However, most practical problem solvers are
neither logically complete nor finish all possible work on a
state before considering one other.

Fuzzy function:
Membership function is the one of the fuzzy function which is
used to develop the fuzzy set value. The fuzzy logic is depends
upon membership function

Page 85

Unit-4
Natural Language processing and planning
Backward chaining: Backward chaining (or backward
reasoning) is an inference method used in automated theorem
provers, proof assistants and other artificial intelligence
applications. It is one of the two most commonly used methods
of reasoning with inference rules and logical implications the
other is forward chaining. Backward chaining is implemented in
logic programming by SLD resolution. Both rules are based on
the modus ponens inference rule.
Backward chaining starts with a list of goals (or a hypothesis)
and works backwards from the consequent to the antecedent to
see if there is data available that will support any of these
consequents. An inference engine using backward chaining
would search the inference rules until it finds one which has a
consequent (Then clause) that matches a desired goal. If the
antecedent (If clause) of that rule is not known to be true, then it
is added to the list of goals (in order for one's goal to be
confirmed one must also provide data that confirms this new
rule).
For example, suppose that the goal is to conclude the color of
my pet Fritz, given that he croaks and eats flies, and that the rule
base contains the following four rules:
1. If X croaks and eats flies Then X is a frog
Page 86

2. If X chirps and sings Then X is a canary


3. If X is a frog Then X is green
4. If X is a canary Then X is yellow
This rule base would be searched and the third and fourth rules
would be selected, because their consequents (Then Fritz is
green, Then Fritz is yellow) match the goal (to determine Fritz's
color). It is not yet known that Fritz is a frog, so both the
antecedents (If Fritz is a frog, If Fritz is a canary) are added to
the goal list. The rule base is again searched and this time the
first two rules are selected, because their consequents (Then X
is a frog, Then X is a canary) match the new goals that were just
added to the list. The antecedent (If Fritz croaks and eats flies) is
known to be true and therefore it can be concluded that Fritz is a
frog, and not a canary. The goal of determining Fritz's color is
now achieved (Fritz is green if he is a frog, and yellow if he is a
canary, but he is a frog since he croaks and eats flies; therefore,
Fritz is green).
Note that the goals always match the affirmed versions of the
consequents of implications (and not the negated versions as in
modus tollens) and even then, their antecedents are then
considered as the new goals (and not the conclusions as in
affirming the consequent) which ultimately must match known
facts (usually defined as consequents whose antecedents are
always true); thus, the inference rule which is used is modus
ponens.
Because the list of goals determines which rules are selected and
used, this method is called goal-driven, in contrast to data-driven
Page 87

forward-chaining inference. The backward chaining approach is


often employed by systems. Programming languages such as
Prolog, Knowledge Machine and Eclipse support backward
chaining within their inference engines.

Parsing: Parser is an algorithm for inferring the structure


Of its input, guided by a grammar that dictates what Structures
are possible or probable. In an ordinary Parser, the input is a
string, and the grammar ranges over strings. This explores
generalizations of Ordinary parsing algorithms that allow the
input to Consist of string tuples and/or the grammar to range
Over string tuples. Such inference algorithms can perform
various kinds of analysis on parallel texts, Also known as multi
texts.
Figure 1 show some of the ways in which ordinary parsing
can be generalized. A synchronous parser is an algorithm that
can infer the syntactic structure of each component text in a
multi text and simultaneously infer the correspondence relation
Between this structures.1 when a parsers input can have fewer
dimensions than the parsers grammar, we call it a translator.
When a parsers grammar can have fewer dimensions than the
parsers input, we call it a synchronizer. The corresponding
processes are called translation and synchronization.
To our knowledge, synchronization has never been explored
as a class of algorithms. Neither has the relationship between
parsing and word alignment. The relationship between
translation and ordinary parsing was noted a long time. But here
we articulate it in more detail: ordinary parsing is a special

Page 88

Case of synchronous parsing, which is a special case of


translation. This paper offers an informal guided tour of the
generalized parsing algorithms in Figure 1. It culminates with a
recipe for using these algorithms to train and apply a syntaxaware statistical machine translation (SMT) system.

Machine translation: Machine translation is architecture


for SMT that revolves around multi trees. Figure 2 shows how to
build and use a rudimentary Machine Translation system,
Page 89

starting from some multi text and one or more monolingual tree
banks.
The recipe follows:
T1. Induce a word-to-word translation model.
T2. Induce PCFGs from the relative frequencies of productions
in the monolingual tree banks
T3. Synchronize some multi text,
T4. Induce an initial PMTG from the relative frequencies of
productions in the multi tree bank.
T5. Re-estimate the PMTG parameters, using a
Synchronous parser with the expectation smearing.
A1. Use the PMTG to infer the most probable multi tree
Covering new input text.
A2. Linearize the output dimensions of the multi tree.
Steps T2, T4 and A2 are trivial. Steps T1, T3, T5, and A1 are
instances of the generalized parsers
Figure 2 is only architecture. Computational
Complexity and generalization error stand in the
Way of its practical implementation. Nevertheless,
it is satisfying to note that all the non-trivial algorithms
In Figure 2 are special cases of Translator CT.
It is therefore possible to implement an MTSMT
System using just one inference algorithm, parameterized
By a grammar, a smearing, and a search
Strategy. An advantage of building an MT system in
This manner is that improvements invented for ordinary
Parsing algorithms can often be applied to all
The main components of the system. For example,
Page 90

Me lamed (2003) showed how to reduce the computational


complexity of a synchronous parser by _ _3_, just by changing
the logic.
The same optimization can be applied to the inference
algorithms... With proper software design, such optimizations
Need never be implemented more than once. For simplicity, the
algorithms in this are based on CKY logic. However, the
architecture in Figure 2 can also be implemented using
generalizations of more sophisticated parsing logics, such as
those inherent in Early or Head-Driven parsers

Page 91

Benefits of Machine Translation: There are three research


benefits of using generalized Parsers to build MT systems.
1. We can take advantage of past and future research on
making parsers more accurate and more efficient.
Therefore,
2. We can concentrate our efforts on better models, without
worrying about MT-specific search algorithms.
Page 92

3. More generally and most importantly, this approach


encourages MT research to be less specialized and more
transparently related to the rest of computational
linguistics.

Block world: The technique we are about to discuss can be


applied in a wide variety of task domains, and they have been.
But to make it easy to compare the variety of methods we
consider, we should find it useful to look at all of them in a
single domain that is complex enough that we need for each of
the mechanisms is apparent yet simple enough that easy-tofollow examples can be found. The blocks world is such a
domain. There is a flat surface on which blocks can be placed.
There are a number of square blocks, all the same size. They can
be stacked one upon another. There is a robot arm that can
manipulate the blocks. The actions it can perform include:
UNSTACK (A, B) - pick up block A from its current
position on block B. the arm must be empty and blocks A
must have no blocks on top of it.
STACK (A, B) - place block A on block B. the arm must
already be holding and the surface of B must be clear.
PICKUP (A) - picks up block A from the table and holds it.
The arm must be nothing on top of block A.
PUTDOWN (A)- put block A down on the table. The arm
must have been holding block A.
Notice that in the world we have described, the robot arm can
hold only one block at a time. Also, since all blocks are the same
Page 93

size, each block can have at most one other block directly on top
of it.
In order to specify both the conditions under which an operation
may be performed and the results of performing it, we need to
use the following predicates:

ON (A, B) - block A is on block B.


ONTABLE (A) - block A is on the table.
CLEAR (A) - there is nothing on top of block A.
HOLDING (A) - The arm is holding block A.
ARMEMPTY- the arm is holding nothing.

Various logical statements are true in this blocks world. For


example,
[

x: HOLDING (x)]

ARMEMPTY

x: ONTABLE(x)

y: ON(x,y)

The first of these statements says simply that if the arm is


holding anything, then it is not empty. The second says that if a
block is on the table, then it is not also on another block. The
third says that any block with no blocks on it is clear.
Components of planning system: The components of planning
systems are as follows:

Choose the best rule to apply next


based on the best available information

Apply the chosen rule to compute


the new problem state that arises from the application
Page 94

Detect when a solution has been


found

Detect dead ends so that they can be


abandoned and the systems effort directed in more fruitful
directions.

Detect when an almost correct


solution has been found and employ special techniques to
make it totally correct.
Choosing rules to apply. The most widely used technique for
selecting appropriate rules to apply is first to isolate a set of
differences between the desired goal state and the current state
and then to identify those rules that are relevant to reducing
those differences. If several rules are found, a variety of other
heuristic information can be exploited to choose among them.
Applying Rules. Each rule simply specified the problem state
that would result from its application. Now, however, we must
be able to deal with rules that specify only a small part of the
complete problem state.
One way is to describe, for each action, each of the changes it
makes to the state description. In addition, some statement that
everything else remains unchanged is also necessary. Fig shows
how a state, called S0, of a simple blocks world problem could
be represented.
A

ON (A, B, S0) ^
ONTABLE (B, S0) ^

CLEAR (A, S0)


Page 95

Fig1: a simple blocks world description

If we start with the situation shown in fig, we would describe it


as
ON (A, B) ^ONTABLE (B) ^CLEAR (A)

After applying the operator UNSTACK (A, B), our description


of the world would be
ONTABLE (B) ^CLEAR (A) ^CLEAR (B) ^HOLDING (A)

STRIPS-style operators that correspond to the blocks world


operations we have discussed are shown in fig2. Notice that for
simple rules such as these the PRECONDITION lists is often
identical to the DELETE list. In order to pick up a block, the
robot arm must be empty; as soon as it picks up a block, it is no
longer empty. But preconditions are not always detected. For
example, in order for the arm to pick up a block, the block must
have no other blocks on top of it: after it is picked up, it still has
no blocks on top of it. This is the reason that the
PRECONDITION and DELETE lists must be specified
separately.
STACK(X, Y)
P: CLEAR(Y) ^HOLDING(X)
D: CLEAR(Y) ^HOLDING(X)
Page 96

A: ARMEMPTY^ON(X, Y)
UNSTACK(X, Y)
P: ON(X, Y) ^CLEAR(X) ^ARMEMPTY
D: ON(X, Y) ^ARMEMPTY
A: HOLDING(X) ^ON(X, Y)

PICKUP(X)
P: CLEAR(X) ^ONTABLE(X) ^ARMEMPTY
D: ONTABLE(X) ^ARMEMPTY
A: HOLDING(X)
PUTDOWN(X)
P: HOLDING(X)
D: HOLDING(X)
A: ONTABLE(X) ^ARMEMPTY

Fig: STRIPS- style operators for the blocks world


Detecting a solution. A planning system has succeeded in
finding a solution to a problem when it has found a sequence of
operators that transforms the initial problem state into the goal
state.
Detecting Dead ends. As a planning system is searching for a
sequence of operators to solve a particular problem, it must be
able to detect when it is exploring a path that can never lead to a
Page 97

solution. The same reasoning mechanisms that can be used to


detect a solution can often be used for detecting a dead end.
Goal stack planning: The technique to be developed for
solving compound goals that many interact was the use of a goal
stack. This was the approach used by STRIPS. In this method,
the problem solver makes use of a single stack that contains both
goals and operators that have been proposed to satisfy those
goals. The problem solver also relies on a database that
describes the current situation and a set of operators described in
PRECONDITION, ADD, and DELETE lists. To see how this
method works, let us carry it through for the simple example
shown in fig.
B

Start: ON (B, A) ^

goal: ON (C, A) ^

ON TABLE (A) ^

ON (B, D) ^

ONTABLE(C) ^

ONTABLE (A) ^

ONTABLE (D) ^

ONTABLE (D)

ARMEMPTY

Fig: a very simple Blocks world problem

When we begin solving this problem, the goal stack is simply


ON (C, A) ^ ON (B, D) ^ ONTABLE (A) ^ ONTABLE (D)
Page 98

But we want to separate this problem into four sub problems,


one for each component of the original goal. Two of the sub
problems, ONTABLE (A) and ONTABLE (D), are already true in the
initial state. Depending on the order in which we want to tackle
the sub problems, there are two goals stacks that could be
created as our first step, where each line represents one goal on
the stack and OTAD is an abbreviation for ONTABLE (A) ^
ONTABLE (D):

ON (C, A)

ON (B, D)

ON (B, D)

ON (C, A)

ON (C, A) ^ ON (B, D) ^ ONTAD

ON (C, A) ^ ON (B, D) ^ ONTAD

[1]

[2]

To continue with the example we started above, let us assume


that we choose first to explore alternative1. Alternative2 will
also lead to a solution. In fact, it finds one so trivially that it is
not very interesting. Exploring alternative 1, we first check to
see whether ON (C, A) is true in the current state.

Partial-Order Planner. Any planner that maintains a partial


solution as a totally ordered list of steps found so far is called a
total-order planner, or a linear planner. Alternatively, if we
only represent partial-order constraints on steps, then we have a
partial-order planner, which is also called a non-linear
planner. In this case, we specify a set of temporal constraints
Page 99

between pairs of steps of the form S1 < S2 meaning that step S1


comes before, but not necessarily immediately before, step S2.
We also show this temporal constraint in graph form as
S1 +++++++++> S2
STRIPS is a total-order planner, as are situation-space
progression and regression planners
Principle of Least Commitment
The principle of least commitment is the idea of never making a
choice unless required to do so. In other words, only do
something if it's necessary. The advantage of using this principle
is that we try to avoid doing work that might have to be undone
later, hence avoiding wasted work. In planning, one application
of this principle is to never order plan steps unless it's necessary
for some reason. So, partial-order planners exhibit this property
because constraints ordering steps will only be inserted when
necessary. On the other hand, situation-space progression
planners make commitments about the order of steps as they try
to find a solution and therefore may make mistakes from poor
guesses about the right order of steps.
Representing a Partial-Order Plan
A partial-order plan will be represented as a graph that describes
the temporal constraints between plan steps selected so far. That
is, each node will represent a single step in the plan (i.e., an
instance of one of the operators), and an arc will designate a
temporal constraint between the two steps connected by the arc.
For example,
S1 ++++++++> S2 ++++++++++> S5
Page 100

|\
^
| \++++++++++++++++|
|
|
v
|
++++++> S3 ++++++> S4 ++++++
graphically represents the temporal constraints S1 < S2, S1 <
S3, S1 < S4, S2 < S5, S3 < S4, and S4 < S5. This partial-order
plan implicitly represents the following three total-order plans,
each of which is consistent with all of the given constraints:
[S1,S2,S3,S4,S5], [S1,S3,S2,S4,S5], and [S1,S3,S4,S2,S5].
Partial-Order Planner (POP) Algorithm
function pop(initial-state, conjunctive-goal, operators)
// non-deterministic algorithm
plan = make-initial-plan(initial-state, conjunctive-goal);
loop:
begin
if solution?(plan) then return plan;
(S-need, c) = select-subgoal(plan) ; // choose an unsolved goal
choose-operator(plan, operators, S-need, c);
// select an operator to solve that goal and revise plan
resolve-threats(plan); // fix any threats created
end
end
function solution?(plan)
if causal-links-establishing-all-preconditions-of-all-steps(plan)
and all-threats-resolved(plan)
and all-temporal-ordering-constraints-consistent(plan)
and all-variable-bindings-consistent(plan)
then return true;
else return false;
Page 101

end
function select-subgoal(plan)
pick a plan step S-need from steps(plan) with a precondition c
that has not been achieved;
return (S-need, c);
end
procedure choose-operator(plan, operators, S-need, c)
// solve "open precondition" of some step
choose a step S-add by either
Step Addition: adding a new step from operators that
has c in its Add-list
or Simple Establishment: picking an existing step in Steps(plan)
that has c in its Add-list;
if no such step then return fail;
add causal link "S-add --->c S-need" to Links(plan);
add temporal ordering constraint "S-add < S-need" to Orderings(plan);
if S-add is a newly added step then
begin
add S-add to Steps(plan);
add "Start < S-add" and "S-add < Finish" to Orderings(plan);
end
end
procedure resolve-threats(plan)
foreach S-threat that threatens link "Si --->c Sj" in Links(plan)
begin // "declobber" threat
choose either
Demotion: add "S-threat < Si" to Orderings(plan)
or Promotion: add "Sj < S-threat" to Orderings(plan);
if not(consistent(plan)) then return fail;
end
Page 102

end

Recursive transition network recursive transition network


("RTN") is a graph theoretical schematic used to represent the
rules of a context free grammar. RTNs have application to
programming languages, natural language and lexical analysis.
Any sentence that is constructed according to the rules of an
RTN [1] is said to be "well-formed." The structural elements of a
well-formed sentence may also be well-formed sentences by
themselves, or they may be simpler structures. This is why
RTNs are described as recursive.
A sentence is generated by a RTN by applying the generative
rules specified in the RTN itself. These represent any set of rules
or a function consisting of a finite number of steps.

Unit-5
Expert System and AI languages
Introduction: An expert system is a set of programs that
manipulate encoded knowledge to solve problems in a
specialized domain that normally requires human expertise. An
expert systems knowledge is obtained form expert sources and
coded in a form suitable for the system to use in its interference
or reasoning processes. The expert knowledge must be obtained
from specialists or other sources of expertise, such as texts,
Page 103

journal articles, and data bases. This type of knowledge usually


requires much training and experience in some specialized field
such as medicine, geology, system configuration, or engineering
design.

Characteristic Features of Expert systems: Expert


systems differ from conventional computer systems in several
important ways.
1.

Expert systems use


knowledge rather than data to control the solution process.
Much of the knowledge used is heuristic in nature rather
than algorithmic.
2.
The knowledge is encoded
and maintained as an entity separate from the control
program. As such, it is not compiled together with the
control program itself. This permits the incremental
addition and modification (refinement) of the knowledge
base without recompilation of the control programs.
Furthermore, it is possible in some cases to use different
knowledge bases with the same control programs to
produce different types of expert systems.
3.
Expert systems are capable
of explaining how a particular conclusion was reached, and
why requested information is needed during a consultation.
This is important as it gives the user a chance to assess and
Page 104

understand the systems reasoning ability, thereby


improving the users confidence in the system.
4.
Expert systems use
symbolic representations for knowledge (rules, networks,
or frames) and perform their interference through symbolic
computations that closely resemble manipulations of
natural language.
Expert systems often reason with Meta knowledge; that is, they
reason with knowledge about themselves, and their own
knowledge limits and capabilities.
Applications:

Different types of medical


diagnoses(internal medicine, pulmonary diseases,
infectious blood disease, and so on)
Diagnosis of complex
electronic and electrochemical systems.
Diagnosis of diesel electric
locomotion systems.
Diagnosis of software
development projects.
Forecasting crop damage.
Location of faults in
computer and communication systems.
Page 105

Expert and Systems analyst

Development engine

Pro
ble
m

Knowledge
Base

Inference engine

Do
mai
n

User interface

User

Fig. An expert system model.

Rule-Based System Architectures: The most common


form of architecture used in expert and other types of
knowledge based systems is the production system, also
called the rule-based system. This type of system uses
knowledge encoded in the form of production rules, that is, if
then rules.
IF: Condition-1 and Condition-2 and Condition-3
THEN Take Action-4
Page 106

IF: The temperature is greater than 200 degrees, and


the water level is low
THEN: Open the safety valve.
A&B & C&D

E&F

Each rule represents a small chunk of knowledge


relating to the given domain of expertise which leads from
some initially known facts to some useful conclusions or
action part of the rule is then accepted as known(or at least
known with some degree of certainty).
Inference in production systems is
accomplished by a process of chaining through the rules
recursively, either in a forward or backward direction, until a
conclusion is reached or until failure occurs. The selection of
rules used in the chaining process is determined by matching
current facts against the domain knowledge or variables in rules
and choosing among a candidate set of rules the ones that meet
some given criteria, such as specificity. The inference process is
typically carried out in an interactive mode with the user
providing input parameters needed to complete the chaining
process.

Page 107

EXPERT SYSTEM
USER

Explanation
Module
Inference engine

Input

I/O interface

Case history
file

Output

Editor

Knowledge
base

Working
memory

Learning
Module

Fig. Components of a typical expert system


The Knowledge Base: The Knowledge base contains facts and
rules about some specialized knowledge domain.
Page 108

The Inference Process: The inference engine accepts user input


queries and responses to questions through the I/O interface and
uses this dynamic information together with the static
knowledge (the rules and facts) stored in the knowledge base.
The knowledge in the knowledge base is used to derive
conclusions about the current case or situation as presented by
the users input.
During the match stage, the contents of working
memory are compared to facts and rules contained in the
knowledge base. When consistent matches are found, the
corresponding rules are placed in a conflict set. To find an
appropriate and consistent match, substitutions (instantiations)
may be required. Once all the matched rules have been added to
the conflict set during a given cycle, one of the rules is selected
for execution.
When the left side of a sequence of rules is
instantiated first and the rules are executed from left to right, the
process is called forward chaining. This is also known as datadriven inference since input data are used to guide the direction
of the inference process. For example, we can chain forward to
show that when a student is encouraged, is healthy, and has
goals, the student will succeed.

ENCOURAGED (student)

MOTIVATED (student)
Page 109

MOTIVATED (student) &HEALTHY (student)

WORKHARD (student)

WORKHARD (student) &HASGOALS (student)

EXCELL (student)

EXCELL (student)

SUCCED (student)

On the other hand, when the right side of the rules is


instantiated first, the left-hand conditions become sub goals.
These sub goals may in turn cause sub- sub goals to be
established, and so on until facts are found to match the lowest
sub goals conditions. When this form of inference takes place,
we say that backward chaining is performed. This form of
inference is also known as goal-driven inference since an initial
goal establishes the backward direction of the inferring.
Explanation Module: The Explanation module provides the
user with an explanation of the reasoning process when
requested. This is done in response to a how query or a why
query.
To respond to a how query, the explanation module traces
the chain of rules fired during a consultation with the user. The
sequence of rules that led to the conclusion is then printed for
the user in an easy to understand human-language style. This
permits the user to actually see the reasoning process followed
by the system in arriving at the conclusion. If the user does not
agree with the reasoning steps presented they may be changed
using the editor.

Page 110

To respond to a why query, the explanation module


must be able to explain why certain information is needed by the
inference engine to complete a step in the reasoning process
before it can proceed. For example, in diagnosing a car that will
not start, a system might be asked why it needs to know the
status of the distributor spark.
Building a knowledge Base: The editor is used by developers
to create new rules for addition to the knowledge base, to delete
outmoded rules, or to modify existing rules in some way.
Consistency tests for newly created rule. Such systems also
prompt the user for missing information.
The I/O Interface: The input-output interface permits the user
to communicate with the system in a more natural way by
permitting the use of simple selection menus or the use of a
restricted language which is close to a natural language. This
means that the system must have special prompts or a
specialized vocabulary which encompasses the terminology of
the given domain of expertise.

Nonproduction System Architectures: Instead of rules,


these systems employ more structured representation schemes
like associative or semantic networks, frame and rule structures,
and decision trees, or even specialized networks like neural
networks.

Page 111

Associative or Semantic Network Architectures: We know


that an associative network is a network made up of nodes
connected by directed arcs. The nodes represent objects,
attributes, concepts, or other basic entities, and the arcs, which
are labeled, describe the relationship between the two nodes they
connect. Special network links include the ISA and HASPART
links which designate an object as being a certain type of object
(belonging to a class of objects) and as being a subpart of
another object, respectively.
Associative network representations are especially useful
in depicting hierarchical knowledge structures, where property
inheritance is common. More often, these network
representations are used in natural language or computer vision
systems or in conjunction with some other form of
representation.
Frame Architectures: Frames are structured sets of closely
related knowledge, such as an object or concept name, the
objects main attributes and their corresponding values, and
possibility some attached procedures (if-needed, if-added, ifremoved procedures). The attributes, values, and procedures are
stored in specified slots facets of the frame. Individual frames
are usually linked together as a much like the nodes in an
associative network.

Page 112

Decision Tree Architectures: Knowledge for expert systems


may be stored in the form of a decision tree when the knowledge
can be structured in a top-to-bottom manner. For example, the
identification of objects (equipment, faults, physical objects,
diseases) can be made through a decision tree structure. Initial
and intermediate nodes in the tree correspond to object
attributes, and terminal nodes correspond to the identities of
objects. Attribute values for an object determine a path to a leaf
node in the tree which contains object identification. Each object
attribute corresponds to a non terminal node in the tree and each
branch of the decision tree corresponds to an attribute value or
set of values.
Blackboard system Architectures: Blackboard architectures
refer to a special type of knowledge-based system which uses a
form of opportunistic reasoning. This differs from pure forward
or pure backward chaining in production systems in that either
direction may be chosen dynamically at each stage in the
problem solution process.
Blackboard systems are composed of three functional
components as depicted in fig
1.

There are a number of knowledge sources which


are separate and independent sets of coded knowledge.
Each knowledge source may be thought of as a specialist
in some limited area needed to solve a given subset of
Page 113

problems; the sources may contain knowledge in the form


of procedures, rules, or other schemes.
2.
A globally accessible data base structure, called
a blackboard, contains the current problem state and
information needed by the knowledge sources (input data,
partial solutions, control data, alternatives, and final
solutions). The knowledge sources make changes to the
blackboard data that incrementally lead to a solution.
Communication and interaction between the knowledge
sources takes place solely through the blackboard.
3.
Control information may be contained within the
sources, on the blackboard, or possibly in a separate
module. The control knowledge monitors the changes to
the blackboard and determines what the immediate focus
of attention should be in solving the problem.
Analogical Reasoning Architectures: Expert systems based
on analogical architectures solve new problems like humans, by
finding a similar problem solution that is known and applying
the known solution to the new problem, possibly with some
modifications, for example, if we know a method of proving that
the product of two even integer is even, we can successfully
prove that the product of two odd integers is odd through much
the same proof steps. Expert systems using analogical
architectures will require a large knowledge base having

Page 114

numerous problem solutions and other previously encountered


situations or episodes
Neural Network Architectures: Neural networks are large
networks of simple processing elements or nodes which process
information dynamically in response to external inputs. The
nodes are simplified models of neurons. The knowledge in a
neural network is distributed throughout the network in the form
of internodes connections and weighted links which form the
inputs to the nodes. The link weights serve to enhance or inhibit
the input stimuli values which are then added together at the
nodes. If the sum of all the inputs to a node exceeds some
threshold value T, the node executes and produces an output

Blackboard

Knowledge sources

Page 115

Control information
Fig. Components of blackboard systems.
which is passed on to other nodes or is used to produce some
output response.
Neural networks were originally inspired as being
models of the human nervous system. They are generally
simplified models to be sure.

Knowledge acquisition: Knowledge for expert systems must


be derived from expert sources like experts in the given field,
journal articles, texts, reports, data bases, and so on. Elicitation
of the right Knowledge can take several man years and cost
hundreds of thousands of dollars. This process is now
recognized as one of the main bottlenecks in building expert and
other Knowledge-based systems. Consequently, much effort has
been developed to more effective methods of acquisition and
coding.
Pulling together and correctly interpreting the right
knowledge to solve a set of complex tasks is an onerous job.
Typically, experts do not know what specific Knowledge is
being neither applied nor just how it is applied in the solution of
a given problem. Even if they do know, it is likely they are
unable to articulate the problem solving process well enough to
Page 116

capture the low-level Knowledge used and the inferring


processes applied. This difficulty has lead to the use of AI
experts (called Knowledge engineers) who serve as
intermediaries between the domain expert and the system. The
knowledge engineer elicits information from the experts and
codes this Knowledge into a form suitable for use in the expert
system.
The Knowledge elicitation process is depicted in fig. To
elicit the requisite Knowledge, a Knowledge engineer conducts
extensive interviews with domain experts. During the
interviews, the expert is asked to solve typical problems in the
domain of interest and to explain his or her solutions.

D
Domain
Expert

Knowledge
engineer

System
Editor

Knowledge
Base

Fig the Knowledge acquisition process.


Using the Knowledge gained from the experts and other
sources, the knowledge engineer codes the knowledge in the
form of rules or some other representation scheme. This
Page 117

knowledge is then used to solve sample problems for review.


Errors and omissions are uncovered and corrected, and
additional knowledge is added as needed. The process is
repeated until a sufficient body of knowledge has been collected
to solve a large class of problems in the chosen domain. The
whole process may take as many as tens of person years.

Lisp: Lisp is one of the oldest computer programming


languages. It was invented by John Mc carthy during the late
1950s, shortly after the development of FORTRAN LISP is
particularly suited for AI programs because of its ability to
process symbolic information effectively. It is a language with a
simple syntax, with little or no data typing and dynamic memory
management.
Lisp has become the language of choice for most
AI practioners. It was practically unheard of outside the research
community until AI began to gain some popularity ten to fifteen
years ago. Since then, special LISP processing machines have
been built and its popularity has spread to many new sectors of
business and government.
The basic building blocks of LISP are
the atom, list, and the string. An atom is a number or string of
contiguous characters, including numbers and special characters.
A list is a sequence of atoms and/or other lists enclosed within

Page 118

parentheses. A string is a group of characters enclosed in double


quotation marks.
Lisp programs run either on an interpreter or as
compiled code. The interpreter examines source programs in a
repeated loop, called the read-evaluate-print loop. This loop
reads the program code, evaluates it, and prints the values
returned by the program. The interpreter signals its read lines to
accept code for execution by printing a prompt such as the ->
symbol.
Examples
Here are examples of Common Lisp code.
The basic "Hello world" program:
(Print "Hello world")
As the reader may have noticed from the above discussion, Lisp
syntax lends itself naturally to recursion. Mathematical problems
such as the enumeration of recursively defined sets are simple to
express in this notation.
Evaluate a number's factorial:
(Defun factorial (n)
(if (<= n 1)
1
(* n (factorial (- n 1)))))
Page 119

An alternative implementation, often faster than the previous


version if the Lisp system has tail recursion optimization:
(defun factorial (n &optional (acc 1))
(if (<= n 1)
acc
(factorial (- n 1) (* acc n))))
Contrast with an iterative version which uses Common Lisp's
loop macro:
(defun factorial (n)
(loop for i from 1 to n
for fac = 1 then (* fac i)
finally (return fac)))
The following function reverses a list. (Lisp's built-in reverse
function does the same thing.)
(defun -reverse (list)
(let ((return-value '()))
(dolist (e list) (push e return-value))
return-value))

Prolog: Prolog is a general purpose logic programming


language associated with artificial intelligence and
computational linguistics.
Prolog has its roots in formal logic, and unlike many other
programming languages, Prolog is declarative: The program
Page 120

logic is expressed in terms of relations, represented as facts and


rules. A computation is initiated by running a query over these
relations.[4]
The language was first conceived by a group around Alain
Colmerauer in Marseille, France, in the early 1970s and the first
Prolog system was developed in 1972 by Colmerauer with
Philippe Roussel.[5][6]
Prolog was one of the first logic programming languages,[7] and
remains among the most popular such languages today, with
many free and commercial implementations available. While
initially aimed at natural language processing, the language has
since then stretched far into other areas like theorem proving,[8]
expert systems,[9] games, automated answering systems,
ontologies and sophisticated control systems. Modern Prolog
environments support creating graphical user interfaces, as well
as administrative and networked applications.
Syntax and semantics. In Prolog, program logic is expressed in
terms of relations, and a computation is initiated by running a
query over these relations. Relations and queries are constructed
using Prolog's single data type, the term. Relations are defined
by clauses. Given a query, the Prolog engine attempts to find a
resolution refutation of the negated query. If the negated query
can be refuted, i.e., an instantiation for all free variables is found
that makes the union of clauses and the singleton set consisting
of the negated query false, it follows that the original query,
with the found instantiation applied, is a logical consequence of
the program. This makes Prolog (and other logic programming
Page 121

languages) particularly useful for database, symbolic


mathematics, and language parsing applications. Because Prolog
allows impure predicates, checking the truth value of certain
special predicates may have some deliberate side effect, such as
printing a value to the screen. Because of this, the programmer
is permitted to use some amount of conventional imperative
programming when the logical paradigm is inconvenient. It has
a purely logical subset, called "pure Prolog", as well as a
number of extra logical features.
Data types
Prolog's single data type is the term. Terms are either atoms,
numbers, variables or compound terms.

An atom is a general-purpose name with no inherent


meaning. Examples of atoms include x, blue, 'Taco', and
'some atom'.
Numbers can be floats or integers.
Variables are denoted by a string consisting of letters,
numbers and underscore characters, and beginning with an
upper-case letter or underscore. Variables closely resemble
variables in logic in that they are placeholders for arbitrary
terms.
A compound term is composed of an atom called a
"functor" and a number of "arguments", which are again
terms. Compound terms are ordinarily written as a functor
followed by a comma-separated list of argument terms,
which is contained in parentheses. The number of
arguments is called the term's arity. An atom can be
Page 122

regarded as a compound term with arity zero. Examples of


compound terms are truck_year('Mazda', 1986) and
'Person_Friends'(zelda,[tom,jim]).
Special cases of compound terms:

A List is an ordered collection of terms. It is denoted by


square brackets with the terms separated by commas or in
the case of the empty list, []. For example [1,2,3] or
[red,green,blue].
Strings: A sequence of characters surrounded by quotes is
equivalent to a list of (numeric) character codes, generally
in the local character encoding, or Unicode if the system
supports Unicode. For example, "to be, or not to be".

Examples
Here follow some example programs written in Prolog.
Hello world
An example of a query:
?- write('Hello world!'), nl.
Hello world!
true.
?Compiler optimization
Any computation can be expressed declaratively as a sequence
of state transitions. As an example, an optimizing compiler with
Page 123

three optimization passes could be implemented as a relation


between an initial program and its optimized form:
program optimized(Prog0, Prog) :optimization_pass_1(Prog0, Prog1),
optimization_pass_2(Prog1, Prog2),
optimization_pass_3(Prog2, Prog).
or equivalently using DCG notation:
program_optimized --> optimization_pass_1,
optimization_pass_2, optimization_pass_3.
Quicksort
The Quicksort sorting algorithm, relating a list to its sorted
version:
partition([], _, [], []).
partition([X|Xs], Pivot, Smalls, Bigs) :( X @< Pivot ->
Smalls = [X|Rest],
partition(Xs, Pivot, Rest, Bigs)
; Bigs = [X|Rest],
partition(Xs, Pivot, Smalls, Rest)
).
quicksort([]) --> [].
quicksort([X|Xs]) -->
{ partition(Xs, X, Smaller, Bigger) },
quicksort(Smaller), [X], quicksort(Bigger).
Page 124

Dynamic programming
The following Prolog program uses dynamic programming to
find the longest common subsequence of two lists in polynomial
time. The clause database is used for memorization:
:- dynamic(stored/1).
memo (Goal) :- ( stored (Goal) -> true ; Goal,
asserts(stored(Goal)) ).
lcs([], _, []) :- !.
lcs(_, [], []) :- !.
lcs([X|Xs], [X|Ys], [X|Ls]) :- !, memo(lcs(Xs, Ys, Ls)).
lcs([X|Xs], [Y|Ys], Ls) :memo(lcs([X|Xs], Ys, Ls1)), memo(lcs(Xs, [Y|Ys], Ls2)),
length(Ls1, L1), length(Ls2, L2),
( L1 >= L2 -> Ls = Ls1 ; Ls = Ls2 ).
Example query:
?- lcs([x,m,j,y,a,u,z], [m,z,j,a,w,x,u], Ls).
Ls = [m, j, a, u]
Design patterns
A design pattern is a general reusable solution to a commonly
occurring problem in software design. In Prolog, design patterns
go under various names: skeletons and techniques, clichs,
program schemata, and logic description schemata. An
alternative to design patterns is higher order programming.
Page 125

Higher-order programming
Main articles: Higher-order logic and Higher-order
programming
By definition, first-order logic does not allow quantification
over predicates. A higher-order predicate is a predicate that
takes one or more other predicates as arguments. Prolog already
has some built-in higher-order predicates such as call/1, find
all/3, setoff/3, and bag of/3.[16] Furthermore, since arbitrary
Prolog goals can be constructed and evaluated at run-time, it is
easy to write higher-order predicates like map list/2, which
applies an arbitrary predicate to each member of a given list, and
sub list/3, which filters elements that satisfy a given predicate,
also allowing for currying.[15]
To convert solutions from temporal representation (answer
substitutions on backtracking) to spatial representation (terms),
Prolog has various all-solutions predicates that collect all answer
substitutions of a given query in a list. This can be used for list
comprehension. For example, perfect numbers equal the sum of
their proper divisors:
Perfect (N):between (1, inf, N), U is N // 2,
find all(D, (between(1,U,D), N mod D =:= 0), Ds),
sum list(Ds, N).
This can be used to enumerate perfect numbers, and also to
check whether a number is perfect
Page 126

Mycin. Mycin was an early expert system developed over five


or six years in the early 1970s at Stanford University. It was
written in Lisp as the doctoral dissertation of Edward Shortliffe
under the direction of Bruce Buchanan, Stanley N. Cohen and
others. It arose in the laboratory that had created the earlier
Dendral expert system, but emphasized the use of judgmental
rules that had elements of uncertainty (known as certainty
factors) associated with them. This expert system was designed
to identify bacteria causing severe infections, such as bacteremia
and meningitis, and to recommend antibiotics, with the dosage
adjusted for patient's body weight the name derived from the
antibiotics themselves, as many antibiotics have the suffix "mycin". The Mycin system was also used for the diagnosis of
blood clotting diseases.
Method
MYCIN operated using a fairly simple inference engine, and a
knowledge base of ~600 rules. It would query the physician
running the program via a long series of simple yes/no or textual
questions. At the end, it provided a list of possible culprit
bacteria ranked from high to low based on the probability of
each diagnosis, its confidence in each diagnosis' probability, the
reasoning behind each diagnosis (that is, MYCIN would also list
the questions and rules which led it to rank a diagnosis a
particular way), and its recommended course of drug treatment.
Despite MYCIN's success, it sparked debate about the use of its
ad hoc, but principled, uncertainty framework known as
"certainty factors". The developers performed studies showing
Page 127

that MYCIN's performance was minimally affected by


perturbations in the uncertainty metrics associated with
individual rules, suggesting that the power in the system was
related more to its knowledge representation and reasoning
scheme than to the details of its numerical uncertainty model.
Some observers felt that it should have been possible to use
classical Bayesian statistics. MYCIN's developers argued that
this would require either unrealistic assumptions of probabilistic
independence, or require the experts to provide estimates for an
unfeasibly large number of conditional probabilities.[1][2]
Subsequent studies later showed that the certainty factor model
could indeed be interpreted in a probabilistic sense, and
highlighted problems with the implied assumptions of such a
model. However the modular structure of the system would
prove very successful, leading to the development of graphical
models such as Bayesian networks
Results
Research conducted at the Stanford Medical School found
MYCIN to propose an acceptable therapy in about 69% of cases,
which was better than the performance of infectious disease
experts who were judged using the same criteria. This study is
often cited as showing the potential for disagreement about
therapeutic decisions, even among experts, when there is no
"gold standard" for correct treatment.

Page 128

Practical use
MYCIN was never actually used in practice. This wasn't
because of any weakness in its performance. As mentioned, in
tests it outperformed members of the Stanford medical school
faculty. Some observers raised ethical and legal issues related to
the use of computers in medicine if a program gives the
wrong diagnosis or recommends the wrong therapy, who should
be held responsible? However, the greatest problem, and the
reason that MYCIN was not used in routine practice, was the
state of technologies for system integration, especially at the
time it was developed. MYCIN was a stand-alone system that
required a user to enter all relevant information about a patient
by typing in response to questions that MYCIN would pose. The
program ran on a large time-shared system, available over the
early Internet (Arpanet), before personal computers were
developed. In the modern era, such a system would be integrated
with medical record systems, would extract answers to questions
from patient databases, and would be much less dependent on
physician entry of information. In the 1970s, a session with
MYCIN could easily consume 30 minutes or morean
unrealistic time commitment for a busy clinician.
A difficulty that rose to prominence during the development of
MYCIN and subsequent complex expert systems has been the
extraction of the necessary knowledge for the inference engine

Page 129

to use from the human expert in the relevant fields into the rule
base (the so-called knowledge engineering).

Page 130

Вам также может понравиться