Artificial Intelligence Anna University Notes

www.kinindia.
com
EINSTEIN COLLEGE OF ENGINEERING
ARTIFICIAL INTELLIGENCE
[lecture Notes CS1351]
SUJATHA.K 12/15/2010
Author of Reference book- Stuart Russell& Norvig
R.Anirudhan
Artificial Intelligence-CS1351 www.kinindia.com

Chapter-1
Characterizations of Artificial Intelligence
Artificial Intelligence is not an easy science to describe, as it has fuzzy borders with mathematics, computer science, philosophy, psychology, statistics, physics, biology and other disciplines. It is often characterized in various ways, some of which are given below. I'll use these categorizations to introduce various important issues in AI.
1.1 Long Term Goals

Just what is the science of Artificial Intelligence trying to achieve? At a very high level, you will hear AI researchers categorized as either 'weak' or 'strong'. The 'strong' AI people think that computers can achieve consciousness (although they may not be working on consciousness issues). The 'weak' AI people don't go that far. Other people talk of the difference between ' Big AI' and 'Small AI'. Big AI is the attempt to build robots of intelligence equaling that of humans, such as Lieutenant Commander Data from Star Trek. Small AI is all about getting programs to work for small problems and trying to generalize the techniques to work on larger problems. Most AI researchers don't worry about things like consciousness and concentrate on some of the following long term goals. Firstly, many researchers want to:
Produce machines which exhibit intelligent behavior.
Machines in this sense could simply be personal computers, or they could be robots with embedded systems, or a mixture of both. Why would we want to build intelligent systems? One answer appeals to the reasons why we use computers in general: to accomplish tasks which, if we did them by hand would be error prone. For instance how many of us would not reach for our calculator if required to multiply two six digit numbers together? If we scale this up to more intelligent tasks, then it should be possible to use computers to do some fairly complicated things reliably. This reliability may be very useful if the task is beyond some cognitive limitation of the brain, or when human intuition is counter-constructive, such as in the Monty Hall problem described below, which many people - some of whom call themselves mathematicians - get wrong. Another reason we might want to construct intelligent machines is to enable us to do things we couldn't do before. A large part of science is dependent on the use of computers already, and more intelligent applications are increasingly being
EINSTEIN COLLEGE OF ENGINEERING Page 2
R.Anirudhan

employed. The ability for intelligent software to increase our abilities is not limited to science, of course, and people are working on AI programs which can have a creative input to human activities such as composing, painting and writing. Finally, in constructing intelligent machines, we may learn something about intelligence in humanity and other species. This deserves a category of its own. Another reason to study Artificial Intelligence is to help us to:
Understand human intelligence in society.
AI can be seen as just the latest tool in the philosopher's toolbox for answering questions about the nature of human intelligence, following in the footsteps of mathematics, logic, biology, psychology, cognitive science and others. Some obvious questions that philosophy has wrangled with are: "We know that we are more 'intelligent' than the other animals, but what does this actually mean?" and "How many of the activities which we call intelligent can be replicated by computation (e.g., algorithmically)?" For example, the ELIZA program discussed below is a classic example from the sixties where a very simple program raised some serious questions about the nature of human intelligence. Amongst other things, ELIZA helped philosophers and psychologists to question the notion of what it means to 'understand' in natural language (e.g., English) conversations By stating that AI helps us understand the nature of human intelligence in society, we should note that AI researchers are increasingly studying multi-agent systems, which are, roughly speaking, collections of AI programs able to communicate and cooperate/compete on small tasks towards the completion of larger tasks. This means that the social, rather than individual, nature of intelligence is now a subject within range of computational studies in Artificial Intelligence. Of course, humans are not the only life-forms, and the questions of life (including intelligent life) poses even bigger questions. Indeed, some Artificial Life (ALife) researchers have grand plans for their software. They want to use it to:
Give birth to new life forms.
A study of Artificial Life will certainly throw light on what it means for a complex system to be 'alive'. Moreover, ALife researchers hope that, in creating artificial life-forms, given time, intelligent behaviour will emerge, much like it did in human evolution. Hence, there may be practical applications of an ALife approach. In particular, evolutionary algorithms (where programs and parameters are evolved to
R.Anirudhan

perform a particular task, rather than to exhibit signs of life) are becoming fairly mainstream in AI. A less obvious long term goal of AI research is to:
Add to scientific knowledge.
This is not to be confused with the applications of AI programs to other sciences, discussed later. Rather, it is worth pointing out that some AI researchers don't write intelligent programs and are certainly not interested in human intelligence or breathing life into programs. They are really interested in the various scientific problems that arise in the study of AI. One example is the question of algorithmic complexity - how bad will a particular algorithm get at solving a particular problem (in terms of the time taken to find the solution) as the problem instances get bigger. These kinds of studies certainly have an impact on the other long term goals, but the pursuit of knowledge itself is often overlooked as a reason for AI to exist as a scientific discipline. We won't be covering issues such as algorithmic complexity in this course, however.
1.2 Inspirations
Artificial Intelligence research can be characterised in terms of how the following question has been answered: "Just how are we going to get a computer to perform intelligent tasks?" One way to answer the question is to say that:
Logic makes a science out of various forms of reasoning, which play their part in intelligence. So, let's build our programs as implementations of logical theories.
This has led to the use of logic - drawing on mathematics and philosophy - in a great deal of AI research. This means that we can be very precise about the algorithms we implement, write our programs in very clear ways using logic programming languages, and even prove things about the programs we produce. However, while it's theoretically possible to do certain intelligent things (such as prove some easy mathematics theorems) with programs based on logic alone, such methods are held back by the very large search spaces involved. People began to think about heuristics - rules of thumb - which they could use to enable their programs to get jobs done in a reasonable time. They answered the question like this:
R.Anirudhan
We're not sure that humans reason with perfect logic all the time, but we are certainly intelligent. So, let's use introspection and tell our AI programs how to think like us.
In answering this question, AI researchers started building expert systems, which encapsulated factual, procedural and heuristic knowledge about particular domains
1.4 General Tasks to Accomplish

Once you've worried about why you're doing AI, what has inspired you and how you're going to approach the job, then you can start to think about what task it is that you want to automate. AI is so often portrayed as a set of problem-solving techniques, but I think the relentless shoe-horning of intelligent tasks into one problem formulation or another is holding AI back. That said, we have determined a number of problem solving tasks in AI - most of which have been hinted at previously - which can be used as a characterization. The categories overlap a little because of the generality of the techniques. For instance, planning could be found in many categories, as this is a fundamental part of solving many types of problem.
1.5 Generic Techniques Developed

In the pursuit of solutions to various problems in the above categories, various individual techniques have sprung up which have been shown to be useful for solving a range of problems (usually within the general problem category). These techniques are established enough now to have a name and provide at least a partial characterisation of AI. The following list is not intended to be complete, but rather to introduce some techniques you will learn later in the course. Note that some of these overlap with the general techniques above.

Forward/backward chaining (reasoning) Resolution theorem proving (reasoning) Proof planning (reasoning) Constraint satisfaction (reasoning) Davis-Putnam method (reasoning) Minimax search (games) Alpha-Beta pruning (games) Case-based reasoning (expert systems) Knowledge elicitation (expert systems) Neural networks (learning) Bayesian methods (learning)
Explanation based (learning) Inductive logic programming (learning) Reinforcement (learning) Genetic algorithms (learning) Genetic programming (learning) Strips (planning) N-grams (NLP) Parsing (NLP) Behavior based (robotics) Cell decomposition (robotics)
Page 5
R.Anirudhan
1.6 Representations/Languages Used

Many people are taught AI with the opening line: "The three most important things in AI are representation, representation and representation". While choosing the way of representing knowledge in AI programs will always be a key concern, many techniques now have well-chosen ways to represent data which have been shown to be useful for that technique. Along the way, much research has been undertaken into discovering the best ways to represent certain types of knowledge. The way in which knowledge can be represented is often taken as another way to characterize Artificial Intelligence. Some general representation schemes include:

First order logic Higher order logic Logic programs Frames Production Rules Semantic Networks Fuzzy logic Bayes nets Hidden Markov models Neural networks Strips
Some standard AI programming languages have been developed in order to build intelligent programs efficiently and robustly. These include:

Prolog Lisp ML
Note that other languages are used extensively to build AI programs, including:

Perl C++ Java C
1.7 Application Areas

Individual applications often drive AI research much more than the long term goals described above. Much of AI literature is grouped into application areas, some of which are:
R.Anirudhan

Agriculture Architecture Art Astronomy Bioinformatics Email classification Engineering Finance Fraud detection Information retrieval Law
Mathematics Military Music Scientific discovery Story writing Telecommunications Telephone services Transportaion Tutoring systems Video games Web search engines
Page 7
R.Anirudhan
Chapter2 Artificial Intelligence Agents

In the previous lecture, we discussed what we will be talking about in Artificial Intelligence and why those things are important. This lecture is all about how we will be talking about AI, i.e., the language, assumptions and concepts which will be common to all the topics we cover. These notions should be considered before undertaking any large AI project. Hence, this lecture also serves to add to the systems engineering information you have/will be studying. For AI software/hardware, of course, we have to worry about which programming language to use, how to split the project into modules, etc. However, we also have to worry about higher level notions, such as: what does it mean for our program/machine to act rationally in a particular domain, how will it use knowledge about the environment, and what form will that knowledge take? All these things should be taken into consideration before we worry about actually doing any programming.
2.1 Autonomous Rational Agents

In many cases, it is inaccurate to talk about a single program or a single robot, as the combination of hardware and software in some intelligent systems is considerably more complicated. Instead, we will follow the lead of Russell and Norvig and describe AI through the autonomous, rational intelligent agents paradigm. We're going to use the definitions from chapter 2 of Russell and Norvig's textbook, starting with these two:

An agent is anything that can be viewed as perceiving its environment through sensors and acting upon that environment through effectors. A rational agent is one that does the right thing.
We see that the word 'agent' covers humans (where the sensors are the senses and the effectors are the physical body parts) as well as robots (where the sensors are things like cameras and touch pads and the effectors are various motors) and computers (where the sensors are the keyboard and mouse and the effectors are the monitor and speakers). To determine whether an agent has acted rationally, we need an objective measure of how successful it has been and we need to worry about when to make an
R.Anirudhan

evaluation using this measure. When designing an agent, it is important to think hard about how to evaluate it's performance, and this evaluation should be independent from any internal measures that the agent undertakes (for example as part of a heuristic search - see the next lecture). The performance should be measured in terms of how rationally the program acted, which depends not only on how well it did at a particular task, but also on what the agent experienced from its environment, what the agent knew about its environment and what actions the agent could actually undertake.
Acting Rationally Al Capone was finally convicted for tax evasion. Were the police acting rationally? To answer this, we must first look at how the performance of police forces is viewed: arresting and convicting the people who have committed a crime is a start, but their success in getting criminals off the street is also a reasonable, if contentious, measure. Given that they didn't convict Capone for the murders he committed, they failed on that measure. However, they did get him off the street, so they succeeded there. We must also look at the what the police knew and what they had experienced about the environment: they had experienced murders which they knew were undertaken by Capone, but they had not experienced any evidence which could convict Capone of the murders. However, they had evidence of tax evasion. Given the knowledge about the environment that they can only arrest if they have evidence, their actions were therefore limited to arresting Capone on tax evasion. As this got him off the street, we could say they were acting rationally. This answer is controversial, and highlights the reason why we have to think hard about how to assess the rationality of an agent before we consider building it. To summarize, an agent takes input from its environment and affects that environment. The rational performance of an agent must be assessed in terms of the task it was meant to undertake, it's knowledge and experience of the environment and the actions it was actually able to undertake. This performance should be objectively measured independently of any internal measures used by the agent. In English language usage, autonomy means an ability to govern one's actions independently. In our situation, we need to specify the extent to which an agent's behavior is affected by its environment. We say that:
The autonomy of an agent is measured by the extent to which its behaviour is determined by its own experience.
Page 9
R.Anirudhan

At one extreme, an agent might never pay any attention to the input from its environment, in which case, its actions are determined entirely by its built-in knowledge. At the other extreme, if an agent does not initially act using its built-in knowledge, it will have to act randomly, which is not desirable. Hence, it is desirable to have a balance between complete autonomy and no autonomy. Thinking of human agents, we are born with certain reflexes which govern our actions to begin with. However, through our ability to learn from our environment, we begin to act more autonomously as a result of our experiences in the world. Imagine a baby learning to crawl around. It must use in-built information to enable it to correctly employ its arms and legs, otherwise it would just thrash around. However, as it moves, and bumps into things, it learns to avoid objects in the environment. When we leave home, we are (supposed to be) fully autonomous agents ourselves. We should expect similar of the agents we build for AI tasks: their autonomy increases in line with their experience of the environment.
2.3 Internal Structure of Agents

We have looked at agents in terms of their external influences and behaviors: they take input from the environment and perform rational actions to alter that environment. We will now look at some generic internal mechanisms which are common to intelligent agents.
Architecture and Program
The program of an agent is the mechanism by which it turns input from the environment into an action on the environment. The architecture of an agent is the computing device (including software and hardware) upon which the program operates. On this course, we mostly concern ourselves with the intelligence behind the programs, and do not worry about the hardware architectures they run on. In fact, we will mostly assume that the architecture of our agents is a computer getting input through the keyboard and acting via the monitor. RHINO consisted of the robot itself, including the necessary hardware for locomotion (motors, etc.) and state of the art sensors, including laser, sonar, infrared and tactile sensors. RHINO also carried around three on-board PC workstations and was connected by a wireless Ethernet connection to a further three off-board SUN workstations. In total, it ran up to 25 different processes at any one time, in parallel. The program employed by RHINO was even more complicated than the architecture upon which it ran. RHINO ran software which drew upon techniques ranging from low level probabilistic reasoning and visual information processing to high level problem solving and planning using logical representations.
Page 10
R.Anirudhan

An agent's program will make use of knowledge about its environment and methods for deciding which action to take (if any) in response to a new input from the environment. These methods include reflexes, goal based methods and utility based methods.
Knowledge of the Environment
We must distinguish between knowledge an agent receives through it's sensors and knowledge about the world from which the input comes. Knowledge about the world can be programmed in, and/or it can be learned through the sensor input. For example, a chess playing agent would be programmed with the positions of the pieces at the start of a game, but would maintain a representation of the entire board by updating it with every move it is told about through the input it receives. Note that the sensor inputs are the opponent's moves and this is different to the knowledge of the world that the agent maintains, which is the board state. There are three main ways in which an agent can use knowledge of its world to inform its actions. If an agent maintains a representation of the world, then it can use this information to decide how to act at any given time. Furthermore, if it stores its representations of the world, then it can also use information about previous world states in its program. Finally, it can use knowledge about how its actions affect the world. The RHINO agent was provided with an accurate metric map of the museum and exhibits beforehand, carefully mapped out by the programmers. Having said this, the layout of the museum changed frequently as routes became blocked and chairs were moved. By updating it's knowledge of the environment, however, RHINO consistently knew where it was, to an accuracy better than 15cm. RHINO didn't move objects other than itself around the museum. However, as it moved around, people followed it, so its actions really were altering the environment. It was because of this (and other reasons) that the designers of RHINO made sure it updated its plan as it moved around.
Reflexes
If an agent decides upon and executes an action in response to a sensor input without consultation of its world, then this can be considered a reflex response. Humans flinch if they touch something very hot, regardless of the particular social situation they are in, and this is clearly a reflex action. Similarly, chess agents are programmed with lookup tables for openings and endings, so that they do not have to do any processing to choose the correct move, they simply look it up. In timed chess matches, this kind of reflex action might save vital seconds to be used in more difficult situations later.
R.Anirudhan

Unfortunately, relying on lookup tables is not a sensible way to program intelligent agents: a chess agent would need 35100 entries in its lookup table (considerably more entries than there are atoms in the universe). And if we remember that the world of a chess agent consists of only 32 pieces on 64 squares, it's obvious that we need more intelligent means of choosing a rational action. For RHINO, it is difficult to identify any reflex actions. This is probably because performing an action without consulting the world representation is potentially dangerous for RHINO, because people get everywhere, and museum exhibits are expensive to replace if broken!
Goals
One possible way to improve an agent's performance is to enable it to have some details of what it is trying to achieve. If it is given some representation of the goal (e.g., some information about the solution to a problem it is trying to solve), then it can refer to that information to see if a particular action will lead to that goal. Such agents are called goal-based. Two tried and trusted methods for goal-based agents are planning (where the agent puts together and executes a plan for achieving its goal) and search (where the agent looks ahead in a search space until it finds the goal). Planning and search methods are covered later in the course. In RHINO, there were two goals: get the robot to an exhibit chosen by the visitors and, when it gets there, provide information about the exhibit. Obviously, RHINO used information about its goal of getting to an exhibit to plan its route to that exhibit.
Utility Functions
A goal based agent for playing chess is infeasible: every time it decides which move to play next, it sees whether that move will eventually lead to a checkmate. Instead, it would be better for the agent to assess it's progress not against the overall goal, but against a localized measure. Agent's programs often have a utility function which calculates a numerical value for each world state the agent would find itself in if it undertook a particular action. Then it can check which action would lead to the highest value being returned from the set of actions it has available. Usually the best action with respect to a utility function is taken, as this is the rational thing to do. When the task of the agent is to find something by searching, if it uses a utility function in this manner, this is known as a best-first search. RHINO searched for paths from its current location to an exhibit, using the distance from the exhibit as a utility function. However, this was complicated by visitors getting in the way.
R.Anirudhan

2.4 Environments
We have seen that intelligent agents should take into account certain information when choosing a rational action, including information from its sensors, information from the world, information from previous states of the world, information from its goal and information from its utility function(s). We also need to take into account some specifics about the environment it works in. On the surface, this consideration would appear to apply more to robotic agents moving around the real world. However, the considerations also apply to software agents which are receiving data and making decisions which affect the data they receive in this case we can think of the environment as the flow of information in the data stream. For example, an AI agent may be employed to dynamically update web pages based on the requests from internet users. We follow Russell and Norvig's lead in characterizing information about the environment:
Accessibility
In some cases, certain aspects of an environment which should be taken into account in decisions about actions may be unavailable to the agent. This could happen, for instance, because the agent cannot sense certain things. In these cases, we say the environment is partially inaccessible. In this case, the agent may have to make (informed) guesses about the inaccessible data in order to act rationally. The builders of RHINO talk about "invisible" objects that RHINO had to deal with. These included glass cases and bars at various heights which could not be detected by the robotic sensors. These are clearly inaccessible aspects of the environment, and RHINO's designers took this into account when designing its programs.
Determinism
If we can determine what the exact state of the world will be after an agent's action, we say the environment is deterministic. In such cases, the state of the world after an action is dependent only on the state of the world before the action and the choice of action. If the environment is non-deterministic, then utility functions will have to make (informed) guesses about the expected state of the world after possible actions if the agent is to correctly choose the best one. RHINO's world was non-deterministic because people moved around, and they move objects such as chairs around. In fact, visitors often tried to trick the robot by setting up roadblocks with chairs. This was another reason why RHINO's plan was constantly updated.
R.Anirudhan
Episodes
If an agent's current choice of action does not depend on its past actions, then the environment is said to be episodic. In non-episodic environments, the agent will have to plan ahead, because it's current action will affect subsequent ones. Considering only the goal of getting to and from exhibits, the individual trips between exhibits can be seen as episodes in RHINO's actions. Once it had arrived at one exhibit, how it got there would not normally affect its choices in getting to the next exhibit. If we also consider the goal of giving a guided tour, however, RHINO must at least remember the exhibits it had already visited, in order not to repeat itself. So, at the top level, its actions were not episodic.
Static or Dynamic
An environment is static if it doesn't change while an agent's program is making the decision about how to act. When designing agents to operate in dynamic (nonstatic) environments, the underlying program may have to refer to the changing environment while it deliberates, or to anticipate the change in the environment between the time when it receives an input and when it has to take an action. RHINO was very fast in making decisions. However, because of the amount of visitor movement, by the time RHINO had planned a route, that plan was sometimes wrong because someone was now blocking the route. However, because of the speed of decision making, instead of referring to the environment during the planning process, as we have said before, the designers of RHINO chose to enable it to continually update its plan as it moved.
Discrete or Continuous
The nature of the data coming in from the environment will affect how the agent should be designed. In particular, the data may be discrete (composed of a limited number of clearly defined parts) or continuous (seemingly without discernible sections). Of course, given the nature of computer memory (in bits and bytes), even streaming video can be shoe-horned into the discrete category, but an intelligent agent will probably have to deal with this as if it is continuous. The mathematics in your agent's programs will differ depending on whether the data is taken to be discrete or continuous.
Page 14
R.Anirudhan
Chapter-3 Search in Problem Solving

If Artificial Intelligence can inform the other sciences about anything, it is about problem solving and, in particular, how to search for solutions to problems. Much of AI research can be explained in terms of specifying a problem, defining a search space which should contain a solution to the problem, choosing a search strategy and getting an agent to use the strategy to find a solution. If you are hired as an AI researcher/programmer, you will be expected to come armed with a battery of AI techniques, many of which we cover later in the course. However, perhaps the most important skill you will bring to the job is to effectively seek out the best way of turning some vague specifications into concrete problems requiring AI techniques. Specifying those problems in the most effective way will be vital if you want your AI agent to find the solutions in a reasonable time. In this lecture, we look at how to specify a search problem.
3.1 Specifying Search Problems

In our agent terminology, a problem to be solved is a specific task where the agent starts with the environment in a given state and acts upon the environment until the altered state has some pre-determined quality. The set of states which are possible via some sequence of actions the agent takes is called the search space. The series of actions that the agent actually performs is its search path, and the final state is a solution if it has the required property. There may be many solutions to a particular problem. If you can think of the task you want your agent to perform in these terms, then you will need to write a problem solving agent which uses search. It is important to identify the scope of your task in terms of the problems which will need to be solved. For instance, there are some tasks which are single problems solved by searching, e.g., find a route on a map. Alternatively, there are tasks such as winning at chess, which have to be broken down into sub-problems (searching for the best move at each stage). Other tasks can be achieved without searching whatsoever e.g., multiplying two large numbers together - you wouldn't dream of searching through the number line until you came across the answer! There are three initial considerations in problem solving (as described in Russell and Norvig):
Initial State Page 15
R.Anirudhan

Firstly, the agent needs to be told exactly what the initial state is before it starts its search, so that it can keep track of the state as it searches.
Operators
An operator is a function taking one state to another via an action undertaken by the agent. For example, in chess, an operator takes one arrangement of pieces on the board to another arrangement by the action of the agent moving a piece.
Goal Test
It is essential when designing a problem solving agent to know when the problem has been solved, i.e., to have a well defined goal test. Suppose the problem we had set our agent was to find a name for a newborn baby, with some properties. In this case, there are lists of "accepted" names for babies, and any solution must appear in that list, so goal-checking amounts to simply testing whether the name appears in the list. In chess, on the other hand, the goal is to reach a checkmate. While there are only a finite number of ways in which the pieces on a board can represent a checkmate, the number of these is huge, so checking a position against them is a bad idea. Instead, a more abstract notion of checkmate is used, whereby our agent checks that the opponent's king cannot move without being captured.
3.2 General Considerations for Search

If we can specify the initial state, the operators and the goal check for a search problem, then we know where to start, how to move and when to stop in our search. This leaves the important question of how to choose which operator to apply to which state at any stage during the search. We call an answer to this question a search strategy. Before we worry about exactly what strategy to use, the following need to be taken into consideration:
Path or Artifact
Broadly speaking, there are two different reasons to undertake a search: to find an artifact (a particular state), or to find a path from one given state to another given state. Whether you are searching for a path or an artifact will affect many aspects of your agent's search, including its goal test, what it records along the way and the search strategies available to you. For example, in the maze below, the game involves finding a route from the top left hand corner to the bottom right hand corner. We all know what the exit looks like (a gap in the outer wall), so we do not search for an artifact. Rather, the point of the search is to find a path, so the agent must remember where it has been.
R.Anirudhan
However, in other searches, the point of the search is to find something, and it may be immaterial how you found it. For instance, suppose we play a different game: to find an anagram of the phrase:
ELECTING NEIL
The answer is, of course: (FILL IN THIS GAP AS AN EXERCISE). In this case, the point of the search is to find an artifact - a word which is an anagram of "electing neil". No-one really cares in which order to actually re-arrange the letters, so we are not searching for a path.
Completeness
It's also worth trying to estimate the number of solutions to a problem, and the density of those solutions amongst the non-solutions. In a search problem, there may be any number of solutions, and the problem specification may involve finding just one, finding some, or finding all the solutions. For example, suppose a military application searches for routes that an enemy might take. The question: "Can the enemy get from A to B" requires finding only one solution, whereas the question: "How many ways can the enemy get from A to B" will require the agent to find all the solutions. When an agent is asked to find just one solution, we can often program it to prune its search space quite heavily, i.e., rule out particular operators at particular times to be more efficient. However, this may also prune some of the solutions, so if our agent is asked to find all of them, the pruning has to be controlled so that we know that pruned areas of the search space either contain no solutions, or contain solutions which are repeated in another (non-pruned) part of the space. If our search strategy is guaranteed to find all the solutions eventually, then we say that it is complete. Often, it is obvious that all the solutions are in the search space, but in other cases, we need to prove this fact mathematically to be sure that our space is complete. A problem with complete searches is that - while the solution is
R.Anirudhan

certainly there - it can take a very long time to find the solution, sometimes so long that the strategy is effectively useless. Some people use the word exhaustive when they describe complete searches, because the strategy exhausts all possibilities in the search space.
Time and Space Tradeoffs
In practice, you are going to have to stop your agent at some stage if it has not found a solution by then. Hence, if we can choose the fastest search strategy, then this will explore more of the search space and increase the likelihood of finding a solution. There is a problem with this, however. It may be that the fastest strategy is the one which uses most memory. To perform a search, an agent needs at least to know where it is in a search space, but lots of other things can also be recorded. For instance, a search strategy may involve going over old ground, and it would save time if the agent knew it had already tried a particular path. Even though RAM capacities in computers are going steadily up, for some of the searches that AI agents are employed to undertake, they often run out of memory. Hence, as in computer science in general, AI practitioners often have to devise clever ways to trade memory and time in order to achieve an effective balance.
Soundness
You may hear in some application domains - for example automated theorem proving - that a search is "sound and complete". Soundness in theorem proving means that the search to find a proof will not succeed if you give it a false theorem to prove. This extends to searching in general, where a search is unsound if it finds a solution to a problem with no solution. This kind of unsound search may not be the end of the world if you are only interested in using it for problems where you know there is a solution (and it performs well in finding such solutions). Another kind of unsound search is when a search finds the wrong solution to a problem. This is more worrying and the problem will probably lie with the goal testing mechanism.
Additional Knowledge in Search
The amount of extra knowledge available to your agent will effect how it performs. In the following sections of this lecture, we will look at uninformed search strategies, where no additional knowledge is given, and heuristic searches, where any information about the goal, intermediate states and operators can be used to improve the efficiency of the search strategy.
Page 18
R.Anirudhan

3.3 Uninformed Search Strategies
To be able to undertake an uninformed search, all our agent needs to know is the initial state, the possible operators and how to check whether the goal has been reached. Once these have been described, we must then choose a search strategy for the agent: a pre-determined way in which the operators will be applied. The example we will use is the case of a genetics professor searching for a name for her newborn baby boy - of course, it must only contain the letters D, N and A. The states in this search are strings of letters (but only Ds, Ns and As), and the initial state is an empty string. Also, the operators available are: (i) add a 'D' to an existing string (ii) add an 'N' to an existing string and (iii) add an 'A' to an existing string. The goal check is possible using a book of boys names against which the professor can check a string of letters. To help us think about the different search strategies, we use two analogies. Firstly, we suppose that the professor keeps an agenda of actions to undertake, such as: add an 'A' to the string 'AN'. So, the agenda consists of pairs (S,O) of states and operators, whereby the operator is to be applied to the state. The action at the top of the agenda is the one which is carried out, then that action is removed. How actions are added to the agenda differs for each search strategy. Secondly, we think of a search graphically: by making each state a node in a graph and each operator an edge, we can think of the search progressing as movement from node to node along edges in the graph. We then allow ourselves to talk about nodes in a search space (rather than the graph) and we say that a node in a search space has been expanded if the state that node represents has been visited and searched from. Note that graphs which have no cycles in them are called trees, and many AI searches can be represented as trees.
Breadth First Search
Given a set of operators o1, ..., on in a breadth first search, every time a new state s is reached, an action for each operator on s is added to the bottom of the agenda, i.e., the pairs (s,o1), ..., (s,on) are added to the end of the agenda in that order. However, once the 'D' state had been found, the actions: 1.(empty,add'D') 2.(empty,add'N') 3. (empty,add 'A') would be added to the top of the agenda, so it would look like this:
R.Anirudhan

4. ('D',add 'D') 5. ('D',add 'N') 6. ('D',add 'A')
However, we can remove the first agenda item as this action has been undertaken. Hence there are actually 5 actions on the agenda after the first step in the search space. Indeed, after every step, one action will be removed (the action just carried out), and three will be added, making a total addition of two actions to the agenda. It turns out that this kind of breadth first search leads to the name 'DAN' after 20 steps. Also, after the 20th step, there are 43 tasks still on the agenda to do. It's useful to think of this search as the evolution of a tree, and the diagram below shows how each string of letters is found via the search in a breadth first manner. The numbers above the boxes indicate at which step in the search the string was found.
We see that each node leads to three others, which corresponds to the fact that after every step, three more steps are put on the agenda. This is called the branching rate of a search, and seriously affects both how long a search is going to take and how much memory it will use up. Breadth first search is a complete strategy: given enough time and memory, it will find a solution if one exists. Unfortunately, memory is a big problem for breadth first search. We can think about how big the agenda grows, but in effect we are just
R.Anirudhan

counting the number of states which are still 'alive', i.e., there are still steps in the agenda involving them. In the above diagram, the states which are still alive are those with fewer than three arrows coming from them: there are 14 in all. It's fairly easy to show that in a search with a branching rate of b, if we want to search all the way to a depth of d, then the largest number of states the agent will have to store at any one time is bd-1. For example, if our professor wanted to search for all names up to length 8, she would have to remember (or write down) 2187 different strings to complete a breadth first search. This is because she would need to remember 37 strings of length 7 in order to be able to build all the strings of length 8 from them. In searches with a higher branching rate, the memory requirement can often become too large for an agent's processor.
Depth First Search
Depth first search is very similar to breadth first, except that things are added to the top of the agenda rather than the bottom. In our example, the first three things on the agenda would still be: However, once the 'D' state had been found, the actions: 1.(empty,add'D') 2.(empty,add'N') 3. (empty,add 'A') would be added to the top of the agenda, so it would look like this: 4. ('D',add 'D') 5. ('D',add 'N') 6. ('D',add 'A')
Of course, carrying out the action at the top of the agenda would introduce the string 'DD', but then this would cause the action: ('DD',add 'D') to be added to the top, and the next string found would be 'DDD'. Clearly, this can't go on indefinitely, and in practice, we must specify a depth limit to stop it going down a particular path forever. That is, our agent will need to record how far down a particular path it has gone, and avoid putting actions on the agenda if the state in the agenda item is past a certain depth.
R.Anirudhan

Note that our search for names is special: no matter what state we reach, there will always be three actions to add to the agenda. In other searches, the number of actions available to undertake on a particular state may be zero, which effectively stops that branch of the search. Hence, a depth limit is not always required. Returning to our example, if the professor stipulated that she wanted very short names (of three or fewer letters), then the search tree would look like this:
We see that 'DAN' has been reached after the 12th step, so there is an improvement on the breadth first search. However, it was lucky in this case that the first letter explored is 'D' and that there is a solution at depth three. If the depth limit had been set at 4 instead, the tree would have looked very much different:
Page 22
R.Anirudhan
It looks like it will be a long time until it finds 'DAN'. This highlights an important drawback to depth first search. It can often go deep down paths which have no solutions, when there is a solution much higher up the tree, but on a different branch. Also, depth first search is not, in general, complete. Rather than simply adding the next agenda item directly to the top of the agenda, it might be a better idea to make sure that every node in the tree is fully expanded before moving on to the next depth in the search. This is the kind of depth first search which Russell and Norvig explain. For our DNA example, if we did this, the search tree would like like this:
Page 23
R.Anirudhan
The big advantage to depth first search is that it requires much less memory to operate than breadth first search. If we count the number of 'alive' nodes in the diagram above, it amounts to only 4, because the ones on the bottom row are not to be expanded due to the depth limit. In fact, it can be shown that if an agent wants to search for all solutions up to a depth of d in a space with branching factor b, then in a depth first search it only needs to remember up to a maximum of b*d states at any one time. To put this in perspective, if our professor wanted to search for all names up to length 8, she would only have to remember 3 * 8 = 24 different strings to complete a depth first search (rather than 2187 in a breadth first search).
Iterative Deepening Search
So, breadth first search is guaranteed to find a solution (if one exists), but it eats all the memory. Depth first search, however, is much less memory hungry, but not guaranteed to find a solution. Is there any other way to search the space which combines the good parts of both? Well, yes, but it sounds silly. Iterative Deepening Search (IDS) is just a series of depth first searches where the depth limit is increased by one every time. That is, an
R.Anirudhan

IDS will do a depth first search (DFS) to depth 1, followed by a DFS to depth 2, and so on, each time starting completely from scratch. This has the advantage of being complete, as it covers all depths of the search tree. Also, it only requires the same memory as depth first search (obviously). However, you will have noticed that this means that it completely re-searches the entire space searched in the previous iteration. This kind of redundancy will surely make the search strategy too slow to contemplate using in practice? Actually, it isn't as bad as you might think. This is because, in a depth first search, most of the effort is spent expanding the last row of the tree, so the repetition over the top part of the tree is not a major factor. In fact, the effect of the repetition reduces as the branching rate increases. In a search with branching rate 10 and depth 5, the number of states searched is 111,111 with a single depth first search. With an iterative deepening search, this number goes up to 123,456. So, there is only a repetition of around 11%.
Bidirectional Search
We've concentrated so far on searches where the point of the search is to find a solution, not the path to the solution. In other searches, we know the solution, and we know the initial state, but we don't know how to get from one to the other, and the point of the search is to find a path. In these cases, in addition to searching forward from the initial state, we can sometimes also search backwards from the solution. This is called a bidirectional search. For example, consider the 8-puzzle game in the diagram below, where the point of the game is to move the pieces around so that they are arranged in the right hand diagram. It's likely that in the search for the solution to this puzzle (given an arbitrary starting state), you might start off by moving some of the pieces around to get some of them in their end positions. Then, as you got closer to the solution state, you might work backwards: asking yourself, how can I get from the solution to where I am at the moment, then reversing the search path. In this case, you've used a bidirectional search.
Page 25
R.Anirudhan

Bidirectional search has the advantage that search in both directions is only required to go to a depth half that of normal searches, and this can often lead to a drastic reduction in the number of paths looked at. For instance, if we were looking for a path from one town to another through at most six other towns, we only have to look for a journey through three towns from both directions, which is fairly easy to do, compared to searching all paths through six towns in a normal search. Unfortunately, it is often difficult to apply a bidirectional search because (a) we don't really know the solution, only a description of it (b) there may be many solutions, and we have to choose some to work backwards from (c) we cannot reverse our operators to work backwards from the solution and (d) we have to record all the paths from both sides to see if any two meet at the same point - this may take up a lot of memory, and checking through both sets repeatedly could take up too much computing time.
3.5 Heuristic Search Strategies

Generally speaking, a heuristic search is one which uses a rule of thumb to improve an agent's performance in solving problems via search. A heuristic search is not to be confused with a heuristic measure. If you can specify a heuristic measure, then this opens up a range of generic heuristic searches which you can try to improve your agent's performance, as discussed below. It is worth remembering, however, that any rule of thumb, for instance, choosing the order of operators when applied in a simple breadth first search, is a heuristic. In terms of our agenda analogy, a heuristic search chooses where to put a (state, operator) pair on the agenda when it is proposed as a move in the state space. This choice could be fairly complicated and based on many factors. In terms of the graph analogy, a heuristic search chooses which node to expand at any point in the search. By definition, a heuristic search is not guaranteed to improve performance for a particular problem or set of problems, but they are implemented in the hope of either improving the speed of which a solution is found and/or the quality of the solution found. In fact, we may be able to find optimal solutions, which are as good as possible with respect to some measure.
Optimality
The path cost of a solution is calculated as the sum of the costs of the actions which led to that solution. This is just one example of a measure of value on the solution
R.Anirudhan

of a search problem, and there are many others. These measures may or may not be related to the heuristic functions which estimate the likelihood of a particular state being in the path to a solution. We say that - given a measure of value on the possible solutions to a search problem - one particular solution is optimal if it scores higher than all the others with respect to this measure (or costs less, in the case of path cost). For example, in the maze example given in section 3.2, there are many paths from the start to the finish of the maze, but only one which crosses the fewest squares. This is the optimal solution in terms of the distance travelled. Optimality can be guaranteed through a particular choice of search strategy (for instance the uniform path cost search described below). Alternatively, an agent can choose to prove that a solution is optimal by appealing to some mathematical argument. As a last resort, if optimality is necessary, then an agent must exhaust a complete search strategy to find all solutions, then choose the one scoring the highest (alternatively costing the lowest).
Uniform Path Cost Search
A breadth first search will find the solution with the shortest path length from the initial state to the goal state. However, this may not be the least expensive solution in terms of the path cost. A uniform path cost search chooses which node to expand by looking at the path cost for each node: the node which has cost least to get to is expanded first. Hence, if, as is usually the case, the path cost of a node increases with the path length, then this search is guaranteed to find the least expensive solution. It is therefore an optimal search strategy. Unfortunately, this search strategy can be very inefficient.
Greedy Search
If we have a heuristic function for states, defined as above, then we can simply measure each state with respect to this measure and order the agenda items in terms of the score of the state in the item. So, at each stage, the agent determines which state scores lowest and puts agenda items on the top of the agenda which contain operators acting on that state. In this way, the most promising nodes in a search space are expanded before the less promising ones. This is a type of best first search known specifically as a greedy search. In some situations, a greedy search can lead to a solution very quickly. However, a greedy search can often go down blind alleys, which look promising to start with, but ultimately don't lead to a solution. Often the best states at the start of a search are in fact really quite poor in comparison to those further in the search space. One way to counteract this blind-alley effect is to turn off the heuristic until a proportion of the search space has been covered, so that the truly high scoring states can be
R.Anirudhan

identified. Another problem with a greedy search is that the agent will have to keep a record of which states have been explored in order to avoid repetitions (and ultimately end up in a cycle), so a greedy search must keep all the agenda items it has undertaken in its memory. Also, this search strategy is not optimal, because the optimal solution may have nodes on the path which score badly for the heuristic function, and hence a non-optimal solution will be found before an optimal one. (Remember that the heuristic function only estimates the path cost from a node to a solution).
A* Search
A* search combines the best parts of uniform cost search, namely the fact that it's optimal and complete, and the best parts of greedy search, namely its speed. This search strategy simply combines the path cost function g(n) and the heuristic function h(n) by summing them to form a new heuristic measure f(n): f(n) = g(n) + h(n) Remembering that g(n) gives the path cost from the start state to state n and h(n) estimates the path cost from n to a goal state, we see that f(n) estimates the cost of the cheapest solution which passes through n. The most important aspect of A* search is that, given one restriction on h(n), it is possible to prove that the search strategy is complete and optimal. The restriction to h(n) is that it must always underestimate the cost to reach a goal state from n. Such heuristic measures are called admissible. See Russell and Norvig for proof that A* search with an admissible heuristic is complete and optimal.
IDA* Search
A* search is a sophisticated and successful search strategy. However, a problem with A* search is that it must keep all states in its memory, so memory is often a much bigger consideration than time in designing agents to undertake A* searches. We overcame the same problem with breadth first search by using an iterative deepening search (IDS), and we do similar with A*. Like IDS, an IDA* search is a series of depth first searches where the depth is increased after each iteration. However, the depth is not measured in terms of the path length, as it is in IDS, but rather in terms of the A* combined function f(n) as described above. To do this, we need to define contours as regions of the search space containing states where f is below some limit for all the states, as shown pictorially here:
R.Anirudhan
Each node in a contour scores less than a particular value and IDA* search agents are told how much to increase the contour boundary by on each iteration. This defines the depth for successive searches. When using contours, it is useful for the function f(n) to be monotonic, i.e., f is monotonic if whenever an operator takes a state s1 to a state s2, then f(s2) >= f(s1). In other words, if the value of f always increases along a path, then f is monotonic. As an exercise, why do we need monotonicity to ensure optimality in IDA* search?
SMA* Search
IDA* search is very good from a memory point of view. In fact, it can be criticised for not using enough memory - using more memory can increase the efficiency, so really our search strategies should use all the available memory. Simplified Memory-Bounded A* search (SMA*) is a search which does just that. This is a complicated search strategy, with details given in Russell and Norvig.
Hill Climbing
Page 29
R.Anirudhan

As we've seen, in some problems, finding the search path from initial to goal state is the point of the exercise. In other problems, the path and the artefact at the end of the path are both important, and we often try to find optimal solutions. For a certain set of problems, the path is immaterial, and finding a suitable artefact is the sole purpose of the search. In these cases, it doesn't matter whether our agent searches down a path for 10 or 1000 steps, as long as it finds a solution in the end. For example, consider the 8-queens problem, where the task is to find an arrangement of 8 queens on a chess board such that no one can "take" another (one queen can take another if its on the same horizontal, vertical or diagonal line). A solution to this problem is:
One way to specify this problem is with states where there are a number of queens (1 to 8) on the board, and an action is to add a queen in such a way that it can't take another. Depending on your strategy, you may find that this search requires much back-tracking, i.e., towards the end, you find that you simply can't put the last queens on anywhere, so you have to move one of the queens you put down earlier (you go back-up the search tree). An alternative way of specifying the problem is that the states are boards with 8 queens already on them, and an action is a movement of one of the queens. In this case, our agent can use an evaluation function and do hill climbing. That is, it counts the number of pairs of queens where one can take the other, and only moves a queen if that movement reduces the number of pairs. When there is a choice of movements both resulting in the same decrease, the agent can choose one randomly from the choices. In the 8-queens problem, there are only 56 * 8 = 448 possible ways to move one queen, so our agent only has to calculate the evaluation function 448 times at each stage. If it only chooses moves where the situation with respect to
R.Anirudhan

the evaluation function improves, it is doing hill climbing (or gradient descent if it's better to think of the agent going downhill rather than uphill). A common problem with this search strategy is local maxima: the search has not yet reached a solution, but it can only go downhill in terms of the evaluation function. For example, we might get to the stage where only two queens can take each other, but moving any queen increases this number to at least three. In cases like this, the agent can do a random re-start whereby they randomly choose a state to start the whole process from again. This search strategy has the appeal of never requiring to store more than one state at any one time (the part of the hill the agent is on). Russell and Norvig make the analogy that this kind of search is like trying to climb mount everest in the fog with amnesia, but they do concede that it is often the search strategy of choice for some industrial problems. Local/Global Maxima/Minima are represented in the diagram below:
Simulated Annealing
One way to get around the problem of local maxima, and related problems such as ridges and plateaux in hill climbing is to allow the agent to go downhill to some extent. In simulated annealing - named because of an analogy with cooling a liquid until it freezes - the agent chooses to consider a random move. If the move improves the evaluation function, then it is always carried out. If the move doesn't improve the evaluation function, then the agent will carry out the move with some probability between 0 and 1. The probability decreases as the move gets worse in terms of the evaluation function, so really bad moves are rarely carried out. This strategy can often nudge a search out of a local maximum and the search can continue towards the global maximum.
Random Search
Page 31
R.Anirudhan

Some problems to be solved by a search agent are more creative in nature, for example, writing poetry. In this case, it is often difficult to project the word 'creative' on to a program because it is possible to completely understand why it produced an artefact, by looking at its search path. In these cases, it is often a good idea to try some randomness in the search strategy, for example randomly choosing an item from the agenda to carry out, or assigning values from a heuristic measure randomly. This may add to the creative appeal of the agent, because it makes it much more difficult to predict what the agent will do.
3.6 Assessing Heuristic Searches

Given a particular problem you want to build an agent to solve, there may be more than one way of specifying it as a search problem, more than one choice for the search strategy and different possibilities for heuristic measures. To a large extent, it is difficult to predict what the best choices will be, and it will require some experimentation to determine them. In some cases, - if we calculate the effective branching rate, as described below - we can tell for sure if one heuristic measure is always being out-performed by another.
The Effective Branching Rate
Assessing heuristic functions is an important part of AI research: a particular heuristic function may sound like a good idea, but in practice give no discernible increase in the quality of the search. Search quality can be determined experimentally in terms of the output from the search, and by using various measures such as the effective branching rate. Suppose a particular problem P has been solved by search strategy S by expanding N nodes, and the solution lay at depth D in the space. Then the effective branching rate of S for P is calculated by comparing S to a uniform search U. An example of a uniform search is a breadth first search where the number of branches from any node is always the same (as in our baby naming example). We then suppose the (uniform) branching rate of U is such that, on exhausting its search to depth D, it too would have expanded exactly N nodes. This imagined branching rate, written b*, is the effective branching rate of S and is calculated thus: N = 1 + b* + (b*)2 + ... + (b*)D. Rearranging this equation will provide a value for b*. For example (taken from Russell and Norvig), suppose S finds a solution at depth 5 having expanded 52 nodes. In this case: 52 = 1 + b* + (b*)2 + ... + (b*)5.
R.Anirudhan

and it turns out that b*=1.91. To calculate this, we use the well known mathematical identity:
This enables us to write a polynomial for which b* is a zero, and we can solve this using numerical techniques such as Newton's method. It is usually the case that the effective branching rate of a search strategy is similar over all the problems it is used for, so that it is acceptable to average b* over a small set of problems to give a valid account. If a heuristic search has a branching rate near to 1, then this is a good sign. We say that one heuristic function h1 dominates another h2 if the search using h1 always has a lower effective branching rate than h2. Having a lower effective branching rate is clearly desirable because it means a quicker search.
Page 33
R.Anirudhan
Chapter-4 Knowledge Representation

To recap, we now have some characterizations of AI, so that when an AI problem arises, you will be able to put it into context, find the correct techniques and apply them. We have introduced the agents language so that we can talk about intelligent tasks and how to carry them out. We have also looked at search in the general case, which is central to AI problem solving. Most pieces of software have to deal with data of some type, and in AI we use the more grandiose title of "knowledge" to stand for data including (i) facts, such as the temperature of a patient (ii) procedures, such as how to treat a patient with a high temperature and (iii) meaning, such as why a patient with a high temperature should not be given a hot bath. Accessing and utilizing all these kinds of information will be vital for an intelligent agent to act rationally. For this reason, knowledge representation is our final general consideration before we look at particular problem types. To a large extent, the way in which you organize information available to and generated by your intelligent agent will be dictated by the type of problem you are addressing. Often, the best ways of representing knowledge for particular techniques are known. However, as with the problem of how to search, you will need a lot of flexibility in the way you represent information. Therefore, it is worth looking at four general schemes for representing knowledge, namely logic, semantic networks, production rules and frames. Knowledge representation continues to be a much-researched topic in AI because of the realization fairly early on that how information is arranged can often make or break an AI application.
4.1 Logical Representations

If all human beings spoke the same language, there would be a lot less misunderstanding in the world. The problem with software engineering in general is that there are often slips in communication which mean that what we think we've told an agent and what we've actually told it are two different things. One way to reduce this, of course, is to specify and agree upon some concrete rules for the language we use to represent information. To define a language, we need to specify the syntax of the language and the semantics. To specify the the syntax of a language, we must say what symbols are allowed in the language and what are legal constructions (sentences) using those symbols. To specify the semantics of a language, we must say how the legal sentences are to be read, i.e., what they mean. If we choose a particular well defined language and stick to it, we are using a logical representation.
R.Anirudhan

Certain logics are very popular for the representation of information, and range in terms of their expressiveness. More expressive logics allow us to translate more sentences from our natural language (e.g., English) into the language defined by the logic. Some popular logics are:
Propositional Logic
This is a fairly restrictive logic, which allows us to write sentences about propositions - statements about the world - which can either be true or false. The symbols in this logic are (i) capital letters such as P, Q and R which represent propositions such as: "It is raining" and "I am wet", (ii) connectives which are: and ( ), or ( ), implies ( ) and not ( ), (iii) brackets and (iv) T which stands for the proposition "true", and F which stands for the proposition "false". The syntax of this logic are the rules specifying where in a sentence the connectives can go, for example must go between two propositions, or between a bracketed conjunction of propositions, etc. The semantics of this logic are rules about how to assign truth values to a sentence if we know whether the propositions mentioned in the sentence are true or not. For instance, one rule is that the sentence P Q is true only in the situation when both P and Q are true. The rules also dictate how to use brackets. As a very simple example, we can represent the knowledge in English that "I always get wet and annoyed when it rains" as:
It is raining I am wet I am annoyed
Moreover, if we program our agent with the semantics of propositional logic, then if at some stage, we tell it that it is raining, it can infer that I will get wet and annoyed.
First Order Predicate Logic
This is a more expressive logic because it builds on propositional logic by allowing us to use constants, variables, predicates, functions and quantifiers in addition to the connectives we've already seen. For instance, the sentence: "Every Monday and Wednesday I go to John's house for dinner" can be written in first order predicate logic as:
X ((day_of_week(X, monday) day_of_week(X, wednesday)) (go_to(me, house_of(john)) eat_meal(me, dinner))).
Page 35
R.Anirudhan

Here, the symbols monday, wednesday, me, dinner and john are all constants: base-level objects in the world about which we want to talk. The symbols day_of_week, go_to and eat_meal are predicates which represent relationships between the arguments which appear inside the brackets. For example in eat_meal, the relationship specifies that a person (first argument) eats a particular meal (second argument). In this case, we have represented the fact that me eats dinner. The symbol X is a variable, which can take on a range of values. This enables us to be more expressive, and in particular, we can quantify X with the 'forall' symbol , so that our sentence of predicate logic talks about all possible X's. Finally, the symbol house_of is a function, and - if we can - we are expected to replace house_of(john) with the output of the function (john's house) given the input to the function (john). The syntax and semantics of predicate logic are covered in more detail as part of the lectures on automated reasoning.
Higher Order Predicate Logic
In first order predicate logic, we are only allowed to quantify over objects. If we allow ourselves to quantify over predicate or function symbols, then we have moved up to the more expressive higher order predicate logic. This means that we can represent meta-level information about our knowledge, such as "For all the functions we've specified, they return the number 10 if the number 7 is input": f, (f(7) = 10).
Fuzzy Logic
In the logics described above, we have been concerned with truth: whether propositions and sentences are true. However, with some natural language statements, it's difficult to assign a "true" or "false" value. For example, is the sentence: "Prince Charles is tall" true or false? Some people may say true, and others false, so there's an underlying probability that we may also want to represent. This can be achieved with so-called "fuzzy" logics. The originator of fuzzy logics, Lotfi Zadeh, advocates not thinking about particular fuzzy logics as such, but rather thinking of the "fuzzification" of current theories, and this is beginning to play a part in AI. The combination of logics with theories of probability, and programming agents to reason in the light of uncertain knowledge are important areas of AI research. Various representation schemes such as Stochastic Logic Programs have an aspect of both logic and probability.
Other logics
Page 36
R.Anirudhan

Other logics you may consider include: Multiple valued logics, where different truth value such as "unknown" are allowed. These have some of the advantages of fuzzy logics, without necessarily worrying about probability. Modal logics, which cater for individual agents' beliefs about the world. For example, one agent could believe that a certain statement is true, but another may not. Modal logics help us deal with statements that may be believed to be true to some, but not all agents. Temporal logics, which enable us to write sentences involving considerations of time, for example that a statement may become true some time in the future. It's not difficult to see why logic has been a very popular representation scheme in AI:
It's fairly easy to represent knowledge in this way. It allows us to be expressive enough to represent most knowledge, while being constrained enough to be precise about that knowledge. There are whole branches of mathematics devoted to the study of it. We get a lot of reasoning for free (theorems can be deduced about information in a logical representation and patterns can be similarly induced). Some programming languages grew from logical representations, in particular Prolog. So, if you understand the logic, it's fairly easy to write programs.
Page 37
R.Anirudhan
Chapter-5 Game Playing

We have now dispensed with the necessary background material for AI problem solving techniques, and we can move on to looking at particular types of problems which have been addressed using AI techniques. The first type of problem we'll look at is getting an agent to compete, either against a human or another artificial agent. This area has been extremely well researched over the last 50 years. Indeed, some of the first chess programs were written by Alan Turing, Claude Shannon and other fore-fathers of modern computing. We only have one lecture to look at this topic, so we'll restrict ourselves to looking at two person games such as chess played by software agents. If you are interested in games involving more teamwork and/or robotics, then a good place to start would be with the Robo Cup project,5.1 MinMax Search Parents often get two children to share a cake fairly by asking one to cut the cake and the other to choose which half they want to eat. In this two player cake-scoffing game, there is only one move (cutting the cake), and player one soon learns that if he wants to maximize the amount of cake he gets, he had better cut the cake into equal halves, because his opponent is going to try and minimize the cake that player 1 gets by choosing the biggest half for herself. Suppose we have a two player game where the winner scores a positive number at the end, and the loser scores nothing. In board games such as chess, the score is usually just 1 for a win and 0 for a loss. In other games such as poker, however, one player wins the (cash) amount that the other player loses. These are called zerosum games, because when you add one player's winnings to the other player's loss, the sum is zero. The minimax algorithm is so called because it assumes that you and your oppenent are going to act rationally, and so you will choose moves to try to maximise your final score and your opponent will choose moves to try to minimise your final score. To demonstrate the minimax algorithm, it is helpful to have a game where the search tree is fairly small. For this reason, we will invent the following very trivial game:
Take a pack of cards and deal out four cards face up. Two players take it in turn to choose a card each until they have two each. The object is to choose two cards so
R.Anirudhan

that they add up to an even number. The winner is the one with the largest even number n (picture cards all count as 10), and the winner scores n. If both players get the same even number, it is a draw, and they both score zero.
Suppose the cards dealt are 3, 5, 7 and 8. We are interested in which card player one should choose first, and the minimax algorithm can be used to decide this for us. To demonstrate this, we will draw the entire search tree and put the scores below the final nodes on paths which represent particula
r games.
Our aim is to write the best score on the top branches of the tree that player one can guarantee to score if he chooses that move. To do this, starting at the bottom, we will write the final scores on successively higher branches on the search tree until
R.Anirudhan

we reach the top. Whenever there is a choice of scores to write on a particular branch, we will assume that player two will choose the card which minimises player one's final score, and player one will choose the card which maximises his/her score. Our aim is to move the scores all the way up the graph to the top, which will enable player one to choose the card which leads to the best guaranteed score for the overall game. We will first write the scores on the edges of the tree in the bottom two branches:
Now we want to move the scores up to the next level of branches in the tree. However, there is a choice. For example, for the first branch on the second row, we could write either 10 or -12. This is where our assumption about rationality comes into account. We should write 10 there, because, supposing that player two has actually chosen the 5, then player one can choose either 7 or 8. Choosing 7 would result in a score of 10 for player 1, choosing 8 would result in a score of -12. Clearly, player 1 would choose the 7, so the score we write on this branch is 10. Hence, we should choose the maximum of the scores to write on the edges in the row above. Doing the same for all the other branches, we get the following:
R.Anirudhan
Finally, we want to put the scores on the top edges in the tree. Again, there is a choice. However, in this case, we have to remember that player two is making the choices, and they will act in order to minimise the score that player 1 gets. Hence, in the case when player one chooses the 3 card, player 2 will choose the 7 to minimise the score player 1 can get. Hence, we choose the minimum possibility of the three to put on the edges at the top of the tree as follows:
Page 41
R.Anirudhan
To choose the correct first card, player one simply looks at the topmost edges of the final tree and chooses the one with the highest score. In this case, choosing the 7 will guarantee player one scores 10 in this game (assuming that player one chooses according to the minimax strategy for move 2, but - importantly - making no assumptions about how player two will choose). Note that the process above was in order for player one to choose his/her first move. The whole process would need to be repeated for player two's first move, and player one's second move, etc. In general, agents playing games using a minimax search have to calculate the best move at each stage using a new minimax search. Don't forget that just because an agent thinks their opponent will act rationally, doesn't mean they will, and hence they cannot assume a player will make a particular move until they have actually done it.
Page 42
R.Anirudhan
5.2 Cutoff Search

To use a minimax search in a game playing situation, all we have to do is program our agent to look at the entire search tree from the current state of the game, and choose the minimax solution before making a move. Unfortunately, only in very trivial games such as the one above is it possible to calculate the minimax answer all the way from the end states in a game. So, for games of higher complexity, we are forced to estimate the minimax choice for world states using an evaluation function. This is, of course, a heuristic function such as those we discussed in the lecture on search. In a normal minimax search, we write down the whole search space and then propogate the scores from the goal states to the top of the tree so that we can choose the best move for a player. In a cutoff search, however, we write down the whole search space up to a specific depth, and then write down the evaluation function for each of the states at the bottom of the tree. We then propogate these values from the bottom to the top in exactly the same way as minimax. The depth is chosen in advance to ensure that the agent doesn't take too long to choose a move: if it has longer, then we allow it to go deeper. If our agent has a given time limit for each move, then it makes sense to enable it to carry on searching until the time runs out. There are many ways to do the search in such a way that a game playing agent searches as far as possible in the time available. As an exercise, what possible ways can you think of to perform this search? It is important to bear in mind that the point of the search is not to find a node in the above graph, but to determine which move the agent should make.
Evaluation Functions
Evaluation functions estimate the score that can be guaranteed if a particular world state is reached. In chess, such evaluation functions have been known long before computers came along. One such function simply counts the number of pieces on the board for a particular player. A more sophisticated function scores more for the more influential pieces such as rooks and queens: each pawn is worth 1, knights and bishops score 3, rooks score 5 and queens score 9. These scores are used in a weighted linear function, where the number of pieces of a certain type is multiplied by a weight, and all the products are added up. For instance, if in a particular board state, player one has 6 pawns, 1 bishop, 1 knight, 2 rooks and 1 queen, then the evaluation function, f for that board state, B, would be calculated as follows: f(B) = 1*6 + 3*1 + 3*1 + 5*2 + 9*1 = 31 The numbers in bold are the weights in this evaluation function (i.e., the scores assigned to the pieces). Ideally, evaluation functions should be quick to calculate. If they take a long time to calculate, then
R.Anirudhan
less of the space will be searched in a given time limit. Ideally, evaluation functions should also match the actual score in goal states. Of course, this isn't true for our weighted linear function in chess, because goal states only score 1 for a win and 0 for a loss. In fact, we don't need the match to be exact - we can use any values for an evaluation function, as long it scores more for better board states. A bad evaluation function can be disastrous for a game playing agent. There are two main problems with evaluation functions. Firstly, certain evaluation functions only make sense for game states which are quiescent. A board state is quiescent for an evaluation function, f, if the value of f is unlikely to exhibit wild swings in the near future. For example, in chess, board states such as one where a queen is threatened by a pawn, where one piece can take another without a similar valued piece being taken back in the next move are not quiescent for evaluation functions such as the weighted linear evaluation function mentioned above. To get around this problem, we can make an agent's search more sophisticated by implementing a quiescence search, whereby, given a non-quiescent state we want to evaluate the function for, we expand that game state until a quiescent state is reached, and we take the value of the function for that state. If quiescent positions are much more likely to occur than non-quiescent positions in a search, then such an extension to the search will not slow things down too much. In chess, a search strategy may choose to delve further into the space whenever a queen is threatened to try to avoid the quiescent problem. It is also worth bearing in mind the horizon problem, where a game-playing agent cannot see far enough into the search space. An example of the horizon problem given in Russell and Norvig is the case of promoting a pawn to a queen in chess. In the board state they present, this can be forestalled for a certain number of moves, but is inevitable. However, with a cutoff search at a certain depth, this inevitability cannot be noticed until too late. It is likely that the agent trying to forestall the move would have been better off doing something else with the moves it had available. In the card game example above, game states are collections of cards, and a possible evaluation function would be to add up the card values and take that if it was an even number, but score it as zero if the sum is an odd number. This evaluation function matches exactly with the actual scores in goal states, but is perhaps not such a good idea. Suppose the cards dealt were: 10, 3, 7 and 9. If player one was forced to cutoff the search after only the first card choice, then the cards would score: 10, 0, 0 and 0 respectively. So player one would choose card 10, which would be disastrous, as this will inevitably lead to player one losing that game by at least twelve points. If we scale the game up to choosing cards from 40 rather than 4, we can see that a more sophisticated heuristic involving the cards left unchosen might be a better idea.
5.3 Pruning
Recall that pruning a search space means deciding that certain branches should not be explored. If an agent knows for sure that exploring a certain branch will not affect its choice for a particular move, then that branch can be pruned with no concern at all (i.e., no effect on the outcome of the search for a
Page 44
R.Anirudhan
move), and the speed up in search may mean that extra depths can be searched. When using a minimax approach, either for an entire search tree or in a cutoff search, there are often many branches which can be pruned because we find out fairly quickly that the best value down a whole branch is not as good as the best value from a branch we have already explored. Such pruning is called alpha-beta pruning. As an example, suppose that there are four choices for player one, called moves M1, M2, M3 and M4, and we are looking only two moves ahead (1 for player one and 1 for player two). If we do a depth first search for player one's move, we can work out the score they are guaranteed for M1 before even considering move M2. Suppose that it turns out that player one is guaranteed to score 10 with move M1. We can use this information to reject move M2 without checking all the possibilities for player two's move. For instance, suppose that the first choice possible for player two after M2 from player one means that player one will score only 5 overall. In this case, we know that the maximum player one can score with M2 is 5 or less. Of course, player one won't choose this, because M1 will score 10 for them. We see that there's no point checking all the other possibilites for M2. This can be seen in the following diagram (ignore the X's and N's for the time being):
We see that we could reject M2 straight away, thus saving ourselves 3 nodes in the search space. We could reject M3 after we came across the 9, and in the end M4 turns out to be better than M1 for player one. In total, using alpha-beta pruning, we avoided looking at 5 end nodes out of 16 - around 30%. If the calculation to assess the scores at end-game states (or estimate them with an evaluation function) is computationally expensive, then this saving could enable a much larger search. Moreover, this kind of pruning can occur anywhere on the tree. The general principles are that:
1. Given a node N which can be chosen by player one, then if there is another node, X, along any path, such that (a) X can be chosen by player two (b) X is on a higher level than N and (c) X has been shown to guarantee a worse score for player one than N, then all the nodes with the same parent as N can be pruned. EINSTEIN COLLEGE OF ENGINEERING Page 45
R.Anirudhan
2. Given a node N which can be chosen by player two, then if there is a node X along any path such that (a) player one can choose X (b) X is on a higher level than N and (c) X has been shown to guarantee a better score for player one than N, then all the nodes with the same parent as N can be pruned.
As an exercise: which of these principles did we use in the M1 - M4 pruning example above? (To make it easy, I've written on the N's and X's). Because we can prune using the alpha-beta method, it makes sense to perform a depth-first search using the minimax principle. Compared to a breadth first search, a depth first search will get to goal states quicker, and this information can be used to determine the scores guaranteed for a player at particular board states, which in turn is used to perform alpha-beta pruning. If a game-playing agent used a breadth first search instead, then only right at the end of the search would it reach the goal states and begin to perform minimax calculations. Hence, the agent would miss much potential to peform pruning. Using a depth first search and alpha-beta pruning is fairly sensitive to the order in which we try operators in our search. For example above, if we had chosen to look at move M4 first, then we would have been able to do more pruning, due to the higher minimum value (11) from that branch. Often, it is worth spending some time working out how best to order a set of operators, as this will greatly increase the amount of pruning that can occur. It's obvious that a depth-first minimax search with alpha-beta pruning search dominates minimax search alone. In fact, if the effective branching rate of a normal minimax search was b, then utilising alpha-beta pruning will reduce this rate to b. In chess, this means that the effective branching rate reduces from 35 to around 6, meaning that alpha-beta search can look further moves ahead than a normal minimax search with cutoff.
Page 46
R.Anirudhan
Chapter-6 First-Order Logic

6.1 There's Reasoning, and then There's Reasoning
As humans, we have always prided ourselves on our ability to think things through: to reason things out and come to the only conclusion possible in a Sherlock Holmes kind of way. But what exactly do we mean by "reasoning" and can we automate this process? We can take Sherlock Holmes as a case study for describing different types of reasoning. Suppose after solving another major case, he says to Dr. Watson: "It was elementary my dear Watson. The killer always left a silk glove at the scene of the murder. That was his calling card. Our investigations showed that only three people have purchased such gloves in the past year. Of these, Professor Doolally and Reverend Fisheye have iron-clad alibis, so the murderer must have been Sergeant Heavyset. When he tried to murder us with that umbrella, we knew we had our man." At least five types of reasoning can be identified here.
Firstly, how do we know that the killer always left a silk glove at the murder scene? Well, this is because Holmes has observed a glove at each of the murders and basically guessed that they have something to do with the murder simply by always being there. This type of reasoning is called inductive reasoning, where a hypothesis has been induced from some data. We will cover this in the lectures on machine learning. Secondly, Holmes used abductive reasoning to dredge from his past experience the explanation that the gloves are left by the murderer as a calling card. We don't really cover abductive reasoning in general on this course, unfortunately.
Thirdly, Sherlock tracked down the only three people who bought the particular type of glove left at the scene. This can be seen - perhaps quite loosely - as model generation, which plays a part in the reasoning process. Models are usually generated to prove existence of them, or often to disprove a hypothesis, by providing a counterexample to it. We cover model generation in brief detail.
Fourthly, Sherlock managed to obtain alibis for two suspects, but not for the third. Hence, he ruled out two possibilities leaving only one. This can be seen as constraint-based reasoning, and we will cover this in the lecture on constraint solving. Page 47
R.Anirudhan
Finally, Sherlock had two pieces of knowledge about the world, which he assumed were true: (i) the killer leaves a silk glove at the murder scene (ii) the only person who could have left a glove was Sergeant Heavyset. Using this knowledge, he used deductive reasoning to infer the fact that the killer must be Heavyset himself. It's so obvious that we hardly see it as a reasoning step, but it is one: it's called using the Modus Ponens rule of inference, which we cover in the lectures on automated reasoning following this one.
As an aside, it's worth pointing out that - presumably for heightened tension - in most Sherlock Holmes books, the murderer confesses, either by sobbing into a cup of tea and coming quietly, or by trying to kill Holmes, Watson, the hapless inspector Lestrade or all three. This means that the case never really has to go to trial. Just once, I'd like to see the lawyers get involved, and to see the spectacle of Holmes trying to justify his reasoning. This could be disastrous as all but his deductive reasoning was unsound. Imagine a good lawyer pointing out that all five victims happened - entirely coincidentally - to be members of the silk glove appreciation society..... Automating Reasoning is a very important topic in AI, which has received much attention, and has found applications in the verification of hardware and software configurations, amongst other areas. The topic known as "Automated Reasoning" in AI concentrates mostly on deductive reasoning, where new facts are logically deduced from old ones. It is important to remember that this is only one type of reasoning, and there are many others. In particular, in our lectures on machine learning later, we cover the notion of inductive reasoning, where new facts are guessed at, using empirical evidence. Automated Reasoning is, at present, mostly based on how we wish we reasoned: logically, following prescribed rules to start from a set of things we know are true (called axioms), and end with new knowledge about our world. The way we actually reason is much more sloppy: we use creativity, refer to previous examples, perform analogies, wait for divine inspiration, and so on. To make this more precise, we say that automated reasoning agents are more formal in their reasoning than humans. The formal approach to reasoning has advantages and disadvantages. In general, if a computer program has proved something fairly complex (for instance that a circuit board functions as specified), then people are more happy to accept the proof than one done by a human. This is because there is much less room for error in a well-written automated reasoning program. On the other hand, by being less formal, humans can often skip around the search space much more efficiently and prove more complicated results. Humans are still much more gifted at deducing things than computers are likely to be any time soon. In order to understand how AI researchers gave agents the ability to reason, we first look at how information about the world is represented using first-order logic. This will lead us into the programming language Prolog, and we will use Prolog to demonstrate a simple but effective type of AI program known as an expert system.
6.2 Syntax and Semantics

R.Anirudhan
Propositional logic is restricted in its expressiveness: it can only represent true and false facts about the world. By extending propositional logic to first-order logic - also known as predicate logic and first order predicate logic - we enable ourselves to represent much more information about the world. Moreover, as we will see in the next lecture, first-order logic enables us to reason about the world using rules of deduction. We will think about first-order logic as simply a different language, like French or German. We will need to be able to translate sentences from English to first-order logic, in order to give our agent information about the world. We will also need to be able to translate sentences from first-order logic into English, so that we understand what our agent has deduced from the facts we gave it. To do this, we will look at the combinations of symbols we are allowed to use in first-order logic (the syntax of the language). We will also determine how we assign meaning to the sentences in the language (the semantics), and how we translate from one language to another, i.e., English to Logic and vice-versa.
Predicates
First and foremost in first-order logic sentences, there are predicates. These are indications that some things are related in some way. We call the things which are related by a predicate the arguments of the predicate, and the number of arguments which are related is called the arity of the predicate. The following are examples of predicates:
lectures_ai(simon) father(bob,bill) ("simon lectures AI") ("bob is bill's father") arity is 1 here arity is 2 here
lives_at(bryan, house_of(jack)) ("bryan lives at jack's house") arity is 2 here
Connectives
We can string predicates together into a sentence by using connectives in the same way that we did for propositional logic. We call a set of predicates strung together in the correct way a sentence. Note that a single predicate can be thought of as a sentence. There are five connectives in first-order logic. First, we have "and", which we write , and "or", which we write . These connect predicates together in the obvious ways. So, if we wanted to say that "Simon lectures AI and Simon lectures bioinformatics", we could write: lectures_ai(simon) lectures_bioinformatics(simon)
Note also, that now we are talking about different lectures, it might be a good idea to change our
R.Anirudhan
choice of predicates, and make ai and bioinformatics constants: lectures(simon, ai) lectures(simon, bioinformatics)
The other connectives available to us in first-order logic are (a) "not", written , which negates the truth of a predicate (b) "implies", written , which can be used to say that one sentence being true follows from another sentence being true, and (c) "if and only if" (also known as "equivalence"), written , which can be used to state that the truth of one sentence is always the same as the truth of another sentence. For instance, if we want to say that "if Simon isn't lecturing AI, then Bob must be lecturing AI", we could write it thus:
lectures(simon, ai) lectures(bob, ai)
The things which predicates relate are terms: these may be constants, variables or the output from functions.
Constants
Constants are things which cannot be changed, such as england, black and barbara. They stand for one thing only, which can be confusing when the constant is something like blue, because we know there are different shades of blue. If we are going to talk about different shades of blue in our sentences, however, then we should not have made blue a constant, but rather used shade_of_blue as a predicate, in which we can specify some constants, such as navy_blue, aqua_marine and so on. When translating a sentence into first-order logic, one of the first things we must decide is what objects are to be the constants. One convention is to use lower-case letters for the constants in a sentence, which we also stick to.
Functions
Functions can be thought of as special predicates, where we think of all but one of the arguments as input and the final argument as the output. For each set of things which are classed as the input to a function, there is exactly one output to which they are related by the function. To make it clear that we are dealing with a function, we can use an equality sign. So, for example, if we wanted to say that the cost of an omelette at the Red Lion pub is five pounds, the normal way to express it in first-order logic would probably be:
cost_of(omelette, red_lion, five_pounds)
However, because we know this is a function, we can make this clearer:
Page 50
R.Anirudhan
cost_of(omelette, red_lion) = five_pounds
Because we know that there is only one output for every set of inputs to a function, we allow ourselves to use an abbreviation when it would make things clearer. That is, we can talk about the output from a function without explicitly writing it down, but rather replacing it with the left hand side of the equation. So, for example, if we wanted to say that the price of omelettes at the Red Lion is less than the price of pancakes at the House Of Pancakes, we would normally write something like this:
cost_of(omelette, red_lion)=X cost_of(pancake, house_of_pancakes)=Y less_than(X,Y).
This is fairly messy, and involves variables (see next subsection). However, allowing ourselves the abbreviation, we can write it like this:
less_than(cost_of(omelette, red_lion), cost_of(pancake, house_of_pancakes))
which is somewhat easier to follow.
Variables and Quantifiers
Suppose now that we wanted to say that there is a meal at the Red Lion which costs only 3 pounds. If we said that cost_of(meal, red_lion) = three_pounds, then this states that a particular meal (a constant, which we've labeled meal) costs 3 pounds. This does not exactly capture what we wanted to say. For a start, it implies that we know exactly which meal it is that costs 3 pounds, and moreover, the landlord at the Red Lion chose to give this the bizarre name of "meal". Also, it doesn't express the fact that there may be more than one meal which costs 3 pounds. Instead of using constants in our translation of the sentence "there is a meal at the Red Lion costing 3 pounds", we should have used variables. If we had replaced meal with something which reflects the fact that we are talking about a generic, rather than a specific meal, then things would have been clearer. When a predicate relates something that could vary (like our meal), we call these things variables, and represent them with an upper-case word or letter. So, we should have started with something like
meal(X) cost_of(red_lion,X) = three_pounds,
which reflects the fact that we're talking about some meal at the Red Lion, rather than a particular one. However, this isn't quite specific enough. We need to tell the reader of our translated sentence something more about our beliefs concerning the variable X. In this case, we need to tell the reader that we believe there exists such an X. There is a specific symbol in predicate logic which we use for this purpose, called the 'exists symbol'. This is written: . If we put it around our pair of predicates,
Page 51
R.Anirudhan
then we get a fully formed sentence in first-order logic: X (meal(X) cost_of(red_lion, X) = three_pounds)
This is read as "there is something called X, where X is a meal and X costs three pounds at the Red Lion". But what now if we want to say that all meals at the Red Lion cost three pounds. In this case, we need to use a different symbol, which we call the 'forall' symbol. This states that the predicates concerning the variable to which the symbol applies are true for all possible instances of that variable. So, what would happen if we replaced the exists symbol above by our new forall symbol? We would get this: X (meal(X) cost_of(red_lion, X) = three_pounds)
Is this actually what we wanted to say? Aren't we saying something about all meals in the universe? Well, actually, we're saying something about every object in the Universe: everything is a meal which you can buy from the Red Lion. For three pounds! What we really wanted to say should have been expressed more like this: X (meal(X) cost_of(red_lion, X) = three_pounds)
This is read as: forall objects X, if X is a meal, then it costs three pounds in the Red Lion. We're still not there, though. This implies that every meal can be brought at the Red Lion. Perhaps we should throw in another predicate: serves(Pub, Meal) which states that Pub serves the Meal. We can now finally write what we wanted to say: X (meal(X) serves(red_lion, X) cost_of(red_lion, X) = three_pounds)
This can be read as: for all objects X, if X is a meal and X is served in the Red Lion, then X costs three pounds. The act of making ourselves clear about a variable by introducing an exists or a forall sign is called quantifying the variable. The exists and forall sign are likewise called quantifiers in first-order logic. Substituting a ground term for a variable is often called "grounding a variable", "applying a substitution" or "performing an instantiation". An example of instantiation is: turning the sentence "All meals are five pounds" into "Spaghetti is five pounds" - we have grounded the value of the variable meal to the constant spaghetti to give us an instance of the sentence.
Translating from English to First-Order Logic Pitfalls
We have now seen some examples of first order sentences, and you should practice writing down
R.Anirudhan
English sentences in first-order logic, to get used to them. There are many ways to translate things from English to Predicate Logic incorrectly, and we can highlight some pitfalls to avoid. Firstly, there is often a mix up between the "and" and "or" connectives. We saw in a previous lecture that the sentence "Every Monday and Wednesday I go to John's house for dinner" can be written in first order first-order logic as: X ((day_of_week(X, monday) (go_to(me, house_of(john)) day_of_week(X, wednesday)) eat_meal(me, dinner)))
and it's important to note that the "and" in the English sentence has changed to an "or" sign in the first-order logic translation. Because we have turned this sentence into an implication, we need to make it clear that if the day of the week is Monday or Wednesday, then we go to John's house for dinner. Hence the disjunction sign (the "or" sign) is introduced. Note that we call the "and" sign the conjunction sign. Another common problem is getting the choice, placement and order of the quantifiers wrong. We saw this with the Red Lion meals example above. As another example, try translating the sentence: "Only red things are in the bag". Here are some incorrect answers: X (in_bag(X) X (red(X) X( red(X)) in_bag(X)) in_bag(Y,X) red(Y)))
Y (bag(X)
Question: "Why are these incorrect, what are they actually saying, and what is the correct answer?" Another common problem is using commonsense knowledge to introduce new predicates. While this may simplify things, the agent you're communicating with is unlikely to know the piece of commonsense knowledge you are expecting it to. For example, some people translate the sentence: "Any child of an elephant is an elephant" as: X( Y (parent(X,Y) elephant(X)) elephant(Y))
even though they're told to use the predicate child. What they have done here is use their knowledge about the world to substitute the predicate 'parent' for 'child'. It's important to never assume this kind of commonsense knowledge in an agent: unless you've specifically programmed it to, an agent will not know the relationship between the child predicate and the parent predicate.
Translating from First-Order Logic to English
There are tricks to compress what is written in logic into a succinct, understandable English sentence.
R.Anirudhan
For instance, look at this sentence from earlier: X (meal(X) cost_of(red_lion, X) = three_pounds)
This is read as "there is something called X, where X is a meal and X costs three pounds at the Red Lion". We can abbreviate this to: "there is a meal, X, which costs three pounds at the Red Lion", and finally, we can ignore the X entirely: "there is a meal at the Red Lion which costs three pounds". In performing these abbreviations, we have interpreted the sentence. Interpretation is fraught with danger. Remember that the main reason we will want to translate from first-order logic is so that we can read the output from a reasoning agent which has deduced something new for us. Hence it is important that we don't ruin the good work of our agent by misinterpreting the information it provides us with.
6.3 The Prolog Programming Language

Most programming languages are procedural: the programmer specifies exactly the right instructions (algorithms) required to get an agent to function correclty. It comes as a surprise to many people that there is another way to write programs. Declarative programming is when the user declares what the output to a function should look like given some information about the input. The agent then searches for an answer which fits the declaration, and returns any it finds. As an example, imagine a parent asking their child to run to the shop and buy some groceries. To do this in a declarative fashion, the parent simply has to write down a shopping list. The parents have "programmed" their child to perform their task in the knowledge that the child has underlying search routines which will enable him or her to get to the shop, find and buy the groceries, and come home. To instruct their child in a procedural fashion, they would have to tell the child to go out of the front door, turn left, walk down the street, stop after 70 steps, and so on. We see that declarative programming languages can have some advantages over procedural ones. In fact, it is often said that a Java program written to do the same as a Prolog program usually takes about 10 times the number of lines of code. Many AI researchers try out an idea in Prolog before implementing it more fully in other languages, because Prolog can be used to perform searches easily (see later). A well-known declarative language which is used a lot by AI researchers is Prolog, which is based on first-order logic. For any declarative programming language, the two most important aspects are: how information is represented, and the underlying search routines upon which the language is based. Robert Kowalski put this in a most succinct way: Algorithm = Logic + Control.
Page 54
R.Anirudhan
Representation in Prolog - Logic Programs
If we impose some additional constraints on first-order logic, then we get to a representation language known as logic programs. The main restriction we impose is that all the knowledge we want to encode is represented as Horn clauses. These are implications which comprise a body and a head, where the predicates in the body are conjoined and they imply the single predicate in the head. Horn clauses are universally quantified over all the variables appearing in them. So, an example Horn clause looks like this: x, y, z ( b1(x,y) b2(x) ... bn(x,y,z) h(x,y))
We see that the body consists of predicates bi and the head is h(x,y). We can make this look a lot more like the Prolog programs you are used to writing by making a few syntactic changes: first, we turn the implication around and write it as :- thus:
x, y, z (h(x,y) :- b1(x,y) b2(x) ... bn(x,y,z))
next, we change the
symbols to commas.
x, y, z (h(x,y) :- b1(x,y), b2(x), ..., bn(x,y,z))
Finally, we remove the universal quantification (it is assumed in Prolog), make the variables capital letters (Prolog requires this), and put a full stop at the end:
h(X,Y) :- b1(X,Y), b2(X), ..., bn(X,Y,Z).
Note that we use the notation h/2 to indicate that predicate h has arity 2. Also, we call a set of Horn clauses like these a logic program. Representing knowledge with logic programs is less expressive than full first order logic, but it can still express lots of types of information. In particular, disjunction can be achieved by having different Horn clauses with the same head. So, this sentence in first-order logic: x (a(x) b(x) c(x) d(x))
can be written as the following logic program:

c(x) c(x) d(x) d(x) ::::a(x). b(x). a(x). b(x).
We also allow ourselves to represent facts as atomic ground predicates. So, for instance, we can state
Page 55
R.Anirudhan
that:
parent(georgesenior, georgedubya). colour(red).
and so on.
Search mechanisms in Prolog
We can use this simple Prolog program to describe how Prolog searches:
president(X) :- first_name(X, georgedubya), second_name(X, bush). prime_minister(X) :- first_name(X, maggie), second_name(X, thatcher). prime_minister(X) :- first_name(X, tony), second_name(X, blair). first_name(tonyblair, tony). first_name(georgebush, georgedubya). second_name(tonyblair, blair). second_name(georgebush, bush).
If we loaded this into a Prolog implementation such as Sicstus, and queried the database:
?- prime_minister(P).
then Sicstus would search in the following manner: it would run through it's database until it came across a Horn clause (or fact) for which the head was prime_minister and the arity of the predicate was 1. It would first look at the president clause, and reject this, because the name of the head doesn't match with the head in the query. However, next it would find that the clause:
prime_minister(X) :- first_name(X, maggie), second_name(X, thatcher).
fits the bill. It would then look at the predicates in the body of the clause and see if it could satisfy them. In this case, it would try to find a match for first_name(X, maggie). However, it would fail, because no such information can be found in the database. That means that the whole clause fails, and Sicstus would backtrack, i.e., it would go back to looking for a clause with the same head as the query. It would, of course, next find this clause:
prime_minister(X) :- first_name(X, tony), second_name(X, blair).
Then it would look at the body again, and try to find a match for first_name(X, tony). It would look through the datatabase and find X=tonyblair a good assignment, because the fact
R.Anirudhan
first_name(tonyblair, tony) is found towards the end of the database. Likewise, having assigned X=tonyblair, it would then look for a match to: second_name(tonyblair, blair), and would succeed. Hence, the answer tonyblair would make the query succeed, and this would be reported
back to us. The important thing to remember is that Prolog implementations search from the top to the bottom of the database, and try each term in the body of a clause in the order in which they appear. We say that Sicstus has proved the query prime_minister(P) by finding something which satisfied the declaration of what a prime minister is: Tony Blair. It is also worth remembering that Sicstus assumes negation as failure. This means that if it cannot prove a predicate, then the predicate is false. Hence the query:
?- \+ president(tonyblair).
Returns an answer of 'true', because Sicstus cannot prove that Tony Blair is a president. Note that, as part of its search, Prolog also makes inferences using the generalised Modus-Ponens rule of inference and unification of clauses. We will look in detail at these processes in the next lecture.
6.4 Logic-based Expert Systems

Expert systems are agents which are programmed to make decisions about real world situations. They are put together by using knowledge illicitation techniques to extract information from human experts. A particularly fruitful area is in diagnosis of diseases, where expert systems are used to decide (suggest) what disease a patient has, given their symptoms. Expert systems are one of the major success stories of AI. Russell and Norvig give a very nice example from medicine:
"A leading expert on lymph-node pathology describes a fiendishly difficult case to the expert system, and examines the system's diagnosis. He scoffs at the system's response. Only slightly worried, the creators of the system suggest he ask the computer for an explanation of the diagnosis. The machine points out the major factors influencing its decision and explains the subtle interaction of several of the symptoms in this case. The experts admits his error, eventually."
Often, the rules from the expert are encoded as if-then rules in first-order logic and the implementation of the expert system can be fairly easily achieved in a programming language such as
R.Anirudhan
Prolog. We can take our card game from the previous lecture as a case study for the implementation of a logic-based expert system. The rules were: four cards are laid on the table face up. Player 1 takes the first card, and they take it in turns until they both have two cards each. To see who has won, they each add up their two card numbers, and the winner is the one with the highest even number. The winner scores the even number they have. If there's no even number, or both players achieve the same even number, then the game is drawn It could be argued that undertaking a minimax search is a little uneccessary for this game, because we could easily just specify a set of rules for each player, so that they choose cards rationally. To demonstrate this, we will derive down some Prolog rules which specify how player one should choose the first card. For example, suppose the cards dealt were: 4, 5, 6, 10. In this case, the best choice of action for player one is to choose the 10, followed presumably by the 4, because player two will pick the 6. We need to abstract from this particular example to the general case: we see that there were three even numbers and one odd one, so player one is guaranteed another even number to match the one they chose. This is also true if there are four even numbers. Hence we have our first rule:
If there are three or four even numbered cards, then player one should choose the highest even numbered card in their first go.
When there are three or four odd cards it's not difficult to see that the most rational action for player one is to choose the highest odd numbered card:
If there are three or four odd numbered cards, then player one should choose the highest odd numbered card in their first go.
The only other situation is when there are two even and two odd cards. In this case, I'll leave it as an exercise to convince yourselves that there are no rules governing the choice of player one's first card: they can simply choose randomly, because they're not going to win unless player two makes a mistake. To write an expert system to decide which card to choose in a game, we will need to translate our rules into first-order logic, and then into a Prolog implementation. Our first rule states that, in a game, g: (number_of_even_at_start(g,3) number_of_even_at_start(g,4)) highest_even_at_start(g,h) player_one_chooses(g,h). The meaning of the predicates is as obvious as it seems. Similarly, our second rule can be written as:
R.Anirudhan
(number_of_odd_at_start(g,3) number_of_odd_at_start(g,4)) highest_odd_at_start(g,h) player_one_chooses(g,h). There are many different ways to encode these rules as a Prolog program. Different implementations will differ in their execution time, but for our simple program, it doesn't really matter which predicates we choose to implement. We will make our top level predicate: player_one_chooses/2. This predicate will take a list of card numbers as the first argument, and it will choose a member of this list to put as the second argument. In this way, the same predicate can be used in order to make second choices. Using our above logical representation, we can start by definining:
player_one_chooses(CardList, CardToChoose) :length(CardList, 4), number_of_evens(CardList, 3), biggest_even_in_list(CardList, CardToChoose). player_one_chooses(CardList, CardToChoose) :length(CardList, 4), number_of_evens(CardList, 4), biggest_even_in_list(CardList, CardToChoose). player_one_chooses(CardList, CardToChoose) :length(CardList, 4), number_of_odds(CardList, 3), biggest_odd_in_list(CardList, CardToChoose). player_one_chooses(CardList, CardToChoose) :length(CardList, 4), number_of_odds(CardList, 4), biggest_odd_in_list(CardList, CardToChoose). player_one_chooses([CardToChoose|_], CardToChoose).
We see that there are four choices depending on the number of odds and evens in the CardList. To make these predicates work, we need to fill in the details of the other predicates. Assuming that we have some basic list predicates: length/2 which calculates the size of a list, sort/2 which sorts a list, and last/2 which returns the last element in a list, then we can write down the required predicates:
iseven(A) :0 is A mod 2. isodd(A) :1 is A mod 2. even_cards_in_list(CardList, EvenCards) :findall(EvenCard,(member(EvenCard, CardList), iseven(EvenCard)), EvenCards).
Page 59
R.Anirudhan

odd_cards_in_list(CardList, OddCardes) :findall(OddCard,(member(OddCard, CardList), isodd(OddCard)), EvenCards). number_of_evens(CardList, NumberOfEvens) :even_cards_in_list(CardList, EvenCards), length(EvenCards, NumberOfEvens). number_of_odds(CardList, NumberOfOdds) :odd_cards_in_list(CardList, OddCards), length(OddCards, NumberOfOdds). biggest_odd_in_list(CardList, BiggestOdd) :odd_cards_in_list(CardList, OddCards), sort(OddCards, SortedOddCards), last(SortedOddCards, BiggestOdd). biggest_even_in_list(CardList, BiggestEven) :even_cards_in_list(CardList, EvenCards), sort(EvenCards, SortedEvenCards), last(SortedEvenCards, BiggestEven).
It's left as an exercise to write down the rules for player one's next choice, and player two's choices.
chapter-7 Making Deductive Inferences

We have shown how knowledge can be represented in first-order logic, and how rule-based expert systems expressed in logic can be constructed and used. We now look at how to take some known facts about a domain and deduce new facts from them. This will, in turn, enable agents to prove things, i.e., to start with a set of statements we believe to be true (axioms) and deduce whether another statement (theorem) is true or not. We will first look at how to tell whether a sentence in propositional logic is true or false. This will suggest some equivalences between propositional sentences, which allow us to rewrite sentences to other sentences which mean the same thing, regardless of the truth or meaning of the individual propositions they contains. These are reversible inferences, in that deduction can be applied either way. We then look at propositional and first-order inference rules in general, which enable us deduce new sentences if we know that certain things are true, and which may not be reversible.
Page 60
R.Anirudhan

7.1 Truth Tables
(Material covered in Lecture 6)
In propositional logic where we are restricted to expressing sentences where propositions are true or false we can check whether a particular statement is true or false by working out the truth of ever larger sub statements using the truth of the propositions themselves. To work out the truth of sub statements, we need to know how to deal with truth assignments in the presence of connectives. For instance, if we know that is_president(barack_obama) and is_male(barack_obama) are true, then we know that the sentence: is_male(barack_obama) is_president(barack_obama) Q is true when P is
is also true, because we know that a sentence of the form P true and Q is true.
The truth values of connectives given the truth values of the propositions they contain is presented in the following truth table:
P Q P P QP Q P Q P True False False True Q
True True True False False True False False
False True True True False False True False True False True True True False False True
This table allows us to read the truth of the connectives in the following manner. Suppose we are looking at row three. This says that, if P is false and Q is true, then
1. 2. 3. 4. 5. P is true P Q is false P Q is true P Q is true P Q is false
Note that, if P is false, then regardless of whether Q is true or false, the statement P Q is true. This takes a little getting used to, but can be a very useful tool in
R.Anirudhan

theorem proving: if we know that something is false, it can imply anything we want it to! So, the following sentence is true: "Barack Obama is female" implies that "Barack Obama is an alien", because the premise that Barack Obama is female was false, so the conclusion that Barack Obama is an alien can be deduced in a sound way. Each row of a truth table defines the connectives for a particular assignment of true and false to the individual propositions in a sentence. We call each assignment a model: it represents a particular possible state of the world. For two propositions P and Q there are four models. For propositional sentences in general, a model is also just a particular assignment of truth values to its individual propositions. A sentence with n propositions will have 2n possible models, and so 2n rows in its truth table. A sentence S will be true or false for a given model M when S is true we say 'M is a model of S'. Sentences which are always true, regardless of the truth of the individual propositions, are called tautologies (or valid sentences). Tautologies are true for all models. For instance, if I said that "Tony Blair is prime minister or Tony Blair is not prime minister", this is largely a content-free sentence, because we could have replaced the predicate of being Tony Blair with any predicate and the sentence would still have been correct. Tautologies are not always as easy to notice as the one above, and we can use truth tables to be certain that a statement we have written is true, regardless of the truth of the individual propositions it contains. To do this, the columns of our truth table will be headed with ever larger sections of the sentence, until the final column contains the entire sentence. As before, the rows of the truth table will represent all the possible models for the sentence, i.e. each possible assignment of truth values to the individual propositions in the sentence. We will use these initial truth values to assign truth values to the subsentences in the truth table, then use these new truth values to assign truth values to larger subsentences and so on. If the final column (the entire sentence) is always assigned true, then this means that, whatever the truth values of the propositions being discussed, the entire sentence will turn out to be true. For example, the following is a tautology: S: (X (Y Z)) ((X Y) (X Z))
In English, sentence, S says that X implies Y and Z if and only if X implies Y and X implies Z. The truth table for this sentence will look like this:
R.Anirudhan
Z X
Y X true false true false true true true true
Z X true false false false true true true true
(Y Z) ((X true false false false true true true true
Y)
(X
Z)) S true true true true true true true true
true true true true true true true false false true true false true false false true false false false false false true true true true false true false false true false false true false true false false false false true
We see that that the seventh and eighth columns the truth values which have been built up from the previous columns have exactly the same truth values in each row. Because our sentence is made up of the two sub sentences in these columns, this means that our overall equivalence must be correct. The truth of this statement demonstrates that the connectives and are related by a property called distributivity, which we come back to later on. Truth tables give us our first (albeit simple) method for proving a theorem: check whether it can be written in propositional logic and, if so, if it is a tautology, then it must be true. So, for instance, if we were asked to prove this theorem from number theory: n, m ((sigma(n) = n tau(n) = m) (tau(n) = m sigma(n) =\= n))
then we could prove it straight away, because we know that this is a tautology: (X Y) (Y X)
As we know this is a tautology, and that our number theory theorem fits into the tautology (let X represent the proposition sigma(n)=n, and so on), we know that the theorem must be true, regardless of what tau and sigma mean. (As an exercise, show that this is indeed a tautology, using a truth table).
R.Anirudhan

7.2 Equivalences & Rewrite Rules
As well as allowing us to prove trivial theorems, tautologies enable us to establish that certain sentences are saying the same thing. In particular, if we can show that A B is a tautology then we know A and B are true for exactly the same models, i.e. they will have identical columns in a truth table. We say that A and B are logically equivalent, written as the equivalence A B. (Clearly and mean the same thing here, so why use two different symbols? It's a technical difference: A B is a sentence of propositional logic, whereas A B is a claim we make outside the logic.) In natural language, we could replace the phrase "There's only one Tony Blair" by "Tony Blair is unique", in sentences, because basically the phrases mean the same thing. We can do exactly the same in logical languages, with an advantage: because we are being more formal, we will have mathematically proved that two sentences are equivalent. This means that there is absolutely no situation in which one sentence would be interpreted in a different way to another, which is certainly possible with natural language sentences about Tony Blair. Equivalences allow us to change one sentence into another without affecting the meaning, because we know that replacing one side of an equivalence with the other will have no effect whatsoever on the semantics: it will still be true for the same models. Suppose we have a sentence S with a sub expression A, which we write as S[A]. If we know A B then we can be sure the semantics of S is unaffected if we replace A with B, i.e. S[A] S[B]. Moreover, we can also use A B to replace any sub expression of S which is an instance of A. An instance of a propositional expression A is a 'copy' of A where some of the propositions of have been consistently replaced by new sub expressions, e.g. every P has been replaced by Q. We call this replacement a substitution, a mapping from propositions to expressions. Applying a substitution U to a sentence S, we get a new sentence S.U which is an instance of S. It is easy to show that if A B then A.U B.U for any substitution U, i.e. an instance of an equivalence is also an equivalence. Hence an equivalence A B allows us to change a sentence S[A'] to a logically equivalent one S[B'] if we have substitution U such that A' = A.U and B' = B.U. The power to replace sub expressions allows use to prove theorems with equivalences: in the above example, given a theorem S[A'] S[B'] we can use the equivalence A B to rewrite the theorem to the equivalent S[A'] S[A'], which we know to be true. Given a set of equivalences we can prove (or disprove) a
R.Anirudhan

complex theorem by rewriting it to something logically equivalent that we already know to be true (or false). The fact that we can rewrite instances of A to instances of B is expressed in the rewrite rule A => B. Of course, we can also rewrite Bs to As, so we could use the rewrite rule B => A instead. However, it's easy to see that having an agent use both rules is dangerous, as it could get stuck in a loop A => B => A => B => ... and so on. Hence, we typically use just one of the rewrite rules for a particular equivalence (we 'orient' the rule in a single direction). If we do use both then we need to make sure we don't get stuck in a loop. Apart from proving theorems directly, the other main use for rewrite rules is to prepare a statement for use before we search for the proof, as described in the next lecture. This is because some automated deduction techniques require a statement to be in a particular format, and in these cases, we can use a set of rewrite rules to convert the sentence we want to prove into a logically equivalent one which is in the correct format. Below are some common equivalences which automated theorem proves can use as rewrite rules. Remember that the rules can be read both ways, but that in practice either i) only one direction is used or ii) a loop-check is employed. Note also that these are true of sentences in propositional logic, so they can also be used for rewriting sentences in first-order logic, which is just an extension of propositional logic.
Commutativity of Connectives
You will be aware of the fact that some arithmetic operators have a property that it doesn't matter which way around you give the operator input. We call this property commutativity. For example, when adding two numbers, it doesn't matter which one comes first, because a+b = b+a for all a and b. The same is true for multiplication, but not true for subtraction and division. The , and connectives (which operate on two sub sentences), also have the commutativity property. We can express this with three tautologies: P Q Q P P Q Q P P Q Q
P P, and
So, if it helps to do so, whenever we see P Q, we can rewrite it as Q similarly for the other two commutative connectives.
Page 65
R.Anirudhan
Associativity of Connectives
Brackets are useful in order to tell us when to perform calculations in arithmetic and when to evaluate the truth of sentences in logic. Suppose we want to add 10, 5 and 7. We could do this: (10 + 5) + 7 = 22. Alternatively, we could do this: 10 + (5 + 7) = 22. In this case, we can alter the bracketing and the answer still comes out the same. We say that addition is associative because it has this property with respect to bracketing. The and connectives are associative. This makes sense, because the order in which we check truth values doesn't matter when we are working with sentences only involving or only involving . For instance, suppose we wanted to know the truth of P (Q R). To do this, we just need to check that every proposition is true, in which case the whole sentence will be true, otherwise the whole sentence will be false. So, it doesn't matter how the brackets are arranged, and hence the is associative. Similarly, suppose we wanted to work out the truth of: (P Q) (R (X Z))
Then all we need to do is check whether one of these propositions is true, and the bracketing is immaterial. As equivalences, then, the two associativity results are: (P Q) R P (Q R) (P Q) R P (Q R)
Distributivity of Connectives
Our last analogy with arithmetic will involve a well-used technique for playing around with algebraic properties. Suppose we wanted to work out: 10 * (3 + 5). We could do it like this: 10 * (3 + 5) = 10 * 8 = 80. Or we could do it like this: (10 * 3) + (10 * 5) = 30 + 50 = 80. In general, we know that, for any numbers, a, b and c: a * (b + c) = (a * b) + (a * c). In this case, we say that multiplication is distributive over addition. You guessed it, we can distribute some of the connectives too. In particular, is distributive over and vice versa: is also distributive over . We can present these as equivalences as follows: P (Q R) (P Q) (P R) P (Q R) (P Q) (P R)
R.Anirudhan

Also, we saw earlier that over . Therefore: P P
is distributive over
, and the same is true for
(Q R) (P Q) (P R) (Q R) (P Q) (P R)
Double Negation
Parents are always correcting their children for the use of double negatives, but we have to be very careful with them in natural language: "He didn't tell me not to do it" doesn't necessarily mean the same as "He did tell me to do it". The same is true with logical sentences: we cannot, for example, change (P Q) to ( P Q) without risking the meaning of the sentence changing. However, there are certain cases when we can alter expressions with negation. Two possibilities are given by de Morgan's law below, and we can also simplify statements by removing double negation. These are cases when a proposition has two negation signs in front of it, like this: P. You may be wondering why on earth anyone would ever write down a sentence with such a double negation in the first place. Of course, you're right. As humans, we wouldn't write a sentence in logic like that. However, remember that our agent will be doing search using rewrite rules. It may be that as part of the search, they introduce a double negation, by following a particular rewrite rule to the letter. In this case, the agent would probably tidy it up by using this equivalence: P
P
De Morgan's Laws
Continuing with the relationship between and , we can also use De Morgan's Law to rearrange sentences involving negation in conjunction with these connectives. In fact, there are two equivalences which, taken as a pair are called De Morgan's Law: (P Q) P Q (P Q) P Q These are important rules and it is worth spending some time thinking about why they are true.
Contraposition
The contraposition equivalence is as follows:

R.Anirudhan

P Q Q P
This may seem a little strange at first, because it appears that we have said nothing in the first sentence about Q, so how can we infer anything from it in the second sentence? However, suppose we know that P implies Q, and we saw that Q was false. In this case, if we were to imply that P was true, then, because we know that P implies Q, we also know that Q is true. But Q was false! Hence we cannot possibly imply that P is true, which means that we must imply that P is false (because we are in propositional logic, so P must be either true or false). This argument shows that we can replace the first sentence by the second one, and it is left as an exercise to construct a similar argument for the vice-versa part of this equivalence.
Other Equivalences
The following miscellaneous equivalence rules are often useful during rewriting sessions. The first two allow us to completely get rid of implication and equivalence connectives from our sentences if we want to:

Replace implication: P Replace equivalence: P
Q Q
P (P
Q (this one is very useful) Q ) (Q P)
The next two allow truth values to be determined regardless of the truth of the propositions.

Consistency: P P False Excluded middle: P P True
Here the "False" symbol stands for the proposition which is always false: no matter what truth values you give to other propositions in the sentence, this one will always be false. Similarly, the "True" symbol stands for the proposition which is always true. In first-order logic we can treat them as special predicates with the same properties.
An Example using Rewrite Rules

Equivalence rules can be used to show that a complicated looking sentence is actually just a simple one in disguise. For this example, we shall show that this sentence: (A B) (A B)
Page 68
R.Anirudhan

conveys a meaning which is actually much simpler than you would think on first inspection. We can simplify this, using the following chain of rewrite steps based on the equivalences we've stated above:
1. Using the double negation rewrite: P => P (A B) (A B) 2. Using De Morgan's Law: P (A B) (A B) 3. Using the commutativity of
Q => (P
Q)
:P
Q => Q
(A B) ( B A) 4. Using 'replace implication' from right to left: P (A B) (B A) 5. Using 'replace equivalence' from left to right: P ((A B) (B A)) (B 6. Using the associativity of : (P A) Q) R => P
Q => P
Q => (P
Q)
(Q
P)
(Q
R)
(A B) ((B A) (B A)) 7. Using the consistency equivalence above: P (A B) False 8. Using the definition of False
P => False
So, what does this mean? It means that our original sentence was always false: there are no models which would make this sentence true. Another way to think about this is that the original sentence was inconsistent with the rules of propositional logic. In general, proving theorems by proving that they're negation rewrites to False is an example of proof by contradiction, which we discuss below. Note that the first step of this simplification routine was to insert a double negation! Also, at some stages, the rewritten sentence looked more complicated than the original, so we seemed to be making matters worse, which is quite common. Is there any other way to simplify the original statement? Of course, you'll still end up with the answer false, but there might be a quicker way to get there. You may get the feeling you are solving a search problem, which, of course, is exactly what you're doing. If you think about this sentence, it may become obvious why it is
R.Anirudhan

false: for (P Q) to be true, P must be false and Q must be true. But then what about the conjoined equivalence?
7.4 Propositional Inference Rules
7.4 Propositional Inference RulesEquivalence rules are particularly useful because of the vice-versa aspect, which means that we can search backwards and forwards in a search space using them. Hence, we can perform bi-directional search, which is a bonus. However, what if we know that one sentence (or set of sentences) being true implies that another set of sentences is true. For instance, the following sentence is used ad nauseum in logic text books: All men are mortal Socrates was a man Hence, Socrates is mortal This is an example of the application of a rule of deduction known as Modus Ponens. We see that we have deduced the fact that Socrates is mortal from the two true facts that all men are mortal and Socrates was a man. So, because we know that the rule about men being mortal and the classification of Socrates as a man are true, we can infer with certainty (because we know that modus ponens is sound), that Socrates is going to die - which, of course, he did. Of course, it doesn't make sense to go backwards as with equivalences: we would deduce that, Socrates being mortal implies that he was a man and that all men are mortal! The general format for the modus ponens rule is as follows: if we have a true sentence which states that proposition A implies proposition B and we know that proposition A is true, then we can infer that proposition B is true. The notation we use for this is as follows: A B This is an example of an inference rule. The comma above the line indicates we know both these things in our knowledge base, and the line stands for the deductive step. That is, if we know that both the propositions above the line are true, then we can deduce that the proposition below the line is also true. In general, an inference rule B, A
Page 70
R.Anirudhan
A B is sound if we can be sure that A entails B, i.e. B is true when A is true. More formally, A entails B means that if M is a model of A then M is also a model of B. We write this as A B. This gives us a way to check the soundness of propositional inference rules: (i) draw up a logic table for both A and B evaluating them for all models and (ii) check that whenever A is true, then B is also true. We don't care here about the models for which A is false. For instance, the truth table for the modus ponens rule is really the same as the one for the implication connective. It looks like this: A B A B
True True True True False False False True True False False True This is a trivial example, but it highlights how we use truth tables: the first line is the only one where both above-line propositions (A and A B) are true. We see that on this line, the proposition B is also true. This shows us that we have an entailment: the above-line propositions entail the below-line one. To see why such inference rules are useful, remember what the main application of automated deduction is: to prove theorems. Theorems are normally part of a larger theory, and that theory has axioms. Axioms are special theorems which are taken to be true without question. Hence whenever we have a theorem statement we want to prove, we should be able to start from the axioms and deduce the theorem statement using sound inference rules such as modus ponens. Below are some more propositional inference rules:
And-Elimination
In English, this says that "if you know that lots of things are all true, then you know that any one of them is also true". It means that you can simplify a conjunction by just taking one of the conjuncts (in effect, eliminating the s).
R.Anirudhan
A1
A2 Ai
...
An
Note that 1 i n.
And-Introduction
In English, this says that "if we know that a lot of things are true, then we know that the conjunction of all of them is true", so we can introduce conjunction ('and') symbols. A1, A2, ..., A1 A2 ... An An
This may not seem to be saying much. However, imagine that we are working with a lot of different sentences at different places in our knowledge base, and we know some of them are true. Then we can make a larger sentence out of them by conjoining the smaller ones.
Or-Introduction
If we know that one thing is true, then we know that a sentence where that thing is in a disjunction is true. For example, we know that "Tony Blair is prime minister" is true. From this, we can infer any disjunction as long as we include this true sentence as a disjunct. So, we can infer that "Tony Blair is prime minister or the moon is made of blue cheese", which makes perfect sense. Ai A1 Again, 1 i n.
A2
...
An
Unit Resolution
Suppose that we knew the sentence "Tony Blair is prime minister or the moon is made of blue cheese", is true, and we later found out that the moon isn't in fact made of cheese. Then, because the first (disjoined) sentence is true, we can infer that Tony Blair is indeed prime minister. This typifies the essence of the unit resolution rule:
R.Anirudhan
(A
B), A
The generalised version of this inference rule is the subject of a whole area of Artificial Intelligence research known as resolution theorem proving, which we cover in detail in the next lecture.
7.5 First-Order Models
We proposed first-order logic as a good knowledge representation language rather than propositional logic because it is more expressive, so we can write more of our sentences in logic. So the sentences we are going to want to apply rewrites and inference rules will include quantification. All of the rewrite rules we've seen so far can be used in propositional logic (and hence first-order logic). We now consider rules in which rely on information about the quantifiers, so are not available to an agent working with a propositional logic representation scheme. Before we look at first-order inference rules we need to pause to consider what it means for such an inference rule to be sound. Earlier we defined this as meaning the top entails the bottom: that any model of the former was a model of the latter. But first-order logic introduces new syntactic elements (constants, functions, variables, predicates and quantifiers) alongside the propositional connectives. This means we need to completely revise our definition of model, a notion of a 'possible world' which defines whether a sentence is true or false in that world. A propositional model was just an assignement of truth values to propositions. In contrast, a first-order model is a pair (, ) where

is a domain, a non-empty set of 'objects', i.e. things which our first-order sentences are refering to. is an interpretation, a procedure for calculating the truth of sentences relative to .
This seems very different from propositional logic. Fortunately, everything we have discussed so far about deduction carries over into first-order logic when we use this new definition of model.
Page 73
R.Anirudhan

Terms
First-order logic allows us to talks about properties of objects, so the first job for our model (, ) is to assign a meaning to the terms which represent objects. A ground term is any combination of constant and function symbols, and maps each individual ground term to a specific object in . This means that a ground term refers to a single specific object. The meaning of subterms is always independent of the term they appear in. The particular way that terms are mapped to objects depends on the model. Different models can define terms as refering to different things. Note that although father(john) and jack are separate terms, they might both be mapped to the same object (say Jack) in . That is, the two terms are syntactically different but (in this model) they are semantically the same, i.e. they both refer to the same thing! Terms can also contain variables (e.g. father(X)) these are non-ground terms. They don't refer to any specific object, and so our model can't assign any single meaning to them directly. We'll come back to what variables mean.
Predicates
Predicates take a number of arguments (which for now we assume are ground terms) and represent a relationship between those arguments which can be true or false. The semantics of an n-ary predicate p(t1,...tn) are defined by a model (, ) as follows: we first calculate the n objects that the arguments refer to (t1), ..., (tn). maps p to a function P: &Delta n{true,false} which defines whether p is true for those n elements of . Different models can as sign different functions P, i.e. they can provide different meanings for each predicate. Combining predicates, ground terms and propositional connectives gives us ground formulae, which don't contain any variables. They are definite statements about specific objects.
Quantifiers and Variables
So what do sentences containing variables mean? In other words, how does a firstorder model decide whether such a sentence is true or false? The first step is to ensure that the sentence does not contain any free variables, variables which are not bound by (associated with) a quantifier. Strictly speaking, a first-order expression is not a sentence unless all the variables are bound. However, we usually assume that if a variable is not explicitly bound then really it is implicitly universally quantified.
R.Anirudhan

Next we look for the outermost quantifier in our sentence. If this is X then we consider the truth of the sentence for every value X could take. When the outermost quantifier is X we need to find just a single possible value of X. To make this more formal we can use a concept of substitution. Here {X\t} is a substitution which replaces all occurances of variable X with a term representing an object t:

X. A is true if and only if A.{X\t} for all t in X. A is true if and only if A.{X\t} for at least one t in
Repeating this for all the quantifiers we get a set of ground formulae which we have to check to see if the original sentence is true or false. Unfortunately, we haven't specificed that our domain is fin ite for example, it may contain the natural numbers so there may be a infinite number of sentences to check for a given model! There may be also be an infinite number of models..So although we have a proper definition of model, and hence a proper semantics for first-order logic, so we can't rely on having a finite number of models as we did when drawing propositional truth tables.
7.6 First-Order Inference Rules

Now we have a clear definition of a first-order model is, we can define soundness for first-order inference rules in the same way as we did for propositional inference rules: the rule is sound if given a model of the sentences above the line, this is always a model of the sentence below. To be able to specify these new rules, we must use the notion of substitution. We've already seen substitutions which replace propositions with propositional expressions (7.2 above) and other substitutions which replace variables with terms that represent a given object (7.5 above). In this section we use substitutions which replace variables with ground terms (terms without variables) so to be clear we will call these ground substitutions. Another name for a ground substitution is an instantiation, For example, if we start with the wonderfully optimistic sentence that everyone likes everyone else: X, Y (likes(X, Y)), then we can choose particular values for X and Y. So, we can instantiate this sentence to say: likes(george, tony). Because we have chosen a particular value, the quantification no longer makes sense, so we must drop it. The act of performing an instantiation is a function, as there is only one possible outcome, so we can write it as a function. The notation
Page 75
R.Anirudhan

Subst({X/george, Y/tony}, likes(X,Y)) = likes(george, tony) indicates that we have made a ground substitution. We also have to recognise that we are working with sentences which form part of a knowledge base of many such sentences. More to the point, there will be constants which appear throughout the knowledge base, and some which are local to a particular sentence.
Universal Elimination
For any sentence, A, containing a universally quantified variable, v, then for any ground term, g, we can substitute g for v in A. We write the following to represent this rule:
vA
Subst({v/g}, A)
As an example (from Russell and Norvig), this rule can be used on the following sentence: X, likes(X, ice_cream) to substitute the variable 'ben' for X, giving us the sentence likes(ben, ice_cream). In English, this says that, given that everyone likes ice cream, we can infer that Ben likes ice cream. This is not exactly rocket science, and it is worth bearing in mind that, beneath all the fancy symbols in logic, we're really only saying simple things.
Existential Elimination
For a sentence, A, with an existentially quantified variable, v, then, for every constant symbol k, that does not appear anywhere else in the knowledge base, we can substitute k for v in A:
Subst({v/k}, A)
Page 76
R.Anirudhan

For an example, if we know that X (likes(X,ice_cream)), then we can choose a particular name for X. We could choose ben for this, giving us: likes(ben, ice_cream), but only if the constant ben does not appear anywhere else in our knowledge base. So, why the condition about the existential variable being unique to the new sentence? Basically, what you are doing here is giving a particular name to a variable you know must exist. It would be unwise to give this a name which already exists. For example, suppose we have the predicates brother(john,X), sister(john, susan) then, when instantiating X, it would be unwise to choose the term susan for the constant to ground X with, because this would probably be a false inference. Of course, it's not impossible that John would have a sister named Susan and also a brother named Susan, but it is not likely. However, if we choose a totally new constant, then there can be no problems and the inference is guaranteed to be correct.
Existential Introduction
For any sentence, A, and variable, v, which does not occur in A, then for any ground term, g, that occurs in A, we can turn A into an existentially quantified sentence by substituting v for g:
A
v Subst({g/v}, A)
So, for example, if we know that likes(jerry, ice_cream), then we can infer that X (likes(X, ice_cream)), because the constant jerry does not appear anywhere else in the original sentence. The conditions that v and g do not occur in A is for similar reasons as those given for the previous rule. As an exercise, find a situation where ignoring this condition would mean that the inferred sentence did not follow logically from the premise sentence.
7.6 Chains of Inference

We look now at how to get an agent to prove a given theorem using various search strategies. We have noted in previous lectures that, to specify a search problem, we need to describe the representation language for the artefacts being searched for, the initial state, the goal state (or some information about what a goal should look like), and the operators: how to go from one state to another.
R.Anirudhan

We can state the problem of proving a given theorem from some axioms as a search problem. Three different specifications give rise to three different ways to solve the problem, namely forward and backward chaining and proof by contradiction. In all of these specifications, the representation language is predicate logic (not surprisingly), and the operators are the rules of inference, which allow us to rewrite a set of sentences as another set. We can think of each state in our search space as a sentence in first order logic. The operators will traverse this space, finding new sentences. However, we are really only interested in finding a path from the start states to the goal state, as this path will constitute a proof. (Note that there are other ways to prove theorems such as exhausting the search for a counterexample and finding none - in this case we don't have a deductive proof for the truth of the theorem, but we know it is true). Only the initial state of the space and the details of the goal differ in the three following approaches.
Forward Chaining
Suppose we have a set of axioms which we know are true statements about the world. If we set these to each be an initial state of the search space, and we set the goal state to be our theorem statement, then this is a simple approach which can be used to prove theorems. We call this approach forward chaining, because the agent employing the search constructs chains of reasoning, from the axioms, hopefully to the goal. Once a path has been found from the axioms to the theorem, this path constitutes a proof and the problem has been solved. However, the problem with forward chaining in general is that it cannot easily use the goal (theorem statement) to drive the search. Hence it really must just explore the search space until it comes across the solution. Goal-directed searches are often more effective than non-goal directed ones like forward chaining.
Backward Chaining
Given that we are only interested in constructing the path, we can set our initial state to be the theorem statement and search backwards until we find an axiom (or set of axioms). If we restrict ourselves to just using equivalences as rewrite rules, then this approach is OK, because we can use equivalences both ways, and any path from the theorem to axioms which is found will provide a proof. However, if we use inference rules to traverse from theorem to axioms, then we will have proved that, if the theorem is true, then the axioms are true. But we already know that the axioms are true! To get around this, we must invert our inference rules and try to work backwards. That is, the operators in the search basically answer the question: what could be true in order to infer the state (logical sentence) we are at right now?
R.Anirudhan

If our agent starts searching from the theorem statement and reaches the axioms, it has proved the theorem. This is also problematic, because there are numerous answers to the inversion question, and the search space gets very large.
Proof by Contradiction
So, forward chaining and backward chaining both have drawbacks. Another approach is to think about proving theorems by contradiction. These are very common in mathematics: mathematicians specify some axioms, then make an assumption. After some complicated mathematics, they have shown that an axiom is false (or something derived from the axioms which did not involve the assumption is false). As the axioms are irrefutably correct, this means that the assumption they made must be false. That is, the assumption is inconsistent with the axioms of the theory. To use this for a particular theorem which they want to prove is true, they negate the theorem statement and use this as the assumption they are going to show is false. As the negated theorem must be false, their original theorem must be true. Bingo! We can program our reasoning agents to do just the same. To specify this as a search problem, therefore, we have to say that the axioms of our theory and the negation of the theorem we want to prove are the initial search states. Remembering our example in section 7.2, to do this, we need to derive the False statement to show inconsistency, so the False statement becomes our goal. Hence, if we can deduce the false statement from our axioms, the theorem we were trying to prove will indeed have been proven. This means that, not only can we use all our rules of inference, we also have a goal to aim for. As an example, below is the input to the Otter theorem prover for the trivial theorem about Socrates being mortal. Otter searches for contradictions using resolution, hence we note that the theorem statement - that Socrates is mortal - is negated using the minus sign. We discuss Otter and resolution theorem proving in the next two lectures. Input:
set(auto). formula_list(usable). all x (man(x)->mortal(x)). % For all x, if x is a man then x is mortal man(socrates). % Socrates is a man -mortal(socrates). % Socrates is immortal (note: negated) end_of_list.
Otter has no problem whatsoever proving this theorem, and here is the output:
R.Anirudhan

Output:
---------------- PROOF ---------------1 2 3 4 5 [] -man(x)|mortal(x). [] -mortal(socrates). [] man(socrates). [hyper,3,1] mortal(socrates). [binary,4.1,2.1] $F.
------------ end of proof -------------
Chapter-8 The Resolution Method

A minor miracle occurred in 1965 when Alan Robinson published his resolution method. This method uses a generalised version of the resolution rule of inference we saw in the previous lecture. It has been mathematically proven to be refutationcomplete over first order logic. This means that if you write any set of sentences in first order logic which are unsatisfiable (i.e., taken together they are false, in that they have no models), then the resolution method will eventually derive the False symbol, indicating that the sentences somehow contradict each other. In particular, if the set of first order sentences comprises a set of axioms and the negation of a theorem you want to prove, the resolution method can be used in a proof-by-contradiction approach. This means that, if your first order theorem is true then proof by contradiction using the resolution method is guaranteed to find the proof to a theorem eventually. The underlining here identifies some drawbacks to resolution theorem proving:
It only works for true theorems which can be expressed in first order logic: it cannot check at the same time whether a conjecture is true or false, and it can't work in higher order logics. (There are related techniques which address these problems, to varying degrees of success.) While it is proven that the method will find the solution, in practice the search space is often too large to find one in a reasonable amount of time, even for fairly simple theorems.
Notwithstanding these drawbacks, resolution theorem proving is a complete method: if your theorem does follow from the axioms of a domain, then resolution can prove it. Moreover, it only uses one rule of deduction (resolution), rather than the multitude we saw in the last lecture. Hence, it is comparatively easy to
R.Anirudhan

understand how resolution theorem provers work. For these reasons, the development of the resolution method was a major accomplishment in logic, with serious implications to Artificial Intelligence research. Resolution works by taking two sentences and resolving them into one, eventually resolving two sentences to produce the False statement. The resolution rule is more complicated than the rules of inference we've seen before, and we need to cover some preparatory notions before we can understand how it works. In particular, we need to look at conjunctive normal form and unification before we can state the full resolution rule at the heart of the resolution method.
8.1 Binary Resolution

We saw unit resolution (a propositional inference rule) in the previous lecture:
A B, B
We can take this a little further to propositional binary resolution:

A B, B C
Binary resolution gets its name from the fact that each sentence is a disjunction of exactly two literals. We say the two opposing literals B and B are resolved they are removed when the disjunctions are merged. The binary resolution rule can be seen to be sound because if both A and C were false then then at least one of the sentences on the top line would be false. As this is an inference rule, we are assuming that the top line is true. Hence we can't have both A and C being false, which means either A or C must be true. So we can infer the bottom line. So far we've only looked at propositional versions of resolution. In first-order logic we need to also deal with variables and quantifiers. As we'll see below, we don't
R.Anirudhan

need to worry about quantifiers: we are going to be working with sentences that only contain free variables. Recall that we treat these variables as implicitly universally quantified, and that they can take any value. This allows us to state a more general first-order binary resolution inference rule:
A B, C D Subst(, B) = Subst(&theta, C) Subst(, A D)
This rule has the side condition Subst(, B) = Subst(&theta, C), which requires there to be a substitution which makes B and C the same before we can apply the rule. (Note can substitute fresh variables whie making B and C equal. It doesn't have to be a ground substitution!) If we can find such a , then we can make the resolution step and apply to the outcome. In fact, the first -order binary rule is simply equivalent to applying the substitution to the original sentences, and then applying the propositional binary rule.
8.2 Conjunctive Normal Form

For the resolution rule to resolve two sentences, they must both be in a normalised format known as conjunctive normal form, which is usually abbreviated to CNF. This is an unfortunate name because the sentences themselves are made up of sets of disjunctions. It is implicitly assumed that the entire knowledge base is a big conjunction of the sentences, which is where conjunctive normal form gets its name. So, CNF is actually a conjunction of disjunctions. The disjunctions are made up of literals which can either be a predicate or the negation of a predicate (for propositional read a proposition or the negation of a proposition): So, CNF sentences are of the form: (p1 p2 ... pm) (q1 q2 ... qn) etc.
where each pi and qj is a literal. Note that we call the disjunction of such literals a clause. As a concrete example, likes(george, X) likes(tony, george) is_mad(maggie)
is in conjunctive normal form, but:
Page 82
R.Anirudhan

likes(george, X) is not in CNF. The following eight-stage process converts any sentence into CNF:
1. Eliminate arrow connectives by rewriting with P P Q => (P Q Q) (Q P)
likes(tony, george)
is_mad(maggie)
is_mad(tony)
Q => P
2. Move inwards using De Morgan's laws (inc. quantifier versions) and double negation: (P (P Q) => P Q) => P => => => P Q Q
X. P
X. P X. P
X. P P
3. Rename variables apart: the same variable name may be reused several times for different variables, within one sentence or between several. To avoid confusion later rename each distinct variable with a unique name. 4. Move quantifiers outwards: the sentence is now in a form where all the quantifiers can be moved safely to the outside without affecting the semantics, provided they are kept in the same order. 5. Skolemise existential variables by replacing them with Skolem constants and functions. This is similar to the existential elimination rule from the last lecture: we just substitute a term for each existential variable that represents the 'something' for which it holds. If there are no preceeding universal quantifiers the 'something' is just a fresh constant. However, if there are then we use a function that takes all these preceeding universal variables as arguments. When we're done we just drop all the universal quantifiers. This leaves a quantifier-free sentence. For example:
X. Y. person(X)
has(X, Y)
heart(Y)
Page 83
R.Anirudhan

is Skolemised as person(X)
6. Distribute with:
has(X, f(X))
over
heart(f(X))
to make a conjunction of disjunctions. This involves rewriting
P (P
(Q Q)
R) => (P R => (P
Q) R)
(P (Q
R) R)
7. Flatten binary connectives: replace nested and disjuncts: P (P P (P (Q Q) (Q Q) R) => P R => P R) => P R => P Q Q Q Q
and
with flat lists of conjuncts
R R R R
8. The sentence is now in CNF. Further simplication can take place by removing duplicate literals and dropping any clause which contains both A and A (one will be true, so the clause is always true. In the conjunction of clauses we want everything to be true, so we can drop it.) There is an optional final step that takes it to Kowalski normal form, also known as implicative normal form (INF):
9. Reintroduce implication by gathering up all the negative literals (the negated ones) and forming their conjunction N, then taking the disjunction P of the positive literals, and forming the logically equivalent clause N P.
Example: Converting to CNF

We will work through a simple propositional example: (B (A C)) (B A)
This first thing to do is remove the implication sign:
Page 84
R.Anirudhan

(B (A C)) (B A)
Next we use De Morgan's laws to move our negation sign from the outside to the inside of brackets: ( B (A C)) (B A)
And we can use De Morgan's law again to move a negation sign inwards: ( B (A C)) (B over (( A A) as follows: C) (B A))
Next we distribute ( B (B A))
If we flatten our disjunctions, we get our sentence into CNF form. Note the conjunction of disjunctions: ( B B A) (A C B A)
Finally, the first conjunction has B and B, so the whole conjunction must be true. Also, we can remove the duplicate A in the second conjunction: True ( A C B)
The truth of this sentence is only dependent on the second conjunct. If it is false, the whole thing is false, if it is true, the whole thing is true. Hence, we can remove the True, giving us a single clause in its final conjunctive normal form: A C B
If we want Kowalski normal form we take one more step to get: (A C) B
8.3 Unification
We have said that the rules of inference for propositional logic detailed in the last lecture can also be used in first-order logic. However, we need to clarify that a little. One important difference between propositional and first-order logic is that the latter has predicates with terms as arguments. So, one clarification we need to make is that we can apply the inference rules as long as the predicates and arguments
R.Anirudhan

match up. So, not only do we have to check for the correct kinds of sentence before we can carry out a rule of inference, we also have to check that the arguments do not forbid the inference. For instance, suppose in our knowledge base, we have the these two statements: knows(john,X) hates(john, X) knows(john, mary) and we want to use the Modus Ponens rule to infer something new. In this case, there is no problem, and we can infer that, because john hates everyone he knows, and he knows Mary, then he must hate Mary, i.e., we can infer that hates(john, mary) is true. However, suppose instead that we had these two sentences: knows(john,X) hates(john, X) knows(jack, mary) Here, the predicate names have not changed, but the arguments are holding us back from making any deductive inference. In the first case above, we could allow the variable X to be instantiated to mary during the deduction, and the constant john before and after the deduction also matched without problem. However, in the second case, although we could still instantiate X to mary, we could no longer match john and jack, because they are two different constants. So we cannot deduce anything about john (or anybody else) from the latter two statements. The problem here comes from our inability to make the arguments in knows(john, X) and the arguments in knows(jack, mary) match up. When we can make two predicates match up, we say that we have unified them, and we will look at an algorithm for unifying two predicates (if they can be unified) in this section. Remember that unification plays a part in the way Prolog searches for matches to queries.
A Unification Algorithm
To unify two sentences, we must find a substitution which makes the two sentences the same. Remember that we write V/T to signify that we have substituted term T for variable V (read the "/" sign as "is substituted by"). The purpose of this algorithm will be to produce a substitution (a set of pairs V/T) for a given pair of sentences. So, for example, the output for the pair of sentences:
R.Anirudhan

knows(john,X) knows(john, mary) will be: {X/mary}. However, for the two sentences above involving jack, the function should fail, as there was no way to unify the sentences. To describe the algorithm, we need to specify some functions it calls internally.

The function isa_variable(x) checks whether x is a variable. The function isa_compound(x) checks whether x is a compound expression: either a predicate, a function or a connective which contains subparts. The subparts of a predicate or function are the arguments. The subparts of a connective are the things it connects. We write args(x) for the subparts of compound expression x. Note that args(x) outputs a list: the list of subparts. Also, we write op(x) to signify the symbol of the compound operator (predicate name, function name or connective symbol). The function isa_list(x) checks whether x is a list. We write head(L) for the first term in a list L and tail(L) for the sublist comprising all the other terms except the head. Hence the head of [2,3,5,7,11] is 2 and the tail is [3,5,7,11]. This terminology is common in Prolog.
It's easiest to explain the unification algorithm as a recursive method which is able to call itself. As this is happening, a set, mu, is passed around the various parts of the algorithm, collecting substitutions as it goes. The method has two main parts:
unify_internal(x,y,mu)
which returns a substitution which makes sentence x look exactly like sentence y, given an already existing set of substitutions mu (although mu may be empty). This function checks various properties of x and y and calls either itself again or the unify_variable routine, as described below. Note that the order of the ifstatements is important, and if a failure is reported at any stage, the whole function fails. If none of the cases is true for the input, then the algorithm fails to find a unifying set of substitutions.
unify_variable(var,x,mu)
which returns a substitution given a variable var, a sentence x and an already existing set of substitutions mu. This function also contains a set of cases which cause other routines to run if the case is true of the input. Again, the order of the cases is important. Here, if none of the cases is true of the input, a substitution is returned. The algorithm is as follows:
Page 87
R.Anirudhan
unify(x,y) = unify_internal(x,y,{}) unify_internal(x,y,mu) ---------------------Cases 1. if (mu=failure) then return failure 2. if (x=y) then return mu. 3. if (isa_variable(x)) then return unify_variable(x,y,mu) 4. if (isa_variable(y)) then return unify_variable(y,x,mu) 5. if (isa_compound(x) and isa_compound(y)) then return unify_internal(args(x),args(y),unify_internal(op(x),op(y),mu)) 6. if (isa_list(x) and isa_list(y)) then return unify_internal(tail(x),tail(y),unify_internal(head(x),head(y),mu)) 7. return failure unify_variable(var,x,mu) -----------------------Cases 1. if (a substitution var/val is in mu) then return unify_internal(val,x,mu) 2. if (a substitution x/val is in mu) then return unify_internal(var,val,mu) 3. if (var occurs anywhere in x) then return failure 4. add var/x to mu and return
Some things to note about this method are: (i) trying to match a constant to a different constant fails because they are not equal, neither is a variable and neither is a compound expression or list. Hence none of the cases in unify_internal is true, so it must return failure. (ii) Case 1 and 2 in unify_variable(var,x,my) check that neither inputs have already been substituted. If x already has a substitution, then it tries to unify the
Page 88
R.Anirudhan

substituted value and var, rather than x and var. It does similarly if var already has a substitution. (iii) Case 3 in unify_variable is known as the occurs-check case (or occurcheck). This is important: imagine we got to the stage where, to complete a unification, we needed to substitute X with, say, f(X,Y). If we did this, we would write f(X,Y) instead of X. But this still has an X in it! So, we would need to substitute X by f(X,Y) again, giving us: f(f(X,Y),Y) and it is obvious why we should never have tried this substitution in the first place, because this process will never end. The occurs check makes sure this isn't going to happen before case 4 returns a substitution. The rule is: you cannot substitute a compound for a variable if that variable appears in the compound already, because you will never get rid of the variable. (iv) The unify_internal(op(x),op(y),mu)) part of case 5 in unify_internal checks that the operators of the two compound expressions are the same. This means that it will return false if, for example, it tries to unify two predicates with different names, or a with a symbol. (v) The unification algorithm returns the unique most general unifier (MGU) mu for two sentences. This means that if there is another unifier U then T.U is always an instance of T.mu. The MGU substitutes as little as it can get away with while still being a unifier.
Example: Unifying Two Sentences

Suppose we wanted to unify these two sentences:
1. p(X,tony) q(george, X, Z) 2. p(f(tony),tony) q(B,C,maggie)
We can see by inspection that a way to unify these sentences is to apply the substitution: {X/f(tony), B/george, C/f(tony), Z/maggie}. Therefore, our unification algorithm should find this substitution. To run our algorithm, we set the inputs to be:
Page 89
R.Anirudhan

x = p(X,tony) q(george, X, Z) and y = p(f(tony),tony) q(B,C,maggie) and then follow the algorithm steps. Iteration one
unify_internal is called with inputs x, y and the empty list {}. This tries case 1,
but as mu is not failure, this is not the case. Next it tries case 2, but this is also not the case, because x is not equal to y. Cases 3 and 4 similarly fail, because neither x nor y is a variable. Finally, case 5 kicks in because x and y are compound terms. In fact, they are both conjunctions connected by the connective. Using our definitions above, args(x)=[p(X,tony),q(george,X,Z)] and args(y)=[p(f(tony),tony),q(B,C,maggie)]. Also, op(x) = p and op(y) = p. So, case 5 means that we call unify_internal again with inputs [p(X,tony),q(george,X,Z)] and [p(f(tony),tony),q(B,C,maggie)]. Before we do that, the third input to the function will be unify_internal(op(x),op(y),mu). Because our op(x) and op(y) are the same (both p), then this will return mu [check this yourselves]. mu is still the empty list, so this gets passed on. Iteration two So, we're back at the top of unify_internal again, but this time with a pair of lists as input. None of the cases match until case 6. This states that we have to split our lists into heads and tails, then unify the heads and use this to unify the tails. Unifying the heads means that we once again call unify_internal, this time with predicates p(X,tony) and p(f(tony),tony). Iteration three Now case 5 fires again, because our two inputs are both predicates. This turns the arguments into a list, checks that the two predicate names match and calls unify_internal yet again, this time with lists [X,tony] and [f(tony),tony] as input. Iteration four In this iteration, all the algorithm does is split the lists into heads and tails, and first calls unify_internal with X and f(tony) as inputs, and later with tony and tony as input. In the latter case we can see that unify_internal will return mu, because the constant symbols are equal. Hence this will not affect anything. Iteration five When X and f(tony) are given as input, case 3 fires because X is a variable. This causes unify_variable(X,f(tony),{}) to be called. In this case, it checks that neither X nor f(tony) has been subject to a substitution already, which they haven't because the substitution list is still empty. It also makes an occurs-check, and finds that X
R.Anirudhan

does not appear anywhere in f(tony), so case 3 does not fire. Hence it goes all the way to case 4, and X/f(tony) is added to the substitution list. Finally, we have a substitution! This returns the substitution list {X/f(tony)} as output, and causes some other embedded calls to also return this substitution list. It is left as an exercise to show that the same way in which the algorithm unified p(X,tony) and p(f(tony),tony) with the substitution {X/f(tony)}, it also unifies q(george,X,Z) and q(B,C,maggie), adding B/george, C/f(tony) and Z/maggie to the substitution list. However, in this case, we had already assigned the substitution X/tony. Hence, when unify_variable was finally called, it fired case 2 (or is it 1?) to make sure that the already substituted variable was not given another substitution. At this stage, all the return statements start to actually return things, and the substitution gets passed back all the way to the top. Finally, the substitution {X/f(tony), B/george, C/f(tony), Z/maggie}. is indeed produced by the unification algorithm. When applied to both sentences, the result is the same sentence: p(f(tony),tony) q(george,f(tony),maggie)
The complexity of this relatively simple example shows why it is a good idea to get a software agent to do this, rather than doing it ourselves. Of course, if you wanted to try out the unification algorithm, you can simply run Prolog and type in your sentences separated by a single = sign. This asks Prolog to try to unify the two terms. This is what happens in Sicstus Prolog:
?- [p(X,tony),q(george,X,Z)]=[p(f(tony),tony),q(B,C,maggie)]. B = george, C = f(tony), X = f(tony), Z = maggie ? yes.
We see that Prolog has come up with the same unifying substitution as before.
8.4 The Full Resolution Rule

Now that we know about unification, we can properly describe the full version of resolution:
Page 91
R.Anirudhan
p1
...
pj
...
pm,
q1
...
qk
...
qn Unify(pj, qk) =
Subst(, p1
...
pj-1
pj+1
...
pm
q1
... qk-1
qk+1
...
qn)
This resolves literals pj and qk. Note that we have to add to qk to make it unify with pj, so it is in fact pj which is the negative literal here. The rule is more general than first-order binary resolution in that it allows an arbitrary number of literals in each clause. Moreover, is the most general unifier, rather than an arbitrary unifying substitution. To use the rule in practice, we first take a pair of sentences and express them in CNF using the techniques described above. Then we find two literals, p j and qk for which can find a substitution mu to unify pj and qk. Then we take a disjunction of all the literals (in both sentences) except pj and qk. Finally, we apply the substitution to the new disjunction to determine what we have just inferred using resolution. In the next lecture, we will look at how resolution theorem proving is put into action, including some example proofs, some heuristics for improving its performance and some applications.
Chapter-11 Decision Tree Learning

As discussed in the last lecture, the representation scheme we choose to represent our learned solutions and the way in which we learn those solutions are the most important aspects of a learning method. We look in this lecture at decision trees - a simple but powerful representation scheme, and we look at the ID3 method for decision tree learning.
11.1 Decision Trees

Imagine you only ever do four things at the weekend: go shopping, watch a movie, play tennis or just stay in. What you do depends on three things: the weather (windy, rainy or sunny); how much money you have (rich or poor) and whether your parents are visiting. You say to your yourself: if my parents are visiting, we'll go to the cinema. If they're not visiting and it's sunny, then I'll play tennis, but if it's
R.Anirudhan

windy, and I'm rich, then I'll go shopping. If they're not visiting, it's windy and I'm poor, then I will go to the cinema. If they're not visiting and it's rainy, then I'll stay in. To remember all this, you draw a flowchart which will enable you to read off your decision. We call such diagrams decision trees. A suitable decision tree for the weekend decision choices would be as follows:
We can see why such diagrams are called trees, because, while they are admittedly upside down, they start from a root and have branches leading to leaves (the tips of the graph at the bottom). Note that the leaves are always decisions, and a particular decision might be at the end of multiple branches (for example, we could choose to go to the cinema for two different reasons). Armed with our decision tree, on Saturday morning, when we wake up, all we need to do is check (a) the weather (b) how much money we have and (c) whether our parent's car is parked in the drive. The decision tree will then enable us to make our decision. Suppose, for example, that the parents haven't turned up and the sun is shining. Then this path through our decision tree will tell us what to do:
R.Anirudhan
and hence we run off to play tennis because our decision tree told us to. Note that the decision tree covers all eventualities. That is, there are no values that the weather, the parents turning up or the money situation could take which aren't catered for in the decision tree. Note that, in this lecture, we will be looking at how to automatically generate decision trees from examples, not at how to turn thought processes into decision trees.
Reading Decision Trees
There is a link between decision tree representations and logical representations, which can be exploited to make it easier to understand (read) learned decision trees. If we think about it, every decision tree is actually a disjunction of implications (if ... then statements), and the implications are Horn clauses: a conjunction of literals implying a single literal. In the above tree, we can see this by reading from the root node to each leaf node: If the parents are visiting, then go to the cinema or If the parents are not visiting and it is sunny, then play tennis or
R.Anirudhan

If the parents are not visiting and it is windy and you're rich, then go shopping or If the parents are not visiting and it is windy and you're poor, then go to cinema or If the parents are not visiting and it is rainy, then stay in. Of course, this is just a re-statement of the original mental decision making process we described. Remember, however, that we will be programming an agent to learn decision trees from example, so this kind of situation will not occur as we will start with only example situations. It will therefore be important for us to be able to read the decision tree the agent suggests. Decision trees don't have to be representations of decision making processes, and they can equally apply to categorisation problems. If we phrase the above question slightly differently, we can see this: instead of saying that we wish to represent a decision process for what to do on a weekend, we could ask what kind of weekend this is: is it a weekend where we play tennis, or one where we go shopping, or one where we see a film, or one where we stay in? For another example, we can refer back to the animals example from the last lecture: in that case, we wanted to categorise what class an animal was (mammal, fish, reptile, bird) using physical attributes (whether it lays eggs, number of legs, etc.). This could easily be phrased as a question of learning a decision tree to decide which category a given animal is in, e.g., if it lays eggs and is homeothermic, then it's a bird, and so on...
11.2 Learning Decision Trees Using ID3
Specifying the Problem
We now need to look at how you mentally constructed your decision tree when deciding what to do at the weekend. One way would be to use some background information as axioms and deduce what to do. For example, you might know that your parents really like going to the cinema, and that your parents are in town, so therefore (using something like Modus Ponens) you would decide to go to the cinema. Another way in which you might have made up your mind was by generalising from previous experiences. Imagine that you remembered all the times when you had a really good weekend. A few weeks back, it was sunny and your parents were not visiting, you played tennis and it was good. A month ago, it was raining and you were penniless, but a trip to the cinema cheered you up. And so on. This information could have guided your decision making, and if this was the case, you would have used an inductive, rather than deductive, method to construct your
R.Anirudhan

decision tree. In reality, it's likely that humans reason to solve decisions using both inductive and deductive processes. We can state the problem of learning decision trees as follows:
We have a set of examples correctly categorised into categories (decisions). We also have a set of attributes describing the examples, and each attribute has a finite set of values which it can possibly take. We want to use the examples to learn the structure of a decision tree which can be used to decide the category of an unseen example.
Assuming that there are no inconsistencies in the data (when two examples have exactly the same values for the attributes, but are categorised differently), it is obvious that we can always construct a decision tree to correctly decide for the training cases with 100% accuracy. All we have to do is make sure every situation is catered for down some branch of the decision tree. Of course, 100% accuracy may indicate overfitting.
The basic idea
In the decision tree above, it is significant that the "parents visiting" node came at the top of the tree. We don't know exactly the reason for this, as we didn't see the example weekends from which the tree was produced. However, it is likely that the number of weekends the parents visited was relatively high, and every weekend they did visit, there was a trip to the cinema. Suppose, for example, the parents have visited every fortnight for a year, and on each occasion the family visited the cinema. This means that there is no evidence in favour of doing anything other than watching a film when the parents visit. Given that we are learning rules from examples, this means that if the parents visit, the decision is already made. Hence we can put this at the top of the decision tree, and disregard all the examples where the parents visited when constructing the rest of the tree. Not having to worry about a set of examples will make the construction job easier. This kind of thinking underlies the ID3 algorithm for learning decisions trees, which we will describe more formally below. However, the reasoning is a little more subtle, as (in our example) it would also take into account the examples when the parents did not visit.
Entropy Page 96
R.Anirudhan

Putting together a decision tree is all a matter of choosing which attribute to test at each node in the tree. We shall define a measure called information gain which will be used to decide which attribute to test at each node. Information gain is itself calculated using a measure called entropy, which we first define for the case of a binary decision problem and then define for the general case. Given a binary categorisation, C, and a set of examples, S, for which the proportion of examples categorised as positive by C is p+ and the proportion of examples categorised as negative by C is p-, then the entropy of S is:
The reason we defined entropy first for a binary decision problem is because it is easier to get an impression of what it is trying to calculate. Tom Mitchell puts this quite well: "In order to define information gain precisely, we begin by defining a measure commonly used in information theory, called entropy that characterizes the (im)purity of an arbitrary collection of examples." Imagine having a set of boxes with some balls in. If all the balls were in a single box, then this would be nicely ordered, and it would be extremely easy to find a particular ball. If, however, the balls were distributed amongst the boxes, this would not be so nicely ordered, and it might take quite a while to find a particular ball. If we were going to define a measure based on this notion of purity, we would want to be able to calculate a value for each box based on the number of balls in it, then take the sum of these as the overall measure. We would want to reward two situations: nearly empty boxes (very neat), and boxes with nearly all the balls in (also very neat). This is the basis for the general entropy measure, which is defined as follows: Given an arbitrary categorisation, C into categories c 1, ..., cn, and a set of examples, S, for which the proportion of examples in c i is pi, then the entropy of S is:
This measure satisfies our criteria, because of the -p*log2(p) construction: when p gets close to zero (i.e., the category has only a few examples in it), then the log(p) becomes a big negative number, but the p part dominates the calculation, so the entropy works out to be nearly zero. Remembering that entropy calculates the disorder in the data, this low score is good, as it reflects our desire to reward
R.Anirudhan

categories with few examples in. Similarly, if p gets close to 1 (i.e., the category has most of the examples in), then the log(p) part gets very close to zero, and it is this which dominates the calculation, so the overall value gets close to zero. Hence we see that both when the category is nearly - or completely - empty, or when the category nearly contains - or completely contains - all the examples, the score for the category gets close to zero, which models what we wanted it to. Note that 0*ln(0) is taken to be zero by convention.
Information Gain
We now return to the problem of trying to determine the best attribute to choose for a particular node in a tree. The following measure calculates a numerical value for a given attribute, A, with respect to a set of examples, S. Note that the values of attribute A will range over a set of possibilities which we call Values(A), and that, for a particular value from that set, v, we write Sv for the set of examples which have value v for attribute A. The information gain of attribute A, relative to a collection of examples, S, is calculated as:
The information gain of an attribute can be seen as the expected reduction in entropy caused by knowing the value of attribute A.
An Example Calculation
As an example, suppose we are working with a set of examples, S = {s 1,s2,s3,s4} categorised into a binary categorisation of positives and negatives, such that s1 is positive and the rest are negative. Suppose further that we want to calculate the information gain of an attribute, A, and that A can take the values {v1,v2,v3}. Finally, suppose that: s1 takes s2 takes s3 takes s4 takes value v1 for A value value value v2 v2 v3 for for for A A A
To work out the information gain for A relative to S, we first need to calculate the entropy of S. To use our formula for binary categorisations, we need to know the
Page 98
R.Anirudhan

proportion of positives in S and the proportion of negatives. These are given as: p + = 1/4 and p- = 3/4. So, we can calculate: Entropy(S) = -(1/4)log2(1/4) -(3/4)log2(3/4) = -(1/4)(-2) -(3/4)(-0.415) = 0.5 + 0.311 = 0.811 Note that, to do this calculation with your calculator, you may need to remember that: log2(x) = ln(x)/ln(2), where ln(2) is the natural log of 2. Next, we need to calculate the weighted Entropy(Sv) for each value v = v1, v2, v3, v4, noting that the weighting involves multiplying by (|Svi|/|S|). Remember also that Sv is the set of examples from S which have value v for attribute A. This means that: Sv1 = {s4}, sv2={s1, s2}, sv3 = {s3}. We now have need to carry out these calculations: (|Sv1|/|S|) * Entropy(Sv1) = (1/4) * (-(0/1)log2(0/1) - (1/1)log2(1/1)) = (1/4)(-0 (1)log2(1)) = (1/4)(-0 -0) = 0 (|Sv2|/|S|) * Entropy(Sv2) = (2/4) * (-(1/2)log2(1/2) - (1/2)log2(1/2)) = (1/2) * (-(1/2)*(-1) - (1/2)*(-1)) = (1/2) * (1) = 1/2 (|Sv3|/|S|) * Entropy(Sv3) = (1/4) * (-(0/1)log2(0/1) - (1/1)log2(1/1)) = (1/4)(-0 (1)log2(1)) = (1/4)(-0 -0) = 0 Note that we have taken 0 log2(0) to be zero, which is standard. In our calculation, we only required log2(1) = 0 and log2(1/2) = -1. We now have to add these three values together and take the result from our calculation for Entropy(S) to give us the final result: Gain(S,A) = 0.811 - (0 + 1/2 + 0) = 0.311 We now look at how information gain can be used in practice in an algorithm to construct decision trees.
The ID3 algorithm
The calculation for information gain is the most difficult part of this algorithm. ID3 performs a search whereby the search states are decision trees and the operator involves adding a node to an existing tree. It uses information gain to measure the attribute to put in each node, and performs a greedy search using this measure of worth. The algorithm goes as follows:
R.Anirudhan

Given a set of examples, S, categorised in categories ci, then: 1. Choose the root node to be the attribute, A, which scores the highest for information gain relative to S. 2. For each value v that A can possibly take, draw a branch from the node. 3. For each branch from A corresponding to value v, calculate S v. Then:

If Sv is empty, choose the category cdefault which contains the most examples from S, and put this as the leaf node category which ends that branch. If Sv contains only examples from a category c, then put c as the leaf node category which ends that branch. Otherwise, remove A from the set of attributes which can be put into nodes. Then put a new node in the decision tree, where the new attribute being tested in the node is the one which scores highest for information gain relative to Sv (note: not relative to S). This new node starts the cycle again (from 2), with S replaced by Sv in the calculations and the tree gets built iteratively like this.
The algorithm terminates either when all the attributes have been exhausted, or the decision tree perfectly classifies the examples. The following diagram should explain the ID3 algorithm further:
Page 100
R.Anirudhan
11.3 A worked example

We will stick with our weekend example. Suppose we want to train a decision tree using the following instances:
Weekend (Example) Weather Parents Money Decision (Category) W1 W2 Sunny Sunny Yes No Rich Rich Cinema Tennis
Page 101
R.Anirudhan
W3 W4 W5 W6 W7 W8 W9 W10
Windy Rainy Rainy Rainy Windy Windy Windy Sunny
Yes Yes No Yes No No Yes No
Rich Poor Rich Poor Poor Rich Rich Rich
Cinema Cinema Stay in Cinema Cinema Shopping Cinema Tennis
The first thing we need to do is work out which attribute will be put into the node at the top of our tree: either weather, parents or money. To do this, we need to calculate: Entropy(S) = -pcinema log2(pcinema) -ptennis log2(ptennis) -pshopping log2(pshopping) pstay_in log2(pstay_in) = -(6/10) * log2(6/10) -(2/10) * log2(2/10) -(1/10) * log2(1/10) -(1/10) * log2(1/10) = -(6/10) * -0.737 -(2/10) * -2.322 -(1/10) * -3.322 -(1/10) * -3.322 = 0.4422 + 0.4644 + 0.3322 + 0.3322 = 1.571 and we need to determine the best of: Gain(S, weather) = 1.571 - (|Ssun|/10)*Entropy(Ssun) - (|Swind|/10)*Entropy(Swind) (|Srain|/10)*Entropy(Srain) = 1.571 - (0.3)*Entropy(Ssun) - (0.4)*Entropy(Swind) (0.3)*Entropy(Srain) = 1.571 - (0.3)*(0.918) - (0.4)*(0.81125) - (0.3)*(0.918) = 0.70 Gain(S, parents) = 1.571 - (|Syes|/10)*Entropy(Syes) - (|Sno|/10)*Entropy(Sno) = 1.571 - (0.5) * 0 - (0.5) * 1.922 = 1.571 - 0.961 = 0.61 Gain(S, money) = 1.571 - (|Srich|/10)*Entropy(Srich) - (|Spoor|/10)*Entropy(Spoor) = 1.571 - (0.7) * (1.842) - (0.3) * 0 = 1.571 - 1.2894 = 0.2816
R.Anirudhan

This means that the first node in the decision tree will be the weather attribute. As an exercise, convince yourself why this scored (slightly) higher than the parents attribute - remember what entropy means and look at the way information gain is calculated. From the weather node, we draw a branch for the values that weather can take: sunny, windy and rainy:
Now we look at the first branch. S sunny = {W1, W2, W10}. This is not empty, so we do not put a default categorisation leaf node here. The categorisations of W1, W2 and W10 are Cinema, Tennis and Tennis respectively. As these are not all the same, we cannot put a categorisation leaf node here. Hence we put an attribute node here, which we will leave blank for the time being. Looking at the second branch, Swindy = {W3, W7, W8, W9}. Again, this is not empty, and they do not all belong to the same class, so we put an attribute node here, left blank for now. The same situation happens with the third branch, hence our amended tree looks like this:
Now we have to fill in the choice of attribute A, which we know cannot be weather, because we've already removed that from the list of attributes to use. So, we need to calculate the values for Gain(S sunny, parents) and Gain(S sunny, money). Firstly, Entropy(Ssunny) = 0.918. Next, we set S to be Ssunny = {W1,W2,W10} (and, for this part of the branch, we will ignore all the other examples). In effect, we are interested only in this part of the table:
Weekend (Example) Weather Parents Money Decision (Category)
Page 103
R.Anirudhan
W1 W2 W10
Sunny Sunny Sunny
Yes No No
Rich Rich Rich
Cinema Tennis Tennis
Hence we can calculate: Gain(Ssunny, parents) = 0.918 - (|Syes|/|S|)*Entropy(Syes) - (|Sno|/|S|)*Entropy(Sno) = 0.918 - (1/3)*0 - (2/3)*0 = 0.918 Gain(Ssunny, money) = 0.918 - (|Srich|/|S|)*Entropy(Srich) - (|Spoor|/|S|)*Entropy(Spoor) = 0.918 - (3/3)*0.918 - (0/3)*0 = 0.918 - 0.918 = 0 Notice that Entropy(Syes) and Entropy(Sno) were both zero, because Syes contains examples which are all in the same category (cinema), and S no similarly contains examples which are all in the same category (tennis). This should make it more obvious why we use information gain to choose attributes to put in nodes. Given our calculations, attribute A should be taken as parents. The two values from parents are yes and no, and we will draw a branch from the node for each of these. Remembering that we replaced the set S by the set S Sunny, looking at Syes, we see that the only example of this is W1. Hence, the branch for yes stops at a categorisation leaf, with the category being Cinema. Also, S no contains W2 and W10, but these are in the same category (Tennis). Hence the branch for no ends here at a categorisation leaf. Hence our upgraded tree looks like this:
Finishing this tree off is left as a tutorial exercise.
Page 104
R.Anirudhan

11.4 Avoiding Overfitting
As we discussed in the previous lecture, overfitting is a common problem in machine learning. Decision trees suffer from this, because they are trained to stop when they have perfectly classified all the training data, i.e., each branch is extended just far enough to correctly categorise the examples relevant to that branch. Many approaches to overcoming overfitting in decision trees have been attempted. As summarised by Tom Mitchell, these attempts fit into two types:

Stop growing the tree before it reaches perfection. Allow the tree to fully grow, and then post-prune some of the branches from it.
The second approach has been found to be more successful in practice. Both approaches boil down to the question of determining the correct tree size. See Chapter 3 of Tom Mitchell's book for a more detailed description of overfitting avoidance in decision tree learning.
11.5 Appropriate Problems for Decision Tree Learning

It is a skilled job in AI to choose exactly the right learning representation/method for a particular learning task. As elaborated by Tom Mitchell, decision tree learning is best suited to problems with these characteristics:

The background concepts describe the examples in terms of attribute-value pairs, and the values for each attribute range over finitely many fixed possibilities. The concept to be learned (Mitchell calls it the target function) has discrete values. Disjunctive descriptions might be required in the answer.
In addition to this, decision tree learning is robust to errors in the data. In particular, it will function well in the light of (i) errors in the classification instances provided (ii) errors in the attribute-value pairs provided and (iii) missing values for certain attributes for certain examples.
Lecture Two Layer Networks

Artificial
12 Neural
Page 105
R.Anirudhan

Decision trees, while powerful, are a simple representation scheme. While graphical on the surface, they can be seen as disjunctions of conjunctions, and hence are a logical representation, and we call such schemes symbolic representations. In this lecture, we look at a non-symbolic representation scheme known as Artificial Neural Networks. This term is often shortened to Neural Networks, but this annoys neuro-biologists who deal with real neural networks (inside our human heads). As the name suggests, ANNs have a biological motivation, and we briefly look at that first. Following this, we look in detail at how information is represented in ANNs, then we look at the simplest type of network, two layer networks. We look at perceptrons and linear units, and discuss the limitations that such simple networks have. In the next lecture, we discuss multi-layer networks and the backpropagation algorithm for learning such networks.
12.1 Biological Motivation

In our discussion in the first lecture about how people have answered the question: "How are we going to get an agent to act intelligently", one of the answers was to realise that intelligence in individual humans is effected by our brains. Neuroscientists have told us that the brain is made up of architectures of networks of neurons. At the most basic level, neurons can be seen as functions which, when given some input, will either fire or not fire, depending on the nature of the input. The input to certain neurons comes from the senses, but in general, the input to a neuron is a set of outputs from other neurons. If the input to a neuron goes over a certain threshold, then the neuron will fire. In this way, one neuron firing will affect the firing of many other neurons, and information can be stored in terms of the thresholds set and the weight assigned by each neuron to each of its inputs. Artificial Neural Networks (ANNs) are designed to mimic the behaviour of the brain. Some ANNs are built into hardware, but the vast majority are simulated in software, and we concentrate on these. It's important not to take the analogy too far, because there really isn't much similarity between artificial and animal neural networks. In particular, while the human brain is estimated to contain around 100,000,000,000 neurons, ANNs usually contain less than 1000 equivalent units. Moreover, the interconnection of neurons is much bigger in natural systems. Also, the way in which ANNs store and manipulate information is a gross simplification of the way in which networks of neurons work in natural systems.
12.2 ANN Representation

ANNs are taught on AI courses because of their motivation from brain studies and the fact that they are used in an AI task, namely machine learning. However, I
R.Anirudhan

would argue that their real home is in statistics, because, as a representation scheme, they are just fancy mathematical functions. Imagine being asked to come up with a function to take the following inputs and produce their associated outputs:
Input Output 1 2 3 4 1 4 9 16
Presumably, the function you would learn would be f(x) = x2. Imagine now that you had a set of values, rather than a single instance as input to your function:
Input Output [1,2,3] 1 [2,3,4] 5 [3,4,5] 11 [4,5,6] 19
Here, it is still possible to learn a function: for example, multiply the first and last element and take the middle one from the product. Note that the functions we are learning are getting more complicated, but they are still mathematical. ANNs just take this further: the functions they learn are generally so complicated that it's difficult to understand them on a global level. But they are still just functions which play around with numbers. Imagine, now, for example, that the inputs to our function were arrays of pixels, actually taken from photographs of vehicles, and that the output of the function is either 1, 2 or 3, where 1 stands for a car, 2 stands for a bus and 3 stands for a tank:
R.Anirudhan
Input
Output Input
Output
In this case, the function which takes an array of integers representing pixel data and outputs either 1, 2 or 3 will be fairly complicated, but it's just doing the same kind of thing as the two simpler functions. Because the functions learned to, for example, categorise photos of vehicles into a category of car, bus or tank, are so complicated, we say the ANN approach is a black box approach because, while the function performs well at its job, we cannot look inside it to gain a knowledge of how it works. This is a little unfair, as there are some projects which have addressed the problem of translating learned neural networks into human readable forms. However, in general, ANNs are used in cases where the predictive accuracy is of greater importance than understanding the learned concept. Artificial Neural Networks consist of a number of units which are mini calculation devices. They take in real-valued input from multiple other nodes and they produce a single real valued output. By real-valued input and output we mean real numbers which are able to take any decimal value. The architecture of ANNs is as follows: 1. A set of input units which take in information about the example to be propagated through the network. By propagation, we mean that the information from the input will be passed through the network and an output produced. The set of input units forms what is known as the input layer. 2. A set of hidden units which take input from the input layer. The hidden units collectively form the hidden layer. For simplicity, we assume that each unit in the input layer is connected to each unit of the hidden layer, but this isn't necessarily the case. A weighted sum of the output from the input
R.Anirudhan

units forms the input to every hidden unit. Note that the number of hidden units is usually smaller than the number of input units. 3. A set of output units which, in learning tasks, dictate the category assigned to an example propagated through the network. The output units form the output layer. Again, for simplicity, we assume that each unit in the hidden layer is connected to each unit in the output layer. A weighted sum of the output from the hidden units forms the input to every output unit. Hence ANNs look like this in the general case:
Note that the w, x, y and z represent real valued weights and that all the edges in this graph have weights associated with them (but it was difficult to draw them all on). Note also that more complicated ANNs are certainly possible. In particular, many ANNs have multiple hidden layers, with the output from one hidden layer forming the input to another hidden layer. Also, ANNs with no hidden layer - where the input units are connected directly to the output units - are possible. These tend to be too simple to use for real world learning problems, but they are useful to study for illustrative purposes, and we look at the simplest kind of neural networks, perceptrons, in the next section. In our vehicle example, it is likely that the images will all be normalised to having the same number of pixels. Then there may be an input unit for each red, green and blue intensity for each pixel. Alternatively, greyscale images may be used, in which case there needs only to be an input node for each pixel, which takes in the brightness of the pixel. The hidden layer is likely to contain far fewer units (probably between 3 and 10) than the number of input units. The output layer will contain three units, one for each of the categories possible (car, bus, tank). Then,
R.Anirudhan

when the pixel data for an image is given as the initial values for the input units, this information will propagate through the network and the three output units will each produce a real value. The output unit which produces the highest value is taken as the categorisation for the input image. So, for instance, when this image is used as input:
then, if output unit 1 [car] produces value 0.5, output unit 2 [bus] produces value 0.05 and output unit 3 [tank] produces value 0.1, then this image has been (correctly) classified as a car, because the output from the corresponding car output unit is higher than for the other two. Exactly how the function embedded within a neural network computes the outputs given the inputs is best explained using example networks. In the next section, we look at the simplest networks of all, perceptrons, which consist of a set of input units connected to a single output unit.
12.3 Perceptrons
The weights in any ANN are always just real numbers and the learning problem boils down to choosing the best value for each weight in the network. This means there are two important decisions to make before we train a artificial neural network: (i) the overall architecture of the system (how input nodes represent given examples, how many hidden units/hidden layers to have and how the output information will give us an answer) and (ii) how the units calculate their real value output from the weighted sum of real valued inputs. The answer to (i) is usually found by experimentation with respect to the learning problem at hand: different architectures are tried and evaluated on the learning problem until the best one emerges. In perceptrons, given that we have no hidden layer, the architecture problem boils down to just specifying how the input units represent the examples given to the network. The answer to (ii) is discussed in the next subsection.
Units
Page 110
R.Anirudhan

The input units simply output the value which was input to them from the example to be propagated. Every other unit in a network normally has the same internal calculation function, which takes the weighted sum of inputs to it and calculates an output. There are different possibilities for the unit function and this dictates to some extent how learning over networks of that type is performed. Firstly, there is a simple linear unit which does no calculation, it just outputs the weighted sum which was input to it. Secondly, there are other unit functions which are called threshold functions, because they are set up to produce low values up until the weighted sum reaches a particular threshold, then they produce high values after this threshold. The simplest type of threshold function produces a 1 if the weighted sum of the inputs is over a threshold value T, and produces a -1 otherwise. We call such functions step functions, due to the fact that, when drawn as a graph, it looks like a step. Another type of threshold function is called a sigma function, which has similarities with the step function, but advantages over it. We will look at sigma functions in the next lecture.
Example
As an example, consider a ANN which has been trained to learn the following rule categorising the brightness of 2x2 black and white pixel images: if it contains 3 or 4 black pixels, it is dark; if it contains 2, 3 or 4 white pixels, it is bright. We can model this with a perceptron by saying that there are 4 input units, one for each pixel, and they output +1 if the pixel is white and -1 if the pixel is black. Also, the output unit produces a 1 if the input example is to be categorised as bright and -1 if the example is dark. If we choose the weights as in the following diagram, the perceptron will perfectly categorise any image of four pixels into dark or light according to our rule:
Page 111
R.Anirudhan

We see that, in this case, the output unit has a step function, with the threshold set to -0.1. Note that the weights in this network are all the same, which is not true in the general case. Also, it is convenient to make the weights going in to a node add up to 1, so that it is possible to compare them easily. The reason this network perfectly captures our notion of darkness and lightness is because, if three white pixels are input, then three of the input units produce +1 and one input unit produces -1. This goes into the weighted sum, giving a value of S = 0.25*1 + 0.25*1 + 0.25*1 + 0.25*(-1) = 0.5. As this is greater than the threshold of -0.1, the output node produces +1, which relates to our notion of a bright image. Similarly, four white pixels will produce a weighted sum of 1, which is greater than the threshold, and two white pixels will produce a sum of 0, also greater than the threshold. However, if there are three black pixels, S will be -0.5, which is below the threshold, hence the output node will output -1, and the image will be categorised as dark. Similarly, an image with four black pixels will be categorised as dark. As an exercise: keeping the weights the same, how low would the threshold have to be in order to misclassify an example with three or four black pixels?
Learning Weights in Perceptrons
We will look in detail at the learning method for weights in multi-layer networks next lecture. The following description of learning in perceptrons will help clarify what is going on in the multi-layer case. We are in a machine learning setting, so we can expect the task to be to learn a target function which categorises examples into categories, given (at least) a set of training examples supplied with their correct categorisations. A little thought will be needed in order to choose the correct way of thinking about the examples as input to a set of input units, but, due to the simple nature of a perceptron, there isn't much choice for the rest of the architecture. In order to produce a perceptron able to perform our categorisation task, we need to use the examples to train the weights between the input units and the output unit, and to train the threshold. To simplify the routine, we think of the threshold as a special weight, which comes from a special input node that always outputs a 1. So, we think of our perceptron like this:
Page 112
R.Anirudhan
Then, we say that the output from the perceptron is +1 if the weighted sum from all the input units (including the special one) is greater than zero, and it outputs -1 otherwise. We see that weight w0 is simply the threshold value. However, thinking of the network like this means we can train w0 in the same way as we train all the other weights. The weights are initially assigned randomly and training examples are used one after another to tweak the weights in the network. All the examples in the training set are used and the whole process (using all the examples again) is iterated until all examples are correctly categorised by the network. The tweaking is known as the perceptron training rule, and is as follows: If the training example, E, is correctly categorised by the network, then no tweaking is carried out. If E is mis-classified, then each weight is tweaked by adding on a small value, . Suppose we are trying to calculate weight wi, which is between the i-th input unit, xi and the output unit. Then, given that the network should have calculated the target value t(E) for example E, but actually calculated the observed value o(E), then is calculated as: = (t(E)- o(E))xi Note that is a fixed positive constant called the learning rate. Ignoring briefly, we see that the value that we add on to our weight w i is calculated by multiplying the input value xi by t(E) - o(E). t(E) - o(E) will either be +2 or -2, because perceptrons output only +1 or -1, and t(E) cannot be equal to o(E), otherwise we wouldn't be doing any tweaking. So, we can think of t(E) - o(E) as a movement in a particular numerical direction, i.e., positive or negative. This direction will be such that, if the overall sum, S, was too low to get over the threshold and produce the correct categorisation, then the contribution to S from w i * xi will be increased. Conversely, if S is too high, the contribution from wi * xi is reduced. Because t(E) o(E) is multiplied by xi, then if xi is a big value (positive or negative), the change to the weight will be greater. To get a better feel for why this direction correction works, it's a good idea to do some simple calculations by hand.
R.Anirudhan

simply controls how far the correction should go at one time, and is usually set to be a fairly low value, e.g., 0.1. The weight learning problem can be seen as finding the global minimum error, calculated as the proportion of mis-categorised training examples, over a space where all the input values can vary. Therefore, it is possible to move too far in a direction and improve one particular weight to the detriment of the overall sum: while the sum may work for the training example being looked at, it may no longer be a good value for categorising all the examples correctly. For this reason, restricts the amount of movement possible. If a large movement is actually required for a weight, then this will happen over a series of iterations through the example set. Sometimes, is set to decay as the number of such iterations through the whole set of training examples increases, so that it can move more slowly towards the global minimum in order not to overshoot in one direction. This kind of gradient descent is at the heart of the learning algorithm for multilayered networks, as discussed in the next lecture. Perceptrons with step functions have limited abilities when it comes to the range of concepts that can be learned, as discussed in a later section. One way to improve matters is to replace the threshold function with a linear unit, so that the network outputs a real value, rather than a 1 or -1. This enables us to use another rule, called the delta rule, which is also based on gradient descent. We don't look at this rule here, because the backpropagation learning method for multi-layer networks is similar.
12.4 Worked Example

Suppose we are trying to learn a perceptron to represent the brightness rules above, in such a way that if it outputs a 1, the image is categorised as bright, and if it outputs a -1, the image is categorised as dark. Remember that we said a 2x2 black and white pixel image is categorised as bright if it has two or more white pixels in it. We shall call the pixels p1 to p4, with the numbers going from left to right, top to bottom in the 2x2 image. A black pixel will produce an input of -1 to the network, and a white pixel will give an input of +1. Given our new way of thinking about the threshold as a weight from a special input node, our network will have five input nodes and five weights. Suppose also that we have assigned the weights randomly to values between -1 and 1, namely -0.5, 0.7, 0.2, 0.1 and 0.9. Then our perceptron will initially look like this:
Page 114
R.Anirudhan
We will now train the network with the first training example, using a learning rate of = 0.1. Suppose the first example image, E, is this:
With two white squares, this is categorised as bright. Hence, the target output for E is: t(E) = +1. Also, p1 (top left) is black, so the input x1 is -1. Similarly, x2 is +1, x3 is +1 and x4 is -1. Hence, when we propagate this through the network, we get the value: S = (-0.5 * 1) + (0.7 * -1) + (-0.2 * +1) + (0.1 * +1) + (0.9 * -1) = -2.2 As this value is less than zero, the network outputs o(E) = -1, which is not the correct value. This means that we should now tweak the weights in light of the incorrectly categorised example. Using the perception training rule, we need to calculate the value of to add on to each weight in the network. Plugging values into the formula for each weight gives us: 0 = (t(E)- o(E))xi = 0.1 * (1 - (-1)) * (1) = 0.1 * (2) = 0.2 1 = (t(E)- o(E))xi = 0.1 * (1 - (-1)) * (-1) = 0.1 * (-2) = -0.2 2 = (t(E)- o(E))xi = 0.1 * (1 - (-1)) * (1) = 0.1 * (2) = 0.2 3 = (t(E)- o(E))xi = 0.1 * (1 - (-1)) * (1) = 0.1 * (2) = 0.2
Page 115
R.Anirudhan

4 = (t(E)- o(E))xi = 0.1 * (1 - (-1)) * (-1) = 0.1 * (-2) = -0.2 When we add these values on to our existing weights, we get the new weights for the network as follows: w'0 = -0.5 + 0 = -0.5 + 0.2 = -0.3 w'1 = 0.7 + 1 = 0.7 + -0.2 = 0.5 w'2 = -0.2 + 2 = -0.2 + 0.2 = 0 w'3 = 0.1 + 3 = 0.1 + 0.2 = 0.3 w'4 = 0.9 + 4 = 0.9 - 0.2 = 0.7 Our newly trained network will now look like this:
To see how this has improved the situation with respect to the training example, we can propagate it through the network again. This time, we get the weighted sum to be: S = (-0.3 * 1) + (0.5 * -1) + (0 * +1) + (0.3 * +1) + (0.7 * -1) = -1.2 This is still negative, and hence the network categorises the example as dark, when it should be light. However, it is less negative. We can see that, by repeatedly training using this example, the training rule would eventually bring the network to a state where it would correctly categorise this example.
12.5 The Learning Abilities of Perceptrons

Computational learning theory is the study of what concepts particular learning schemes (representation and method) can and can't learn. We don't look at this in
R.Anirudhan

detail, but a famous example, first highlighted in a very influential book by Minsky and Papert involves perceptrons. It has been mathematically proven that the above method for learning perceptron weights will converge to a perfect classifier for learning tasks where the target concept is linearly separable. To understand what is and what isn't a linearly separable target function, we look at the simplest functions of all, boolean functions. These take two inputs, which are either 1 or -1 and output either a 1 or a -1. Note that, in other contexts, the values 0 and 1 are used instead of -1 and 1. As an example function, the AND boolean function outputs a 1 only if both inputs are 1, whereas the OR function only outputs a 1 if either inputs are 1. Obviously, these relate to the connectives we studied in first order logic. The following two perceptrons can represent the AND and OR boolean functions respectively:
One of the major impacts of Minsky and Papert's book was to highlight the fact that perceptions cannot learn a particular boolean function called XOR. This function outputs a 1 if the two inputs are not the same. To see why XOR cannot be learned, try and write down a perception to do the job. The following diagram highlights the notion of linear reparability in Boolean functions, which explains why they can't be learned by perceptions:
Page 117
R.Anirudhan
In each case, we've plotted the values taken by the Boolean function when the inputs are particular values: (-1,-1);(1,-1);(-1,1) and (1,1). For the AND function, there is only one place where a 1 is plotted, namely when both inputs are 1. This meant that we could draw the dotted line to separate the output -1s from the 1s. We were able to draw a similar line in the OR case. Because we can draw these lines, we say that these functions are linearly separable. Note that it is not possible to draw such as line for the XOR plot: wherever you try, you never get a clean split into 1s and -1s. The dotted lines can be seen as the threshold in perceptrons: if the weighted sum, S, falls below it, then the perceptron outputs one value, and if S falls above it, the alternative output is produced. It doesn't matter how the weights are organized, the threshold will still be a line on the graph. Therefore, functions which are not linearly separable cannot be represented by perceptrons. Note that this result extends to functions over any number of variables, which can take in any input, but which produce a Boolean output (and hence could, in principle be learned by a perceptron). For instance, in the following two graphs, the function takes in two inputs (like Boolean functions), but the input can be over a range of values. The concept on the left can be learned by a perceptron, whereas the concept on the right cannot:
Page 118
R.Anirudhan
As an exercise, in the left hand plot, draw in the separating (threshold) line. Unfortunately, the disclosure in Minsky and Papert's book that perceptrons cannot learn even such a simple function was taken the wrong way: people believed it represented a fundamental flaw in the use of ANNs to perform learning tasks. This led to a winter of ANN research within AI, which lasted over a decade. In reality, perceptrons were being studied in order to gain insights into more complicated architectures with hidden layers, which do not have the limitations that perceptrons have. No one ever suggested that perceptrons would be eventually used to solve real world learning problems. Fortunately, people studying ANNs within other sciences (notably neuro-science) revived interest in the study of ANNs. For more details of computational learning theory, see chapter 7 of Tom Mitchell's machine learning book.
Chapter-13 Multi-Layer Artificial Neural Networks

We can now look at more sophisticated ANNs, which are known as multi-layer artificial neural networks because they have hidden layers. These will naturally be used to undertake more complicated tasks than perceptrons. We first look at the network structure for multi-layer ANNs, and then in detail at the way in which the weights in such structures can be determined to solve
R.Anirudhan
machine learning problems. There are many considerations involved with learning such ANNs, and we consider some of them here. First and foremost, the algorithm can get stuck in local minima, and there are some ways to try to get around this. As with any learning technique, we will also consider the problem of overfitting, and discuss which types of problems an ANN approach is suitable for.
13.1 Multi-Layer Network Architectures

We saw in the previous lecture that perceptrons have limited scope in the type of concepts they can learn - they can only learn linearly separable functions. However, we can think of constructing larger networks by building them out of perceptrons. In such larger networks, we call the step function units the perceptron units in multi-layer networks. As with individual perceptrons, multi-layer networks can be used for learning tasks. However, the learning algorithm that we look at (the backpropagation routine) is derived mathematically, using differential calculus. The derivation relies on having a differentiable threshold function, which effectively rules out using perceptron units if we want to be sure that backpropagation works correctly. The step function in perceptrons is not continuous, hence non-differentiable. An alternative unit was therefore chosen which had similar properties to the step function in perceptron units, but which was differentiable. There are many possibilities, one of which is sigmoid units, as described below.
Sigmoid units
Remember that the function inside units take as input the weighted sum, S, of the values coming from the units connected to it. The function inside sigmoid units calculates the following value, given a real-valued input S:
Where e is the base of natural logarithms, e = 2.718... When we plot the output from sigmoid units given various weighted sums as input, it looks remarkably like a step function:
Page 120
R.Anirudhan
Of course, getting a differentiable function which looks like the step function was the whole point of the exercise. In fact, not only is this function differentiable, but the derivative is fairly simply expressed in terms of the function itself:
Note that the output values for the function range between but never make it to 0 and 1. This is because e-S is never negative, and the denominator of the fraction tends to 0 as S gets very big in the negative direction, and tends to 1 as it gets very big in the positive direction. This tendency happens fairly quickly: the middle ground between 0 and 1 is rarely seen because of the sharp (near) step in the function. Because of it looking like a step function, we can think of it firing and not-firing as in a perceptron: if a positive real is input, the output will generally be close to +1 and if a negative real is input the output will generally be close to -1.
Example Multi-layer ANN with Sigmoid Units
We will concern ourselves here with ANNs containing only one hidden layer, as this makes describing the backpropagation routine easier. Note that networks where you can feed in the input on the left and propagate it forward to get an output are called feed forward networks. Below is such an ANN, with two sigmoid units in the hidden layer. The weights have been set arbitrarily between all the units.
Page 121
R.Anirudhan
Note that the sigma units have been identified with sigma signs in the node on the graph. As we did with perceptrons, we can give this network an input and determine the output. We can also look to see which units "fired", i.e., had a value closer to 1 than to 0. Suppose we input the values 10, 30, 20 into the three input units, from top to bottom. Then the weighted sum coming into H1 will be: SH1 = (0.2 * 10) + (-0.1 * 30) + (0.4 * 20) = 2 -3 + 8 = 7. Then the function is applied to SH1 to give: (SH1) = 1/(1+e-7) = 1/(1+0.000912) = 0.999 [Don't forget to negate S]. Similarly, the weighted sum coming into H2 will be: SH2 = (0.7 * 10) + (-1.2 * 30) + (1.2 * 20) = 7 - 36 + 24 = -5 and applied to SH2 gives: (SH2) = 1/(1+e5) = 1/(1+148.4) = 0.0067 From this, we can see that H1 has fired, but H2 has not. We can now calculate that the weighted sum going in to output unit O1 will be: SO1 = (1.1 * 0.999) + (0.1*0.0067) = 1.0996 and the weighted sum going in to output unit O2 will be: SO2 = (3.1 * 0.999) + (1.17*0.0067) = 3.1047 The output sigmoid unit in O1 will now calculate the output values from the network for O1:
Page 122
R.Anirudhan

(SO1) = 1/(1+e-1.0996) = 1/(1+0.333) = 0.750 and the output from the network for O2: (SO2) = 1/(1+e-3.1047) = 1/(1+0.045) = 0.957 Therefore, if this network represented the learned rules for a categorisation problem, the input triple (10,30,20) would be categorised into the category associated with O2, because this has the larger output.
13.2 The Backpropagation Learning Routine

As with perceptrons, the information in the network is stored in the weights, so the learning problem comes down to the question: how do we train the weights to best categorise the training examples. We then hope that this representation provides a good way to categorise unseen examples. In outline, the backpropagation method is the same as for perceptrons:
1. We choose and fix our architecture for the network, which will contain input, hiddedn and output units, all of which will contain sigmoid functions.
2. We randomly assign the weights between all the nodes. The assignments should be to small numbers, usually between -0.5 and 0.5. 3. Each training example is used, one after another, to re-train the weights in the network. The way this is done is given in detail below. 4. After each epoch (run through all the training examples), a termination condition is checked (also detailed below). Note that, for this method, we are not guaranteed to find weights which give the network the global minimum error, i.e., perfectly correct categorisation of the training examples. Hence the termination condition may have to be in terms of a (possibly small) number of mis-categorisations. We see later that this might not be such a good idea, though.
Weight Training Calculations
Because we have more weights in our network than in perceptrons, we firstly need to introduce the notation: wij to specify the weight between unit i and unit j. As with perceptrons, we will calculate a value ij to add on to each weight in the network after an example has been tried. To calculate the weight changes for a particular example, E, we first start with the information about how the network should perform for E. That is, we write down the target values ti(E) that each output unit Oi should produce for E. Note that, for categorisation problems, t i(E) will be zero for all the output units except one, which is the unit associated with the correct categorisation for E. For that unit, t i(E) will be 1.
R.Anirudhan
Next, example E is propagated through the network so that we can record all the observed values oi(E) for the output nodes Oi. At the same time, we record all the observed values h i(E) for the hidden nodes. Then, for each output unit Ok, we calculate its error term as follows:
The error terms from the output units are used to calculate error terms for the hidden units. In fact, this method gets its name because we propagate this information backwards through the network. For each hidden unit Hk, we calculate the error term as follows:
In English, this means that we take the error term for every output unit and multiply it by the weight from hidden unit Hk to the output unit. We then add all these together and multiply the sum by hk(E)*(1 - hk(E)). Having calculated all the error values associated with each unit (hidden and output), we can now transfer this information into the weight changes ij between units i and j. The calculation is as follows: for weights wij between input unit Ii and hidden unit Hj, we add on:
[Remembering that xi is the input to the i-th input node for example E; that is a small value known as the learning rate and that Hj is the error value we calculated for hidden node Hj using the formula above]. For weights wij between hidden unit Hi and output unit Oj, we add on:
[Remembering that hi(E) is the output from hidden node Hi when example E is propagated through the network, and that Oj is the error value we calculated for output node Oj using the formula above]. Each alteration is added to the weights and this concludes the calculation for example E. The next example is then used to tweak the weights further. As with perceptrons, the learning rate is used to ensure that the weights are only moved a short distance for each example, so that the training for previous examples is not lost. Note that the mathematical derivation for the above
R.Anirudhan

calculations is based o n derivative of that we saw above. For a full description of this, see chapter 4 of Tom Mitchell's book "Machine Learning".
13.3 A Worked Example

We will re-use the example from section 13.1, where our network originally looked like this:
and we propagated the values (10,30,20) through the network. When we did so, we observed the following values:
Input units Hidden units Output units
Unit Output Unit Weighted Sum Input Output Unit Weighted Sum Input Output I1 I2 I3 10 30 20 H1 7 H2 -5 0.999 O1 1.0996 0.750 0.957
0.0067 O2 3.1047
Suppose now that the target categorisation for the example was the one associated with O1. This means that the network mis-categorised the example and gives us an opportunity to demonstrate the backpropagation algorithm: we will update the weights in the network according to the weight training calculations provided above, using a learning rate of = 0.1. If the target categorisation was associated with O1, this means that the target output for O1 was 1, and the target output for O2 was 0. Hence, using the above notation, t1(E) = 1; t2(E) = 0; o1(E) = 0.750; o2(E) = 0.957
Page 125
R.Anirudhan
That means we can calculate the error values for the output units O1 and O2 as follows: O1 = o1(E)(1 - o1(E))(t1(E) - o1(E)) = 0.750(1-0.750)(1-0.750) = 0.0469 O2 = o2(E)(1 - o2(E))(t2(E) - o2(E)) = 0.957(1-0.957)(0-0.957) = -0.0394 We can now propagate this information backwards to calculate the error terms for the hidden nodes H1 and H2. To do this for H1, we multiply the error term for O1 by the weight from H1 to O1, then add this to the multiplication of the error term for O2 and the weight between H1 and O2. This gives us: (1.1*0.0469) + (3.1*-0.0394) = -0.0706. To turn this into the error value for H1, we multiply by h1(E)*(1-h1(E)), where h1(E) is the output from H1 for example E, as recorded in the table above. This gives us: H1 = -0.0706*(0.999 * (1-0.999)) = -0.0000705 A similar calculation for H2 gives the first part to be: (0.1*0.0469)+(1.17*-0.0394) = -0.0414, and the overall error value to be: H2 -0.0414 * (0.067 * (1-0.067)) = -0.00259 We now have all the information required to calculate the weight changes for the network. We will deal with the 6 weights between the input units and the hidden units first:
Input unit Hidden unit I1 I1 I2 I2 I3 I3 H1 H2 H1 H2 H1 H2 H xi = *H*xi Old weight New weight 0.1999295 0.69741 -0.1002115 -1.20777 0.39999 1.1948
0.1 -0.0000705 10 -0.0000705 0.2 0.1 -0.00259 10 -0.00259 0.7
0.1 -0.0000705 30 -0.0002115 -0.1 0.1 -0.00259 30 -0.00777 -1.2
0.1 -0.0000705 20 -0.000141 0.4 0.1 -0.00259 20 -0.00518 1.2
We now turn to the problem of altering the weights between the hidden layer and the output layer. The calculations are similar, but instead of relying on the input values from E, they use the values calculated by the sigmoid functions in the hidden nodes: hi(E). The following table
R.Anirudhan
calculates the relevant values:

Hidden Output unit unit H1 H1 H2 H2 O1 O2 O1 O2 O hi(E) = *O*hi(E) Old weight New weight 1.1 3.1 0.1 1.17 1.100469 3.0961 0.10314 1.16998
0.1 0.0469 0.999 0.000469 0.1 -0.0394 0.999 -0.00394 0.1 0.0469 0.0067 0.00314 0.1 -0.0394 0.0067 -0.0000264
We note that the weights haven't altered all that much, so it might be a good idea in this situation to use a bigger learning rate. However, remember that, with sigmoid units, small changes in the weighted sum can produce big changes in the output from the unit. As an exercise, check whether the re-trained network performs better with respect to the example than the original network.
13.4 Avoiding Local Minima

The error rate of multi-layered networks over a training set could be calculated as the number of mis-classified examples. Remembering, however, that there are many output nodes, all of which could potentially misfire (e.g., giving a value close to 1 when it should have output 0, and viceversa), we can be more sophisticated in our error evaluation. In practice the overall network error is calculated as:
This is not as complicated as it first appears. The calculation simply involves working out the difference between the observed output for each output unit and the target output and squaring this to make sure it is positive, then adding up all these squared differences for each output unit and for each example. Backpropagation can be seen as using searching a space of network configurations (weights) in order to find a configuration with the least error, measured in the above fashion. The more
R.Anirudhan
complicated network structure means that the error surface which is searched can have local minima, and this is a problem for multi-layer networks, and we look at ways around it below. Having said that, even if a learned network is in a local minima, it may still perform adequately, and multi-layer networks have been used to great effect in real world situations (see Tom Mitchell's book for a description of an ANN which can drive a car!) One way around the problem of local minima is to use random re-start as described in the lecture on search techniques. Different initial random weightings for the network may mean that it converges to different local minima, and the best of these can be taken for the learned ANN. Alternatively, as described in Mitchell's book, a "committee" of networks could be learned, with the (possibly weighted) average of their decisions taken as an overall decision for a given test example. Another alternative is to try and skip over some of the smaller local minima, as described below.
Adding Momentum
Imagine a ball rolling down a hill. As it does so, it gains momentum, so that its speed increases and it becomes more difficult to stop. As it rolls down the hill towards the valley floor (the global minimum), it might occasionally wander into local hollows. However, it may be that the momentum it has obtained keeps it rolling up and out of the hollow and back on track to the valley floor. The crude analogy describes one heuristic technique for avoiding local minima, called adding momentum, funnily enough. The method is simple: for each weight remember the previous value of which was added on to the weight in the last epoch. Then, when updating that weight for the current epoch, add on a little of the previous . How small to make the additional extra is controlled by a parameter called the momentum, which is set to a value between 0 and 1. To see why this might help bypass local minima, note that if the weight change carries on in the direction it was going in the previous epoch, then the movement will be a little more pronounced in the current epoch. This effect will be compounded as the search continues in the same direction. When the trend finally reverses, then the search may be at the global minimum, in which case it is hoped that the momentum won't be enough to take it anywhere other than where it is. Alternatively, the search may be at a fairly narrow local minimum. In this case, even though the back propagation algorithm dictates that will change direction, it may be that the additional extra from the previous epoch (the momentum) may be enough to counteract this effect for a few steps. These few steps may be all that is needed to bypass the local minimum. In addition to getting over some local minima, when the gradient is constant in one direction, adding momentum will increase the size of the weight change after each epoch, and the network may converge quicker. Note that it is possible to have cases where (a) the momentum is not enough to carry the search out of a local minima or (b) the momentum carries the search out of the global minima into a local minima. This is why this technique is a heuristic method and
R.Anirudhan
should be used somewhat carefully (it is used in practice a great deal).
13.5 Over fitting Considerations

Left unchecked, back propagation in multi-layer networks can be highly susceptible to overfitting itself to the training examples. The following graph plots the error on the training and test set as the number of weight updates increases. It is typical of networks left to train unchecked.
Alarmingly, even though the error on the training set continues to gradually decrease, the error on the test set actually begins to increase towards the end. This is clearly overfitting, and it relates to the network beginning to find and fine-tune to ideosyncrasies in the data, rather than to general properties. Given this phenomena, it would be unwise to use some kind of threshold for the error as the termination condition for backpropagation. In cases where the number of training examples is high, one antidote to overfitting is to split the training examples into a set to use to train the weight and a set to hold back as an internal validation set. This is a mini-test set, which can be used to keep the network in check: if the error on the validation set reaches a minima and then begins to increase, then it could be that overfitting is beginning to occur. Note that (time permitting) it is worth giving the training algorithm the benefit of the doubt as much as possible. That is, the error in the validation set can also go through local minima, and it is not wise to stop training as soon as the validation set error starts to increase, as a better minima may be achieved later on. Of course, if the minima is never bettered, then the network which is finally presented by the learning algorithm should be re-wound to be the one which produced the minimum on the validation set.
R.Anirudhan
Another way around overfitting is to decrease each weight by a small weight decay factor during each epoch. Learned networks with large (positive or negative) weights tend to have overfitted the data, because larger weights are needed to accommodate outliers in the data. Hence, keeping the weights low with a weight decay factor may help to steer the network from overfitting.
13.6 Appropriate Problems for ANN learning

As we did for decision trees, it's important to know when ANNs are the right representation scheme for the job. The following are some characteristics of learning tasks for which artificial neural networks are an appropriate representation: 1. The concept (target function) to be learned can be characterised in terms of a real-valued function. That is, there is some translation from the training examples to a set of real numbers, and the output from the function is either real-valued or (if a categorisation) can be mapped to a set of real values. It's important to remember that ANNs are just giant mathematical functions, so the data they play around with are numbers, rather than logical expressions, etc. This may sound restrictive, but many learning problems can be expressed in a way that ANNs can tackle them, especially as real numbers contain booleans (true and false mapped to +1 and -1), integers, and vectors of these data types can also be used. 2. Long training times are acceptable. Neural networks generally take a longer time to train than, for example, decision trees. Many factors, including the number of training examples, the value chosen for the learning rate and the architecture of the network, have an affect on the time required to train a network. Training times can vary from a few minutes to many hours. 3. It is not vitally important that humans be able to understand exactly how the learned network carries out categorizations. As we discussed above, ANNs are black boxes and it is difficult for us to get a handle on what its calculations are doing. 4. When in use for the actual purpose it was learned for, the evaluation of the target function needs to be quick. While it may take a long time to learn a network to, for instance, decide whether a vehicle is a tank, bus or car, once the ANN has been learned, using it for the categorization task is typically very fast. This may be very important: if the network was to be used in a battle situation, then a quick decision about whether the object moving hurriedly towards it is a tank, bus, car or old lady could be vital.
In addition, neural network learning is quite robust to errors in the training data, because it is not trying to learn exact rules for the task, but rather to minimize an error function.
Page 130
R.Anirudhan
Chapter-14 Inductive Logic Programming

Having studied a non-symbolic approach to machine learning (Artificial Neural Networks), we return to a logical approach, namely Inductive Logic Programming (ILP). As the name suggests, the representation scheme used in this approach is logic programs, which we covered in lecture 6. As a quick overview, one search strategy for ILP systems is to invert rules of deduction and therefore induce hypotheses which may solve the learning problem. In order to understand ILP, we will define a context for ILP, and use this to state the machine learning problem being addressed. Following this, we will look at the search operators in ILP, in particular the notion of inverting resolution in order to generate hypotheses. We will consider how the search is undertaken and run through a session with the Progol ILP system. We end by looking at some of the applications of Inductive Logic Programming.
14.1 Problem Context and Specification

The development of Inductive Logic Programming has been heavily formal (mathematical) in nature, because the major people in the field believe that this is the only way to progress and to show progress. It means that we have to (re)introduce some notation, and we will use this to formally specify the machine learning problem faced by ILP programs. To do this, we first need to refresh and rerepresent our knowledge about logic programs, and define background, example and hypothesis logic programs. Following this, we will specify some prior conditions on the knowledge base that must be met before an agent attempts a learning task. We will also specify some posterior conditions on the learned hypothesis, in such a way that, given a problem satisfying the prior conditions, if our learning agent finds a hypothesis which satisfies the posterior conditions, it will have solved the learning task.
Logic Programs
Logic programs are a subset of first order logic. A logic program contains a set of Horn clauses, which are implication conjectures where there is a conjunction of literals which imply a single literal. Hence a logic program consists of implications which look like this example:
R.Anirudhan

X, Y, Z ( b1(X,Y) b2(X) ... bn(X,Y,Z) h(X,Y))
Remember also that, in Prolog, we turn the implication around so that the implication sign points from left to right, and the head of the clause comes first. We also assume universal quantification over all our literals, so that can be removed. Hence we can write Horn clauses like this: h(x,y) b1(X,Y) b2(X) ... bn(X,Y,Z)
and everybody understands what we are saying. We will also adopt the convention of writing a conjunction of literals with a capital letter, and a single literal with a lower case letter. Hence, if we were interested in the first literal in the body of the above Horn clause, but not interested in the others, then we could write: h(X,Y) b1, B bn(X,Y,Z) has been replaced by
We see that the conjunction of literals b2(X) ... B and we have used a comma instead of a sign.
Also, we need to specify when one logic program can be deduced from another. We use the entails sign to denote this. If logic program L1 can be proved to be true using logic program L2, we write: L2 L1. We use the symbol to denote that one logic program does not entail another. It is important to understand that if L2 L1, this does not mean that L2 entails that L1 is false, only that L2 cannot be used to prove that L1 is true. Note also that, because we have restricted our representation language to logic programs, we can use a Prolog interpreter to prove the entailment of one logic program from another. As a final notation, it is important to remember that a logic program can contain just one Horn clause, and that the Horn clause could have no body, in which case the head of the clause is a known fact about the domain.
Background, Examples and Hypothesis
We will start off with three logic programs. Firstly, we will have the logic program representing a set of positive examples for the concept we wish to be learned, and we denote the set of examples E +. Secondly, we will have a set of negative examples for the concept we wish to be learned, labelled E -. Thirdly, we will have set of Horn clauses which provide background concepts, and we denote this logic program B. We will denote the logic program representing the learned hypothesis H.
Page 132
R.Anirudhan

Normally, E+ and E- will be ground facts, i.e., Horn clauses with no body. In this case, we can prove that an example of E follows from the hypothesis, as they are all still logic programs. When an example (positive or negative) is proved to be true using a hypothesis H, we say that H (taken along with B) explains the example.
Prior Conditions
Firstly, we must make sure that our problem has a solution. If one of the negative examples can be proved to be true from the background information alone, then clearly any hypothesis we find will not be able to compensate for this, and the problem is not satisfiable. Hence, we need to check the prior satisfiability of the problem: e in E- (B e).
Any learning problem which breaks the prior satisfiability condition has inconsistent data, so the user should be made aware of this. Note that this condition does not mean that B entails that any negative example is false, so it is certainly possible to find a hypothesis which, along with B entails a negative example. In addition to checking whether we will be able to find a solution to the problem, we also have to check that the problem isn't solved already by the background information. That is, if the problem satisfies the prior satisfiability condition, and each positive example is entailed by the background information, then the background logic program B would itself perfectly solve the problem. Hence, we need to check that at least one positive example cannot be explained by the background information B. We call this condition the prior necessity condition: e in E+ (B
e).
Posterior Conditions
Given a problem which satisfies the prior conditions, we define here two properties that the hypothesis learned by our agent, H, will satisfy if it solves the concept learning problem. Firstly, H should satisfy the posterior satisfiability condition that, taken together with the background logic program, it does not entail any negative example: e in E- ((B H) e).
Also, we must check that all the positive examples are entailed if we take the background program in conjunction with the hypothesis. This is called the posterior sufficiency condition:
R.Anirudhan

e in E+ ((B H) e).
It should be obvious that any hypothesis satisfying the two posterior conditions will be a perfect solution to the learning problem.
Problem Specification
Given the above context for ILP, we can state the learning problem as follows: we are given a set of positive and a set of negative examples represented as logic programs E+ and E- respectively, and some background clauses making up a logic program B. These logic programs satisfy the two prior conditions. Then the learning problem is to find a logic program, H, such that H, B, E + and E- satisfy the posterior conditions.
Pruning and Sorting
Because we can test whether each hypothesis explains (entails) a particular example, we can associate to a hypothesis a set of positive elements that it explains and a similar set of negative elements. There is also a similar analogy with general and specific hypotheses as described above: if a hypothesis G is more general than hypothesis S, then the examples explained by S will be a subset of those explained by G. We will assume the following generic search strategy for an ILP system: (i) a set of current hypotheses is maintained, QH (ii) at each step in the search, a hypothesis H is taken from QH and some inference rules applied to it in order to generate some new hypotheses which are then added to the set (we say that H has been expanded) (iii) this continues until a termination criteria is met. This leaves many questions unanswered. Looking first at the question of which hypothesis to expand at a particular stage, ILP systems associate a label with each hypothesis generated which expresses a probability of the hypothesis holding given that the background knowledge and examples are true. Then, hypotheses with a higher probability are expanded before those with a lower probability, and hypotheses with zero probability are pruned from the set QH entirely. This probability calculation is derived using Bayesian mathematics and we do not go into the derivation here. However, we hint at two aspects of the calculation in the paragraphs below. In specific to general ILP systems, the inference rules are inductive, so each operator takes a hypothesis and generalizes it. As mentioned above, this means that the hypothesis generated will explain more examples than the original hypothesis. As the search gradually makes hypotheses more general, there will come a stage
R.Anirudhan

when a newly formed hypothesis H is general enough to explain a negative example, e-. This should therefore score zero for the probability calculation because it cannot possibly hold given the background and examples being true. Furthermore, because the operators only generalize, there is no way by which H can be fixed to not explain e-, so pruning it from QH because of the zero probability score is a good decision. A similar situation occurs in general to specific ILP systems, where the inference rules are deductive, hence they specialize. At some stage, a hypothesis will become so specialized that it fails to explain all the positive examples. In this case, a similar pruning operation can be imposed because further specialization will not rectify the situation. Note that in practice, to compensate for noisy data, there is more flexibility built into the systems. In particular, the posterior conditions which specify the problem can be relaxed, and the pruning of hypotheses which explain small numbers of negative examples may not be immediately dropped. We can see how the examples could be used to choose between two non-pruned hypotheses: if performing a specific to general search, then the number of positive examples explained by a hypothesis can be taken as a value to sort the hypotheses with (more positive examples explained being better). Similarly, if performing a general to specific search, then the number of negatives still explained by a hypothesis can be taken as a value to sort the hypotheses with (fewer negatives being better). This may, however, be a very crude measure, because many hypotheses might score the same, especially if there is a small number of examples. When all things are equal, an ILP system may employ a sophisticated version of Occam's razor, and choose between two equal scoring hypotheses according to some function derived from Algorithmic Complexity theory or some similar theory.
Chapter-16 Constraint Satisfaction Problems

I was perhaps most proud of AI on a Sunday. On this particular Sunday, a friend of mine found an article in the Observer about the High-IQ society, a rather brash and even more elitist version of Mensa. Their founder said that their entrance test was so difficult that some of the problems had never been solved. The problem given below was in the Observer as such an unsolved problem. After looking at it for a few minutes, I confidently told my friend that I would have the answer in half an hour.
R.Anirudhan
After just over 45 minutes, I did indeed have an answer, and my friend was suitably impressed. See the end of these notes for the details. Of course, I didn't spend my time trying to figure it out (if you want to split the atom, you don't sharpen a knife). Instead, I used the time to describe the problem to a constraint solver, which is infinitely better at these things than me. The constraint solver is part of good old Sicstus Prolog, so specifying the problem was a matter of writing it as a logic program - it's worth pointing out that I didn't specify how to find the solution, just
R.Anirudhan

what the problem was. With AI programming languages such as Prolog, every now and then the intelligence behind the scenes comes in very handy. Once I had specified the problem to the solver (a mere 80 lines of Prolog), it took only one hundredth of a second to solve the problem. So not only can the computer solve a problem which had beaten many high IQ people, it could solve 100 of these "difficult" problems every second. A great success for AI. In this lecture, we will look at how constraint solving works in general. Much of the material here is taken from Barbara Smith's excellent tutorial on Constraint Solving which is available here:
16.1 Specifying Constraint Problems

As with most successful AI techniques, constraint solving is all about solving problems: somehow phrase the intelligent task you are interested in as a problem, then massage it into the format of a constraint satisfaction problem (CSP), put it into a constraint solver and see if you get a solution. CSPs consist of the following parts:

A set of variables X = {x1, x2, ..., xn} A finite set of values that each variable can take. This is called the domain of the variable. The domain of variable xi is written Di A set of constraints that specifies which values the variables can take simultaneously
In the high-IQ problem above, there are 25 variables: one for each of the 24 smaller square lengths, and one for the length of the big square. If we say that the smallest square is of length 1, then the big square is perhaps of length at most 1000. Hence the variables can each take values in the range 1 to 1000. There are many constraints in this problem, including the fact that each length is different, and that certain ones add up to give other lengths, for example the lengths of the three squares along the top must add up to the length of the big square. Depending on what solver you are using, constraints are often expressed as relationships between variables, e.g., x1 + x2 < x3. However, to be able to discuss constraints more formally, we use the following notation: A constraint Cijk specifies which tuples of values variables xi, xj and xk ARE allowed to take simultaneously. In plain English, a constraint normally talks about things which can't happen, but in our formalism, we are looking at tuples (vi, vj, vk) which xi, xj and xk can take simultaneously. As a simple example, suppose we have a CSP with two variables x and y, and that x can take values {1,2,3}, whereas y can take values {2,3}. Then the constraint that x=y would be written as:
R.Anirudhan

Cxy={(2,2), (3,3)}, and the constraint that x<y would be written as Cxy = {(1,2),(1,3),(2,3)} A solution to a CSP is an assignment of values, one to each variable in such a way that no constraint is broken. It depends on the problem at hand, but the user might want to know that there is a solution, i.e., they will take the first answer given. Alternatively, they may require all the solutions to the problem, or they might want to know that no solutions exists. Sometimes, the point of the exercise is to find the optimum solution based on some measure of worth. Sometimes, it's possible to do this without enumerating all the solutions, but other times, it will be necessary to find all solutions, then work out which is the optimum. In the high-IQ problem, a solution is simply a set of lengths, one per square. The shaded one is the 17th biggest, which answers the IQ question.
16.2 Binary Constraints

Unary constraints specify that a particular variable can take certain values, which basically restricts the domain for that variable, and hence should be taken care of when specifying the CSP. Binary constraints relate two variables, and binary constraint problems are special CSPs which involve only binary constraints. Binary CSPs have a special place in the theory because all CSPs can be written as binary CSPs (we don't go into the details of this here, and while it is possible in theory to do so, in practice, the translation is rarely used). Also, binary CSPs can be represented both graphically and using matrices, which can make them easier to understand. Binary constraint graphs such as the one below afford a nice representation of constraint problems, where the nodes are the variables and the edges represent the constraints on the variables between the two variables joined by the edge (remember that the constraints state which values can be taken at the same time).
Page 138
R.Anirudhan
Binary constraints can also be represented as matrices, with a single matrix for each constraint. For example, in the above constraint graph, the constraint between variables x4 and x5 is {(1,3),(2,4),(7,6)}. This can be represented as the following matrix.
C 1 2 3 4 5 6 7
Page 139
R.Anirudhan
We see that the asterixes mark the entry (i,j) in the table such that variable x4 can take value i at the same time that variable x5 takes value j. As all CSPs can be written as binary CSPs, the artificial generation of random binary CSPs as a set of matrices is often used to assess the relative abilities of constraint solvers. However, it should be noted that in real world constraint problems, there is often much more structure to the problems than you get from such random constructions. A very commonly used example CSP, which we will use in the next section, is the "n-queens" problem, which is the problem of placing n queens on a chess board in such a way that no one threatens another along the vertical, horizontal or diagonal. We've seen this in previous lectures. There are many possibilities for representing this as a CSP (in fact, finding the best specification of a problem so that a solver gets the answer as quickly as possible is a highly skilled art). One possibility is to have the variables representing the rows and the values they can take representing the columns on the row that a queen was situated on. If we look at the following solution to the 4-queens problem below:
Then, counting rows from the top downwards and columns from the left, the solution would be represented as: X1=2, X2=4, X3=1, X4=3. This is because the queen on row 1 is in column 2, the queen in row 2 is in column 4, the queen in row 3 is in column 1 and the queen in row 4 is in column 3. The constraint between variable X1 and X2 would be: C1,2 = {(1,3),(1,4),(2,4),(3,1),(4,1),(4,2)} As an exercise, work out exactly what the above constraint is saying.
Page 140
R.Anirudhan

16.3 Arc Consistency
There have been many advances in how constraint solvers search for solutions (remember this means an assignment of a value to each variable in such a way that no constraint is broken). We look first at a pre-processing step which can greatly improve efficiency by pruning the search space, namely arc-consistency. Following this, we'll look at two search methods, backtracking and forward checking which keep assigning values to variables until a solution is found. Finally, we'll look at some heuristics for improving the efficiency of the solver, namely how to order the choosing of the variables, and how to order the assigning of the values to variables. The pre-processing routine for bianry constraints known as arc-consistency involves calling a pair (xi, xj) an arc and noting that this is an ordered pair, i.e., it is not the same as (xj, xi). Each arc is associated with a single constraint C ij, which constrains variables xi and xj. We say that the arc (xi, xj) is consistent if, for all values a in Di, there is a value b in D j such that the assignment xi=a and xj=b satisfies constraint Cij. Note that (xi, xj) being consistent doesn't necessarily mean that (xj,xi) is also consistent. To use this in a pre-processing way, we take every pair of variables and make it arc-consistent. That is, we take each pair (xi,xj) and remove variables from Di which make it inconsistent, until it becomes consistent. This effectively removes values from the domain of variables, hence prunes the search space and makes it likely that the solver will succeed (or fail to find a solution) more quickly. To demonstrate the worth of performing an arc-consistency check before starting a serarch for a solution, we'll use an example from Barbara Smith's tutorial. Suppose that we have four tasks to complete, A, B, C and D, and we're trying to schedule them. They are subject to the constraints that:

Task A lasts 3 hours and precedes tasks B and C Task B lasts 2 hours and precedes task D Task C lasts 4 hours and precedes task D Task D lasts 2 hours
We will model this problem with a variable for each of the task start times, namely startA, startB, startC and startD. We'll also have a variable for the overall start time: start, and a variable for the overall finishing time: finish. We will say that the domain for variable start is {0}, but the domains for all the other variables is {0,1,...,11}, because the summation of the duration of the tasks is 3 + 2 + 4 + 2 = 11. We can now translate our English specification of the constraints into our formal model. We start with an intermediate translation thus:
start startA Page 141
R.Anirudhan

startA + 3 startB startA + 3 startC startB + 2 startD startC + 2 startD startD + 2 finish
Then, by thinking about the values that each pair of variables can take simultatneously, we can write the constraints as follows:

Cstart,startA = {(0,0), (0,1), (0,2), ..., (0,11)} CstartA,start = {(0,0), (1,0), (2,0), ..., (11,0)} CstartA,startB = {(0,3), (0,4), ..., (0,11), (1,4), (1,5), ..., (8,11)} etc.
Now, we will check whether each arc is arc-consistent, and if not, we will remove values from the domains of variables until we get consistency. We look first at the arc (start, startA) which is associated with the constraint {(0,0), (0,1), (0,2), ..., (0,11)} above. We need to check whether there is any value, P, in D start that does not have a corresponding value, Q, such that (P,Q) satisfies the constraint, i.e., appears in the set of assignable pairs. As D start is just {0}, we are fine. We next look at the arc (startA, start), and check whether there is any value in D startA, P, which doesn't have a corresponding Q such that (P,Q) is in C startA, start. Again, we are OK, because all the values in DstartA appear in CstartA, start. If we now look at the arc (startA, startB), then the constraint in question is: {(0,3), (0,4), ..., (0,11), (1,4), (1,5), ..., (8,11)}. We see that their is no pair of the form (9,Q) in the constraint, similarly no pair of the form (10,Q) or (11,Q). Hence, this arc is not arc-consistent, and we have to remove the values 9, 10 and 11 from the domain of startA in order to make the arc consistent. This makes sense, because we know that, if task B is going to start after task A, which has duration 3 hours, and they are all going to have started by the eleventh hour, then task A cannot start after the eighth hour. Hence, we can - and do - remove the values 9, 10 and 11 from the domain of startA. This method of removing values from domains is highly effective. As reported in Barbara Smith's tutorial, the domains become quite small, as reflected in the following scheduling network:
Page 142
R.Anirudhan
We see that the largest domain size has only 5 values in it, which means that quite a lot of the search space has been pruned. In practice, to remove as many variables as possible in a CSP which is dependent on precedence constraints, we have to work backwards, i.e., look at the start time of the task, T, which must occur last, then make each arc of the form (startT, Y) consistent for every variable Y. Following this, move on to the task which must occur second to last, etc. In CSPs which only involve precedence constraints, arc-consistency is guaranteed to remove all values which cannot appear in a solution to the CSP. In general, however, we cannot make such a guarantee, but arc-consistency usually has some effect on the initial specification of a problem.
16.4 Search Methods and Heuristics

We now come to the question of how constraint solvers search for solutions constraint preserving assignments of values to variables - to the CSPs they are given. The most obvious approach is to use a depth first search: assign a value to the first variable and check that this assignment doesn't break any constraints. Then, move on to the next variable, assign it a value and check that this doesn't break any constraints, then move on to the next variable and so on. When an assignment does break a constraint, then choose a different value for the assignment until one is found which satisfies the constraints. If one cannot be found, then this is when the search must backtrack. In such a situation, the previous variable is looked at again, and the next value for it is tried. In this way, all possible sets of assignments will be tried, and a solution will be found. The following search diagram - taken from Smith's tutorial paper - shows how the search for a solution to the 4-queens problem progresses until it finds a solution:
Page 143
R.Anirudhan
We see that the first time it backtracks is after the failure to put a queen in row three given queens in positions (1,1) and (2,3). In this case, it backtracked and move the queen in (2,3) to (2,4). Eventually, this didn't work out either, so it had to backtrack further and moved the queen in (1,1) to (1,2). This led fairly quickly to a solution. To add some sophistication to the search method, constraint solvers use a technique known as forward checking. The general idea is to work the same as a backtracking search, but, when checking compliance with constraints after assigning a value to a variable, the agent also checks whether this assignment is going to break constraints with future variable assignments. That is, supposing that Vc has been assigned to the current variable c, then for each unassigned variable x i, (temporarily) remove all values from Di which, along with Vc break a constraint. It may be that in doing so, Di becomes empty. This means that the choice of Vc for the current variable is bad - it will not find its way into a solution to the problem, because there's no way to assign a value to xi without breaking a constraint. In such a scenario, even though the assignment of V c may not break any constraints with already assigned variables, a new value is chosen (or backtracking occurs if there are no values left), because we know that Vc is a bad assignment.
R.Anirudhan

The following diagram (again, taken from Smith's tutorial) shows how forward checking improves the search for a solution to the 4-queens problem.
In addition to forward checking to improve the intelligence of the constraint solving agent, there are some possibilities for a heuristic search. Firstly, our agent can worry about the order in which it looks at the variables, e.g., in the 4-queens problem, it might try to put a queen in row 2, then one in row 3, one in row 1 and finally one in row 4. A solver taking such care is said to be using a variableordering heuristic. The ordering of variables can be done before a search is started and rigidly adhered to during the search. This might be a good idea if there is extra knowledge about the problem, e.g., that a particular variable should be assigned a value sooner rather than later. Alternatively, the ordering of the variables can be done dynamically, in response to some information gathered about how the search is progressing during the search procedure. One such dynamic ordering procedure is called "fail-first forward checking". The idea is to take advantage of information gathered while forward checking during search. In cases where forward checking highlights the fact that a future domain is effectively emptied, then this signals that it's time to change the current assignment. However, in the general case, the domain of the variable will be reduced but not necessarily emptied. Suppose that of all the future variables, x f has the most values
R.Anirudhan

removed from Df. The fail-first approach specifies that our agent should choose to assign values to xf next. The thinking behind this is that, with fewer possible assignments for xf than the other future variables, we will find out most quickly whether we are heading down a dead-end. Hence, a better name for this approach would be "find out if its a dead end quickest". However, this isn't as catchy a phrase as "fail-first". An alternative/addition to variable ordering is value ordering. Again, we could specify in advance the order in which values should be assigned to variables, and this kind of tweaking of the problem specification can dramatically improve search time. We can also perform value ordering dynamically: suppose that it's possible to assign values Vc, Vd and Ve to the current variable. Further suppose that, when looking at all the future variables, the total number of values in their domains reduces to 300, 20 and 50 for Vc, Vd and Ve respectively. We could then specify that our agent assigns V c at this stage in the search, because it has retained the most number of values in the future domains. This is different from variable ordering in two important ways:
If this is a dead end then we will end up visiting all the values for this variable anyway, so fail-first does not make sense for values. Rather, we try and keep our options open as much as possible, as this will help if there is a solution ahead of us. Unlike the variable ordering heuristics, this heuristic carries an extra cost on top of forward checking, because the reduction in domain sizes of future variables for every assignment of the current variable needs to be checked. Hence, it is possible that this kind of value ordering will slow things down. In practice, this is what happens for randomly constructed binary CSPs. On occasions, however, it can sometimes be a very good idea to employ dynamic value ordering.
Chapter-17 Genetic Algorithms

The evolutionary approach to Artificial Intelligence is one of the neatest ideas of all. We have tried to mimic the functioning of the brain through neural networks, because - even though we don't know exactly how it works - we know that the brain does work. Similarly, we know that mother nature, through the process of evolution, has solved many problems, for instance the problem getting animals to walk around on two feet (try getting a robot to do that - it's very difficult). So, it seems like a good idea to mimic the processes of reproduction and survival of the fittest to try to evolve answers to problems, and maybe in the long run reach the
R.Anirudhan

holy grail of computers which program themselves by evolving programs. Evolutionary approaches are simple in conception:

generate a population of possible answers to the problem at hand choose the best individuals from the population (using methods inspired by survival of the fittest) produce a new generation by combining these best ones (using techniques inspired by reproduction) stop when the best individual of a generation is good enough (or you run out of time)
Perhaps the first landmark in the history of the evolutionary approach to computing was John Holland's book "Adaptation in Natural and Artificial Systems", where he developed the idea of the genetic algorithm as searching via sampling hyperplane partitions of the space. It's important to rememeber that genetic algorithms (GAs), which we look at in this lecture, and genetic programming (GP), which we look at in the next lecture, are just fancy search mechanisms which are inspired by evolution. In fact, using Tom Mitchell's definition of a machine learning system being one which improves its performance through experience, we can see that evolutionary approaches can be classed as machine learning efforts. Historically, however, it has been more common to categorise evolutionary approaches together because of their inspiration rather than their applications (to learning and discovery problems). As we will see, evolutionary approaches boil down to (i) specifying how to represent possible problem solutions and (ii) determining how to choose which partial solutions are doing the best with respect to solving the problem. The main difference between genetic algorithms and genetic programming is the choice of representation for problem solutions. In particular, with genetic algorithms, the format of the solution is fixed, e.g., a fixed set of parameters to find, and the evolution occurs in order to find good values for those parameters. With genetic programming, however, the individuals in the population of possible solutions are actually individual programs which can increase in complexity, so are not as constrained as in the genetic algorithm approach.
17.1 The Canonical Genetic Algorithm

As with all search techniques, one of the first questions to ask with GAs is how to define a search space which potentially contains good solutions to the problem at hand. This means answering the question of how to represent possible solutions to the problem. The classical approach to GAs is to represent the solutions as strings of ones and zeros, i.e., bit strings . This is not such a bad idea, given that computers store everything as bit strings, so any solution would eventually boil down to a
R.Anirudhan

string of ones and zeros. However, there have been many modifications to the original approach to genetic algorithms, and GA approaches now come in many different shapes and sizes, with higher level representations. Indeed, it's possible to see genetic programming, where the individuals in the population are programs, as just a GA approach with a more complicated representation scheme. Returning to the classical approach, as an example, if solving a particular problem involved finding a set of five integers between 1 and 100, then the search space for a GA would be bits strings where the first eight bits are decoded as the first integer, the next eight bits become the second integer and so on. Representing the solutions is one of the tricky parts to using genetic algorithms, a problem we come back to later. However, suppose that the solutions are represented as strings of length L. Then, in the standard approach to GAs, known as the canonical genetic algorithm, the first stage is to generate an initial random population of bit strings of length L. By random, we mean that the ones and zeros in the strings are chosen at random. Sometimes, rarely, the initialisation procedure is done with a little more intelligence, e.g., using some additional knowledge about the domain to choose the initial population. After the initialisation step, the canonical genetic algorithm proceeds iteratively using selection, mating, and recombination processes, then checking for termination. This is portrayed in the following diagram:
Page 148
R.Anirudhan

In the next section, we look in detail at how individuals are selected, mated, recombined (and mutated for good measure). Termination of the algorithm may occur if one or more of the best individuals in the current generation performs well enough with respect to the problem, with this performance specified by the user. Note that this termination check may be related, or the same as the evaluation function - discussed later - but it may be something entirely different to this. There may not be a definitive answer to the problem you're looking at, and it may only be possible to evolve solutions which are as good as possible. In this case, it may not be obvious when to stop, and moreover, it may be a good idea to produce as many populations as possible given the computing/time resources you have available. In this case, the termination function may be a specific time limit or a specific number of generations. It is very important to note that the best individual in your final population may not be as good as the best individual in a previous generation (GAs do not perform hill-climbing searches, so it is perfectly possible for generations to degrade). Hence GAs should record the best individuals from every generation, and, as a final solution presented to the user, they should output the best solution found over all the generations.
17.2 Selection, Mating, Recombination and Mutation

So, the point of GAs is to generate population after population of individuals which represent possible solutions to the problem at hand in the hope that one individual in one generation will be a good solution. We look here at how to produce the next generation from the current generation. Note that there are various models for whether to kill off the previous generation, or allow some of the fittest individuals to stay alive for a while - we'll assume a culling of the old generation once the new one has been generated.
Selection
The first step is to choose the individuals which will have a shot at becoming the parents of the next generation. This is called the selection procedure, and its purpose it to choose those individuals from the current population which will go into an intermediate population (IP). Only individuals in this intermediate population will be chosen to mate with each other (and there's still no guarantee that they'll be chosen to mate, or that if they do mate, they will be successful - see later). To perform the selection, the GA agent will require a fitness function. This will assign a real number to each individual in the current generation. From this value, the GA calculates the number of copies of the individual which are guaranteed to go into the intermediate population and a probability which will be used to determine whether an additional copy goes into the IP. To be more specific, if the value
R.Anirudhan

calculated by the fitness function is an integer part followed by a fractional part, then the integer part dictates the number of copies of the individual which are guaranteed to go into the IP, and the fractional part is used as a probability: another copy of the individual is added to the IP with this probability, e.g., if it was 1/6, then a random number between 1 and 6 would be generated and only if it was a six would another copy be added. The fitness function will use an evaluation function to calculate a value of worth for the individual so that they can be compared against each other. Often the evaluation function is written g(c) for a particular individual c. Correctly specifying such evaluation functions is a tricky job, which we look at later. The fitness of an individual is calculated by dividing the value it gets for g by the average value for g over the entire population: fitness(c) = g(c)/(average of g over the entire population) We see that every individual has at least a chance of going into the intermediate population unless they score zero for the evaluation function. As an example of a fitness function using an evaluation function, suppose our GA agent has calculated the evaluation function for every member of the population, and the average is 17. Then, for a particular individual c 0, the value of the evaluation function is 25. The fitness function for c0 would be calculated as 25/17 = 1.47. This means that one copy of c0 will definitely be added to the IP, and another copy will be added with a probability of 0.47 (e.g., a 100 side dice is thrown and only if it returns 47 or less, is another copy of c0 added to the IP).
Mating
Once our GA agent has chosen the individuals lucky enough (actually, fit enough) to produce offspring, we next determine how they are going to mate with each other. To do this, pairs are simply chosen randomly from the set of potential parents. That is, one individual is chosen randomly, then another - which may be the same as the first - is chosen, and that pair is lined up for the reproduction of one or more offspring (dependent on the recombination techniques used). Then whether or not they actually reproduce is probabilistic, and occurs with a probability p c. If they do reproduce, then their offspring are generated using a recombination and mutation procedure as described below, and these offspring are added to the next generation. This continues until the number of offspring which is produced is the required number. Often this required number is the same as the current population size, to keep the population size constant. Note that there are repeated individuals in the IP, so some individuals may become the proud parent of multiple children.
R.Anirudhan

This mating process has some anology with natural evolution, because sometimes the fittest organisms may not have the opportunity to find a mate, and even if they do find a mate, it's not guaranteed that they will be able to reproduce. However, the analogy with natural evolution also breaks down here, because individuals can mate with themselves and there is no notion of sexes.
Recombination
During the selection and mating process, the GA repeatedly lines up pairs of individuals for reproduction. The next question is how to generate offspring from these parent individuals. This is called the recombination process and how this is done is largely dependent on the representation scheme being used. We will look at three possibilities for recombination of individuals represented as bit strings. The population will only evolve to be better if the best parts of the best individuals are combined, hence recombination procedures usually take parts from both parents and place them into the offspring. In the One-Point Crossover recombination process, a point is chosen at random on the first individual, and the same point is chosen on the second individual. This splits both individuals into a left hand and a right hand side. Two offspring individuals are then produced by (i) taking the LHS of the first and adding it to the RHS of the second and (ii) by taking the LHS of the second and adding it to the RHS of the first. In the following example, the crossover point is after the fifth letter in the bit string:
Note that all the a's, b's, X's and Y's are actually ones or zeros. We see that the length of the two children is the same as that of the parents because GAs use a fixed representation (remember that the bit strings only make sense as solutions if they are of a particular length). In Two-point Crossover, as you would expect, two points are chosen in exactly the same place in both individuals. Then the bits falling in-between the two points are
Page 151
R.Anirudhan

swapped to give two new offspring. For example, in the following diagram, the two points are after the 5th and 11th letters:
Again, the a's, b's, X's and Y's are ones or zeros, and we see that this recombintion technique doesn't alter the string length either. As a third recombination operator, the inversion process simply takes a segment of a single individual and produces a single offspring by reversing the letters in-between two chosen points. For example:
Mutation
It may appear that the above recombinations are a little arbitrary, especially as points defining where crossover and inversion occur are chosen randomly. However, it is important to note that large parts of the string are kept in tact, which means that if the string contained a region which scored very well with the evaluation function, these operators have a good chance of passing that region on to the offspring (especially if the regions are fairly small, and, like in most GA problems, the overall string length is quite high). The recombination process produces a large range of possible solutions. However, it is still possible for it to guide the search into a local rather than the global maxima with respect to the evaluation function. For this reason, GAs usually perform random mutations. In this process, the offspring are taken and each bit in their bit
R.Anirudhan

string is flipped from a one to a zero or vice versa with a given probability. This probability is usually taken to be very small, say less than 0.01, so that only one in a hundred letters is flipped on average. In natural evolution, random mutations are often highly deleterious (harmful) to the organism, because the change in the DNA leads to big changes to way the body works. It may seem sensible to protect the children of the fittest individuals in the population from the mutation process, using special alterations to the flipping probability distribution. However, it may be that it is actually the fittest individuals that are causing the population to stay in the local maxima. After all, they get to reproduce with higher frequency. Hence, protecting their offspring is not a good idea, especially as the GA will record the best from each generation, so we won't lose their good abilities totally. Random mutation has been shown to be effective at getting GA searches out of local maxima effectively, which is why it is an important part of the process. To summarize the production of one generation from the previous: firstly, an intermediate population is produced by selecting copies of the fittest individuals using probability so that every individual has at least a chance of going into the intermediate population. Secondly, pairs from this intermediate population are chosen at random for reproduction (a pair might consist of the same individual twice), and the pair reproduce with a given fixed probability. Thirdly, offspring are generated through recombination procedures such as 1-point crossover, 2-point crossover and inversion. Finally, the offspring are randomly mutated to produce the next generation of individuals. Individuals from the old generation may be entirely killed off, but some may be allowed into the next generation (alternatively, the recombination procedure might be tuned to leave some individuals unchanged). The following schematic gives an indication of how the new generation is produced:
Page 153
R.Anirudhan

17.3 Two Difficulties
The first big problem we face when designing an AI agent to perform a GA-style search is how to represent the solutions. If the solutions are textual by nature, then ASCII strings require eight bits per letter, so the size of individuals can get very large. This will mean that evolution may take thousands of generations to converge onto a solution. Also, there will be much redundancy in the bit string representations: in general many bit strings produced by the recombination process will not represent solutions at all, e.g., they may represent ASCII characters which shouldn't appear in the solution. In the case of individuals which don't represent solutions, how do we measure these with the evaluation function? It doesn't necessarily follow that they are entirely unfit, because the tweaking of a single zero to a one might make them good solutions. The situation is better when the solution space is continuous, or the solutions represent real valued numbers or integers. The situation is worse when there are only a finite number of solutions. The second big problem we face is how to specify the evaluation function. This is crucial to the success of the GA experiment. The evaluation function should, if possible:

Return a real-valued number scoring higher for individuals which perform better with respect to the problem Be quick to calculate, as this calculation will be done many thousands of times Distinguish well between different individuals, i.e., give a good range of values
Even with a well specified evaluation function, when populations have evolved to a certain stage, it is possible that the individuals will all score highly with respect to the evalation function, so all have equal chances of reproducing. In this case, evolution will effectively have stopped, and it may be necessary to take some action to spread them out (make the evaluation function more sophisticated dynamically, possibly).
16.4 An Example Application

There are many fantastic applications of genetic algorithms. Perhaps my favourite is their usage in evaluating Jazz melodies done as part of a PhD project in Edinburgh. The one we look at here is chosen because it demonstrates how a fairly lightweight effort using GAs can often be highly effective. In their paper "The Application of Artificial Intelligence to Transportation System Design", Ricardo Hoar and Joanne Penner describe their undergraduate project, which involved representing vehicles on a road system as autonomous agents, and using a GA approach to evolve solutions to the timing of traffic lights to increase the traffic flow in the system. The
R.Anirudhan

optimum settings for when lights come on and go off is known only for very simple situations, so an AI-style search can be used to try and find good solutions. Hoar and Penner chose to do this in an evolutionary fashion. They don't give details of the representation scheme they used, but traffic light times are real-valued numbers, so they could have used a bit-string representation. The evaluation function they used involved the total waiting time and total driving time for each car in the system as follows:
The results they produced were good (worthy of writing a paper). The two graphs below describe the decrease in overall waiting time for a simple road and for a more complicated road (albeit not amazingly complicated).
We see that in both cases, the waiting time has roughly halved, which is a good result. In the first case, for the simple road system, the GA evolved a solution very
R.Anirudhan

similar to the ones worked out to be optimal by humans. We see that GAs can be used to find good near-optimal solutions to problems where a more cognitive approach might have failed (i.e., humans still can't work out how best to tune traffic light times, but a computer can evolve a good solution).
Page 156
R.Anirudhan

Artificial Intelligence Anna University Notes

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Artificial Intelligence Anna University Notes

Загружено:

Авторское право:

Доступные форматы

www.kinindia.

EINSTEIN COLLEGE OF ENGINEERING

Author of Reference book- Stuart Russell& Norvig

Artificial Intelligence-CS1351 www.kinindia.com

1.1 Long Term Goals

Produce machines which exhibit intelligent behavior.

Artificial Intelligence-CS1351 www.kinindia.com

Understand human intelligence in society.

Give birth to new life forms.

Artificial Intelligence-CS1351 www.kinindia.com

Add to scientific knowledge.

Artificial Intelligence-CS1351 www.kinindia.com

1.4 General Tasks to Accomplish

1.5 Generic Techniques Developed

EINSTEIN COLLEGE OF ENGINEERING

Artificial Intelligence-CS1351 www.kinindia.com

1.6 Representations/Languages Used

Perl C++ Java C

1.7 Application Areas

Artificial Intelligence-CS1351 www.kinindia.com

EINSTEIN COLLEGE OF ENGINEERING

Artificial Intelligence-CS1351 www.kinindia.com

Chapter2 Artificial Intelligence Agents

2.1 Autonomous Rational Agents

Artificial Intelligence-CS1351 www.kinindia.com

EINSTEIN COLLEGE OF ENGINEERING

Artificial Intelligence-CS1351 www.kinindia.com

2.3 Internal Structure of Agents

Architecture and Program

EINSTEIN COLLEGE OF ENGINEERING

Artificial Intelligence-CS1351 www.kinindia.com

Knowledge of the Environment

Artificial Intelligence-CS1351 www.kinindia.com

Artificial Intelligence-CS1351 www.kinindia.com

Artificial Intelligence-CS1351 www.kinindia.com

EINSTEIN COLLEGE OF ENGINEERING

Artificial Intelligence-CS1351 www.kinindia.com

Chapter-3 Search in Problem Solving

3.1 Specifying Search Problems

Initial State Page 15

EINSTEIN COLLEGE OF ENGINEERING

Artificial Intelligence-CS1351 www.kinindia.com

3.2 General Considerations for Search

Artificial Intelligence-CS1351 www.kinindia.com

Artificial Intelligence-CS1351 www.kinindia.com

Time and Space Tradeoffs

Additional Knowledge in Search

EINSTEIN COLLEGE OF ENGINEERING

Artificial Intelligence-CS1351 www.kinindia.com

Breadth First Search

Artificial Intelligence-CS1351 www.kinindia.com

Artificial Intelligence-CS1351 www.kinindia.com

Depth First Search

Artificial Intelligence-CS1351 www.kinindia.com

EINSTEIN COLLEGE OF ENGINEERING

Artificial Intelligence-CS1351 www.kinindia.com

EINSTEIN COLLEGE OF ENGINEERING

Artificial Intelligence-CS1351 www.kinindia.com

Iterative Deepening Search

Artificial Intelligence-CS1351 www.kinindia.com

EINSTEIN COLLEGE OF ENGINEERING

Artificial Intelligence-CS1351 www.kinindia.com

3.5 Heuristic Search Strategies

Artificial Intelligence-CS1351 www.kinindia.com