Вы находитесь на странице: 1из 24

Northern Philippines College for Maritime, Science and Technology

Lingsat, City of San Fernando, La Union


Syllabus
Bachelor of Science in Computer Science
1st Semester SY: 2008 2009
Prerequisite: None
Course Code: Automata
Course Title: Automata and Language Theory
Course Description:

I.
II.
III.

This course includes abstract machines and languages; finite automata, regular
expressions, pushdown automata, context free languages, Turing machines and
recursively enumerable languages.
Vision:
We envision the Northern Philippines College for Maritime, Science and
Technology to be the center of education excellence in producing graduates
whose skills and competitiveness as at par with local and international standards.
V.
Mission:
Inspired by this vision we endeavor to hold out students to become quality
professionals and workers in this respective fields of specialization whose
knowledge and skills, and values will make them standout in the extremely
competitive and local global market.
VII.
Credit Units: 3 Units (3 hrs Lecture)
VIII. Course Contents:
IV.

Prelim
o
o
o

Finite Automata
Regular Languages
Properties of Regular Languages

o
o
o

Context-Free Languages
Transforming Grammars
Pushdown Automata

Midterm

Finals
Parsing
LR Parsing
LL Parsing
Properties of Context-Free Languages
FLAP (Formal Languages and Automata Package) a tool for designing, and
simulating several variations of finite automata, and pushdown automata. Flap
was worked by Susan Rodger, Dan Caugherty, Mark LoSacco, David Harrison,
and Greg Badros.
LEX (Lexical Analysis Program Generator).
o
o
o
o

YACC (Parsing Program Generator).

IX. Instructional Methodologies


Lectures and Class Discussion
Reading and Written Assignment
X. Evaluation Techniques:
Paper & pencil test
Oral Participation
Student work products/ projects
XI. Course Requirements:
Prelim, Midterm, Final Exam
Quizzes, Seatwork
Assignments
Project
Prepared by:
Mr. Romeo E. Balcita
Instructor
Noted by:
Mrs. Marie Cris S. Almoite
Dean, College of Computer Education
Approved by:
Dr.Rogelio Espiritu
Vice President for Academic Affairs

Automata theory
Automata is defined as a system where energy, information and material is transformed,
transmitted and used for performing some function without the direct participation of
man. In theoretical computer science, automata theory is the study of abstract
machines and problems they are able to solve. Automata theory is closely related to
formal language theory as the automata are often classified by the class of formal
languages they are able to recognize.
An automaton is a mathematical model for a finite state machine (FSM). A FSM is a
machine that, given an input of symbols, "jumps" through a series of states according to
a transition function (which can be expressed as a table). In the common "Mealy" variety
of FSMs, this transition function tells the automaton which state to go to next given a
current state and a current symbol.
The input is read symbol by symbol, until it is consumed completely (think of it as a tape
with a word written on it, that is read by a reading head of the automaton; the head
moves forward over the tape, reading one symbol at a time). Once the input is depleted,
the automaton is said to have stopped.
Depending on the state in which the automaton stops, it's said that the automaton either
accepts or rejects the input. If it landed in an accept state, then the automaton accepts
the word. If, on the other hand, it lands on a reject state, the word is rejected. The set of
all the words accepted by an automaton is called the language accepted by the
automaton.
Note, however, that, in general, an automaton need not have a finite number of states, or
even a countable number of states. Thus, for example, the quantum finite automaton
has an uncountable infinity of states, as the set of all possible states is the set of all
points in complex projective space. Thus, quantum finite automata, as well as finite state
machines, are special cases of a more general idea, that of a topological automaton,
where the set of states is a topological space, and the state transition functions are
taken from the set of all possible functions on the space. Topological automata are often
called M-automata, and are simply the augmentation of a semiautomaton with a set of
accept states, where set intersection determines whether the initial state is accepted or
rejected.
In general, an automaton need not strictly accept or reject an input; it may accept it with
some probability between zero and one. Again this is illustrated by the quantum finite
automaton, which only accepts input with some probability. This idea is again a special
case of a more general notion, the geometric automaton or metric automaton, where the
set of states is a metric space, and a language is accepted by the automaton if the
distance between the initial point, and the set of accept states is sufficiently small with
respect to the metric.
Automata play a major role in compiler design and parsing.

Vocabulary
The basic concepts of symbols, words, alphabets and strings are common to most
descriptions of automata. These are:
Symbol
An arbitrary datum which has some meaning to or effect on the machine.
Symbols are sometimes just called "letters" or "atoms".
Word
A finite string formed by the concatenation of a number of symbols.
Alphabet
A finite set of symbols. An alphabet is frequently denoted by , which is the set of
letters in an alphabet.
Language
A set of words, formed by symbols in a given alphabet. May or may not be
infinite.
Kleene closure
A language may be thought of as a subset of all possible words. The set of all
possible words may, in turn, be thought of as the set of all possible
concatenations of strings. Formally, this set of all possible strings is called a free
monoid. It is denoted as * , and the superscript * is called the Kleene star.
Formal description
An automaton is represented by the 5-tuple

, where:

Q is a set of states.
is a finite set of symbols, that we will call the alphabet of the language the
automaton accepts.
is the transition function, that is

(For non-deterministic automata, the empty string is an allowed input).


q0 is the start state, that is, the state in which the automaton is when no input has
been processed yet, where q0 C Q.
F is a set of states of Q (i.e. F C Q) called accept states.

Given an input letter


, one may write the transition function as
,
using the simple trick of currying, that is, writing (q,a) = a(q) for all
. This way,
the transition function can be seen in simpler terms: it's just something that "acts" on a
state in Q, yielding another state. One may then consider the result of function
composition repeatedly applied to the various functions a, b, and so on. Repeated
function composition forms a monoid. For the transition functions, this monoid is known
as the transition monoid, or sometimes the transformation semigroup.

Given a pair of letters


, where

, one may define a new function

, by insisting that

denotes function composition. Clearly, this process can be

recursively continued, and so one has a recursive definition of a function


defined for all words
, so that one has a map

The construction can also be reversed: given a


two descriptions are equivalent.

that is

, one can reconstruct a , and so the

The triple
is known as a semiautomaton. Semiautomata underlay automata, in
that they are just automata where one has ignored the starting state and the set of
accept states. The additional notions of a start state and an accept state allow automata
to do something the semiautomata cannot: they can recognize a formal language. The
language L accepted by a deterministic finite automaton

is:

That is, the language accepted by an automaton is the set of all words w, over the
alphabet , that, when given as input to the automaton, will result in its ending in some
state from F. Languages that are accepted by automata are called recognizable
languages.
When the set of states Q is finite, then the automaton is known as a finite state
automaton, and the set of all recognizable languages are the regular languages. In fact,
there is a strong equivalence: for every regular language, there is a finite state
automaton, and vice versa.
As noted above, the set Q need not be finite or countable; it may be taken to be a
general topological space, in which case one obtains topological automata. Another
possible generalization is the metric automata or geometric automata. In this case, the
acceptance of a language is altered: instead of a set inclusion of the final state in
, the acceptance criteria are replaced by a probability, given in terms of
the metric distance between the final state
and the set F. Certain types of
probabilistic automata are metric automata, with the metric being a measure on a
probability space.

Classes of finite automata


The following are three kinds of finite automata
Deterministic finite automata (DFA)
Each state of an automaton of this kind has a transition for every symbol in the
alphabet.

DFA
Nondeterministic finite automata (NFA)
States of an automaton of this kind may or may not have a transition for each
symbol in the alphabet, or can even have multiple transitions for a symbol. The
automaton accepts a word if there exists at least one path from q0 to a state in F
labeled with the input word. If a transition is undefined, so that the automaton
does not know how to keep on reading the input, the word is rejected.

NFA, equivalent to the DFA from the previous example


Nondeterministic finite automata, with transitions (FND- or -NFA)
Besides of being able to jump to more (or none) states with any symbol, these
can jump on no symbol at all. That is, if a state has transitions labeled with ,
then the NFA can be in any of the states reached by the -transitions, directly or
through other states with -transitions. The set of states that can be reached by
this method from a state q, is called the -closure of q.

It can be shown, though, that all these automata can accept the same languages. You
can always construct some DFA M' that accepts the same language as a given NFA M.
Extensions of finite automata
The family of languages accepted by the above-described automata is called the family
of regular languages. More powerful automata can accept more complicated languages.
Such automata include:
Pushdown automata (PDA)
Such machines are identical to DFAs (or NFAs), except that they additionally
carry memory in the form of a stack. The transition function will now also
depend on the symbol(s) on top of the stack, and will specify how the stack is to
be changed at each transition. Non-determinstic PDAs accept the context-free
languages.
Linear Bounded Automata (LBA)
An LBA is a limited Turing machine; instead of an infinite tape, the tape has an
amount of space proportional to the size of the input string. LBAs accept the
context-sensitive languages.
Turing machines
These are the most powerful computational machines. They possess an infinite
memory in the form of a tape, and a head which can read and change the tape,
and move in either direction along the tape. Turing machines are equivalent to
algorithms, and are the theoretical basis for modern computers. Turing machines
decide/accepts recursive languages and recognize the recursively enumerable
languages.

Finite state machine

Fig.1 Example of a Finite State Machine


A finite state machine (FSM) or finite state automaton (plural: automata) or simply a
state machine, is a model of behavior composed of a finite number of states, transitions
between those states, and actions. A finite state machine is an abstract model of a
machine with a primitive internal memory.
Contents

1 Concepts and vocabulary


2 Classification
o 2.1 Acceptors and recognizers
2.1.1 Start state
2.1.2 Accept state
o 2.2 Transducers
3 FSM logic
4 Mathematical model
5 Optimization
6 Implementation
o 6.1 Hardware applications
o 6.2 Software applications

Concepts and vocabulary


A current state is determined by past states of the system. As such, it can be said to
record information about the past, i.e. it reflects the input changes from the system start
to the present moment. A transition indicates a state change and is described by a
condition that would need to be fulfilled to enable the transition. An action is a description
of an activity that is to be performed at a given moment. There are several action types:
Entry action
which is performed when entering the state
Exit action
which is performed when exiting the state
Input action
which is performed depending on present state and input conditions
Transition action
which is performed when performing a certain transition

An FSM can be represented using a state diagram (or state transition diagram) as in
figure 1 above. Besides this, several state transition table types are used. The most
common representation is shown below: the combination of current state (B) and
condition (Y) shows the next state (C). The complete actions information can be added
only using footnotes. An FSM definition including the full actions information is possible
using state tables (see also VFSM).
State transition table
Current State
State A State B State C
Condition
Condition X

...

...

...

Condition Y

...

State C ...

Condition Z

...

...

...

In addition to their use in modeling reactive systems presented here, finite state
automata are significant in many different areas, including electrical engineering,
linguistics, computer science, philosophy, biology, mathematics, and logic. A complete
survey of their applications is outside the scope of this article. Finite state machines are
a class of automata studied in automata theory and the theory of computation. In
computer science, finite state machines are widely used in modeling of application
behavior, design of hardware digital systems, software engineering, compilers, network
protocols, and the study of computation and languages.

Classification
There are two different groups: Acceptors/Recognizers and Transducers.
Acceptors and recognizers

Fig. 2 Acceptor FSM: parsing the word "nice"


Acceptors and recognizers (also sequence detectors) produce a binary output,
saying either yes or no to answer whether the input is accepted by the machine or not.
All states of the FSM are said to be either accepting or not accepting. At the time when
all input is processed, if the current state is an accepting state, the input is accepted;
otherwise it is rejected. As a rule the input are symbols (characters); actions are not
used. The example in figure 2 shows a finite state machine which accepts the word
"nice". In this FSM the only accepting state is number 7.
The machine can also be described as defining a language, which would contain every
word accepted by the machine but none of the rejected ones; we say then that the
language is accepted by the machine. By definition, the languages accepted by FSMs
are the regular languages - that is, a language is regular if there is some FSM that
accepts it.
Start state
The start state is usually shown drawn with an arrow "pointing at it from nowhere"

10

Accept state

Fig. 3: A finite state machine that determines if a binary number has an odd or even
number of 0s.
An accept state (sometimes referred to as an accepting state) is a state at which the
machine has successfully performed its procedure. It is usually represented by a double
circle.
An example of an accepting state appears on the left in this diagram of a deterministic
finite automaton (DFA) which determines if the binary input contains an even number of
0s.
S1 (which is also the start state) indicates the state at which an even number of 0s has
been input and is therefore defined as an accepting state. This machine will give a
correct end state if the binary number contains an even number of zeros including a
string with no zeros. Examples of strings accepted by this DFA are epsilon (the empty
string), 1, 11, 11..., 00, 010, 1010, 10110 and so on.
Transducers
Transducers generate output based on a given input and/or a state using actions. They
are used for control applications and in the field of computational linguistics. Here two
types are distinguished:
Moore machine
The FSM uses only entry actions, i.e. output depends only on the state. The
advantage of the Moore model is a simplification of the behaviour. The example
in figure 1 shows a Moore FSM of an elevator door. The state machine
recognizes two commands: "command_open" and "command_close" which
trigger state changes. The entry action (E:) in state "Opening" starts a motor
opening the door, the entry action in state "Closing" starts a motor in the other
direction closing the door. States "Opened" and "Closed" don't perform any
actions. They signal to the outside world (e.g. to other state machines) the
situation: "door is open" or "door is closed".

11

Fig. 4 Transducer FSM: Mealy model example


Mealy machine
The FSM uses only input actions, i.e. output depends on input and state. The use
of a Mealy FSM leads often to a reduction of the number of states. The example
in figure 4 shows a Mealy FSM implementing the same behaviour as in the
Moore example (the behaviour depends on the implemented FSM execution
model and will work e.g. for virtual FSM but not for event driven FSM). There are
two input actions (I:): "start motor to close the door if command_close arrives"
and "start motor in the other direction to open the door if command_open
arrives".
In practice mixed models are often used including the predicate transition state machine
invented by Andrew Wightman.
More details about the differences and usage of Moore and Mealy models, including an
executable example, can be found in the external technical note "Moore or Mealy
model?"
A further distinction is between deterministic (DFA) and non-deterministic (NDFA,
GNFA) automata. In deterministic automata, for each state there is exactly one transition
for each possible input. In non-deterministic automata, there can be none, one, or more
than one transition from a given state for a given possible input. This distinction is
relevant in practice, but not in theory, as there exists an algorithm which can transform
any NDFA into an equivalent DFA, although this transformation typically significantly
increases the complexity of the automaton.
The FSM with only one state is called a combinatorial FSM and uses only input actions.
This concept is useful in cases where a number of FSM are required to work together,
and where it is convenient to consider a purely combinatorial part as a form of FSM to
suit the design tools.

FSM logic

12

Fig. 5 FSM Logic (Mealy)


The next state and output of an FSM is a function of the input and of the current state.
The FSM logic is shown in Figure 5.
Mathematical model
Depending on the type there are several definitions.
A deterministic finite state machine is a quintuple (,S,s0,,F), where:
is the input alphabet (a finite, non-empty set of symbols).
S is a finite, non-empty set of states.
s0 is an initial state, an element of S.
is the state-transition function:
. In a Nondeterministic finite

state machine,
, returns a set of states.
F is the set of final states, a (possibly empty) subset of S.

An acceptor finite-state machine is a quintuple (,S,s0,,F), where:


is the input alphabet (a finite, non-empty set of symbols).
S is a finite, non-empty set of states.
s0 is an initial state, an element of S.
is the state-transition function:
.
F is the set of final states, a (possibly empty) subset of S.
A finite state transducer is a sextuple (,,S,s0,,), where:
is the input alphabet (a finite non empty set of symbols).
is the output alphabet (a finite, non-empty set of symbols).
S is a finite, non-empty set of states.
s0 is the initial state, an element of S. In a Nondeterministic finite state machine,
s0 is a set of initial states.

13

is the state-transition function:


is the output function.

If the output function is a function of a state and input alphabet (


) that
definition corresponds to the Mealy model, and can be modelled as a Mealy machine. If
the output function depends only on a state (
) that definition corresponds to
the Moore model, and can be modelled as a Moore machine. A finite-state machine with
no output function at all is known as a semiautomaton or transition system.
Optimization
Optimizing an FSM means finding the machine with the minimum number of states that
performs the same function. The fastest known algorithm doing this is the Hopcroft
minimization algorithm.[1][2]. Other techniques include using an Implication table, or the
Moore reduction procedure. Additionally, acyclic FSAs can be optimized using a simple
bottom up algorithm[3].

14

Implementation
Hardware applications

Fig. 6 The circuit diagram for a 4 bit TTL counter, a type of state machine
In a digital circuit, an FSM may be built using a programmable logic device, a
programmable logic controller, logic gates and flip flops or relays. More specifically, a
hardware implementation requires a register to store state variables, a block of
combinational logic which determines the state transition, and a second block of
combinational logic that determines the output of an FSM. One of the classic hardware
implementations is the Richard's Controller.

15

Software applications
The following concepts are commonly used to build software applications with finite state
machines:

event driven FSM


virtual FSM (VFSM)
Automata-based programming

is the input alphabet (a finite non empty set of symbols).


is the output alphabet (a finite, non-empty set of symbols).
S is a finite, non-empty set of states.
s0 is the initial state, an element of S. In a Nondeterministic finite state machine,
s0 is a set of initial states.
is the state-transition function:
.
is the output function.

If the output function is a function of a state and input alphabet (


) that
definition corresponds to the Mealy model, and can be modelled as a Mealy machine. If
the output function depends only on a state (
) that definition corresponds to
the Moore model, and can be modelled as a Moore machine. A finite-state machine with
no output function at all is known as a semiautomaton or transition system.

APPLICATION PART

16

Context-free grammar
In formal language theory, a context-free grammar (CFG) is a grammar in which every
production rule is of the form
Vw
where V is a single nonterminal symbol, and w is a string of terminals and/or
nonterminals (possibly empty). The term "context-free" expresses the fact that
nonterminals can be rewritten without regard to the context in which they occur. A formal
language is context-free if some context-free grammar generates it.
Context-free grammars play a central role in the description and design of programming
languages and compilers. They are also used for analyzing the syntax of natural
languages.
Contents

1 Background
2 Formal definitions
3 Examples
o 3.1 Example 1
o 3.2 Example 2
o 3.3 Example 3
o 3.4 Example 4
o 3.5 Example 5
o 3.6 Other examples
o 3.7 Derivations and syntax trees
4 Normal forms
5 Undecidable problems
6 Extensions
7 Linguistic applications

Background
Since the time of Pnini, at least, linguists have described the grammars of languages in
terms of their block structure, and described how sentences are recursively built up from
smaller phrases, and eventually individual words or word elements.
The context-free grammar (or "phrase-structure grammar" as Chomsky called it)
formalism developed by Noam Chomsky,[1] in the mid-1950s, took the manner in which
linguistics had described this grammatical structure, and then turned it into rigorous
mathematics. A context-free grammar provides a simple and precise mechanism for
describing the methods by which phrases in some natural language are built from
smaller blocks, capturing the "block structure" of sentences in a natural way. Its simplicity
makes the formalism amenable to rigorous mathematical study, but it comes at a price:

17

important features of natural language syntax such as agreement and reference cannot
be expressed in a natural way, or not at all.
Block structure was introduced into computer programming languages by the Algol
project, which, as a consequence, also featured a context-free grammar to describe the
resulting Algol syntax. This became a standard feature of computer languages, and the
notation for grammars used in concrete descriptions of computer languages came to be
known as Backus-Naur Form, after two members of the Algol language design
committee.
The "block structure" aspect that context-free grammars capture is so fundamental to
grammar that the terms syntax and grammar are often identified with context-free
grammar rules, especially in computer science. Formal constraints not captured by the
grammar are then considered to be part of the "semantics" of the language.
Context-free grammars are simple enough to allow the construction of efficient parsing
algorithms which, for a given string, determine whether and how it can be generated
from the grammar. An Earley parser is an example of such an algorithm, while the widely
used LR and LL parsers are more efficient algorithms that deal only with more restrictive
subsets of context-free grammars.
Formal definitions
A context-free grammar G is a 4-tuple:
where
1. is a finite set of non-terminal characters or variables. They represent different types
of phrase or clause in the sentence.
2. is a finite set of terminals, disjoint with
sentence.
3.

is a relation from

to

, which make up the actual content of the

such that

4. is the start variable, used to represent the whole sentence (or program). It must be
an element of .
In addition, is a finite set. The members of are called the rules or productions of the
grammar. The asterisk represents the Kleene star operation.
Additional Definition 1
For any strings

, we say

yields , written as

such that
is the result of applying the rule

to

and

, if
. Thus,

18

Additional Definition 2
For any

(or

in some textbooks) if
such that

Additional Definition 3
The language of a grammar

is the set

Additional Definition 4
A language

is said to be a context-free language (CFL) if there exists a CFG,

that

such

Additional Definition 5
A context-free grammar is said to be proper if it has:

no useless symbols (inaccesible symbols or unproductive symbols)


no -productions
no cycles

Examples
Example 1
Sa
S aS
S bS
The terminals here are a and b, while the only non-terminal is S. The language
described is all nonempty strings of as and bs that end in a.
This grammar is regular: no rule has more than one nonterminal in its right-hand side,
and each of these nonterminals is at the same end of the right-hand side.
Every regular grammar corresponds directly to a nondeterministic finite automaton, so
we know that this is a regular language.
It is common to list all right-hand sides for the same left-hand side on the same line,
using | to separate them, like this:
S a | aS | bS

19

Technically, this is the same grammar as above.


Example 2
In a context-free grammar, we can pair up characters the way we do with brackets. The
simplest example:
S aSb
S ab
This grammar generates the language

, which is not regular.

The special character stands for the empty string. By changing the above grammar
to :S aSb | we obtain a grammar generating the language
instead. This differs only in that it contains the empty string while the original grammar
did not.
Example 3
Here is a context-free grammar for syntactically correct infix algebraic expressions in the
variables x, y and z:
S x | y | z | S + S | S - S | S * S | S/S | (S)
This grammar can, for example, generate the string "( x + y ) * x - z * y / ( x + x )" as
follows: "S" is the initial string. "S - S" is the result of applying the fifth transformation [S
S - S] to the nonterminal S. "S * S - S / S" is the result of applying the sixth transform
to the first S and the seventh one to the second S. "( S ) * S - S / ( S )" is the result of
applying the final transform to certain of the nonterminals. "( S + S ) * S - S * S / ( S +
S )" is the result of the fourth and fifth transforms to certain nonterminals. "( x + y ) * x - z
* y / ( x + x )" is the final result, obtained by using the first three transformations to turn
the S nonterminals into the terminals x, y, and z.
This grammar is ambiguous, meaning that one can generate the same string with more
than one parse tree. For example, "x + y * z" might have either the + or the * parsed first;
presumably these will produce different results. However, the language being described
is not itself ambiguous: a different, unambiguous grammar can be written for it.
Example 4
A context-free grammar for the language consisting of all strings over {a,b} for which the
number of a's and b's are different is
SU|V
U TaU | TaT
V TbV | TbT
T aTbT | bTaT |

20

Here, the nonterminal T can generate all strings with the same number of a's as b's, the
nonterminal U generates all strings with more a's than b's and the nonterminal V
generates all strings with fewer a's than b's.
Example 5
Another example of a non-regular language is
context-free as it can be generated by the following context-free grammar:

. It is

S bSbb | A
A aA |
Other examples
Context-free grammars are not limited in application to mathematical ("formal")
languages. For example, it has been suggested that a class of Tamil poetry called Venpa
is described by a context-free grammar.[2]
Derivations and syntax trees
There are two common ways to describe how a given string can be derived from the
start symbol of a given grammar. The simplest way is to list the consecutive strings of
symbols, beginning with the start symbol and ending with the string, and the rules that
have been applied. If we introduce a strategy such as "always replace the left-most
nonterminal first" then for context-free grammars the list of applied grammar rules is by
itself sufficient. This is called the leftmost derivation of a string. For example, if we take
the following grammar:
(1) S S + S
(2) S 1
(3) S a
and the string "1 + 1 + a" then a left derivation of this string is the list [ (1), (1), (2), (2),
(3) ]. Analogously the rightmost derivation is defined as the list that we get if we always
replace the rightmost nonterminal first. In this case this could be the list [ (1), (3), (1), (2),
(2)].
The distinction between leftmost derivation and rightmost derivation is important
because in most parsers the transformation of the input is defined by giving a piece of
code for every grammar rule that is executed whenever the rule is applied. Therefore it is
important to know whether the parser determines a leftmost or a rightmost derivation
because this determines the order in which the pieces of code will be executed. See for
an example LL parsers and LR parsers.
A derivation also imposes in some sense a hierarchical structure on the string that is
derived. For example, if the string "1 + 1 + a" is derived according to the leftmost
derivation:
S S + S (1)

21

S + S + S (1)
1 + S + S (2)
1 + 1 + S (2)
1 + 1 + a (3)
the structure of the string would be:
{ { { 1 } S + { 1 } S }S + { a } S }S
where { ... }S indicates a substring recognized as belonging to S. This hierarchy can also
be seen as a tree:
S
/|\
/|\
/ | \
S '+' S
/|\
|
/|\ |
S '+' S 'a'
| |
'1' '1'
This tree is called a concrete syntax tree (see also abstract syntax tree) of the string. In
this case the presented leftmost and the rightmost derivations define the same syntax
tree; however, there is another (leftmost) derivation of the same string
S S + S (1)
1 + S (2)
1 + S + S (1)
1 + 1 + S (2)
1 + 1 + a (3)
and this defines the following syntax tree:
S
/|\
/|\
/ | \
S '+' S
|
/|\
| /|\
'1' S '+' S
| |
'1' 'a'
If, for certain strings in the language of the grammar, there is more than one parsing
tree, then the grammar is said to be an ambiguous grammar. Such grammars are
usually hard to parse because the parser cannot always decide which grammar rule it
has to apply. Usually, ambiguity is a feature of the grammar, not the language, and an

22

unambiguous grammar can be found which generates the same context-free language.
However, there are certain languages which can only be generated by ambiguous
grammars; such languages are called inherently ambiguous.
Normal forms
Every context-free grammar that does not generate the empty string can be transformed
into one in which no rule has the empty string as a product [a rule with as a product is
called an -production]. If it does generate the empty string, it will be necessary to
include the rule
, but there need be no other -rule. Every context-free grammar
with no -production has an equivalent grammar in Chomsky normal form or Greibach
normal form. "Equivalent" here means that the two grammars generate the same
language.
Because of the especially simple form of production rules in Chomsky Normal Form
grammars, this normal form has both theoretical and practical implications. For instance,
given a context-free grammar, one can use the Chomsky Normal Form to construct a
polynomial-time algorithm which decides whether a given string is in the language
represented by that grammar or not (the CYK algorithm).
Undecidable problems
Although some operations on context-free grammars are decidable, due to their limited
power, CFGs do have interesting undecidable problems. One of the simplest and most
cited is the problem of deciding whether a CFG accepts the language of all strings. A
reduction can be demonstrated to this problem from the well-known undecidable
problem of determining whether a Turing machine accepts a particular input (the Halting
problem). The reduction uses the concept of a computation history, a string describing
an entire computation of a Turing machine. We can construct a CFG that generates all
strings that are not accepting computation histories for a particular Turing machine on a
particular input, and thus it will accept all strings only if the machine does not accept that
input.
As a consequence of this, it is also undecidable whether two CFGs describe the same
language, since we can't even decide whether a CFG is equivalent to the trivial CFG
deciding the language of all strings.
Another point worth mentioning is that the problem of determining if a context-sensitive
grammar describes a context-free language is undecidable.
Extensions
An obvious way to extend the context-free grammar formalism is to allow nonterminals
to have arguments, the values of which are passed along within the rules. This allows
natural language features such as agreement and reference, and programming
language analogs such as the correct use and definition of identifiers, to be expressed in
a natural way. E.g. we can now easily express that in English sentences, the subject and
verb must agree in number.

23

In computer science, examples of this approach include affix grammars, attribute


grammars, indexed grammars, and Van Wijngaarden two-level grammars.
Similar extensions exist in linguistics.
Another extension is to allow additional symbols to appear at the left hand side of rules,
constraining their application. This produces the formalism of context-sensitive
grammars.
Linguistic applications
Chomsky initially hoped to overcome the limitations of context-free grammars by adding
transformation rules.
Such rules are another standard device in traditional linguistics; e.g. passivization in
English. However, arbitrary transformations must be disallowed, since they are much too
powerful (Turing complete). Much of generative grammar has been devoted to finding
ways of refining the descriptive mechanisms of phrase-structure grammar and
transformation rules such that exactly the kinds of things can be expressed that natural
language actually allows.
His general position regarding the non-context-freeness of natural language has held up
since then, although his specific examples regarding the inadequacy of context free
grammars (CFGs) in terms of their weak generative capacity were later
disproved.Gerald Gazdar and Geoffrey Pullum have argued that despite a few noncontext-free constructions in natural language (such as cross-serial dependencies in
Swiss German and reduplication in Bambara), the vast majority of forms in natural
language are indeed context-free.

24

Вам также может понравиться