Вы находитесь на странице: 1из 6

Introduction to Syntax Analysis

When an input string (source code or a program in some language) is given to a compiler, the compiler
processes it in several phases, starting from lexical analysis (scans the input and divides it into tokens) to
target code generation.

Syntax Analysis or Parsing is the second phase, i.e. after lexical analysis. It checks the syntactical
structure of the given input, i.e. whether the given input is in the correct syntax (of the language in
which the input has been written) or not. It does so by building a data structure, called a Parse tree or
Syntax tree. The parse tree is constructed by using the pre-defined Grammar of the language and the
input string. If the given input string can be produced with the help of the syntax tree (in the derivation
process), the input string is found to be in the correct syntax. if not, error is reported by syntax analyzer.

The Grammar for a Language consists of Production rules.

Example:

Suppose Production rules for the Grammar of a language are:

S -> cAd

A -> bc|a

And the input string is “cad”.

The parser attempts to construct syntax tree from this grammar for the given input string. It uses the
given production rules and applies those as needed to generate the string.

In the step iii above, the production rule A->bc was not a suitable one to apply (because the string
produced is “cbcd” not “cad”), here the parser needs to backtrack, and apply the next production rule
available with A which is shown in the step iv, and the string “cad” is produced.
Thus, the given input can be produced by the given grammar, therefore the input is correct in syntax.

But backtrack was needed to get the correct syntax tree, which is really a complex process to
implement.

Why FIRST and FOLLOW?


Why FIRST?

If the compiler would have come to know in advance, that what is the “first character of the string
produced when a production rule is applied”, and comparing it to the current character or token in the
input string it sees, it can wisely take decision on which production rule to apply.

Let’s take the same grammar from the previous article:

S -> cAd

A -> bc|a

And the input string is “cad”.

Thus, in the example above, if it knew that after reading character ‘c’ in the input string and applying
S->cAd, next character in the input string is ‘a’, then it would have ignored the production rule A->bc
(because ‘b’ is the first character of the string produced by this production rule, not ‘a’ ), and directly use
the production rule A->a (because ‘a’ is the first character of the string produced by this production rule,
and is same as the current character of the input string which is also ‘a’).

Hence it is validated that if the compiler/parser knows about first character of the string that can be
obtained by applying a production rule, then it can wisely apply the correct production rule to get the
correct syntax tree for the given input string.

Why FOLLOW?

The parser faces one more problem. Let us consider below grammar to understand this problem.

A -> aBb

B -> c | ε

And suppose the input string is “ab” to parse.

As the first character in the input is a, the parser applies the rule A->aBb.

/| \

a B b

Now the parser checks for the second character of the input string which is b, and the Non-Terminal to
derive is B, but the parser can’t get any string derivable from B that contains b as first character.
But the Grammar does contain a production rule B -> ε, if that is applied then B will vanish, and the
parser gets the input “ab” , as shown below. But the parser can apply it only when it knows that the
character that follows B is same as the current character in the input.

In RHS of A -> aBb, b follows Non-Terminal B, i.e. FOLLOW(B) = {b}, and the current input character read
is also b. Hence the parser applies this rule. And it is able to get the string “ab” from the given grammar.

A A

/ | \ / \

a B b => a b

So FOLLOW can make a Non-terminal to vanish out if needed to generate the string from the parse tree.

The conclusions is, we need to find FIRST and FOLLOW sets for a given grammar, so that the parser can
properly apply the needed rule at the correct position.

FIRST Set in Syntax Analysis


FIRST(X) for a grammar symbol X is the set of terminals that begin the strings derivable from X.

Rules to compute FIRST set:

1. If x is a terminal, then FIRST(x) = { ‘x’ }


2. If x-> Є, is a production rule, then add Є to FIRST(x).
3. If X->Y1 Y2 Y3….Yn is a production,
a. FIRST(X) = FIRST(Y1)
b. If FIRST(Y1) contains Є then FIRST(X) = { FIRST(Y1) – Є } U { FIRST(Y2) }
c. If FIRST (Yi) contains Є for all i = 1 to n, then add Є to FIRST(X).

Example 1:

Production Rules of Grammar

E -> TE’ FIRST sets

E’ -> +T E’|Є FIRST(E) = FIRST(T) = { ( , id }

T -> F T’ FIRST(E’) = { +, Є }

T’ -> *F T’ | Є FIRST(T) = FIRST(F) = { ( , id }

F -> (E) | id FIRST(T’) = { *, Є }

FIRST(F) = { ( , id }
Example 2:

Production Rules of Grammar FIRST sets

S -> ACB | Cbb | Ba FIRST(S) = FIRST(A) U FIRST(B) U FIRST(C)

A -> da | BC = { d, g, h, Є, b, a}

B -> g | Є FIRST(A) = { d } U FIRST(B) = { d, g , h, Є }

C -> h | Є FIRST(B) = { g , Є }

FIRST(C) = { h , Є }

Notes:

The grammar used above is Context-Free Grammar (CFG). Syntax of most of the programming language
can be specified using CFG.

CFG is of the form A -> B , where A is a single Non-Terminal, and B can be a set of grammar symbols ( i.e.
Terminals as well as Non-Terminals)

FOLLOW Set in Syntax Analysis


Follow(X) to be the set of terminals that can appear immediately to the right of Non-Terminal X in some
sentential form.

Example:

S ->Aa | Ac

A ->b

S S

/ \ / \

A a A C

| |

b b

Here, FOLLOW (A) = {a, c}


Rules to compute FOLLOW set:

1) FOLLOW(S) = { $ } // where S is the starting Non-Terminal

2) If A -> pBq is a production, where p, B and q are any grammar symbols,

then everything in FIRST(q) except ? is in FOLLOW(B.

3) If A->pB is a production, then everything in FOLLOW(A) is in FOLLOW(B).

4) If A->pBq is a production and FIRST(q) contains ?,

then FOLLOW(B) contains { FIRST(q) – ? } U FOLLOW(A)

Example 1:

Production Rules:

E -> TE’

E’ -> +T E’|?

T -> F T’

T’ -> *F T’ | ?

F -> (E) | id

FIRST set

FIRST(E) = FIRST(T) = { ( , id }

FIRST(E’) = { +, ? }

FIRST(T) = FIRST(F) = { ( , id }

FIRST(T’) = { *, ? }

FIRST(F) = { ( , id }

FOLLOW Set

FOLLOW(E) = { $ , ) } // Note ')' is there because of 5th rule

FOLLOW(E’) = FOLLOW(E) = { $, ) } // See 1st production rule

FOLLOW(T) = { FIRST(E’) – ? } U FOLLOW(E’) U FOLLOW(E) = { + , $ , ) }

FOLLOW(T’) = FOLLOW(T) = {+,$,)}

FOLLOW(F) = { FIRST(T’) – ? } U FOLLOW(T’) U FOLLOW(T) = { *, +, $, ) }


Example 2:

Production Rules:

S -> ACB|Cbb|Ba

A -> da|BC

B-> g|?

C-> h| ?

FIRST set

FIRST(S) = FIRST(A) U FIRST(B) U FIRST(C) = { d, g, h, ?, b, a}

FIRST(A) = { d } U FIRST(B) = { d, g, h, ? }

FIRST(B) = { g, ? }

FIRST(C) = { h, ? }

FOLLOW Set

FOLLOW(S) = { $ }

FOLLOW(A) = { h, g, $ }

FOLLOW(B) = { a, $, h, g }

FOLLOW(C) = { b, g, $, h }

Note :

? as a FOLLOW doesn’t mean anything (? is an empty string).

$ is called end-marker, which represents the end of the input string, hence used while parsing to
indicate that the input string has been completely processed.

The grammar used above is Context-Free Grammar (CFG). The syntax of a programming language can be
specified using CFG.

CFG is of the form A -> B , where A is a single Non-Terminal, and B can be a set of grammar symbols ( i.e.
Terminals as well as Non-Terminals)

Вам также может понравиться