Академический Документы
Профессиональный Документы
Культура Документы
Lexical analyser
source program
Parser
to semantic analysis
Symbol table
Token attributes
Apart from the token itself, the lexical analyzer also passes other information regarding the token. These items of information are called token attributes. When more than one pattern matches the lexeme , a lexical analyzer must provide additional information about the particular lexeme that must be matched to the subsequent phases of compilation. The token names and associated attribute values for the following statement a=b+c is written below as a sequence of pairs <id, pointer to symbol-table entry for a> <assign_op> <id, pointer to symbol-table entry for b> <add_op> <id, pointer to symbol-table entry for c>
Scanning
The lexical analyzer scans the characters of the source program one at a time to discover tokens. Reading the input character by character is costly. A block of data is read first into a buffer, and then scanned by the lexical analyzer. The lexical analyzer uses two pointers to read tokens. lb(lexeme_beginning) indicates beginning of the lexeme sp(search pointer) that keeps track of the input string scanned Initially, both pointers point to the beginning of a lexeme The sp then starts scanning forward to search for the end of the lexeme. When the end of the lexeme is identified, the token and attribute corresponding to the lexeme is returned. lb and sp are then made to point to the beginning of the next token.
Scanning
Commonly used buffering methods are: One buffer scheme Two buffer scheme an extra character called sentinel character other than the input characters are added at the end of the input buffer to reduce buffer tests.
int a=12;
lb
sp
Specification of tokens
The patterns corresponding to a token are generally specified using a compact notation known as regular expression. A regular expression (r) is defined by the set of strings that matches it. This set of strings is called as the language generated by the regular expression and is represented as L(r). The set of symbols in the language is called the alphabet of the language and is denoted by . Example L={A, , Z, a, .., z} and a set of digits is represented as {0,1, ,9}.Now L U D is a language. For a language L with alphabet set , the following rules define the regular expressions: is a regular expression denoting the language { }, that is the set containing the empty string. If a be a symbol in , then a also denotes the regular expression corresponding to the language {a}, that is, the set containing only the string a.
Specification of tokens
If r1 and r2 be the regular expressions corresponding to the languages L1 and L2 respectively, then r1|r2 is a regular expression corresponding to the language L1 U L2,i.e. the set containing all the strings of L1 and L2. r1r2 is a regular expression corresponding to the language created by concatenating strings of L2 to the strings of L1. r1* is a regular expression corresponding to the language L1*, the set containing zero or more occurrences of the strings belonging to L1. (r1) is a regular expression corresponding to the language L1 itself Unary operator * is of highest precedence and is left associative. Concatenation has the next highest and | has the lowest.
Specification of tokens
For ={0,1} , let us consider the following regular expressions: a) (0|1)* denotes all the binary strings including the empty string. { ,0,1,001,011,.} b) (0|1)(0|1)* denotes all non empty binary string. c) 0(0|1)*0 denotes all binary strings of length at least two, both starting and ending with 0s. {00,010,000,.} d)(0|1)*0(0|1)(0|1) e)0*10*10*10*
Recognition of Tokens
The tokens obtained during lexical analysis are recognized using a finite automaton. Finite automata or finite-state machine are a mathematical way of describing the regular expression. It produces a transition diagram for regular expression. It takes as input a particular string and verifies whether the string belongs to the language or not. Transition diagrams have a collection of nodes or circles called states. Each state represents a condition that would occur during the process of scanning the input looking for a lexeme that matches one of several patterns. Edges are directed from one state of the transition diagram to another. Each edge is labeled by a symbol or set of symbols. If we are in some state s and the next symbol is a, we look for an edge out of state s labeled by a.
q0
start
q0
A circle with an arrow which is not originating from any node represents the start state of machine.
q0
Two circles are used to represent a final state. Here, q0 is the final state.
q0
q1
An arrow with label 1 goes from state q0 to state q1 . This indicates there is a transition from state q0 on input symbol 1 to state q1 .This is represented as: (q0, 1)= q1
q0
0, 1
q0
q1
An arrow with label 0, 1 goes from state q0 to state q1.This indicates that the machine in state q0 on reading a 0 or a 1 enters into state q1 .This is represented as: (q0, 0)= q1 (q0, 1)= q1
Deterministic Finite Automata Non-deterministic Finite Automata Non-deterministic Finite Automata with -moves
q1
accept
Transition diagram
The transition diagram for DFA M=(Q, , , q0, F) is defined as a graph with circles , arrow and arcs with labels, two circles etc. It is formally defined as shown below: Each state of the Q corresponds to one node or vertex represented using a circle or two circles. Alphabets in are represented as labels. The transition from one state to another state is indicated by the directed edge. Let (qi , a)=qj .This indicates that there is a direct edge from qi to qj and the edge is labeled a. The start state is an state which has an arrow not originating from any node and entering into the state. This is labeled with start. The final states or accepting states which are in F are represented by double circles. The states which are not in F are represented by a single circle.
Transition Table
The transition table for DFA M=(Q, , , q0, F) is defined as a conventional, tabular representation of a transition function such as which takes two arguments and returns a value with: The rows of the table correspond to the states of DFA obtained from Q. The columns correspond to the input symbols from . If q is the current state of DFA and a is the current input symbol, the value returned from (q, a) represent the next state of DFA and is entered in row q and column a. The start state is marked with an arrow. The final state is marked with a star. q0 *q1 0 q0 q1 1 q1 q1