Lexical Analysis

Lexical Analysis
Lexical Analyzer in Perspective
source program
lexical analyzer
token parser get next token
symbol table
Important Issue:
What are Responsibilities of each Box ? Focus on Lexical Analyzer and Parser.
Role of the Lexical Analyzer

Identify the words: Lexical Analysis
Converts a stream of characters (input program) into a stream of tokens.
Also called Scanning or Tokenizing
Identify the sentences: Parsing.

Derive the structure of sentences: construct parse trees from a stream of tokens.
Next_char() Next_token()
Input
character
Scanner
token
Parser
Symbol Table
Interaction of Lexical Analyzer with Parser

Next_char() Next_token()
Input
character
Scanner
token
Parser
Symbol Table
Often a subroutine of the parser Secondary tasks of Lexical Analyzer

Strip out comments and white spaces from the source Correlate error messages with the source program Preprocessing may be implemented as lexical analysis takes place
Issues in lexical analysis

Simplicity/Modularity: Conventions about words" are often different from conventions about sentences". Efficiency: Word identification problem has a much more efficient solution than sentence identification problem. Specialized buffering techniques for reading input characters and processing tokens.
Portability: Input alphabet peculiarities and other device-specific anomalies can be restricted to the lexical analyzer.
Introducing Basic Terminology

What are Major Terms for Lexical Analysis?
TOKEN
A classification for a common set of strings

e.g., tok_integer_constant
PATTERN
The rules which characterize the set of strings for a token

e.g., digit followed by zero or more digits
LEXEME
Actual sequence of characters that matches pattern and is classified by a token

e.g., 32894
Introducing Basic Terminology

Token
const if relation id if <, <=, =, < >, >, >= pi, count, D2
Sample Lexemes
const
Informal Description of Pattern

const if < or <= or = or < > or >= or > letter followed by letters and digits
num
literal
3.1416, 0, 6.02E23
core dumped
any numeric constant

any characters between and except
Token Stream
Tokens are terminal symbol in the grammar for the source language keywords, operators, identifiers, constants, literal strings, punctuation symbols etc. are treated as tokens Source:
if ( x == -3.1415 ) /* test x */ then ...
Token Stream:
< IF > < LPAREN > < ID, x > < EQUALS > < NUM, -3.14150000 > < RPAREN > < THEN > ...
Attributes for Tokens

Tokens influence parsing decision; the attributes influence the translation of tokens. A token usually has only a single attribute
A pointer to the symbol-table entry Other attributes (e.g. line number, lexeme) can be stored in symbol table
Example:
E = M * C ** 2
<id, pointer to symbol-table entry for E>

<assign_op, > <id, pointer to symbol-table entry for M> <mult_op, > <id, pointer to symbol-table entry for C> <exp_op, > <num, integer value 2>
Lexical Error
Few errors can be caught by the lexical analyzer
Most errors tend to be typos Not noticed by the programmer
return 1.23; retunn 1,23;
... Still results in sequence of legal tokens

<ID, retunn> <INT,1> <COMMA> <INT,23> <SEMICOLON>
No lexical error, but problems during parsing! Another example: fi (a == f(x))
In what Situations do Errors Occur?

Lexical analyzer is unable to proceed because none of the patterns for tokens matches a prefix of remaining input.
Recovery from lexical errors

Panic mode recovery
Delete successive characters from the input until the lexical analyzer can find a well-formed token May confuse the parser
The parser will detect syntax errors and get straightened out (hopefully!)
Other possible error-recovery actions

Deleting an extra character Inserting a missing character Replacing an incorrect character by a correct character Swapping two adjacent character
Managing Input Buffers

Option 1: Read one char from OS at a time. Option 2: Read N characters per system call
e.g., N = 4096
Manage input buffers in Lexer

More efficient
Lexical analyzer needs to look ahead several characters beyond the lexeme for a pattern before a match can be announced
.. E = M * C * * 2 ..
forward lexeme_beginning
But! Due to look ahead we need to push back the lookahead characters
Need specialized buffer managing technique to improve efficiency
Buffer Pairs
Token could overlap / span buffer boundaries
.. .. .. .. .. .. .. .. .. .. .. .. .. .. 1 2 . 12.46 4 6 .. .. .. .. .. .. .. .. .. .. .. .. .. .. ..
Solution: Use a paired buffer of N characters each

N-characters N-characters
.. ..
C * * 2 \0
Deficiency:
Code:
if (forward at end of buffer1) then relaod buffer2; forward = forward +1; else if (forward at end of buffer2) then relaod first half; move forward to the beginning of the buffer1 else forward = forward +1
Sentinels
Technique: Use Sentinels to reduce testing Choose some character that occurs rarely in most inputs e.g. \0
N-characters N-characters
.. ..
M * \0
C * * 2 \0
\0
forward++; if *forward == \0 then if forward at end of buffer #1 then Read next N bytes into buffer #2; forward = address of first char of buffer #2; elseIf forward at end of buffer #2 then Read next N bytes into buffer #1; forward = address of first char of buffer #1; else // do nothing; a real \0 occurs in the input endIf endIf

Lexical Analysis

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Lexical Analysis

Загружено:

Авторское право:

Доступные форматы

Lexical Analysis

Lexical Analyzer in Perspective

token parser get next token

Role of the Lexical Analyzer

Identify the sentences: Parsing.

Interaction of Lexical Analyzer with Parser

Often a subroutine of the parser Secondary tasks of Lexical Analyzer

Issues in lexical analysis

Introducing Basic Terminology

A classification for a common set of strings

The rules which characterize the set of strings for a token

Actual sequence of characters that matches pattern and is classified by a token

Introducing Basic Terminology

Informal Description of Pattern

any numeric constant

Attributes for Tokens

<id, pointer to symbol-table entry for E>

... Still results in sequence of legal tokens

No lexical error, but problems during parsing! Another example: fi (a == f(x))

In what Situations do Errors Occur?

Recovery from lexical errors

Other possible error-recovery actions

Managing Input Buffers

Manage input buffers in Lexer

Solution: Use a paired buffer of N characters each

Вам также может понравиться