Вы находитесь на странице: 1из 23

LEX/FLEX AND YACC/BISON

OVERVIEW

Department of Computer Science


Mekelle University
General Lex/Flex Information
• lex
 is a tool to generate lexical analyzers.
 It was written by Mike Lesk and Eric Schmidt (the
Google guy).
 divides a stream of input characters into meaningful units
(lexemes), identifies them (token) and may pass the token
to a parser generator, yacc
 lex specifications are regular expressions
• flex (fast lexical analyzer generator)
–Free and open source alternative.
–You’ll be using this.
General Yacc/Bison Information
• Yacc (yet another compiler compiler)
–Is a tool to generate parsers (syntactic analyzers).
–Generated parsers require a lexical analyzer.
–It isn’t used anymore.
• Bison
–Free and open source alternative.
–You’ll be using this.
Lex/Flex and Yacc/Bison relation to a
compiler tool chain
FLEX IN DETAIL
How Flex Works
• Flex uses a .l spec file to generate a tokenizer/scanner.

• The tokenizer reads an input file and chunks it into a


series of tokens which are passed to the parser.
• Flex is a program that automatically creates a scanner
in C, using rules for tokens as regular expressions.
Internal Structure of Lex/Flex

Regular
NFA DFA Minimal DFA
expressions

The final states of the DFA are associated with actions


Flex/Lex Structure
• Format of the input file is like …
Definitions section (1)
• There are three things that can go in the definitions section:
• C code Any indented code between %{ and %} is copied to the
C file. This is typically used for defining file variables, and for
prototypes of routines that are defined in the code segment.
• definitions A definition is very much like a #define cpp
directive. For example
letter [a-zA-Z]
digit [0-9]
punct [,.:;!?]
nonblank [ˆ \t]
These definitions can be used in the rules section: one could start
a rule {letter}+ {...
Definitions section (2)
• State definitions If a rule depends on context, it’s possible to
introduce states and incorporate those in the rules. A state
definition looks like %s STATE, and by default a state
INITIAL is already given.
Rules section
• The rules section has a number of pattern-action
pairs.
• The patterns are regular expressions and the
actions are either a single C command, or a
sequence enclosed in braces.
• Example:
RE Action
\n linenum++;
[0-9]+ printf(“integer”);
[a-zA-Z] printf(“letter”);
Lex/Flex Regular Expression (1)
• Regular Expression contains:
– text characters (which match the corresponding characters in the strings
being compared) and
– operator characters (which specify repetitions, choices, and other
features).
• Text characters: the letters of the alphabet and the digits are
always text characters.
• Operator Characters: the operator characters are:
“\[]^-?.*+|()$/{}%<>
• and, if they are to be used as text characters, an escape (\) should be used.
• The quotation mark operator (“) indicates that whatever is contained
between a pair of quotes is to be taken as text character.
Lex/Flex Regular Expression (2)
• Character classes:
• Classes of characters can be specified using the operator pair [].
• The construction [abc] matches a single character, which may be
a, b, c.
• Within square brackets, most operator meanings are ignored.
Only three characters are special: these are \ - and ^
• The – character indicates ranges, for example, [a-z0-9]
(i.e., it indicates the character class containing all the lower case
letters and digits)
• If it is desired to include the character - in a character class, it
should be first or last; Ex: [-+0-9] matches all digits and two
signs.
Lex/Flex Regular Expression (3)
• In character classes, the ^ operator must appear as the first
character after the left bracket;
• It indicates that the resulting string is to be complemented with
respect to the computer character set.
• Example:
• [^abc] matches all characters except a, b, or c
• [^a-zA-Z] matches any character which is not a letter.
• The \ character provides the usual escapes within character
class brackets.
Lex/Flex Regular Expression (4)
• Arbitrary character:
• To match almost any character, the operator character. (dot)
is the class of all characters except newline.
• Optional Expression:
• The operator ? Indicates an optional element of an
expression. Ex: ab?c matches either ac or abc
• Repeated Expressions:
• Repetitions of classes are indicated by the operators * and +
• Ex: a* is any number of consecutive a characters
including zero; while a+ is one or more instance of a.
Lex/Flex Regular Expression (5)
• Example:
• [a-z]+ is all strings of lower case letters
• [A-Z][a-z]+ indicates strings with a first upper case
letter followed by any number of lower case letters.
• Alternation and Grouping:
• The operator | indicates alternation:
• Ex: (ab|cd) matches either ab or cd.
• Ex: (ab | cd+)?(ef)* matches such strings as abefef, efefef,
cdef, or cddd; but not abc, abcd, or abcdef
Lex/Flex Regular Expression (6)
• Repetition and Definitions:
• The operators { } specify either repetitions (if they enclose
number) or definition expression (if they enclose a name)
• Ex: {digit} looks for a predefined string named digit and
inserts it at that point in the expression. The definitions are
given in the first part of the Lex input, before the rules.
• In contrast, a{1, 5} looks for 1 to 5 occurrences of a.
Example
Pattern Meaning
c The char “c”
“c” The char “c” even if it is a special char in this table
\c Same as “c”, used to quote a single char
[cd] The char c or the char d
[a-z] Any single char in the range a through z
[^c] Any char but c
. Any char but newline
^x The pattern x if it occurs at the beginning of a line
x$ The pattern x at the end of a line
x? An optional x
Example
x* Zero or more occurrences of the pattern x
x+ One or more occurrences of the pattern x
xy The pattern x concatenated with the pattern y
x|y An x or a y
(x) An x
x/y An x only if followed by y
<S>x The pattern x when lex is in start condition S
{name} The value of a macro from definitions section
x{m} m occurrences of the pattern x
x{m,n} m through n occurrences of x (takes precedence over
concatenation)
Flex/Lex Predefined Variables
Name function
int yylex(void) call to invoke lexer, returns token
char *yytext pointer to matched string
yyleng length of matched string
yylval value associated with token
int yywrap(void) wrapup, return 1 if done, 0 if not done
FILE *yyout output file
FILE *yyin input file
INITIAL initial start condition
BEGIN condition switch start condition
ECHO write matched string
Flex/Lex Action
• When an expression written matched, Lex/Flex executes the
corresponding actions.
• Example:
%%
[a-z]+ printf (“alpha\n”);
[0-9]+ printf (“numeric\n”);
[a-z0-9]+ printf (“alphanumeric\n”);
[ \t]+ printf (“white space\n”);
. printf (“special char\n”);
\n ;
%%
Disambiguation Rules
• If there are several patterns which match the current input,
yylex() chooses one of them according to these rules:

1. The longest match is preferred.


2. Among rules that match the same number of characters, the
rule that occurs earliest in the list is preferred.
Example
• Show the output if the input to yylex() generated by the lex
program above is abc123 abc 123?x
• Solution:
alphanumeric
white space
alpha
white space
numeric
special char
alpha

Вам также может понравиться