You are on page 1of 13

-- --

Lex A Lexical Analyzer Generator

M . E . L es k a n d E . S c h m i d t

A B S T RA C T

Lex helps write programs whose control flow is directed by instances of regular
expressions in the input stream. It is well suited for editor-script type transformations
and for segmenting input in preparation for a parsing routine.
Lex source is a table of regular expressions and corresponding program fragments.
The table is translated to a program which reads an input stream, copying it to an output
stream and partitioning the input into strings which match the given expressions. As
each such string is recognized the corresponding program fragment is executed. The
recognition of the expressions is performed by a deterministic finite automaton generated
by Lex. The program fragments written by the user are executed in the order in which
the corresponding regular expressions occur in the input stream.
The lexical analysis programs written with Lex accept ambiguous specifications
and choose the longest match possible at each input point. If necessary, substantial look-
ahead is performed on the input, but the input stream will be backed up to the end of the
current partition, so that the user has general freedom to manipulate it.
Lex can generate analyzers in either C or Ratfor, a language which can be
translated automatically to portable Fortran. It is available on the PDP-11 UNIX,
Honeywell GCOS, and IBM OS systems. This manual, however, will only discuss gen-
erating analyzers in C on the UNIX system, which is the only supported form of Lex
under UNIX Version 7. Lex is designed to simplify interfacing with Yacc, for those with
access to this compiler-compiler system.

1. Introduction. The user supplies the additional code


Lex is a program generator designed for beyond expression matching needed to complete
lexical processing of character input streams. It his tasks, possibly including code written by other
accepts a high-level, problem oriented generators. The program that recognizes the
specification for character string matching, and expressions is generated in the general purpose
produces a program in a general purpose programming language employed for the users
language which recognizes regular expressions. program fragments. Thus, a high level expression
The regular expressions are specified by the user language is provided to write the string expres-
in the source specifications given to Lex. The sions to be matched while the users freedom to
Lex written code recognizes these expressions in write actions is unimpaired. This avoids forcing
an input stream and partitions the input stream the user who wishes to use a string manipulation
into strings matching the expressions. At the language for input analysis to write processing
boundaries between strings program sections pro- programs in the same and often inappropriate
vided by the user are executed. The Lex source string handling language.
file associates the regular expressions and the pro- Lex is not a complete language, but rather a
gram fragments. As each expression appears in generator representing a new language feature
the input to the program written by Lex, the which can be added to different programming
corresponding fragment is executed. languages, called host languages. Just as gen-
eral purpose languages can produce code to run
-- --

PS1:16-2 Lex A Lexical Analyzer Generator

on different computer hardware, Lex can write The finite automaton generated for this source
code in different host languages. The host will scan for both rules at once, observing at the
language is used for the output code generated by termination of the string of blanks or tabs whether
Lex and also for the program fragments added by or not there is a newline character, and executing
the user. Compatible run-time libraries for the the desired rule action. The first rule matches all
different host languages are also provided. This strings of blanks or tabs at the end of lines, and
makes Lex adaptable to different environments the second rule all remaining strings of blanks or
and different users. Each application may be tabs.
directed to the combination of hardware and host Lex can be used alone for simple transfor-
language appropriate to the task, the users back- mations, or for analysis and statistics gathering on
ground, and the properties of local implementa- a lexical level. Lex can also be used with a parser
tions. At present, the only supported host generator to perform the lexical analysis phase; it
language is C, although Fortran (in the form of is particularly easy to interface Lex and Yacc [3].
Ratfor [2] has been available in the past. Lex Lex programs recognize only regular expressions;
itself exists on UNIX, GCOS, and OS/370; but Yacc writes parsers that accept a large class of
the code generated by Lex may be taken any- context free grammars, but require a lower level
where the appropriate compilers exist. analyzer to recognize input tokens. Thus, a com-
Lex turns the users expressions and actions bination of Lex and Yacc is often appropriate.
(called source in this memo) into the host When used as a preprocessor for a later parser
general-purpose language; the generated program generator, Lex is used to partition the input
is named yylex. The yylex program will recog- stream, and the parser generator assigns structure
nize expressions in a stream (called input in this to the resulting pieces. The flow of control in
memo) and perform the specified actions for each such a case (which might be the first half of a
iiiiiiiiSee Figure 1.
expression as it is detected. compiler, for example) is shown in Figure 2.
Source cic iiiiiii
Lex cc yylex Additional programs, written by other generators
or by hand, can be added easily to programs writ-
ten by Lex.
iiiiiiii lexical grammar
Input cciiiiiiii
yylex cc Output rules rules

iiiiiiiii
iiiiiiiiiii
An overview of Lex cciiiiiiiii
Lex c c c
ic iiiiiiiiii
Yacc cc
Figure 1
iiiiiiiii
iiiiiiiiiii
For a trivial example, consider a program to Input cic iiiiiiii
yylex cc icc iiiiiiiiii
yyparse cc Parsed input
delete from the input all blanks or tabs at the ends
of lines. Lex with Yacc
%% Figure 2
[ \t]+$ ; Yacc users will realize that the name yylex is
is all that is required. The program contains a what Yacc expects its lexical analyzer to be
%% delimiter to mark the beginning of the rules, named, so that the use of this name by Lex
and one rule. This rule contains a regular expres- simplifies interfacing.
sion which matches one or more instances of the Lex generates a deterministic finite automa-
characters blank or tab (written \t for visibility, in ton from the regular expressions in the source [4].
accordance with the C language convention) just The automaton is interpreted, rather than com-
prior to the end of a line. The brackets indicate piled, in order to save space. The result is still a
the character class made of blank and tab; the + fast analyzer. In particular, the time taken by a
indicates one or more ...; and the $ indicates Lex program to recognize and partition an input
end of line, as in QED. No action is specified, stream is proportional to the length of the input.
so the program generated by Lex (yylex) will The number of Lex rules or the complexity of the
ignore these characters. Everything else will be rules is not important in determining speed,
copied. To change any remaining string of blanks unless rules which include forward context
or tabs to a single blank, add another rule: require a significant amount of rescanning. What
%% does increase with the number and complexity of
[ \t]+$ ; rules is the size of the finite automaton, and there-
[ \t]+ printf(" "); fore the size of the program generated by Lex.
-- --

Lex A Lexical Analyzer Generator PS1:16-3

In the program written by Lex, the users can spelling. Lex rules such as
fragments (representing the actions to be per- colour printf("color");
formed as each regular expression is found) are mechanise printf("mechanize");
gathered as cases of a switch. The automaton petrol printf("gas");
interpreter directs the control flow. Opportunity would be a start. These rules are not quite
is provided for the user to insert either declara- enough, since the word petroleum would become
tions or additional statements in the routine con- gaseum ; a way of dealing with this will be
taining the actions, or to add subroutines outside described later.
this action routine.
Lex is not limited to source which can be 3. Lex Regular Expressions.
interpreted on the basis of one character look- The definitions of regular expressions are
ahead. For example, if there are two rules, one very similar to those in QED [5]. A regular
looking for ab and another for abcdefg , and the expression specifies a set of strings to be
input stream is abcdefh , Lex will recognize ab matched. It contains text characters (which
and leave the input pointer just before cd. . . match the corresponding characters in the strings
Such backup is more costly than the processing of being compared) and operator characters (which
simpler languages. specify repetitions, choices, and other features).
The letters of the alphabet and the digits are
2. Lex Source. always text characters; thus the regular expres-
The general format of Lex source is: sion
{definitions} integer
%% matches the string integer wherever it appears
{rules} and the expression
%% a57D
{user subroutines} looks for the string a57D.
where the definitions and the user subroutines are Operators. The operator characters are
often omitted. The second %% is optional, but "\[]?.+|()$/{}%<>
the first is required to mark the beginning of the and if they are to be used as text characters, an
rules. The absolute minimum Lex program is escape should be used. The quotation mark
thus operator (") indicates that whatever is contained
%% between a pair of quotes is to be taken as text
(no definitions, no rules) which translates into a characters. Thus
program which copies the input to the output xyz"++"
unchanged. matches the string xyz++ when it appears. Note
In the outline of Lex programs shown that a part of a string may be quoted. It is harm-
above, the rules represent the users control deci- less but unnecessary to quote an ordinary text
sions; they are a table, in which the left column character; the expression
contains regular expressions (see section 3) and "xyz++"
the right column contains actions, program frag- is the same as the one above. Thus by quoting
ments to be executed when the expressions are every non-alphanumeric character being used as a
recognized. Thus an individual rule might appear text character, the user can avoid remembering
integer printf("found keyword INT"); the list above of current operator characters, and
to look for the string integer in the input stream is safe should further extensions to Lex lengthen
and print the message found keyword INT the list.
whenever it appears. In this example the host An operator character may also be turned
procedural language is C and the C library func- into a text character by preceding it with \ as in
tion printf is used to print the string. The end of xyz\+\+
the expression is indicated by the first blank or which is another, less readable, equivalent of the
tab character. If the action is merely a single C above expressions. Another use of the quoting
expression, it can just be given on the right side mechanism is to get a blank into an expression;
of the line; if it is compound, or takes more than a normally, as explained above, blanks or tabs end
line, it should be enclosed in braces. As a slightly a rule. Any blank character not contained within
more useful example, suppose it is desired to [ ] (see below) must be quoted. Several normal C
change a number of words from British to Ameri- escapes with \ are recognized: \n is newline, \t is
-- --

PS1:16-4 Lex A Lexical Analyzer Generator

tab, and \b is backspace. To enter \ itself, use \\. Repeated expressions. Repetitions of
Since newline is illegal in an expression, \n must classes are indicated by the operators and + .
be used; it is not required to escape tab and back- a
space. Every character but blank, tab, newline is any number of consecutive a characters,
and the list above is always a text character. including zero; while
Character classes. Classes of characters a+
can be specified using the operator pair [ ]. The is one or more instances of a. For example,
construction [abc] matches a single character, [az]+
which may be a , b , or c . Within square brack- is all strings of lower case letters. And
ets, most operator meanings are ignored. Only [AZaz][AZaz09]
three characters are special: these are \ and . indicates all alphanumeric strings with a leading
The character indicates ranges. For example, alphabetic character. This is a typical expression
[az09<>_] for recognizing identifiers in computer languages.
indicates the character class containing all the Alternation and Grouping. The operator |
lower case letters, the digits, the angle brackets, indicates alternation:
and underline. Ranges may be given in either (ab | cd)
order. Using between any pair of characters matches either ab or cd. Note that parentheses
which are not both upper case letters, both lower are used for grouping, although they are not
case letters, or both digits is implementation necessary on the outside level;
dependent and will get a warning message. (E.g., ab | cd
[0z] in ASCII is many more characters than it is would have sufficed. Parentheses can be used for
in EBCDIC). If it is desired to include the char- more complex expressions:
acter in a character class, it should be first or (ab | cd+)?(ef)
last; thus matches such strings as abefef , efefef , cdef , or
[+09] cddd ; but not abc , abcd , or abcdef .
matches all the digits and the two signs. Context sensitivity. Lex will recognize a
In character classes, the operator must small amount of surrounding context. The two
appear as the first character after the left bracket; simplest operators for this are and $ . If the first
it indicates that the resulting string is to be com- character of an expression is , the expression
plemented with respect to the computer character will only be matched at the beginning of a line
set. Thus (after a newline character, or at the beginning of
[abc] the input stream). This can never conflict with
matches all characters except a, b, or c, including the other meaning of , complementation of char-
all special or control characters; or acter classes, since that only applies within the [ ]
[azAZ] operators. If the very last character is $ , the
is any character which is not a letter. The \ char- expression will only be matched at the end of a
acter provides the usual escapes within character line (when immediately followed by newline).
class brackets. The latter operator is a special case of the /
Arbitrary character. To match almost any operator character, which indicates trailing con-
character, the operator character text. The expression
. ab/cd
is the class of all characters except newline. matches the string ab , but only if followed by cd.
Escaping into octal is possible although non- Thus
portable: ab$
[\40\176] is the same as
matches all printable characters in the ASCII ab/\n
character set, from octal 40 (blank) to octal 176 Left context is handled in Lex by start conditions
(tilde). as explained in section 10. If a rule is only to be
executed when the Lex automaton interpreter is in
Optional expressions. The operator ? start condition x, the rule should be prefixed by
indicates an optional element of an expression. <x>
Thus using the angle bracket operator characters. If we
ab?c considered being at the beginning of a line to
matches either ac or abc . be start condition ONE , then the operator would
-- --

Lex A Lexical Analyzer Generator PS1:16-5

be equivalent to In more complex actions, the user will


<ONE> often want to know the actual text that matched
Start conditions are explained more fully later. some expression like [az]+ . Lex leaves this
Repetitions and Defifinnitions. The operators text in an external character array named yytext.
{} specify either repetitions (if they enclose Thus, to print the name found, a rule like
numbers) or definition expansion (if they enclose [az]+ printf("%s", yytext);
a name). For example will print the string in yytext. The C function
{digit} printf accepts a format argument and data to be
looks for a predefined string named digit and printed; in this case, the format is print string
inserts it at that point in the expression. The (% indicating data conversion, and s indicating
definitions are given in the first part of the Lex string type), and the data are the characters in
input, before the rules. In contrast, yytext. So this just places the matched string on
a{1,5} the output. This action is so common that it may
looks for 1 to 5 occurrences of a . be written as ECHO:
[az]+ ECHO;
Finally, initial % is special, being the is the same as the above. Since the default action
separator for Lex source segments. is just to print the characters found, one might ask
why give a rule, like this one, which merely
4. Lex Actions. specifies the default action? Such rules are often
When an expression written as above is required to avoid matching some other rule which
matched, Lex executes the corresponding action. is not desired. For example, if there is a rule
This section describes some features of Lex which matches read it will normally match the
which aid in writing actions. Note that there is a instances of read contained in bread or read-
default action, which consists of copying the just ; to avoid this, a rule of the form [az]+ is
input to the output. This is performed on all needed. This is explained further below.
strings not otherwise matched. Thus the Lex user Sometimes it is more convenient to know
who wishes to absorb the entire input, without the end of what has been found; hence Lex also
producing any output, must provide rules to provides a count yyleng of the number of charac-
match everything. When Lex is being used with ters matched. To count both the number of words
Yacc, this is the normal situation. One may con- and the number of characters in words in the
sider that actions are what is done instead of input, the user might write
copying the input to the output; thus, in general, a [azAZ]+ {words++; chars += yyleng;}
rule which merely copies can be omitted. Also, a which accumulates in chars the number of char-
character combination which is omitted from the acters in the words recognized. The last character
rules and which appears as input is likely to be in the string matched can be accessed by
printed on the output, thus calling attention to the yytext[yyleng1]
gap in the rules.
Occasionally, a Lex action may decide that
One of the simplest things that can be done a rule has not recognized the correct span of char-
is to ignore the input. Specifying a C null state- acters. Two routines are provided to aid with this
ment, ; as an action causes this result. A frequent situation. First, yymore() can be called to indi-
rule is cate that the next input expression recognized is
[ \t\n] ; to be tacked on to the end of this input. Nor-
which causes the three spacing characters (blank, mally, the next input string would overwrite the
tab, and newline) to be ignored. current entry in yytext. Second, yyless (n) may
Another easy way to avoid writing actions be called to indicate that not all the characters
is the action character |, which indicates that the matched by the currently successful expression
action for this rule is the action for the next rule. are wanted right now. The argument n indicates
The previous example could also have been writ- the number of characters in yytext to be retained.
ten Further characters previously matched are
"" returned to the input. This provides the same sort
"\t" of lookahead offered by the / operator, but in a
"\n" different form.
with the same result, although in different style. Example: Consider a language which
The quotes around \n and \t are not required. defines a string as a set of characters between
-- --

PS1:16-6 Lex A Lexical Analyzer Generator

quotation (") marks, and provides that to include a 2) output(c) which writes the character c on
" in a string it must be preceded by a \. The regu- the output; and
lar expression which matches that is somewhat 3) unput(c) pushes the character c back onto
confusing, so that it might be preferable to write the input stream to be read later by input().
\"["] {
By default these routines are provided as macro
if (yytext[yyleng1] == \\)
definitions, but the user can override them and
yymore();
supply private versions. These routines define the
else
relationship between external files and internal
... normal user processing
characters, and must all be retained or modified
}
consistently. They may be redefined, to cause
which will, when faced with a string such as
input or output to be transmitted to or from
"abc\"def " first match the five characters "abc\ ;
strange places, including other programs or inter-
then the call to yymore() will cause the next part
nal memory; but the character set used must be
of the string, "def , to be tacked on the end. Note
consistent in all routines; a value of zero returned
that the final quote terminating the string should
by input must mean end of file; and the relation-
be picked up in the code labeled normal pro-
ship between unput and input must be retained or
cessing.
the Lex lookahead will not work. Lex does not
The function yyless() might be used to look ahead at all if it does not have to, but every
reprocess text in various circumstances. Consider rule ending in + ? or $ or containing / implies
the C problem of distinguishing the ambiguity of lookahead. Lookahead is also necessary to match
=a. Suppose it is desired to treat this as = an expression that is a prefix of another expres-
a but print a message. A rule might be sion. See below for a discussion of the character
=[azAZ] {
set used by Lex. The standard Lex library
printf("Op (=) ambiguous\n");
yyless(yyleng1); imposes a 100 character limit on backup.
... action for = ... Another Lex library routine that the user
} will sometimes want to redefine is yywrap()
which prints a message, returns the letter after the which is called whenever Lex reaches an end-of-
operator to the input stream, and treats the opera- file. If yywrap returns a 1, Lex continues with the
tor as =. Alternatively it might be desired to normal wrapup on end of input. Sometimes,
treat this as = a. To do this, just return the however, it is convenient to arrange for more
minus sign as well as the letter to the input: input to arrive from a new source. In this case,
=[azAZ] { the user should provide a yywrap which arranges
printf("Op (=) ambiguous\n"); for new input and returns 0. This instructs Lex to
yyless(yyleng2);
continue processing. The default yywrap always
... action for = ...
returns 1.
}
will perform the other interpretation. Note that This routine is also a convenient place to
the expressions for the two cases might more print tables, summaries, etc. at the end of a pro-
easily be written gram. Note that it is not possible to write a nor-
=/[AZaz] mal rule which recognizes end-of-file; the only
in the first case and access to this condition is through yywrap. In
=/[AZaz] fact, unless a private version of input() is sup-
in the second; no backup would be required in the plied a file containing nulls cannot be handled,
rule action. It is not necessary to recognize the since a value of 0 returned by input is taken to be
whole identifier to observe the ambiguity. The end-of-file.
possibility of =3, however, makes 5. Ambiguous Source Rules.
=/[ \t\n]
Lex can handle ambiguous specifications.
a still better rule.
When more than one expression can match the
In addition to these routines, Lex also per- current input, Lex chooses as follows:
mits access to the I/O routines it uses. They are:
1) The longest match is preferred.
1) input() which returns the next input charac-
2) Among rules which matched the same
ter;
number of characters, the rule given first is
preferred.
-- --

Lex A Lexical Analyzer Generator PS1:16-7

Thus, suppose the rules executed. The position of the input pointer is
integer keyword action ...; adjusted accordingly. Suppose the user really
[az]+ identifier action ...; wants to count the included instances of he:
to be given in that order. If the input is integers , she {s++; REJECT;}
it is taken as an identifier, because [az]+ he {h++; REJECT;}
matches 8 characters while integer matches only \n |
7. If the input is integer , both rules match 7 . ;
characters, and the keyword rule is selected these rules are one way of changing the previous
because it was given first. Anything shorter (e.g. example to do just that. After counting each
int ) will not match the expression integer and so expression, it is rejected; whenever appropriate,
the identifier interpretation is used. the other expression will then be counted. In this
The principle of preferring the longest example, of course, the user could note that she
match makes rules containing expressions like . includes he but not vice versa, and omit the
dangerous. For example, REJECT action on he; in other cases, however, it
. would not be possible a priori to tell which input
might seem a good way of recognizing a string in characters were in both classes.
single quotes. But it is an invitation for the pro- Consider the two rules
gram to read far ahead, looking for a distant sin- a[bc]+ { ... ; REJECT;}
gle quote. Presented with the input a[cd]+ { ... ; REJECT;}
first quoted string here, second here If the input is ab , only the first rule matches, and
the above expression will match on ad only the second matches. The input string
first quoted string here, second accb matches the first rule for four characters and
which is probably not what was wanted. A better then the second rule for three characters. In con-
rule is of the form trast, the input accd agrees with the second rule
[\n] for four characters and then the first rule for three.
which, on the above input, will stop after fifirrst . In general, REJECT is useful whenever the
The consequences of errors like this are mitigated purpose of Lex is not to partition the input stream
by the fact that the . operator will not match new- but to detect all examples of some items in the
line. Thus expressions like . stop on the current input, and the instances of these items may over-
line. Dont try to defeat this with expressions like lap or include each other. Suppose a digram table
[.\n]+ or equivalents; the Lex generated program of the input is desired; normally the digrams over-
will try to read the entire input file, causing inter- lap, that is the word the is considered to contain
nal buffer overflows. both th and he . Assuming a two-dimensional
Note that Lex is normally partitioning the array named digram to be incremented, the
input stream, not searching for all possible appropriate source is
matches of each expression. This means that %%
each character is accounted for once and only [az][az] {
once. For example, suppose it is desired to count digram[yytext[0]][yytext[1]]++;
occurrences of both she and he in an input text. REJECT;
Some Lex rules to do this might be }
she s++; . ;
he h++; \n ;
\n | where the REJECT is necessary to pick up a letter
. ; pair beginning at every character, rather than at
where the last two rules ignore everything besides every other character.
he and she. Remember that . does not include
newline. Since she includes he, Lex will nor- 6. Lex Source Definitions.
mally not recognize the instances of he included Remember the format of the Lex source:
in she, since once it has passed a she those char- {definitions}
acters are gone. %%
Sometimes the user would like to override {rules}
this choice. The action REJECT means go do %%
the next alternative. It causes whatever rule {user routines}
was second choice after the current rule to be So far only the rules have been described. The
-- --

PS1:16-8 Lex A Lexical Analyzer Generator

user needs additional options, though, to define %%


variables for use in his program and for use by {D}+ printf("integer");
Lex. These can go either in the definitions sec- {D}+"."{D}({E})? |
tion or in the rules section. {D}"."{D}+({E})? |
Remember that Lex is turning the rules into {D}+{E}
a program. Any source not intercepted by Lex is Note the first two rules for real numbers; both
copied into the generated program. There are require a decimal point and contain an optional
three classes of such things. exponent field, but the first requires at least one
digit before the decimal point and the second
1) Any line which is not part of a Lex rule or requires at least one digit after the decimal point.
action which begins with a blank or tab is To correctly handle the problem posed by a For-
copied into the Lex generated program. tran expression such as 35.EQ.I , which does not
Such source input prior to the first %% del- contain a real number, a context-sensitive rule
imiter will be external to any function in such as
the code; if it appears immediately after the [09]+/"."EQ printf("integer");
first %%, it appears in an appropriate place could be used in addition to the normal rule for
for declarations in the function written by integers.
Lex which contains the actions. This
material must look like program fragments, The definitions section may also contain
and should precede the first Lex rule. other commands, including the selection of a host
language, a character set table, a list of start con-
As a side effect of the above, lines which ditions, or adjustments to the default size of
begin with a blank or tab, and which con- arrays within Lex itself for larger source pro-
tain a comment, are passed through to the grams. These possibilities are discussed below
generated program. This can be used to under Summary of Source Format, section 12.
include comments in either the Lex source
or the generated code. The comments 7. Usage.
should follow the host language conven-
tion. There are two steps in compiling a Lex
source program. First, the Lex source must be
2) Anything included between lines contain- turned into a generated program in the host gen-
ing only %{ and %} is copied out as eral purpose language. Then this program must
above. The delimiters are discarded. This be compiled and loaded, usually with a library of
format permits entering text like preproces- Lex subroutines. The generated program is on a
sor statements that must begin in column 1, file named lex.yy.c . The I/O library is defined in
or copying lines that do not look like pro- terms of the C standard library [6].
grams.
The C programs generated by Lex are
3) Anything after the third %% delimiter, slightly different on OS/370, because the OS
regardless of formats, etc., is copied out compiler is less powerful than the UNIX or
after the Lex output. GCOS compilers, and does less at compile time.
Definitions intended for Lex are given C programs generated on GCOS and UNIX are
before the first %% delimiter. Any line in this the same.
section not contained between %{ and %}, and UNIX. The library is accessed by the
begining in column 1, is assumed to define Lex loader flag ll . So an appropriate set of com-
substitution strings. The format of such lines is mands is
name translation lex source cc lex.yy.c ll
and it causes the string given as a translation to be The resulting program is placed on the usual file
associated with the name. The name and transla- a.out for later execution. To use Lex with Yacc
tion must be separated by at least one blank or see below. Although the default Lex I/O routines
tab, and the name must begin with a letter. The use the C standard library, the Lex automata
translation can then be called out by the {name} themselves do not do so; if private versions of
syntax in a rule. Using {D} for the digits and {E} input, output and unput are given, the library can
for an exponent field, for example, might abbrevi- be avoided.
ate rules to recognize numbers:
D [09]
E [DEde][+]?{D}+
-- --

Lex A Lexical Analyzer Generator PS1:16-9

8. Lex and Yacc. k = atoi(yytext);


If you want to use Lex with Yacc, note that printf("%d",
what Lex writes is a program named yylex(), the k%7 == 0 ? k+3 : k);
name required by Yacc for its analyzer. Nor- }
mally, the default main program on the Lex ?[09.]+ ECHO;
library calls this routine, but if Yacc is loaded, [A-Za-z][A-Za-z0-9]+ ECHO;
and its main program is used, Yacc will call Numerical strings containing a . or preceded
yylex(). In this case each Lex rule should end by a letter will be picked up by one of the last two
with rules, and not changed. The ifelse has been
return(token); replaced by a C conditional expression to save
where the appropriate token value is returned. An space; the form a?b:c means if a then b else
easy way to get access to Yaccs names for c .
tokens is to compile the Lex output file as part of For an example of statistics gathering, here
the Yacc output file by placing the line is a program which histograms the lengths of
# include "lex.yy.c" words, where a word is defined as a string of
in the last section of Yacc input. Supposing the letters.
grammar to be named good and the lexical int lengs[100];
rules to be named better the UNIX command %%
sequence can just be: [az]+ lengs[yyleng]++;
yacc good . |
lex better \n ;
cc y.tab.c ly ll %%
The Yacc library (ly) should be loaded before yywrap()
the Lex library, to obtain a main program which {
invokes the Yacc parser. The generations of Lex int i;
and Yacc programs can be done in either order. printf("Length No. words\n");
for(i=0; i<100; i++)
9. Examples. if (lengs[i] > 0)
As a trivial problem, consider copying an printf("%5d%10d\n",i,lengs[i]);
input file while adding 3 to every positive number return(1);
divisible by 7. Here is a suitable Lex source pro- }
gram This program accumulates the histogram, while
%% producing no output. At the end of the input it
int k; prints the table. The final statement return(1);
[09]+ { indicates that Lex is to perform wrapup. If
k = atoi(yytext); yywrap returns zero (false) it implies that further
if (k%7 == 0) input is available and the program is to continue
printf("%d", k+3); reading and processing. To provide a yywrap
else that never returns true causes an infinite loop.
printf("%d",k); As a larger example, here are some parts of
} a program written by N. L. Schryer to convert
to do just that. The rule [09]+ recognizes strings double precision Fortran to single precision For-
of digits; atoi converts the digits to binary and tran. Because Fortran does not distinguish upper
stores the result in k. The operator % (remainder) and lower case letters, this routine begins by
is used to check whether k is divisible by 7; if it defining a set of classes including both cases of
is, it is incremented by 3 as it is written out. It each letter:
may be objected that this program will alter such a [aA]
input items as 49.63 or X7 . Furthermore, it b [bB]
increments the absolute value of all negative c [cC]
numbers divisible by 7. To avoid this, just add a ...
few more rules after the active one, as here: z [zZ]
%% An additional class recognizes white space:
int k; W [ \t]
?[09]+ { The first rule changes double precision to
-- --

PS1:16-10 Lex A Lexical Analyzer Generator

real, or DOUBLE PRECISION to initial r:


REAL. {d}1{m}{a}{c}{h} {yytext[0] =+ r d;
{d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} {
printf(yytext[0]==d? "real" : "REAL");
} To avoid such names as dsinx being detected as
Care is taken throughout this program to preserve instances of dsin, some final rules pick up longer
the case (upper or lower) of the original program. words as identifiers and copy some surviving
The conditional operator is used to select the characters:
proper form of the keyword. The next rule copies [AZaz][AZaz09] |
continuation card indications to avoid confusing [09]+ |
them with constants: \n |
" "[ 0] ECHO; . ECHO;
In the regular expression, the quotes surround the Note that this program is not complete; it does not
blanks. It is interpreted as beginning of line, deal with the spacing problems in Fortran or with
then five blanks, then anything but blank or the use of keywords as identifiers.
zero. Note the two different meanings of . 10. Left Context Sensitivity.
There follow some rules to change double preci- Sometimes it is desirable to have several
sion constants to ordinary floating constants. sets of lexical rules to be applied at different
[09]+{W}{d}{W}[+]?{W}[09]+ | times in the input. For example, a compiler
[09]+{W}"."{W}{d}{W}[+]?{W}[09]+ | preprocessor might distinguish preprocessor state-
"."{W}[09]+{W}{d}{W}[+]?{W}[09]+ { ments and analyze them differently from ordinary
/ convert constants / statements. This requires sensitivity to prior con-
for(p=yytext; p != 0; p++) text, and there are several ways of handling such
{ problems. The operator, for example, is a prior
if (p == d || p == D) context operator, recognizing immediately
p=+ e d; preceding left context just as $ recognizes
ECHO; immediately following right context. Adjacent
} left context could be extended, to produce a facil-
After the floating point constant is recognized, it ity similar to that for adjacent right context, but it
is scanned by the for loop to find the letter d or is unlikely to be as useful, since often the relevant
D . The program than adds ed , which con- left context appeared some time earlier, such as at
verts it to the next letter of the alphabet. The the beginning of a line.
modified constant, now single-precision, is writ-
This section describes three means of deal-
ten out again. There follow a series of names
ing with different environments: a simple use of
which must be respelled to remove their initial d.
flags, when only a few rules change from one
By using the array yytext the same action suffices
environment to another, the use of start condi-
for all the names (only a sample of a rather long
t i o n s on rules, and the possibility of making mul-
list is given here).
tiple lexical analyzers all run together. In each
{d}{s}{i}{n} |
case, there are rules which recognize the need to
{d}{c}{o}{s} |
change the environment in which the following
{d}{s}{q}{r}{t} |
input text is analyzed, and set some parameter to
{d}{a}{t}{a}{n} |
reflect the change. This may be a flag explicitly
...
tested by the users action code; such a flag is the
{d}{f}{l}{o}{a}{t} printf("%s",yytext+1);
simplest way of dealing with the problem, since
Another list of names must have initial d changed
Lex is not involved at all. It may be more con-
to initial a:
venient, however, to have Lex remember the flags
{d}{l}{o}{g} |
as initial conditions on the rules. Any rule may
{d}{l}{o}{g}10 |
be associated with a start condition. It will only
{d}{m}{i}{n}1 |
be recognized when Lex is in that start condition.
{d}{m}{a}{x}1 {
The current start condition may be changed at any
yytext[0] =+ a d;
time. Finally, if the sets of rules for the different
ECHO;
environments are very dissimilar, clarity may be
}
best achieved by writing several distinct lexical
And one routine must have initial d changed to
analyzers, and switching from one to another as
-- --

Lex A Lexical Analyzer Generator PS1:16-11

desired. b {ECHO; BEGIN BB;}


Consider the following problem: copy the c {ECHO; BEGIN CC;}
input to the output, changing the word magic to \n {ECHO; BEGIN 0;}
fifirrst on every line which began with the letter a, <AA>magic printf("first");
changing magic to second on every line which <BB>magic printf("second");
began with the letter b, and changing magic to <CC>magic printf("third");
third on every line which began with the letter c. where the logic is exactly the same as in the pre-
All other words and all other lines are left vious method of handling the problem, but Lex
unchanged. does the work rather than the users code.
These rules are so simple that the easiest 11. Character Set.
way to do this job is with a flag:
int flag; The programs generated by Lex handle
%% character I/O only through the routines input,
a {flag = a; ECHO;} output, and unput. Thus the character representa-
b {flag = b; ECHO;} tion provided in these routines is accepted by Lex
c {flag = c; ECHO;} and employed to return values in yytext. For
\n {flag = 0 ; ECHO;} internal use a character is represented as a small
magic { integer which, if the standard library is used, has
switch (flag) a value equal to the integer value of the bit pat-
{ tern representing the character on the host com-
case a: printf("first"); break; puter. Normally, the letter a is represented as the
case b: printf("second"); break; same form as the character constant a . If this
case c: printf("third"); break; interpretation is changed, by providing I/O rou-
default: ECHO; break; tines which translate the characters, Lex must be
} told about it, by giving a translation table. This
} table must be in the definitions section, and must
should be adequate. be bracketed by lines containing only %T.
The table contains lines of the form
To handle the same problem with start con- {integer} {character string}
ditions, each start condition must be introduced to which indicate the value associated with each
Lex in the definitions section with a line reading character. Thus the next example
%Start name1 name2 ... %T
where the conditions may be named in any order. 1 Aa
The word Start may be abbreviated to s or S. The 2 Bb
conditions may be referenced at the head of a rule ...
with the <> brackets: 26 Zz
<name1>expression 27 \n
is a rule which is only recognized when Lex is in 28 +
the start condition name1. To enter a start condi- 29
tion, execute the action statement 30 0
BEGIN name1; 31 1
which changes the start condition to name1. To ...
resume the normal state, 39 9
BEGIN 0; %T
resets the initial condition of the Lex automaton
interpreter. A rule may be active in several start Sample character table.
conditions: maps the lower and upper case letters together
<name1,name2,name3> into the integers 1 through 26, newline into 27, +
is a legal prefix. Any rule not beginning with the and into 28 and 29, and the digits into 30
<> prefix operator is always active. through 39. Note the escape for newline. If a
The same example as before can be written: table is supplied, every character that is to appear
%START AA BB CC either in the rules or in any valid input must be
%% included in the table. No character may be
a {ECHO; BEGIN AA;} assigned the number 0, and no character may be
-- --

PS1:16-12 Lex A Lexical Analyzer Generator

assigned a bigger number than the size of the x 0,1,2, ... instances of x.
hardware character set. x+ 1,2,3, ... instances of x.
x|y an x or a y.
12. Summary of Source Format. (x) an x.
The general form of a Lex source file is: x/y an x but only if followed by y.
{definitions} {xx} the translation of xx from the
%% definitions section.
{rules} x{m,n} m through n occurrences of x
%%
{user subroutines} 13. Caveats and Bugs.
The definitions section contains a combination of There are pathological expressions which
1) Definitions, in the form name space trans- produce exponential growth of the tables when
lation. converted to deterministic machines; fortunately,
they are rare.
2) Included code, in the form space code.
REJECT does not rescan the input; instead
3) Included code, in the form it remembers the results of the previous scan.
%{ This means that if a rule with trailing context is
code found, and REJECT executed, the user must not
%} have used unput to change the characters forth-
4) Start conditions, given in the form coming from the input stream. This is the only
%S name1 name2 ... restriction on the users ability to manipulate the
5) Character set tables, in the form not-yet-processed input.
%T
number space character-string 14. Acknowledgments.
... As should be obvious from the above, the
%T outside of Lex is patterned on Yacc and the inside
6) Changes to internal array sizes, in the form on Ahos string matching routines. Therefore,
%xx nnn both S. C. Johnson and A. V. Aho are really origi-
where nnn is a decimal integer representing nators of much of Lex, as well as debuggers of it.
an array size and x selects the parameter as Many thanks are due to both.
follows: The code of the current version of Lex was
Letter Parameter designed, written, and debugged by Eric Schmidt.
p positions
n states 15. References.
e tree nodes
1. B. W. Kernighan and D. M. Ritchie, The C
a transitions
Programming Language, Prentice-Hall, N.
k packed character classes
J. (1978).
o output array size
2. B. W. Kernighan, Ratfor: A Preprocessor
Lines in the rules section have the form expres-
for a Rational Fortran, Software Prac-
sion action where the action may be continued
tice and Experience, 5, pp. 395-496 (1975).
on succeeding lines by using braces to delimit it.
3. S. C. Johnson, Yacc: Yet Another Compiler
Regular expressions in Lex use the follow-
Compiler, Computing Science Technical
ing operators:
Report No. 32, 1975, Bell Laboratories,
x the character "x"
Murray Hill, NJ 07974.
"x" an "x", even if x is an operator.
\x an "x", even if x is an operator. 4. A. V. Aho and M. J. Corasick, Effificcient
[xy] the character x or y. S t ring M a t ch i ng : A n A i d t o B i bl iog r ap h i c
[xz] the characters x, y or z. Search, Comm. ACM 18, 333-340 (1975).
[x] any character but x. 5. B. W. Kernighan, D. M. Ritchie and K. L.
. any character but newline. Thompson, QED Text Editor, Computing
x an x at the beginning of a line. Science Technical Report No. 5, 1972, Bell
<y>x an x when Lex is in start condition y. Laboratories, Murray Hill, NJ 07974.
x$ an x at the end of a line.
x? an optional x.
-- --

Lex A Lexical Analyzer Generator PS1:16-13

6. D. M. Ritchie, private communication. See


also M. E. Lesk, The Portable C Library,
Computing Science Technical Report No.
31, Bell Laboratories, Murray Hill, NJ
07974.