Вы находитесь на странице: 1из 11

Lex - A Lexical Analyzer Generator

M. E. Lesk and E. Schmidt


Bell Laboratories
Murray Hill, New Jersey 07974

ABSTRA CT

Lex helps write programs whose control ow is directed by instances of regular expressions in the input
stream. It is well suited for editor-script type transformations and for segmenting input in preparation for a
parsing routine.
Lex source is a table of regular expressions and corresponding program fragments. The table is translated to a
program which reads an input stream, copying it to an output stream and partitioning the input into strings
which match the given expressions. As each such string is recognized the corresponding program fragment is
executed. The recognition of the expressions is performed by a deterministic finite automaton generated by
Lex. The program fragments written by the user are executed in the order in which the corresponding regular
expressions occur in the input stream.
The lexical analysis programs written with Lex accept ambiguous specifications and choose the longest match
possible at each input point. If necessary, substantial lookahead is performed on the input, but the input stream
will be backed up to the end of the current partition, so that the user has general freedom to manipulate it.
Lex can generate analyzers in either C or Ratfor, a language which can be translated automatically to portable
Fortran. It is available on the PDP-11 UNIX, Honeywell GCOS, and IBM OS systems. This manual, however,
will only discuss generating analyzers in C on the UNIX system, which is the only supported form of Lex under
UNIX Version 7. Lex is designed to simplify interfacing with Yacc, for those with access to this compiler-com-
piler system.

July 21, 1975


-- --

Lex - A Lexical Analyzer Generator

M. E. Lesk and E. Schmidt


Bell Laboratories
Murray Hill, New Jersey 07974

Table of Contents
1. Introduction. 1
2. Lex Source. 3
3. Lex Regular Expressions. 3
4. Lex Actions. 5
5. Ambiguous Source Rules. 7
6. Lex Source Definitions. 8
7. Usage. 8
8. Lex and Yacc. 9
9. Examples. 10
10. Left Context Sensitivity. 11
11. Character Set. 12
12. Summary of Source Format. 12
13. Caveats and Bugs. 13
14. Acknowledgments. 13
15. References. 13
1. Introduction. can produce code to run on different computer hard-
Lex is a program generator designed for lexical pro- ware, Lex can write code in different host languages.
cessing of character input streams. It accepts a high- The host language is used for the output code generated
level, problem oriented specification for character by Lex and also for the program fragments added by
string matching, and produces a program in a general the user. Compatible run-time libraries for the different
purpose language which recognizes regular expres- host languages are also provided. This makes Lex
sions. The regular expressions are specified by the user adaptable to different environments and different users.
in the source specifications given to Lex. The Lex writ- Each application may be directed to the combination of
ten code recognizes these expressions in an input hardware and host language appropriate to the task, the
stream and partitions the input stream into strings user’s background, and the properties of local imple-
matching the expressions. At the boundaries between mentations. At present, the only supported host lan-
strings program sections provided by the user are guage is C, although Fortran (in the form of Ratfor [2]
executed. The Lex source file associates the regular has been available in the past. Lex itself exists on
expressions and the program fragments. As each UNIX, GCOS, and OS/370; but the code generated by
expression appears in the input to the program written Lex may be taken anywhere the appropriate compilers
by Lex, the corresponding fragment is executed. exist.
The user supplies the additional code beyond expres- Lex turns the user’s expressions and actions (called
sion matching needed to complete his tasks, possibly s o u r c e in this memo) into the host general-purpose lan-
including code written by other generators. The pro- guage; the generated program is named y y l e x . The
gram that recognizes the expressions is generated in the y y l e x program will recognize expressions in a stream
general purpose programming language employed for (called i n p u t in this memo) and perform the specified
the user’s program fragments. Thus, a high level actions for each expression as it is detected. See Figure
expression language is provided to write the string 1.
expressions to be matched while the user’s freedom to
Source Lex yylex
write actions is unimpaired. This avoids forcing the
user who wishes to use a string manipulation language
for input analysis to write processing programs in the
Input yylex Output
same and often inappropriate string handling language.
Lex is not a complete language, but rather a generator
An overview of Lex
representing a new language feature which can be
added to different programming languages, called
‘‘host languages.’’ Just as general purpose languages
-- --

LEX-2

Figure 1 Lex generates a deterministic finite automaton from


For a trivial example, consider a program to delete the regular expressions in the source [4]. The automa-
from the input all blanks or tabs at the ends of lines. ton is interpreted, rather than compiled, in order to save
%% space. The result is still a fast analyzer. In particular,
[ \t]+$ ; the time taken by a Lex program to recognize and parti-
is all that is required. The program contains a %% tion an input stream is proportional to the length of the
delimiter to mark the beginning of the rules, and one input. The number of Lex rules or the complexity of
rule. This rule contains a regular expression which the rules is not important in determining speed, unless
matches one or more instances of the characters blank rules which include forward context require a signifi-
or tab (written \t for visibility, in accordance with the C cant amount of rescanning. What does increase with
language convention) just prior to the end of a line. the number and complexity of rules is the size of the
The brackets indicate the character class made of blank finite automaton, and therefore the size of the program
and tab; the + indicates ‘‘one or more ...’’; and the $ generated by Lex.
indicates ‘‘end of line,’’ as in QED. No action is speci- In the program written by Lex, the user’s fragments
fied, so the program generated by Lex (yylex) will (representing the a c t i o n s to be performed as each regu-
ignore these characters. Everything else will be copied. lar expression is found) are gathered as cases of a
To change any remaining string of blanks or tabs to a switch. The automaton interpreter directs the control
single blank, add another rule: ow. Opportunity is provided for the user to insert
%% either declarations or additional statements in the rou-
[ \t]+$ ; tine containing the actions, or to add subroutines out-
[ \t]+ printf(" "); side this action routine.
The finite automaton generated for this source will scan Lex is not limited to source which can be interpreted
for both rules at once, observing at the termination of on the basis of one character lookahead. For example,
the string of blanks or tabs whether or not there is a if there are two rules, one looking for a b and another
newline character, and executing the desired rule for a b c d e f g , and the input stream is a b c d e f h , Lex will
action. The first rule matches all strings of blanks or recognize a b and leave the input pointer just before c d .
tabs at the end of lines, and the second rule all remain- . . Such backup is more costly than the processing of
ing strings of blanks or tabs. simpler languages.
Lex can be used alone for simple transformations, or
for analysis and statistics gathering on a lexical level. 2. Lex Source.
Lex can also be used with a parser generator to perform The general format of Lex source is:
the lexical analysis phase; it is particularly easy to {definitions}
interface Lex and Yacc [3]. Lex programs recognize %%
only regular expressions; Yacc writes parsers that {rules}
accept a large class of context free grammars, but %%
require a lower level analyzer to recognize input tokens. {user subroutines}
Thus, a combination of Lex and Yacc is often appropri- where the definitions and the user subroutines are often
ate. When used as a preprocessor for a later parser gen- omitted. The second % % is optional, but the first is
erator, Lex is used to partition the input stream, and the required to mark the beginning of the rules. The abso-
parser generator assigns structure to the resulting lute minimum Lex program is thus
pieces. The ow of control in such a case (which might %%
be the first half of a compiler, for example) is shown in (no definitions, no rules) which translates into a pro-
Figure 2. Additional programs, written by other gener- gram which copies the input to the output unchanged.
ators or by hand, can be added easily to programs writ- In the outline of Lex programs shown above, the r u l e s
ten by Lex. represent the user’s control decisions; they are a table,
lexical grammar in which the left column contains r e g u l a r e x p r e s s i o n s
rules rules (see section 3) and the right column contains a c t i o n s ,
¯ program fragments to be executed when the expres-
sions are recognized. Thus an individual rule might
Lex Yacc
appear
¯
integer printf("found keyword INT");
Input yylex yyparse Parsed input
to look for the string i n t e g e r in the input stream and
print the message ‘‘found keyword INT’’ whenever it
Lex with Yacc
appears. In this example the host procedural language
is C and the C library function p r i n t f is used to print the
Figure 2
string. The end of the expression is indicated by the
Yacc users will realize that the name y y l e x is what Yacc
first blank or tab character. If the action is merely a
expects its lexical analyzer to be named, so that the use
single C expression, it can just be given on the right
of this name by Lex simplifies interfacing.
side of the line; if it is compound, or takes more than a
-- --

LEX-3

line, it should be enclosed in braces. As a slightly more Within square brackets, most operator meanings are
useful example, suppose it is desired to change a num- ignored. Only three characters are special: these are \ -
ber of words from British to American spelling. Lex and ˆ. The - character indicates ranges. For example,
rules such as [a-z0-9<>_]
colour printf("color"); indicates the character class containing all the lower
mechanise printf("mechanize"); case letters, the digits, the angle brackets, and under-
petrol printf("gas"); line. Ranges may be given in either order. Using -
would be a start. These rules are not quite enough, between any pair of characters which are not both
since the word p e t r o l e u m would become g a s e u m ;a upper case letters, both lower case letters, or both digits
way of dealing with this will be described later. is implementation dependent and will get a warning
message. (E.g., [0-z] in ASCII is many more charac-
3. Lex Regular Expressions. ters than it is in EBCDIC). If it is desired to include the
The definitions of regular expressions are very similar character - in a character class, it should be first or last;
to those in QED [5]. A regular expression specifies a thus
set of strings to be matched. It contains text characters [-+0-9]
(which match the corresponding characters in the matches all the digits and the two signs.
strings being compared) and operator characters (which In character classes, the ˆ operator must appear as the
specify repetitions, choices, and other features). The first character after the left bracket; it indicates that the
letters of the alphabet and the digits are always text resulting string is to be complemented with respect to
characters; thus the regular expression the computer character set. Thus
integer [ˆabc]
matches the string i n t e g e r wherever it appears and the matches all characters except a, b, or c, including all
expression special or control characters; or
a57D [ˆa-zA-Z]
looks for the string a 5 7 D . is any character which is not a letter. The \ character
O p e r a t o r s . The operator characters are provides the usual escapes within character class brack-
"\[]ˆ-?.* +|()$/{}%<> ets.
and if they are to be used as text characters, an escape A r b i t r a r y c h a r a c t e r . To match almost any character,
should be used. The quotation mark operator (") indi- the operator character
cates that whatever is contained between a pair of .
quotes is to be taken as text characters. Thus is the class of all characters except newline. Escaping
xyz"++" into octal is possible although non-portable:
matches the string x y z + + when it appears. Note that a [\40-\176]
part of a string may be quoted. It is harmless but matches all printable characters in the ASCII character
unnecessary to quote an ordinary text character; the set, from octal 40 (blank) to octal 176 (tilde).
expression O p t i o n a l e x p r e s s iio n s . The operator ? indicates an
"xyz++" optional element of an expression. Thus
is the same as the one above. Thus by quoting every ab?c
non-alphanumeric character being used as a text char- matches either a c or a b c .
acter, the user can avoid remembering the list above of R e p e a t e d e x p r e s s i o n s . Repetitions of classes are indi-
current operator characters, and is safe should further cated by the operators * and + .
extensions to Lex lengthen the list. a *
An operator character may also be turned into a text is any number of consecutive a characters, including
character by preceding it with \ as in zero; while
xyz\+\+ a+
which is another, less readable, equivalent of the above is one or more instances of a . For example,
expressions. Another use of the quoting mechanism is [a-z]+
to get a blank into an expression; normally, as is all strings of lower case letters. And
explained above, blanks or tabs end a rule. Any blank [A-Za-z][A-Za-z0-9]*
character not contained within [ ] (see below) must be indicates all alphanumeric strings with a leading alpha-
quoted. Several normal C escapes with \ are recog- betic character. This is a typical expression for recog-
nized: \n is newline, \t is tab, and \b is backspace. To nizing identifiers in computer languages.
enter \ itself, use \\. Since newline is illegal in an A l t e r n a t i o n a n d G r o u p i n g . The operator | indicates
expression, \n must be used; it is not required to escape alternation:
tab and backspace. Every character but blank, tab, (ab | cd)
newline and the list above is always a text character. matches either a b or c d . Note that parentheses are used
C h a r a c t e r c l a s s e s . Classes of characters can be speci- for grouping, although they are not necessary on the
fied using the operator pair [ ]. The construction [ a b c ] outside level;
matches a single character, which may be a , b ,orc .
-- --

LEX-4

ab | cd what is done instead of copying the input to the output;


would have sufficed. Parentheses can be used for more thus, in general, a rule which merely copies can be
complex expressions: omitted. Also, a character combination which is omit-
(ab | cd+)?(ef)* ted from the rules and which appears as input is likely
matches such strings as a b e f e f , e f e f e f , c d e f ,orc d d d ; to be printed on the output, thus calling attention to the
but not a b c , a b c d ,ora b c d e f . gap in the rules.
C o n t e x t s e n s i t i v i t y . Lex will recognize a small amount One of the simplest things that can be done is to ignore
of surrounding context. The two simplest operators for the input. Specifying a C null statement, ; as an action
this are ˆ and $ . If the first character of an expression causes this result. A frequent rule is
is ˆ , the expression will only be matched at the begin- [ \t\n] ;
ning of a line (after a newline character, or at the begin- which causes the three spacing characters (blank, tab,
ning of the input stream). This can never con ict with and newline) to be ignored.
the other meaning of ˆ , complementation of character Another easy way to avoid writing actions is the action
classes, since that only applies within the [ ] operators. character |, which indicates that the action for this rule
If the very last character is $ , the expression will only is the action for the next rule. The previous example
be matched at the end of a line (when immediately fol- could also have been written
lowed by newline). The latter operator is a special case ""
of the / operator character, which indicates trailing con- "\t"
text. The expression "\n"
ab/cd with the same result, although in different style. The
matches the string a b , but only if followed by c d . Thus quotes around \n and \t are not required.
ab$ In more complex actions, the user will often want to
is the same as know the actual text that matched some expression like
ab/\n [ a - z ] + . Lex leaves this text in an external character
Left context is handled in Lex by s t a r t c o n d i t i o n s as array named y y t e x t . Thus, to print the name found, a
explained in section 10. If a rule is only to be executed rule like
when the Lex automaton interpreter is in start condition [a-z]+ printf("%s", yytext);
x , the rule should be prefixed by will print the string in y y t e x t . The C function p r i n t f
<x> accepts a format argument and data to be printed; in
using the angle bracket operator characters. If we con- this case, the format is ‘‘print string’’ (% indicating
sidered ‘‘being at the beginning of a line’’ to be start data conversion, and s indicating string type), and the
condition O N E , then the ˆ operator would be equivalent data are the characters in y y t e x t . So this just places the
to matched string on the output. This action is so com-
<ONE> mon that it may be written as ECHO:
Start conditions are explained more fully later. [a-z]+ ECHO;
R e p e t i t i o n s a n d D e fi n i t i o n s . The operators {} specify is the same as the above. Since the default action is just
either repetitions (if they enclose numbers) or definition to print the characters found, one might ask why giv e a
expansion (if they enclose a name). For example rule, like this one, which merely specifies the default
{digit} action? Such rules are often required to avoid matching
looks for a predefined string named d i g i t and inserts it some other rule which is not desired. For example, if
at that point in the expression. The definitions are there is a rule which matches r e a d it will normally
given in the first part of the Lex input, before the rules. match the instances of r e a d contained in b r e a d or
In contrast, r e a d j u s t ; to avoid this, a rule of the form [ a - z ] + is
a{1,5} needed. This is explained further below.
looks for 1 to 5 occurrences of a . Sometimes it is more convenient to know the end of
Finally, initial % is special, being the separator for Lex what has been found; hence Lex also provides a count
source segments. y y l e n g of the number of characters matched. To count
both the number of words and the number of characters
4. Lex Actions. in words in the input, the user might write
When an expression written as above is matched, Lex [a-zA-Z]+ {words++; chars += yyleng;}
executes the corresponding action. This section which accumulates in c h a r s the number of characters in
describes some features of Lex which aid in writing the words recognized. The last character in the string
actions. Note that there is a default action, which con- matched can be accessed by
sists of copying the input to the output. This is per- yytext[yyleng-1]
formed on all strings not otherwise matched. Thus the Occasionally, a Lex action may decide that a rule has
Lex user who wishes to absorb the entire input, without not recognized the correct span of characters. Tw o rou-
producing any output, must provide rules to match tines are provided to aid with this situation. First,
ev erything. When Lex is being used with Yacc, this is y y m o r e ( ) can be called to indicate that the next input
the normal situation. One may consider that actions are expression recognized is to be tacked on to the end of
-- --

LEX-5

this input. Normally, the next input string would over- In addition to these routines, Lex also permits access
write the current entry in y y t e x t . Second, y y l e s s ( n ) to the I/O routines it uses. They are:
may be called to indicate that not all the characters 1)i n p u t ( ) which returns the next input character;
matched by the currently successful expression are 2)o u t p u t ( c ) which writes the character c on the output;
wanted right now. The argument n indicates the num- and
ber of characters in y y t e x t to be retained. Further char- 3)u n p u t ( c ) pushes the character c back onto the input
acters previously matched are returned to the input. stream to be read later by i n p u t ( ) .
This provides the same sort of lookahead offered by the By default these routines are provided as macro defini-
/ operator, but in a different form. tions, but the user can override them and supply private
E x a m p l e : Consider a language which defines a string versions. These routines define the relationship
as a set of characters between quotation (") marks, and between external files and internal characters, and must
provides that to include a " in a string it must be pre- all be retained or modified consistently. They may be
ceded by a \. The regular expression which matches redefined, to cause input or output to be transmitted to
that is somewhat confusing, so that it might be prefer- or from strange places, including other programs or
able to write internal memory; but the character set used must be
\"[ˆ"]* { consistent in all routines; a value of zero returned by
if (yytext[yyleng-1] == '\\') i n p u t must mean end of file; and the relationship
yymore(); between u n p u t and i n p u t must be retained or the Lex
else lookahead will not work. Lex does not look ahead at
... normal user processing all if it does not have to, but every rule ending in + * ?
} or $ or containing / implies lookahead. Lookahead is
which will, when faced with a string such as " a b c \ " d e f " also necessary to match an expression that is a prefix of
first match the five characters " a b c \ ; then the call to another expression. See below for a discussion of the
y y m o r e ( ) will cause the next part of the string, " d e f ,to character set used by Lex. The standard Lex library
be tacked on the end. Note that the final quote termi- imposes a 100 character limit on backup.
nating the string should be picked up in the code Another Lex library routine that the user will some-
labeled ‘‘normal processing’’. times want to redefine is y y w r a p ( ) which is called
The function y y l e s s ( ) might be used to reprocess text in whenever Lex reaches an end-of-file. If y y w r a p returns
various circumstances. Consider the C problem of dis- a 1, Lex continues with the normal wrapup on end of
tinguishing the ambiguity of ‘‘=-a’’. Suppose it is input. Sometimes, however, it is convenient to arrange
desired to treat this as ‘‘=- a’’ but print a message. A for more input to arrive from a new source. In this
rule might be case, the user should provide a y y w r a p which arranges
=-[a-zA-Z] { for new input and returns 0. This instructs Lex to con-
printf("Operator (=-) ambiguous\n"); tinue processing. The default y y w r a p always returns 1.
yyless(yyleng-1); This routine is also a convenient place to print tables,
... action for =- ... summaries, etc. at the end of a program. Note that it is
} not possible to write a normal rule which recognizes
which prints a message, returns the letter after the oper- end-of-file; the only access to this condition is through
ator to the input stream, and treats the operator as y y w r a p . In fact, unless a private version of i n p u t ( ) is
‘‘=-’’. Alternatively it might be desired to treat this as supplied a file containing nulls cannot be handled, since
‘‘= -a’’. To do this, just return the minus sign as well a value of 0 returned by i n p u t is taken to be end-of-file.
as the letter to the input: 5. Ambiguous Source Rules.
=-[a-zA-Z] { Lex can handle ambiguous specifications. When more
printf("Operator (=-) ambiguous\n"); than one expression can match the current input, Lex
yyless(yyleng-2); chooses as follows:
... action for = ... 1)The longest match is preferred.
} 2)Among rules which matched the same number of
will perform the other interpretation. Note that the characters, the rule given first is preferred.
expressions for the two cases might more easily be Thus, suppose the rules
written integer keyword action ...;
=-/[A-Za-z] [a-z]+ identifier action ...;
in the first case and to be given in that order. If the input is i n t e g e r s ,itis
=/-[A-Za-z] taken as an identifier, because [ a - z ] + matches 8 char-
in the second; no backup would be required in the rule acters while i n t e g e r matches only 7. If the input is
action. It is not necessary to recognize the whole iden- i n t e g e r , both rules match 7 characters, and the keyword
tifier to observe the ambiguity. The possibility of rule is selected because it was given first. Anything
‘‘=-3’’, however, makes shorter (e.g. i n t ) will not match the expression i n t e g e r
=-/[ˆ \t\n] and so the identifier interpretation is used.
a still better rule.
-- --

LEX-6

The principle of preferring the longest match makes matches the first rule for four characters and then the
rules containing expressions like . * dangerous. For second rule for three characters. In contrast, the input
example, a c c d agrees with the second rule for four characters
'.*' and then the first rule for three.
might seem a good way of recognizing a string in sin- In general, REJECT is useful whenever the purpose of
gle quotes. But it is an invitation for the program to Lex is not to partition the input stream but to detect all
read far ahead, looking for a distant single quote. Pre- examples of some items in the input, and the instances
sented with the input of these items may overlap or include each other. Sup-
'first' quoted string here, 'second' here pose a digram table of the input is desired; normally the
the above expression will match digrams overlap, that is the word t h e is considered to
'first' quoted string here, 'second' contain both t h and h e . Assuming a two-dimensional
which is probably not what was wanted. A better rule array named d i g r a m to be incremented, the appropriate
is of the form source is
'[ˆ'\n]*' %%
which, on the above input, will stop after ¢fi r s t ¢ . The [a-z][a-z] {digram[yytext[0]][yytext[1]]++; REJECT;}
consequences of errors like this are mitigated by the \n ;
fact that the . operator will not match newline. Thus where the REJECT is necessary to pick up a letter pair
expressions like . * stop on the current line. Don’t try beginning at every character, rather than at every other
to defeat this with expressions like [ . \ n ] + or equiv- character.
alents; the Lex generated program will try to read the
entire input file, causing internal buffer over ows. 6. Lex Source Definitions.
Note that Lex is normally partitioning the input Remember the format of the Lex source:
stream, not searching for all possible matches of each {definitions}
expression. This means that each character is %%
accounted for once and only once. For example, sup- {rules}
pose it is desired to count occurrences of both s h e and %%
h ee in an input text. Some Lex rules to do this might be {user routines}
she s++; So far only the rules have been described. The user
he h++; needs additional options, though, to define variables for
\n | use in his program and for use by Lex. These can go
.; either in the definitions section or in the rules section.
where the last two rules ignore everything besides h e Remember that Lex is turning the rules into a program.
and s h e . Remember that . does not include newline. Any source not intercepted by Lex is copied into the
Since s h e includes h e , Lex will normally n o t recognize generated program. There are three classes of such
the instances of h e included in s h e , since once it has things.
passed a s h e those characters are gone. 1)Any line which is not part of a Lex rule or action
Sometimes the user would like to override this choice. which begins with a blank or tab is copied into the Lex
The action REJECT means ‘‘go do the next alterna- generated program. Such source input prior to the first
tive.’’ It causes whatever rule was second choice after %% delimiter will be external to any function in the
the current rule to be executed. The position of the code; if it appears immediately after the first %%, it
input pointer is adjusted accordingly. Suppose the user appears in an appropriate place for declarations in the
really wants to count the included instances of h e : function written by Lex which contains the actions.
she {s++; REJECT;} This material must look like program fragments, and
he {h++; REJECT;} should precede the first Lex rule.
\n | As a side effect of the above, lines which begin with a
.; blank or tab, and which contain a comment, are passed
these rules are one way of changing the previous exam- through to the generated program. This can be used to
ple to do just that. After counting each expression, it is include comments in either the Lex source or the gen-
rejected; whenever appropriate, the other expression erated code. The comments should follow the host
will then be counted. In this example, of course, the language convention.
user could note that s h e includes h e but not vice versa, 2)Anything included between lines containing only % {
and omit the REJECT action on h e ; in other cases, and % } is copied out as above. The delimiters are dis-
however, it would not be possible a priori to tell which carded. This format permits entering text like prepro-
input characters were in both classes. cessor statements that must begin in column 1, or
Consider the two rules copying lines that do not look like programs.
a[bc]+ { ... ; REJECT;} 3)Anything after the third %% delimiter, reg ardless of
a[cd]+ { ... ; REJECT;} formats, etc., is copied out after the Lex output.
If the input is a b , only the first rule matches, and on a d Definitions intended for Lex are given before the first
only the second matches. The input string a c c b %% delimiter. Any line in this section not contained
-- --

LEX-7

between %{ and %}, and begining in column 1, is 8. Lex and Yacc.


assumed to define Lex substitution strings. The format If you want to use Lex with Yacc, note that what Lex
of such lines is writes is a program named y y l e x ( ) , the name required
name translation by Yacc for its analyzer. Normally, the default main
and it causes the string given as a translation to be asso- program on the Lex library calls this routine, but if
ciated with the name. The name and translation must Yacc is loaded, and its main program is used, Yacc will
be separated by at least one blank or tab, and the name call y y l e x ( ) . In this case each Lex rule should end with
must begin with a letter. The translation can then be return(token);
called out by the {name} syntax in a rule. Using {D} where the appropriate token value is returned. An easy
for the digits and {E} for an exponent field, for exam- way to get access to Yacc’s names for tokens is to com-
ple, might abbreviate rules to recognize numbers: pile the Lex output file as part of the Yacc output file by
D [0-9] placing the line
E [DEde][-+]?{D}+ # include "lex.yy.c"
%% in the last section of Yacc input. Supposing the gram-
{D}+ printf("integer"); mar to be named ‘‘good’’ and the lexical rules to be
{D}+"."{D}*({E})? | named ‘‘better’’ the UNIX command sequence can just
{D}*"."{D}+({E})? | be:
{D}+{E} yacc good
Note the first two rules for real numbers; both require a lex better
decimal point and contain an optional exponent field, cc y.tab.c -ly -ll
but the first requires at least one digit before the deci- The Yacc library (-ly) should be loaded before the Lex
mal point and the second requires at least one digit after library, to obtain a main program which invokes the
the decimal point. To correctly handle the problem Yacc parser. The generations of Lex and Yacc pro-
posed by a Fortran expression such as 3 5 . E Q . I , which grams can be done in either order.
does not contain a real number, a context-sensitive rule
such as 9. Examples.
[0-9]+/"."EQ printf("integer"); As a trivial problem, consider copying an input file
could be used in addition to the normal rule for inte- while adding 3 to every positive number divisible by 7.
gers. Here is a suitable Lex source program
The definitions section may also contain other com- %%
mands, including the selection of a host language, a int k;
character set table, a list of start conditions, or adjust- [0-9]+ {
ments to the default size of arrays within Lex itself for k = atoi(yytext);
larger source programs. These possibilities are dis- if (k%7 == 0)
cussed below under ‘‘Summary of Source Format,’’ printf("%d", k+3);
section 12. else
printf("%d",k);
7. Usage. }
There are two steps in compiling a Lex source pro- to do just that. The rule [0-9]+ recognizes strings of
gram. First, the Lex source must be turned into a gen- digits; a t o i converts the digits to binary and stores the
erated program in the host general purpose language. result in k . The operator % (remainder) is used to
Then this program must be compiled and loaded, usu- check whether k is divisible by 7; if it is, it is incre-
ally with a library of Lex subroutines. The generated mented by 3 as it is written out. It may be objected that
program is on a file named l e x . y y . c . The I/O library is this program will alter such input items as 4 9 . 6 3 or X 7 .
defined in terms of the C standard library [6]. Furthermore, it increments the absolute value of all
The C programs generated by Lex are slightly different negative numbers divisible by 7. To avoid this, just add
on OS/370, because the OS compiler is less powerful a few more rules after the active one, as here:
than the UNIX or GCOS compilers, and does less at %%
compile time. C programs generated on GCOS and int k;
UNIX are the same. -?[0-9]+ {
U N I X . The library is accessed by the loader ag - l l . k = atoi(yytext);
So an appropriate set of commands is printf("%d", k%7 == 0 ? k+3 : k);
lex source cc lex.yy.c -ll }
The resulting program is placed on the usual file a . o u t -?[0-9.]+ ECHO;
for later execution. To use Lex with Yacc see below. [A-Za-z][A-Za-z0-9]+ ECHO;
Although the default Lex I/O routines use the C stan- Numerical strings containing a ‘‘.’’ or preceded by a let-
dard library, the Lex automata themselves do not do so; ter will be picked up by one of the last two rules, and
if private versions of i n p u t , o u t p u t and u n p u t are given, not changed. The i f - e l s e has been replaced by a C
the library can be avoided. conditional expression to save space; the form a ? b : c
-- --

LEX-8

means ‘‘if a then b else c ’’. "."{W}[0-9]+{W}{d}{W}[+-]?{W}[0-9]+ {


For an example of statistics gathering, here is a pro- /* convert constants */
gram which histograms the lengths of words, where a for(p=yytext; *p != 0; p++)
word is defined as a string of letters. {
int lengs[100]; if (*p=='d' || *p=='D')
%% *p=+ 'e'- 'd';
[a-z]+ lengs[yyleng]++; ECHO;
.| }
\n ; After the oating point constant is recognized, it is
%% scanned by the f o r loop to find the letter d or D . The
yywrap() program than adds ¢e ¢- ¢d ¢ , which converts it to the next
{ letter of the alphabet. The modified constant, now sin-
int i; gle-precision, is written out again. There follow a
printf("Length No. words\n"); series of names which must be respelled to remove
for(i=0; i<100; i++) their initial d . By using the array y y t e x t the same action
if (lengs[i] > 0) suffices for all the names (only a sample of a rather
printf("%5d%10d\n",i,lengs[i]); long list is given here).
return(1); {d}{s}{i}{n} |
} {d}{c}{o}{s} |
This program accumulates the histogram, while pro- {d}{s}{q}{r}{t} |
ducing no output. At the end of the input it prints the {d}{a}{t}{a}{n} |
table. The final statement r e t u r n ( 1 ) ; indicates that Lex ...
is to perform wrapup. If y y w r a p returns zero (false) it {d}{f}{l}{o}{a}{t} printf("%s",yytext+1);
implies that further input is available and the program Another list of names must have initial d changed to
is to continue reading and processing. To provide a initial a :
y y w r a p that never returns true causes an infinite loop. {d}{l}{o}{g} |
As a larger example, here are some parts of a program {d}{l}{o}{g}10 |
written by N. L. Schryer to convert double precision {d}{m}{i}{n}1 |
Fortran to single precision Fortran. Because Fortran {d}{m}{a}{x}1 {
does not distinguish upper and lower case letters, this yytext[0] =+ 'a' - 'd';
routine begins by defining a set of classes including ECHO;
both cases of each letter: }
a [aA] And one routine must have initial d changed to initial r :
b [bB] {d}1{m}{a}{c}{h} {yytext[0] =+ 'r' - 'd';
c [cC]
...
z [zZ] To avoid such names as d s i n x being detected as
An additional class recognizes white space: instances of d s i n , some final rules pick up longer words
W [ \t]* as identifiers and copy some surviving characters:
The first rule changes ‘‘double precision’’ to ‘‘real’’, or [A-Za-z][A-Za-z0-9]* |
‘‘DOUBLE PRECISION’’ to ‘‘REAL’’. [0-9]+ |
{d}{o}{u}{b}{l}{e}{W}{p}{r}{e}{c}{i}{s}{i}{o}{n} { \n |
printf(yytext[0]=='d'? "real" : "REAL"); . ECHO;
} Note that this program is not complete; it does not deal
Care is taken throughout this program to preserve the with the spacing problems in Fortran or with the use of
case (upper or lower) of the original program. The con- keywords as identifiers.
ditional operator is used to select the proper form of the 10. Left Context Sensitivity.
keyword. The next rule copies continuation card indi- Sometimes it is desirable to have sev eral sets of lexical
cations to avoid confusing them with constants: rules to be applied at different times in the input. For
ˆ" "[ˆ 0] ECHO; example, a compiler preprocessor might distinguish
In the regular expression, the quotes surround the preprocessor statements and analyze them differently
blanks. It is interpreted as ‘‘beginning of line, then five from ordinary statements. This requires sensitivity to
blanks, then anything but blank or zero.’’ Note the two prior context, and there are several ways of handling
different meanings of ˆ . There follow some rules to such problems. The ˆ operator, for example, is a prior
change double precision constants to ordinary oating context operator, recognizing immediately preceding
constants. left context just as $ recognizes immediately following
[0-9]+{W}{d}{W}[+-]?{W}[0-9]+ | right context. Adjacent left context could be extended,
[0-9]+{W}"."{W}{d}{W}[+-]?{W}[0-9]+ | to produce a facility similar to that for adjacent right
context, but it is unlikely to be as useful, since often the
-- --

LEX-9

relevant left context appeared some time earlier, such as BEGIN name1;
at the beginning of a line. which changes the start condition to n a m e 1 . To resume
This section describes three means of dealing with dif- the normal state,
ferent environments: a simple use of ags, when only a BEGIN 0;
few rules change from one environment to another, the resets the initial condition of the Lex automaton inter-
use of s t a r t c o n d i t i o n s on rules, and the possibility of preter. A rule may be active in sev eral start conditions:
making multiple lexical analyzers all run together. In <name1,name2,name3>
each case, there are rules which recognize the need to is a legal prefix. Any rule not beginning with the <>
change the environment in which the following input prefix operator is always active.
text is analyzed, and set some parameter to re ect the The same example as before can be written:
change. This may be a ag explicitly tested by the %STARTAABBCC
user’s action code; such a ag is the simplest way of %%
dealing with the problem, since Lex is not involved at ˆa {ECHO; BEGIN AA;}
all. It may be more convenient, however, to hav e Lex ˆb {ECHO; BEGIN BB;}
remember the ags as initial conditions on the rules. ˆc {ECHO; BEGIN CC;}
Any rule may be associated with a start condition. It \n {ECHO; BEGIN 0;}
will only be recognized when Lex is in that start condi- <AA>magic printf("first");
tion. The current start condition may be changed at any <BB>magic printf("second");
time. Finally, if the sets of rules for the different envi- <CC>magic printf("third");
ronments are very dissimilar, clarity may be best where the logic is exactly the same as in the previous
achieved by writing several distinct lexical analyzers, method of handling the problem, but Lex does the work
and switching from one to another as desired. rather than the user’s code.
Consider the following problem: copy the input to the
output, changing the word m a g i c to fi r s t on every line 11. Character Set.
which began with the letter a , changing m a g i c to s e c - The programs generated by Lex handle character I/O
o n d on every line which began with the letter b , and only through the routines i n p u t , o u t p u t , and u n p u t .
changing m a g i c to t h i r d on every line which began with Thus the character representation provided in these rou-
the letter c . All other words and all other lines are left tines is accepted by Lex and employed to return values
unchanged. in y y t e x t . For internal use a character is represented as
These rules are so simple that the easiest way to do a small integer which, if the standard library is used,
this job is with a ag: has a value equal to the integer value of the bit pattern
int ag; representing the character on the host computer. Nor-
%% mally, the letter a is represented as the same form as
ˆa { ag = 'a'; ECHO;} the character constant ¢a ¢ . If this interpretation is
ˆb { ag = 'b'; ECHO;} changed, by providing I/O routines which translate the
ˆc { ag = 'c'; ECHO;} characters, Lex must be told about it, by giving a trans-
\n { ag = 0 ; ECHO;} lation table. This table must be in the definitions sec-
magic { tion, and must be bracketed by lines containing only
switch ( ag) ‘‘%T’’. The table contains lines of the form
{ {integer} {character string}
case 'a': printf("first"); break; which indicate the value associated with each character.
case 'b': printf("second"); break; Thus the next example
case 'c': printf("third"); break; %T
default: ECHO; break; 1Aa
} 2Bb
} ...
should be adequate. 26 Zz
To handle the same problem with start conditions, each 27 \n
start condition must be introduced to Lex in the defini- 28 +
tions section with a line reading 29 -
%Start name1 name2 ... 30 0
where the conditions may be named in any order. The 31 1
word S t a rrt may be abbreviated to s or S . The condi- ...
tions may be referenced at the head of a rule with the 39 9
<> brackets: %T
<name1>expression
is a rule which is only recognized when Lex is in the Sample character table.
start condition n a m e 1 . To enter a start condition, maps the lower and upper case letters together into the
execute the action statement integers 1 through 26, newline into 27, + and - into 28
-- --

LEX-10

and 29, and the digits into 30 through 39. Note the x|y an x or a y.
escape for newline. If a table is supplied, every charac- (x) an x.
ter that is to appear either in the rules or in any valid x/y an x but only if followed by y.
input must be included in the table. No character may {xx} the translation of xx from the definitions section.
be assigned the number 0, and no character may be x{m,n} m through n occurrences of x
assigned a bigger number than the size of the hardware
character set. 13. Caveats and Bugs.
There are pathological expressions which produce
12. Summary of Source Format. exponential growth of the tables when converted to
The general form of a Lex source file is: deterministic machines; fortunately, they are rare.
{definitions} REJECT does not rescan the input; instead it remem-
%% bers the results of the previous scan. This means that if
{rules} a rule with trailing context is found, and REJECT
%% executed, the user must not have used u n p u t to change
{user subroutines} the characters forthcoming from the input stream. This
The definitions section contains a combination of is the only restriction on the user’s ability to manipulate
1)Definitions, in the form ‘‘name space translation’’. the not-yet-processed input.
2)Included code, in the form ‘‘space code’’. 14. Acknowledgments.
3)Included code, in the form As should be obvious from the above, the outside of
%{ Lex is patterned on Yacc and the inside on Aho’s string
code matching routines. Therefore, both S. C. Johnson and
%} A. V. Aho are really originators of much of Lex, as well
4)Start conditions, given in the form as debuggers of it. Many thanks are due to both.
%S name1 name2 ... The code of the current version of Lex was designed,
5)Character set tables, in the form written, and debugged by Eric Schmidt.
%T
number space character-string 15. References.
... 1.B. W. Kernighan and D. M. Ritchie, T h e C P r o g r a m -
%T m i n g L a n g u a g e , Prentice-Hall, N. J. (1978).
6)Changes to internal array sizes, in the form 2.B. W. Kernighan, R a t f o r : A P r e p r o c e s s o r f o r a R a t i o -
%x n n n n a l F o r t r a n , Software - Practice and Experience, 5,
where n n n is a decimal integer representing an array pp. 395-496 (1975).
size and x selects the parameter as follows: 3.S. C. Johnson, Y a c c : Y e t A n o t h e r C o m p i l e r C o m p i l e r ,
Letter Parameter Computing Science Technical Report No. 32, 1975,
p positions Bell Laboratories, Murray Hill, NJ 07974.
n states 4.A. V. Aho and M. J. Corasick, E f fi c i e n t S t r i n g M a t c h -
e tree nodes i n g : A n A i d t o B i b l i o g r a p h i c S e a r c h , Comm. ACM 18,
a transitions 333-340 (1975).
k packed character classes 5.B. W. Kernighan, D. M. Ritchie and K. L. Thompson,
o output array size Q E D T e x t E d i t o r , Computing Science Technical Report
Lines in the rules section have the form ‘‘expression No. 5, 1972, Bell Laboratories, Murray Hill, NJ 07974.
action’’ where the action may be continued on succeed- 6.D. M. Ritchie, private communication. See also M.
ing lines by using braces to delimit it. E. Lesk, T h e P o r t a b l e C L i b r a r y , Computing Science
Regular expressions in Lex use the following opera- Technical Report No. 31, Bell Laboratories, Murray
tors: Hill, NJ 07974.
x the character "x"
"x" an "x", even if x is an operator.
\x an "x", even if x is an operator.
[xy] the character x or y.
[x-z] the characters x, y or z.
[ˆx] any character but x.
. any character but newline.
ˆx an x at the beginning of a line.
<y>x an x when Lex is in start condition y.
x$ an x at the end of a line.
x? an optional x.
x* 0,1,2, ... instances of x.
x+ 1,2,3, ... instances of x.