Вы находитесь на странице: 1из 31

B ACHELOR I NFORMATICA

I NFORMATICA — U NIVERSITEIT VAN A MSTERDAM

Practical use of Automata and


Formal Languages in the com-
piler field

Daan de Graaf

June 9, 2017

Supervisor(s): Inge Bethke (UVA)


Signed: Inge Bethke (UVA)
2
Abstract
The principles in the study of Automata and Formal Languages (AFL) exert in a theoretical
manner, without mentioning the applicability of the theory. To express the practical use, this
paper discusses three assignments covering the major subjects in the AFL theory. By using a
compiler design approach, a source program consisting of simple arithmetic expressions gets
successfully compiled and eventually executed. The execution leads to the solution of the
expression. The student complements the provided framework, with acquired knowledge
from AFL theory, by implementing certain phenomena found in AFL as well as in compiler
design theory. The assignments let the student put theory into practice and at the same time
give an insight into compiler design.

3
4
Contents

1 Introduction 7
1.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.3 Educational Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Background 9
2.1 Automata and Formal Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Choices 11
3.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.3 The Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.4 Grading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Design 13
4.1 Lexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
4.1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.2.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3 Intermediate phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.1 Type checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.2 Intermediate code generation . . . . . . . . . . . . . . . . . . . . . . . . . . 19
4.3.3 Register allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3.4 Machine code generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
4.3.5 Assembly and linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4 Executing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4.4.1 The instruction set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4.4.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5 Implementation 25
5.1 Assignment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.1 Lexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.2 Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.2 Assignment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5.3 Assignment 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6 Conclusion 29
6.1 Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Bibliography 31

5
6
CHAPTER 1

Introduction

1.1 Context
The principles in the study of Automata and Formal Languages, in this paper referred to as AFL,
are mainly exerted in a theoretical manner. However, the phenomena found in this study are
broadly spread across computer science and are applicable in many disciplines. The theory is
used in compilers, text processing, natural languages and genomes [7].
The existing literature and books only provide exercises that capture the idea of certain princi-
ples in AFL, neglecting its applications, or briefly mentioning it [5]. The idea of this research
is to design assignments for courses in the AFL study that reflect the applicability of such Au-
tomata and Formal Languages.
Since compiler design exists in the world of computer science, and thus has more value to
computer science students, this discipline is chosen to be the thread inside this research. Af-
ter investigating compiler design studies, it became clear that the created assignments should
be comprehensible and include some abstraction on the specific compiler elements, due to the
complexity of compiler design, to guarantee some simplicity towards students.

This research lays the focus on compiling a source program, existing of simple arithmetic
expressions, to a final result, supposedly the answer to the expression, involving AFL and com-
piler phenomena. In order to achieve this, the goal and objectives will be discussed first to
create a definition and determine what the assignments should contain (and what not). Then
the issue with current assignments/exercises is addressed in more detail and some solutions of
other literature are discussed. Eventually the design of the assignments will be presented.

1.2 Goal
The goal of this research is to give a basic understanding towards students of the practical uses
of AFL by creating practical assignments.
Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-
chines to test acquired knowledge with simple implementations for the specific subjects. The
assignments should provide a framework1 with an easy to use programming language and
structure, to advance implementing AFL aspects and abstracting advance work.
The student will eventually have a common idea of particular applications and creating some
interest in the compiler field. At the same time the assignments should guarantee not to over-
flow the student with too much new information.
1 Pre-written code offering a basic setup taking care of I/O, printing, objects and corresponding error handling.

7
1.3 Educational Objectives
To educate and give an understanding of the use of AFL, the focus lays on simplicity. Most
students taking the course are in their first year of computer science and still learn program-
ming. An excess of information on compiler theory will draw away the attention on the main
subject. Reducing the information on compiling should accomplish this. However this comes
with some drawbacks as will be discussed later on.
Since the main audience is starting writing code, it is stimulated to write as much code them-
selves. The final framework should keep the balance between achieving programming skills
and at the same time keeping a feasible workload.
Encouraging students to understand the different concepts in the AFL theory is of equal
importance. The investigation of integrating these two aspects can be found further on.

8
CHAPTER 2

Background

2.1 Automata and Formal Languages


As mentioned before, three main concepts are found in the AFL study: Finite Automata (FA),
Push-Down Automata (PDA) and Turing Machines (TM). The overall explanation in found lit-
erature makes use of a simple alphabet Σ = { a, b} or Σ = {0, 1} to express words like “a”, “bba”,
“babaaba” or “0”, “110”, “1011000” [5, 1] and exercises are often found with corresponding au-
tomata accepting a certain set of strings, for example: an even amount of 1’s and 0’s in a string
or a string containing the substring aba. This is defined as pattern matching and is an important
application of finite automata. These basic concepts show the capabilities of automata, however
not the use of them.
The PDA is mentioned as an extension on Finite Automata, accompanied with a memory it is
capable of handling non-regular languages like the set { an bn : n ≥ 0}.
After mentioning the Turing machine, the word computability follows. However a TM is capa-
ble of simulating logic for every computer existing and exploiting this is left out in most liter-
ature. As denoted in [8], instead of looking at the state in which a TM halts, think of the final
contents on the tape. In this perspective a TM implements a function with input and output.

2.2 Related Work


In the AFL field as well as in the compiler field, formal languages are coupled with compiler
design. The book Formal Language: A practical introduction [8] makes use of simple arithmetic
expressions as an example for the use of AFL in relation to compiling. Using a provided frame-
work written in Java, readers complement the framework by implementing functions in order
to parse expressions provided as user input. However, the given examples are for a specific
grammar only, which detracts from the power of specifying an infinite number of grammars
without changing the code. In other words, when making adjustments to the grammar, the
source code has to change instead of only adjusting the language specific grammar. The ideas
of AFL, context free grammar to be more specific, are used as a route to write the grammar in a
hard-coded form and are not used as a grammar itself.
The book also pays attention to the TM by providing exercises to TM functions. Readers are
asked to create a TM function for multiplying, subtracting and dividing binary as well as unary
numbers. This lays the focus on the final contents of the tape, more then the state in which it
halts. From that point of view the TM is used as a calculator, or more general an executor of
certain instructions.
Furthermore, Basics of Compiler Design [6] provides a thorough introduction to the related AFL
aspects for each phase in the compiling process. It further examines the construction of a parse
table, which is used later on for the parser, using simple arithmetic expressions. Removing am-
biguity, the construction of a usable grammar and maintaining operator precedence eventually
leads to all the knowledge to construct a simple compiler.

9
In [4], the use of Python as a pseudo language for formal language theory is discussed. With
the use of selected topics it is shown that Python is suitable for structures and algorithms in
the formal language theory. The selected topics that are used are not common topics found
in theory. Specifying a grammar for Roman numerals and performing a syntax analysis on an
input, is a perfect example of the application of formal languages.
There is also a section about translation and using a grammar to translate from Roman to Arabic
numerals. However, this method is not used in this research, since other implementations came
across and the approach may not be facilitating enough.
The exact same procedure of compiling and using AFL aspects, is done in [3]. It takes a
user specified grammar as input and has several functions that can be applied to the gram-
mar. The same principles are found in this research. Although the Github project is helpful, it
provides no challenges to the students and also has a to complicated structure for use in this
project. However, some thoughts on the implementation of certain data structures came from
this project.

10
CHAPTER 3

Choices

Before the design and implementation part, some choices have been made in order to create a
smaller scope for the project and make the assignments workable. This means some core aspects
of compiler design will be neglected or provided in understandable matter in order to advance
the focus on AFL.

3.1 Python
Python is an easy language to start with for a first year computer science student. Constraining
them with another language has its drawbacks and is not beneficial for the yield of the assign-
ments. As mentioned in [4], Python can significantly improve the study of the formal language
theory and its applications. Taking that into account, no other languages where considered for
this project.

3.2 Framework
For each assignment a framework is provided. The framework serves as a start-up for the stu-
dents, as well as helping future correctors with revising assignments. The framework contains
the following elements:
Error handling If an error (of any sort) occurs and the cause of this error is known, the program
exits and returns a corresponding message about the problem.

I/O management Reading the source program from memory and writing the result back is
taken care of by the framework. Thereby, the user input will be checked for an argument
specifying the source file. The corresponding error messages regard: a not existing file or
directory, no filepath specified or an unreadable source.
Functions To encourage the use of certain conventions, some basic functions are provided. The
functions help implementing the assignments and are provided with comments on how
to use the function.
Objects The framework of each assignment contains an object with an AFL principle. The
principles FA, PDA and TM are used in assignments 1, 2, 3 respectively. The objects create
a structure for automata and grammars, specified by the student or already present in the
framework. This means, a grammar, provided in a certain data structure, can be read and
stored in the object. The object can then be used to apply methods to.

11
3.3 The Compiler
As briefly explained above, compiler design is part of computer science and has some interest-
ing principles that can be found in AFL. There is also enough literature combining these two
theories that encourages to take a closer look. The choices that are made during the implemen-
tation of the compiler design, erect from three origins.
The first one stems from simplifying certain processes and phases. The principles in compiler
design are abstracted for students in order to focus on the AFL implementation, this is done by
providing elements in the framework or skipping some parts.
The second origin is the addition of elements. Some elements are added to the assignments to
keep it challenging or more consistent. The third origin stems from complications during the
implementation.
All details are described in further sections.

3.4 Grading
Since the assignments are for educational purposes, the solutions to the assignments are com-
pleted and provided in a separate framework. The framework can be used by teachers, assis-
tants or correctors in order to check the assignments from the student and grade them. The
grading scheme is not discussed in this paper, only brief comments about the challenging as-
pects are provided. The reason for this is that the teacher may have a different view on which
matter is more important and students may find that the grading does not suit the assignment
(since subjects may be less or not covered).

12
CHAPTER 4

Design

A basic compiler consists of seven compiling phases [6]:

Lexical analysis In this phase characters from the source code are divided into tokens. Tokens
can vary from an integer to a plus sign to a white space (see Section 4.1).
Syntax analysis The generated token list gets parsed and has to meet a specific structure. When-
ever the code is not accepted, a syntax error with corresponding error message gets thrown
(see Section 4.2).

Type checking In this phase the consistency requirements of the parsed code are checked.
Whether a variable is used but not declared or assigning an integer to a string variable
(see Section 4.3.1). This phase is skipped in this project.
Intermediate code generation In this phase the code is translated to an intermediate language
(see Section 4.3.2).
Register allocation The variable names are translated to numbers which correspond to a regis-
ter (see Section 4.3.3).
Machine code generation The intermediate code is translated to a machine specific assembly
language (see Section 4.3.4).

Assembly and linking The last phase translates the assembly language into a binary represen-
tation and the addresses of functions, variables, etc., are determined (see Section 4.3.5).
In order to keep it simple, the focus lays on the lexical and syntax analysis and on the execution
of the generated machine code (note this is not a compile phase). To accomplish a working
compiler and at the same time not getting too detailed, some adjustments regarding the phases
have to be made. By extracting the phases that make extensive use of AFL principles and for-
saking the other phases for a minute, the scheme in Figure 4.1 remains. This is the core part of
the assignments and the main implementation for students. In the next sections, each phase is
clarified, followed by an explanation of how the left out phases are effective in each phase.

13
Figure 4.1: Compiling scheme. Dashed arrows include intermediate phases, found in Section
4.3.

The source code consists of one or multiple simple arithmetic expressions1 , the expression
language, which will be compiled and executed. The grammar in Figure 4.2 defines valid ex-
pressions.

A → id = E
E → EOE | ( E) | int | id
O→ + | − | ∗ | /

Figure 4.2: Grammar representation of valid expressions. Note this grammar will be modified
further on and serves as a clarification.

Lowercase characters, including the operators and brackets, represent terminals to distin-
guish them form non-terminals2 .
a = 2 + (3 ∗ 2) and b = a/4 are examples of expressions that can be used. Thereby, an id is a
character which can be followed by other characters or digits. a, result1 or A2b4g are all valid
id’s. An int is a digit which can be followed by other digits, e.g. 1, 528 or 10000.
Only the natural numbers N, can be properly used for computation, as this becomes clear later
on when using an unary representation of numbers. Moreover, the intermediate result of an ex-
pression has to be a natural number too in order for the program to work properly. So although
a = 50 + (10 − 23) results in a natural number (37), the intermediate result of (10 − 23) equals
−13 and is not a natural number.
One can clearly see from the grammar that the use of the equal sign is obligatory. This means
that every expression is an initialisation of a variable, which is useful for later on when the code
is executed (see Section 4.4).

4.1 Lexing
In the lexical analysis part of the compilation, an input stream of characters from the source code
gets grouped into tokens. Each token is assigned an identifier in the form of an uppercase string.
A deterministic finite automaton (DFA), as illustrated in Figure 4.4, is used as the tokenizer. It
can be seen that an ID should always start with a character possibly followed by a character or
a digit. A digit is any of the values in [0-9] and a character is any of the values in [a-zA-Z]3 . If
there is no transition possible from the START state to an accepting state, the character is not
1 The source code file can have multiple lines to use previously declared variables.
2 Symbols that can be replaced by other symbols are called non-terminals and symbols that cannot be replaced by
other symbols are called terminals [1].
3 This is done to omit a transition for every digit and character which would result in 62 additional state-transitions.

14
part of the alphabet and thus not recognised.
Tokenizing the input source Var = 34 + (5*56) results in the stream shown in Figure 4.3.

Figure 4.3: Tokens with corresponding identifiers.

In order to work properly, the longest stream that matches a token is tokenized. As an ex-
ample: var22 is tokenized as var22 instead of var and 22. Noting that “var 22” has an incorrect
syntax.
A white-space will always separate two tokens, since there is no white-space transition from
an accepting state to an accepting state. The accepting WHITE state could also have had a
white-space transition to itself, since consecutive white-spaces still make a white-space. How-
ever, considering the arithmetic expression format of the input, consecutive white-spaces are
uncommon.

15
Figure 4.4: A DFA for tokenizing the characters from the input stream.

4.1.1 Objectives
The DFA is fairly straightforward due to the simple tokens of an arithmetic expression. How-
ever, the basis of using an FA for pattern matching and using regular expressions is there. The
resulting implementation, disregarding the character and digit simplification, should work on
any given DFA for any input stream.
The implementation should include the checking of each input character and the presence of a
valid state transition to an accepting state. Checking multiple lines from the source and even-
tually creating a decent structure to use in the next phase are also part of the implementation.

16
4.2 Parsing
Syntax analysis recombines the created token list into a syntax tree which reflects the structure
of the source[6]. This process is referred to as parsing. The parser checks if the syntax of the
source program is valid and throws a syntax error for an invalid syntax.
Initially a DFA can be used to parse the tokens and check the syntax, since checking syntax is
still basic pattern matching. In Figure 4.5 a DFA which matches the syntax of the arithmetic
expressions is shown. For every input token a check for a transition from the current state is
made, if it eventually ends in an accepting state, the syntax is correct.
Nevertheless, this method does not work correctly. As mentioned in [1, 5, 6] the set { an bn : n ≥
0} is not a regular language and therefore cannot be accepted. The problem lays in the fact that
after reading a certain amount of a’s, there is no memory to store the number of a’s and there-
fore the DFA does not know how many b’s it should read. The same problem is found in the
arithmetic expressions. Whenever an opening bracket occurs, a closing bracket should follow
(with the occasion of any bracket pair nested inside those two). This principle is called balanced
parentheses. The DFA in Figure 4.5 can only be used properly if there exist no brackets in the
source code or if the parentheses in the source code are balanced.
Another reason the DFA cannot be used for proper parsing, is the lack of operator precedence.
The automaton only accepts strings with a certain pattern and not taking precedence into con-
sideration.

Figure 4.5: A DFA for parsing the tokens.

These obstacles can be avoided using a context-free grammar (CFG). The grammar in Figure
4.6 is a LL(1)4 grammar which can be used for recursive descent parsing. The grammar is
unambiguous and left-recursion is eliminated. The construction of such a grammar can be
found in [6]. In this implementation the use of a syntax tree is skipped since the used grammar
is left-factored and does not need a syntax tree. Instead, some semantic actions are embedded
inside the CFG, see Section 4.3.2, that replaces the need of a syntax tree.
4 The first L indicates the reading direction (left-to-right), the second L indicates the derivation order (left) and the 1

indicates that there is a one-symbol look-ahead [6].

17
S → A eof (4.1)
A → id = E (4.2)
0
E→T E (4.3)
0 0
E →+ T E (4.4)
0 0
E →− T E (4.5)
0
E →e (4.6)
0
T→F T (4.7)
0 0
T →∗ F T (4.8)
0 0
T →/ F T (4.9)
0
T →e (4.10)
F → (E) (4.11)
F → int (4.12)
F → id (4.13)

Figure 4.6: Unambiguous LL(1) grammar. Bold tokens indicate terminal symbols.

The automata used for top down parsing are push-down automata, characterised by a stack
that functions as the memory talked about earlier. It has the ability to push a symbol on top of
the stack or to pop off a symbol on top of the stack. The items that get pushed or popped are
determined by the input symbol, the current state the automaton is in and the symbol on top
off the stack, from now on referred to as stack-symbol. With this information, the automaton
can follow a transition with a corresponding production chosen from a parse table.
Since the CFG is of a recursive nature, this means having the same non-terminal symbol on the
left as well as on the right side of the arrow (E0 → + T E0 ), determining how the expression is
produced from the grammar can be challenging. So to determine which production to apply,
a look-ahead symbol is used. This symbol is the upcoming symbol from the input stream and
decides which production to match it with. On line 10 in Table 4.1, the look-ahead symbol is
+ and the current state being E’. Using the parse table in Table 4.2 for the + symbol and the
non-terminal stack-symbol E’, gives production 4. This corresponds with production 4.4 from
the grammar in Figure 4.6. So in the next step (11) in Table 4.1, the stack looks like this: $ eof E’
T + . Whenever the top of the stack and the first symbol from the input are the same, they are
matched and both removed from their stream. Eventually the stack will be empty, only includ-
ing a $, meaning the syntax of the input is correct.

Table 4.1: First twelve steps of top down parsing the sample expression Var = 34 + (5 ∗ 56).

STACK INPUT OUTPUT


1 $S Var = 34 + (5 ∗ 56)
2 $ eof A Var = 34 + (5 ∗ 56) s → A eof
3 $ eof E = id Var = 34 + (5 ∗ 56) A → id = E
4 $ eof E = = 34 + (5 ∗ 56) match
5 $ eof E 34 + (5 ∗ 56) match
6 $ eof E’ T 34 + (5 ∗ 56) E → T E’
7 $ eof E’ T’ F 34 + (5 ∗ 56) T → F T’
8 $ eof E’ T’ int 34 + (5 ∗ 56) F → int
9 $ eof E’ T’ +(5 ∗ 56) match
10 $ eof E’ +(5 ∗ 56) T’ → epsilon
11 $ eof E’ T + +(5 ∗ 56) E’ → + T E’
12 $ eof E’ T (5 ∗ 56) match

18
Table 4.2: Parse table.

( ) + - * / int id eof
S 1
A 2
E 3 3 3
E’ 6 4 5 6
T 7 7 7
T’ 10 10 10 8 9 10
F 11 12 13

4.2.1 Objectives
As explained, non regular languages cannot be accepted by finite automata and the solution
is found in CFG’s and PDA. The theoretical example of a number of a’s followed by the same
number of b’s (balanced parentheses) is now obvious in a real world problem. The student comes
across this problem when first implementing the syntax analysis using the DFA approach and
then the need for more powerful automata is experienced. Since the walk-through of a DFA is
done at the lexing part, the approach is used as an introduction to the problem and not per se
a challenge. However, the approach is appropriate for creating efficient error messages, since
each state reveals which symbols are expected. The goal then is to implement a PDA that makes
use of the CFG and the parse table shown above. Since some elements are provided in the
framework, getting familiar with the code and used data structures can be another challenge,
since there are multiple approaches as seen in Section 2.2.

4.3 Intermediate phases


At this point, compiling phase one and two are discussed, namely lexical analysis and syntax
analysis respectively. The rest of the phases are discussed in this section.

4.3.1 Type checking


The type checking phase is left out in this concept of a compiler. The reason for this is that the
checking that should be done is surmountable and can be dealt with in other ways. Obvious
checks to perform are: checking if the (intermediate) result is a natural number, checking if the
left hand side of a subtraction is greater or equal to the right hand side and checking if the left
hand side of a division is a multiple of the right hand side. All of this is left out and dealt with
at run-time. Evaluating this result beforehand is in a sense just the solution to the expression
and therefore not desirable.

4.3.2 Intermediate code generation


For the execution, a distinction between the elements are made on a higher level. Where the +
has an ADD identifier in the tokenized code, in turn the ADD operation has an operator identifier
in the intermediate code. In Table 4.3 there is an overview of the intermediate code identifiers.
The distinction is needed for the simplification of the eventual execution. Since values are con-
verted to unary (see Section 4.3.5), variables read from memory, operators perform on values
and assignments write to memory, some clear distinction between them is needed.

19
Table 4.3: Intermediate identifiers with corresponding token identifiers.

Intermediate identifier Token identifier


variable ID
value INT
operator ADD, SUB, MUL, DIV
assignment EQU

4.3.3 Register allocation


The register allocation of variables is also done on runtime. Whenever the machine comes across
a variable during execution, the machine checks its simple memory, made of a Python dictio-
nary, and if the variable occurs in the memory it is read and printed to the tape. If there is no
occurrence of the variable in the memory, the variable will be declared and is set to a pending
state until the solution of the expression is found. The result will be assigned to the pending
variable.

4.3.4 Machine code generation


The parsing phase only returns if the syntax is valid and else it returns an error to the user. In
order to go to the eventual execution, some management has to be done first. The machine code
is defined in a postfix notation5 that evolves from semantic actions embedded in the CFG. The
action changes the infix notation in the grammar to a postfix notation, as can be seen in slide 53
of [9]. In Figure 4.7 the CFG from Figure 4.6 is shown with only the lines where the actions are
embedded in.

A → id = E [EQU] (4.14)
0 0
E → + T [ADD] E (4.15)
0 0
E → − T [SUB] E (4.16)
0 0
T → ∗ F [MUL] T (4.17)
0 0
T → / F [DIV] T (4.18)

Figure 4.7: The LL(1) grammar with embedded semantic actions.

During the parsing phase these actions get also pushed onto the stack. Once an action, a
variable or a value is the stack-symbol, it gets pushed to the postfix array. The postfix array
will eventually hold the machine code for further use. The parse routine of the expression
Var = 1 + 1, using the grammar with embedded actions, can be found in Table 4.4.
5 A notation for writing arithmetic expressions in which the operands appear before their operators. Source: http:

//www.cs.csi.cuny.edu/~zelikovi/csc326/data/assignment5.htm.

20
Table 4.4: Top down parsing the sample expression Var = 1 + 1 with embedded semantic
actions.

STACK INPUT OUTPUT POSTFIX


$S Var = 1 + 1 \n
$ eof A Var = 1 + 1 \n S → A eof
$ eof [EQU] E = id Var = 1 + 1 \n A → id = E [EQU]
$ eof [EQU] E = = 1 + 1 \n match
$ eof [EQU] E 1 + 1 \n match
$ eof [EQU] E’ T 1 + 1 \n E → T E’
$ eof [EQU] E’ T’ F 1 + 1 \n T → F T’
$ eof [EQU] E’ T’ int 1 + 1 \n F → int
$ eof [EQU] E’ T’ +1 \n match and add 1
$ eof [EQU] E’ +1 \n T’ → e 1
$ eof [EQU] E’ [ADD] T ADD +1 \n E’ → ADD T [ADD] E’ 1
$ eof [EQU] E’ [ADD] T 1 \n match 1
$ eof [EQU] E’ [ADD] T’ F 1 \n T → F T’ 1
$ eof [EQU] E’ [ADD] T’ int 1 \n F → int 1
$ eof [EQU] E’ [ADD] T’ \n match and add 11
$ eof [EQU] E’ [ADD] \n T’ → e 11
$ eof [EQU] E’ \n add 1 1 [ADD]
$ eof [EQU] \n E’ → e 1 1 [ADD]
$ eof \n add 1 1 [ADD] [EQU]
$ match 1 1 [ADD] [EQU]

The result is 1 1 [ADD] [EQU] and is in postfix notation. The [EQU] function will eventually
assign the result to a register in memory at execution time, this is why it is the last action for
obvious reasons. The [ADD] action adds up the to digits in front, being 1 and 1. Further use of
those actions as functions can be read in Section 4.4.

4.3.5 Assembly and linking


The intermediate code has to be translated to a machine specific language The machine is dis-
cussed in Section 4.4. The machine makes use of an unary presentation to perform arithmetic
operations on. So essentially the machine code generation is the conversion of the decimal rep-
resentation of numbers in the postfix notation to an unary representation. As a convention,
the elements of the unary code are separated by a 0. So as an example: 2 3 post f ix becomes
110111unary . After this, the functions in the postfix code operate on the unary numbers.

4.4 Executing
A Turing Machine is used for the final execution of the code. The TM exists, like the PDA, of
a memory and an underlying automaton. However the memory is represented in tape form.
An infinite tape with cells that are empty or can hold any symbol and with a $ sign (` in most
literature, but harder to use on machines) indicating the start of the tape. For the purpose of
this project only using 1’s and 0’s satisfies. The tape is filled with 0’s and the 1’s represent the
unary numbers.
A TM features a memory and a transition table, referred to as instruction set, which holds the
underlying automaton. A tape head moving above the tape can move from right(R) to left(L)
or stay(N) where it is (other literature may have a moving tape instead). When the symbol is
read from the tape, the TM finds the instruction at the column corresponding to the symbol
and the row corresponding to the state. The idea of performing arithmetic operations came
from [2] and is adopted in the implementation with some alterations. An example of how to
perform arithmetic operations on a tape using a Turing machine is demonstrated in Table 4.5
and the instruction set of the ADD operation is found in Table 4.6. The boxed digit indicates the
location of the tape head.

21
Table 4.5: The ADD operation on a tape containing 110111.

Tape State Rule


11011 1 0 (0, L, 1)
1101 1 0 1 (1, L, 1)
110 1 10 1 (1, L, 1)
11 0 110 1 (0, L, 2)
1 1 0110 2 (1, R, 7)
11 0 110 7 (1, R, 7)
111 1 10 7 (1, R, 8)
1111 1 0 8 (1, R, 8)
11111 0 8 (0, L, 9)
1111 1 0 9 HALT

Table 4.6: The instruction set of an ADD operation

State 0 1
0 (0,L,3) (0,L,1)
1 (0,L,2) (1,L,1)
2 (1,R,4) (1,R,7)
3 (0,L,9) (None)
4 (1,R,5) (None)
5 (0,L,6) (1,R,5)
6 (None) (0,L,9)
7 (1,R,7) (1,R,8)
8 (0,L,9) (1,R,8)

The main difference in the Turing machine in this project, is the presence of an additional
memory. Using an ordinary TM, the total input stream is written to the tape and operated on.
However, with this implementation the input stream is read and written token by token. For
each identifier, as seen in Table 4.3, there are four different procedures:

Variable A variable read from the stream gets looked up in memory. If it occurs in memory, the
value assigned to it is written on the tape, if there is no occurrence, it goes to a pending
state, waiting for an assignment from the [EQU] procedure.
Value Whenever a value is read it gets converted to an unary representation and written onto
the tape. If the value is a 0, the TM writes a 0 too (also with a 0 in front). The tape after
writing a 2 and a 0 will look like the following: 1 1 0 0 .
Operator The corresponding instructions are read from the instruction set and executed on the
contents of the tape, being the last two values.
Assignment The assignment procedure is executed at the end of each expression. The tape
contains only one value which is the solution to the expression. The TM reads the unary
number and assigns it to the pending variable in memory.

Note that after any procedure, the tape head should return to the last digit of the last value on
the tape.

4.4.1 The instruction set


In this section some more details on the instruction set are covered. The ADD operation in
Table 4.6 may seem complicated and hard to read. The underlying automaton gives insights
and improves the readability, shown in Figure 4.8. The transition is indicated with a digit on

22
the arrow and represents the character read by the tape head. The attached text indicates which
symbol to write and where to move next respectively. The (None) instruction from the ADD
operation is not present in the automaton, since there is no transition for this combination. The
existence of the (None) instruction emerged from the fact that there is no particular way a 1
might occur in that stage of the execution.
Not every rule of the ADD operation in Table 4.6 is used in the execution in Table 4.5. These
rules handle exceptions to the ordinary addition, where both operands are greater than zero.
For example, reading a 0 on the first rule directs to the part where the tape head skips to the
first operand (state 3). The tape head ends in state 9 on the first operand by skipping the 0
separating the numbers. Another example is state 2 where the tape head is at the first operand.
If the operand is 0, it gets replaced by a 1 in the transition to state 4 and the separating 0 gets
replaced by a 1 in the transition to state 5. Eventually the first 1 gets removed (6 to 9).
One other possible character that might be read, is the $ sign. This happens when the tape head
reaches the start of the tape. However, if the instructions are properly implemented, this would
not occur.

Figure 4.8: The ADD operation in automata form.

4.4.2 Objectives
The student should be able to create a running TM which reads the input source code, reads the
tape and performs operations on the contents of the tape. For each procedure that is read from
the source, an action is performed. The actions will be implemented by the student too.
Beside the provided ADD instruction set, the student has to deliver the sets for the [SUB], [MUL]
and [DIV] procedures. During this process the acquired knowledge is put into practice and
clever thinking is expected. Operations involving a 0, like 3 ∗ 0 but also 0 ∗ 3, need to be treated
and must result in a correct solution. Thereby, the work space for this operations is on the
infinite side of the tape, so a straightforward solution could result in overwriting other contents
on the tape.

23
24
CHAPTER 5

Implementation

In the implementation section, the eventual implementation is discussed. Mind that due to the
purpose of this project, namely creating assignments, some parts will be skipped or explained
in less detail, so no solutions to the assignments will be offered.
Each program has its I/O management for reading and writing the source code and interme-
diate code, along with appropriate error messages. The arithmetic expression source code is
provided in a text file and is read line by line. The intermediate codes are stored as objects
using pickle1 .

5.1 Assignment 1
Assignment 1 exists of two programs: lexer.py and parser.py. An additional program automata.py
will function as an FA by calling FA(Q, Sigma, delta, q0, F) (the formal definition of a FA). The
source code is stored in a text file called main.prog and is used by the lexer.
The FA object in automata.py consists of some basic methods:
Initialise Before creating an object, all arguments, provided by the FA() call, are checked on
consistency first. The q0 state and accepting states F have to be part of the set of states Q
and each transition symbol has to be part of the alphabet Σ. If all arguments are correctly
specified, the states are created. Each state has a name, a transition table, a boolean indi-
cating if it is a start state and a boolean indicating whether it is an accepting state. Then
the FA object is returned.
Move The move method takes an input symbol and checks whether a transition from the cur-
rent state exist for the input symbol. If a transition exists, the state changes according to
the transition and returns True. If there is no transition for the input symbol, False gets
returned.
Reset The reset method resets the current state to the start state q0.
Plot Using graphviz2 , a plot of the automaton is rendered and stored on the machine. This is
useful when checking the correctness of the automaton.

5.1.1 Lexer
Once main.prog is loaded, it gets divided into lines. The main function calls the lexer, which is
an FA object. The assignment is to write a lexer(M, source) function, taking the FA object and
a source code line as input. The function returns a tuple list with tokens and corresponding
identifiers, as explained in Section 4.1.
To create the lexer, the arguments for the FA object need to be specified as shown in Listing 5.1.
1 https://docs.Python.org/2/library/pickle.html
2 http://www.graphviz.org

25
The listing creates the DFA from Figure 4.4. The last line eventually creates the object and thus
the lexer.

Listing 5.1: FA arguments for creating the lexer automaton


1 Q = [ ’ START ’ , ’ ID ’ , ’ INT ’ , ’ ADD ’ , ’ SUB ’ , ’ MUL ’ , ’ DIV ’ , ’ LBT ’ , ’ RBT ’ , ’ EQU ’ , ’
,→ WHITE ’ , ’ EOF ’]
2 Sigma = [ ’ < digit > ’ , ’ < character > ’ , ’/ ’ , ’= ’ , ’( ’ , ’) ’ , ’+ ’ , ’* ’ , ’ - ’ , ’\ n ’ , ’
,→ ’ , ’\ t ’]
3 delta = {
4 ’ START ’ :{
5 ’ < character > ’: ’ ID ’ ,
6 ’ < digit > ’: ’ INT ’ ,
7 ’+ ’: ’ ADD ’ ,
8 ’ - ’: ’ SUB ’ ,
9 ’* ’: ’ MUL ’ ,
10 ’/ ’: ’ DIV ’ ,
11 ’( ’: ’ LBT ’ ,
12 ’) ’: ’ RBT ’ ,
13 ’= ’: ’ EQU ’ ,
14 ’\ n ’: ’ EOF ’ ,
15 ’ ’: ’ WHITE ’ ,
16 ’\ t ’: ’ WHITE ’} ,
17 ’ ID ’ :{
18 ’ < character > ’: ’ ID ’ ,
19 ’ < digit > ’: ’ ID ’} ,
20 ’ INT ’ :{
21 ’ < digit > ’: ’ INT ’}
22 }
23 q0 = ’ START ’
24 F = [ ’ ID ’ , ’ INT ’ , ’ ADD ’ , ’ SUB ’ , ’ MUL ’ , ’ DIV ’ , ’ LBT ’ , ’ RBT ’ , ’ EQU ’ , ’ WHITE ’ , ’
,→ EOF ’]
25
26 M = FA (Q , Sigma , delta , q0 , F )

Once every token has its identifier, if no errors occurred, they are stored in a 2D array (tokens
by lines). If all lines are tokenized the 2D array is stored with pickle as main.lex.

5.1.2 Parser
As explained in the first part of Section 4.2, a DFA can be used for parsing. The parser uses the
same FA object as the lexer and the arguments are specified in the same fashion as in Listing 5.1,
following the structure of the DFA in Figure 4.5. The assignment is to take the input tokens from
the main.lex file and parse them using the parser, created by calling the 5-tuple FA() function.
Again the main function reads the file line by line and calls the parser. The student should
implement the parser function which follows the transition, based on the input token. The
student also implements an error handler with a message corresponding to the state. If there
is no possible transition from state 2, the error message might be: ’SyntaxError: Expected INT or
ID’. The program only returns a message that the file is successfully parsed. There is no parsed
file, due to the balanced parentheses problem and the lack of operator precedence, discussed in
Section 4.2.

5.2 Assignment 2
For assignment 2, the file PDA.py has to be completed in order to parse the main.lex file properly.
In addition, there is a CFG object provided in CFG.py. The object reads the CFG provided by the
user and creates a rule object for every left hand side symbol. The readGrammar() method takes
the grammar and checks the validity. The left hand side gets separated from the right hand side
and together with the rule number stored in a rule object.

26
In the PDA.py file the student has to implement a stack object and a parse function. The stack
object should have an array to hold the symbols and two basic methods to push symbols on
and pop symbols off the stack. Further method implementations are up to the student.
The parse function takes the tuple array from main.lex as input and starts parsing. Together with
the CFG object and the created stack object, symbols can be pushed and popped by following
the rules of the grammar. The student should also handle the postfix creation, described in Sec-
tion 4.3.4, by following a provided type function which models Table 4.3. Based on the returned
value, the student should come up with which actions to take next.
After parsing the file, the resulting postfix code is stored as main.exe.

5.3 Assignment 3
The last assignment exists of two files: TM.py and EXE.py. The TM.py file contains a TM object
provided with functions and methods which have to be implemented by the student. EXE.py
is the main program which executes the main.exe file from the parse phase. Again the read file
is fed line by line to the TM object. The instruction sets are specified in the TM.py file in the
executables function. This function returns the instruction set for the inputs: [ADD], [SUB],
[MUL] and [DIV].
In order to create a functioning TM, the TM object has a register holding the executable code, a
(ram) memory to store variables and a tape. The register consists of a list and the memory uses
a dictionary for easy access. There is a tape object that represents the infinite3 tape. In Listing
5.2 the tape object is shown in code.

Listing 5.2: The infinite tape object


1 class Tape ( object ) :
2 def __init__ ( self , init ) :
3 self . default = 0
4 self . items = [ init ]
5
6 def get ( self , index ) :
7 if index < len ( self . items ) :
8 return self . items [ index ]
9 else :
10 return self . default
11
12 def set ( self , index , value ) :
13 if index >= len ( self . items ) :
14 self . items . extend ([ self . default ]*( index + 1 - len ( self ) )
,→ )
15 self . items [ index ] = value
16
17 def __len__ ( self ) :
18 return len ( self . items )
19
20 def __repr__ ( self ) :
21 return ’ ’. join ( map ( str , self . items ) )
The init method initialises the tape by creating a list and setting the first element, being
the $ sign in this case. The default element is the 0 digit, as mentioned in Section 4.4, the tape
is filled with 0’s. To accomplish this infinite tape, the get and set methods handle requests with
indexes outside the range of the items list differently. When the index exceeds the range, the
default digit (0) is returned by the get method. The set method extends the items list for an index
outside the range with the default digit and then writes the value to it.
The student has to implement a run method inside the TM object that handles the input from
the register according to the described actions in Section 4.4. When an operator procedure is read
from the register, an execute function is called, which has to be complemented by the student.
3 Limited by the available hardware memory and depending on the machine.

27
The function should execute the instruction set for each procedure. By reading the symbol be-
low the tape head, the function takes the action from the instruction set corresponding to the
symbol and current state. The action is then performed on the tape, this includes a write, move
and state change action.
Finally the instruction sets for the [SUB], [MUL] and [DIV] procedures are implemented. The
challenge for this implementation is handling special cases including for example a 0 operand.
If all works correctly, the eventual content on the tape is the solution to the arithmetic expres-
sion.

28
CHAPTER 6

Conclusion

This paper described how the practical use of AFL can be found in compiling and executing
code, in this case simple arithmetic expressions in the form of assignments. All three compo-
nents mentioned in Section 1.2: Finite Automata, Push-Down Automata and the Turing Ma-
chine, are covered in the study. The fundamental elements of each automaton are recreated and
exerted in the compiling field, by using acquired knowledge from the student. This is done
without overflowing the student with new information by restricting the implementation to
AFL elements only and still giving an insight into the compiler design field. The study proves
that with some abstraction and advance work, compiling and executing are suitable to demon-
strate the applicability of AFL.
There are no results of the assignments in practice, however the assignments closely follow the
literature on AFL and are an extension on the subject more than an addition. Therefore the as-
signments are feasible for the students and convenient for teachers and should not cause any
problems.
The use of Python worked well and was no issue, emphasising it was the only language con-
sidered and not the goal of this study to find the optimal language to exert AFL subjects. In
addition, the assignments can be implemented in any high-level programming language with
its advantages and disadvantages for each.

6.1 Reflection
Although the study worked out as intended, some flaws came across. The whole process of
compiling a simple arithmetic expression to an eventual solution seems redundant, since there
are other approaches on solving this issue. However, using a programming language instead of
the expressions would have resulted in a more challenging assignment, but harder to apply to
the TM. Adding the use of variables in the expressions was a balance between these two.
The constraints of the tape both obstructed and advanced the implementation. Since the tape
resembles computer memory, only two symbols are allowed. This causes a higher complexity
on number representation which resulted in using natural numbers only, which in turn resulted
in a restriction of possible expressions. However, the advantage to this was the simplification
of the instruction sets which operate on the tape, since there is no need for handling negative
numbers and floating points.
In retrospect, the assignments gave a basic understanding of how compilers might work. The
approach is very powerful once the automaton and grammar are correctly defined. This means
that the code does not have to change when using other specified languages, only the automa-
ton and grammar need to be redefined. Keep in mind though that some structures, like the
parse table, strongly depend on these automata and grammars. However, by following the
steps in the mentioned literature, these structures can easily be created, or even with specific
tools on the web1 .
1 http://hackingoff.com/compilers/ll-1-parser-generator

29
The execution on the TM successfully resulted in the solution to the expression. The approach
gives an insight into how a processor might perform arithmetic operations on a data stream
and shows the complexity of the operations. With the existing hardware, the execution ap-
proach might seem redundant. But again, keep in mind that it illustrates the capabilities of the
TM.

30
Bibliography

[1] Anderson, J. A. (2006). Automata theory with modern applications. Cambridge University
Press.
[2] Cooper, C. (2013). Languages and machines. Notes for Math, 237.
[3] DuSell, B. (2014). pycfg. https://github.com/bdusell/pycfg/. Retrieved May, 2017.

[4] Han, Z. D., Kocijan, K., and Lopina, V. (2016). Python as pseudo language for formal lan-
guage theory. In MIPRO 2016-Computers in education (CE).
[5] Kozen, D. C. (2012). Automata and computability. Springer Science & Business Media.
[6] Mogensen, T. Æ. (2009). Basics of compiler design. Torben Ægidius Mogensen.

[7] Perrin, D. (2003). Automata and formal languages. Université de Marne-la-vallée, pages 1–22.
[8] Webber, A. B. (2008). Formal language: A practical introduction. Franklin, Beedle & Associates
Inc.
[9] Wu, F. (2012). Syntax-directed translation. http://slideplayer.com/slide/8701450/. Re-
trieved May, 2017.

31

Вам также может понравиться