Академический Документы
Профессиональный Документы
Культура Документы
languages
The unix philosophy applied to language design, for GPLs
and DSLs
Federico Tomassetti
This book is for sale at http://leanpub.com/create_languages
This is a Leanpub book. Leanpub empowers authors and publishers with the Lean Publishing
process. Lean Publishing is the act of publishing an in-progress ebook using lightweight tools and
many iterations to get reader feedback, pivot until you have the right book and build traction once
you do.
4. Writing a lexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
Why using ANTLR? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
The plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
Configuration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
The Lexer grammar for MiniCalc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
The Lexer grammar for StaMac . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
5. Writing a parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
The parser grammar for MiniCalc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
The parser grammar for StaMac . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
Testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7. Symbol resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Example: reference to a value in Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Example: reference to a type in Java . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
Example: reference to a method in Java . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Resolving symbols in MiniCalc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
Resolving symbols in StaMac . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Testing the symbol resolution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
8. Typesystem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
Typesystem rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
Lets see the code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Typesystem for MiniCalc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Typesystem for StaMac . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
9. Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Validation for MiniCalc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
Validation for StaMac . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
Write to me . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
1. Motivation: why do you want to
build language tools?
In this book we are going to see how to build tools to support languages.
There are two different scenarios in which you may want to do that:
1. you want to create a new language: maybe a general purpose language (GPL), maybe a
domain specific language (DSL). In any case you may want to build some support for this
language of yours. Maybe you want to generate C and compile the generated code, maybe you
want to interpret it. Maybe you want to build a compiler or a simulator for your language. Or
you want to do all of this stuff and more.
2. you want to create an additional tool for an existing language. Do you want to perform
static analysis on your Java code? Or build a translator from Python to JavaScript? Maybe a
web editor for some less known language?
In both scenarios building the right tools to support a language can make a difference. The tools you
use can make or break your experience with that language. They can make a crucial difference by
supporting your programming, or any other kind of intellectual activities you can carry on by using
your language, or instead hindering your every move and putting all sort of limitations to what you
can achieve.
The case of Domain Specific Language is very easy to defend: if you create a language for a specific
goal you will end up with a language tailored for a set of tasks. And it will be arguably better at
supporting those than some generic language. Think about some notable DSLs like HTML, Latex,
or SQL. You could define documents using some program written in C to draw on the screen the
information you want to display, or generate some PDF document to distribute. While C could
be used also to write these kind of applications it is not a language designed for this goal, so it
will be more complicate to use, for this goal, than HTML or Latex. People would be required to
learn much more to write simple documents. Also, many more things could go wrong: writing an
HTML document is pretty difficult to have a memory leak or to deferentiate a null pointer. You have
less power, but also less things to consider. You can analyze more easily the things you write with
your DSL. You can build special tools for the very specific tasks you need to accomplish with your
languages.
The case for General Purpose Languages is different: to build a language that is better than the
existing General Purpose Languages is not an easy task. Many try and end up with languages not
powerful enough or give up when they realize than designing a language is far from being easy.
On the other hand, someone from time to time succeed and build a GPL that works well for him.
Or for his team. Or for a small community. Or a language that change how we program. Think
about the influence a language like Ruby had in the last decade. A language can make a difference,
or can be better than existing one, for many different reasons. You can build a GPL that is really
good on a specific aspect, like Go, which is famous for being good at concurrency. And thus its a
good choice for networking. But there are also other good reasons in addition to the technical ones,
such as educational or artistic. This is typically the case for the creation of esoteric programming
languages: languages created to make a point. Building a GPL is one of those challenges which
attract a significant percentage of the most talented developers. Maybe you are just one of them, or
you aspire to be one of them, and you want to give it a try. If this challenge appeal to you, even if
you dont leave a mark in the history of computer science you still going to have fun.
There is a good reason for building a language today that was not true before: its easier than ever.
Now the barriers to create a language, and make it usable by sane persons, are significantly lower
than they used to be. In this book I will try to demonstrate why I think this is the case.
First of all there are ecosystems like the JVM or CLR: if you build your language to be compatible
with one of those you get access to tons of libraries from day one. Frameworks like LLVM make also
possible to build very efficient languages with a much lower effort than was required in the past.
There are great frameworks and libraries you can reuse. In this book, for example, we are going to
use ANTLR to generate most of our lexers and parser. You can also build your editor as a plugin for
well known IDEs like Eclipse or IntelliJ.
So it is a great time to move your first steps as a Language Engineer.
1. Motivation: why do you want to build language tools? 3
Summary
We have seen that there are different reasons to build language machinery and different things that
can be achieved. However there are some common tools and principles that are shared. We will look
into those and see how to apply them pragmatically to get concrete results. At the end of the book
you should have learnt an approach that you can adapt, to produce systems that you can understand
and extend.
2. The general plan
In this book we are going to see how to build machinery for your languages.
These include:
parsers
compilers
code generators
static analysis tools
editors
simulators
In other words we are going to see how to implement all sort of tools that would make working with
a language productive.
We are not going to discuss in detail how to design languages. While there will be some comments
here and there in the book, I think there is no better way to learn design principles than by building
things. So you shouldnt expect theoretical dissertations on the merits of this or that paradigm: we
are going to learn how to build stuff in practice, and with practice you will form your own ideas.
We will also see different kinds of languages, and you will be able to see the merits of different
approaches and decide what makes sense in your case.
Philosophy
Building software is a complex task. It would be easy to spend a whole life working on one single
problem. Think about the amount of effort went in producing a parser generator like ANTLR or the
thousands of man-years poured in building a Java compiler or the IDEs used by most developers.
If you want to build all the machinery for a language and build all of this to high quality you need
to adopt the UNIX philosophy to reach your goal: take simple, high quality components and
combine them together in smart ways.
This is exactly what we are going to do: we are going to look at components that we can reuse
and combine. For our strategy to work, we need to select components which are not just of high
quality, but also that can be combined easily. Components with very large requirements or very
complex interfaces are not good candidates. Components that do one thing well, and have a simple
architecture are the ideal ones.
2. The general plan 5
MiniCalc
This will be a toy-language, created to show us how to work with expressions. This language would
be of limited use in practice, but it will be helpful to start introducing the basics of building languages.
The language will permit to define inputs and variables. It will be possible to execute one MiniCalc
module, specifying the values for the inputs. The execution will consist of evaluating all the
expressions and then executing the print statements, producing an output visible to the user.
We will support:
Particularities:
Example:
MiniCalcFun
When implementing an interpreter we will enrich MiniCalc by adding support for functions. This
variant will be named MiniCalcFun because creativity is not really my strong suit. We will also
allow to have annidated functions. This will be useful to discuss scoping.
2. The example languages we are going to build 7
StaMac
This language will permit to represent state machines. It will be useful to see how to work with a
different execution model when compared to the classical procedural one.
A state machine starts in a specific state and when receives in event it moves to a different state.
When entering or leaving a state it can execute specific actions.
StaMac will permit to define inputs for our State Machines, so that they are configurable. State
machines will also have variables, a list of events to which they can react and a list of states.
Of all the states one will be marked as the start state. A state will have a name and specify to which
events it will react and to which states it will move. It will also specify the actions to execute on
entering and leaving that state.
Consider a state machine used to represent some piece of equipment producing physical items.
This state machine will started as turned off. We will send to it a command (an event) to turn it on.
Later we could increase the speed or decrease the speed. The machine will support three speed: still,
low speed, high speed. We could also simulate the fact that time passes without nothing happening:
we will do that by sending the event doNothing. This state machine will be configurable: we could
specify how many items it produces while in low speed or high speed mode.
In StaMac this state machine could be written as:
1 statemachine mySm
2
3 input lowSpeedThroughtput: Int
4 input highSpeedThroughtput: Int
5
6 var totalProduction = 0
7
8 event turnOff
9 event turnOn
10 event speedUp
11 event speedDown
12 event emergencyStop
13 event doNothing
14
15 start state turnedOff {
16 on turnOn -> turnedOn
17 }
18
19 state turnedOn {
20 on turnOff -> turnedOff
21 on speedUp -> lowSpeed
2. The example languages we are going to build 8
22 }
23
24 state lowSpeed {
25 on entry {
26 totalProduction = totalProduction + lowSpeedThroughtput
27 print("Producing " + lowSpeedThroughtput + " elements (total "+totalProd\
28 uction+")")
29 }
30 on speedDown -> turnedOn
31 on speedUp -> highSpeed
32 on doNothing -> lowSpeed
33 }
34
35 state highSpeed {
36 on entry {
37 totalProduction = totalProduction + highSpeedThroughtput
38 print("Producing " + highSpeedThroughtput + " elements (total "+totalPro\
39 duction+")")
40 }
41 on speedDown -> lowSpeed
42 on emergencyStop -> turnedOn
43 on doNothing -> highSpeed
44 }
Part I: the basics
We are going to see the basic building blocks for building language tools.
We are going to work using several examples presented in Chapter 2.
The languages are deliberately simple because here we want to show the principles without getting
caught in too many nitty gritty details and corner cases.
At the end of Part I you will know the basis to build a model from the raw code of your language. You
will be able to validate such model, to resolve references and to calculate the type of the different
expressions. At that point you will be ready to move to the next steps.
4. Writing a lexer
When we start analyzing the code of our language we get an entire file to process. The first step is to
break that big file into a list of tokens. Divide et impera is a principle that worked for some millenia
and keep being valid.
To split a file into tokens we will build a lexer. The lexer is the piece of code that takes a textual
document and break it into tokens. Tokens are portions of text with a specific role.
Our tokens could be:
numeric literals
string literals
comments
keywords
The plan
We are going to look into how to setup our project and then we will see the Lexer grammar
for MiniCalc and StaMac, which are both described in chapter 2, the one presenting the example
languages we are going to work with through the book.
Configuration
As first thing we will need to setup our project. We are going to use Gradle as our build system but
any build system would work. At this stage we will just need to:
be able to invoke ANTLR to generate the lexer code from the lexer grammar. We will generate
a Java class, but ANTLR supports many other targets
compile the code generated by ANTLR
I typically starts by setting up a new git local repository (git init) and setup a gradle wrapper
(gradle wrapper). This is just a small script that install locally a specific version of gradle, so when
we will share the project anyone will be able to use the wrapper and the wrapper will take care of
installing gradle for the specific platform of our user.
Then I create a gradle build file (build.gradle). My build file looks like this:
1 buildscript {
2 // The version of Kotlin I am using, soon moving to 1.1
3 ext.kotlin_version = '1.0.6'
4
5 repositories {
6 mavenCentral()
7 maven {
8 name 'JFrog OSS snapshot repo'
9 url 'https://oss.jfrog.org/oss-snapshot-local/'
10 }
11 jcenter()
12 }
13
14 dependencies {
15 classpath "org.jetbrains.kotlin:kotlin-gradle-plugin:$kotlin_version"
16 }
17 }
18
19 apply plugin: 'kotlin'
4. Writing a lexer 12
62
63 idea {
64 module {
65 sourceDirs += file("generated-src/antlr/main")
66 }
67 }
23 PLUS : '+' ;
24 MINUS : '-' ;
25 ASTERISK : '*' ;
26 DIVISION : '/' ;
27 ASSIGN : '=' ;
28 LPAREN : '(' ;
29 RPAREN : ')' ;
30
31 // Identifiers
32 ID : [_]*[a-z][A-Za-z0-9_]* ;
33
34 STRING_OPEN : '"' -> pushMode(MODE_IN_STRING);
35
36 UNMATCHED : . ;
37
38 mode MODE_IN_STRING;
39
40 ESCAPE_STRING_DELIMITER : '\\"' ;
41 ESCAPE_SLASH : '\\\\' ;
42 ESCAPE_NEWLINE : '\\n' ;
43 ESCAPE_SHARP : '\\#' ;
44 STRING_CLOSE : '"' -> popMode ;
45 INTERPOLATION_OPEN : '#{' -> pushMode(MODE_IN_INTERPOLATION) ;
46 STRING_CONTENT : ~["\n\r\t\\#]+ ;
47
48 STR_UNMATCHED : . -> type(UNMATCHED) ;
49
50 mode MODE_IN_INTERPOLATION;
51
52 INTERPOLATION_CLOSE : '}' -> popMode ;
53
54 INTERP_WS : [\t ]+ -> skip ;
55
56 // Keywords
57 INTERP_AS : 'as'-> type(AS) ;
58 INTERP_INT : 'Int'-> type(INT) ;
59 INTERP_DECIMAL : 'Decimal'-> type(DECIMAL) ;
60 INTERP_STRING : 'String'-> type(STRING) ;
61
62 // Literals
63 INTERP_INTLIT : ('0'|[1-9][0-9]*) -> type(INTLIT) ;
64 INTERP_DECLIT : ('0'|[1-9][0-9]*) '.' [0-9]+ -> type(DECLIT) ;
4. Writing a lexer 15
65
66 // Operators
67 INTERP_PLUS : '+' -> type(PLUS) ;
68 INTERP_MINUS : '-' -> type(MINUS) ;
69 INTERP_ASTERISK : '*' -> type(ASTERISK) ;
70 INTERP_DIVISION : '/' -> type(DIVISION) ;
71 INTERP_ASSIGN : '=' -> type(ASSIGN) ;
72 INTERP_LPAREN : '(' -> type(LPAREN) ;
73 INTERP_RPAREN : ')' -> type(RPAREN) ;
74
75 // Identifiers
76 INTERP_ID : [_]*[a-z][A-Za-z0-9_]* -> type(ID);
77
78 INTERP_STRING_OPEN : '"' -> type(STRING_OPEN), pushMode(MODE_IN_STRING);
79
80 INTERP_UNMATCHED : . -> type(UNMATCHED) ;
Preamble
We start by specifying that this is a lexer grammar. Using ANTLR we could also define parser
grammars or mixed grammars (containing a lexer and a parser in one file).
We also specify that we want to use an extra channel, in addition to the default one. You can imagine
channels as dispatch belts. You put tokens in different channels so that different users are free to
consider or ignore them. We will see more when looking at whitespace.
Whitespace
In our language the newlines are relevant while spaces are not. We will therefore ignore spaces most
of the time. They will however be useful when performing syntax highlighting, so we will not just
throw them away but we will put them into a separate channel, where we can retrieve them when
we need them.
4. Writing a lexer 16
1 // Whitespace
2 NEWLINE : '\r\n' | 'r' | '\n' ;
3 WS : [\t ]+ -> channel(WHITESPACE) ;
Keywords and ID
Defining keywords is pretty simple: we have just to pay attention to the fact that typically the rules
for identifiers could match most, if not all, the keywords. This might become an issue because in
ANTLR, when a piece of text can match more than one rule, the one defined first is chosen. The
solution is just to put the ID rules after all the keywords and you are good to go.
Also, notice that our ID rule specify that an ID cannot start with a capital letter.
1 // Keywords
2 INPUT : 'input' ;
3 VAR : 'var' ;
4 PRINT : 'print';
5 AS : 'as';
6 INT : 'Int';
7 DECIMAL : 'Decimal';
8 STRING : 'String';
9
10 // Identifiers
11 ID : [_]*[a-z][A-Za-z0-9_]* ;
_____a______
a99
foo_99_
__A
A
99a
Numeric Literals
Our language is very simple and it permits to manipulate just numbers and literals. Our number
literals are very simple:
4. Writing a lexer 17
1 // Literals
2 INTLIT : '0'|[1-9][0-9]* ;
3 DECLIT : '0'|[1-9][0-9]* '.' [0-9]+ ;
Our string literals are much more involved because we support interpolation. Lets see them in the
next paragraph.
String
Typically lexers are not context sensitive, however in some cases it makes sense to build them to
be context sensitive. In this way we can have simple rules that apply only in a given context. For
example, when we are inside a string we want to recognize sequences like \n while these are not
relevant outside strings.
In ANTLR we achieve this by using modes: as we open a string we enter the mode MODE_IN_-
STRING.
1 mode MODE_IN_STRING;
2
3 ESCAPE_STRING_DELIMITER : '\\"' ;
4 ESCAPE_SLASH : '\\\\' ;
5 ESCAPE_NEWLINE : '\\n' ;
6 ESCAPE_SHARP : '\\#' ;
7 STRING_CLOSE : '"' -> popMode ;
8 INTERPOLATION_OPEN : '#{' -> pushMode(MODE_IN_INTERPOLATION) ;
9 STRING_CONTENT : ~["\n\r\t\\#]+ ;
10
11 STR_UNMATCHED : . -> type(UNMATCHED) ;
We have all the escape sequences and then we have STRING_CLOSE. When we match it we go back
to the mode we were in when we entered MODE_IN_STRING (typically the default mode).
We can also enter into another mode: MODE_IN_INTERPOLATION.
Finally all the other characters (excluding newlines, which are illegal in string) are just STRING_-
CONTENT.
Interpolation
When we are in interpolation mode we basically can write all the expressions we can write at the
top level. For this reason we have to duplicate different rules:
4. Writing a lexer 18
1 mode MODE_IN_INTERPOLATION;
2
3 INTERPOLATION_CLOSE : '}' -> popMode ;
4
5 INTERP_WS : [\t ]+ -> skip ;
6
7 // Keywords
8 INTERP_AS : 'as'-> type(AS) ;
9 INTERP_INT : 'Int'-> type(INT) ;
10 INTERP_DECIMAL : 'Decimal'-> type(DECIMAL) ;
11 INTERP_STRING : 'String'-> type(STRING) ;
12
13 // Literals
14 INTERP_INTLIT : ('0'|[1-9][0-9]*) -> type(INTLIT) ;
15 INTERP_DECLIT : ('0'|[1-9][0-9]*) '.' [0-9]+ -> type(DECLIT) ;
16
17 // Operators
18 INTERP_PLUS : '+' -> type(PLUS) ;
19 INTERP_MINUS : '-' -> type(MINUS) ;
20 INTERP_ASTERISK : '*' -> type(ASTERISK) ;
21 INTERP_DIVISION : '/' -> type(DIVISION) ;
22 INTERP_ASSIGN : '=' -> type(ASSIGN) ;
23 INTERP_LPAREN : '(' -> type(LPAREN) ;
24 INTERP_RPAREN : ')' -> type(RPAREN) ;
25
26 // Identifiers
27 INTERP_ID : [_]*[a-z][A-Za-z0-9_]* -> type(ID);
28
29 INTERP_STRING_OPEN : '"' -> type(STRING_OPEN), pushMode(MODE_IN_STRING);
30
31 INTERP_UNMATCHED : . -> type(UNMATCHED) ;
This is not ideal and it is one of the very few things I do not like about ANTLR. Unfortunately we do
not live in an ideal world, so I guess we have to cope with it. Anyway we are getting a full lexer by
writing a few tens of lines of definitions, so probably we should not complain too much. All things
considered, the advantages clearly outweight this drawback.
Unmatched
There are characters that are not allowed in certain positions. Like you cannot put a dollar symbol
outside a string in MiniCalc. Normally you may want to just throw an error when you meet
such characters. However you want to handle those characters differently when doing syntax
4. Writing a lexer 19
highlighting: those characters need to be considered and maybe colored in red to give feedback
to the user. This is why we have rules to produce an UNMATCHED token in all modes.
1 UNMATCHED : . ;
2 STR_UNMATCHED : . -> type(UNMATCHED) ;
3 INTERP_UNMATCHED : . -> type(UNMATCHED) ;
1 package examples
2
3 import me.tomassetti.minicalc.MiniCalcLexer
4 import org.antlr.v4.runtime.ANTLRInputStream
5 import org.antlr.v4.runtime.Token
6 import java.io.FileInputStream
7 import java.io.StringReader
8
9 fun lexerForCode(code: String) = MiniCalcLexer(ANTLRInputStream(StringReader(cod\
10 e)))
11
12 fun readExampleCode() = FileInputStream("examples/rectangle.mc").bufferedReader(\
13 ).use { it.readText() }
14
15 fun main(args: Array<String>) {
16 val lexer = lexerForCode(readExampleCode())
17 var token : Token? = null
18 do {
19 token = lexer.nextToken()
20 val typeName = MiniCalcLexer.VOCABULARY.getSymbolicName(token.type)
21 val text = token.text.replace("\n", "\\n").replace("\r", "\\r").replace(\
22 "\t", "\\t")
23 println("L${token.line}(${token.startIndex}-${token.stopIndex}) $typeNam\
24 e '$text'")
25 } while (token?.type != -1)
26 }
4. Writing a lexer 20
34 L4(89-94) ID 'height'
35 L4(95-95) INTERPOLATION_CLOSE '}'
36 L4(96-108) STRING_CONTENT ' has an area '
37 L4(109-110) INTERPOLATION_OPEN '#{'
38 L4(111-114) ID 'area'
39 L4(115-115) INTERPOLATION_CLOSE '}'
40 L4(116-116) STRING_CLOSE '"'
41 L4(117-117) RPAREN ')'
42 L4(118-118) NEWLINE '\n'
43 L5(119-118) EOF '<EOF>'
26 START : 'start';
27 STATE : 'state';
28 ON : 'on';
29 ENTRY : 'entry';
30 EXIT : 'exit';
31
32 // Identifiers
33 ID : [_]*[a-z][A-Za-z0-9_]* ;
34
35 // Literals
36 INTLIT : '0'|[1-9][0-9]* ;
37 DECLIT : '0'|[1-9][0-9]* '.' [0-9]+ ;
38 STRINGLIT : '"' ~["]* '"' ;
39
40 // Operators
41 PLUS : '+' ;
42 MINUS : '-' ;
43 ASTERISK : '*' ;
44 DIVISION : '/' ;
45 ASSIGN : '=' ;
46 COLON : ':' ;
47 LPAREN : '(' ;
48 RPAREN : ')' ;
49 LBRACKET : '{' ;
50 RBRACKET : '}' ;
51 ARROW : '->' ;
52
53 UNMATCHED : . ;
There are many similarities between this lexer grammar and the previous one. This is not by accident
but rather typical, because there are common elements present in many languages:
keywords
literals
operators
the UNMATCHED rule
comments
whitespace
in this language we do not support string interpolation, therefore our string literal rule is way
simple and it does not involve using different modes
we have two channels because we support comments, which are not supported in MiniCalc
in this language newlines are not meaningful, so we send them to the same channel as
whitespace
Testing
Of course we want to start with the right foot and begin writing tests for our language machinery.
We started writing a lexer, so we will start our testing efforts from here.
What is a lexer supposed to do? Take a string and return me a list of tokens. Lets build our tests to
verify it does it correctly.
1 package me.tomassetti.minicalc
2
3 import lexerForCode
4 import tokensContent
5 import tokensNames
6 import kotlin.test.assertEquals
7 import org.junit.Test as test
8
9 // Utilities included only for completeness
10
11 fun lexerForCode(code: String) = MiniCalcLexer(ANTLRInputStream(StringReader(cod\
12 e)))
13
14 fun tokensNames(lexer: MiniCalcLexer): List<String> {
15 val tokens = LinkedList<String>()
16 do {
17 val t = lexer.nextToken()
18 when (t.type) {
19 -1 -> tokens.add("EOF")
20 else -> if (t.type != MiniCalcLexer.WS) tokens.add(lexer.vocabulary.\
21 getSymbolicName(t.type))
22 }
23 } while (t.type != -1)
24 return tokens
25 }
26
27 // End of utilities, here it starts the real test code
28
4. Writing a lexer 24
29 class MiniCalcLexerTest {
30
31 @org.junit.Test fun parseVarDeclarationAssignedAnIntegerLiteral() {
32 assertEquals(listOf("VAR", "ID", "ASSIGN", "INTLIT", "EOF"),
33 tokensNames(lexerForCode("var a = 1")))
34 }
35
36 @org.junit.Test fun parseVarDeclarationAssignedADecimalLiteral() {
37 assertEquals(listOf("VAR", "ID", "ASSIGN", "DECLIT", "EOF"),
38 tokensNames(lexerForCode("var a = 1.23")))
39 }
40
41 @org.junit.Test fun parseVarDeclarationAssignedASum() {
42 assertEquals(listOf("VAR", "ID", "ASSIGN", "INTLIT", "PLUS", "INTLIT", "\
43 EOF"),
44 tokensNames(lexerForCode("var a = 1 + 2")))
45 }
46
47 @org.junit.Test fun parseMathematicalExpression() {
48 assertEquals(listOf("INTLIT", "PLUS", "ID", "ASTERISK", "INTLIT", "DIVIS\
49 ION", "INTLIT", "MINUS", "INTLIT", "EOF"),
50 tokensNames(lexerForCode("1 + a * 3 / 4 - 5")))
51 }
52
53 @org.junit.Test fun parseMathematicalExpressionWithParenthesis() {
54 assertEquals(listOf("INTLIT", "PLUS", "LPAREN", "ID", "ASTERISK", "INTLI\
55 T", "RPAREN", "MINUS", "DECLIT", "EOF"),
56 tokensNames(lexerForCode("1 + (a * 3) - 5.12")))
57 }
58
59 @org.junit.Test fun parseCast() {
60 assertEquals(listOf("ID", "ASSIGN", "ID", "AS", "INT", "EOF"),
61 tokensNames(lexerForCode("a = b as Int")))
62 }
63
64 @org.junit.Test fun parseSimpleString() {
65 assertEquals(listOf("STRING_OPEN", "STRING_CONTENT", "STRING_CLOSE", "EO\
66 F"),
67 tokensNames(lexerForCode("\"hi!\"")))
68 }
69
70 @org.junit.Test fun parseStringWithNewlineEscape() {
4. Writing a lexer 25
they have the right content and the right position. We could potentially write a couple of such tests,
just to verify we do not have any surprise. However, normally what matters is the type of tokens
returned, the rest should just work as expected because ANTLR is very mature and battle-tested.
Summary
Building a lexer is not very difficult, but there are a few things you need to pay attention to, if you
want to get it right. My advice is to use ANTLR because it is a powerful tool which we can adapt
in many contexts. In this chapter we have seen how to use it with two different languages and how
to test our lexer. We also discussed some aspects to consider, in order to make our lexer usable from
the other components, like the parser or the editor.
Now it is time to move to the next component, the parser.
5. Writing a parser
We have seen how to organize the characters of our text into tokens. Now these tokens can be
organized in a more structured form: the Parse Tree.
A boring but needed note about terminology: in some situations people refer to the tree produced
by a parser as the parse tree while in others they refer to it as the Abstract Syntax Tree. In this book
we are calling the tree produced by ANTLR parse tree. In the following chapters we are going to
see how to refine and transform the parse tree to obtain a second tree. We will call that transformed
tree the Abstract Syntax Tree (AST).
The picture below shows the whole process: from code to the AST.
We have already seen how to obtain a list of tokens, by using a lexer. Now to get the parse tree we
are going to use an ANTLR parser. The ANTLR parser is generated by ANTLR according to a parser
grammar. In this Chapter we are going to build such grammar.
In the parser grammar we will refer to the terminals or token types defined in the lexer grammar:
NEWLINE, VAR, ID, and the like.
43 antString
44 | INTERPOLATION_OPEN expression INTERPOLATION_CLOSE # inter\
45 polatedValue ;
46
47 type : INT # integer
48 | DECIMAL # decimal
49 | STRING # string ;
This is the code we are going to use to print the parse tree:
1 ///
2 /// Parsing
3 ///
4
5 fun lexerForCode(code: String) = MiniCalcLexer(ANTLRInputStream(StringReader(cod\
6 e)))
7
8 fun parseCode(code: String) : MiniCalcParser.MiniCalcFileContext = MiniCalcParse\
9 r(CommonTokenStream(lexerForCode(code))).miniCalcFile()
10
11 ///
12 /// Transform the Parse Tree in a string we can print on the screen
13 ///
14
15 abstract class ParseTreeElement {
16 abstract fun multiLineString(indentation : String = ""): String
17 }
18
19 class ParseTreeLeaf(val type: String, val text: String) : ParseTreeElement() {
20 override fun toString(): String{
21 return "T:$type[$text]"
22 }
23
24 override fun multiLineString(indentation : String): String = "${indentation}\
25 T:$type[$text]\n"
26 }
27
28 class ParseTreeNode(val name: String) : ParseTreeElement() {
29 val children = LinkedList<ParseTreeElement>()
30 fun child(c : ParseTreeElement) : ParseTreeNode {
31 children.add(c)
32 return this
33 }
5. Writing a parser 31
34
35 override fun toString(): String {
36 return "Node($name) $children"
37 }
38
39 override fun multiLineString(indentation : String): String {
40 val sb = StringBuilder()
41 sb.append("${indentation}$name\n")
42 children.forEach { c -> sb.append(c.multiLineString(indentation + " "))\
43 }
44 return sb.toString()
45 }
46 }
47
48 fun toParseTree(node: ParserRuleContext, vocabulary: Vocabulary) : ParseTreeNode\
49 {
50 val res = ParseTreeNode(node.javaClass.simpleName.removeSuffix("Context"))
51 node.children.forEach { c ->
52 when (c) {
53 is ParserRuleContext -> res.child(toParseTree(c, vocabulary))
54 is TerminalNode -> res.child(ParseTreeLeaf(vocabulary.getSymbolicNam\
55 e(c.symbol.type), c.text))
56 }
57 }
58 return res
59 }
60
61 ///
62 /// Invoking the parser and print the parse tree
63 ///
64
65 fun main(args: Array<String>) {
66 // readExampleCode is a simple function that read the code of our example fi\
67 le
68 println(toParseTree(parseCode(readExampleCode()), MiniCalcParser.VOCABULARY)\
69 .multiLineString())
70 }
1 MiniCalcFile
2 Line
3 InputDeclarationStatement
4 InputDeclaration
5 T:INPUT[input]
6 Integer
7 T:INT[Int]
8 T:ID[width]
9 T:NEWLINE[
10 ]
11 Line
12 InputDeclarationStatement
13 InputDeclaration
14 T:INPUT[input]
15 Integer
16 T:INT[Int]
17 T:ID[height]
18 T:NEWLINE[
19 ]
20 Line
21 VarDeclarationStatement
22 VarDeclaration
23 T:VAR[var]
24 Assignment
25 T:ID[area]
26 T:ASSIGN[=]
27 BinaryOperation
5. Writing a parser 33
28 VarReference
29 T:ID[width]
30 T:ASTERISK[*]
31 VarReference
32 T:ID[height]
33 T:NEWLINE[
34 ]
35 Line
36 PrintStatement
37 Print
38 T:PRINT[print]
39 T:LPAREN[(]
40 StringLiteral
41 T:STRING_OPEN["]
42 ConstantString
43 T:STRING_CONTENT[A rectangle ]
44 InterpolatedValue
45 T:INTERPOLATION_OPEN[#{]
46 VarReference
47 T:ID[width]
48 T:INTERPOLATION_CLOSE[}]
49 ConstantString
50 T:STRING_CONTENT[x]
51 InterpolatedValue
52 T:INTERPOLATION_OPEN[#{]
53 VarReference
54 T:ID[height]
55 T:INTERPOLATION_CLOSE[}]
56 ConstantString
57 T:STRING_CONTENT[ has an area ]
58 InterpolatedValue
59 T:INTERPOLATION_OPEN[#{]
60 VarReference
61 T:ID[area]
62 T:INTERPOLATION_CLOSE[}]
63 T:STRING_CLOSE["]
64 T:RPAREN[)]
65 T:EOF[<EOF>]
You can see that we have reused some definitions while others are very similar.
In MiniCalc the top rule was defined to recognize a list of lines. In StaMac the top rule instead is
defined to organize the code in two areas: the first is the preamble while the second one is a list of
states. After that we expect the end of file, represented by the special terminal EOF.
The preamble is defined as the terminal SM (corresponding to the keyword statemachine), an ID rep-
resenting the name of the state machine and finally a list of preambleElements. A preambleElement
can be an event declaration, an input declaration or a variable declaration. By defining the rules in
this way we permit to users to mix event, input and variable declarations in any order they want.
However all these definitions must preceed all the ones relatives to the states.
In MiniCalc we were using the newline as a terminator of each line, while in StaMac we ignore
newlines.
In StaMac we have also an optional element: the keyword start (terminal START). It can be used at
the beginning of a state.
Note also that the rule statement could be written equivalently as:
This is the form we obtain by replacing print and assignment by their definitions. This alternative
form would produce a slightly simpler parse tree, but I prefer the original one because I find it more
readable. We will later process the parse tree to obtain the abstract syntax tree, so we have no gain
in sacrificing readability to affect the exact shape of the parse tree we will obtain.
Testing
Ok, we defined our parser, now we need to test it. In general, I think we need to test a parser in three
ways:
5. Writing a parser 36
1. Verify that all the code we need to parse is parsed without errors
2. Ensure that code containing errors is not parsed
3. Verify that the the shape of the resulting AST is the one we expect
In practice the first point is the one on which I tend to insist the most. If you are building a parser
for an existing language the best way to test your parser is to try parsing as much code as you can,
verifying that all the errors found correspond to actual errors in the original code, and not errors in
the parser. Typically I iterate over this step multiple times to complete my grammars.
The second and third points are refinements on which I work once I am sure my grammar can
recognize everything.
In this simple case, we will write simple test cases to cover the first and the third point: we will
verify that some examples are parsed and we will verify that the AST produced is the one we want.
It is a bit cumbersome to verify that the AST produced is the one you want. There are different ways
to do that, but in this case I chose to generate a string representation of the AST and verify it is the
same as the one expected. It is an indirect way of testing the AST is the one I want, but it is much
easier for simple cases like this one.
This is how we produce a string representation of the AST:
23 }
24
25 override fun multiLineString(indentation : String): String {
26 val sb = StringBuilder()
27 sb.append("${indentation}$name\n")
28 children.forEach { c -> sb.append(c.multiLineString(indentation + " "))\
29 }
30 return sb.toString()
31 }
32 }
33
34 fun toParseTree(node: ParserRuleContext) : ParseTreeNode {
35 val res = ParseTreeNode(node.javaClass.simpleName.removeSuffix("Context"))
36 node.children.forEach { c ->
37 when (c) {
38 is ParserRuleContext -> res.child(toParseTree(c))
39 is TerminalNode -> res.child(ParseTreeLeaf(c.text))
40 }
41 }
42 return res
43 }
1 class MiniCalcParserTest {
2
3
4 @org.junit.Test fun parseAdditionAssignment() {
5 assertEquals(
6 """MiniCalcFile
7 Line
8 AssignmentStatement
9 Assignment
10 T[a]
11 T[=]
12 BinaryOperation
13 IntLiteral
14 T[1]
15 T[+]
16 IntLiteral
17 T[2]
18 T[<EOF>]
5. Writing a parser 38
19 """,
20 toParseTree(parseResource("addition_assignment", this.javaClass)\
21 ).multiLineString())
22 }
23
24 @org.junit.Test fun parseSimplestVarDecl() {
25 assertEquals(
26 """MiniCalcFile
27 Line
28 VarDeclarationStatement
29 VarDeclaration
30 T[var]
31 Assignment
32 T[a]
33 T[=]
34 IntLiteral
35 T[1]
36 T[<EOF>]
37 """,
38 toParseTree(parseResource("simplest_var_decl", this.javaClass)).\
39 multiLineString())
40 }
41
42 @org.junit.Test fun parsePrecedenceExpressions() {
43 assertEquals(
44 """MiniCalcFile
45 Line
46 VarDeclarationStatement
47 VarDeclaration
48 T[var]
49 Assignment
50 T[a]
51 T[=]
52 BinaryOperation
53 BinaryOperation
54 IntLiteral
55 T[1]
56 T[+]
57 BinaryOperation
58 IntLiteral
59 T[2]
60 T[*]
5. Writing a parser 39
61 IntLiteral
62 T[3]
63 T[-]
64 IntLiteral
65 T[4]
66 T[<EOF>]
67 """,
68 toParseTree(parseResource("precedence_expression", this.javaClas\
69 s)).multiLineString())
70 }
71
72 }
Summary
We have seen how to build a simple lexer and a simple parser. Many tutorials you can find online
stop there. We are instead going to move on and build more tools from our lexer and parser. We
laid the foundations, we now have to move to the rest of the infrastructure. Things will start to get
pretty interesting.
6. Mapping: from the parse-tree to
the Abstract Syntax Tree
In this chapter we are going to see how to process and to transform the information obtained from the
parser. The ANTLR parser recognizes the elements present in the source code and build a parse tree.
From the parse tree we will obtain the Abstract Syntax Tree on which we will perform validation
and from which will we produce compiled code.
Our goal here is obtain a new tree which satisfies three requirements:
Why? Because we will need to be able to do several operations on this tree, to traverse it and trasform
it easily. The kind of operations that we are going to perform are based on the semantic content of
the code, not its syntactic structure. The syntax has guided us to produce the parse tree and it has
now exhausted its goal, time to move to the semantic. Are you confused by this discussion about
syntax vs. semantic? Do not worry, I am going to throw a lot of code at you and show what I mean
in practice.
In other words we will build a model of our code to simplify the hard work that follows, so that the
hard work becomes a walk in the park.
For this reason every node of the AST will implement this interface:
6. Mapping: from the parse-tree to the Abstract Syntax Tree 41
1 interface Node {
2 val position: Position?
3 }
A Node represents every possible node of an AST and it is general. We can reuse it across the different
languages that we may want to create.
The most important operation that we want to be able to perform on each node is navigate through
it and all its descendants. In particular we want to have the ability to define an operation and execute
it for all nodes of an AST. To do that we will define Node.process:
This takes a Node and looks at all its properties. It finds the children by identifiying those properties
that have as value a Node or a collection of Nodes.
What about performing an operation only on nodes of a certain kind? Easy!
We just invoke process and for each Node we traverse we check if it corresponds to the expected
type. In that case we execute the given operation on it.
Node position
The Node interface has exactly one property: the position. The position represents, well, the position
of the node in the original code. It will be useful when we will need to show some message to the
user, for example about an error we found. To do so we want to be able to indicate a position in the
code, like line 3, column 10 to 20.
These are the classes we will use to define the position: Position and Point.
6. Mapping: from the parse-tree to the Abstract Syntax Tree 42
A Point is a pair of a line and a column, while a Position is a portion of code defines by two
extremes: two points.
Here there are their definitions and some operations that will be useful:
39 /**
40 * Utility function to create a Position
41 */
42 fun pos(startLine:Int, startCol:Int, endLine:Int, endCol:Int) = Position(Point(s\
43 tartLine,startCol),Point(endLine,endCol))
Or we may want to check if a Node comes before or after another node, considering their
corresponding position in the code:
1 event myEvent
1 state aState {
2 on myEvent -> myOtherState
3 }
During the parsing phase an identifier is just an identifier. In our AST we want instead to recognize
the references and treat them differently from the identifiers used to name things. In particular
we want to be able to resolve those references. We want to get a pointer from the reference to the
declared element they are referring to. This will make implementing some operations much easier.
Lets start by defining an interface which will mark the things having a name:
1 interface Named {
2 val name: String
3 }
Now, not everything that is Named would necessarily be a Node, because there could be external
elements which we could refer from our code which are not defined by code. For example compiled
classes or external resources.
6. Mapping: from the parse-tree to the Abstract Syntax Tree 45
How we will resolve references? Simply by passing a list of named things and trying to find a match:
Note that references are the only mutable classes we have as part of our model.
1 //
2 // MiniCalc main entities
3 //
4
5 data class MiniCalcFile(val statements : List<Statement>, override val position:\
6 Position? = null) : Node
7
8 interface Statement : Node
9
10 interface Expression : Node
11
12 interface Type : Node
1 //
2 // Types
3 //
4
5 data class IntType(override val position: Position? = null) : Type
6
7 data class DecimalType(override val position: Position? = null) : Type
8
9 data class StringType(override val position: Position? = null) : Type
Note that these nodes do not bring any relevant information, just their position.
Time to look at the expressions. In the parse tree we used to have a node of type binaryOperation. In
our AST metamodel instead we have four separate node types: SumExpression, SubtractionExpres-
sion, MultiplicationExpression, and DivisionExpression. BinaryExpression is just a marker
interface which acts as a common ancestor for this four operations.
1 //
2 // Expressions
3 //
4
5 interface BinaryExpression : Expression {
6 val left: Expression
7 val right: Expression
8 }
9
10 data class SumExpression(override val left: Expression, override val right: Expr\
6. Mapping: from the parse-tree to the Abstract Syntax Tree 47
Most of the expressions have as children other nodes. A few have instead simple values. They are
ValueReference (which has a property varName of type ReferenceByName<ValueDeclaration>),
and Intlit and DecLit (both have a property value of type String).
Lets look separately to the StringLit. Given that we support interpolated strings in MiniCalc, each
string literal is a sequence of elements which can be constants or interpolated values. For example
"hi #{name}! will be represented as a StringLit node with three elements: a ConstantStringLit-
Part (hi ), an ExpressionStringLItPart (name), and another ConstantStringLitPart (!).
6. Mapping: from the parse-tree to the Abstract Syntax Tree 48
Time to look at the statements. We introduce the interface ValueDeclaration to represent a common
ancestor for InputDeclaration and VarDeclaration. We need it because our ValueReferences can
refer to either inputs or values so we need some node type to indicate both.
Finally we have the four classes implementing Statement.
1 //
2 // Statements
3 //
4
5 interface ValueDeclaration : Statement, Named
6
7 data class VarDeclaration(override val name: String, val value: Expression, over\
8 ride val position: Position? = null) : ValueDeclaration
9
10 data class InputDeclaration(override val name: String, val type: Type, override \
11 val position: Position? = null) : ValueDeclaration
12
13 data class Assignment(val varDecl: ReferenceByName<VarDeclaration>, val value: E\
14 xpression, override val position: Position? = null) : Statement
15
16 data class Print(val value: Expression, override val position: Position? = null)\
17 : Statement
Here we see that we separate the different kind of children part of the preamble in different groups:
inputs, variables, and events. Finally we get a list of states.
1 //
2 // Top level elements
3 //
4
5 interface Typed { val type: Type }
6
7 interface ValueDeclaration : Node, Named, Typed { }
8
9 data class InputDeclaration(override val name: String,
10 override val type: Type,
11 override val position: Position? = null) : ValueDecl\
12 aration
13
14 data class VarDeclaration(override val name: String,
15 val explicitype: Type?,
16 val value: Expression,
17 override val position: Position? = null) : ValueDeclar\
18 ation {
19 override val type: Type
20 get() = explicitype ?: value.type()
21 }
22
23 data class EventDeclaration(override val name: String,
24 override val position: Position? = null) : Node, Nam\
25 ed
26
27 data class StateDeclaration(override val name: String,
28 val start: Boolean,
29 val blocks: List<StateBlock>,
30 override val position: Position? = null) : Node, Nam\
31 ed
6. Mapping: from the parse-tree to the Abstract Syntax Tree 50
As we did for MiniCalc we have introduced a common ancestor for the InputDeclaration and the
VarDeclaration. It is named ValueDeclaration. Here we also have an interface named Typed. A
Typed element has a type, obviously. In the case of the InputDeclaration it is always explicitely
present, while in the case of VarDeclaration it can be either explicitely present or inferred by
looking at the type of the initial value.
1 //
2 // Interfaces
3 //
4
5 interface StateBlock : Node
6 interface Statement : Node
7 interface Expression : Node
8 interface Type : Node
9
10 //
11 // StateBlocks
12 //
13
14 data class OnEntryBlock(val statements: List<Statement>, override val position: \
15 Position? = null) : StateBlock
16 data class OnExitBlock(val statements: List<Statement>, override val position: P\
17 osition? = null) : StateBlock
18 data class OnEventBlock(val event: ReferenceByName<EventDeclaration>,
19 val destination: ReferenceByName<StateDeclaration>,
20 override val position: Position? = null) : StateBlock
For StaMac we introduced also a common ancestor for IntType and DecimalType: NumberType.
1 //
2 // Types
3 //
4
5 interface NumberType : Type
6
7 data class IntType(override val position: Position? = null) : NumberType
8
9 data class DecimalType(override val position: Position? = null) : NumberType
10
11 data class StringType(override val position: Position? = null) : Type
12
13 //
6. Mapping: from the parse-tree to the Abstract Syntax Tree 51
14 // Expressions
15 //
16
17 interface BinaryExpression : Expression {
18 val left: Expression
19 val right: Expression
20 }
21
22 data class SumExpression(override val left: Expression, override val right: Expr\
23 ession, override val position: Position? = null) : BinaryExpression
24
25 data class SubtractionExpression(override val left: Expression, override val rig\
26 ht: Expression, override val position: Position? = null) : BinaryExpression
27
28 data class MultiplicationExpression(override val left: Expression, override val \
29 right: Expression, override val position: Position? = null) : BinaryExpression
30
31 data class DivisionExpression(override val left: Expression, override val right:\
32 Expression, override val position: Position? = null) : BinaryExpression
33
34 data class UnaryMinusExpression(val value: Expression, override val position: Po\
35 sition? = null) : Expression
36
37 data class TypeConversion(val value: Expression, val targetType: Type, override \
38 val position: Position? = null) : Expression
39
40 data class ValueReference(val symbol: ReferenceByName<ValueDeclaration>,
41 override val position: Position? = null) : Expression
42
43 data class IntLit(val value: String, override val position: Position? = null) : \
44 Expression
45
46 data class DecLit(val value: String, override val position: Position? = null) : \
47 Expression
48
49 data class StringLit(val value: String, override val position: Position? = null)\
50 : Expression
Expressions look similar to the ones we had in MiniCalc, just the StringLit is much simpler because
we do not have string interpolation in StaMac.
6. Mapping: from the parse-tree to the Abstract Syntax Tree 52
1 //
2 // Statements
3 //
4
5 data class Assignment(val variable: ReferenceByName<VarDeclaration>, val value: \
6 Expression,
7 override val position: Position? = null) : Statement
8
9 data class Print(val value: Expression, override val position: Position? = null)\
10 : Statement
it will have a simpler and nicer API than the classes generated by ANTLR (so the classes
composing the parse tree). In next sections we will see how this API could permit to perform
transformations on the AST
we will remove elements which are meaningful only while parsing but that logically are
useless: for example the parenthesis expression or the line node
some nodes for which we have separate instances in the parse tree can correspond to a single
instance in the AST. This is the case of the type references Int and Decimal which in the AST
are defined using singleton objects
we can define common interfaces for related node types like BinaryExpression
to define how to parse a variable declaration we reuse the assignement rule. In the AST the
two concepts are completely separated
certain operations have the same node type in the parse tree, but are separated in the AST.
This is the case of the different types of binary expressions
Lets now see how we can get the parse tree, produced by ANTLR, and map it into our AST
classes.
6. Mapping: from the parse-tree to the Abstract Syntax Tree 53
First we define some utility functions to translate the positions, from the way they are expressed in
the parse tree, to the way we want to define them in the ASTL
Now we can look at the specific mapping, as implemented for MiniCalc and for StaMac
Mapping MiniCalc
1 fun MiniCalcFileContext.toAst(considerPosition: Boolean = false) : MiniCalcFile \
2 = MiniCalcFile(this.line().map { it.statement().toAst(considerPosition) }, toPos\
3 ition(considerPosition))
4
5 fun StatementContext.toAst(considerPosition: Boolean = false) : Statement = when\
6 (this) {
7 is VarDeclarationStatementContext -> VarDeclaration(varDeclaration().assignm\
8 ent().ID().text,
9 varDeclaration().assignment().expression().toAst(considerPosition),
10 toPosition(considerPosition))
11 is AssignmentStatementContext -> Assignment(ReferenceByName(assignment().ID(\
12 ).text), assignment().expression().toAst(considerPosition), toPosition(considerP\
13 osition))
14 is PrintStatementContext -> Print(print().expression().toAst(considerPositio\
15 n), toPosition(considerPosition))
16 is InputDeclarationStatementContext -> InputDeclaration(this.inputDeclaratio\
17 n().ID().text, this.inputDeclaration().type().toAst(considerPosition), toPositio\
18 n(considerPosition))
19 else -> throw UnsupportedOperationException(this.javaClass.canonicalName)
20 }
21
22 fun ExpressionContext.toAst(considerPosition: Boolean = false) : Expression = wh\
23 en (this) {
24 is BinaryOperationContext -> toAst(considerPosition)
25 is IntLiteralContext -> IntLit(text, toPosition(considerPosition))
26 is DecimalLiteralContext -> DecLit(text, toPosition(considerPosition))
6. Mapping: from the parse-tree to the Abstract Syntax Tree 54
To implement this we have taken advantage of three very useful features of Kotlin:
smart casts: after we check that an object has a certain class the compiler implicitly cast it to
that type so that we can use the specific methods of that class
We could come up with a mechanism to derive automatically this mapping for most of the rules and
just customize it where the parse tree and the AST differs. To avoid using too much reflection black
magic we are not going to do that for now. If I were using Java I would just go for the reflection
road to avoid having to write manually a lot of redundant and boring code. However using Kotlin
this code is compact and clear.
Mapping StaMac
When mapping the root of the parse-tree to the root of the AST for StaMac we remove the preamble
and redistribute its content directly into the StateMachine node. This is because the preamble had a
role from the syntactic point view but it has not semantic meaning. It was useful to group all kinds
of declarations that we wanted to have at the top of the file, before the states declarations but we do
not need to preserve it. Also, the preamble contained a list of preamble elements: input declarations,
variable declarations, and input declarations all mixed together in any order. In the AST we instead
prefer to have three separate lists, so we filter the premble element depending on the type. We then
translate each premble element to its equivalent in the AST and pass the resulting collections to the
StateMachine constructor.
1 //
2 // StateMachine
3 //
4
5 fun StateMachineContext.toAst(considerPosition: Boolean = false) : StateMachine \
6 = StateMachine(
7 this.preamble().name.text,
8 this.preamble().elements.filterIsInstance(InputDeclContext::class.java).\
9 map { it.toAst(considerPosition) },
10 this.preamble().elements.filterIsInstance(VarDeclContext::class.java).ma\
11 p { it.toAst(considerPosition) },
12 this.preamble().elements.filterIsInstance(EventDeclContext::class.java).\
13 map { it.toAst(considerPosition) },
14 this.states.map { it.toAst(considerPosition) },
15 toPosition(considerPosition))
The rest of the transformations are not particularly interesting and they follow a basic schema.
6. Mapping: from the parse-tree to the Abstract Syntax Tree 56
1 //
2 // Top level elements
3 //
4
5 fun InputDeclContext.toAst(considerPosition: Boolean = false) : InputDeclaration\
6 = InputDeclaration(
7 this.name.text, this.type().toAst(considerPosition), toPosition(consider\
8 Position))
9
10 fun VarDeclContext.toAst(considerPosition: Boolean = false) : VarDeclaration = V\
11 arDeclaration(
12 this.name.text, this.type()?.toAst(considerPosition), this.initialValue.\
13 toAst(considerPosition), toPosition(considerPosition))
14
15 fun EventDeclContext.toAst(considerPosition: Boolean = false) : EventDeclaration\
16 = EventDeclaration(
17 this.name.text, toPosition(considerPosition) )
18
19 fun StateContext.toAst(considerPosition: Boolean = false) : StateDeclaration = S\
20 tateDeclaration(
21 this.name.text, this.start != null, this.blocks.map { it.toAst(considerP\
22 osition) }, toPosition(considerPosition))
23
24 //
25 // StateBlocks
26 //
27
28
29 fun StateBlockContext.toAst(considerPosition: Boolean = false) : StateBlock = wh\
30 en (this) {
31 is EntryBlockContext -> OnEntryBlock(this.statements.map { it.toAst(consider\
32 Position) })
33 is ExitBlockContext -> OnExitBlock(this.statements.map { it.toAst(considerPo\
34 sition) })
35 is TransitionBlockContext -> OnEventBlock(ReferenceByName(this.eventName.tex\
36 t),
37 ReferenceByName(this.destinationName.text), toPosition(considerPosit\
38 ion))
39 else -> throw UnsupportedOperationException(this.javaClass.canonicalName)
40 }
41
42 //
6. Mapping: from the parse-tree to the Abstract Syntax Tree 57
43 // Types
44 //
45
46 fun TypeContext.toAst(considerPosition: Boolean = false) : Type = when (this) {
47 is IntegerContext -> IntType(toPosition(considerPosition))
48 is DecimalContext -> DecimalType(toPosition(considerPosition))
49 else -> throw UnsupportedOperationException(this.javaClass.canonicalName)
50 }
51
52 //
53 // Expressions
54 //
55
56 fun ExpressionContext.toAst(considerPosition: Boolean = false) : Expression = wh\
57 en (this) {
58 is BinaryOperationContext -> toAst(considerPosition)
59 is IntLiteralContext -> IntLit(text, toPosition(considerPosition))
60 is DecimalLiteralContext -> DecLit(text, toPosition(considerPosition))
61 is StringLiteralContext -> StringLit(text, toPosition(considerPosition))
62 is ParenExpressionContext -> expression().toAst(considerPosition)
63 is ValueReferenceContext -> ValueReference(ReferenceByName(text), toPosition\
64 (considerPosition))
65 is TypeConversionContext -> TypeConversion(expression().toAst(considerPositi\
66 on), targetType.toAst(considerPosition), toPosition(considerPosition))
67 else -> throw UnsupportedOperationException(this.javaClass.canonicalName)
68 }
69
70 fun BinaryOperationContext.toAst(considerPosition: Boolean = false) : Expression\
71 = when (operator.text) {
72 "+" -> SumExpression(left.toAst(considerPosition), right.toAst(considerPosit\
73 ion), toPosition(considerPosition))
74 "-" -> SubtractionExpression(left.toAst(considerPosition), right.toAst(consi\
75 derPosition), toPosition(considerPosition))
76 "*" -> MultiplicationExpression(left.toAst(considerPosition), right.toAst(co\
77 nsiderPosition), toPosition(considerPosition))
78 "/" -> DivisionExpression(left.toAst(considerPosition), right.toAst(consider\
79 Position), toPosition(considerPosition))
80 else -> throw UnsupportedOperationException(this.javaClass.canonicalName)
81 }
82
83 //
84 // Statements
6. Mapping: from the parse-tree to the Abstract Syntax Tree 58
85 //
86
87 fun StatementContext.toAst(considerPosition: Boolean = false) : Statement = when\
88 (this) {
89 is AssignmentStatementContext -> Assignment(ReferenceByName(assignment().ID(\
90 ).text),
91 assignment().expression().toAst(considerPosition), toPosition(consid\
92 erPosition))
93 is PrintStatementContext -> Print(print().expression().toAst(considerPositio\
94 n), toPosition(considerPosition))
95 else -> throw UnsupportedOperationException(this.javaClass.canonicalName)
96 }
We could come up with a mechanism to derive automatically this mapping for most of the rules and
just customize it where the parse tree and the AST differs. To avoid using too much reflection black
magic we are not going to do that for now. If I were using Java I would just go for the reflection
road to avoid having to write manually a lot of redundant and boring code. However using Kotlin
this code is compact and clear.
1 class ModelTest {
2
3 @test fun transformVarName() {
4 val startTree = MiniCalcFile(listOf(
5 VarDeclaration("A", IntLit("10")),
6 Assignment("A", IntLit("11")),
7 Print(VarReference("A"))))
8 val expectedTransformedTree = MiniCalcFile(listOf(
9 VarDeclaration("B", IntLit("10")),
10 Assignment("B", IntLit("11")),
11 Print(VarReference("B"))))
12 assertEquals(expectedTransformedTree, startTree.transform {
13 when (it) {
6. Mapping: from the parse-tree to the Abstract Syntax Tree 59
It would be much more convenient not having to define the positions of all the elements of the AST.
So we do not specify the position for the nodes we build manually and for the AST obtained by
transforming the parse tree we leave considerPosition to false, which is the default value. In this
way the tests are much easier to write:
6. Mapping: from the parse-tree to the Abstract Syntax Tree 60
Summary
With this chapter we conclude our journey from the code to the model: the AST is the model on
which we are going to work. We have designed it to contain only the relevant information, we have
6. Mapping: from the parse-tree to the Abstract Syntax Tree 61
built functions to operate on it. By doing this we have put solid foundations on which to build the
next blocks.
7. Symbol resolution
In this short chapter we will see how to resolve symbols.
When we parse our code and we obtain a parse-tree that is indeed a tree: it means that a parent is
connected to the children but there are no other links. The nodes are organized in a strict hierarchy.
By resolving symbols we create new links between references and their declaration. In a sense we
transform our tree in a graph, because links are not strictly hierarchical.
These new links are very important because references are just placeholders, which have no
knowledge about the referred elements. We instead need that knowledge when processing code.
For example, if you have a reference to a variable v in an expression, in order to calculate the type
of the expression you need to know the type of v. The reference does not contain that information,
but the original declaration of the variable does. For this reason you want to have a link to navigate
from the reference to the declaration and extract from there all the information you need.
References could be of different type and solving them could be more or less complicate depending
on the case. Lets consider some examples from the Java language.
a local variable,
a method parameter,
a field of the current class,
an inherited field,
a statically imported field
In some cases we could have multiple matches, for example a field and a local variable both having
the name used by a certain reference. We resolve these ambiguities by selecting the most specific
declarations, where most specific in general means closest to the point of usage.
1 class Foo<A> {
2 // here I can refer to A, the type parameter
3 }
1 class A {
2 class B {
3 // here I can refer to A or B
4 }
5 }
1 import foo.bar.A;
Or we could refer to a class A defined in the same package as the current class.
1 foo.aMethod(aParam);
The actual method invoked depends on the type of foo: when we have a scope we need first of all to
calculate the type of the scope. If there is no scope specified then only methods of the current class
(declared or inherited) can be invoked or methods imported statically.
Secondly, there could be different overloaded versions of the same method, i.e., different methods
with the same name but taking different parameters. There are all sort of considerations to do. In
general you start considering the number of parameters, taking in account variadic methods, i.e.,
methods that can accept a variable number of parameters. Then you need to verify if the type of
7. Symbol resolution 64
the actual parameter is compatible with the type of the formal arguments of the method considered.
You could also have multiple matches, in that case you need to consider the closest match.
We are not even considering type arguments, lambdas, type inference and other aspects of the
language that makes this problem significantly more complex.
So in general resolving symbol is not trivial. However in many cases it is, and it definitely is for the
simple languages we are considering.
Now we can use it to find the parent and the parent of the parent and so on, until we reach the root
of the AST. We will use this mechanism to find an ancestor of a particular type:
7. Symbol resolution 65
Now we can use the function ancestor to find the Statement containing a certain ValueReference.
When we have it we just look at the statements preceeding that one. We select all the ValueDec-
laration (either InputDeclaration or VarDeclaration) and we start looking for a match with our
reference from the last one to the first one. We do that by reversing the list of preceeding value
declarations and pass it to tryToResolve.
1 fun MiniCalcFile.resolveSymbols() {
2
3 val childParentMap = this.childParentMap()
4
5 // Resolve value reference to the closest thing before
6 this.specificProcess(ValueReference::class.java) {
7 val statement = it.ancestor(Statement::class.java, childParentMap)!! as \
8 Statement
9 val valueDeclarations = this.statements.preceedings(statement).filterIsI\
10 nstance<ValueDeclaration>()
11 it.ref.tryToResolve(valueDeclarations.reversed())
12 }
13
14 // We need to consider also assignments
15 }
We have also assignments to consider because they contain a reference to a variable declaration.
They are simpler to implement considering they are statements (no need to search for the containing
statement). We will use the same approach of considering only the preceediung statements. In this
case we will focus only on VarDeclarations, not ValueDeclarations because assignments cannot
refer to InputDeclarations.
1 this.specificProcess(Assignment::class.java) {
2 val varDeclarations = this.statements.preceedings(it).filterIsInstance<V\
3 arDeclaration>()
4 it.varDecl.tryToResolve(varDeclarations.reversed())
5 }
1 this.specificProcess(ValueReference::class.java) {
2 if (!it.symbol.tryToResolve(this.variables) && !it.symbol.tryToResolve(this.\
3 inputs)) {
4 errors.add(Error("A reference to symbol or input '${it.symbol.name}' can\
5 not be resolved", it.position!!))
6 }
7 }
8
9 this.specificProcess(Assignment::class.java) {
10 if (!it.variable.tryToResolve(this.variables)) {
11 errors.add(Error("An assignment to symbol '${it.variable.name}' cannot b\
12 e resolved", it.position!!))
13 }
14 }
We then have transitions. Each transition has two references: one to the event on which we execute
the transition and one to the destination state.
7. Symbol resolution 67
1 this.specificProcess(OnEventBlock::class.java) {
2 if (!it.event.tryToResolve(this.events)) {
3 errors.add(Error("A reference to event '${it.event.name}' cannot be reso\
4 lved", it.position!!))
5 }
6 }
7 this.specificProcess(OnEventBlock::class.java) {
8 if (!it.destination.tryToResolve(this.states)) {
9 errors.add(Error("A reference to state '${it.destination.name}' cannot b\
10 e resolved", it.position!!))
11 }
12 }
1 class SymbolResolutionTest {
2
3 @test fun resolveValueReferenceToVariableDeclaredBefore() {
4 val code = """var a = 1 + 2
5 |var b = 7 * a""".trimMargin("|")
6 val ast = MiniCalcAntlrParserFacade.parse(code).root!!.toAst()
7 ast.resolveSymbols()
8 assertEquals(1, ast.collectByType(ValueReference::class.java).size)
9 assertEquals(true, ast.collectByType(ValueReference::class.java)[0].ref.\
10 resolved)
11 assertEquals("a", ast.collectByType(ValueReference::class.java)[0].ref.n\
12 ame)
13 }
14
15 @test fun resolveValueReferenceToInputDeclaredBefore() {
16 val code = """input Int a
17 |var b = 7 * a""".trimMargin("|")
7. Symbol resolution 68
60 ast.resolveSymbols()
61 assertEquals(1, ast.collectByType(ValueReference::class.java).size)
62 assertEquals(false, ast.collectByType(ValueReference::class.java)[0].ref\
63 .resolved)
64 }
65
66 // more tests to follow
67
68 }
Summary
The examples we have seen of symbol resolution are rather simple, because we have not used
concepts like inheritance or annidated scope, where one symbol defined internally can shadow a
symbol defined in the superior level. However the principles remain the same: what changes is just
the way we navigate the AST to identify the referenced elements.
8. Typesystem
I guess you know what a typesystem is and why it is useful, so we can skip the motivational speech
and get down to business.
In this chapter we are going to briefly look at how a typesystem works and then we are moving to
the interesting part: how to implement one.
Types
Many languages have pretty similar typesystems. Sure, as supporters of one language or another
we tend to emphasize the differences that make our favorite one so much better (look ma, reified
generics!), but the core is pretty common. We are just taking a look at what works decently well.
When you will be building your language you will get a chance to be more creative, but it is always
good to know what worked for others.
Basic Types
There are some basic types you have in most languages:
boolean
types for numbers
character
string
Types for numbers could be divided in types for integers or real numbers, and each group could
have different elements depending on the precision.
These types typically get special support in the language, given their role as building blocks.
In some languages string is not a primitive type but normally it has some kind of built-in support
because, you know, strings are the kind of things you may want to use quite frequently.
Declared types
Depending on the language the user could have the possibility of defining new types.
For example:
8. Typesystem 72
classes
structs
interfaces
enums
Some languages permit also to define type aliases. They typically are not proper types, just additional
names for existing types. That means that when you do this:
You define just an alias to int, so you can use int or myNewType interchangeably.
Parametric types
Simple types are types which do not have any sort of parameter. So two instances of that type are
indistinguishable. An int is just an int.
Not all types are like that: think about arrays or collections. An array of int is not the same type as
an array of string. This is because array is a parametric type. Or if you wish, it is not a type at all,
it is more like a type template, something that you can use to create proper types like array of float,
or array of double.
Subtyping
Types are typically organized in some sort of hierarchy, so that some types are subtypes of other
types.
The classical example is defining the class Cat as a subtype of the class Animal. Or maybe the class
Rectangle as a subtype of the interface Shape.
This could also apply to primitive or built-in types. You could have a type named number and int,
or float, could be subtypes of number, for example.
In general a subtype should respect the Liskov substitution principle. In a poor-man words we
could summarize it as: you should be able to use an instance of the subtype wherever you can use
an instance of the supertype legally, and the program should still be legal.
Typesystem rules
Here we see how to calculate the types of all the expressions.
The crucial part of implementing a simple typesystem is specifying how to calculate the type of each
expression.
https://en.wikipedia.org/wiki/Liskov_substitution_principle
8. Typesystem 73
1 // a float
2 float f = 0.01f
3
4 // a double
5 double d = 0.01
In Java there are four types for integer numbers and not all of them are supported in the same way:
1 byte b = (byte)9999999;
2 short s = (short)9999999;
3 int i = 9999999;
4 long l = 9999999L;
By default a number literal is an int, however the modifier l/L can be used to make the literal a
long. The other types represented integers (byte or short) do not have the same level of support.
Indeed there is simply no way of defining a byte literal or a short literal. All that you can do is to
define an int or a long value and then cast it to either byte or short. Note that I am not advocating
that the Java typesystem is a good example to follow, just discussing a real case.
Some languages also distinguish between signed and unsigned numbers. In that case there are
typically modifiers to indicate literals of one type or the other.
logical-not
logical-and
logical-or
Relations operations
In this case you need two elements that are comparable. So you need some logic to understand which
types are comparable with each other. Is it legal to compare 5 to 3.12 in your language? Are strings
comparable?
The result of these operations will always be a boolean.
==
!=
>
>=
<
<=
Collection operations
You may have operators to access or set elements in collections.
For example:
1 val v1 = myList[0]
2 val v2 = myMap["Key"]
In this case the result of the access is the element type of a collection. It means that if myList is a
collection of float then v1 will be of type float.
8. Typesystem 75
Conditional operator
The conditional operator is a sort of concise if that is an expression and not a statement. It is present
in C and Java:
myCondition ? valueIfConditionIsTrue : valueIfConditionIsFalse
What is the type of the result? Well, ideally you want to find the most specific ancestor of the two
possibile return values. It could get tricky.
Consider this case:
myCondition ? "hi" : "hello"
What should be the result? A float? An int? A number, meaning some abstract concept which is a
supertype of both int or float?
This is the kind of decisions you need to take when designing your typesystem. In this case I would
find elegant to consider the return type of being a number, but it would probably be more practical
to consider it a float.
Casting
We may want to cast a type to another, either to force a conversion (e.g., from integer to float) or
because we know that a certain value has a more specific type. For example, we could have a value
we got from a parameter of type Object, but we know that the value will always be a String, so we
explain it to the compiler by using a cast.
So the type of:
(someType) anyExpression
I want more
This is just a very brief discussion on types. It should be sufficient to get you started and build many
simple but useful languages.
If you are going to build DSLs, probably you are not going to need to build more complex
typesystems, while if we are going to build a General Purpose Language you could need way more
complex stuff.
However if you want to get into the hard stuff you could read:
8. Typesystem 76
Our default implementation tells us that a value of a certain type is assignable exclusively to a
variable of the very same type. That means that you can assign a string to a string variable, for
example.
Lets see one exception:
For variables of type DecimalType we can accept either values of type DecimalType or of type
IntType, because we can promote an int to a decimal.
Now we are going to see the code for calculating the type of expressions. You are going to be surprised
by how concise it is. Part of the merit goes to Kotlin, which is wonderful for writing this kind of
code:
https://www.cis.upenn.edu/~bcpierce/tapl/
https://www.cs.kent.ac.uk/people/staff/sjt/TTFP/
http://www.paultaylor.eu/stable/Proofs+Types.html
8. Typesystem 77
1 is SumExpression -> {
2 if (this.left.type() is StringType) {
3 StringType()
4 } else if (onNumbers()) {
5 if (left.type() is DecimalType || right.type() is DecimalType)
6 DecimalType() else IntType()
7 } else {
8 throw IllegalArgumentException("This operation should be performed on nu\
9 mbers or start with a string")
10 }
11 }
The point is that what we called SumExpression is doing two very different things depending on the
operands:
If on the left we have a string what we are doing is actually a string concatenation. We convert
whatever is on the right to a string and append it to the string on the left.
If we have numbers as operands then we are actually summing them. The type of the result
will depends on the type of the operands. If at least one operand is a DecimalType then the
result is a DecimalType, otherwise it means both operands are IntType and the result is also
an IntType
In all other cases we cannot perform the operations
Note that in this case we are handling the fact we have basically two different operations using the
same syntax as part of the typesystem. This is ok in this case because the language is reasonably
simple. In other cases I would prefer to do a transformation on the AST as an intermediate step,
transforming the SumExpression nodes representing string concatenation in nodes of a different
type, so that the rest of the code would be much simpler.
The other mathematical operators are less ambiguous:
This is our cast. The result has the type to which we casted:
When we have a ValueReference, the type of the reference is exactly the type of the element being
referred. So if we refer to the variable a and a was declared to be an IntType then also our reference
is an IntType.
We left out a few extension methods we introduced to make the previous code simpler. Lets take a
look at those.
This extension method is useful to figure out if a type represents a number.
We could have instead created an abstract supertype named NumberType and make both IntType and
DecimalType to extend it. Then we could have just checked if a type was representing a number by
using the instance-of operator (myType is NumberType). In this case the chosen solution was good
enough and simpler.
We then have another extension method which is also related to numbers. We just want a simple
way to figure out if a BinaryExpression is performed on two number operands.
We want to be able to get the type for every ValueDeclaration. In the case of inputs the type is
explicitly defined, so we just return it. In the case of variables it is not, it is inferred from the initial
value.
1 fun ValueDeclaration.type() =
2 when (this) {
3 is VarDeclaration -> this.value.type()
4 is InputDeclaration -> this.type
5 else -> throw UnsupportedOperationException()
6 }
The type of myInput is found in the original code, while the type of myVar is obtained by calculating
the type of the initial value ("hello" in this case).
In MiniCalc we created an extension method to calculate the type of the symbol referred. In StaMac
it is instead always contained in a field.
The field type comes from the interface Typed. Note that in the case of an InputDeclaration it
is always explicit. In the case of a VarDeclaration instead it can be either explicit or inferred. In
StaMac you can write this:
1 var v1 : Int = 10
2 var v2 = "foo"
The type of v1 would be Int, because it is explictly indicated, while the type of v2 will be
inferred from calculating the type of the initial value ("foo"). This logic is handled directly in
VarDeclaration:
Summary
Typesystems have a scary reputation. Now, you can need very complex and elaborate typesystems,
which are not trivial to implement. However, it does not have to be the case, and unless you need
to be creative you can get away by reapplying some basic patterns common to most languages.
9. Validation
You have parsed your code and built an Abstract Syntax Tree. At this point we can start working on
this Abstract Syntax Tree. The first thing we should do is verifying that the code we parsed make
sense at a semantic level.
The process of lexing and parsing told us if the code made sense at a syntactical level. If he did not,
maybe we could not even build an Abstract Syntax Tree. The fact that a piece of code make sense
at a syntactical level does not necessarily mean it is correct.
Typical semantic errors are:
We could add a level, for example to support also warnings. We could also come up with error codes.
But, lets keep things simple.
When writing in Kotlin I add the validation as an extension method for the AST root. Note that
in this case all the validation happens at the root level. In more complex cases I would define the
validation at different levels (e.g., class level and method level, if your language have those) and
invoke this more specific validation methods from the validation method of the root node.
9. Validation 84
43 }
In this case we do not want to find two variables with the same name. There are not other named
elements that could clash with variables so we are considering only one type of node. In other cases
we could want to verify that a name is unique among several types of node. For example, we may
want to prevent to have a variable foo and a function foo if this lead to ambiguous usages in our
languages.
Note that we do not have annidated scopes here, but only one global scope, so all variables are
defined in the global scope.
As we find variables we check whether their name was already used. If that is the case we produce an
error, otherwise we mark the name as used. This means that if we have two variables with the same
name the error will be associated only to the second one, the first one will be considered correct. I
prefer this approach, while others prefer to show the error on both variables. This is another small
design choice. Small, sure, but they tend to pile up.
Here we check that all references are resolved. For this validation to succeed we expect the symbol
resolution to have happened as previous step. In other cases we could explicitly invoke the symbol
resolution during validation (and that is what we do in StaMac).
For MiniCalc we expect someone to have called resolveSymbols:
Finally we need to check if we are usage of types is consistent. To that we verify that when assigning
a value to a variable the value has a type compatible with the value of the variable. We do not want
to assign a string value to an int variable.
In this case, this translates to check that the values assigned to variables are compatible with the
variable. This would prevent us from assigning a string value to an int variable. The type of variable
was inferred by the type of its initial value, like this:
While the type is not explicit in the code, it is defined for each variable. Yes, we got static typing
without the typical cerimonies.
Note also that we do not strictly need to assign the exact same type the variable had, but only a
type that is compatible. This means that we can assign an int value to a decimal variable. It will be
converted to a decimal value. We cannot do the opposite: we cannot assign a decimal value to an
int value because that conversion could lead to a loss of information. Of course you can allow that
in your language, if you want.
23 his.inputs)) {
24 errors.add(Error("A reference to symbol or input '${it.symbol.name}'\
25 cannot be resolved", it.position!!))
26 }
27 }
28 this.specificProcess(Assignment::class.java) {
29 if (!it.variable.tryToResolve(this.variables)) {
30 errors.add(Error("An assignment to symbol '${it.variable.name}' cann\
31 ot be resolved", it.position!!))
32 }
33 }
34 this.specificProcess(OnEventBlock::class.java) {
35 if (!it.event.tryToResolve(this.events)) {
36 errors.add(Error("A reference to event '${it.event.name}' cannot be \
37 resolved", it.position!!))
38 }
39 }
40 this.specificProcess(OnEventBlock::class.java) {
41 if (!it.destination.tryToResolve(this.states)) {
42 errors.add(Error("A reference to state '${it.destination.name}' cann\
43 ot be resolved", it.position!!))
44 }
45 }
46
47 // check the initial value is compatible with the explicitly declared type
48 this.specificProcess(VarDeclaration::class.java) {
49 if (it.explicitType != null && !it.explicitType.isAssignableBy(it.value.\
50 type())) {
51 errors.add(Error("Cannot assign ${it.explicitType!!} to variable of \
52 type ${it.value.type()}", it.position!!))
53 }
54 }
55 // check the type used in assignment is compatible
56 this.specificProcess(Assignment::class.java) {
57 if (it.variable.resolved) {
58 val actualType = it.value.type()
59 val formalType = it.variable.referred!!.type
60 if (!formalType.isAssignableBy(actualType)) {
61 errors.add(Error("Cannot assign $actualType to variable of type \
62 $formalType", it.position!!))
63 }
64 }
9. Validation 89
65 }
66
67 // we have exactly one start state
68 if (this.states.filter { it.start }.size != 1) {
69 errors.add(Error("A StateMachine should have exactly one start state", t\
70 his.position!!))
71 }
72
73 return errors
74 }
StaMac and MiniCalc have a similar typesystem and similar validation rules. This is not surprising
because we are seeing the most typical patterns for typesystems and validations, and they tend to
be common across many languages. There are a few differences, anyway, so lets look at them.
We start by defining a function for checking duplicate names.
In MiniCalc we had only variables to consider. In StaMac we have different kinds of nodes, with
different naming spaces, that means that names have to be unique only for a certain kind of node,
while it is ok to have a state and an event with the same name. Maybe it is not a smart idea, but
it is legal in the language. Someone could prefer to forbid it or give a warning to the user. In my
experience people who are designing their first language tend to want to be more in control and
prohibit things like this, while after a while a language designer realize that it needs to provide a
tool to users and get out of the way. To me it does not seem to make sense to name an event and a
state with the same name but a user could have a reason to do that, so unless it is strictly needed
for the consistency of my language I would not prohibit it.
So here it is how we check for duplicate names:
9. Validation 90
Then we verify the all references are resolved. In this case we perform symbol resolution as part of
the validation.
1 // check references
2 this.specificProcess(ValueReference::class.java) {
3 if (!it.symbol.tryToResolve(this.variables) && !it.symbol.tryToResolve(this.\
4 inputs)) {
5 errors.add(Error("A reference to symbol or input '${it.symbol.name}' can\
6 not be resolved", it.position!!))
7 }
8 }
9 this.specificProcess(Assignment::class.java) {
10 if (!it.variable.tryToResolve(this.variables)) {
11 errors.add(Error("An assignment to symbol '${it.variable.name}' cannot b\
12 e resolved", it.position!!))
13 }
14 }
15 this.specificProcess(OnEventBlock::class.java) {
16 if (!it.event.tryToResolve(this.events)) {
17 errors.add(Error("A reference to event '${it.event.name}' cannot be reso\
18 lved", it.position!!))
19 }
20 }
21 this.specificProcess(OnEventBlock::class.java) {
22 if (!it.destination.tryToResolve(this.states)) {
23 errors.add(Error("A reference to state '${it.destination.name}' cannot b\
24 e resolved", it.position!!))
9. Validation 91
25 }
26 }
As you can see we just call tryToResolve and verifies if it has found a match. If it did not we add an
error to our list. Not also that in the case of a ValueReference we try twice to resolve the symbol,
first looking for variables with that name and then looking for inputs. The order does not matter
because inputs and variables should have different names. Both are ValueDeclaration and we have
verified that in the initial part of our validation method.
Then we check that the Assignments refer to an existing variable. We do not consider inputs here
because inputs cannot be assigned.
Finally we check for every OnEventBlock that both the event and the destination state can be
resolved.
Regarding the typesystem consistency, we have the same rule on assignments as we have seen in
MiniCalc and we have an additional rule. The additional rule is necessary because in StaMac we
can optionally specify the type of a variable. If we do so, we need to ensure that the initial value is
compatible with the explicitly defined type. This way we will prevent users from writing:
1 // check the initial value is compatible with the explicitly declared type
2 this.specificProcess(VarDeclaration::class.java) {
3 if (it.explicitType != null && !it.explicitType.isAssignableBy(it.value.type\
4 ())) {
5 errors.add(Error("Cannot assign ${it.explicitType!!} to variable of type\
6 ${it.value.type()}", it.position!!))
7 }
8 }
9 // check the type used in assignment is compatible
10 this.specificProcess(Assignment::class.java) {
11 if (it.variable.resolved) {
12 val actualType = it.value.type()
13 val formalType = it.variable.referred!!.type
14 if (!formalType.isAssignableBy(actualType)) {
15 errors.add(Error("Cannot assign $actualType to variable of type $for\
16 malType", it.position!!))
17 }
18 }
19 }
There are other semantic checks that can be performed which are more language specific. For
example, in StaMac we need to ensure that we have exactly one start state:
9. Validation 92
Summary
There are kinds of validation checks that common to all languages, like typesystem related checks,
symbol resolution checks or name duplicates. They are the bread and butter of validation and you
are going to need those in most of your languages.
Then there is a different category of checks that depends on the specificities of your language. Maybe
you can define classes and each class should have a constructor. Or maybe all variables of type int
should have a name that starts with an i. While these rules will be different in each case they will
be implemented using similar techniques: navigate the AST, find the nodes you are interested into,
record errors and show errors.
In complex languages you could have a multi-level validation: you first resolve symbols, then check
type consistency, then do other semantic checks. With time you will be able to grow more complexity
in your language implementation, but hopefully this should give you the basis to start building
something real, and usable.
Part II: compiling
We have seen how to recognize what the user wrote and verify it is correct. Good.
Now it is time to do something with the information we obtained.
For example we could:
In this part we are going to see how to do the first two things.
First we are going to see how to build an interpreter, then how to compile to bytecode and finally
how to compile to native code using LLVM.
We will not see how to write a generators but consider that this can be done either using some
template engine or writing an interpreter then print something on a file. While we will not see an
example you should know all the techniques necessary to write a generator. Or you can look for the
next post on my blog.
https://tomassetti.me
10. Build an interpreter
In the introduction to Part II we have seen that we have two main ways to execute a piece of code:
building an interpreter or building a compiler.
In the case of an interpreter the code is executed directly, while when you are using a compiler
you have to go through an intermediate step: producing the bytecode or native code that will be
executed. There are different technical aspects to consider, about performance or the easyness of
distribute one or the other but if we put them aside we can obtain a very similar results using one
or the other. That said writing an interpreter is typically easier then writing a compiler.
Lets put aside the technical considerations for a moment and lets just consider that by the end
of this chapter we will know how to build interpreters and we will have built two fully working
interpreters.
I love this stage in building a language: is when code becomes alive and starts doing stuff.
So, enough chatting, lets go down to business.
Symbol table
A symbol table is a data structure you use to track the symbols that are available in a given context.
For example, while you are inside a function your symbol table could contain the parameters of the
function and the local variables.
Typically symbol tables are organized in a stack: what does it mean? It means that when you enter
in a more specific context you see new symbols, available only in that context but you see also more
generally available symbols.
Consider a Java program: all code inside a class can access the class fields. When you enter a more
specific context, like a method or an inner class, you get access to more symbols. At the same time
you have still access to the more general symbols, class fields in this case.
Typically when looking for symbols in a symbol table you first check with the more specific one if a
match is available. If it is not you check with the parent symbol table, and so on until you reach the
10. Build an interpreter 95
root symbol table. This approach typically leads to the possibility of shadowing. It means that if you
have a global variable named foo and a local variable named foo, where the local variable is available
you will always access that instead of the global variable. The name foo will be always resolved to
the most local element, making the most generic one inaccessible from within that specific context.
We used the term context, but we could have also used the term scope. For each scope we have a
specific Symbol table, connected to a parent symbol table. Examples of scopes:
global scope
class
method/function
for, while, block
Basically every section of code where I can define symbols is typically a scope.
Take the following example:
In Main we refer to the field v. There is no symbol v defined in the symbol table associated to the
method, so we look into its parent: the symbol table for the whole class. There we found the field
v. This is the same thing that happens in method1 when we refer to v. In method2 instead there is
a symbol in the local symbol table with name v (the method parameter). So when we refer to v we
refer to this local element which is shadowing the field with the same name.
Interpreting expressions
In most languages you will have some form of expression. How do we evaluate them in an
interpreter?
The evaluation of an expression typically produces two things:
a resulting value
side-effects
On side-effects
Side-effects in this case could be the change of a value in the symbol table or the execution of some
statements.
For example, consider these C expressions:
1. (a = b) == 2
2. foo() + 2
The first expression causes the value of a to change in the symbol table. a = b produces also a value
(the original value of b, corresponding to the new value of a). This value is compared to 2 and if it
is considered equal then the whole expression evaluates to true, otherwise to false.
The second expression sums the result of invoking the function foo to 2. Now, the function foo could
do all sort of things like writing on the screen or opening a socket connection. In general it could
execute code that could have side-effects.
Because of side-effects we have to evaluate some pieces of our expressions in a specific, predictable
order. This is not necessary for languages which do not have side-effects. Those languages are free
to have things like lazy-evaluation and be opaque with respect to the rules they use to determine
the order in which they process parts of the expressions.
In an interpreter typically side-effects different from changing values in a symbol table corresponds
to call to runtime libraries or interfaces representing the outside world. In the implementation of
MiniCalcFun we will use the latter approach, using an interface named SystemInterface.
10. Build an interpreter 97
Resulting value
The key thing we want to get out of evaluating an expression is the resulting value. This is typically
obtained in a way that depends from the expression. Lets consider some cases:
literals
unary expressions
binary arithmetic expressions
logical expressions
value references
Literals are quite easy to consider: we just have some value that we could have to parse in some
way.
Number literals would need to be parsed and reconducted to a canonical internal representation, so
that things like 32, 0x20, 040, 100000b are recognized to be the same thing, assuming our language
support specifying numbers in decimal, hexadecimal (0x prefix), octal (0 prefix) and binary format
(b suffix form).
Decimal numbers could instead be expressed in their typical form or in the exponential form. String
literals could have escape sequences that we need to recognize. When evaluating literals we need to
consider these aspects.
Alternatively we could translate literals to a canonical form during the mapping step. In that case
evaluating a literal means just accessing its value, already calculated and stored in the AST node.
Unary expressions are typically also very simple. We can consider the logical negation, the binary
negation or the unary minus sign. The only thing to do is to transform the value of the child
expression. For example if we process -a we need first to evaluate a, then take its value and multiply
it by -1.
Binary arithmetic expressions can be calculated differently depending on the type of the operands.
Why is that? Because summing two integers or summing two decimals are conceptually the same
operation but for the CPU they could be very different operations. For this reason on one level we
may want to represent them as one single construct in our language, and one single AST node type,
but in the interpreter we may have to process them differently.
It is not just about differentiating between integers and decimals. Some languages support a wide
ranges of numerical types: byte, short, int, long, long long, float, double. In some cases you can have
both the signed and unsigned variants of these numbers. Now, executing mathematical operations on
a CPU is still one of those fascinating adventures that seems to work decently well, until it surprises
you with some apparently absurd result. This is not the place to talk about all the issues you can
have with overflows, underflows and problems due to limited precision, but you need to consider
that the specific type involved in the operation could lead to different results. For example, dividing
5 by 2 if 5 and 2 are integer could produce 2 in your language. Or maybe 3, depending how you do
the rounding. Or maybe you want to produce 2.5, so that the result is not an integer anymore. What
10. Build an interpreter 98
1. You just use the primitive types used by the language in which you are writing the interpreter.
It means you could run in all sort of strange results (hey, summing XXX and YYY produces
-123, that is surprising!) but operations are performed fast
2. You internally represents these values as BigDecimal or something equivalent. That means
that all mathematical operations will be very slow but the result will be the correct results
(with some approximation)
What is the best strategy? Well, it depends on what your language is used for. If it is used for
developers who needs to write fast code, then go for the first one. If you are building this language
for non-developers or the language will be used in safety-critical or mission-critical applications go
for the second.
Logical expressions typical logical expressions are logical-and, logical-or, and logical-xor. Now,
depending on the user of your language they could have different expectations on how these are
evaluated. Developers typically expect you to use short-circuits. What does it mean? It means that
you evaluate the first element and if you can already determine the result you do not evaluate the
second element. So if you have a logical-and b you:
evaluate a
a is true -> you evaluate b
a is false -> you return false without evaluating b
Why that matters? Because evaluating b could have side-effects. If b is a function call that print
something on the screen, or change some value, evaluating it or not evaluating wit ould change the
behavior of your program. However if your language does not allow side-effects than all of this is
just a performance optimization.
Value references these are the expressions that permits to access a value of a variable, a constant or
a parameter in you code. In foo + 3, foo is a value reference. Basically you evaluate them by taking
their value out of the symbol table. That is it. Unless you are supporting accessors. For example in
Ruby when writing bar, bar could be a variable or method. In that case the method would have to
be invoked.
Executing statements
Statements permit to execute all sort of operations. They also determine control flow, i.e., what code
you are going to execute. For example a while-loop could make you execute multiple times its body.
http://docs.oracle.com/javase/8/docs/api/java/math/BigDecimal.html
10. Build an interpreter 99
As you will see executing statements basically means to implement the control flow correctly: most
statements are composed by other statements. You just need to know in which order to execute the
element statements and the logic wiring it. For example, in the case of an if statement you have to
evaluate the condition and if the result is truthy you will execute the statement in the then-branch
otherwise you will execute the one in the else-branch.
Statements determine which expressions need to be evaluated. These expressions need to be executed
in a certain scope, that means we need to use a corresponding symbol table. So we need to pass that
around when executing statements. Statements could also modify such symbol table.
Lets see how we could implement a typical bunch of statements.
Print statement: lets start with something simple. The old, glorious print statement. Basically you
need to do two things: 1) evaluate the expression that you want to print 2) get the result and call the
print function or method of the language in which you are going to build your interpreter. That is
basically it. In the implementation of StaMac we will do just that, while in the implementation of
MiniCalcFun we will do something slightly more elaborated.
Variable declaration statement: this statement add a new symbol to the symbol table. It will be
available for all following statements. Now, a variable could also have an expression determining
the initial value. You want to typically first evaluate it and then add the new variable. In this way
an initialization value cannot refer to the variable itself.
Expression statement: this is just about evaluating an expression. You typically want to do that for
the side-effects of that expression. For example in some languages an assignment is an expression
(while in others it is a statement). In the languages in which the assignment is an expression you
may want to put it into an expression statement and execute it. So the assignment is performed and
the symbol table is changed as consequence.
Block statement: a block statement is typically a list of statements to be executed one after the
other. It also delimits a new scope so that variables defined inside the block are visible only inside
the block. The way you typically execute it is to define a new symbol table having as parent the
current symbol table. You use this new symbol table to execute all the statements which are part of
the block. When leaving you just go back to the original symbol table, forgetting about the symbol
table used inside the block.
If statement: we have anticipated this one but it is really as easy as that. You evaluate the condition
and depending on the result you execute the then portion or the else portion (if it is present).
While statement: another easy one, just a variation of the if statement. You evaluate the condition,
if it is satisfied you execute the statement corresponding to the body of the while statement and then
go back to re-evaluate the condition.
For statement: a for statement as present in C99 or Java is a complex beast. Consider this:
First of all you want to excute the statement introducing a new variable. This is executed just once.
Then you verify the condition (i<10). If it evaluates to true you execute the body and then the
iteration step (i++). At this point you verify again the condition and keep doing the same steps until
the condition evaluates to false.
MiniCalcFun
To spicy up things we introduce a variant of MiniCalc which support functions. It is named
MiniCalcFun.
A function must specify its return type and it will expect the last statement to be an expression
statement. The value of that expression will be the return value of the function
10. Build an interpreter 101
26 usExpression
27 | STRING_OPEN (parts+=stringLiteralContent)* STRING_CLOSE # str\
28 ingLiteral
29 | INTLIT # int\
30 Literal
31 | DECLIT # dec\
32 imalLiteral
33 // new
34 | funcName=ID LPAREN (params+=expression (COMMA params+=expression)*)\
35 ? RPAREN # funcCall ;
In addition to the introduction of the functionStatement we make also possible to use an expression
as a statement, by adding expressionStatement. We need this because functions will return the
result of their last statement, which should be an expressionStatement. This will also permit to
have function calls as statements, which we may want to invoke for they side effects (i.e., because
they could print something).
This is how we calculate the type of function call:
It would be also possible to infer the return type of the function but we are not going to do that. Or
as better authors would say this is left as an exercise for the reader.
In the interpreter for MiniCalcFun we use the class CombinedSymbolTable. This is a class we created
to represent a symbol table that can contain elements in two separate namespaces: values and
functions. This will permit to have a value named foo and a function named foo without any
problem. In this language it makes sense to do so because the only operation we perform on functions
is invoking them. Functions are not first-class citizens in our language. We cannot save a reference
to a function in a variable or pass them around, so they could never be confused with values.
The class CombinedSymbolTable is not particularly complex.
10. Build an interpreter 103
43
44 fun tryToGetFunction(name: String) : F? {
45 if (!functions.containsKey(name)) {
46 if (parent == null) {
47 return null
48 } else {
49 return parent.tryToGetFunction(name)
50 }
51 }
52 return functions[name]!!
53 }
54
55 fun setFunction(name: String, value: F) {
56 functions[name] = value
57 }
58
59 fun popUntil(function: F): CombinedSymbolTable<V, F> {
60 if (this.functions.containsValue(function)) {
61 return this
62 }
63 if (this.parent == null) {
64 throw IllegalArgumentException("Function not found: $function")
65 }
66 return this.parent.popUntil(function)
67 }
68
69 override fun toString(): String {
70 return "SymbolTable(values=${values.keys}, functions=${functions.keys})"
71 }
72 }
Symbol tables are organized in a stack. Each instance can have a parent. When a value cannot
be found in the current symbol table we will ask the parent, if present
We store separately values and functions, so most methods are duplicates and we have separate
fields
hasValue/hasFunction can be used to check if a value is known to a Symbol Table (directly
or through its parent)
getValue/getFunction will get the element with the corresponding element or throw an
exception. The element is searched first in this symbol table, then in the stack of ancestor
symbol tables
10. Build an interpreter 105
tryToGetValue/tryToGetFunction will try to get the element with the corresponding element
or just return null. The element is searched first in this symbol table, then in the stack of
ancestor symbol tables
setValue/setFunction store a new value in the current symbol table. This could cause a value
known by the parent to be shadowed, i.e., to be not accessible anymore. For example in a
function with a parameter p we would be unable to access a global variable named p because
everytime we would access p we would get the parameter back, never the global variable
1 fun inc(Int p) {
2 p + 1
3 }
4 inc(3) // invocation at the same level as the function declaration
When we invoke inc we move from the scope inside anotherFunction to the scope of myBigFunc-
tion. There the variable j is not visible. When we change scope we should use a different Symbol
Table, because every Symbol Table represents the elements present in a certain scope.
We can have multiple levels of annidated scope:
10. Build an interpreter 106
In this case when invoking inc we will do that from a scope (and using a Symbol Table) specific to
wrapper3. This scope would be contained in the scope of wrapper2 (so our Symbol Table would have
as parent a Symbol Table for wrapper2). The scope of wrapper2 would be contained in the scope of
wrapper1 and the scope of wrapper1 would be contained in the global scope.
At the point in which we invoke inc we can access all the variables. But we cannot do that when
executing inc. For this reason we would go from the Symbol Table containing the values of wrapper3
directly to the global one, where inc is defined.
Once we have done that we will need to create a Symbol Table for the specific execution of inc,
adding the values for the parameters but we will see this later. This new Symbol Table will have as
parent the Symbol Table where the function inc is defined.
System Interface
Our language permits to do one thing that affects the outside world: printing. We could just print
on the screen when we interpret a print statement. We will not do directly that, to make our system
more testable.
We will provide an instance of SystemInterface to our interpreter and delegate interactions with
the system to it.
In a real application the implementation of the interface will actually print, while during tests we
will capture the strings we would have printed and save them. Later we could add assertions to
verify we tried to print exactly what we expected.
10. Build an interpreter 107
1 interface SystemInterface {
2 fun print(message: String)
3 }
4
5 class RealLifeSystemInterface : SystemInterface {
6
7 override fun print(message: String) {
8 println(message)
9 }
10
11 }
12
13 class TestSystemInterface : SystemInterface {
14
15 // later we can assert on the content of this property
16 val output = LinkedList<String>()
17
18 override fun print(message: String) {
19 output.add(message)
20 }
21
22 }
Interpreter
Lets see the whole code for the interpreter. Later we will comment it piece by piece.
58 if (l is Int) {
59 l as Int + r as Int
60 } else if (l is String) {
61 l as String + r.toString()
62 } else {
63 throw UnsupportedOperationException(l.toString()+ " from eva\
64 luating " + expression.left)
65 }
66 }
67 is SubtractionExpression -> {
68 val l = evaluate(expression.left, symbolTable)
69 val r = evaluate(expression.right, symbolTable)
70 if (l is Int) {
71 l as Int - r as Int
72 } else {
73 throw UnsupportedOperationException(expression.toString())
74 }
75 }
76 is MultiplicationExpression -> {
77 val l = evaluate(expression.left, symbolTable)
78 val r = evaluate(expression.right, symbolTable)
79 if (l is Int) {
80 l * r as Int
81 } else if (l is Double) {
82 l * r as Double
83 } else {
84 throw UnsupportedOperationException("Left is " + l.javaClass)
85 }
86 }
87 is DivisionExpression -> {
88 val l = evaluate(expression.left, symbolTable)
89 val r = evaluate(expression.right, symbolTable)
90 if (l is Int) {
91 l / r as Int
92 } else if (l is Double) {
93 l / r as Double
94 } else {
95 throw UnsupportedOperationException(expression.toString())
96 }
97 }
98 is FunctionCall -> {
99 // SymbolTable: should leave the symbol table until we go at the\
10. Build an interpreter 110
When instantiating an interpreter we specify how we will interact with the rest of the world
(systemInterface) and we also need to provide values for our inputs. Inputs are the mechanism
we have to get parameters in our little algorithm. Better than reading what the user is typing, right?
Our class then contains a symbol table:
This is the global symbol table. It will contains the inputs, the variables and the functions defined
in the global scope.
Then we have the methods that constitute the public interface of our interpreter:
10. Build an interpreter 111
fileEvaluation to evaluate one entire file. We execute all the top level statements using the
global Symbol Table
singleStatementEvaluation could be used to execute statements one by one. It could be
useful to implement a REPL or maybe a simple debugger
getGlobalValue takes a value out of the symbol table. We will use it in tests
The most interesting method seems executeStatement which, well, execute a statement using a
specified symbol table. When executing top level statements we are passing the global symbol table
but that will not always be the case.
20 else ->
21 throw UnsupportedOperationException(
22 statement.javaClass.canonicalName)
23 }
VarDeclaration inserts a value. The actual value is determined by the initialization expression
that is evaluated. Note that we first evaluate the initialization expression and only then we
insert the resulting value in the symbol table. That means that in the initialization of a variable
we cannot refer to the variable itself
InputDeclaration inserts a value. The value has been provided when instantiating the
interpreter because it is coming from outside. The user could specify it as a parameter in
the command line or in a form. When we find the InputDeclaration we make that value
available to the program by putting it in the symbol table
FunctionDeclaration inserts a function. We just take the function and put it in the symbol
table. Note that we do not evaluate the body of the function
Assignment is similar to the declarations because it changes the symbol table. The only difference is
that we expect the element to be already present in the symbol table and we just change its value.
Finally Print evaluates the expression and transform it into a string. Once it got the string to print
it uses the systemInterface. That interface could actually print something on the screen, or log it,
or store it to check it later in an assertion.
The gist is that statements mainly evaluate expressions and put things into the symbol table. They
control what is happening, but most of the action passes through expressions.
Lets check how we evaluate them.
10. Build an interpreter 113
Literals are the building blocks of our expressions and they are easy to deal with:
10. Build an interpreter 115
For IntLit and DecLit we just parse them as Int and Double and we are done. Our string literals
are a little more complex because we support inserting expressions into them (i.e., we have string
interpolation). So a string literal is really a concatenation of constant parts and expressions to
transform into strings. We evaluate the single parts and join them together without spaces in
between. Voila!
We just need to see how to evaluate the single parts of our string literal:
How do we perform operations? In MiniCalc (and in MiniCalcFun) we support the four basic
arithmetic operations, but the same mechanism can be used for all sorts of operations: we calculate
the values of the single operands and then we figure out how to compose those values.
For example, in the case of a SumExpression, once we had the left and the right values we may want
to sum them as ints, sum them as doubles or concatenating them as strings, if we have a string on
the left.
So:
1 + 2 -> 3
1.1 + 2 -> 3.1
"foo " + 2 -> foo 2
1 is SumExpression -> {
2 val l = evaluate(expression.left, symbolTable)
3 val r = evaluate(expression.right, symbolTable)
4 if (l is Int) {
5 l + r as Int
6 } else if (l is Double) {
7 l + r as Double
8 } else if (l is String) {
9 l + r.toString()
10 } else {
11 throw UnsupportedOperationException(l.toString()+ " from evaluating " + \
12 expression.left)
13 }
14 }
The other operations are simpler because we do not support string operands. You cannot divide a
string, multiple it or subract something from it, so we just deal with ints and doubles.
1 is SubtractionExpression -> {
2 val l = evaluate(expression.left, symbolTable)
3 val r = evaluate(expression.right, symbolTable)
4 if (l is Int) {
5 l - r as Int
6 } else if (l is Double) {
7 l - r as Double
8 } else {
9 throw UnsupportedOperationException(expression.toString())
10 }
11 }
12 is MultiplicationExpression -> {
13 val l = evaluate(expression.left, symbolTable)
14 val r = evaluate(expression.right, symbolTable)
15 if (l is Int) {
16 l * r as Int
17 } else if (l is Double) {
18 l * r as Double
19 } else {
20 throw UnsupportedOperationException("Left is " + l.javaClass)
21 }
22 }
23 is DivisionExpression -> {
24 val l = evaluate(expression.left, symbolTable)
10. Build an interpreter 117
We simply get the value out of the symbol table. Thats it.
The FunctionCall is the most complex expression.
1 is FunctionCall -> {
2 // SymbolTable: should leave the symbol table until
3 // we go at the same level at which the function
4 // was declared
5 val functionSymbolTable = CombinedSymbolTable(
6 symbolTable.popUntil(expression.function.referred!!))
7 var i = 0
8 expression.function.referred!!.params.forEach {
9 functionSymbolTable.setValue(it.name,
10 evaluate(expression.params[i++], symbolTable))
11 }
12 var result : Any? = null
13 expression.function.referred!!.statements.forEach {
14 result = executeStatement(it, functionSymbolTable) }
15 if (result == null) {
16 throw IllegalStateException()
17 }
18 result as Any
19 }
We first move up until we find the scope where the function was defined and we get the
corresponding symbol table (see the discussion in the previous section on popUntil).
Then we create a new Symbol Table having the symbol table in which the function is defined as
parent. This is our way to say that we go in one more specific scope (the scope inside the function).
10. Build an interpreter 118
In that symbol table we register the values for the parameters. We get their names from the function
definition, while their values are evaluated. Pay attention to how we evaluate them: we use the
expressions provided in the function call and we evaluate them using the symbol table of the scope
in which the function is called, not the scope representing the inside of the function.
At this point all that we have to do is to execute all the statements composing the body of the
function, using the appropriate symbol table. We just get the result of the last statement and use it
as the result of our function call.
Ok, that was as tricky as it gets for this interpreter.
Testing
It is time to test our interpreter. By testing it we show how to use it. It should not be too hard to put
some UI around of it and get a simple REPL or a simulator out of it.
This is the structure of our test case:
1 class InterpreterTest {
2
3 private var interpreter : MiniCalcInterpreter? = null
4 private var systemInterface : MySystemInterface? = null
5
6 class TestSystemInterface : SystemInterface {
7
8 val output = LinkedList<String>()
9
10 override fun print(message: String) {
11 output.add(message)
12 }
13
14 }
15
16 fun interpret(code: String) {
17 val res = MiniCalcParserFacade.parse(code)
18 assertTrue(res.isCorrect(), res.errors.toString())
19 val miniCalcFile = res.root!!
20 systemInterface = MySystemInterface()
21 interpreter = MiniCalcInterpreter(systemInterface!!)
22 interpreter!!.fileEvaluation(miniCalcFile)
23 }
24
25 ...
26 our test methods
10. Build an interpreter 119
27 ...
28 }
We have the TestSystemInterface we have discussed before. In the interpret method we parse
the code, assert that it is correct and interpret it, saving the systemInterface and the interpreter
as fields of the class. Later in the tests we will access them to validate our assertions.
Lets look at some involuted code. This example is very useful to show how you should not write
code. Incidentally it also useful to check if our interpreter can resolve values and functions correctly,
using the ones defined closer to the point where they are used.
There are two functions names f. When invoking f(3) - f(a) we are referring to the most internal
one. While when we invoke f(a + 1) + f(a + 2) instead we invoke the external one. Inside the
external function f references to a are resolved to the parameter a, while outside that function they
are resolved to the variable a.
StaMac
Looking at MiniCalcFun we have seen how to implement a typical imperative language with
statements and expressions. In the case of StaMac we have a different execution model, based on
state machines so there are some differences. However the part related to the expressions is quite
similar.
This is the whole code we need to interpret StaMac files.
18 values[name] = value
19 }
20 }
21
22 class Interpreter(val stateMachine: StateMachine, val inputsValues: Map<InputDec\
23 laration, Any>) {
24 var currentState : StateDeclaration = stateMachine.states.find { it.start }!!
25 val symbolTable = SymbolTable()
26 var alive = true
27
28 init {
29 stateMachine.inputs.forEach { symbolTable.writeByName(it.name, inputsVal\
30 ues[it]!!) }
31 stateMachine.variables.forEach { symbolTable.writeByName(it.name, it.val\
32 ue.evaluate(symbolTable)) }
33 executeEntryActions()
34 }
35
36 fun variableValue(variable: VarDeclaration) = symbolTable.readByName(variabl\
37 e.name)
38
39 fun receiveEvent(event: EventDeclaration) {
40 if (!alive) {
41 println("[Log] Receiving event ${event.name} after exiting")
42 return
43 }
44 println("[Log] Receiving event ${event.name} while in ${currentState.nam\
45 e}")
46 val transition = currentState.blocks.filterIsInstance(OnEventBlock::clas\
47 s.java).firstOrNull { it.event.referred!! == event }
48 if (transition != null) {
49 enterState(transition.destination.referred!!)
50 }
51 }
52
53 private fun enterState(enteredState: StateDeclaration) {
54 executeExitActions()
55 currentState = enteredState
56 executeEntryActions()
57 }
58
59 private fun executeEntryActions() {
10. Build an interpreter 122
60 currentState.blocks.filterIsInstance(OnEntryBlock::class.java).forEach {\
61 it.execute(symbolTable, this) }
62 }
63
64 private fun executeExitActions() {
65 currentState.blocks.filterIsInstance(OnExitBlock::class.java).forEach { \
66 it.execute(symbolTable, this) }
67 }
68
69 }
70
71 private fun OnEntryBlock.execute(symbolTable: SymbolTable, interpreter: Interpre\
72 ter) {
73 this.statements.forEach { it.execute(symbolTable, interpreter) }
74 }
75
76 private fun OnExitBlock.execute(symbolTable: SymbolTable, interpreter: Interpret\
77 er) {
78 this.statements.forEach { it.execute(symbolTable, interpreter) }
79 }
80
81 private fun Statement.execute(symbolTable: SymbolTable, interpreter: Interpreter\
82 ) {
83 when (this) {
84 is Print -> println(this.value.evaluate(symbolTable))
85 is Assignment -> symbolTable.writeByName(this.variable.name, this.value.\
86 evaluate(symbolTable))
87 is Exit -> interpreter.alive = false
88 else -> throw UnsupportedOperationException(this.toString())
89 }
90 }
91
92 private fun Expression.evaluate(symbolTable: SymbolTable): Any =
93 when (this) {
94 is ValueReference -> symbolTable.readByName(this.symbol.name)
95 is SumExpression -> {
96 val l = this.left.evaluate(symbolTable)
97 val r = this.right.evaluate(symbolTable)
98 if (l is Int) {
99 l + r as Int
100 } else if (l is Double) {
101 l + r as Double
10. Build an interpreter 123
In this case we have only one scope: the global scope. So we have only one Symbol Table. While
processing the different parts of the AST we pass the Symbol Table around.
This is how our Symbol table is defined:
1 class SymbolTable {
2 private val values = HashMap<String, Any>()
3
4 fun readByName(name: String) : Any {
5 if (!values.containsKey(name)) {
6 throw RuntimeException(
7 "Unknown symbol $name. Known symbols: ${values.keys}")
8 }
9 return values[name]!!
10 }
11
12 fun writeByName(name: String, value: Any) {
13 values[name] = value
14 }
15 }
First of all we have to provide values for the inputs. Inputs permit to make a State Machine
configurable. The input values are inserted in the Symbol Table. Then we evaluate all the initial
expressions for the variables and put also those into the Symbol Table.
We also set the current state to the state marked as start state, and we execute all the entry actions
for such state.
After the setup we are ready to react to events. We expose a method name receiveEvent and we
expect it to be called when an event is sent to our State Machine. For example, if we built an UI
for our interpreter the user could hit a button for each event type and that button could call this
method, passing the associated event.
What this method does? When we receive an event we print a log message. Then we look for a
transition that could be triggered from the current state based on the event we received. Two things
can happen:
10. Build an interpreter 126
Note that both executeEntryActions and executeExitActions do not take as a parameter the state
on which to execute the entry or exit actions but they use instead currentState, so the order in
which we call this method and update the currentState variable is important.
Execute entry or exit actions is done by looking for entry or exit blocks in the current state. If such
blocks are found we invoke execute on them passing the Symbol Table.
The execution of the blocks consist just in executing every single statement contained in the block,
in order:
10. Build an interpreter 127
Execute a statement is fairly easy because we have only two types of statements.
Note that you may want to collect logs using an interface and permitting the user to specify different
instances, like loggers that print messages on the screen, on a file or maybe a DB. The same goes for
the output of the print statement of the language.
Summary
In this chapter we have seen the basics for writing an interpreter. We have studied the typical
structure of an interpreter and discussed its main components. We got started working symbol
tables, executing statements and evaluating expressions. Now building an interpreter should look
less mysterious. After all is just about following the information captured into the AST and do
something in response.
In many cases starting by writing an interpreter is just easier compared to writing a compiler. I
would suggest going thorugh this path at least while you are designing the language and it is not
yet stable.
Have fun writing interpreters and when you are ready lets move to the next chapter and explore
an alternative: generating bytecode.
11. Generate JVM bytecode
In the previous chapter we have seen how to write an interpreter. In this one we will see how to
write a compiler instead. Our compiler will produce JVM bytecode. By compiling for the JVM we
will be able to run our code on all sort of platforms. That sounds pretty great to me!
Also, the JVM classes generated by our compiler could be used inside applications written in Java,
Kotlin, Scala, JRuby, Frege and all other sorts of languages that run on the JVM. This opens all sort
of scenarios. For example you may want to create the core of a complex system in Java, and maybe
use a smaller language, like MiniCalc or StaMac to define specific subsytems. In other words you
could combine many specific languages and other more general, established JVM languages to build
rich applications. This a scenario that I think has a lot of potential, because it permits to combine
the strength of different languages, to define different portions of an application.
Before we start writing a compiler targeting the JVM we need to examine how the Java Virtual
Machine works. We will start by doing that in the first section of this chapter. Later we will write
two different compilers: one for MiniCalcFun (i.e., MiniCalc extended to support functions) and the
other one for StaMac.
the general structure of class files: class files contain the code executed by the JVM
JVM Type descriptions: how the JVM define types internally
the stack: a first example of how the stack is used to perform operations
bytecode: the bytecode specifies the instructions to execute when running a method
frames: they will be useful to understand the execution of methods
Finally we will look at a class file and examine the different parts.
Class files
The JVM executes code contained in class files. The format for such files is described into Chapter
4 of the JVM specification. To understand how to write bytecode effectively it is not necessary to
look into every single field of a class file. There are details we can ignore, if we use a simple library
like ASM. What is important is to understand the general structure.
A class file contains:
Each class file represents a single class. Also internal classes, anonymous classes and local classes
are compiled into separate class files.
A very important structure contained in a class file is the constant pool. It contains a set of constants
that can be used for very different goals. Many other fields of the class file contains just indexes that
refer to the constant pool. For example the this_class field contains just an index to an entry in the
constant pool. That entry is expected to contain a data structure describing the current class. The
constan pool contains also constants that will be accessed from bytecode. For example, instructions
to invoke methods do not specify directly the method to invoke. They instead specify indexes into
https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-4.html
11. Generate JVM bytecode 130
the constant pool. At that position in the constant pool we will find the name of the method to invoke
and its signature. This permits to save space when we refer the same method more than once, because
the name is present just once. Considering an index takes just two bytes, this is considerably less
than the space needed to record the information needed to identify a mehtod.
The class file contains also a list of fields and a list of methods declared in the class. The fields and
methods inherited are not present in the class file. For each field we have information like the name,
the type, and the access level. For the method we have the name, its signature and its code.
There are other possible attributes associated to fields and methods, which can be useful for
debugging purposes or which can contain other information (like the exceptions thrown by a
method). We are not going to look into those; if you want to learn more about those please refer
to the JVM Specification, section 4.7. The class file contains also a list of inner classes. We are not
going to use them in this chapter.
The code attribute associated to a method contains the bytecode and some complementary infor-
mation. The bytecode contains a list of instructions that are executed when invoking the method.
We are going to see more about this in the following sections. Before that we are going to look into
concepts that are relevant to understand how the bytecode operates.
In addition to that we need to consider two other cases: declared types and arrays.
By declared types we mean classes, interfaces, enums, and annotations. They have JVM Type
description constructed as "L" + internal name + ";". The internal name is simply the qualified
name with dots replaced by slashes.
To compose the type description of arrays we add the [ symbol to the start of the type description
for the element type.
https://docs.oracle.com/javase/specs/jvms/se8/html/jvms-4.html#jvms-4.7
11. Generate JVM bytecode 131
Typically in the class file you use the JVM Type description, unless only a declared type can be
used. In that case you use an internal name. For example, when specifying a superclass, or the
class defining a method we will use internal names. For parameter types we will instead use type
descriptions.
The stack
We have said that the JVM is a stack based machine. But what is a stack based machine? It is a
machine that executes operations by extracting values from a stack and putting results back on a
stack. A stack is LIFO structure: when extracting values the first value we pick is the last value that
was inserted on the stack.
Consider one instruction of the JVM, IADD. This instruction expects to find two integers at the top
of the stack when it is invoked. It will then get these two values, sum them, and put the result back
on the stack.
Suppose that we have inserted the values 1, 2, and 3 in the stack and we execute two consecutive
IADD. What will happen?
1. Initially the stack will contain some value. We will leave it untouched
2. We will first push the value 1 on the top of the stack
3. Then we will push the value 2. Now 2 is on top, before 1
4. Then we will push the value 3. Now 3 is on top, before 2, which comes before 1
5. We perform an addition by removing the two values at the top of the stack. We remove first
3 and then 2. We sum them, and we put the result on the top of the stack. Now 5 is on the top
of the stack, above 1
11. Generate JVM bytecode 132
6. We perform an addition. We remove first 5 and then 1. We put the result on the top of the
stack. Now 6 is on the top of the stack, above the values that were originally present in the
stack
Bytecode
A JVM istruction is composed by an opcode that takes exactly one byte. Opcode stands for operation
code, and it is a number that identifies one of the instructions the JVM knows how to execute. The
maximum number of opcodes would thoretically be 256, but some values are reserved and not all
the values correspond to valid opcodes. Associated to each opcode there is also a mnemonic name:
it is much clearer to read iadd instead of 96 (which is the value of the opcode for iadd).
An opcode can be followed by one or more operands. Operands can be immediate values or indexes
indicating entries in the constant pool.
Note that the opcode determines how many operands are expected and their type, so that looking
at the opcode we know how long the whole instruction is going to be. For example, after d2f we
know that there will be no operands, so the whole instruction will take one byte. After bipush we
have an operand of one byte, so the instruction will take two bytes. After putfield will follow one
operand of two bytes, so the whole instruction will take three bytes.
For conceputally similar operations we could have different opcodes. For example summing two
numbers can be done using iadd, ladd, fadd, or dadd depending on the type of the operands being
byte, short, int, long, float, or double.
Frames
Each time a method is invoked a new frame is created. The frame is destroyed when the invocation
is completed.
Associated to each frame we have an array of local variables. It contains in order:
the value of this, if the method is an instance method. This entry is not present for static
methods
the values of the method parameters, starting from position 1 for instance methods or from
position 0 for static methods
the local variables defined in the method
Note that while most values in the local variables array take one space the long or double values
takes two.
Lets see a couple of cases.
Suppose we have these two Java methods:
11. Generate JVM bytecode 133
Index Content
0 this
1 parameter s
2 parameter l (1st part)
3 parameter l (2nd part)
4 local variable b
Index Content
0 parameter i
1 parameter j
2 parameter d (1st part)
3 parameter d (2nd part)
4 local variable o
1 class A {
2 }
1 class A
2 minor version: 0
3 major version: 52
4 flags: ACC_SUPER
5 Constant pool:
6 #1 = Methodref #3.#10 // java/lang/Object."<init>":()V
7 #2 = Class #11 // A
8 #3 = Class #12 // java/lang/Object
9 #4 = Utf8 <init>
10 #5 = Utf8 ()V
11 #6 = Utf8 Code
12 #7 = Utf8 LineNumberTable
13 #8 = Utf8 SourceFile
14 #9 = Utf8 a.java
15 #10 = NameAndType #4:#5 // "<init>":()V
16 #11 = Utf8 A
17 #12 = Utf8 java/lang/Object
18 {
19 A();
20 descriptor: ()V
21 flags:
22 Code:
23 stack=1, locals=1, args_size=1
24 0: aload_0
25 1: invokespecial #1 // Method java/lang/Object."<init>\
26 ":()V
27 4: return
28 LineNumberTable:
29 line 1: 0
30 }
First we have the version of the class file. When a new version of Java is released they could introduce
new opcodes or slightly change the class format. If they do so they introduce a new version of the
class file format. Not all new releases of Java have introduced a new version of the class file format.
The ACC_SUPER flag in this context is present for historical reason. It should always be there.
This class file contains one method, even if the class in the source code was empty. This method is
the default constructor, which is added by the compiler when no constructor is explicitely defined.
Now lets look into the constant pool. To understand how the constant pool works lets look at one
line of code:
11. Generate JVM bytecode 135
This line invokes the method specified in the entry #1 of the class pool. If we go in the class pool
we can see that the entry #1 is of type MethodRef (i.e., it describes a method) and it refers to two
other entries: #3 and #10. Entry #3 define the class in which the method is declared, while entry
#10 define the signature. The constructors are basically special methods with the name <init>. In
this case we have a default constructor that invokes the parent constructor. Our class A implicitly
extends java.lang.Object so we call the java.lang.Object constructor.
Entry #3 is an entry of type Class that refers to an entry #12. Entry #12 actually contains the internal
name of the class. The name of the class is in the internal format (java/lang/Object), which is basically
the canonical format with slashes replacing the dots.
Entry #10 is an entry of type NameAndType which refers to two other entries. The first entry (#4)
specifies the name of the method while the second one (#5) specifies the parameters accepted and
the return type.
Entry #4 contains the name of the method which is <init>. This is the special name used to represent
constructors.
Entry #5 indicates that the method takes no parameters and return void (i.e., it returns nothing).
What should be clear at this point is that filling the constant pool is not complicated, but it requires
a lot of bookkeeping. For this reason we are going to use a library to write the class files instead of
writing directly the bytes. That would not be conceptually difficult, just boring.
1 int foo(int p) {
2 p = 0;
3 return p;
4 }
In this method we first store 0 into an entry of the local variables table (p) and then we read a value
from the same entry, to return it.
If we compile it and decompile it we get:
1 int foo(int);
2 descriptor: (I)I
3 flags:
4 Code:
5 stack=1, locals=2, args_size=2
6 0: iconst_0
7 1: istore_1
8 2: iload_1
9 3: ireturn
The first instruction is iconst_0. This is an instruction that push the int value 0 into the stack. We
will see more about the instructions to push constants in the paragraph on Constants.
Then we have the instruction istore_1 which takes the int value on the top of the stack and store it
into the entry #1 of the local variables table. Note that in this case the entry #0 would indicate this,
while the entry #1 would indicate the only parameter of the method, p.
After that we load the integer value in the entry #1 of the local variables table, and then return it.
Each of these instructions takes exactly one byte. The number before the instruction indicates the
index in the byte array describing the code. For example 0: iconst_0 starts at byte 0, 1: istore_1
at byte 1, and so on.
11. Generate JVM bytecode 137
1 int foo(int p1, int p2, int p3, int p4, int p5, int p6) {
2 p1 = 10;
3 p2 = 20;
4 p3 = 30;
5 p4 = 40;
6 p5 = 50;
7 p6 = 60;
8 return p6;
9 }
Here we can see that conceptually we execute the same operation over and over: assigning a constant
to a parameter. However the instructions are different. In the first example we pushed into the stack
the value 0 with the instruction iconst_0. That is an instruction of one byte that specify what to
do (push an integer value) and the value itself to push (0). In general we do not have a specific
instruction to push for each possible value, we can use instead the parametric instruction bipush
which requires us to specify the value to push. We could specify bipush 0. it would be equivalent
to iconst_0, it would just takes more bytes. Using bipush we takes 2 bytes, indeed the successive
instruction starts at byte 2, not at byte 1.
The same reasoning applies for storing instructions. We have special instructions to store values in
the entry #1, #2, #3 but after that we need to use the generic instruction istore. The same is true for
loading: we have seen before iload_1 but there is no iload_6, we instead use iload and specify as
a parameter the index of the entry (6 in this example).
Constants
We have special instructions to put the values between -1 and 5, included. Then for values between
-128 and 127 we use bipush. For values between -32767 and 32767 we push sipush. For other values
we insert the constant in the constant pool and then we use the instruction ldc #x where x is the
index of the constant in the constant pool.
11. Generate JVM bytecode 139
Mathematical operations
Lets look at how addition is executed.
22 return a + b;
23 }
result in:
38 2: fadd
39 3: freturn
40
41 double sumDoubles(double, double);
42 descriptor: (DD)D
43 Code:
44 0: dload_1
45 1: dload_3
46 2: dadd
47 3: dreturn
the JVM treats bytes, shorts, and ints in the same way in many cases. I.e., internally they are
all treated as they were ints.
we have different operations to load primitive values: iload, lload, fload, dload
correspondigly we have different instruction to sum: iadd, ladd, fadd, dadd
same thing for return instructions: ireturn, lreturn, freturn, dreturn
We have not seen it in this example, but for subtraction, division, and multiplication we have also
four variants.
Addition Subtraction Multiplication Division
Byte iadd isub imul idiv
Short iadd isub imul idiv
Int iadd isub imul idiv
Long ladd lsub lmul ldiv
Float fadd fsub fmul fdiv
Double dadd dsub dmul ddiv
We are not considering what happens when you sum two values which are not of the same type.
We are going to figure that out in the next section about conversions.
Conversions
When two types are not compatible we are going to need to do some conversions. Consider these
cases:
11. Generate JVM bytecode 142
21 1: i2l
22 2: lload_2
23 3: ladd
24 4: lreturn
25
26 float sumByteAndFloat(byte, float);
27 descriptor: (BF)F
28 Code:
29 0: iload_1
30 1: i2f
31 2: fload_2
32 3: fadd
33 4: freturn
34
35 double sumByteAndDouble(byte, double);
36 descriptor: (BD)D
37 Code:
38 0: iload_1
39 1: i2d
40 2: dload_2
41 3: dadd
42 4: dreturn
Conversions can be widening or narrowing. Widening numeric conversions should always keep the
original value or a value that is close to the original value. For details you should consider how
floating point values are represented. Narrowing numeric conversions could instead cause the value
to be changed significantly.
For example if you try to convert the an int containing the value 3 into a byte, this will work.
However when you try to convert an int containing the value 128 into a byte you have a problem
because a byte can represent values between -128 and 127: 128 does not fit into a byte. For this reason
the resulting value would be not equivalent to the original value.
Operations on objects
So far we have focused on instructions involving primitive types. We have not yet seen how to deal
with object instances.
In a few words: they work very similarly. The only thing worth noticing is that the instructions will
not work directly on the object value itself, but on a reference, in other words on its address. For the
old guys that learnt to program a few years ago this will bring old memories: basically we are back
to working with pointers.
Consider this method:
1 java.lang.String passingStringAround(java.lang.String);
2 descriptor: (Ljava/lang/String;)Ljava/lang/String;
3 Code:
4 0: aload_1
5 1: astore_2
6 2: aload_2
7 3: areturn
Basically we have the variants of the load, store, and return instruction for references. They start
with a.
The other thing that you could notice is that signatures are much longer. For all the primitive types
we had one single letter. So a method taking an integer and returning a long would have the signature
(I)J.
For types instead the signature is L + internal name + ;. For example java.lang.String becomes
Ljava/lang.String;. This also explains why the signature for long is not L but J.
Method invocations
Now that we have seen how object references are passed around we may want to see how to actually
use them. What do you do with objects? You invoke methods on them.
This is where things get complicated.
We have 5 different instructions:
invokedynamic
invokeinterface
invokespecial
invokestatic
invokevirtual
invokedynamic has to do with the support for dynamic languages that was introduced in the version
8 of the JVM. We are not going to look into that because we should introduce a lot of different
concepts. And you can build many interesting concepts without it.
invokeinterface is used to invoke methods on references having an interface type.
invokespecial is using to invoke superclass methods, private methods and constructors.
In the other cases you want to use invokevirtual.
Lets consider this piece of Java code which contains an interface, an abstract class and a concrete
class.
11. Generate JVM bytecode 146
1 class A {
2
3 interface MyInterface {
4 void foo();
5 }
6
7 abstract class MyAbstractClass implements MyInterface {
8
9 }
10
11 class MyConcreteClass implements MyInterface {
12 public void foo() {}
13 }
14
15 void invoking(MyInterface p0, MyAbstractClass p1, MyConcreteClass p2) {
16 p0.foo();
17 p1.foo();
18 p2.foo();
19 }
20
21 }
1 Constant pool:
2 #1 = Methodref #6.#22 // java/lang/Object."<init>":()V
3 #2 = InterfaceMethodref #12.#23 // A$MyInterface.foo:()V
4 #3 = Methodref #10.#23 // A$MyAbstractClass.foo:()V
5 #4 = Methodref #7.#23 // A$MyConcreteClass.foo:()V
6 #5 = Class #24 // A
7 #6 = Class #25 // java/lang/Object
8 #7 = Class #26 // A$MyConcreteClass
9 #8 = Utf8 MyConcreteClass
10 #9 = Utf8 InnerClasses
11 #10 = Class #27 // A$MyAbstractClass
12 #11 = Utf8 MyAbstractClass
13 #12 = Class #28 // A$MyInterface
14 #13 = Utf8 MyInterface
15 #14 = Utf8 <init>
16 #15 = Utf8 ()V
17 #16 = Utf8 Code
18 #17 = Utf8 LineNumberTable
11. Generate JVM bytecode 147
61 line 18: 10
62 line 19: 14
63 }
64 SourceFile: "a.java"
65 InnerClasses:
66 #8= #7 of #5; //MyConcreteClass=class A$MyConcreteClass of class A
67 abstract #11= #10 of #5; //MyAbstractClass=class A$MyAbstractClass of class\
68 A
69 static #13= #12 of #5; //MyInterface=class A$MyInterface of class A
What is interesting to us are the three invocations of the method foo. The first one is operated on an
interface, so we use invokeinterface. The other twos are operated on two classes, one abstract and
the other one concrete. In both cases we use invokevirtual.
Lets look into constructors:
The class file will contain a default constructor for Derived which will call the default constructor
of Super.
22 Code:
23 stack=1, locals=1, args_size=1
24 0: aload_0
25 1: invokespecial #1 // Method Super."<init>":()V
26 4: return
27 LineNumberTable:
28 line 5: 0
29 }
1 class A {
2
3 private void myPrivateInstanceMethod() { }
4 public void myPublicInstanceMethod() { }
5 private static void myPrivateStaticMethod() { }
6 public static void myPublicStaticMethod() { }
7
8 private void myMethodCallingTheOthers() {
9 myPrivateStaticMethod();
10 myPublicStaticMethod();
11 myPrivateInstanceMethod();
12 myPrivateInstanceMethod();
13 }
14
15 }
1 class A
2 minor version: 0
3 major version: 52
4 flags: ACC_SUPER
5 Constant pool:
6 #1 = Methodref #6.#18 // java/lang/Object."<init>":()V
7 #2 = Methodref #5.#19 // A.myPrivateStaticMethod:()V
8 #3 = Methodref #5.#20 // A.myPublicStaticMethod:()V
9 #4 = Methodref #5.#21 // A.myPrivateInstanceMethod:()V
10 #5 = Class #22 // A
11 #6 = Class #23 // java/lang/Object
11. Generate JVM bytecode 150
12 #7 = Utf8 <init>
13 #8 = Utf8 ()V
14 #9 = Utf8 Code
15 #10 = Utf8 LineNumberTable
16 #11 = Utf8 myPrivateInstanceMethod
17 #12 = Utf8 myPublicInstanceMethod
18 #13 = Utf8 myPrivateStaticMethod
19 #14 = Utf8 myPublicStaticMethod
20 #15 = Utf8 myMethodCallingTheOthers
21 #16 = Utf8 SourceFile
22 #17 = Utf8 A.java
23 #18 = NameAndType #7:#8 // "<init>":()V
24 #19 = NameAndType #13:#8 // myPrivateStaticMethod:()V
25 #20 = NameAndType #14:#8 // myPublicStaticMethod:()V
26 #21 = NameAndType #11:#8 // myPrivateInstanceMethod:()V
27 #22 = Utf8 A
28 #23 = Utf8 java/lang/Object
29 {
30 A();
31 descriptor: ()V
32 flags:
33 Code:
34 stack=1, locals=1, args_size=1
35 0: aload_0
36 1: invokespecial #1 // Method java/lang/Object."<init>\
37 ":()V
38 4: return
39 LineNumberTable:
40 line 1: 0
41
42 private void myPrivateInstanceMethod();
43 descriptor: ()V
44 flags: ACC_PRIVATE
45 Code:
46 stack=0, locals=1, args_size=1
47 0: return
48 LineNumberTable:
49 line 3: 0
50
51 public void myPublicInstanceMethod();
52 descriptor: ()V
53 flags: ACC_PUBLIC
11. Generate JVM bytecode 151
54 Code:
55 stack=0, locals=1, args_size=1
56 0: return
57 LineNumberTable:
58 line 4: 0
59
60 private static void myPrivateStaticMethod();
61 descriptor: ()V
62 flags: ACC_PRIVATE, ACC_STATIC
63 Code:
64 stack=0, locals=0, args_size=0
65 0: return
66 LineNumberTable:
67 line 5: 0
68
69 public static void myPublicStaticMethod();
70 descriptor: ()V
71 flags: ACC_PUBLIC, ACC_STATIC
72 Code:
73 stack=0, locals=0, args_size=0
74 0: return
75 LineNumberTable:
76 line 6: 0
77
78 private void myMethodCallingTheOthers();
79 descriptor: ()V
80 flags: ACC_PRIVATE
81 Code:
82 stack=1, locals=1, args_size=1
83 0: invokestatic #2 // Method myPrivateStaticMethod:()V
84 3: invokestatic #3 // Method myPublicStaticMethod:()V
85 6: aload_0
86 7: invokespecial #4 // Method myPrivateInstanceMethod:\
87 ()V
88 10: aload_0
89 11: invokespecial #4 // Method myPrivateInstanceMethod:\
90 ()V
91 14: return
92 LineNumberTable:
93 line 9: 0
94 line 10: 3
95 line 11: 6
11. Generate JVM bytecode 152
96 line 12: 10
97 line 13: 14
98 }
We can see that the static methods are invoked by using invokestatic. For the instance methods
instead we use invokespecial.
1 class A {
2 String name;
3
4 A(String name) {
5 this.name = name;
6 }
7
8 String getName() {
9 return this.name;
10 }
11 }
1 class A
2 minor version: 0
3 major version: 52
4 flags: ACC_SUPER
5 Constant pool:
6 #1 = Methodref #4.#15 // java/lang/Object."<init>":()V
7 #2 = Fieldref #3.#16 // A.name:Ljava/lang/String;
8 #3 = Class #17 // A
9 #4 = Class #18 // java/lang/Object
10 #5 = Utf8 name
11 #6 = Utf8 Ljava/lang/String;
12 #7 = Utf8 <init>
13 #8 = Utf8 (Ljava/lang/String;)V
14 #9 = Utf8 Code
15 #10 = Utf8 LineNumberTable
16 #11 = Utf8 getName
11. Generate JVM bytecode 153
In both cases we specify the index of a field descriptor, which is contained in the constant pool. The
field descriptors defines the class and the field name.
Object creation
To work with objects we need to be able to instantiate them. Lets see how.
1 A instance() {
2 return new A();
3 }
The class:
1 class A
2 minor version: 0
3 major version: 52
4 flags: ACC_SUPER
5 Constant pool:
6 #1 = Methodref #4.#13 // java/lang/Object."<init>":()V
7 #2 = Class #14 // A
8 #3 = Methodref #2.#13 // A."<init>":()V
9 #4 = Class #15 // java/lang/Object
10 #5 = Utf8 <init>
11 #6 = Utf8 ()V
12 #7 = Utf8 Code
13 #8 = Utf8 LineNumberTable
14 #9 = Utf8 instance
15 #10 = Utf8 ()LA;
16 #11 = Utf8 SourceFile
17 #12 = Utf8 A.java
18 #13 = NameAndType #5:#6 // "<init>":()V
19 #14 = Utf8 A
20 #15 = Utf8 java/lang/Object
21 {
22 A();
23 descriptor: ()V
24 flags:
25 Code:
11. Generate JVM bytecode 155
Here we first use the special instruction new to allocate the object. Once we have allocate it we need
to call the corresponding constructor.
You could wonder why we have the dup instruction here. This instruction takes the value on top
of the stack and duplicate it, so that two copies of the same value are placed on top of the stack.
We need to have two references to the instance of A because we will consume the first one in the
invocation of the constructor and the second one will be needed by areturn.
Comparison
Until now we have just seen how to execute a list of instructions, without conditions. However in
real code we have the if statements, we have loops. We do not execute a list of statements from the
beginning to the end but we do jumps.
Lets look at a simple example with an if:
11. Generate JVM bytecode 156
1 class A
2 minor version: 0
3 major version: 52
4 flags: ACC_SUPER
5 Constant pool:
6 #1 = Methodref #6.#16 // java/lang/Object."<init>":()V
7 #2 = Fieldref #17.#18 // java/lang/System.out:Ljava/io/Print\
8 Stream;
9 #3 = String #19 // Flag is set!
10 #4 = Methodref #20.#21 // java/io/PrintStream.println:(Ljava/\
11 lang/String;)V
12 #5 = Class #22 // A
13 #6 = Class #23 // java/lang/Object
14 #7 = Utf8 <init>
15 #8 = Utf8 ()V
16 #9 = Utf8 Code
17 #10 = Utf8 LineNumberTable
18 #11 = Utf8 choice
19 #12 = Utf8 (Z)V
20 #13 = Utf8 StackMapTable
21 #14 = Utf8 SourceFile
22 #15 = Utf8 A.java
23 #16 = NameAndType #7:#8 // "<init>":()V
24 #17 = Class #24 // java/lang/System
25 #18 = NameAndType #25:#26 // out:Ljava/io/PrintStream;
26 #19 = Utf8 Flag is set!
27 #20 = Class #27 // java/io/PrintStream
28 #21 = NameAndType #28:#29 // println:(Ljava/lang/String;)V
29 #22 = Utf8 A
30 #23 = Utf8 java/lang/Object
31 #24 = Utf8 java/lang/System
32 #25 = Utf8 out
33 #26 = Utf8 Ljava/io/PrintStream;
34 #27 = Utf8 java/io/PrintStream
11. Generate JVM bytecode 157
What is interesting in this case is the ifeq instruction. It has one parameter, that in this case has the
value 12. The parameter indicates the position at which to jump.
How does it work? We first put on the stack the content of the local variables table entry with index
1. It will be the parameter named flag. ifeq performs the jump if the value on top of the stack is
equal to zero. Now, the boolean value false is represented by zero, so we jump if the flag is set to
false.
11. Generate JVM bytecode 158
Where we jump to? We jump to the implicit return instruction at the very end of the method.
If we do not jump (because flag is true) we just keep executing the following instructions, which
corresponds to the statement System.out.println("Flag is set!");.
Another typical condition is checking if a reference is null:
1 class A
2 minor version: 0
3 major version: 52
4 flags: ACC_SUPER
5 Constant pool:
6 #1 = Methodref #6.#16 // java/lang/Object."<init>":()V
7 #2 = Fieldref #17.#18 // java/lang/System.out:Ljava/io/Print\
8 Stream;
9 #3 = String #19 // Obj is not null!
10 #4 = Methodref #20.#21 // java/io/PrintStream.println:(Ljava/\
11 lang/String;)V
12 #5 = Class #22 // A
13 #6 = Class #23 // java/lang/Object
14 #7 = Utf8 <init>
15 #8 = Utf8 ()V
16 #9 = Utf8 Code
17 #10 = Utf8 LineNumberTable
18 #11 = Utf8 choice
19 #12 = Utf8 (Ljava/lang/Object;)V
20 #13 = Utf8 StackMapTable
21 #14 = Utf8 SourceFile
22 #15 = Utf8 A.java
23 #16 = NameAndType #7:#8 // "<init>":()V
24 #17 = Class #24 // java/lang/System
25 #18 = NameAndType #25:#26 // out:Ljava/io/PrintStream;
26 #19 = Utf8 Obj is not null!
27 #20 = Class #27 // java/io/PrintStream
28 #21 = NameAndType #28:#29 // println:(Ljava/lang/String;)V
11. Generate JVM bytecode 159
29 #22 = Utf8 A
30 #23 = Utf8 java/lang/Object
31 #24 = Utf8 java/lang/System
32 #25 = Utf8 out
33 #26 = Utf8 Ljava/io/PrintStream;
34 #27 = Utf8 java/io/PrintStream
35 #28 = Utf8 println
36 #29 = Utf8 (Ljava/lang/String;)V
37 {
38 A();
39 descriptor: ()V
40 flags:
41 Code:
42 stack=1, locals=1, args_size=1
43 0: aload_0
44 1: invokespecial #1 // Method java/lang/Object."<init>\
45 ":()V
46 4: return
47 LineNumberTable:
48 line 1: 0
49
50 void choice(java.lang.Object);
51 descriptor: (Ljava/lang/Object;)V
52 flags:
53 Code:
54 stack=2, locals=2, args_size=2
55 0: aload_1
56 1: ifnull 12
57 4: getstatic #2 // Field java/lang/System.out:Ljav\
58 a/io/PrintStream;
59 7: ldc #3 // String Obj is not null!
60 9: invokevirtual #4 // Method java/io/PrintStream.prin\
61 tln:(Ljava/lang/String;)V
62 12: return
63 LineNumberTable:
64 line 4: 0
65 line 5: 4
66 line 7: 12
67 StackMapTable: number_of_entries = 1
68 frame_type = 12 /* same */
69 }
Here the structure is very similar, we just have a different kind of jump. This time we use ifnull.
11. Generate JVM bytecode 160
Code
For writing our JVM compilers we are going to use ASM. ASM is a library that can produce
bytecode and class files. On one hand this library is extremely useful because it handles all the
bookkepping involved in generating the bytecode while giving access tothe low level structures
present in the class file. On the other hand the documentation is extremely outdated and poor. All
in all, it is worthy to go through the difficulties of learning how to use ASM to build your own
compiler.
MiniCalcFun
We are going to build a JVM compiler that given a source file written in MiniCalcFun will produce
a class file.
General structure
Lets start from the entry point of our compiler. We will expect the name of a source file to be
specified as the first and only parameter. We will open the file, read the code and try to build an
AST. We will check for lexical and syntactical errors. If there are none we will validate the AST
and check for semantic errors. If will have no semantics errors we will go on with the class file
generation. If instead errors are found we show them to the user and terminate.
http://asm.ow2.org
11. Generate JVM bytecode 161
We are reusing code to build the AST and validate it. A line containing new code is this one:
1 class JvmCompiler {
2
3 fun compile(ast: MiniCalcFile, className: String) =
4 Compilation(ast, className).compile()
5
6 }
Here we take an AST and a name to assign to the class to generate. We use them to instantiate
Compilation. Why do we do that? Because Compilation will be used to track different pieces of
temporary data we need while producing the class file.
Before going to examine the Compilation class we will look at some utilities we will need.
When looking at how the JVM works we have seen that internally it uses type descriptions and
internal names for declared types (classes, interfaces, enums, and annotations). We have seen that
there are type descriptions for all primitive types, for arrays, and for declared types. For example
the type description for int is I, for an array of arrays of int is [[I, for the class String is
Ljava/lang/String;. Internal names can be instead defined only for declared types. The internal
name of String is java/lang/String, for File is java/io/File.
When compiling our code we will translate the types present in our language to types for the JVM.
In particular we have three types in MiniCalcFun:
In general it will be useful to have functions to the get the internal names and type descriptions of
the different classes. In our simple compiler we will refer to String but also to Object.
In general given a canonical name (like java.lang.Object) we can obtain an internal name or a
type description like this:
11. Generate JVM bytecode 162
We will use this extension method in our extension method for Type. This method give us the type
description for any of the three types we support in MiniCalcFun:
1 fun Type.jvmDescription() =
2 when (this) {
3 is IntType -> "I"
4 is DecimalType -> "D"
5 is StringType -> String::class.java.jvmDescription()
6 else -> throw UnsupportedOperationException(this.javaClass.canonicalName)
7 }
Now that we have started looking into types we can look at other operations that depends on the
type.
We have four of them:
localVarTableSize: this method, when invoked on a type returns the number of spaces
needed for an element of that type in the local variables table
loadOp: we have seen that there are different operations to load a value from the local variables
table into the stack, depending on its type. For example, for int we should use ILOAD, while
for double we should use DLOAD. This method gives us the right opcode to use with a given
type
storeOp: similarly to loadOp, given a type it returns the opcode to use to store a value of that
type into the local variables table
returnOp: similarly to loadOp and storeOp, given a type it returns the opcode to use to return
a value of that type
This methods are very simple, maybe we could have used maps instead of writing these methods.
Anyway they will be useful to abstract some of the nitty-gritty details necessary when writing the
compiler.
11. Generate JVM bytecode 163
1 // We have seen that all types but long and double takes one space in a local
2 // variables table. In this case we have a type (DecimalType) that is
3 // translated into the JVM type double so that it takes two spaces, while
4 // the other types take just one
5 fun Type.localVarTableSize() =
6 when (this) {
7 is IntType -> 1
8 is DecimalType -> 2
9 is StringType -> 1
10 else -> throw UnsupportedOperationException(
11 this.javaClass.canonicalName)
12 }
13
14 fun Type.loadOp() =
15 when (this) {
16 is IntType -> ILOAD
17 is DecimalType -> DLOAD
18 is StringType -> ALOAD
19 else -> throw UnsupportedOperationException(
20 this.javaClass.canonicalName)
21 }
22
23 fun Type.storeOp() =
24 when (this) {
25 is IntType -> ISTORE
26 is DecimalType -> DSTORE
27 is StringType -> ASTORE
28 else -> throw UnsupportedOperationException(
29 this.javaClass.canonicalName)
30 }
31
32 fun Type.returnOp() =
33 when (this) {
34 is IntType -> IRETURN
35 is DecimalType -> DRETURN
36 is StringType -> ARETURN
37 else -> throw UnsupportedOperationException(
38 this.javaClass.canonicalName)
39 }
11. Generate JVM bytecode 164
Pushing values
Now we are going to see how do we deal with expressions. The typical thing you want to do is
to evaluate an expression. What does it mean from the point of view of the compiler? It means
executing a sequence of instructions and at the end having the result of the expression at the top of
the stack.
This is how we evaluate all the expressions. Note that we are referring to some classes we have not
yet seen (MethodVisitor, CompilationContext) so not everything will be clear right now but lets
start focusing on the general structure.
33 val lt = this.left.type()
34 val rt = this.right.type()
35 if (lt is StringType) {
36 this.left.pushAsString(methodVisitor, context)
37 this.right.pushAsString(methodVisitor, context)
38 methodVisitor.visitMethodInsn(INVOKEVIRTUAL,
39 "java/lang/String",
40 "concat",
41 "(${String::class.java.jvmDescription()})"
42 + "${String::class.java.jvmDescription()}", false)
43 } else if (lt is IntType && rt is IntType) {
44 this.left.pushAsInt(methodVisitor, context)
45 this.right.pushAsInt(methodVisitor, context)
46 methodVisitor.visitInsn(IADD)
47 } else if (lt is NumberType && rt is NumberType) {
48 this.left.pushAsDouble(methodVisitor, context)
49 this.right.pushAsDouble(methodVisitor, context)
50 methodVisitor.visitInsn(DADD)
51 } else {
52 throw UnsupportedOperationException(lt.toString()
53 + " from evaluating " + this.left)
54 }
55 }
56 is SubtractionExpression -> {
57 val lt = this.left.type()
58 val rt = this.right.type()
59 if (lt is IntType && rt is IntType) {
60 this.left.pushAsInt(methodVisitor, context)
61 this.right.pushAsInt(methodVisitor, context)
62 methodVisitor.visitInsn(ISUB)
63 } else if (lt is NumberType && rt is NumberType) {
64 this.left.pushAsDouble(methodVisitor, context)
65 this.right.pushAsDouble(methodVisitor, context)
66 methodVisitor.visitInsn(DSUB)
67 } else {
68 throw UnsupportedOperationException(lt.toString()
69 + " from evaluating " + this.left)
70 }
71 }
72 is MultiplicationExpression -> {
73 val lt = this.left.type()
74 val rt = this.right.type()
11. Generate JVM bytecode 166
In the following sub-sections we are going to examine the different portions of this method.
Literals
Lets start from some simple cases. How do we evaluate integer and decimal literals?
In this case all we have to do is to push a constant on the stack. If the value is small, ASM will generate
an instruction containing the value itself. Otherwise ASM will create an entry in the constant pool to
hold the value and generate an instruction referring to that entry. These little details are abstracted
away by ASM: we just invoke visitLdcInsn.
String literals are more complex because MiniCalcFun supports interpolated strings. It means that
we can insert expressions in string literals. Like:
String literals in MiniCalcFun are composed by parts that could be either constant strings or
embedded expressions.
How do we translate this?
We consider three different cases:
If we have zero elements we just push an empty string into the stack (methodVisitor.visitLdcInsn("")).
If we have one or more elements we evaluate the first element. If it is a constant string we just push it.
If it is an expression we instead evaluate it and convert it to a string using the method pushAsString
that we will see in the next section. This means that evaluating 3 * 4 will not produce the integer
12 but will instead produce the string "12". In this way every single part of the interpolated string
will produce a string.
If we had more than one elements at this point we have evaluated only the first one. To evaluate the
remaining ones we create a temporary StringLit with all the parts from the second one to the last
one (all but the first part, that we have already evaluated). We then do a recursive call on push.
At this point we will have on the top of the stack two strings: the first one representing the first part,
the second one representing the concatenation of all the other parts. Now we just call the method
String.concat(String) that will merge the two elements into a single string. It will use the first
element as the this value and the second one as the parameter of the concat method.
1 is StringLit -> {
2 if (this.parts.isEmpty()) {
3 methodVisitor.visitLdcInsn("")
4 } else {
5 val part = this.parts.first()
6 when (part) {
7 is ConstantStringLitPart -> methodVisitor.visitLdcInsn(part.content)
8 is ExpressionStringLItPart -> part.expression.pushAsString(methodVis\
9 itor, context)
10 }
11 if (this.parts.size > 1) {
12 StringLit(this.parts.subList(1, this.parts.size)).push(methodVisitor\
13 , context)
14 methodVisitor.visitMethodInsn(INVOKEVIRTUAL,
15 "java/lang/String", "concat",
16 "(${String::class.java.jvmDescription()})"
17 + "${String::class.java.jvmDescription()}", false)
18 }
19 }
20 }
Value reference
When we have a reference to an input, a variable or a parameter we just need to find its value and
push on the stack:
11. Generate JVM bytecode 170
Binary operations
1 is SubtractionExpression -> {
2 val lt = this.left.type()
3 val rt = this.right.type()
4 if (lt is IntType && rt is IntType) {
5 // we know the first operand is already an int, so we could just use pus\
6 h instead of pushInt
7 this.left.pushAsInt(methodVisitor, context)
8 this.right.pushAsInt(methodVisitor, context)
9 methodVisitor.visitInsn(ISUB)
10 } else if (lt is NumberType && rt is NumberType) {
11 // we know the first operand is already a double, so we could just use p\
12 ush instead of pushDouble
13 this.left.pushAsDouble(methodVisitor, context)
14 this.right.pushAsDouble(methodVisitor, context)
15 methodVisitor.visitInsn(DSUB)
16 } else {
17 throw UnsupportedOperationException(lt.toString()+ " from evaluating " +\
18 this.left)
19 }
20 }
In practice we start by looking at the type of the operands. If they are both integers we just push
both of them on the stack. When the values are on the stack we call the instruction ISUB to subtract
them. If at least one of them is a decimal then we need to convert both values to decimal by using
pushAsDouble and then invoke DSUB.
Multiplication and division work in the exact same way, they just use different opcodes: IMUL, DMUL,
IDIV, and DDIV.
11. Generate JVM bytecode 171
Addition is instead more complex because we consider the case in which we are summing strings.
While this operation use the plus sign it is not a real addition, but a string concatenation. In that
case we will use the concat method that we have seen when we looked at interpolated strings.
1 is SumExpression -> {
2 val lt = this.left.type()
3 val rt = this.right.type()
4 if (lt is StringType) {
5 this.left.pushAsString(methodVisitor, context)
6 this.right.pushAsString(methodVisitor, context)
7 methodVisitor.visitMethodInsn(INVOKEVIRTUAL, "java/lang/String",
8 "concat",
9 "(${String::class.java.jvmDescription()})${String::class.java.jvmDe\
10 scription()}",
11 false)
12 } else if (lt is IntType && rt is IntType) {
13 this.left.pushAsInt(methodVisitor, context)
14 this.right.pushAsInt(methodVisitor, context)
15 methodVisitor.visitInsn(IADD)
16 // NumberType is a common ancestor for IntType and DecimalType
17 // if both are NumberType and the previous condition
18 // was not satisfied it means at least one is a Decimal
19 // and the other is either a Decimal or an Int
20 } else if (lt is NumberType && rt is NumberType) {
21 this.left.pushAsDouble(methodVisitor, context)
22 this.right.pushAsDouble(methodVisitor, context)
23 methodVisitor.visitInsn(DADD)
24 } else {
25 throw UnsupportedOperationException(lt.toString()+ " from evaluating " +\
26 this.left)
27 }
28 }
Function call
1 is FunctionCall -> {
2 val functionCode = context.compilation.functions[this.function.referred!!]!!
3 var index = 0
4 // we push this
5 methodVisitor.visitVarInsn(ALOAD, index)
6 // we push all the parameters we received and we need to pass along
7 index = 1
8 functionCode.surroundingValues.forEach {
9 val type = it.type()
10 methodVisitor.visitVarInsn(type.loadOp(), index)
11 index += type.localVarTableSize()
12 }
13 // we push all the parameters specified in the call
14 this.params.forEach { it.push(methodVisitor, context) }
15 // we invoke the method
16 methodVisitor.visitMethodInsn(INVOKEVIRTUAL, context.compilation.className,
17 functionCode.methodName, functionCode.signature, false)
18 }
To understand how the function call works you need to know how we will compile each function.
We will see it later in details but the idea is that each function in MiniCalcFun is compiled as a
JVM method. This method has as many parameters as the values which are visible to the function.
Consider this function:
1 input Int i
2 var globalVar = 0
3
4 fun f(Int p0) Int {
5 i * globalVar * p0
6 }
The function f needs to access not only its own parameter p0 but also the inputs and global variables.
For this reason we will generate JVM method named fun_f which will take three parameters. In this
way when we will call it we will be able to pass to it all the necessary values.
MiniCalcFun supports also annidated functions, like in this example:
11. Generate JVM bytecode 173
1 input Int i
2 var globalVar = 0
3
4 fun f(Int p0) Int {
5 fun g(Int p1) Int {
6 p1 * 2
7 }
8 i * globalVar * g(p0)
9 }
In this case g will compiled to a JVM method taking 4 parameters: one for the input (i), one for the
global variable (globalVar), one for the parameter of the wrapping function f (p0) and one for its
own parameter (p1).
Consider also this case:
1 input Int i
2 var globalVar = 0
3
4 fun f(Int p0) Int {
5 fun g(Int p1) Int {
6 fun h(Int p2) Int {
7 f(p0) * p2
8 }
9 p1 * 2
10 }
11 i * globalVar * g(p0)
12 }
Things start to get complex, so lets look at the parameter lists in a table.
The idea is that as we use more deep functions we pass all the sourrounding information plus the
new parameters. Note also that local variables have to be passed along too.
If we update the example in this way:
11. Generate JVM bytecode 174
1 input Int i
2 var globalVar = 0
3 fun f(Int p0) Int {
4 var v0 = 2
5 fun g(Int p1) Int {
6 var v1 = 3
7 fun h(Int p2) Int {
8 var v2 = 4
9 f(p0) * (p2 - v2 + v1)
10 }
11 p1 * (2 + v0)
12 }
13 i * globalVar * g(p0)
14 }
So when we execute a function call we need to pass more than just the parameters of the function
as defined in the MiniCalcFun code. We need to pass also all the values visible to that function.
It means:
When we call a function we are sure to have all the values it needs already present in our local
variables table. We just need to push them, so that they are available to the method we are going
to invoke. Once we pass the contextual values we also push the values for the parameters, that are
instead specified in the function call.
Note also that values are ordered from the most global to the most specific both in the local variables
table and among the parameters of JVM methods. This will be useful.
It can sound confusing right now, but we will see more details when looking at how the code for
the function and for the top level statements is generated.
Back at the code for FunctionCall we:
11. Generate JVM bytecode 175
push the value of this, because we are going to call an instance method (the JVM method for
the function)
we pass as many values from the local variables table as needed
we evaluate all the parameter values specified in the function call by pushing their expressions
we invoke the JVM method corresponding to the function
This one was not easy, but we are building a compiler after all. We have to sweat a little.
We have seen that while pushing values we may want to convert them, to ensure they have a certain
type. Lets see the methods we use to do these conversions:
30
31 private fun Expression.pushAsString(methodVisitor: MethodVisitor,
32 localSymbols: HashMap<String, JvmCompiler.Entry>) {
33 when (this.type()) {
34 is IntType -> {
35 this.pushAsInt(methodVisitor, localSymbols)
36 methodVisitor.visitMethodInsn(INVOKESTATIC, "java/lang/Integer",
37 "toString",
38 "(I)${String::class.java.jvmDescription()}", false)
39 }
40 is DecimalType -> {
41 this.pushAsDouble(methodVisitor, localSymbols)
42 methodVisitor.visitMethodInsn(INVOKESTATIC, "java/lang/Double",
43 "toString",
44 "(D)${String::class.java.jvmDescription()}", false)
45 }
46 is StringType -> this.push(methodVisitor, localSymbols)
47 else -> throw UnsupportedOperationException(
48 this.type().javaClass.canonicalName)
49 }
50 }
The structure is pretty simple: if the value has already the expected type we do a simple push,
otherwise we do a push followed by some operation to perform a conversion.
Consider pushAsInt: if the value to be converted to an int is a double, we invoke the operation D2I,
which convert a double value on top of the stack to an int value. We do the opposite in pushAsDouble,
by using I2D.
To convert numbers to strings we need instead to invoke the methods Integer.toString and
Double.toString. Both of them are static methods expecting one parameter. So we push the value
to be converted and invoke those methods. They will pop the value on top of the stack and use it as
their parameters, then they will convert it to a string and push that string on top of the stack.
Compilation
It is time to see the remaining element of our compiler: the Compilation class.
11. Generate JVM bytecode 177
43 SystemInterface::class.java.jvmDescription())
44 constructor.visitInsn(RETURN)
45 constructor.visitEnd()
46 constructor.visitMaxs(-1, -1)
47 }
48
49 private fun compileFunction(functionDeclaration: FunctionDeclaration) {
50 val functionCode = functions[functionDeclaration]!!
51 val allParams = LinkedList<ValueDeclaration>()
52 allParams += functionCode.surroundingValues
53 allParams += functionDeclaration.params
54 generateMethod(functionCode.methodName, allParams,
55 functionDeclaration.variables(),
56 functionDeclaration.statements,
57 functionDeclaration.returnType)
58 }
59
60 private fun generateMethod(methodName: String,
61 methodParameters: List<ValueDeclaration>,
62 variables: List<VarDeclaration>,
63 statements: List<Statement>,
64 returnType: Type? = null) {
65 // our class will have just one method: the calculate method
66 // it will take as many methodParameters as the inputs and return nothing
67 val methodVisitor = cw.visitMethod(ACC_PUBLIC, methodName,
68 "(${methodParameters.map { it.type().jvmDescription() }.join\
69 ToString(separator = "")})"
70 + "${returnType?.jvmDescription() ?: "V"}", null, null)
71 methodVisitor.visitCode()
72 // labels are used by ASM to mark points in the code
73 val methodStart = Label()
74 val methodEnd = Label()
75 // with this call we indicate to what point in the method the label
76 // methodStart corresponds
77 methodVisitor.visitLabel(methodStart)
78
79 // Variable declarations:
80 // we find all variable declarations in our code and we assign to them
81 // an index value. Our vars map will tell us which variable name
82 // corresponds to which index
83 var nextIndex = 1
84 val localSymbols = HashMap<ValueDeclaration, Entry>()
11. Generate JVM bytecode 179
85 methodParameters.forEach {
86 localSymbols[it] = Entry(nextIndex, it.type())
87 nextIndex += it.type().localVarTableSize()
88 // they are just represented by the params
89 }
90 variables.forEach {
91 localSymbols[it] = Entry(nextIndex, it.type())
92 methodVisitor.visitLocalVariable(it.name, it.type().jvmDescription()\
93 , null,
94 methodStart, methodEnd, nextIndex)
95 nextIndex += it.type().localVarTableSize()
96 }
97
98 // time to generate bytecode for all the statements
99 val ctx = CompilationContext(localSymbols, this)
100 statements.forEach { s ->
101 when (s) {
102 is InputDeclaration -> {
103 // Nothing to do, the value is already stored where it shoul\
104 d be
105 }
106 is VarDeclaration -> {
107 s.value.push(methodVisitor, ctx)
108 methodVisitor.visitVarInsn(s.type().storeOp(), localSymbols[\
109 s]!!.index)
110 }
111 is Print -> {
112 methodVisitor.visitVarInsn(ALOAD, 0)
113 methodVisitor.visitFieldInsn(GETFIELD, canonicalNameToIntern\
114 alName(className),
115 "systemInterface", SystemInterface::class.java.j\
116 vmDescription())
117 s.value.pushAsString(methodVisitor, ctx)
118 methodVisitor.visitMethodInsn(INVOKEINTERFACE,
119 SystemInterface::class.java.internalName(), "pri\
120 nt",
121 "(${String::class.java.jvmDescription()})V", tru\
122 e)
123 }
124 is Assignment -> {
125 s.value.push(methodVisitor, ctx)
126 methodVisitor.visitVarInsn(s.varDecl.referred!!.type().store\
11. Generate JVM bytecode 180
127 Op(),
128 localSymbols[s.varDecl.referred!!]!!.index)
129 }
130 is FunctionDeclaration -> compileFunction(s)
131 is ExpressionStatatement -> s.expression.push(methodVisitor, ctx)
132 else -> throw UnsupportedOperationException(s.javaClass.canonica\
133 lName)
134 }
135 }
136
137 // We just says that here is the end of the method
138 methodVisitor.visitLabel(methodEnd)
139 // And we had the return instruction
140 if (returnType == null) {
141 methodVisitor.visitInsn(RETURN)
142 } else {
143 methodVisitor.visitInsn(returnType.returnOp())
144 }
145 methodVisitor.visitEnd()
146 methodVisitor.visitMaxs(-1, -1)
147 }
148
149 private fun compileCalculateMethod() {
150 generateMethod("calculate", ast.inputs(), ast.topLevelVariables(), ast.s\
151 tatements)
152 }
153
154 fun compile() : ByteArray {
155 ast.topLevelFunctions().forEach { collectFunctions(it, "fun") }
156
157 // here we specify that the class is in the format introduced with
158 // Java 8 (so it would require a JRE >= 8 to run). We also specify the
159 // name of the class, the fact it extends Object and it implements no
160 // interfaces
161 cw.visit(V1_8, ACC_PUBLIC, canonicalNameToInternalName(className), null,\
162 "java/lang/Object", null)
163
164 cw.visitField(ACC_PRIVATE, "systemInterface", SystemInterface::class.jav\
165 a.jvmDescription(),
166 null, null)
167
168 compileConstructor()
11. Generate JVM bytecode 181
169 compileCalculateMethod()
170
171 cw.visitEnd()
172 return cw.toByteArray()
173 }
174 }
Our strategy is to generate for each MiniCalcFun source file one JVM class with:
one constructor
one method named calculate that will execute the whole code
one method for each MiniCalcFun function
The method calculate and the methods for the functions will be generated using generateMethod.
In both cases we just need to generate code for a sequence of statements. In one case the sequence
of statements will come from the global scope, in the other cases it will come from the body of the
function.
When we instantiate the compilation we pass the AST to compile and the name of the class:
We need to provide the name of the class because there is no name for the whole script in the AST.
We start by creating a ClassWriter:
ClassWriter is a class from ASM that will be used to generate the class. It will give us the actual
bytes to save in the class file. The parameters we specify instruct ASM to calculate for us several
values.
The action starts in compile. We first of all collect all the functions. We pick the top level ones and
on each of them we invoke collectFunction. This method will look for other functions inside the
given one, recursively.
11. Generate JVM bytecode 182
the name will be fun_f_g. A qualified name that permits to distinguish functions having the
same name but declared in different scopes
the surrounding values of g will be: all the inputs and global variables plus the parameters
and variables of f
The constructor
The constructor of our compiled class will receive one parameter of type SystemInterface. That
parameter defines how we will interact with the system. This will make testing much easier, as we
will see later.
We start by defining the constructor (special name <init>). Its signature indicates that the return
type is void and the only parameter is an instance of the class SystemInterface, which we have
already seen while examining the interpreter:
1 interface SystemInterface {
2 fun print(message: String)
3 }
We start by pushing this into the stack (aload 0) and then invoke the constructor of Object, our
super class. After that we take the value of the parameter and store in the field systemInterface.
To do so we need to first push this (aload 0), then push the value to assign, i.e. the value of the
first and only parameter (aload 1), finally we call PUTFIELD. At this point we just need to insert a
RETURN. In the bytecode RETURN is never implicit and it must always be present.
Generate method
The generate method will be used to generate both the method for the global scope (calculate) and
a method for each function present in the MiniCalcFun script to be compiled.
We expect to receive the name of the JVM method to generate, the parameters this method will
receive, the variables that will be defined in this method, the statements composing the method and
potentially the return type.
11. Generate JVM bytecode 184
Remember that a function could see variables defined outside of it: in the global scope or in a function
wrapping it. These variables would become parameters in the corresponding JVM code, so they will
be inserted in the list of methodParameters.
variables would contain exclusively the variables defined in the function being compiled. When
generateMethod is called for the global scope variables will contain the global variables.
The signature is defined by simply taking all the method parameters and getting their corresponding
JVM Type description. They are all joined without space between them and they are enclosed in
parenthesis. At the end of the signature we have V if no return type is present, otherwise the JVM
Type description corresponding to the return type.
We then define labels indicating the start and the end of the method. We will use them when defining
the range of validity of the variables, which is relevant only for debugging purposes.
At this point we register all the parameters and variables:
1 // Variable declarations:
2 // we find all variable declarations in our code and we assign to them an
3 // index value.
4 // our vars map will tell us which variable name corresponds to which index
5 var nextIndex = 1
6 val localSymbols = HashMap<ValueDeclaration, Entry>()
7 methodParameters.forEach {
8 localSymbols[it] = Entry(nextIndex, it.type())
9 nextIndex += it.type().localVarTableSize()
10 // they are just represented by the params
11 }
12 variables.forEach {
13 localSymbols[it] = Entry(nextIndex, it.type())
14 methodVisitor.visitLocalVariable(it.name, it.type().jvmDescription(), null,
15 methodStart, methodEnd, nextIndex)
16 nextIndex += it.type().localVarTableSize()
17 }
We have seen that the first entry of the local variables table is this. It is then followed by the
parameters. So for each parameter we record its position in the local variables table. Remember that
double and long entries take 2 spaces. We do not use long but we use double which corresponds to
the Decimal type in MiniCalcFun. So what happens if we have a method taking three parameters,
(p0, p1, p2), where the first and the last one are of type Int and the second one is of type Decimal?
The resulting local variables table will have this content:
11. Generate JVM bytecode 185
Then we proceed to insert the variables. For example, if we have two string variables (v0 and v1) for
this method the local variables table will have this content:
Entry Type Index
this reference 0
p0 int 1
p1 double 2-3
p2 int 4
v0 reference 5
v1 reference 6
The call to visitLocalVariable is useful only to fill a table used for debugging purposes.
At this point we create an instance of context:
This instance captures the list of symbols and their position in the local variables table. We will
need it when executing statements and evaluating expressions. Why is that? Because when will
have reference to p0 we could look to localSymbols to know where p0 is in the local variables table,
so that we could write the correct instruction to retrieve or set its value.
Now we can process all the statements, in order:
1 is InputDeclaration -> {
2 // Nothing to do, the value is already stored where it should be
3 }
4 is VarDeclaration -> {
5 s.value.push(methodVisitor, ctx)
6 methodVisitor.visitVarInsn(s.type().storeOp(), localSymbols[s]!!.index)
7 }
For input declarations we do not need to do anything. We will receive their values as parameters so
they will be already in the local variables table.
For variables instead we need to evaluate the expressions providing the initial value. Once we
evaluated those expressions, their value is on top of the stack and the store operation put it in the local
11. Generate JVM bytecode 186
variables table. We figure out the index of the local variables table by looking in the localSymbols
map we created earlier.
The assignment works exactly as the variable declaration: we evaluate an expression and store its
value in the local variables table.
1 is Assignment -> {
2 s.value.push(methodVisitor, ctx)
3 methodVisitor.visitVarInsn(s.varDecl.referred!!.type().storeOp(),
4 localSymbols[s.varDecl.referred!!]!!.index)
5 }
The expression statement consists in even less code: just evaluating an expression, without saving
its result.
Then we had the function declarations, which are handled in a separate method:
We are left with the print method. Now, one simple way to implement it would be this:
1 is Print -> {
2 // this means that we access the field "out" of "java.lang.System" which
3 // is of type "java.io.PrintStream"
4 mainMethodWriter.visitFieldInsn(GETSTATIC, "java/lang/System", "out",
5 "Ljava/io/PrintStream;")
6 // we push the value we want to print on the stack
7 s.value.push(mainMethodWriter, localSymbols)
8 // we call the method println of System.out to print the value. It will
9 // take its parameter from the stack note that we have to tell the JVM
10 // which variant of println to call. To do that we describe the
11 // signature of the method, depending on the type of the value we want
12 // to print.
13 // If we want to print an int we will produce the signature "(I)V",
14 // we will produce "(D)V" for a double
15 mainMethodWriter.visitMethodInsn(INVOKEVIRTUAL, "java/io/PrintStream",
16 "println", "(${s.value.type().jvmDescription()})V", false)
17 }
11. Generate JVM bytecode 187
This just consist in pushing an expression on the stack and then invoke one of the different methods
System.out.println. There are several of these methods, one taking a string, one taking an int, an
other one taking a double.
However we did not implement the print statement in this way. We instead delegate the imple-
mentation to the systemInterface field. By choosing this approach we can either: i) define a
SystemInterface that actually prints to the screen or ii) an instance that collect the strings we tried
to print in an array to later test the result. This is the same strategy we have used in the interpreter.
1 is Print -> {
2 methodVisitor.visitVarInsn(ALOAD, 0)
3 methodVisitor.visitFieldInsn(GETFIELD, c
4 anonicalNameToInternalName(className),
5 "systemInterface",
6 SystemInterface::class.java.jvmDescription())
7 s.value.pushAsString(methodVisitor, ctx)
8 methodVisitor.visitMethodInsn(INVOKEINTERFACE,
9 SystemInterface::class.java.internalName(), "print",
10 "(${String::class.java.jvmDescription()})V", true)
11 }
When writing a JVM compiler you may want to generate class files and examine them using the
javap utility. Lets see at some examples of MiniCalcFun code and the resulting class files we got,
disassembled using javap.
First example:
1 input Int i
2 print(i * 2)
1 {
2 private me.tomassetti.minicalc.interpreter.SystemInterface systemInterface;
3 descriptor: Lme/tomassetti/minicalc/interpreter/SystemInterface;
4 flags: ACC_PRIVATE
5
6 public minicalc.example(me.tomassetti.minicalc.interpreter.SystemInterface);
7 descriptor: (Lme/tomassetti/minicalc/interpreter/SystemInterface;)V
8 flags: ACC_PUBLIC
9 Code:
11. Generate JVM bytecode 188
the field for storing the instance of SystemInterface we get in the constructor
the constructor
the calculate method, which executes the whole script
1 {
2 private me.tomassetti.minicalc.interpreter.SystemInterface systemInterface;
3 descriptor: Lme/tomassetti/minicalc/interpreter/SystemInterface;
4 flags: ACC_PRIVATE
5
6 public minicalc.example(me.tomassetti.minicalc.interpreter.SystemInterface);
7 descriptor: (Lme/tomassetti/minicalc/interpreter/SystemInterface;)V
8 flags: ACC_PUBLIC
9 Code:
10 stack=2, locals=2, args_size=2
11 0: aload_0
12 1: invokespecial #11 // Method java/lang/Object."<init>\
13 ":()V
14 4: aload_0
15 5: aload_1
16 6: putfield #13 // Field systemInterface:Lme/tomas\
17 setti/minicalc/interpreter/SystemInterface;
18 9: return
19
20 public void calculate();
21 descriptor: ()V
22 flags: ACC_PUBLIC
23 Code:
24 stack=3, locals=1, args_size=1
25 0: aload_0
26 1: getfield #13 // Field systemInterface:Lme/tomas\
27 setti/minicalc/interpreter/SystemInterface;
28 4: aload_0
29 5: ldc #18 // int 5
30 7: invokevirtual #22 // Method "minicalc.example".fun_f\
31 :(I)I
32 10: invokestatic #28 // Method java/lang/Integer.toStri\
33 ng:(I)Ljava/lang/String;
34 13: invokeinterface #34, 2 // InterfaceMethod me/tomassetti/m\
35 inicalc/interpreter/SystemInterface.print:(Ljava/lang/String;)V
11. Generate JVM bytecode 190
36 18: return
37
38 public int fun_f(int);
39 descriptor: (I)I
40 flags: ACC_PUBLIC
41 Code:
42 stack=2, locals=2, args_size=2
43 0: iload_1
44 1: ldc #17 // int 1
45 3: iadd
46 4: ireturn
47 }
In this case in addition to the field, the constructor, and the calculate method we have a method
for the function, named fun_f.
Testing
We may have finished looking into the code of our first compiler but one important piece is stil
missing: our tests.
Lets look at the general structure we use for testing compilation of MiniCalcFun source files.
1 class JvmCompilerTest {
2
3 fun compile(code: String): Class<*> {
4 val res = MiniCalcParserFacade.parse(code)
5 assertTrue(res.isCorrect(), res.errors.toString())
6 val miniCalcFile = res.root!!
7 val bytes = JvmCompiler().compile(miniCalcFile, "me/tomassetti/MyCalc")
8 return MyClassLoader(bytes).loadClass("me.tomassetti.MyCalc")
9 }
10
11 class MyClassLoader(val bytes: ByteArray) : ClassLoader() {
12 override fun findClass(name: String?): Class<*> {
13 return defineClass(name, bytes, 0, bytes.size)
14 }
15 }
16
17 class TestSystemInterface : SystemInterface {
18
19 val output = LinkedList<String>()
20
11. Generate JVM bytecode 191
In the compile method we get some code, we parse it and verify there are no errors. If everything
is fine we invoke the compile method. The compile method returns to us the bytes of the compiled
class. We pass those to our simple classloader (MyClassLoader) which will use them to define a class,
as it is needed to.
We define also an instance of SystemInterface that instead of printing strings store them in a list.
This is how we can write tests using this structure:
And this is how you can write and test your JVM compiler.
StaMac
We are going to see how to write a JVM compiler for StaMac. We will reuse many principles we
have seen when writing the compiler for MiniCalcFun. However the structure of the generated
classes will be different because the execution model is different. MiniCalcFun is a typical imperative
language which executes a list of statements from beginning to end. StaMac instead is based on State
Machines, so it is event based.
https://docs.oracle.com/javase/8/docs/api/java/lang/Class.html#newInstance--
11. Generate JVM bytecode 193
For each StaMac source file we are going to generate several class files.
One class file will represent the whole state machine. We will then have one interface named
State.We will also have one class for each state of the state machine. Each of these state classes
will implement the State interface.
We will generate all the classes in one package, named using the name of the state machine.
Consider this example:
1 statemachine simpleSM
2
3 input lowSpeedThroughtput : Int
4 input highSpeedThroughtput : Int
5 var counter = 0
6
7 event accelerate
8 event slowDown
9 event clock
10
11 start state turnedOff {
12 on accelerate -> lowSpeed
13 }
14
15 state lowSpeed {
16 on entry {
17 counter = counter + lowSpeedThroughtput
18 }
19 on accelerate -> highSpeed
20 on slowDown -> turnedOff
21 on clock -> lowSpeed
22 }
23
24 state highSpeed {
25 on entry {
26 counter = counter + highSpeedThroughtput
27 }
28 on slowDown -> lowSpeed
29 on clock -> highSpeed
30 }
stamac.simpleSM.StateMachine
stamac.simpleSM.State
stamac.simpleSM.turnedOff
stamac.simpleSM.lowSpeed
stamac.simpleSM.highSpeed
The StateMachine class will be the only one users should consider. The State interface and the three
implementations will be used by StateMachine. They could have been inner classes, but this would
have required to introduce a little more complexity that is not worthy at this stage.
The StateMachine class will have several elements. Lets start with the public ones:
Other elements are package-protected because they are intended to be accessed only by the state
classes.
one field for SystemInterface, that we have already seen in the interpreter chapter and in the
MiniCalcFun compiler
one field for each input and variable
the exit method
a goTo_Xxx method for each state
one field indicating if the state machine has reach the exit
one field containing an instance of a state class. This represents the current state
a method enter
a method leave
a method for each event
There are different ways of producing bytecode that would lead to a system with the same behavior.
We chose this approach because it seems reasonably clean and easy to implement. The main activity
our system will do will be reacting to events from the external world so we started from there. What
should happen when we receive an event? We should react in a way that depends on the event and
the current state.
The first choice is having a different method for each event: in this way the user will communicate
which events he is sending by invoking the corresponding method. Alternatively we could have
chosen to have a single method, named receiveEvent and specify which event was sent using
a parameter, like receiveEvent(ACCELERATE), for example. With the approach we chose the user
would have to call accelerate() instead.
The second choice is how to implement these event methods. One way would be to have a switch
on the current state, and do something different depending on the state. Something equivalent to:
1 void accelerate() {
2 switch (currentState) {
3 case TURNED_OFF:
4 ...
5 break;
6 case LOW_SPEED:
7 ...
8 break;
9 case HIGH_SPEED:
10 ...
11 break;
12 }
13 }
We did not like this approach for several reason, including the fact that generating the bytecode for
switch statements is not trivial. We chose instead to simply delegate to an object representing the
current state, like we were applying the State pattern from the Design Patterns book.
So our code will be very similar to what we would obtain by compiling this example in Java:
https://en.wikipedia.org/wiki/Design_Patterns
11. Generate JVM bytecode 196
1 class StateMachine {
2
3 ...
4
5 void accelerate() {
6 currentState.accelerate();
7 }
8
9 ...
10 }
Compiling the example we have seen before we will obtain this code.
StateMachine class:
27 public stamac.simpleSM.StateMachine(me.tomassetti.stamac.jvmcompiler.SystemInt\
28 erface, int, int);
29 descriptor: (Lme/tomassetti/stamac/jvmcompiler/SystemInterface;II)V
30 flags: ACC_PUBLIC
31 Code:
32 stack=2, locals=4, args_size=4
33 0: aload_0
34 1: invokespecial #19 // Method java/lang/Object."<init>\
35 ":()V
36 4: aload_0
37 5: aload_1
38 6: putfield #21 // Field systemInterface:Lme/tomas\
39 setti/stamac/jvmcompiler/SystemInterface;
40 9: aload_0
41 10: aload_2
42 11: putfield #23 // Field lowSpeedThroughtput:I
43 14: aload_0
44 15: aload_3
45 16: putfield #25 // Field highSpeedThroughtput:I
46 19: aload_0
47 20: invokevirtual #28 // Method goTo_turnedOff:()V
48 23: return
49
50 void exit();
51 descriptor: ()V
52 flags:
53 Code:
54 stack=2, locals=1, args_size=1
55 0: aload_0
56 1: ldc #30 // int 1
57 3: putfield #32 // Field exited:Z
58 6: return
59
60 public boolean isExited();
61 descriptor: ()Z
62 flags: ACC_PUBLIC
63 Code:
64 stack=1, locals=1, args_size=1
65 0: aload_0
66 1: getfield #32 // Field exited:Z
67 4: ireturn
68
11. Generate JVM bytecode 198
69 void goTo_turnedOff();
70 descriptor: ()V
71 flags:
72 Code:
73 stack=4, locals=1, args_size=1
74 0: aload_0
75 1: getfield #36 // Field currentState:Lstamac/simp\
76 leSM/State;
77 4: ifnull 16
78 7: aload_0
79 8: getfield #36 // Field currentState:Lstamac/simp\
80 leSM/State;
81 11: invokeinterface #41, 1 // InterfaceMethod stamac/simpleSM\
82 /State.leave:()V
83 16: aload_0
84 17: new #43 // class stamac/simpleSM/turnedOff
85 20: dup
86 21: aload_0
87 22: invokespecial #46 // Method stamac/simpleSM/turnedOf\
88 f."<init>":(Lstamac/simpleSM/StateMachine;)V
89 25: putfield #36 // Field currentState:Lstamac/simp\
90 leSM/State;
91 28: aload_0
92 29: getfield #36 // Field currentState:Lstamac/simp\
93 leSM/State;
94 32: invokeinterface #49, 1 // InterfaceMethod stamac/simpleSM\
95 /State.enter:()V
96 37: return
97 StackMapTable: number_of_entries = 1
98 frame_type = 16 /* same */
99
100 void goTo_lowSpeed();
101 descriptor: ()V
102 flags:
103 Code:
104 stack=4, locals=1, args_size=1
105 0: aload_0
106 1: getfield #36 // Field currentState:Lstamac/simp\
107 leSM/State;
108 4: ifnull 16
109 7: aload_0
110 8: getfield #36 // Field currentState:Lstamac/simp\
11. Generate JVM bytecode 199
111 leSM/State;
112 11: invokeinterface #41, 1 // InterfaceMethod stamac/simpleSM\
113 /State.leave:()V
114 16: aload_0
115 17: new #52 // class stamac/simpleSM/lowSpeed
116 20: dup
117 21: aload_0
118 22: invokespecial #53 // Method stamac/simpleSM/lowSpeed\
119 ."<init>":(Lstamac/simpleSM/StateMachine;)V
120 25: putfield #36 // Field currentState:Lstamac/simp\
121 leSM/State;
122 28: aload_0
123 29: getfield #36 // Field currentState:Lstamac/simp\
124 leSM/State;
125 32: invokeinterface #49, 1 // InterfaceMethod stamac/simpleSM\
126 /State.enter:()V
127 37: return
128 StackMapTable: number_of_entries = 1
129 frame_type = 16 /* same */
130
131 void goTo_highSpeed();
132 descriptor: ()V
133 flags:
134 Code:
135 stack=4, locals=1, args_size=1
136 0: aload_0
137 1: getfield #36 // Field currentState:Lstamac/simp\
138 leSM/State;
139 4: ifnull 16
140 7: aload_0
141 8: getfield #36 // Field currentState:Lstamac/simp\
142 leSM/State;
143 11: invokeinterface #41, 1 // InterfaceMethod stamac/simpleSM\
144 /State.leave:()V
145 16: aload_0
146 17: new #56 // class stamac/simpleSM/highSpeed
147 20: dup
148 21: aload_0
149 22: invokespecial #57 // Method stamac/simpleSM/highSpee\
150 d."<init>":(Lstamac/simpleSM/StateMachine;)V
151 25: putfield #36 // Field currentState:Lstamac/simp\
152 leSM/State;
11. Generate JVM bytecode 200
195
196 public void clock();
197 descriptor: ()V
198 flags: ACC_PUBLIC
199 Code:
200 stack=1, locals=1, args_size=1
201 0: aload_0
202 1: getfield #32 // Field exited:Z
203 4: ifne 16
204 7: aload_0
205 8: getfield #36 // Field currentState:Lstamac/simp\
206 leSM/State;
207 11: invokeinterface #66, 1 // InterfaceMethod stamac/simpleSM\
208 /State.clock:()V
209 16: return
210 StackMapTable: number_of_entries = 1
211 frame_type = 16 /* same */
212 }
State interface:
1 interface stamac.simpleSM.State
2 {
3 public abstract void enter();
4 descriptor: ()V
5 flags: ACC_PUBLIC, ACC_ABSTRACT
6
7 public abstract void leave();
8 descriptor: ()V
9 flags: ACC_PUBLIC, ACC_ABSTRACT
10
11 public abstract void accelerate();
12 descriptor: ()V
13 flags: ACC_PUBLIC, ACC_ABSTRACT
14
15 public abstract void slowDown();
16 descriptor: ()V
17 flags: ACC_PUBLIC, ACC_ABSTRACT
18
19 public abstract void clock();
20 descriptor: ()V
21 flags: ACC_PUBLIC, ACC_ABSTRACT
22 }
11. Generate JVM bytecode 202
turnedOff class:
42 leSM/StateMachine;
43 4: invokevirtual #23 // Method stamac/simpleSM/StateMac\
44 hine.goTo_lowSpeed:()V
45 7: return
46
47 public void slowDown();
48 descriptor: ()V
49 flags: ACC_PUBLIC
50 Code:
51 stack=0, locals=1, args_size=1
52 0: return
53
54 public void clock();
55 descriptor: ()V
56 flags: ACC_PUBLIC
57 Code:
58 stack=0, locals=1, args_size=1
59 0: return
60 }
lowSpeed class:
21 flags: ACC_PUBLIC
22 Code:
23 stack=3, locals=1, args_size=1
24 0: aload_0
25 1: getfield #15 // Field stateMachine:Lstamac/simp\
26 leSM/StateMachine;
27 4: aload_0
28 5: getfield #15 // Field stateMachine:Lstamac/simp\
29 leSM/StateMachine;
30 8: getfield #22 // Field stamac/simpleSM/StateMach\
31 ine.counter:I
32 11: aload_0
33 12: getfield #15 // Field stateMachine:Lstamac/simp\
34 leSM/StateMachine;
35 15: getfield #25 // Field stamac/simpleSM/StateMach\
36 ine.lowSpeedThroughtput:I
37 18: iadd
38 19: putfield #22 // Field stamac/simpleSM/StateMach\
39 ine.counter:I
40 22: return
41
42 public void leave();
43 descriptor: ()V
44 flags: ACC_PUBLIC
45 Code:
46 stack=0, locals=1, args_size=1
47 0: return
48
49 public void accelerate();
50 descriptor: ()V
51 flags: ACC_PUBLIC
52 Code:
53 stack=1, locals=1, args_size=1
54 0: aload_0
55 1: getfield #15 // Field stateMachine:Lstamac/simp\
56 leSM/StateMachine;
57 4: invokevirtual #30 // Method stamac/simpleSM/StateMac\
58 hine.goTo_highSpeed:()V
59 7: return
60
61 public void slowDown();
62 descriptor: ()V
11. Generate JVM bytecode 205
63 flags: ACC_PUBLIC
64 Code:
65 stack=1, locals=1, args_size=1
66 0: aload_0
67 1: getfield #15 // Field stateMachine:Lstamac/simp\
68 leSM/StateMachine;
69 4: invokevirtual #34 // Method stamac/simpleSM/StateMac\
70 hine.goTo_turnedOff:()V
71 7: return
72
73 public void clock();
74 descriptor: ()V
75 flags: ACC_PUBLIC
76 Code:
77 stack=1, locals=1, args_size=1
78 0: aload_0
79 1: getfield #15 // Field stateMachine:Lstamac/simp\
80 leSM/StateMachine;
81 4: invokevirtual #38 // Method stamac/simpleSM/StateMac\
82 hine.goTo_lowSpeed:()V
83 7: return
84 }
18 leSM/StateMachine;
19 9: return
20
21 public void enter();
22 descriptor: ()V
23 flags: ACC_PUBLIC
24 Code:
25 stack=3, locals=1, args_size=1
26 0: aload_0
27 1: getfield #15 // Field stateMachine:Lstamac/simp\
28 leSM/StateMachine;
29 4: aload_0
30 5: getfield #15 // Field stateMachine:Lstamac/simp\
31 leSM/StateMachine;
32 8: getfield #22 // Field stamac/simpleSM/StateMach\
33 ine.counter:I
34 11: aload_0
35 12: getfield #15 // Field stateMachine:Lstamac/simp\
36 leSM/StateMachine;
37 15: getfield #25 // Field stamac/simpleSM/StateMach\
38 ine.highSpeedThroughtput:I
39 18: iadd
40 19: putfield #22 // Field stamac/simpleSM/StateMach\
41 ine.counter:I
42 22: return
43
44 public void leave();
45 descriptor: ()V
46 flags: ACC_PUBLIC
47 Code:
48 stack=0, locals=1, args_size=1
49 0: return
50
51 public void accelerate();
52 descriptor: ()V
53 flags: ACC_PUBLIC
54 Code:
55 stack=0, locals=1, args_size=1
56 0: return
57
58 public void slowDown();
59 descriptor: ()V
11. Generate JVM bytecode 207
60 flags: ACC_PUBLIC
61 Code:
62 stack=1, locals=1, args_size=1
63 0: aload_0
64 1: getfield #15 // Field stateMachine:Lstamac/simp\
65 leSM/StateMachine;
66 4: invokevirtual #31 // Method stamac/simpleSM/StateMac\
67 hine.goTo_lowSpeed:()V
68 7: return
69
70 public void clock();
71 descriptor: ()V
72 flags: ACC_PUBLIC
73 Code:
74 stack=1, locals=1, args_size=1
75 0: aload_0
76 1: getfield #15 // Field stateMachine:Lstamac/simp\
77 leSM/StateMachine;
78 4: invokevirtual #35 // Method stamac/simpleSM/StateMac\
79 hine.goTo_highSpeed:()V
80 7: return
81 }
Lets start with the similarities: also in StaMac we have functions to deal with types. So the
extension methods Type.jvmDescription, Type.localVarTableSize, Type.loadOp, Type.storeOp,
and Type.returnOp are exactly the same as we have seen in the compiler for MiniCalcFun. Also the
methods canonicalNameToInternalName, canonicalNameToJvmDescription, and the corresponding
extension methods for Class are reused.
The code that deals with expressions is also very similar: pushAsInt, pushAsDouble, and pushAs-
String look exactly the same. The code of push is partially the same but it contains some differences.
Lets focus on them:
11. Generate JVM bytecode 208
ValueReference is treated differently from what we did for MiniCalcFun because of the way we
store global variables and inputs. In the case of StaMac we store them as class fields.
11. Generate JVM bytecode 209
1 class JvmCompiler {
2
3 fun compile(ast: StateMachine) = Compilation(ast).compile()
4
5 }
6
7 interface SystemInterface {
8 fun print(message: String)
9 }
10
11 data class CompilationContext(val compilation: Compilation,
12 val classCname: String)
13
14 class Compilation(val ast: StateMachine) {
15 ...
16 }
The main method of our compiler asks for a source file but it produces a list of class files, instead of
just one as it did for MiniCalcFun. This is because for MiniCalcFun we generated one class file for
each source file, while for StaMac we generate several.
17 }
18 } else {
19 System.err.println("${res.errors.size} error(s) found\n")
20 res.errors.forEach { System.err.println(it) }
21 }
22 }
34 }
35
36 private fun smIsExitedMethod() {
37 ...
38 }
39
40 private fun smEventMethod(event: EventDeclaration) {
41 ...
42 }
43
44 private fun smGoToStateMethod(state: StateDeclaration) {
45 ...
46 }
47
48 fun compile() : Map<String, ByteArray> {
49 val classes = HashMap<String, ByteArray>()
50
51 // here we specify that the class is in the format introduced with
52 // Java 8 (so it would require a JRE >= 8 to run)
53 // We also specify the name of the class, the fact it extends Object
54 // and it implements no interfaces
55 smClass.visit(V1_8, ACC_PUBLIC,
56 canonicalNameToInternalName(stateMachineCName), null, "java/lang/Ob\
57 ject", null)
58
59 smClass.visitField(0, "systemInterface",
60 SystemInterface::class.java.jvmDescription(), null, null)
61 smClass.visitField(ACC_PRIVATE, "exited", "Z", null, null)
62 smClass.visitField(ACC_PRIVATE, "currentState",
63 canonicalNameToJvmDescription(stateInterfaceCName), null, null)
64 ast.inputs.forEach {
65 smClass.visitField(0, it.name, it.type.jvmDescription(), null, null)
66 }
67 ast.variables.forEach {
68 smClass.visitField(0, it.name, it.type.jvmDescription(), null, null)
69 }
70
71 smConstructor()
72 smExitMethod()
73 smIsExitedMethod()
74 ast.states.forEach {
75 smGoToStateMethod(it)
11. Generate JVM bytecode 212
76 }
77
78 ast.events.forEach {
79 smEventMethod(it)
80 }
81
82 smClass.visitEnd()
83 classes[stateMachineCName] = smClass.toByteArray()
84
85 // add State interface
86 compileStateInterface(classes)
87
88 ast.states.forEach {
89 // generate State classes
90 compileStateClass(it, classes)
91 }
92 return classes
93 }
94 }
The Compilation receives the AST to compile. It defines a ClassWriter exactly as we did in the
compiler for MiniCalcFun.
Then we have a few constants and a function to define the name of the classes to generate. CName
stands for Canonical name. I.e., the qualified name of a class, which includes the package name.
The compile method is where we coordinate the work. We start by preparing a map to collect all
the classes we are going to generate by their name. To be precise we are going to store the actual
bytes corresponding to the class (as ByteArray instances).
11. Generate JVM bytecode 213
We will define the class for the state machine. First we define the fields: systemInterface, exited,
currentState, and then one field for each input and one for each variable.
Then we define the constructor, the exit method, the isExited method, one goTo_xxx method for
each state and one method for each event.
After that we define the State interface and one class for each state.
37
38 // add State interface
39 compileStateInterface(classes)
40
41 ast.states.forEach {
42 // generate State classes
43 compileStateClass(it, classes)
44 }
45 return classes
46 }
The StateMachine class is the only class intended to be used directly. It coordinates all the activities.
The constructor of StateMachine expects an instance of SystemInterface and values for all the
inputs. The signature is similar to the signature we have seen for the constructor of the class
generated for MiniCalcFun.
Also in this case we call the super constructor, the default constructor of Object.
We then store the first parameter in the field systemInterface. After that we look at all the values
for the inputs which are passed as parameters. We store them one by one in separate fields.
Then we invoke the goTo_xxx method for start state and we close the constructor.
20 var index = 2
21 ast.inputs.forEach {
22 constructor.visitVarInsn(ALOAD, 0)
23 constructor.visitVarInsn(it.type.loadOp(), index)
24 constructor.visitFieldInsn(PUTFIELD,
25 canonicalNameToInternalName(stateMachineCName),
26 it.name, it.type.jvmDescription())
27 index += it.type.localVarTableSize()
28 }
29
30 constructor.visitVarInsn(ALOAD, 0)
31 constructor.visitMethodInsn(INVOKEVIRTUAL,
32 canonicalNameToInternalName(stateMachineCName),
33 goToMethodName(ast.startState()), "()V", false)
34
35 constructor.visitInsn(RETURN)
36 constructor.visitEnd()
37 constructor.visitMaxs(-1, -1)
38 }
So what do this do in practice? If the currentState is not null we call the method leave on it. The
currentState will be null when we first start the state machine, and we are not yet in any state.
So when we go to a state (the start state) we have no state to leave. From now on whenever we call
one of these goTo_xxx methods we will have to leave a state instead, and to do that we will call the
method leave on it.
Once we have done that we want to instantiate the class representing the state to which we are
going. After we have instantiated it we assign it to the field currentState. Finally we execute the
method enter on the new value of currentState. In this case we are sure this value is not null, so
there is no reason to check.
11. Generate JVM bytecode 216
43
44 mv.visitInsn(RETURN)
45 mv.visitEnd()
46 mv.visitMaxs(-1, -1)
47 }
Similarly to the goTo_xxx methods we have the exit method. Also this method is not public because
it is not intended to be called directly by the users of our compiled class.
Finally we have the main public methods the users will need to interact with the state machine.
These methods permit to report that an event has been received.
These methods are public and named after the events.
What they do is loading the value of the field exited and then check if it is equal to true (opcode
IFNE). If that is the case all the rest of the method is skipped, as we jump directly before the RETURN
instruction.
11. Generate JVM bytecode 218
If instead the field exited is false we continue by invoking the method corresponding to the event
on the field currentState. For example, if we are defining the method foo on the StateMachine,
we will invoke currentState.foo(). In other words, we delegate to currentState to decide how
to react to event received.
It is time to examine how the State interface is defined. There is no code involved here, we just
define the methods. All of them are public and abstract, because they are interface methods. We
always have enter and leave. Then we have also one method for each event, named as the event
itself.
10 null, null)
11 ast.events.forEach {
12 interfaceClass.visitMethod(ACC_PUBLIC or ACC_ABSTRACT, it.name, "()V",
13 null, null)
14 }
15 interfaceClass.visitEnd()
16 classes[stateInterfaceCName] = interfaceClass.toByteArray()
17 }
We define a class for each state. While defining it we specify that it implements one interface: the
State interface (arrayOf(canonicalNameToInternalName(stateInterfaceCName))).
The class will have a stateMachineField. In the constructor we will receive the reference to the
state machine and store it in such field.
We then have the enter and leave methods. They simply contains all the code for the statements
associated to on-entry and on-exit blocks.
25 canonicalNameToInternalName(stateClassCName(state)),
26 "stateMachine",
27 canonicalNameToJvmDescription(stateMachineCName))
28 constructor.visitInsn(RETURN)
29 constructor.visitEnd()
30 constructor.visitMaxs(-1, -1)
31
32 val enterMethod = stateClass.visitMethod(ACC_PUBLIC, "enter", "()V",
33 null, null)
34 enterMethod.visitCode()
35 state.blocks.filterIsInstance(OnEntryBlock::class.java).forEach {
36 it.statements.forEach {
37 compileStatement(it, enterMethod, stateClassCName(state))}
38 }
39 enterMethod.visitInsn(RETURN)
40 enterMethod.visitEnd()
41 enterMethod.visitMaxs(-1, -1)
42
43 val leaveMethod = stateClass.visitMethod(ACC_PUBLIC, "leave", "()V",
44 null, null)
45 leaveMethod.visitCode()
46 state.blocks.filterIsInstance(OnExitBlock::class.java).forEach {
47 it.statements.forEach {
48 compileStatement(it, leaveMethod, stateClassCName(state))}
49 }
50 leaveMethod.visitInsn(RETURN)
51 leaveMethod.visitEnd()
52 leaveMethod.visitMaxs(-1, -1)
53
54 ast.events.forEach { e ->
55 val eventMethod = stateClass.visitMethod(ACC_PUBLIC, e.name, "()V",
56 null, null)
57 eventMethod.visitCode()
58
59 val transition = state.blocks.filterIsInstance(OnEventBlock::class.java)
60 .find { it.event.referred!! == e }
61 if (transition != null) {
62 eventMethod.visitVarInsn(ALOAD, 0)
63 eventMethod.visitFieldInsn(GETFIELD,
64 canonicalNameToInternalName(stateClassCName(state)),
65 "stateMachine",
66 canonicalNameToJvmDescription(stateMachineCName))
11. Generate JVM bytecode 221
67 eventMethod.visitMethodInsn(INVOKEVIRTUAL,
68 canonicalNameToInternalName(stateMachineCName),
69 goToMethodName(transition.destination.referred!!),
70 "()V", false)
71 }
72
73 eventMethod.visitInsn(RETURN)
74 eventMethod.visitEnd()
75 eventMethod.visitMaxs(-1, -1)
76 }
77
78 stateClass.visitEnd()
79 classes[stateClassCName(state)] = stateClass.toByteArray()
80 }
1 state lowSpeed {
2 on entry {
3 counter = counter + lowSpeedThroughtput
4 }
5 on accelerate -> highSpeed
6 on slowDown -> turnedOff
7 on clock -> lowSpeed
8 }
In this case the enter method will contain the code for the statement counter = counter +
lowSpeedThroughtput, while the leave method will be empty. The generation of code for each
statement is done in the compileStatement method that follows below.
We then have one method for each event. To compile each of these methods we check if there is
a transition in that state for that event. For example, considering the previous example the state
lowSpeed has a transition on the event accelerate. That transition goes to the state highSpeed. So
we need to generate code that express that. The way we do it is by calling the goTo_xxx method
corresponding to the target state on the StateMachine instance. In this case for example we would
invoke stateMachine.goTo_lowSpeed().
We have just to see how statements are compiled:
11. Generate JVM bytecode 222
43 lName)
44 }
45 }
Tests
The general structure of the tests for StaMac compiler is practically the same we had for MiniCalc-
Fun:
1 class JvmCompilerTest {
2
3 fun compile(code: String): Class<*> {
4 val res = SMLangParserFacade.parse(code)
5 assertTrue(res.isCorrect(), res.errors.toString())
6 val miniCalcFile = res.root!!
7 val classesBytecode = JvmCompiler().compile(miniCalcFile)
8 val classes = HashMap<String, Class<*>>()
9 classesBytecode.forEach { name, bytes -> classes[name.replace("/", ".")]\
10 = MyClassLoader(classesBytecode).loadClass(name.replace("/", ".")) }
11 return classes["stamac.sm.StateMachine"]!!
12 }
13
14 class MyClassLoader(val bytes: Map<String, ByteArray>) : ClassLoader() {
15 override fun findClass(name: String?): Class<*> {
16 return defineClass(name, bytes[name], 0, bytes[name]!!.size)
17 }
18 }
19
20 class TestSystemInterface : SystemInterface {
21
22 val output = LinkedList<String>()
23
24 override fun print(message: String) {
25 output.add(message)
26 }
11. Generate JVM bytecode 224
27
28 }
29
30 ...
31 }
The only difference is that in this case we can produce several classes, not just one. So we slightly
adapted MyClassLoader.
This is an actual test based on the example we have used in this section:
33 }""")
34
35 val systemInterface = TestSystemInterface()
36 val instance = clazz.declaredConstructors[0].newInstance(
37 systemInterface, 2, 5)
38 assertEquals(emptyList<String>(), systemInterface.output)
39 clazz.methods.find { it.name == "accelerate" }!!.invoke(instance)
40 assertEquals(listOf("2"), systemInterface.output)
41 clazz.methods.find { it.name == "clock" }!!.invoke(instance)
42 assertEquals(listOf("2", "4"), systemInterface.output)
43 clazz.methods.find { it.name == "clock" }!!.invoke(instance)
44 assertEquals(listOf("2", "4", "6"), systemInterface.output)
45 clazz.methods.find { it.name == "accelerate" }!!.invoke(instance)
46 assertEquals(listOf("2", "4", "6", "11"), systemInterface.output)
47 clazz.methods.find { it.name == "slowDown" }!!.invoke(instance)
48 assertEquals(listOf("2", "4", "6", "11", "13"), systemInterface.output)
49 }
We added some print statements to our original example, so that we can verify the output in our
assertions.
In the test we start by instantiating the StateMachine class returned by compile. We do that by
passing an instance of TestSystemInterface and values for the inputs (lowSpeedThroughtput, and
highSpeedThroughtput in this case).
Then we invoke the methods corresponding to the different events and we verify that the correct
values are printed.
Summary
In this chapter we have learned how the JVM works and we have defined two JVM compilers for
two different languages.
There are undoubtedly some similarities and a common structure in the way we work with types,
expressions and statements. However the general structure of the generated classes can vary a lot,
depending on the nature of the languages. In this chapter we have examined to the most common,
useful concepts you can leverage to write powerful compilers for the JVM. There is still a lot to learn:
inner classes, invokedynamic, control-flow statements. We could not cover the whole of it in one
chapter, but at this stage you should be familiar with how compilers for the JVM work and you can
keep going from here. Remember that the JVM Specification is a very useful resource.
12. Generate LLVM bitcode
Part III: editing support
13. Syntax highlighting
14. Auto completion
Write to me
I would be extremely grateful if you could share with me your feedback. Write to me about your
ideas, suggestions, comments at federico@tomassetti.me
If you want to read more about these topics you can find articles on my blog on Language
Engineering.
https://tomassetti.me