Compiler Construction

Compiler construction Paul Cockshott
Function
The job of a compiler is to take a file specifying an algorithm in a high level language and translate this into assembler code of a particular computer. The result of running the compiler is an executable file for the computer in question. It may be necessary to run and assembler or linker after the compiler to produce the final binary file.
Approaches
Write the compiler in assembler as was done with Basic interpreters First Fortran compiler (1957 ) was done this way
Approaches
Write the compiler in assembler as was done with Basic interpreters First Fortran compiler (1957 ) was done this way Write the compiler in its own language C. If this is the first compiler of this type, then hand translate to assembler C->human->A. Use the hand translated version to do a machine translation C -> A -> A This was done with the first Burroughs Algol compiler in 1960
Approaches
Write the compiler for one language in another existing language for example write a Java compiler in C, as is the case with gcj Write a new compiler for an existing language using an existing compiler for the same language so gcc was written in C and originally compiled using the Unix C compiler cc Write the compiler using a Compiler Compiler.
Compiler compilers
Lexical analysis generators These generate programs to tokenize the input stream into the lexemes defined in the high level language. They are based on regular expressions and they typically output C or Java. Examples are Lex, Jlex, Flex
Compiler compilers
Lexical analysis generators These generate programs to tokenize the input stream into the lexemes defined in the high level language. They are based on regular expressions and they typically output C or Java. Examples are Lex, Jlex, Flex
Compiler compilers
Syntax analysis generators These are typically based on a bnf like input notation that describes the context free syntax of a language. They generate a parser for the language. The parser they produce may be in C, Java or some other high level language. Examples: yacc, bison, Sablecc
They usually require a lexical analyser produced by a distinct package, but Sablecc integrates the lexical analysis as well.
Compiler compilers
Code generator generators These do the job of producing code generators. A code generator translates the parsed input programme into assembler Code generator generators are generally specified using a tree matching grammar. Examples are Twig, Burg, Ilcg
Advantages of compiler compilers
They automate a time consuming task They remove many of the bugs that can exist in hand generated compilers
Disadvantages
You have to learn the specification language You have to write glue code to use the output They may force you to use unfamiliar programming paradigms : the visitor pattern with SableCC
ILCG2
ILCG is a code generator generator that has been used in this department for a while. ILCG2 is an extension of the original system that includes the Sablecc system for lexical analysis and parsing. ILCG2 allows you to specify The concrete syntax of your language using sablecc The machine you will generate code to in ILCG
ILCG2
Here is the ILCG2 file you will work with compiler tinybasic parser tinyb cpu IA32 It will produce a compiler file called tinybasic.java It will use the file tinyb.sab to define the syntax of the language It will use the file IA32.ilc to define the processor you will be generating code for.
Sablecc
The syntax and sematics are defined in the Sable language. The reference manual for this language is available from the sablecc web site. Here I will give you a short introduction to it, but I advise you to download the manual for more details.
Parts of the SableCC File
1. Package 2. Helpers 3. Tokens 4. Ignored Tokens 5. Productions
Package
For example Package tinyb; Defines the package the Java files will be in. That is to say the files will be placed in subdirectories under the directory tinyb. The following subpackages will be created:
tinyb.analysis, tinyb.lexer, tinyb.node, tinyb.parser
Helpers
This section defines regular expressions used in the definition of terminal symbols in the grammar. Regular expressions are made up of: characters either in quotes 'z', 'S', or decimal 10, 13 charactersets ['a'..'z'] [['a'..'z']+['A'..'Z']] [['A'..'Z']-'E']
Helpers
Regular expressions are made up of (continued): bracketed regular expression ( <regexp> ) an alternation of regular expressions eg: 'a'|['0'..'9']|'Z' a string eg : 'then' a helper id eg : alpha a sequence of regular expressions 'a' 'c' ('d'|'e') a regular expression followed by a * eg: alpha* stands for zero or more alphas
Helpers
Regular expressions are made up of (continued): a regular expression followed by a + eg: digit+ stands for one or more digits regular expressions can be named. e.g.: digit = ['0'..'9']; lwcase = ['a'..'z']; upcase = ['A'..'Z']; letter = lwcase | upcase; alphanum = letter|digit; helper and other names must be lower case
Tokens
Tokens are the actual lexemes recognised by the language. They are terminal symbols in context free grammar of the language. A token is defined by a regular expression. It may contain Helper ids, but it may not contain other token ids.
begin='begin'; then ='then' id = letter alphanum*; number = digit+; Note order is significant, if we defined id before begin, then begin would be recognised as an id not as a distinct token
Productions
These list the non terminals of the language being defined eg: statements = {list} statement statements | {empty} ; value = {constant} number | {identifier} identifier | {expression} l_par expression r_par; Note that each alternative in a multiway production is given a label in braces thus {paramlist}. At most one unlabeled alternative is allowed for each production. The first production is the root of the parse.
Labelled non terminals
Suppose a production has two occurrences of the same non-terminal. For example addexp= exp plus exp; Is illegal Sable forces us to label the two occurrences differently: addexp= [left]:exp plus [right]:exp; This is so the generated java code can distinguish the two sub expreressions
Generated java
When ILCG2 uses sable to process a file with packagename Foo we get a collection of directories as follows: /Foo /analysis /lexer /node /parser within the /Foo/node directory we get a collection of class files 1 for each production or branch of a production
Generated java
The production
expression = {value} [v]:value | {arith} [left]:value [op]:arithop [right]:value ;
produces 3 classes in directory node : AValueExpression.java AArithExpression.java PExpression.java
Abstract class
PExpression.java will be an abstract class that the other two implement /* This file was generated by SableCC (http://www.sablecc.org/). */ package tinyb.node; public abstract class PExpression extends Node { // Empty body }
{value} [v]:value produces
Note that [v]:value produces attribute of type PValue called V
package tinyb.node; import tinyb.analysis.*; public final class AValueExpression extends PExpression { private PValue _v_; public AValueExpression( PValue _v_) { setV(_v_); } public PValue getV() {return this._v_; } public void setV(PValue node) { ...} public String toString() {... } ... more methods }
Concrete/abstract syntax tree
The Sable generated components of your compiler will generate a concrete syntax tree. The concrete syntax is language specific We must translate this into a language and machine independent abstract syntax tree using the ILCG classes ( Intermendiate Language for Code Generation). The ILCG system will generate a code generator from the abstract syntax tree to assembly language.
ILCG
Intermediate Language for Code Generation Both a specification language for machines and a representation for compiled code. Part of a Code Generator Generator System
ILCG machine Descriptions
register word EBX assembles[ebx] ; reserved register word ESP assembles[esp];
would declare EBX to be of type ref word.
Aliasing A register can be declared to be a sub-field of another register, hence we could write
alias register octet AL = EAX(0:7) assembles[al];
Register sets
A set of registers that are used in the same way by the instructionset can be defined. pattern reg means [EBP|EBX|ESI|EDI|ECX|EAX| EDX|ESP]; pattern breg means[AL|AH|BL|BH|CL|CH|DL| DH]; All registers in an register set should be of the same length.
Types
The language has a set of predefined types : int8, int16,int64, ieee32, ieee64, uint8, uint16 etc These are the low level types that hardware works with. It is necessary to set up a mapping between these types and the specific assembler syntax for the types. type int32=DWORD; type ieee64=QWORD; type uint8=BYTE; type int16=WORD;
Intel assembler syntax
Register Stacks
Whilst some machines have registers organised as an array, another class of machines, those oriented around postfix instruction sets, have register stacks. The ILCG syntax allows register stacks to be declared: Register stack (8)ieee64 FP assembles[ ] ; Two access operations are supported on stacks: Push(FP,^mem(20)) mem(20):=pop(FP)
Instruction formats
An instruction format is an abstraction over a class of concrete instructions. It abstracts over particular operations and types thereof whilst specifying how arguments can be combined.
pattern operator means [add | sub|imul|and|or|xor]; instruction pattern RR( operator op, anyreg r1, anyreg r2, int t) means[r1:=(t) op( ^((ref t) r1),^((ref t) r2))] assembles[op r1 , r2]; This describes instructions like
add eax,ebx sub ah,bl
Parameterised patterns
instruction pattern RR( operator op, anyreg r1, anyreg r2, int t) means[r1:=(t) op( ^((ref t) r1),^((ref t) r2))] assembles [op r1 , r2];
Reserved words bold Parameters in italics Parameters are matched to tree during meaning phase and substituted into the body of the assembly code during the assembler phase.
ILCG Semantic operators
Infix? infix Prefix
operator := ^ +,-,* >,<,<> <=,>= div Push,pop
means assignment Dereference a register or memory Add, subtract, multiply Greater,less, notequal Less than equal, gt than equal divide Stack operations
AND,OR,XOR Bitwise boolean EXTEND >>, << Extends numeric precision of value Shift operations
Operand formats
Other patterns are used to define the address modes available on the machine, for example pattern regindirf(reg r) means[^(r) ] assembles[ r ]; pattern baseplusoffsetf(reg r, signed s) means[+( ^(r) ,const s)] assembles[ r + s ]; pattern addrform means[label|baseplusoffsetf|regindirf]; pattern maddrmode(addrform f) means[mem(f) ] assembles[ [ f ] ];
Load and store
We can use similar patterns to say how to load and store into registers
instruction pattern LOAD(maddrmode m, anyreg r1, type t) means[ (ref t) r1:= (t)^(m )] assembles['mov ' r1 ',' t ' ' m]; instruction pattern STORER(maddrmode m, reg r1, type t) means[ (ref t) m:= ^( r1) ] assembles['mov ' t ' ' m ',' r1];
We can also add instructions that use numbers immediately

instruction pattern RMLIT(operator op,maddrmode rm, type t, offset sm) means[ (ref t) rm:= op(^(rm),(t) sm) ] assembles[op ' ' t ' ' rm ',' sm]; instruction pattern INC(maddrmode rm,int t) means[(ref t)rm:= + (^(rm),1)] assembles['inc ' t ' ' rm];
Control instructions
The ILCG language supports gotos, for statements and conditionals as these are necessary to describe the semantics of certain processor instructions. Goto can be used to describe simple jump instructions
instruction pattern GOTO(jumpmode l) means[goto l] assembles['jmp ' l];
If statements can be used to describe the semantics of conditional jumps, note that the if statement here translates into a compare followed by a jump. Note the embedded newline character in the assembler.
instruction pattern IFLITGOTO(label l,addrmode r1,signed r2,condition c,signed t,int b) means[if((b)c((t) ^(r1),const r2))goto l] assembles[' cmp 't' ' r1 ', ' r2 '\n j' c ' near ' l];
The Intel architecture provides certain simple loop constructs as single instruction prefixes, the most prominent of these is the block copy instruction.
instruction pattern REPMOVSB(countreg s,maddrmode m1,sourcereg si, destreg di) means[for (ref int32)m1:=0 to ^(s) step 1 do (ref octet)mem(+(^(di),^((ref int32)m1))):=^((ref octet)mem(+(^(si),^((ref int32)m1)))) ] assembles[' inc ecx\n cmp ecx,0\n jle $+4\n rep movsb\n nop\n nop'];
Instructionset
At the end of an ILCG file you must list all the instruction patterns available in the order in which you want them to be tried. Instructions that come first are tried first when translating a programme, so you should list the most specific and powerful instructions first.
instructionset [IFLITGOTO |LOAD |INC |STORER |RMLIT |RR Etc..... ];
Structure of generated Code Generators
Click to edit Master text styles Second level Third level Fourth level Fifth level
I have already done this for you
An example
Let us look at an example of how translation takes place, consider the Basic statement 100 LET B:=B+1 Consider how this might be performed on an x86 lineage machine. If we used the most primitive style instructions we might code it as mov eax,[b] add eax,1 mov [b],eax or as mov eax,[b] inc eax mov [b],eax adding the instructions from the 8086 vintage to the options we might try mov eax,[b] lea eax,[eax+1] mov [b],eax or simply inc dword[b]
Translation to ILCG
instructionset [IFLITGOTO |LOAD |INC |STORER |RMLIT |RR
Assume that we have already translated the concrete syntax tree coming from Sable into an abstract syntax tree. Then the statement would initially be translated as follows B would translate to mem(b) where b is a label B+1 would translate to +((int32)^mem(b),1) B:=B+1 would translate to (ref int32)mem(b):= +((int32)^mem(b),1) The code generator will then run through the instructionset from top to bottom looking for a pattern that matches this The patterns it compares (ref int32)mem(b):= +((int32)^mem(b),1) to are IFLITGOTO if((b)c((t) ^(r1),const r2))goto l Fail, does not start with if LOAD (ref t) r1:= (t)^(rm ) Fail r1 is a register not memory INC (ref t)rm:= + (^(rm),1) Succeed
Inorder to succeed the matcher has to match (ref t) to (ref int32) and then match rm to mem(b)
Code generation symbol table
The code generator maintains a symbol table that associates pattern parameter names with ILCG trees and assembler translations, consider this after we have matched t to int32, but before we have matched the parameter rm Parameter t rm ILCG it matches int32 ? Assembler it produces DWORD
Call other patterns
We have to match rm which was defined as of type maddrmode to the actual pattern mem(b) maddrmode is declared as pattern maddrmode(addrform f) means[mem(f) ] assembles[ [ f ] ]; This implies we must match mem(f) to mem(b) which in turn implies that b must match f which is of type addrform
pattern addrform means[label|baseplusoffsetf|regindirf];
but b is a label so this will succeed. The matcher will return the value '[b]' from maddrmode and bind this to the parameter rm in the code generator's symbol table
Parameter t rm
ILCG it matches int32 mem(b)
Assembler it produces DWORD [b]
The matching process
Remember we are matching (ref int32)mem(b):= +((int32)^mem(b),1) to the INC pattern which was (ref t)rm:= + (^(rm),1) the blue part indicates what has been matched so far We can obviously match := + ( ,1) to := +( ,1) That leaves us with having to match (int32)^mem(b) to ^(rm) The cast will succeed since we have an uncast pattern ^(rm) which in principle could match any type so we just have to match rm to mem(b) So the matcher looks up rm in the symbol table and find it is already associated with mem(b) so the match succeeds. The inc instruction's assembly part is then expanded assembles['inc ' t ' ' rm] by substituting in the t and rm parameters from the symbol table assembly column So the output would be inc DWORD [b]
Effect of instruction ordering
The assembler produced depends on the order of priority of the instructions listed in the instructionset. If we had the different instruction set order
IFLITGOTO |LOAD |STORER |INC |RMLIT |RR
that is, STORER now comes before INC the instruction sequence generated would be mov EAX, [b] add EAX, 1 mov [b],EAX In this case the first pattern to match would be STORER since its ILCG is (ref t) rm:= ^( r1) the pattern (ref t) rm:= will match the abstract syntax pattern (ref int32)mem(b):= leaving the matcher to attempt to match the register r1 to + ((int32)^mem(b),1) The matcher then goes through the instructions looking for an instruction that will put into a register the result of +((int32)^mem(b),1)
Alternative instruction sequence
The instruction set is searched and it finds +((int32)^mem(b),1) can be matched to the right hand side of the RMLIT pattern (ref t) rm:= op(^(rm),(t) sm) by instantiating rm as a register, say EAX, op as +, t as int32 and sm as 1. But to do this we have to find a pattern that replaces EAX with (int32)^mem(b) The matcher again searches the instructions and finds that LOAD has the semantics (ref t) r1:= (t)^(rm ) where rm is a memory address mode and r1 is a register. This matches to produce mov EAX, [b] the success of this triggers RMLIT to produce add EAX,1 which in turn allows the LOAD pattern to complete matching producing mov [b],EAX
Main class
The ILCG2 system will, when given a compiler specification invoke Sable on the language specification and build the appropriate code generator. So given
compiler tinybasic parser tinyb cpu IA32
it builds a tiny basic lexer and parser, and builds an IA32 code generator, it then outputs a compiler class whose main method invokes both of these, This is shown on the next slide
Main compiler method

public static void main(String[] arguments){ if(arguments.length <1){ System.out.println("usage:\njava tinyb.tinybasic sourcefile ");System.exit(1); } try{ FileReader r= new FileReader(arguments[0]); PushbackReader pr = new PushbackReader( new BufferedReader( r), 1024) ; Lexer lexer = new Lexer( pr ); // create lexer using file This class Parser parser = new Parser(lexer); // create parser using lexer and method tinyb.node.Node st = parser.parse();// build the syntax tree must be hand ilcg.tree.Node[] trans = Translator.translate(st); // translate to ILCG tree written IA32CG cpu = new IA32CG(); PrintWriter asm =new PrintWriter(new FileOutputStream(arguments[0]+".asm")); cpu.setLogfile(new PrintWriter(new FileOutputStream(arguments[0]+ ".lst"))); if(!cpu.codeGen(trans[0])) throw new Exception("codegen fails"); // gen asm cpu.buf.writeln("section .data"); if(!cpu.codeGen(trans[1])) throw new Exception("codegen fails"); // gen asm cpu.flush(asm);cpu.flushlog(); asm.close(); } catch(Exception e){ System.out.println(e);} cpu.assemble(arguments[0]+".asm",arguments[0]+".o" ); }
Ilcg translation
I have written for you a translator from tinybasic concrete syntax to Ilcg abstract syntax, but you will have to extend this translator so it is worth understanding how it works. It translates into the classes in the package ilcg/tree which allow ilcg abstract syntax trees to be represented. I will go through some of the key classes here that you have to know about before going on to decribe the overall algorithmic structure of such translators.
Interface Node all tree nodes implement this
Method Summary
Boolean equals(Object n) check exact equality of trees Node eval() evaluate expressions known at compile time, this allows arithmetic expressions like 3+5 to be done in the compiler not at run time. void examine(TreeExaminer e) A method that is used by an examiner to visit all locations. Node getCost() Give an expression which evaluates to a cost of an operation in memory accesses boolean knownAtCompileTime() Nodes are used to represent the ILCG tree, and in the root class we declare the terminal symbols of the language as strings boolean matches(Node n)Check if similar enough for code generation purposes boolean matches(Node n, ilcg.SymbolTable D) Check if similar enough for code generation purposes with use of rewrite rules stored in D Node modify(TreeModifier m) A method that must beinstantiated allowing a TreeModifier to substitute values into the tree. String returnType() Returns the type of an expression tree as a string int weight() returns a rough estimate of the number of registers needed to compile this sub tree
Class TreeExaminer
The code generators work using tree visiting classes. An example is the TreeExaminer class which can be used to visit all notes of a subtree. You typically extend the class by overriding the leave and visit methods to do what you want. It is invoked by passing a TreeExaminer to the examine() method of a node. Tree examiners are useful if you want to search trees or gather statistics on them like how many memory accesses do they make.
Method Summary
void boolean leave(Node n) This is called after all subtrees have been visited visit(Node n) This is called each time a node is visited, but before any subtrees are visited. If it returns false the sub trees are not visited
Assign
Class Assign represents assignments

Assign(Node d, Node s) d = destination, s = source of the assignment , check the assignment for type validity. Assign(Node d, Node s, boolean check) d = destination, s = source of the assignment , if check is true, check the assignment for type validity.
Class Int represent integer constants
This class extends java.lang.Number so it has methods like getInt(), getLong() etc
Constructor Summary Int() Constructor for the Int object Int(String s) Constructor for the Int object convert string to integer Int(int i) Constructor for the Int object Int(long l) Constructor for the Int object Int(long l, String rep) Constructor for the Int object rep specifies the precision int32, int16 etc
Labels and locations
Label Constructor Summary Label() constructs a label with a machine generated name Label(String s) constructs a label with a programmer specified name
Location Constructor Summary Location(Node n) Constructor for the Location object , reserves a memory location in the target machine to store the expression in n
Conditionals
Constructor Summary If(Node condition, Node thenPart) Creates if then fi it executes the thenPart if the condition is true at run time If(Node condition, Node thenPart, Node elsePart) Creates if then else fi execute the thenPart if the condition is true at run time otherwise execute the else part
The If node are the abstract sytnax representataions of control branches in the original source code.
There are several other classes for constructing ILCG trees but you can familiarise yourselves with these by looking at the html documentation in the package ilcg/tree If a condition is known at compile time when eval is called the tree is simplified thus Label a=... Node x= new Memref(a,int32); new If(new Dyad(new Int(4),new Int(3),<), new Assign(a, new Dyad(a,a,*))).eval() will generate the null statement since 4<3 is false
Statements
Constructor Summary Statement() the null statement Statement(Node n) a statement made of n Statement(Node n, Statement s) a statement n followed by the statement s The statement class provides the basic mechanism for specifying sequencing in the abstract syntax tree. new Statement(new Assign(regEAX, new Int(3)), new Statement( new Assign(new Memref(new Label(b),int32),regEAX))) would tell the code generator to output code to assign 3 to the eax register and then store the eax register in the memory location referred to by label b. We can also use this to set up store locations for variables new statement(new Label(b), new Statement(new Location(new Int(0,int32)))) would tell the code generator to reserve a memory location for a 32 bit integer, initialised to zero and label it b. If you print ilcg trees out, the toString method of the statement class renders it in the form seq( , )
Dyad
Class Dyad represents dyadic arithmetic expressions like a*7 Constructor Summary Dyad() Constructor for the Dyad object Dyad(Node l, Node r, String o) Constructor for the Dyad object with o being the string representation of the operation being performed Calling the eval method on it after construction will ensure that any expressions that can be calculated at compile time are so evaluated eg new Dyad(new Int(3),new Int(7),+).eval() would return an Int(10)
The translator structure
The translation process happens using the visitor paradigm. A DepthFirstAdapter is a class which will visit every node in a Sable syntax tree. By extending it, we get a translator that will visit every node in the concrete syntax tree and translate it into the equivalent abstract syntax tree.
class Translator extends DepthFirstAdapter{ Hashtable<tinyb.node.Node,ilcg.tree.Node> translations = new Hashtable<tinyb.node.Node,ilcg.tree.Node> (); Hashtable<String,ilcg.tree.Node> symbols= new Hashtable<String,ilcg.tree.Node>(); String precision = int32;
The translations are stored in the hashtable translations with tiny basic concrete syntax nodes as keys of the hash table and ilcg abstract syntax trees as values. An auxilliary hash table symbols is used to look up associations with identifier names. precision specifies the accuracy we are going to use in generated code.
One method per class
The visitor class has one method per class of syntax node. Consider the Sable specification value = {constant} number | {identifier} identifier | {expression} l_par expression r_par; The translator must have methods to deal with each of these alternatives: public void outAConstantValue(AConstantValue node){ translations.put(node, new ilcg.tree.Int( new Long(node.toString().trim()).longValue(),precision) ); } public void outAIdentifierValue(AIdentifierValue node){ translations.put(node, new Deref( idof(node.getIdentifier().toString().trim()))); } public void outAExpressionValue(AExpressionValue node){ translations.put(node,translations.get(node.getExpression())); } When a node is encountered the appropriate method for its class is called to translate it. It puts the translation into the translation table. Lets look at some methods in detail.
Constants
public void outAConstantValue(AConstantValue node){ translations.put(node, new ilcg.tree.Int( new Long(node.toString().trim()).longValue(),precision) ); } The main call here puts node into the translation table as a key, an the translation of the node as the associated value. The translation is new ilcg.tree.Int( new Long(node.toString().trim()).longValue(),precision) node.toString().trim() gives us the number with leading and trailing spaces removed We then convert this into a java long value and pass it to the constructor Int(long v, String representation) using the default value of precision ( int32 ) as the representation The end effect is to translate the tinybasic concrete syntax AConstantValue into an Ilcg Int instance
Variables
public void outAIdentifierValue(AIdentifierValue node){ translations.put(node, new Deref( idof(node.getIdentifier().toString().trim()))); }
The translation here is of an identifier in the context of an expression. An identifier in a high level language is a name for a location in the computer memory. But what we want to have for the calculation is not the location but the contents of that location. In Ilcg source this operation of getting the contents of a location is denoted by the ^ character. In the syntax tree it is denoted by the Deref class. In order to find the memory location we call the method idof passing to it the source of the identifier with leading and trailing spaces removed.
Handling labels
The function idof checks if we have already encountered s. If we have it simply returns it. Otherwise is looks to see if s is a variable ( starts with a letter ) or a line number. If it is a line number it associates with the line number s the label lines, so if s was 20 it would get label line20 If it is a variable it allocates a memory location for s and associates both this memory location and s with the label vars
ilcg.tree.Node idof(String s){ ilcg.tree.Node n = symbols.get(s); if(n==null){ n= new Label("line"+s); if(!Character.isDigit(s.charAt(0))){ n = new Label("var"+s); Memref m = new Memref(n,precision); data = new Statement(n, new Statement( new Location(new Int(0,precision)),data)); n=m; } symbols.put(s,n); } return n; }
Example translation
Note in what follows how labels are being used 10 FOR I := 1 TO 5 20 PRINT I 30 NEXT I 40 END Translated to assembler is line10: mov DWORD [varI],1 label131051df1911: line20: push DWORD [varI] lea eax, [print] call eax add esp, 4 line30: inc DWORD [varI] cmp DWORD [varI],5 jle near label131051df1911 line40: .... some stuff cut out here section .data varI: dd 0
Dyadic expressions
Consider how we translate expressions like T+(I*2) this can be broken down into translating the left and right halves of the expression and then construction an abstract dyadic expression node with the appropriate operator. public void outAArithExpression(AArithExpression node){ ilcg.tree.Node l,r; l=translations.get(node.getLeft()); r=translations.get(node.getRight()); ilcg.tree.Node n = new Dyad(l,r, node.getOp().toString().trim()); translations.put(node,n); } Look at an example of how this works LET T:=T+(I*2) --> assign(mem(ref int32,varT), +((int32)^(mem(ref int32,varT)), *((int32)^(mem(ref int32,varI)),2) )) --> imul eax, DWORD [varI], 2 add DWORD [varT], eax
If statements
Sable
action = {conditional} if [c]:condition then [n]:number new_line Java in Translator public void outAConditionalAction(AConditionalAction node){ translations.put(node, new If(translations.get(node.getC()), new Goto(idof(node.getN().toString().trim())))); } Basic source IF T>15 THEN 30 Translated Ilcg tree if(>((int32)^(mem(ref int32,varT)),15),goto(line30) ,null) Resulting assembler cmp DWORD [varT], jg near line30
15
For statements
Sable
The loops stack holds the labels for returning to
action = {forloop}for identifier assign [fromexp]:expression to [toexp]:expression new_line Java in Translator

public void outAForloopAction(AForloopAction node){ try { Label l = new Label();loops.push(l); Assign init = new Assign(idof(node.getIdentifier().toString().trim()), translations.get(node.getFromexp())); limits.push(translations.get(node.getToexp())); translations.put(node,new Statement(init, new Statement(l))); }catch(Exception e){Error(node,e);} }
The limits stack holds the upper bounds of loops, stacks allow loop nesting
Basic source FOR I:= 1 to 5 Translated Ilcg tree seq(assign(mem(ref int32,varI),1),seq(label131712bd8954)) Resulting assembler mov DWORD [varI], 1 label131712bd8954:
Next statements
action = {loopend} next [i]:identifier new_line
Java in Translator
public void outALoopendAction(ALoopendAction node) { try{// first generate the increment ilcg.tree.Node i =idof(node.getI().toString().trim()); ilcg.tree.Node di = new Deref(i); ilcg.tree.Node inc = new Assign(i, new Dyad(di,new Int(1,precision),"+")); ilcg.tree.Statement jump = new Statement(new If( new Dyad(di,limits.pop(),"<="),new Goto(loops.pop()))); translations.put(node, new Statement(inc,jump)); }catch(Exception e){Error(node,e);} }
Basic source NEXT I Translated Ilcg tree

seq(assign(mem(ref int32,varI),+((int32)^(mem(ref int32,varI)),1)), seq(if(<=((int32)^(mem(ref int32,varI)), 5), goto(label131712bd8954) , null)))
Resulting assembler
inc DWORD [ varI] cmp DWORD [ varI], 5 jle near label131712bd8954
Print statements
Sable
action = {printexp} print [e]:expression new_line
Java in Translator public void outAPrintexpAction(APrintexpAction node){ Function f = new Function( exidof("print"),precision,precision); Vector <ilcg.tree.Node> v = new Vector<ilcg.tree.Node>(); v.add(translations.get(node.getE())); translations.put(node, new Monad(f,new Cartesian(v))); } Basic source PRINT T Translated Ilcg tree print[(int32)^(mem(ref int32,varT))] Resulting assembler push DWORD [ varT] ; push T on the stack lea eax,[ print] ; find the address of print call eax ; call that address add esp, 4 ; restore the stack C routine called void print(int i){printf("%d ",i);}
Calling C functions
Requirements
1.Must establish a name correspondence with the C routine. Issues here Case of the names allowed characters how are these passed in assembler 2.Must pass parameters appropriately 3.Must get results back from C routines
Calling C functions
Characters and significance Case is significant in C, but this is not the case of all languages. Pascal for instance makes case insignificant, and requires that externals where the case is significant be given a name in quotes for example: procedure close (var f:fileptr); external name 'pasclose'; This allows the external routine to have a different name to the internal representation of it. The allowed characters in a name in Basic are limited to the uppercase letters, that means we can not call and C routine with an _ or a digit in its name unless we were to extend the syntax for externals along the above lines.
Calling C functions
Assembler representation In the assembler file list all the externals called as follows: extern extern print println
Then we can call them just as if they were declared within this file. call print But to do this you have to link the object file produced by the assembler (.o file) with an object file or C file containing the C routine.
Calling C functions
Assembler representation
Underscores
Most 32-bit C compilers share the convention used by 16-bit compilers, that the names of all global symbols (functions or data) they define are formed by prefixing an underscore to the name as it appears in the C program. However, not all of them do: the ÈLF' specification used in Linux .o files states that C symbols do not have a leading underscore on their assembly-language names. Thus if you are producing code for Linux, which uses ELF, do not use underscores.
Calling C functions
The C calling convention
The caller pushes the function's parameters on the stack, one after another, in reverse order (right to left, so that the first argument specified to the function is pushed last). The caller then executes a near `CALL' instruction to pass control to the callee.
So if we have C function foo(int a, int b) being called as

foo(x,4)
we would generate the code push 4 push dword[x] call foo
Calling C functions
stack
[ebp+12]
[ebp+8]
retaddr
[ebp+4]
Old ebp
[ebp]
3. The callee receives control, and typically (although this is not actually necessary, in functions which do not need to access their parameters) starts by saving the value of ÈSP' in ÈBP' so as to be able to use ÈBP' as a base pointer to find its parameters on the stack. However, the caller was probably doing this too, so part of the calling convention states that ÈBP' must be preserved by any C function. Hence the callee, if it is going to set up ÈBP' as a frame pointer, must push the previous value first. foo: push ebp mov ebp,esp 4. The callee may then access its parameters relative to ÈBP'. The doubleword at `[EBP]' holds the previous value of ÈBP' as it was pushed; the next doubleword, at `[EBP+4]', holds the return address, pushed implicitly by `CALL'. The parameters start after that, at `[EBP+8]'. The leftmost parameter of the function, since it was pushed last, is accessible at this offset from ÈBP'; the others follow, at successively greater offsets. Thus, in a function such as `printf' which takes a variable number of parameters, the pushing of the parameters in reverse order means that the function knows where to find its first parameter, which tells it the number and type of the remaining ones.
Calling C functions
5. The callee may also wish to decrease ÈSP' further, so as to allocate space on the stack for local variables, which will then be accessible at negative offsets from ÈBP'. 6. The callee, if it wishes to return a value to the caller, should leave the value in ÀL', ÀX' or ÈAX' depending on the size of the value. Floating-point results are typically returned in `ST0'. 7. Once the callee has finished processing, it restores ÈSP' from ÈBP' if it had allocated local stack space, then pops the previous value of ÈBP', and returns via `RET' . mov eax, retval mov esp,[ebp] ret 8. When the caller regains control from the callee, the function parameters are still on the stack, so it typically adds an immediate constant to ÈSP' to remove them (instead of executing a number of slow `POP' instructions). In our example: call foo add esp,8 ; removes the two parameters

Compiler Construction

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Compiler Construction

Загружено:

Авторское право:

Доступные форматы

Compiler construction Paul Cockshott

Advantages of compiler compilers

Parts of the SableCC File

1. Package 2. Helpers 3. Tokens 4. Ignored Tokens 5. Productions

tinyb.analysis, tinyb.lexer, tinyb.node, tinyb.parser

Labelled non terminals

produces 3 classes in directory node : AValueExpression.java AArithExpression.java PExpression.java

{value} [v]:value produces

Note that [v]:value produces attribute of type PValue called V

Concrete/abstract syntax tree

ILCG machine Descriptions

register word EBX assembles[ebx] ; reserved register word ESP assembles[esp];

would declare EBX to be of type ref word.

Intel assembler syntax

ILCG Semantic operators

Infix? infix Prefix

operator := ^ +,-,* >,<,<> <=,>= div Push,pop

Load and store

We can also add instructions that use numbers immediately

instructionset [IFLITGOTO |LOAD |INC |STORER |RMLIT |RR Etc..... ];

Structure of generated Code Generators

I have already done this for you

instructionset [IFLITGOTO |LOAD |INC |STORER |RMLIT |RR

Code generation symbol table

Call other patterns

ILCG it matches int32 mem(b)

Assembler it produces DWORD [b]

The matching process

Effect of instruction ordering

Alternative instruction sequence

Main compiler method

Interface Node all tree nodes implement this

Class Assign represents assignments

Class Int represent integer constants

Labels and locations

The translator structure

One method per class

public void outAIdentifierValue(AIdentifierValue node){ translations.put(node, new Deref( idof(node.getIdentifier().toString().trim()))); }

The loops stack holds the labels for returning to

action = {forloop}for identifier assign [fromexp]:expression to [toexp]:expression new_line Java in Translator

Basic source NEXT I Translated Ilcg tree

The C calling convention

So if we have C function foo(int a, int b) being called as

we would generate the code push 4 push dword[x] call foo

Вам также может понравиться