Вы находитесь на странице: 1из 75

Compiler construction Paul Cockshott

Function

The job of a compiler is to take a file specifying an algorithm in a high level language and translate this into assembler code of a particular computer. The result of running the compiler is an executable file for the computer in question. It may be necessary to run and assembler or linker after the compiler to produce the final binary file.

Approaches

Write the compiler in assembler as was done with Basic interpreters First Fortran compiler (1957 ) was done this way

Approaches

Write the compiler in assembler as was done with Basic interpreters First Fortran compiler (1957 ) was done this way Write the compiler in its own language C. If this is the first compiler of this type, then hand translate to assembler C->human->A. Use the hand translated version to do a machine translation C -> A -> A This was done with the first Burroughs Algol compiler in 1960

Approaches

Write the compiler for one language in another existing language for example write a Java compiler in C, as is the case with gcj Write a new compiler for an existing language using an existing compiler for the same language so gcc was written in C and originally compiled using the Unix C compiler cc Write the compiler using a Compiler Compiler.

Compiler compilers

Lexical analysis generators These generate programs to tokenize the input stream into the lexemes defined in the high level language. They are based on regular expressions and they typically output C or Java. Examples are Lex, Jlex, Flex

Compiler compilers

Lexical analysis generators These generate programs to tokenize the input stream into the lexemes defined in the high level language. They are based on regular expressions and they typically output C or Java. Examples are Lex, Jlex, Flex

Compiler compilers

Syntax analysis generators These are typically based on a bnf like input notation that describes the context free syntax of a language. They generate a parser for the language. The parser they produce may be in C, Java or some other high level language. Examples: yacc, bison, Sablecc

They usually require a lexical analyser produced by a distinct package, but Sablecc integrates the lexical analysis as well.

Compiler compilers

Code generator generators These do the job of producing code generators. A code generator translates the parsed input programme into assembler Code generator generators are generally specified using a tree matching grammar. Examples are Twig, Burg, Ilcg

Advantages of compiler compilers

They automate a time consuming task They remove many of the bugs that can exist in hand generated compilers

Disadvantages
You have to learn the specification language You have to write glue code to use the output They may force you to use unfamiliar programming paradigms : the visitor pattern with SableCC

ILCG2

ILCG is a code generator generator that has been used in this department for a while. ILCG2 is an extension of the original system that includes the Sablecc system for lexical analysis and parsing. ILCG2 allows you to specify The concrete syntax of your language using sablecc The machine you will generate code to in ILCG

ILCG2

Here is the ILCG2 file you will work with compiler tinybasic parser tinyb cpu IA32 It will produce a compiler file called tinybasic.java It will use the file tinyb.sab to define the syntax of the language It will use the file IA32.ilc to define the processor you will be generating code for.

Sablecc

The syntax and sematics are defined in the Sable language. The reference manual for this language is available from the sablecc web site. Here I will give you a short introduction to it, but I advise you to download the manual for more details.

Parts of the SableCC File

1. Package 2. Helpers 3. Tokens 4. Ignored Tokens 5. Productions

Package

For example Package tinyb; Defines the package the Java files will be in. That is to say the files will be placed in subdirectories under the directory tinyb. The following subpackages will be created:

tinyb.analysis, tinyb.lexer, tinyb.node, tinyb.parser

Helpers

This section defines regular expressions used in the definition of terminal symbols in the grammar. Regular expressions are made up of: characters either in quotes 'z', 'S', or decimal 10, 13 charactersets ['a'..'z'] [['a'..'z']+['A'..'Z']] [['A'..'Z']-'E']

Helpers

Regular expressions are made up of (continued): bracketed regular expression ( <regexp> ) an alternation of regular expressions eg: 'a'|['0'..'9']|'Z' a string eg : 'then' a helper id eg : alpha a sequence of regular expressions 'a' 'c' ('d'|'e') a regular expression followed by a * eg: alpha* stands for zero or more alphas

Helpers

Regular expressions are made up of (continued): a regular expression followed by a + eg: digit+ stands for one or more digits regular expressions can be named. e.g.: digit = ['0'..'9']; lwcase = ['a'..'z']; upcase = ['A'..'Z']; letter = lwcase | upcase; alphanum = letter|digit; helper and other names must be lower case

Tokens

Tokens are the actual lexemes recognised by the language. They are terminal symbols in context free grammar of the language. A token is defined by a regular expression. It may contain Helper ids, but it may not contain other token ids.
begin='begin'; then ='then' id = letter alphanum*; number = digit+; Note order is significant, if we defined id before begin, then begin would be recognised as an id not as a distinct token

Productions

These list the non terminals of the language being defined eg: statements = {list} statement statements | {empty} ; value = {constant} number | {identifier} identifier | {expression} l_par expression r_par; Note that each alternative in a multiway production is given a label in braces thus {paramlist}. At most one unlabeled alternative is allowed for each production. The first production is the root of the parse.

Labelled non terminals

Suppose a production has two occurrences of the same non-terminal. For example addexp= exp plus exp; Is illegal Sable forces us to label the two occurrences differently: addexp= [left]:exp plus [right]:exp; This is so the generated java code can distinguish the two sub expreressions

Generated java

When ILCG2 uses sable to process a file with packagename Foo we get a collection of directories as follows: /Foo /analysis /lexer /node /parser within the /Foo/node directory we get a collection of class files 1 for each production or branch of a production

Generated java

The production
expression = {value} [v]:value | {arith} [left]:value [op]:arithop [right]:value ;

produces 3 classes in directory node : AValueExpression.java AArithExpression.java PExpression.java

Abstract class

PExpression.java will be an abstract class that the other two implement /* This file was generated by SableCC (http://www.sablecc.org/). */ package tinyb.node; public abstract class PExpression extends Node { // Empty body }

{value} [v]:value produces

Note that [v]:value produces attribute of type PValue called V

package tinyb.node; import tinyb.analysis.*; public final class AValueExpression extends PExpression { private PValue _v_; public AValueExpression( PValue _v_) { setV(_v_); } public PValue getV() {return this._v_; } public void setV(PValue node) { ...} public String toString() {... } ... more methods }

Concrete/abstract syntax tree

The Sable generated components of your compiler will generate a concrete syntax tree. The concrete syntax is language specific We must translate this into a language and machine independent abstract syntax tree using the ILCG classes ( Intermendiate Language for Code Generation). The ILCG system will generate a code generator from the abstract syntax tree to assembly language.

ILCG

Intermediate Language for Code Generation Both a specification language for machines and a representation for compiled code. Part of a Code Generator Generator System

ILCG machine Descriptions

register word EBX assembles[ebx] ; reserved register word ESP assembles[esp];

would declare EBX to be of type ref word.

Aliasing A register can be declared to be a sub-field of another register, hence we could write
alias register octet AL = EAX(0:7) assembles[al];

Register sets

A set of registers that are used in the same way by the instructionset can be defined. pattern reg means [EBP|EBX|ESI|EDI|ECX|EAX| EDX|ESP]; pattern breg means[AL|AH|BL|BH|CL|CH|DL| DH]; All registers in an register set should be of the same length.

Types

The language has a set of predefined types : int8, int16,int64, ieee32, ieee64, uint8, uint16 etc These are the low level types that hardware works with. It is necessary to set up a mapping between these types and the specific assembler syntax for the types. type int32=DWORD; type ieee64=QWORD; type uint8=BYTE; type int16=WORD;

Intel assembler syntax

Register Stacks

Whilst some machines have registers organised as an array, another class of machines, those oriented around postfix instruction sets, have register stacks. The ILCG syntax allows register stacks to be declared: Register stack (8)ieee64 FP assembles[ ] ; Two access operations are supported on stacks: Push(FP,^mem(20)) mem(20):=pop(FP)

Instruction formats

An instruction format is an abstraction over a class of concrete instructions. It abstracts over particular operations and types thereof whilst specifying how arguments can be combined.
pattern operator means [add | sub|imul|and|or|xor]; instruction pattern RR( operator op, anyreg r1, anyreg r2, int t) means[r1:=(t) op( ^((ref t) r1),^((ref t) r2))] assembles[op r1 , r2]; This describes instructions like
add eax,ebx sub ah,bl

Parameterised patterns

instruction pattern RR( operator op, anyreg r1, anyreg r2, int t) means[r1:=(t) op( ^((ref t) r1),^((ref t) r2))] assembles [op r1 , r2];
Reserved words bold Parameters in italics Parameters are matched to tree during meaning phase and substituted into the body of the assembly code during the assembler phase.

ILCG Semantic operators

Infix? infix Prefix

operator := ^ +,-,* >,<,<> <=,>= div Push,pop

means assignment Dereference a register or memory Add, subtract, multiply Greater,less, notequal Less than equal, gt than equal divide Stack operations

AND,OR,XOR Bitwise boolean EXTEND >>, << Extends numeric precision of value Shift operations

Operand formats

Other patterns are used to define the address modes available on the machine, for example pattern regindirf(reg r) means[^(r) ] assembles[ r ]; pattern baseplusoffsetf(reg r, signed s) means[+( ^(r) ,const s)] assembles[ r + s ]; pattern addrform means[label|baseplusoffsetf|regindirf]; pattern maddrmode(addrform f) means[mem(f) ] assembles[ [ f ] ];

Load and store

We can use similar patterns to say how to load and store into registers
instruction pattern LOAD(maddrmode m, anyreg r1, type t) means[ (ref t) r1:= (t)^(m )] assembles['mov ' r1 ',' t ' ' m]; instruction pattern STORER(maddrmode m, reg r1, type t) means[ (ref t) m:= ^( r1) ] assembles['mov ' t ' ' m ',' r1];

We can also add instructions that use numbers immediately


instruction pattern RMLIT(operator op,maddrmode rm, type t, offset sm) means[ (ref t) rm:= op(^(rm),(t) sm) ] assembles[op ' ' t ' ' rm ',' sm]; instruction pattern INC(maddrmode rm,int t) means[(ref t)rm:= + (^(rm),1)] assembles['inc ' t ' ' rm];

Control instructions

The ILCG language supports gotos, for statements and conditionals as these are necessary to describe the semantics of certain processor instructions. Goto can be used to describe simple jump instructions
instruction pattern GOTO(jumpmode l) means[goto l] assembles['jmp ' l];

If statements can be used to describe the semantics of conditional jumps, note that the if statement here translates into a compare followed by a jump. Note the embedded newline character in the assembler.
instruction pattern IFLITGOTO(label l,addrmode r1,signed r2,condition c,signed t,int b) means[if((b)c((t) ^(r1),const r2))goto l] assembles[' cmp 't' ' r1 ', ' r2 '\n j' c ' near ' l];

The Intel architecture provides certain simple loop constructs as single instruction prefixes, the most prominent of these is the block copy instruction.
instruction pattern REPMOVSB(countreg s,maddrmode m1,sourcereg si, destreg di) means[for (ref int32)m1:=0 to ^(s) step 1 do (ref octet)mem(+(^(di),^((ref int32)m1))):=^((ref octet)mem(+(^(si),^((ref int32)m1)))) ] assembles[' inc ecx\n cmp ecx,0\n jle $+4\n rep movsb\n nop\n nop'];

Instructionset

At the end of an ILCG file you must list all the instruction patterns available in the order in which you want them to be tried. Instructions that come first are tried first when translating a programme, so you should list the most specific and powerful instructions first.

instructionset [IFLITGOTO |LOAD |INC |STORER |RMLIT |RR Etc..... ];

Structure of generated Code Generators

Click to edit Master text styles Second level Third level Fourth level Fifth level

I have already done this for you

An example

Let us look at an example of how translation takes place, consider the Basic statement 100 LET B:=B+1 Consider how this might be performed on an x86 lineage machine. If we used the most primitive style instructions we might code it as mov eax,[b] add eax,1 mov [b],eax or as mov eax,[b] inc eax mov [b],eax adding the instructions from the 8086 vintage to the options we might try mov eax,[b] lea eax,[eax+1] mov [b],eax or simply inc dword[b]

Translation to ILCG

instructionset [IFLITGOTO |LOAD |INC |STORER |RMLIT |RR

Assume that we have already translated the concrete syntax tree coming from Sable into an abstract syntax tree. Then the statement would initially be translated as follows B would translate to mem(b) where b is a label B+1 would translate to +((int32)^mem(b),1) B:=B+1 would translate to (ref int32)mem(b):= +((int32)^mem(b),1) The code generator will then run through the instructionset from top to bottom looking for a pattern that matches this The patterns it compares (ref int32)mem(b):= +((int32)^mem(b),1) to are IFLITGOTO if((b)c((t) ^(r1),const r2))goto l Fail, does not start with if LOAD (ref t) r1:= (t)^(rm ) Fail r1 is a register not memory INC (ref t)rm:= + (^(rm),1) Succeed
Inorder to succeed the matcher has to match (ref t) to (ref int32) and then match rm to mem(b)

Code generation symbol table

The code generator maintains a symbol table that associates pattern parameter names with ILCG trees and assembler translations, consider this after we have matched t to int32, but before we have matched the parameter rm Parameter t rm ILCG it matches int32 ? Assembler it produces DWORD

Call other patterns

We have to match rm which was defined as of type maddrmode to the actual pattern mem(b) maddrmode is declared as pattern maddrmode(addrform f) means[mem(f) ] assembles[ [ f ] ]; This implies we must match mem(f) to mem(b) which in turn implies that b must match f which is of type addrform
pattern addrform means[label|baseplusoffsetf|regindirf];

but b is a label so this will succeed. The matcher will return the value '[b]' from maddrmode and bind this to the parameter rm in the code generator's symbol table

Parameter t rm

ILCG it matches int32 mem(b)

Assembler it produces DWORD [b]

The matching process

Remember we are matching (ref int32)mem(b):= +((int32)^mem(b),1) to the INC pattern which was (ref t)rm:= + (^(rm),1) the blue part indicates what has been matched so far We can obviously match := + ( ,1) to := +( ,1) That leaves us with having to match (int32)^mem(b) to ^(rm) The cast will succeed since we have an uncast pattern ^(rm) which in principle could match any type so we just have to match rm to mem(b) So the matcher looks up rm in the symbol table and find it is already associated with mem(b) so the match succeeds. The inc instruction's assembly part is then expanded assembles['inc ' t ' ' rm] by substituting in the t and rm parameters from the symbol table assembly column So the output would be inc DWORD [b]

Effect of instruction ordering

The assembler produced depends on the order of priority of the instructions listed in the instructionset. If we had the different instruction set order
IFLITGOTO |LOAD |STORER |INC |RMLIT |RR

that is, STORER now comes before INC the instruction sequence generated would be mov EAX, [b] add EAX, 1 mov [b],EAX In this case the first pattern to match would be STORER since its ILCG is (ref t) rm:= ^( r1) the pattern (ref t) rm:= will match the abstract syntax pattern (ref int32)mem(b):= leaving the matcher to attempt to match the register r1 to + ((int32)^mem(b),1) The matcher then goes through the instructions looking for an instruction that will put into a register the result of +((int32)^mem(b),1)

Alternative instruction sequence

The instruction set is searched and it finds +((int32)^mem(b),1) can be matched to the right hand side of the RMLIT pattern (ref t) rm:= op(^(rm),(t) sm) by instantiating rm as a register, say EAX, op as +, t as int32 and sm as 1. But to do this we have to find a pattern that replaces EAX with (int32)^mem(b) The matcher again searches the instructions and finds that LOAD has the semantics (ref t) r1:= (t)^(rm ) where rm is a memory address mode and r1 is a register. This matches to produce mov EAX, [b] the success of this triggers RMLIT to produce add EAX,1 which in turn allows the LOAD pattern to complete matching producing mov [b],EAX

Main class

The ILCG2 system will, when given a compiler specification invoke Sable on the language specification and build the appropriate code generator. So given
compiler tinybasic parser tinyb cpu IA32

it builds a tiny basic lexer and parser, and builds an IA32 code generator, it then outputs a compiler class whose main method invokes both of these, This is shown on the next slide

Main compiler method


public static void main(String[] arguments){ if(arguments.length <1){ System.out.println("usage:\njava tinyb.tinybasic sourcefile ");System.exit(1); } try{ FileReader r= new FileReader(arguments[0]); PushbackReader pr = new PushbackReader( new BufferedReader( r), 1024) ; Lexer lexer = new Lexer( pr ); // create lexer using file This class Parser parser = new Parser(lexer); // create parser using lexer and method tinyb.node.Node st = parser.parse();// build the syntax tree must be hand ilcg.tree.Node[] trans = Translator.translate(st); // translate to ILCG tree written IA32CG cpu = new IA32CG(); PrintWriter asm =new PrintWriter(new FileOutputStream(arguments[0]+".asm")); cpu.setLogfile(new PrintWriter(new FileOutputStream(arguments[0]+ ".lst"))); if(!cpu.codeGen(trans[0])) throw new Exception("codegen fails"); // gen asm cpu.buf.writeln("section .data"); if(!cpu.codeGen(trans[1])) throw new Exception("codegen fails"); // gen asm cpu.flush(asm);cpu.flushlog(); asm.close(); } catch(Exception e){ System.out.println(e);} cpu.assemble(arguments[0]+".asm",arguments[0]+".o" ); }

Ilcg translation

I have written for you a translator from tinybasic concrete syntax to Ilcg abstract syntax, but you will have to extend this translator so it is worth understanding how it works. It translates into the classes in the package ilcg/tree which allow ilcg abstract syntax trees to be represented. I will go through some of the key classes here that you have to know about before going on to decribe the overall algorithmic structure of such translators.

Interface Node all tree nodes implement this

Method Summary
Boolean equals(Object n) check exact equality of trees Node eval() evaluate expressions known at compile time, this allows arithmetic expressions like 3+5 to be done in the compiler not at run time. void examine(TreeExaminer e) A method that is used by an examiner to visit all locations. Node getCost() Give an expression which evaluates to a cost of an operation in memory accesses boolean knownAtCompileTime() Nodes are used to represent the ILCG tree, and in the root class we declare the terminal symbols of the language as strings boolean matches(Node n)Check if similar enough for code generation purposes boolean matches(Node n, ilcg.SymbolTable D) Check if similar enough for code generation purposes with use of rewrite rules stored in D Node modify(TreeModifier m) A method that must beinstantiated allowing a TreeModifier to substitute values into the tree. String returnType() Returns the type of an expression tree as a string int weight() returns a rough estimate of the number of registers needed to compile this sub tree

Class TreeExaminer

The code generators work using tree visiting classes. An example is the TreeExaminer class which can be used to visit all notes of a subtree. You typically extend the class by overriding the leave and visit methods to do what you want. It is invoked by passing a TreeExaminer to the examine() method of a node. Tree examiners are useful if you want to search trees or gather statistics on them like how many memory accesses do they make.

Method Summary
void boolean leave(Node n) This is called after all subtrees have been visited visit(Node n) This is called each time a node is visited, but before any subtrees are visited. If it returns false the sub trees are not visited

Assign

Class Assign represents assignments


Assign(Node d, Node s) d = destination, s = source of the assignment , check the assignment for type validity. Assign(Node d, Node s, boolean check) d = destination, s = source of the assignment , if check is true, check the assignment for type validity.

Class Int represent integer constants

This class extends java.lang.Number so it has methods like getInt(), getLong() etc
Constructor Summary Int() Constructor for the Int object Int(String s) Constructor for the Int object convert string to integer Int(int i) Constructor for the Int object Int(long l) Constructor for the Int object Int(long l, String rep) Constructor for the Int object rep specifies the precision int32, int16 etc

Labels and locations

Label Constructor Summary Label() constructs a label with a machine generated name Label(String s) constructs a label with a programmer specified name

Location Constructor Summary Location(Node n) Constructor for the Location object , reserves a memory location in the target machine to store the expression in n

Conditionals

Constructor Summary If(Node condition, Node thenPart) Creates if then fi it executes the thenPart if the condition is true at run time If(Node condition, Node thenPart, Node elsePart) Creates if then else fi execute the thenPart if the condition is true at run time otherwise execute the else part
The If node are the abstract sytnax representataions of control branches in the original source code.

There are several other classes for constructing ILCG trees but you can familiarise yourselves with these by looking at the html documentation in the package ilcg/tree If a condition is known at compile time when eval is called the tree is simplified thus Label a=... Node x= new Memref(a,int32); new If(new Dyad(new Int(4),new Int(3),<), new Assign(a, new Dyad(a,a,*))).eval() will generate the null statement since 4<3 is false

Statements

Constructor Summary Statement() the null statement Statement(Node n) a statement made of n Statement(Node n, Statement s) a statement n followed by the statement s The statement class provides the basic mechanism for specifying sequencing in the abstract syntax tree. new Statement(new Assign(regEAX, new Int(3)), new Statement( new Assign(new Memref(new Label(b),int32),regEAX))) would tell the code generator to output code to assign 3 to the eax register and then store the eax register in the memory location referred to by label b. We can also use this to set up store locations for variables new statement(new Label(b), new Statement(new Location(new Int(0,int32)))) would tell the code generator to reserve a memory location for a 32 bit integer, initialised to zero and label it b. If you print ilcg trees out, the toString method of the statement class renders it in the form seq( , )

Dyad

Class Dyad represents dyadic arithmetic expressions like a*7 Constructor Summary Dyad() Constructor for the Dyad object Dyad(Node l, Node r, String o) Constructor for the Dyad object with o being the string representation of the operation being performed Calling the eval method on it after construction will ensure that any expressions that can be calculated at compile time are so evaluated eg new Dyad(new Int(3),new Int(7),+).eval() would return an Int(10)

The translator structure

The translation process happens using the visitor paradigm. A DepthFirstAdapter is a class which will visit every node in a Sable syntax tree. By extending it, we get a translator that will visit every node in the concrete syntax tree and translate it into the equivalent abstract syntax tree.

class Translator extends DepthFirstAdapter{ Hashtable<tinyb.node.Node,ilcg.tree.Node> translations = new Hashtable<tinyb.node.Node,ilcg.tree.Node> (); Hashtable<String,ilcg.tree.Node> symbols= new Hashtable<String,ilcg.tree.Node>(); String precision = int32;

The translations are stored in the hashtable translations with tiny basic concrete syntax nodes as keys of the hash table and ilcg abstract syntax trees as values. An auxilliary hash table symbols is used to look up associations with identifier names. precision specifies the accuracy we are going to use in generated code.

One method per class

The visitor class has one method per class of syntax node. Consider the Sable specification value = {constant} number | {identifier} identifier | {expression} l_par expression r_par; The translator must have methods to deal with each of these alternatives: public void outAConstantValue(AConstantValue node){ translations.put(node, new ilcg.tree.Int( new Long(node.toString().trim()).longValue(),precision) ); } public void outAIdentifierValue(AIdentifierValue node){ translations.put(node, new Deref( idof(node.getIdentifier().toString().trim()))); } public void outAExpressionValue(AExpressionValue node){ translations.put(node,translations.get(node.getExpression())); } When a node is encountered the appropriate method for its class is called to translate it. It puts the translation into the translation table. Lets look at some methods in detail.

Constants

public void outAConstantValue(AConstantValue node){ translations.put(node, new ilcg.tree.Int( new Long(node.toString().trim()).longValue(),precision) ); } The main call here puts node into the translation table as a key, an the translation of the node as the associated value. The translation is new ilcg.tree.Int( new Long(node.toString().trim()).longValue(),precision) node.toString().trim() gives us the number with leading and trailing spaces removed We then convert this into a java long value and pass it to the constructor Int(long v, String representation) using the default value of precision ( int32 ) as the representation The end effect is to translate the tinybasic concrete syntax AConstantValue into an Ilcg Int instance

Variables

public void outAIdentifierValue(AIdentifierValue node){ translations.put(node, new Deref( idof(node.getIdentifier().toString().trim()))); }

The translation here is of an identifier in the context of an expression. An identifier in a high level language is a name for a location in the computer memory. But what we want to have for the calculation is not the location but the contents of that location. In Ilcg source this operation of getting the contents of a location is denoted by the ^ character. In the syntax tree it is denoted by the Deref class. In order to find the memory location we call the method idof passing to it the source of the identifier with leading and trailing spaces removed.

Handling labels

The function idof checks if we have already encountered s. If we have it simply returns it. Otherwise is looks to see if s is a variable ( starts with a letter ) or a line number. If it is a line number it associates with the line number s the label lines, so if s was 20 it would get label line20 If it is a variable it allocates a memory location for s and associates both this memory location and s with the label vars

ilcg.tree.Node idof(String s){ ilcg.tree.Node n = symbols.get(s); if(n==null){ n= new Label("line"+s); if(!Character.isDigit(s.charAt(0))){ n = new Label("var"+s); Memref m = new Memref(n,precision); data = new Statement(n, new Statement( new Location(new Int(0,precision)),data)); n=m; } symbols.put(s,n); } return n; }

Example translation

Note in what follows how labels are being used 10 FOR I := 1 TO 5 20 PRINT I 30 NEXT I 40 END Translated to assembler is line10: mov DWORD [varI],1 label131051df1911: line20: push DWORD [varI] lea eax, [print] call eax add esp, 4 line30: inc DWORD [varI] cmp DWORD [varI],5 jle near label131051df1911 line40: .... some stuff cut out here section .data varI: dd 0

Dyadic expressions

Consider how we translate expressions like T+(I*2) this can be broken down into translating the left and right halves of the expression and then construction an abstract dyadic expression node with the appropriate operator. public void outAArithExpression(AArithExpression node){ ilcg.tree.Node l,r; l=translations.get(node.getLeft()); r=translations.get(node.getRight()); ilcg.tree.Node n = new Dyad(l,r, node.getOp().toString().trim()); translations.put(node,n); } Look at an example of how this works LET T:=T+(I*2) --> assign(mem(ref int32,varT), +((int32)^(mem(ref int32,varT)), *((int32)^(mem(ref int32,varI)),2) )) --> imul eax, DWORD [varI], 2 add DWORD [varT], eax

If statements

Sable
action = {conditional} if [c]:condition then [n]:number new_line Java in Translator public void outAConditionalAction(AConditionalAction node){ translations.put(node, new If(translations.get(node.getC()), new Goto(idof(node.getN().toString().trim())))); } Basic source IF T>15 THEN 30 Translated Ilcg tree if(>((int32)^(mem(ref int32,varT)),15),goto(line30) ,null) Resulting assembler cmp DWORD [varT], jg near line30

15

For statements

Sable

The loops stack holds the labels for returning to

action = {forloop}for identifier assign [fromexp]:expression to [toexp]:expression new_line Java in Translator


public void outAForloopAction(AForloopAction node){ try { Label l = new Label();loops.push(l); Assign init = new Assign(idof(node.getIdentifier().toString().trim()), translations.get(node.getFromexp())); limits.push(translations.get(node.getToexp())); translations.put(node,new Statement(init, new Statement(l))); }catch(Exception e){Error(node,e);} }

The limits stack holds the upper bounds of loops, stacks allow loop nesting

Basic source FOR I:= 1 to 5 Translated Ilcg tree seq(assign(mem(ref int32,varI),1),seq(label131712bd8954)) Resulting assembler mov DWORD [varI], 1 label131712bd8954:

Next statements
action = {loopend} next [i]:identifier new_line

Java in Translator
public void outALoopendAction(ALoopendAction node) { try{// first generate the increment ilcg.tree.Node i =idof(node.getI().toString().trim()); ilcg.tree.Node di = new Deref(i); ilcg.tree.Node inc = new Assign(i, new Dyad(di,new Int(1,precision),"+")); ilcg.tree.Statement jump = new Statement(new If( new Dyad(di,limits.pop(),"<="),new Goto(loops.pop()))); translations.put(node, new Statement(inc,jump)); }catch(Exception e){Error(node,e);} }

Basic source NEXT I Translated Ilcg tree


seq(assign(mem(ref int32,varI),+((int32)^(mem(ref int32,varI)),1)), seq(if(<=((int32)^(mem(ref int32,varI)), 5), goto(label131712bd8954) , null)))

Resulting assembler
inc DWORD [ varI] cmp DWORD [ varI], 5 jle near label131712bd8954

Print statements
Sable
action = {printexp} print [e]:expression new_line

Java in Translator public void outAPrintexpAction(APrintexpAction node){ Function f = new Function( exidof("print"),precision,precision); Vector <ilcg.tree.Node> v = new Vector<ilcg.tree.Node>(); v.add(translations.get(node.getE())); translations.put(node, new Monad(f,new Cartesian(v))); } Basic source PRINT T Translated Ilcg tree print[(int32)^(mem(ref int32,varT))] Resulting assembler push DWORD [ varT] ; push T on the stack lea eax,[ print] ; find the address of print call eax ; call that address add esp, 4 ; restore the stack C routine called void print(int i){printf("%d ",i);}

Calling C functions

Requirements
1.Must establish a name correspondence with the C routine. Issues here Case of the names allowed characters how are these passed in assembler 2.Must pass parameters appropriately 3.Must get results back from C routines

Calling C functions

Characters and significance Case is significant in C, but this is not the case of all languages. Pascal for instance makes case insignificant, and requires that externals where the case is significant be given a name in quotes for example: procedure close (var f:fileptr); external name 'pasclose'; This allows the external routine to have a different name to the internal representation of it. The allowed characters in a name in Basic are limited to the uppercase letters, that means we can not call and C routine with an _ or a digit in its name unless we were to extend the syntax for externals along the above lines.

Calling C functions

Assembler representation In the assembler file list all the externals called as follows: extern extern print println

Then we can call them just as if they were declared within this file. call print But to do this you have to link the object file produced by the assembler (.o file) with an object file or C file containing the C routine.

Calling C functions

Assembler representation
Underscores

Most 32-bit C compilers share the convention used by 16-bit compilers, that the names of all global symbols (functions or data) they define are formed by prefixing an underscore to the name as it appears in the C program. However, not all of them do: the `ELF' specification used in Linux .o files states that C symbols do not have a leading underscore on their assembly-language names. Thus if you are producing code for Linux, which uses ELF, do not use underscores.

Calling C functions

The C calling convention

The caller pushes the function's parameters on the stack, one after another, in reverse order (right to left, so that the first argument specified to the function is pushed last). The caller then executes a near `CALL' instruction to pass control to the callee.

So if we have C function foo(int a, int b) being called as


foo(x,4)

we would generate the code push 4 push dword[x] call foo

Calling C functions

stack

[ebp+12]

[ebp+8]

retaddr

[ebp+4]

Old ebp

[ebp]

3. The callee receives control, and typically (although this is not actually necessary, in functions which do not need to access their parameters) starts by saving the value of `ESP' in `EBP' so as to be able to use `EBP' as a base pointer to find its parameters on the stack. However, the caller was probably doing this too, so part of the calling convention states that `EBP' must be preserved by any C function. Hence the callee, if it is going to set up `EBP' as a frame pointer, must push the previous value first. foo: push ebp mov ebp,esp 4. The callee may then access its parameters relative to `EBP'. The doubleword at `[EBP]' holds the previous value of `EBP' as it was pushed; the next doubleword, at `[EBP+4]', holds the return address, pushed implicitly by `CALL'. The parameters start after that, at `[EBP+8]'. The leftmost parameter of the function, since it was pushed last, is accessible at this offset from `EBP'; the others follow, at successively greater offsets. Thus, in a function such as `printf' which takes a variable number of parameters, the pushing of the parameters in reverse order means that the function knows where to find its first parameter, which tells it the number and type of the remaining ones.

Calling C functions

5. The callee may also wish to decrease `ESP' further, so as to allocate space on the stack for local variables, which will then be accessible at negative offsets from `EBP'. 6. The callee, if it wishes to return a value to the caller, should leave the value in `AL', `AX' or `EAX' depending on the size of the value. Floating-point results are typically returned in `ST0'. 7. Once the callee has finished processing, it restores `ESP' from `EBP' if it had allocated local stack space, then pops the previous value of `EBP', and returns via `RET' . mov eax, retval mov esp,[ebp] ret 8. When the caller regains control from the callee, the function parameters are still on the stack, so it typically adds an immediate constant to `ESP' to remove them (instead of executing a number of slow `POP' instructions). In our example: call foo add esp,8 ; removes the two parameters

Вам также может понравиться