Вы находитесь на странице: 1из 14

CS 105 : Compiler Design Project

Phase I. Programming Language Design:

1. Instructions:
You will be given a group code. This is the last letter of your section and a group number.
Design your own small original imperative programming language.
Give your programming language a name.
The more complex your language, the higher will be its potential score. However, a very
complex language could result in difficulties in succeeding project phases.
This language score will be used to compute your grade in this phase and succeeding phases
of the project.

2. Language Requirements:
Document your language. Identify the following features of your language:

2a. Minimum Requirements :


 What is the character set (alphabet) of your language?
 Is your language case-sensitive? Explain.
 White spaces:
 You must be able to handle spaces, tabs, end-of-line chars, and end-of-file char.
 How do you indicate comments till end-of-line ?
 Do you require other control characters? for what purpose?
 How do you start, end and name your programs ?
 What are your tokens ?
 List all your tokens in a table.
 What are your reserved words?
 Do you have special constants like TRUE? Enumerate.
 What are your arithmetic, logical and relational operators?
 What is/are your assignment operator/s ?
 How does each token begin and end? You cannot require that tokens must be delimited
by only one special character like a space. However, a space can be used as one way of
several ways to end a token.
 Create DFA(s) that recognize every kind of token you have.
 What data types are allowed for identifiers and constants?
 You must have at least one boolean type, a numeric type and a string type for identifiers
and constants.
 How do you specify constants for each type?
 What are the rules for forming identifiers?
 How are identifiers associated to their data types? Explicitly or implicitly? What are your
identifier declaration rules?
 Create regular expressions and/or DFAs for the identifiers and different kinds of constants.
 Are you using statement separators like semi-colons, periods, etc.? What? How? Each one
is considered a token.
 Are you using block scope? If so, how do you start and end your blocks?
 Do you explicitly or implicitly convert any data type to string? How?
 Do you explicitly or implicitly convert string values to other data types? How?
 How do you get input(s) from user?
 How do you send output(s) to user?
 How do you implement assigment statements ?
 How do you implement conditional statements?
 You must have if/then/else and if/then constructs that can be nested.
 What kind of looping constructs do you implement? How?
 You are required to implement your version of the for <initializer-condition-action> loop
<loop-body> looping construct. The initializer statement initializes an iterator variable.
Each time the condition is met, the statements in the loop-body are performed after which
the action statement is performed. The action statement should be allowed to increase
and decrease the value of the iterator. The number of iterations should be allowed to be
fixed and be a variable, depending on the condition and action statements.
 You may implement more than one looping construct.
 You must be able to handle nested loops.
 Expressions
 What binary arithmetic operators will be included? How?
 You must include addition and multiplication
 What logical operators are included?:
 You must include a version of and, or, and not
 What relational operators are included? For what data types are these applicable?
 You must include a version of smaller, larger and equals for numbers.
 You must include a version of equals for boolean and strings.
 How do you parenthesize expressions?
 You must allow nested parenthesization.
 Provide a table indicating the order of precedence (from highest to lowest) and
associativity rules (left-to-right or right-to-left) for all your operators.
 Does your language have any semantic rules not captured by your DFAs and CFGs ? If so,
enumerate and describe them.
 What other constructs are required in your language? (What must be done?)
 What are the limitations of your language? (What cant be done?)

2b. Optional Features (Youll earn more points for each optional feature. Enumerate and
describe each of these features.)
 Comments: Allowing In-line and multiline comments.
 Allowing constant data types: Identifiers to hold constant data.
 you must allow initialization and check that their values are not changed
 Allowing tokens including constants to span more than one line in source codes.
 Allowing tokens and constants that can be continued after whitespaces (like FORTRAN)
 Having special characters indicated by Escape-sequences like newline characters in strings.
 Including noise tokens i.e. optional reserved words
 Having more data types, various levels of precision or non-decimal radixes.
 Negative, fixed point or floating point numbers.
 short and int and long; float and double; with characteristic and mantissa;
 characters
 binary, octal, and hexadecimal data types
 Your compiler will be (later) required to perform type-checking for each data type.
 Including other data types:
 records/structs,
 composite types like ordered pairs, etc.
 enumerations
 Arrays
 1 dimensional or 2 dimensional or n-dimensional;
 same type or mixed-types
 Allowing definitions of new types from existing types (like typedef)
 Allowing both global and block/local identifers.
 Allowing identifiers with same name but of different types or of different scope
 Additional kinds of statements
 Switch/case,
 Goto statements and statement labels
 Break loop or continue loop
 Additional looping constructs
 Having more operators in expressions
 Unary plus, unary minus, (more points if you use the same symbol as binary plus and
binary minus)
 subtraction, division, exponentiation,
 built-in numeric functions (trig, log, e, hyperbolic, statistical, etc.)
 left-shift and right-shift operators
 Logical operators: NAND, NOR, XOR, etc.
 Relational operators: >=, <=, !=, etc.
 String lexicographic relational comparison (< or >).
 built-in string functions: concatenation, substring, search, length, etc.
 Allowing mixed-mode expressions (operands of different types, upcasting, downcasting)
 Handling lists of items
 ex. In declaratons, listing ids of the same type,
 ex. In I/O, listing items to input or output,
 ex. In procedure calls, listing ids in parameter lists, etc.
 Having programmer-defined callable procedures
 with or without parameters?
 with return value?
 what parameter passing methods are allowed?
 allowing recursion
 Object-Oriented Features (like Java or C++)
 Classes, Objects
 Functional Programming (like Lisp)
 Logic Programming (like Prolog)
 etc.

3.Specify initial DFA(s) for your tokens


 for reserved words
 for whitespaces
 for arithmetic operators (ex. +, -, ^, ), (, etc.), logical (ex. and, or not) and relational
operators (ex. :=, ==, <=, <<, >=, !=, <<<, etc. if you have these),
 for numeric constants, string constants, boolean constants, etc.
 for identifiers
 for special symbols (ex. statement terminators/separators, EOF, EOL, etc.)

4.Specify an initial CFG for your programming language.


 What is your start variable?
 Do not create grammar rules for each token those are handled by the DFAs.
 How are several tokens structured to form the syntax of your statements and your programs?
 Ensure your grammar enforces syntactic rules and most of the language rules, including
precedence and associativity of operations.
 Your grammar must be able to parse statements like
 if ctr1 + ctr2 < 3 * (ctr1 + 5) AND flag = FALSE
 Your CFG should be able to distinguish between valid and invalid programs of your language.
5. Provide Sample Programs:
Write one or more example error-free programs that demonstrate the features of your language.
Write example programs that will result in scanning errors.
Add comments to identify known errors that will not be caught by your scanner.
Write example programs that will result in parsing errors.
Add comments to identify known errors that will not be caught by your parser.

6. Grading:
 Your grade in this phase will be based on the following:
 complexity of your language
 the completeness, clarity and accuracy of your write-up and examples,
 ability to answer correctly questions about the language and the documentation.
 your timeliness,
 your peer evaluation.
 If you do not deliver on promised features :
 Points earned for those features in this phase will be removed.
 You get additional small deductions.
 Optional features added after language design phase could increase your language score, but
not as much as if they were declared during the language design phase.
Phase II. Write a scanner.
1. Requirements:
 Write at least two methods or programs: a tester and a scanner.
 The tester specifies the input file containing the source program to be scanned.
 The tester creates a symbol table for identifiers, with the lexeme as key, and details about
the identifier as value. Initially, the details will include the lexeme (again) and token type
(id) but later would include its data type and its value.
 The tester makes one request per token to the scanner. The tester must keep requesting
tokens from the scanner until the end of the input file is reached.
 The scanner returns the next good token for each request from the tester.
 If the scanner detects an identifier, it should check if the id is already in the symbol table.
if not, it adds a new entry for the identifier.
 If the scanner consumes an error, it should print it. If an error is passed to the tester, the
tester should print it.
 The tester should print all good tokens.
 Tokens should be printed depending on their type.
 Whitespaces should not be printed.
 Reserved words and operators should be printed using their token names. (ex.
[COMMA],[READ],[GT],[SUBT],[END] )
 Identifiers should be printed to show their lexemes. (ex. x or numStudents)
 Constant values should be printed to show their actual values. (ex. Hello, World! or
TRUE)
 You may write separate routines that perform one or more of the following:
 returns the current position of the next character to be read from the file,
 reads the source program and returns the next character, if any,
 reads a sequence of characters from the file
 unreads a character or a sequence of characters
 You must build your own scanner from scratch. You cannot use pre-existing scanners
(like Lex and Flex) created by others.
 Optional features of your scanner (You earn more points for doing some of these):
 Indicating line number of errors,
 Fixing errors,
 Skipping sequences of characters in error or do not make real tokens
 Write test input source program(s) for scanning.
 They must have all kinds of tokens to be identified
 They must have representative character sequences that do not make good tokens.
 You will use these to demonstrate what your scanner can and can not do.
 Document all changes you made in the language specifications since the documentation you
provided for the language design phase. Highlight these changes. Did you add, subtract or
change features of your language? The graded original and the revised version of your
language specifications must both be part of your phase 2 documentation.
 Document your phase 2:
 Identify the tokens and draw the DFA(s) actually used by the scanner.
 How do these compare against your DFA(s) from language design phase?
 Why did you make these changes?
 If you have multiple DFAs, in what sequence are these DFAs executed?
 Indicate if reserved words are checked by table or by DFA(s).
 Provide a system architecture diagram showing your modules and the interfaces between
modules.
 Show the inputs to each module and where do these inputs come from (user file or
which module).
 Show the outputs from each module and who (user or which module) consume these
outputs.
 Explain the purpose of each module.
 How do you handle errors discovered by the scanner?
 What errors can not be handled by your scanner, if any?
 Specify if your scanner is consuming whitespaces and/or erroneous character sequences,
or if they are being passed to the caller.
 Demonstrate your scanner to the teacher with your computer.
 Hand-in your printed documentation as described above.
 Show your scanner routines source codes and the input source codes for scanning
 The teacher may modify some of your input source codes to test language features you
promised to implement in your language design.
 Have a back-up of all your source codes.

2. Grading:
 Your grade in this phase will be based on the following:
 Delivery of scanner requirements
 Quality of test input programs
 Correct classification and printout of tokens
 Error Handling
 Maintenance of symbol table for identifiers
 Scanner optional features
 Presentation and ability to answer questions correctly
 the completeness, clarity and accuracy of documentation
 your timeliness,
 your peer evaluation
Phase III. Write a parser.
1. Requirements:
 Write at least two new methods or programs: a tester and a parser.
 The tester specifies the input file containing the source program to be parsed.
 The tester calls a routine that prints all the tokens of the source program to produce an
output similar to the printout of the scanner phase.
 Note: This step is unnecessary and redundant for parsing. It would however be useful
to check the sequence of tokens that will be processed during parsing.
 The tester calls the parser to parse the source program, and receives back the resulting k-
way parse or syntax tree.
 It is preferable that an abstract syntax tree be created instead of a concrete syntax tree
(parse tree) in preparation for phase 4.
 The tester prints the k-way tree generated.
 The parser parses the source program, and generates and returns the corresponding k-
way parse or syntax tree.
 The parser should call the scanner repeatedly to fetch one token from the file per call.
 You must manually write your own Recursive Descent, LL, LALR, or LR parser.
 For top-down parsers, you have to remove left-recursions and left-factor your CFG.
 For bottom-up parsers, you have to compute first and follow sets.
 You must build your own parser from scratch. You cannot use pre-existing parsers (like
YACC) created by others.
 Optional features of your parser (You earn more points for doing some of these):
 Indicating line numbers of statements that have parsing errors,
 Fixing parsing errors,
 Skipping sequences of tokens in error or bypassing grammar variables
 Write test input source program(s) for parsing.
 They must demonstrate the usage of most of your grammar rules.
 Some of them must include code segments with parsing errors.
 If you dont provide test programs for optional features, you cant get the additional points.
 Document all changes you made in the language specifications since the documentation you
provided for the scanner phase. Highlight these changes. Did you add, subtract or change
features of your language?
 Document your phase 3:
 The graded original versions of phases 1 and 2 documentations should be submitted as
part of your phase 3 documentation.
 Specify in a table the CFG actually used by your parser.
 How does this CFG compare against your CFG from the language design phase?
 Did you remove left-recursion or left-factor your CFG? What are the results?
 Did you compute first and follow sets? What are the results?
 Why did you make these changes?
 Indicate the parsing method used: Recursive Descent, LL, LALR, or LR.
 Did you use a parsing table? If so, your documentation should refer to a spreadsheet that
stores your parsing table.
 Provide a system architecture diagram showing your modules and the interfaces between
modules. Include the modules in phase 2, too.
 Show the inputs to each module and where do these inputs come from (user file or
which module).
 Show the outputs from each module and who (user or which module) consume these
outputs.
 Explain the purpose of each module.
 Include a short sample source program and draw (using nodes and edges) the
corresponding parse or syntax tree generated by the parser as part of your
documentation.
 How do you handle errors discovered by the parser?
 What errors can not be handled by your parser, if any?
 Demonstrate your parser to the teacher with your computer.
 Hand-in your printed documentation as described above.
 Show your parser routines source codes and the input source codes for parsing
 The teacher may modify some of your input source codes to test language features you
promised to implement in your language design.
 Have a back-up of all your source codes.

2. Grading:
 Your grade in this phase will be based on the following:
 Delivery of parser requirements
 Quality of test input programs
 Actual parsing vs. expected correct parsing
 Correct resolution of ambiguities
 Error Handling
 Maintenance of symbol table, if any
 Working LR(1) and LALR(1) parsers will get significantly higher scores than working LL(1)
and recursive descent and parsers applied to the same language.
 Parser optional features
 Presentation and ability to answer questions correctly
 the completeness, clarity and accuracy of documentation
 your timeliness,
 your peer evaluation
Phase IV. Write an interpreter and semantic type checker.
1. Requirements:
 Write at least two new methods or programs: a tester and an interpreter.
 The tester specifies the input file containing the source program to be interpreted.
 The tester calls the parser to generate a tree for the input source program.
 The tester prints the obtained tree.
 The tester calls the interpreter to type-check and execute the source programs tree.
 The interpreter should check the semantic type rules of the language.
 The interpreter should execute the statements in the tree.
 The interpreter should update the symbol table to include information such as data types
and data values.
 The interpreter should print informational and error messages.
 Optional features of your interpreter (You earn more points for doing some of these):
 Indicating line numbers of statements that have interpreter errors,
 Fixing type errors by automatic casting, insertion of declarations, providing default values,
etc.
 Providing meaningful error messages during interpretation of input source programs.
 Write test input source program(s) for interpretation.
 They must demonstrate the usage of ALL of your language features.
 Some of them must include code segments with interpreter / semantic checking errors.
 If you dont provide test programs for your language features, you dont get additional
points. Points may also be taken off from your previous phases.
 Document all changes you made in the language specifications since the documentation you
provided for the parser phase. Highlight these changes. Did you add, subtract or change
features of your language?
 Document your phase 4:
 The graded original versions of phases 1, 2 and 3 documentations should be submitted as
part of your phase 4 documentation.
 Enumerate the semantic rules you actually checked in your interpreter. (Ex. Declaration
of an identifier and its type before use, assignment of a value whose type is incompatible
with the type of the identifier, etc.)
 How do these compare against the semantic rules you declared in the language
design phase?
 Enumerate run-time checking that your interpreter actually performs. (Ex. division by
zero, assignment of a value that is out of the range of the type of an identifier, etc.)
 Enumerate the language constructs you are able to interpret correctly, and those that you
have difficulties with. Explain these difficulties briefly.
 Does your interpreter create implicitly or explictly an abstract syntax tree from a parse
tree? Or is the abstract syntax tree already provided by the parser?
 Provide a system architecture diagram showing your modules and the interfaces between
modules. Include the modules in phases 2 and 3, too.
 Show the inputs to each module and where do these inputs come from (user file or
which module).
 Show the outputs from each module and who (user or which module) consume these
outputs.
 Explain the purpose of each module.
 Include a short sample source program, draw (using nodes and edges) the corresponding
abstract syntax tree, indicate a sample input to this source program, and the
corresponding output.
 How do you handle errors discovered by the interpreter?
 What errors can not be handled by your interpreter, if any?
 Demonstrate your interpreter to the teacher with your computer.
 Hand-in your printed documentation as described above.
 Show your interpreter routines source codes and the input source codes for interpretation
 The teacher may modify some of your input source codes to test language features you
promised to implement in your language design.
 Have a back-up of all your source codes.

2. Grading:
 Your grade in this phase will be based on the following:
 Delivery of interpreter requirements
 Quality of test input programs
 Checking for and catching type errors correctly
 Correct execution of test input source programs
 Error Handling
 Maintenance of symbol table, if any
 Interpreter / Semantic Type Checker optional features
 Presentation and ability to answer questions correctly
 the completeness, clarity and accuracy of documentation
 your timeliness,
 your peer evaluation
An Example Language:
Language Specifications :
not case sensitive
tokens and constants cannot span more than one line.

Character Set: Upper-case characters, digits, !@#$%^&*()_+-={}[]|:;<>?,./,space,tab,newline


special characters: \ (newline within strings). ~ (concatenation operator)
All lower-case symbols (even literal constants, input/output) immediately converted to uppercase

Whitespaces:
Comments:
// comment till the end of line
/* */ inline and multiline comment; ignore anything in between /* and */
Comments serve as delimiters of tokens; comments cannot be inside any single token
Blanks, Tabs, Line breaks also delimit tokens
Statements can continue for several lines
Noise word : ANG skipped anywhere it appears as a lexeme

identifiers: start with a letter; followed by letters and digits; delimited by non-letters/digits

Data Types:
boolean: TUTOT values OO and HINDI
for comparison: OO > HINDI

integer: BILANG positive, zero, negative integers


values: (optional sign + or -, + if omitted) 1 or more decimal digits; leading zeroes allowed but
trimmed; 0 if all zeroes.

float: LUTANG real numbers


values (optional sign + or -, + if omitted) 1 or more decimal digits with a decimal point anywhere in
the sequence of digits.

strings: SALITA strings


values: sequence of upper and lower case letters, digits, special characters !@#$%^&*()_+-
={}[]|\:;<>?,./,space between a pair of double-quotes.
escape sequences not allowed; double-quotes and back-quotes not allowed; backslash
interpreted as newline character.

Implicit Conversion/Coercion Rules :


No mixed-types computations except for the following:
- conversion to SALITA: for ISULAT and concatenation:
o TUTOT: OO, HINDI as OO,HINDI
o BILANG/LUTANG: based on default unformatted toString method of Java
o SALITA: Convert Lowercase to Uppercase.
- assignment conversion from SALITA: for BASAHIN
o TUTOT: OO and TRUEto OO, HINDI and FALSEto HINDI
o BILANG: to integer equivalent
o LUTANG: to double (real number) equivalent
- With math operators, if BILANG is mixed with LUTANG, the result is LUTANG.
- Comparison:
o BILANG and LUTANG can be compared with each other but not with SALITA nor TUTOT.
o SALITA can be compared with SALITA but not with TUTOT, BILANG nor LUTANG.
o TUTOT can be compared with TUTOT but not with SALITA, BILANG nor LUTANG.

Statements:
all statements begin with reserved words; No statement separators used.
PROGRAM id SIMULA statements TAPOS
Declarations: URI type-name comma-separated identifiers
All identifiers must be declared before used.
All identifiers are global regardless of block structure.
Names of identifiers must be unique and different from reserved words.
Assignments: ILAGAYSA-ANG (ILAGAYSA X ANG 3) ANG is optional
Conditional: KUNG-EHDI-IBA-KUNGFI (if then else endif; if then endif)
Loops: HABANG-GAWIN statement while do loop
I/O: BASAHIN - read from keyboard a string ended by newline character
(BASAHIN ANG X) ANG is optional
I/O: ISULAT write to output display a string; no formatting except newline
(ISULAT SAGOT= ~ X)
blocks indicated by SIMULA TAPOS
no procedures, methods nor functions.

Expressions:
Parenthesization allowed
Operators:
Salita ~(concatenation), <, >, =, !=, <=, >= (lexicographic comparison)
Bilang +(unary),-(unary),+(binary),-(binary),*,/ (integer division),<, >, =, !=, <=, >=
Lutang +(unary),-(unary),+(binary),-(binary),*,/ (real number division), <, >, =, !=, <=, >=
Tutot: ATSAKA (and), OKAYA (or), DEHINS (not), <, >, =, !=, <=, >= (OO < HINDI)
Precedence Level Operations Associativity
(Highest to lowest)
1 parenthesization L to R
2 unary plus (+), unary minus (-) R to L
3 * and / L to R
4 binary plus (+), binary minus (-) L to R
5 comparison operators <, >, =, !=, <=, >= L to R
6 DEHINS R to L
7 ATSAKA L to R
8 OKAYA L to R
9 concatenation (~) L to R

Special characters:
\ - in String literals denote a newline

Tokens (Internal)
CONSTBILANG, CONSTLUTANG, CONSTSALITA,
RELOP (comparison operator),
MULT, DIV, ADD, SUBT, COMMA, UPLUS, UMINUS, LPAREN, RPAREN, DIKIT,
ID, EOF,
SKIP (comments, EOL, ANG),
ERR

Tokens (Reserved Words)


PROGRAMA (program)
SIMULA (begin) TAPOS (end)
URI (decl), BILANG (int), LUTANG (double), SALITA (String),
TUTOT (boolean), OO (true), HINDI (false)
KUNG (if) EHDI (then) IBA (else) KUNGFI (endif)
HABANG (while) GAWIN (do)
ILAGAYSA (let) ANG (be/the)
BASAHIN (read), ISULAT (write)
ATSAKA (and), OKAYA (or), DEHINS (not)
An Example Grammar:
s  pgm
pgm  PROGRAMA ID block
block  SIMULA stmt-list TAPOS
stmt-list  stmt | stmt stmt-list
stmt  decl | assign | cond | loop | read | write | block
decl  URI type vars
type  TUTOT | BILANG | LUTANG | SALITA
vars  ID | ID COMMA vars
assign  ILAGAYSA ID expr
interpreter: If ID is SALITA expr can be of any type, auto-convert expr to SALITA. If ID is
LUTANG, expr BILANG converted to LUTANG.
semantics: Other-wise, ID and expr must be of same type.
cond  KUNG exprt EHDI stmt else
else  KUNGFI | IBA stmt KUNGFI
loop  HABANG exprt GAWIN stmt
read  BASAHIN ID
semantics: ID must have been previously declared in URI statement.
interpreter: ID will be assigned the input after conversion from SALITA to the type of ID
write  ISULAT expr
interpreter: The value of expr will be converted to SALITA and printed
expr  expr DIKIT exprt | exprt
interpreter: the operands of DIKIT will be converted to SALITA and concatenated
exprt  exprt OKAYA exprt1 | exprt1
semantics: the operands of OKAYA must both be of type TUTOT
exprt1  exprt1 ATSAKA exprt2 | exprt2
semantics: the operands of ATSAKA must both be of type TUTOT
exprt2  DEHINS exprt2 | exprt3
semantics: the operand of DEHINS must be of type TUTOT
exprt3  expr RELOP expr | exprn
semantics: the operands of RELOP must be compatible types (SALITA; BILANG and
LUTANG; TUTOT). Note that this does not drill down to exprn because of the desire to
compare SALITAs. Difficulty in precedence level could cause shift-reduce or reduce-reduce
conflicts. Ambiguity to be resolved by specified order of precedence in the parsing table. If
not possible, the parse table entry will be overridden in special cases in the parsing program.
exprn  exprn ADD exprnterm | exprn SUBT exprnterm | exprnterm
semantics: the operands of ADD and SUBT must be of type BILANG or LUTANG
exprnterm  exprnterm MULT exprnfactor | exprnterm DIV exprnfactor | exprnfactor
semantics: the operands of MULT and DIV must be of type BILANG or LUTANG
exprnfactor  UPLUS exprnfactor | UMINUS exprnfactor |
| LPAREN expr RPAREN | unit
semantics: the operands of UPLUS and UMINUS must be of type BILANG or LUTANG
unit  ID | CONSTSALITA | CONSTLUTANG | CONSTBILANG | OO | HINDI

Вам также может понравиться