Dec. 27, 2016
Group Members:
- Xudong Wang (hsu [AT] whu.edu.cn): Language core, built-in functions, test cases
- Puxuan Yu (pxyuwhu [AT] gmail.com): GUI code editor
- Yi Ren
This project is for the course "Compiler Implementation Practice" at International School of Software, Wuhan University.
- The project was implemented in C++11. No 3rd party libraries are used in the interpreter core;
- It is a cross-platform interpreter. We have tested it under macOS, Linux, and Windows;
- No parser generators (e.g. Bison, Antlr) were used. The lexer and parser were purely hand-written for more flexibility and more user-friendly error message;
- A GUI code editor was provided to let users edit and run CMM scripts more conveniently.
Modules \ OS | Windows | Linux | macOS |
---|---|---|---|
CMM Language core | âś… | âś… | âś… |
math library | âś… | âś… | âś… |
multiprocessing | ❌ | ✅ | ✅ |
Ncurses library | ❌ | ✅ | ✅ |
GUI code editor | Not tested | Not tested | âś… |
To show you that CMM is a somewhat useful language that can create relatively non-trivial pieces of code, we implemented a console snake game in purely CMM.
This also serves as a good example for you to get familiar with the CMM grammar quickly.
The language we implemented, named CMM, has a grammar similar to C family languages, but its internal mechanism is close to Python. Some features from Haskell, Scala, and Lisp are also borrowed.
The grammar of CMM is not exact identical to what is required in the task specification. All required syntax are supported by our implementation, in the meanwhile, many new features are supported. The detailed grammar can be checkout out at section CMM Grammar BNF.
The basic syntax of CMM is very similar to C. Here are some points worthy to know:
- Empty statements are allowed (namely a single semicolon as statement). A warning message will be generated upon empty statements;
- Like C, the expressions in
for
loops may be omitted. E.g., a common way to write a infinite loop isfor (;;) {}
- Boolean logic operators are short-circuited. Consider the following snippet:
bool foo() { println("hello"); return true; }
bool bar() { println("world"); return false; }
if (foo() || bar)
;
The output will be "hello".
There are 4 primitive types in CMM: int
, double
, bool
, string
.
The declaration of arrays are similar to C. The following statement
Type Identifier [Expr1][Expr2]...[ExprN]
declares an N-dimensional array of type T with name Identifier. Its valid range of the ith index is [0, Expri - 1].
Note: Different from the required of the assignment and the rule in C89, the declaration of arrays do not require size expression to be constant.
Implicit Conversion Rules:
- When operand of a operator are
int
anddouble
, the integer will be promoted todouble
; - If any operator of
+
is a string, then another operand will be converted to string; - Arrays of type T are considered to be a subtype of T. That is to say, a variable of type T can store array of that type;
- All operands of logical operators will be converted to boolean values. E.g., numeric zeros
and empty strings are
false
, otherwisetrue
.
main
function are optional in CMM. If the programmer defined such a function, then it will
be invoked after all top-level statements and definitions executed.
The parameter list of main
can be empty, or it can take a string
as input.
The main
function may return arbitrary type (although int
and void
are recommended).
returned value will be converted to integers and be will returned to the operating system.
As a convention, a zero value indicates successful execution. Non-zero values means failure.
Here's a simple example of a valid definition of the main function:
int main(string args)
{
if (len(args) < 1)
return -1;
println("Hello " + args[0]);
return 0;
}
In this example, the args
of is a string array (remember that T arrays are subtype of T).
The length of command line arguments can be calculated by function len
.
Say we have a file MainArgs.cmm
with following contents:
void main(string args) {
int i;
for (i = 0; i < len(args); i = i + 1)
println(i, args[i]);
}
By enter command cmm MainArgs.cmm how are you
, the output on the screen will be
0 how
1 are
2 you
In many languages like Scala and Ruby, the value of last statement or expression that was executed in a function will be the default return value of it. In CMM the we implemented the same feature.
E.g., a function that returns the maximum value can be written as:
int max(int a, int b) {
if (a>b) a; else b;
}
Or even more concise:
int max(int a, int b) if (a>b) a; else b;
Haskell allow its user to create new operators. In CMM we have a similar feature.
Unlike the overloading of operators in C++, which allows users to define the behaviours of operators on specific type, we let users use new symbols to create infix (binary) operators and define the precedence. The definition of a custom operator is:
infix [precedence] left-hand-side-operand operator-symbol right-hand-side block
the precedence is optional and default to be 12. The operator symbol may begin with
a character in \@#$:`?
, and followed by arbitrary number of special characters.
Here's an example that defines an operator :+: which sums the natural numbers up in range [LHS, RHS]:
infix low:+:high {
int sum = 0, i;
for (i = low; i <= hight; i = i+1) {
sum = sum + i;
}
sum;
}
Then expression 1 :+: 100
will be evaluated to 5050.
When the block itself is an expression, we can express it in a more natural way:
infix [precedence] = ;
E.g., define a power operator with precedence 12: infix 12 x`^y = pow(x, y);
Dynamic Binding is a feature similar to that in Emacs-Lisp. Firstly, there're two concept that should be known before we go into detail: Lexical binding and Dynamic binding.
- When you call a function with lexical binding, free variables in that function will be looked up in the environment of its definition. It is also called "static binding" because all bindings can be decided at compile time.
- Dynamic binding means free variables in functions will be looked up for definition from the environment where the function is invoked.
In CMM, functions called are by default static, which is also the choice of most programming languages. If you want to force the interpreter to do a dynamic binding, you can add a "!" between the function name and left parenthesis at call point.
An example:
int var = 1234;
void foo() { println(var); }
void main() {
string var = "Hello";
foo(); // Lexical binding
foo!(); // Dynamic binding
}
The output would be: 1234 Hello
CMM do garbage collection by the reference counting algorithm.
Out implementation is pretty straightforward: all new objects are created by
std::make_shared
and save them with smart pointer std::shared_ptr<T>
in C++11, which
automatically manages the reference count and delete the object when ref-count decreases to 0.
It is publicly known that there's a problem with reference counting algorithm: When objects reference form a cycle, and the cycle cannot be used directly or indirectly from top level, then they are leaked forever. We did not solve the problem and suggest avoid it in CMM.
A cycle reference example:
{
int A[3]; // new array object referenced by A
A[0] = A; // new array object referenced by A and its first element
}
/*
* Now A is not in scope, but the array is still referenced by
* its first element. We lost it forever.
*/
CMM implemented two simple optimization.
Constant folding evaluate the value of constant expressions at compile time as much as possible. We implemented a simple folding algorithm which, during the creation of abstract syntax tree.
In CMM, statement print(1 + 2 * 3)
will not be converted to a complex AST.
Instead, its AST will be equivalent to the one of print(7)
.
Dead Code Elimination removes blocks in program which are unreachable.
In our implementation we did that for the most obvious dead code. This snippet
if ("hora" && 0.0)
foo();
else
bar();
will be replaced by a simple bar()
invocation.
Whether a language is expressive or not is largely related to whether it has sufficient library. Users of CMM language can easily add system calls/library function calls, as follows:
- Add a line
ADD_FUNCTION(xxx)
in NativeFunctions.h, which is macro that expands to the function prototype; Write a wrapper function in NativeFunctions.cpp which wraps the library function: It takes as input an array ofcvm::BaiscValue
and returns acvm::BasicValue
; - Register this function in NativeFunctionMap (in Interpreter.cpp)
Display a line number at the left side and high light current line.
The reserved words in CMM are:
if else for while do break continue return int double
bool void string infix
Note: do
is not yet used.
If files are edited and not saved. When existing, a dialog will pop up to confirm.
The default is the Save option, and the newly created file is the Save As option.
If the current editing file is not saved, you are prompted to save it. On clicking the run button, terminal will started to compile and run.
Compilation errors and warnings are presented in the list. The list contains incorrect rows, columns, and error messages. Double-clicking on a line automatically jumps to the wrong location in the source code.
Help message:
Tokens Dump:
AST Dump:
Console Snake game written in CMM!
A glance at the program written in CMM
Operator | Precedence |
---|---|
= (assignment) | 1 |
|| (logical or) | 2 |
&& (logical and) | 3 |
| (bitwise or) | 4 |
^ (bitwise exclusive or) | 5 |
& (bitwise and) | 6 |
== (equal to) != (not equal to) | 7 |
< (less than) <= (less or equal than) > (greater than) >= (greater or equal than) | 8 |
<< (bitwise leftshift) >> (bitwise rightshift) | 9 |
+ (addition) - (subtraction) | 10 |
* (multiplication) / (division) % (modulo) | 11 |
The default precedence of user-defined binary operators | 12 |
Those functions are always available in CMM regardless of the operating system:
typeof
len
strlen
print
println (alias of puts)
system
random
rand
srand
time
exit
toint
todouble
tostring (alias of str)
tobool
read
readln
readint
sqrt
pow
exp
log
log10
The following functions are only available under Linux and macOS:
UnixFork
NcEndWin
NcInitScr
NcNoEcho
NcCursSet
NcKeypad
NcTimeout
NcGetCh
NcMvAddCh
NcMvAddStr
NcGetMaxY
NcGetMaxX
NcStartColor
NcInitPair
NcAttrOn
NcAttrOff
NcColorPair
Program ::= TopLevel*
TopLevel ::= infixOperatorDefinition
TopLevel ::= functionDeclaration
TopLevel ::= DeclarationStatement
TopLevel ::= Statement
infixOpDefinition ::= Kw_infix [Integer] Id infixOp Id Statement
infixOpDefinition ::= Kw_infix [Integer] Id infixOp Id ["="] ExprStatement
functionDefinition ::= typeSpecifier identifier _functionDefinition
functionDefinition ::= typeSpecifier identifier _functionDefinition
_functionDefinition ::= "(" ")" Statement
_functionDefinition ::= "(" parameterList ")" Statement
parameterList ::= "void"
parameterList ::= TypeSpecifier Identifier ("," TypeSpecifier Identifier)*
block ::= "{" statement* "}"
typeSpecifier ::= "bool" | "int" | "double" | "void" | "string"
OptionalArgList ::= epsilon
OptionalArgList ::= argumentList
argumentList ::= Expression ("," Expression)*
EmptyStatement ::= ";"
Statement ::= Block
Statement ::= IfStatement
Statement ::= WhileStatement
Statement ::= ForStatement
Statement ::= ReturnStatement
Statement ::= BreakStatement
Statement ::= ContinueStatement
Statement ::= EmptyStatement
Statement ::= DeclarationStatement
Statement ::= ExprStatement
expression ::= primaryExpr BinOpRHS*
parenExpr ::= "(" expression ")"
primaryExpr ::= parenExpr
primaryExpr ::= identifierExpr
primaryExpr ::= identifierExpr ("[" Expression "]")+
primaryExpr ::= constantExpr
primaryExpr ::= ("~" | "+" | "-" | "!") primaryExpr
identifierExpression ::= identifier
identifierExpression ::= identifier "(" optionalArgList ")"
constantExpr ::= IntExpression
constantExpr ::= DoubleExpression
constantExpr ::= BoolExpression
constantExpr ::= StringExpression
ifStatement ::= "if" "(" Expr ")" Statement
ifStatement ::= "if" "(" Expr ")" Statement "else" Statement
forStatement ::= "for" "(" Expr ";" Expr ";" Expr ")" Statement
whileStatement ::= "while" "(" Expression ")" Statement
exprStatement ::= Expression ";"
returnStatement ::= "return" ";"
returnStatement ::= "return" Expression ";"
breakStatement ::= "break" ";"
continueStatement ::= "continue" ";"
DeclarationStatement ::= TypeSpecifier _DeclarationStatement
_DeclarationStatement ::= SingleDeclaration+
SingleDeclaration ::= identifier "=" Expression
SingleDeclaration ::= identifier ("[" Expression "]")+