[High Integrity Compilation] Susan Stepney.
High Integrity Compilation : a case study .

Prentice-Hall, 1993.

2. Specifying a Language---by Example

2.1 Introduction

This chapter briefly explains the steps involved in the denotational specification of a programming language, using a trivial example language, Turandot (' T iny, U nfinished, R estricted, and O verly T erse'). This illustrative example is by no means complete; the explanation is intended solely to give an overview of the process of specifying and proving a compiler, in order to motivate the larger complete specifications and proofs in the later chapters.

The denotational semantics definition of a sizable programming language requires the use of a branch of mathematics known as domain theory . However, for the simple languages described below, the extra capability, and consequent complexity, provided by domains is not necessary. Set theory provides a sufficient basis for the specifications, and, since it is conceptually simpler, its use clarifies some of the later discussions.

It is necessary to use some particular notation to write a denotational semantics specification. Many programming language specifications introduce their notation for domains in a somewhat ad hoc manner; using the simpler set-based approach has the advantage that the well-defined formal specification language Z can be used as the specification notation. Section 2.6 describes a small liberty taken with the syntax of Z that helps improve the clarity of programming language specifications.

Note: in this chapter some of the syntactic categories are decorated with numerical subscripts. These decorations are purely a technical device to distinguish the tutorial definitions from each other and from the later 'true' definitions of Tosca and Aida; they have no further significance.

2.2 Syntax

The first step in specifying a language is a specification of its syntax . Such a specification gives rules for well-formed 'sentences' in the language. An example of a badly formed, or syntactically incorrect, English sentence is Douglas Hofstadter's example; This sentence no verb.

Traditionally, a programming language's syntax is defined concretely, in terms of well-formed strings of characters. A program text consists of such a string, which first has to be lexically analyzed into a sequence of tokens (keywords, identifiers, and so on), then these tokens have to be parsed into a tree structure or abstract syntax . However, lexing and parsingare solved problems, discussed at great length in many classic texts, and so are not addressed yet again here. The specification of the semantics is clearer if it is given directly in terms of the abstract syntax, rather than in concrete terms of strings, which can become cluttered with disambiguation mechanisms like parentheses, keywords, and operator precedence rules. This sort of separation of concerns also allows the concrete syntax to be changed without needing to change the abstract syntax, the semantics definitions or the correctness proofs.

2.2.1 Structure of a syntax specification

Syntax can be defined using three classes of construct: compounds, lists and selection.

A compound has a fixed number of components, usually of different types. Compounds can be modelled abstractly in Z using a Cartesian product of the components. For example, a choice command consisting of a test expression, and two branch commands, and an assignment command consisting of an identifier and an expression, can be defined as

   Choice == EXPR0 x CMD0 x CMD0
   Assign == NAME0 x EXPR0

The symbol ' == ' is Z's abbreviation definition , a way of providing a meaningful name for a more complicated expression. The name can be used to stand for the expression anywhere in the specification.

A list has an arbitrary number of components of the same type. Lists are modelled in Z using a sequence. For example, a list of commands, and a list of declarations, can be defined as

   CMDLIST0 == seq1 CMD0
   DECLLIST0 == seq DECL0

This says that a command list is a non-empty sequence of commands, and a declaration list is a possibly empty sequence of declarations. A block command, consisting of a list of declarations and commands, can be defined using a compound whose components are lists:

   Block == DECLLIST0 x CMDLIST0

A selection comprises a set of possible constructs of a particular type. Selections can be modelled in Z using its free type(disjoint union) definition. For example, defining a command to be a choice or an assignment or a block:

   CMD0 ::= choice0 << Choice >>
         |  assign0 << Assign >>
         |  block0 << Block >>

The free type notation is explained in Appendix C.

A syntax specification can be shortened by including the compounds and lists directly in the selections. The partial example above can be written more succinctly as

   CMD0 ::= choice0 << EXPR0 x CMD0 x CMD0 >>
         |  assign0 << NAME0 x EXPR0 >>
         |  block0 << seq DECL0 x seq1 CMD0 >>

2.2.2 Turandot's abstract syntax

Turandot has only three types of construct: binary operators, expressions, and commands.

   OP1 ::= plus
        |  lessThan
        |  equalTo
        |  ...

This is a typical non-recursive free type definition, and can be thought of as a simple 'enumerated type', listing all Turandot's binary operators, which include comparison and arithmetic operators.

Turandot's expression syntax is

   EXPR1 ::= number << Z >>
          |  variable << NAME1 >>
          |  negate << EXPR1 >>
          |  operation << EXPR1 x OP1 x EXPR1 >>

This is a typical recursive free type definition. It has two base cases (an expression can be an integer or a variable name) and two recursive cases (a new expression can be formed from another by negating it, or from two others by combining them with a binary operator). Note that all values in Turandot are just numbers; the value 1 is also used to indicate 'true' and 0 to indicate 'false'.

Turandot's command syntax is

   CMD1 ::= skip 
         |  assign << NAME1 x EXPR1 >>
         |  choice << EXPR1 x CMD1 x CMD1 >>
         |  compose << CMD1 x CMD1 >>

Again, this is a recursive definition. The base cases are the skip command, and the assignment command, which assigns the value of an expression to a variable. The others build new commands from smaller commands: choice chooses between two commands based on the value of expression, and compose composes two commands.

2.2.3 Turandot's concrete syntax

It is quite possible to specify a variety of concrete syntaxes from a single abstract syntax. For example, the concrete choice command can easily be defined as any one of the following:

Using the abstract syntax when specifying the semantics literally abstracts away from these unimportant typographical details. Unimportant for specifying the semantics, that is. Concrete syntax is important for making a particular language usable, and so should be chosen with care.

A concrete syntax can be specified by mapping each construct in the abstract syntax to a string of characters. For example, a concrete syntax for Turandot's commands can be specified by

   |   CMD<[_]> : CMD1 --> String
   |----------
   |   A x : NAME1; e : EXPR1; c1, c2 : CMD1 @
   |
   |      CMD<[ skip ]> = "skip"
   |
   |      /\ CMD<[ assign(x, e) ]> =
   |         NAME<[ x ]> ^ ":=" ^ EXPR<[ e ]>
   |
   |      /\ CMD<[ choice(e, c1, c2) ]> =
   |         "if" ^ EXPR<[ e ]> ^ "then" ^ CMD<[ c1 ]>
   |         ^ "else" ^ CMD<[ c2 ]> ^ "endif"
   |
   |      /\ CMD<[ compose(c1, c2) ]> =
   |         CMD<[ c2 ]> ^ ";" ^ CMD<[ c1 ]>

This specifies a function called CMD (it assumes that functions called NAME and EXPR are also specified) using a Z axiomatic definition. The part above the horizontal line declares the signature of the function: CMD maps elements of the syntactic category CMD 1 to strings. The parts below the line provide the definition of the function, by structural induction over the structure of the language syntax. There is one term for each sort of command in the abstract syntax, and the definition is recursive in terms of the translation of subcomponents of a particular command. The infix operator ' ^ ' is Z's concatenation operator for joining sequences.

Aside---a more thorough definition of concrete syntax would be broken down into steps: defining a mapping from abstract syntax to sequences of tokens, defining the character strings used to represent tokens, and defining how tokens may be separated by white space (spaces, tabs and newlines) and comment strings. The simpler treatment given here is adequate for my purpose: defining a concrete syntax in which to write example Tosca programs.

2.3 Semantics

In the specification of Turandot's syntax, the 'meaning' of each construct is given informally in natural language, if at all. It relies heavily on our intuitive understanding of phrases such as ''assigns the value of an expression to a variable''. Problems can arise if the intuition of language designer, implementor and user differ, or if areas of ambiguity are resolved differently. What if the expression refers to some variable that has not been assigned a value? Should some default value be assumed, and if so, what value, or should this be an error? In a typed language (like Tosca, but unlike Turandot), what if the expression and variable have a different type? Should the expression's type be silently coerced to that of the variable, or should this be an error?

The denotational approach to providing a mathematical specification is to define meaning functions , which map syntactic constructs to what they denote: mathematical values. Appeals to intuition and common understanding can be replaced by mathematical manipulations of these formal definitions.

2.3.1 A mathematical model of Turandot

Different kinds of mathematical model are appropriate for different classes of programming languages. A statement in a declarative language reports a fact about the world, for example, ''the cat is on the mat''. Reporting a fact does not change the state of the world. A statement in an imperative language, on the other hand, can change the state of the world, for example, ''I name this ship the Blaise Pascal ''. An appropriate model for an imperative language (the kind considered in this book) is a state, and a set of state transition functions.

An appropriate state for Turandot is a mapping from variables' names to their current values:

   State1 == NAME1 -|-> Z

State 1 is the set of all functions that map names to integers. A particular function s 1 of type State 1 can be declared as s 1 : State 1 .

Z models functions as sets of pairs; for state functions of the type given above, the first element of each pair is a name, the second is an integer. A particular state s 1 , which has two variables, x with the value 3, and y with the value 9, is

   s1 = { x |-> 3, y |-> 9 }

The maplet notation x |-> 3 , an alternative form of the more conventional notation for a pair ( x ,3) , highlights the 'mapping' nature of the term.

In Z, functions are partial : a state need map only some names to values, not all of them. The domain of a state is the set of all those names that have values defined, so dom s 1 = { x , y } . The range is the set of all those values that are mapped to, so ran s 1 = {1,9} . If some function was required to be total , to map every name to a value, it would be declared using an undecorated arrow:

   |   st : NAME1 --> Z

In this case, dom s t = NAME 1 .

Because a state is a function, applying it to a variable yields the value of that variable: s 1 x = 3 .

The simplest state change is to update the value of a single variable. The Z way to do this is by overriding the value, using the ' (+) ' operator: for values in the domain of s b the function s a (+) s b agrees with s b , and elsewhere it agrees with s a .

Thus, s (+) { a |-> v } produces a new state. If a was in the old state, its value in the new state is changed to v ; if it was not, it is added to the new state. So

   s1 (+) { x |-> 1 } = { x |-> 1, y |-> 9 }

   s1 (+) { z |-> 1 } = { x |-> 3, y |-> 9, z |-> 1 }

In Turandot, each binary operator denotes the corresponding mathematical operator, an expression denotes a value in the context of a state, and a command denotes a state transition function. (Tosca requires a slightly more elaborate model, with an environment as well as a state, as discussed in Chapter 7.)

2.3.2 The meaning of Turandot's binary operators

Let's take Turandot's operators first. Each operator denotes a corresponding mathematical operator. A mathematical binary operator is a function that maps two numbers, its arguments, to another number, the result. Hence it has the type Z x Z -|-> Z . The operator meaning function that maps syntactic operators onto the corresponding mathematical function is specified as

   |   DOP[|_|] : OP1 --> Z x Z -|-> Z
   |----------
   |   A x, y : Z @
   |
   |      DOP[| plus |](x, y) = x + y
   |
   |      /\ DOP[| lessThan |](x, y) = if x < y then 1 else 0

The function is specified for all Turandot's binary operators by specifying it for each branch of the OP 1 free type definition. Remember that Turandot uses the value 1 for 'true' and 0 for 'false'.

2.3.3 The meaning of Turandot's expressions

An expression denotes a value; what value is denoted can depend on the values of the variables in the current state. The expression meaning function maps an expression to the value it denotes in the context of a state

   |   DEXPR[|_|] : EXPR1 --> State1 -|-> Z
   |----------
   |   A z: Z; x: NAME1; e, e1, e2: EXPR1; w: OP1; s: State1 @
   |
   |      DEXPR[| number z |] s = z
   |
   |      /\ DEXPR[| variable x |] s = s
x
   |
   |      /\ DEXPR[| negate e |] s = - (DEXPR<[ e ]> s)
   |
   |      /\ DEXPR[| operation(e1, w, e2) |] s =
   |         DOP[| w |] (DEXPR[| e1 |] s, DEXPR[| e2 |] s)

The semantic definition follows the recursive structure of the EXPR 1 abstract syntax. The meaning of an expression consisting of a number is just that number, irrespective of the state. The meaning of an expression consisting of a variable name is the value that name denotes in the current state, found by applying the state function s to the name. (A variable that has not previously been assigned a value is not in the domain of the state function, and the result of applying the state function is undefined. A resolution of this problem is discussed below, in section 2.4 .) The meaning of a negate expression is the mathematical negation of the meaning of the subexpression. The meaning of an operation expression is found by using the meaning of the operator to combine the meanings of the subexpressions.

2.3.4 The meaning of Turandot's commands

A command denotes a state transition. Hence, the command meaning function maps a command to a state transition function:

   |   DCMD[|_|] : CMD1 --> State1 -|-> State1
   |----------
   |   A x: NAME1; e: EXPR1; c1, c2: CMD1; s: State1  @
   |
   |      DCMD[| skip |] s = s
   |
   |      /\ DCMD[| assign(x, e) |] s =
   |         s (+) { x |-> DEXPR[| e |] s }
   |
   |      /\ DCMD[| choice(e, c1, c2) |] s =
   |         if DEXPR[| e |] s = 1
   |         then DCMD[| c1 |] s else DCMD[| c2 |] s
   |
   |      /\ DCMD[| compose(c1, c2) |] s =
   |         (DCMD[| c2 |] o DCMD[| c1 |]) s

The semantic definition follows the recursive structure of the CMD 1 abstract syntax. skip is the identity function; it leaves the state unchanged. assign changes the value of one component of the state; the new state maps the relevant name to its new value. choice chooses between the two commands based on the value of the expression (remember that the value 1 is used to indicate 'true'). Composing commands composes the state transition functions that they denote.

Notice that the words in the paragraph accompanying the mathematical specification of the semantics are very similar to the words that accompanied the syntax specification. But now they have a mathematical meaning, too. This mathematical meaning can be referred to in case of ambiguity, and can be formally manipulated if necessary, to discover the precise meaning of complex constructs.

2.4 Static semantics

Some sentences in a language can be syntactically correct, but meaningless. For example, in English, Chomsky's famous sentence Colourless green ideas sleep furiously is syntactically well-formed, but does not mean anything. In many programming languages, a statement like x := 1 + true might be parsed according to the syntax rules, but might have no meaning, because numbers cannot be added to booleans. In the denotational approach, rules for meaningless constructs can be formalized by providing various static semantics. These look similar to the specifications above, but instead of saying an expression means something like '4', say an expression means 'badly typed'.

We can do this because the interpretation of the meaning of a piece of syntax is up to the specifier, and so we can choose alternative meaning functions for our different purposes. The meaning defined above specifies Turandot's dynamic semantics, the meaning we would expect to be associated with the executing program. But we can associate other, non-standard, meanings, too. For example, we can choose the state to be those names that have been assigned a value, ignoring what that value might be. Call this the s-state:

   SState1 == P NAME1

P is Z's power set constructor, so any state of type SState 1 has the type 'set of NAME 1 '.

Let's define a flag to determine whether or not variables are being used before they are assigned a value:

   SET1 ::= yes | no

Then we can specify an alternative meaning function for expressions, one that maps them to yes or no , to indicate whether the variables to which they refer have been properly assigned a value or not:

   |   SEXPR[|_|] : EXPR1 --> SState1 -|-> SET1
   |----------
   |   A z: Z; x: NAME1; e, e1, e2: EXPR1; w: OP1; s: SState1 @
   |
   |      SEXPR[| number z |] s = yes
   |
   |      /\ SEXPR[| variable x |] s = if x e s then yes else no
   |
   |      /\ SEXPR[| negate e |] s = SEXPR[| e |] s
   |
   |      /\ SEXPR[| operation(e1, w, e2) |] s =
   |         if SEXPR[| e1 |] s = yes /\ SEXPR[| e2 |] s = yes
   |         then yes else no

This defines a different meaning for Turandot's expressions: let's call it the s-meaning. So the s-meaning of an expression consisting of a number is always yes , irrespective of the s-state. The s-meaning of an expression consisting of a variable name is yes if and only if that name is in the current s-state (as we will see in the definition of SCMD[|_|] , the assignment statement adds names to the state). The s-meaning of a negate expression is the same as the s-meaning of the subexpression. The s-meaning of an operation expression is yes if and only if the s-meanings of both subexpressions are yes .

The s-meaning of a command is an s-state transition:

   |   SCMD[|_|] : CMD1 --> SState1 -|-> SState1
   |----------
   |   A x: NAME1; e: EXPR1; c1, c2: CMD1; s: SState1 @
   |
   |      SCMD[| skip |] s = s
   |
   |      /\ SCMD[| assign(x, e) |] s = s u {x}
   |
   |      /\ SCMD[| choice(e, c1, c2) |] s =
   |         SCMD[| c1 |] s n SCMD[| c2 |] s
   |
   |      /\ SCMD[| compose(c1, c2) |] s =
   |         (SCMD[| c2 |] o SCMD[| c1 |]) s

Notice that the only time a new name is added to the environment is by the assign meaning function. The choice s-meaning takes the intersection of the states for each branch: a new variable must be assigned a value in both branches before it is considered to be assigned by the whole command.

Such a specification can be used to provide a static check of whether variables are referenced before they are assigned a value. It is called static, because it can be done without actually executing the program (for example, at compile time). Hence such a non-standard interpretation of the meaning of a program is called static semantics . Various alternative meanings can be defined to provide various different static semantics. This is described in much more detail in Chapter 8. Tosca has three different static semantics in addition to its dynamic semantics.

2.5 Operational semantics

The target language itself can be specified in the same manner. For example, the target language's syntax may include instructions such as

   INSTR1 ::= store << NAME1 >>
           |  jump << LABEL1 >>
           |  goto << LABEL1 >>
           |  label << LABEL1 >>

The informal meanings of these instructions are as follows: store stores the value currently in the accumulator (a special location) in the location corresponding to a name; jump jumps to its label if the value in the accumulator is 'false', otherwise it does nothing; goto jumps unconditionally to its label. Other instructions are also needed, to load values and to combine and compare values. These meanings need to be defined formally, too, by specifying some appropriate meaning function INSTR[|_|] , that maps syntactic instructions to mathematical values. These further definitions are omitted here for brevity. See Chapter 10 for the specification of Aida, the 'real' target language.

The operational semantics (that is, the specification of the translations from source to target language) is defined in a similar manner. However, the 'operational meaning' function is a purely syntactic definition: it maps source language constructs not to mathematical values, but to sequences of target language instructions.

For example, the definition of the translation of Turandot's commands into the target language might look something like

   |   OCMD<[_]> : CMD1 --> seq INSTR1
   |----------
   |   A x: NAME1; e: EXPR1; c, c1, c2: CMD1 @
   |
   |      OCMD<[ skip ]> = < >
   |
   |      /\ OCMD<[ assign(x, e) ]> =
   |         OEXPR<[ e ]> ^ < store x >
   |
   |      /\ OCMD<[ choice(e, c1, c2) ]> =
   |         OEXPR<[ e ]> ^ < jump l1 >
   |         ^ OCMD<[ c1 ]> ^ < goto l2, label l1 >
   |         ^ OCMD<[ c2 ]> ^ < label l2 >
   |
   |      /\ OCMD<[ compose(c1, c2) ]> =
   |         OCMD<[ c1 ]> ^ OCMD<[ c2 ]>

skip does nothing, so translates to an empty sequence of instructions. The expression in an assign is translated to the appropriate sequence of instructions (the definition of OEXPR[|_|] is not given), and a store instruction is concatenated, in order to store the result of the expression at the relevant location. For a choice command, the expression is translated, then a jump on false instruction is added to jump to the false branch. This is followed by a translation of the true branch, and an unconditional goto , to jump over the false branch to the end. Then we get the false branch label, and the translation of the false branch itself. Finally, the end label is added. The instruction list for a compose command simply consists of the concatenation of the instruction lists of the composed commands.

This example has been simplified for brevity. The real definition is more involved: it needs to maintain a 'translation environment' to hold the mapping from variables to their storage locations, and to note what labels have been allocated. This is described in detail in Chapter 11. The overview here is merely intended to give a flavour, to motivate future definitions.

This translation to sequences of instructions is purely syntactic. However, given the denotational specification of the target language semantics, as a specification of a meaning function INSTR[|_|] , it is possible to calculate the meaning of this translated sequence of instructions. The proof of correctness becomes a proof that the meaning of a Turandot construct is the same as the meaning of the sequence of instructions that results from translating it:

   c: CMD1  |-  DCMD[| c |] = INSTR[| OCMD<[ c ]> |]

The Tosca compiler specification is proved correct in Chapter 13.

2.6 A liberty with Z

As illustrated above, Z is used to specify the various languages and their semantics. However, in order to minimize clutter in the definitions, a small liberty is taken with its syntax when defining Tosca and Aida below.

The definitions of the semantics functions are quantified over all the variables appearing on the left-hand side of the equation. For example, the meaning of Turandot's commands is defined for all possible values of names, expressions, commands, and state variables:

   |   DCMD<[_]> : CMD1 --> State1 -|-> State1
   |----------
   |   A x: NAME_1 ; e: EXPR_1 ; c1, c2: CMD_1 ; s: State_1  @
   |
   |      DCMD<[ skip ]> s = s
   |
   |      /\ DCMD<[ assign(x, e) ]> s = ...
   |
   |      /\ DCMD<[ choice(e, c1, c2) ]> s = ...

The continual occurrence of such quantifications tends to clutter the specification. So this is abbreviated, by omitting the declarations of all the arguments of the meaning functions (but not of any other functions), whose types can easily be deduced. You should assume an implicit quantification over all these arguments when reading a definition such as

   |   DCMD<[_]> : CMD1 --> State1 -|-> State1
   |----------
   |   DCMD<[ skip ]> s = s
   |
   |   DCMD<[ assign(x, e) ]> s = ...
   |
   |   DCMD<[ choice(e, c1, c2) ]> s = ...

To reduce any confusion this style of abbreviation might cause, the same symbols are consistently used to refer to things of the same type: this usage is summarized in Appendix D.

2.7 Distinguishing syntax from semantics

There is potential for confusion in distinguishing syntactic and semantic values, because some syntax is needed in order to write down the semantic value! For example, assume the value meaning function for Turandot values has been declared as

   |   DVAL[|_|] : VALUE1 --> Z

Then the meaning of a particular value might look something like DVAL[| 42 |] = 42 . At first sight, this looks rather odd. However, the explanation is straightforward. The digit sequence appearing on the left, inside the special double square brackets, is merely syntax (the value consisting of the character '4' followed by the character '2') whereas the one on the right is the corresponding mathematical number (forty two). The syntax can take a variety of forms, for example, XLII in Roman numerals, or 101010 in binary, but the mathematical value is the same in each case. The confusion occurs because the concrete syntaxes chosen to write programming language syntax and to write mathematical values can have strings in common.

It is possible to map syntactic numerals to mathematical numbers by defining a mapping from character strings to integers, String -|-> Z , but this is not very illuminating, since the integers tend themselves to be written as character strings, as above. So the syntactic and semantic domains corresponding to VALUE s, or to NAME s are not distinguished.

Because of this potential confusion, denotational semantics uses a typographical convention to distinguish syntactic arguments from mathematical values. Special double square brackets are used consistently to enclose syntactic arguments, for example, M[| 3 |] = 3 . Although this convention is not standard Z, it is used here for functions that return mathematical values. Z uses this form of bracket for bags (multisets); there is no potential for confusion here since bags are not used in the following specifications. In addition, special triangular brackets are used to distinguish the syntactic arguments in functions that return syntactic results (the concrete syntax and translation functions), for example, C<[ 3 ]> = "three" .

2.8 Further reading

See [Meyer 1990] and [Tennent 1991] for further discussion on the use of set theory and partial functions, instead of domain theory, in denotational semantics.

Descriptions of lexing and parsing can be found in many classic texts on compiler writing; see, for example, [Aho and Ullman 1977] .

The Z language is described in [Spivey 1992] . [Hayes 1987] contains many case studies, using a slightly older variant of Z. There are many tutorial introductions to Z, for example, [Potter et al . 1991] .

[Austin 1976] gives an account of how English can be considered an 'imperative' language.