Lecture 2, Lexical Structure
Part of
the notes for CS:4980:1
|
It is useful to consider how a person would process a program before trying to think about how it is done by a program. For this purpose, consider this program in our toy programming language, Kestrel.
putstr( "Hello world"+LF, output )
When a person literate in the Roman alphabet looks at this text, an important, almost unconscious processing step takes place: The text is seen not as a random pattern of letters on the page, but as a sequence of distinct words and symbols, This processing step is formally called lexical analysis, and the words and similar structures recognized at this level are called lexemes. When a person literate in programming languages reads the above code, they will likely break it into the following 8 lexemes (each bracketed with square brackets):
[putstr] [(] ["Hello world"] [+] [LF] [,] [output] [)]
If the person knows the language in which the text is written, a second and still possibly unconscious processing step will occur: Lexical elements of the text will be classified into structures according to their function in the text. A literate programmer, regardless of whether they know this language, will probably see the parenthesized text as a parameter list containing two expressions separated by a comma, where these are parameters for a call to something called putstr.
In the case of a computer language, literate readers will see statements, expressions, terms and factors; in English, literate readers will see subjects, objects, verbs, and subsidiary phrases. In our language, Kestrel, there are blocks, declarations, statements and expressions. This level of analysis is called syntactic analysis, and is performed with respect to the grammar or syntax of the language in question.
The analysis of the lexical structure of the text begins with classification of the characters in that text. If you look at the definition of the Kestrel language, for example, you will find that it begins with a description of the character set, where the following categories of characters are identified.
These are then used to define a broader class of symbols
In the above, we are using a notation called BNF (either Backus-Naur Form or Backus Normal Form, there is dispute about what BNF stands for). Each BNF rule defines a class of strings named on the left of the ::= sign in terms of the symbols or named classes of symbols on the right, with each alternative separated by a vertical bar.
This allows a top-level description of a Kestrel program as a sequence of lexemes, with optional white space separating the lexemes.
The above definition is a bit dishonest, since it does not even mention the end-of-file mark at the end. In practice, the shortest possible program consists of an end-of-file mark, which we will classify as a kind of lexeme. The underlying system guarantees that there will never be multiple end-of-file marks, and that there will be exactly one at the end of every program, no matter how defective it may be. Note that Kestrel does permit an empty program consisting of just an end-of-file mark. Such a program does nothing.
In the above, we used a commonly-used extension to BNF to indicate zero or more repetitions of a symbol (or sequence of symbols). This is, perhaps, the most common extension to BNF. Curly braces in this notation indicate repetition, so the notation a{b}c means the infinite set of strings including ac, abc, abbc, abbbc, etc. (In the original version of BNF, recursion was used to represent such sets.)
An abridged lexical specification for Kestrel begins as follows:
<lexeme> ::= <identifier> | <number> | <string> | <punctuation> | <end of file>
Note the use of another extension to BNF above: Square brackets describe optional elements, so a[b] describes the set containing the strings a and ab. This is perhaps the second most common extension to BNF. In the original notation, this would have been described as a|ab.
We will stop here. For the complete description of the Kestrel lexical structure, refer to the official definition of the language. Notably, it includes quoted strings, and it includes a less than obvious (at first blush) system for number bases between 16 and 32.
The primary problem with BNF definitions is that they are wordy. In addition to the symbols of the language, we have introduced numerous metasymbols (with their names enclosed in angle-braces) such as <letter> and <digit>.
In fact, BNF notation is far more powerful than we need for describing a language at the lexical level. To see this, all we need to do is condense the above description by substituting the definition of each metasymbol for the use of that symbol in the text.
<letter> { <letter> | <digit> }
|
( <digit> { <digit> } ) [
'#' ( <digit> | <letter> ) {
( <digit> | <letter> )
}
]
| <string>
| <punctuation>
| <end of file>
Note the use of another extension to BNF above: We have used parentheses to group subsidiary alternatives. Also, because of the complex nested structure of the collapsed text, we have begun to use line breaks and indenting to help the reader figure out which end paren goes with which begin paren.
The key thing to note in the above is that there is no recursion, just choices and loops. In fact, a complete functional lexical analyzer can be written as "flat" code without anything but if and while choices and no need to look at more than one character or remember any previous characters. Formally, we say that for most languages, the lexical description of that language is not merely context-free, but susceptable to analysis by a finite-state machine.
We're still cheating just a bit here: This definition suggests the possibility that a file might contain multiple <end of file> lexemes, and this is impossible.