Chapter 2, Assemblers

Lecture notes for 22C:50 Introduction to System Software, by Douglas W. Jones, University of Iowa Department of Computer Science

Preface to Part I

The following chapters explore one aspect of the programming environment, language processors. This is done by an in-depth examination of the problems encountered in writing an assembler. The problems encountered include lexical and syntactic analysis, symbol table management, and forward reference resolution, problems that an assembler shares with high level language processors.

These issues are discussed in the context of a minimal assembly language before the addition of many useful features. A section is devoted to macro and conditional processing because, of all the features of a language processor, this most clearly illustrates the difference between compile-time or assembly-time computation on the one hand and run-time computation on the other.

Finally, a section is included on object codes, linkers and loaders. These are commonly used to represent and process the output from a language processor before it is run, and serve as a bridge from the realm of language processors to operating systems.

What is an Assembler?

The first idea a new computer programmer has of how a computer works is learned from a programming language. Invariably, the language is a textual or symbolic method of encoding programs to be executed by the computer. In fact, this language is far removed from what the computer hardware actually "understands". At the hardware level, after all, computers only understand bits and bit patterns. Somewhere between the programmer and the hardware the symbolic programming language must be translated to a pattern of bits. The language processing software which accomplishes this translation is usually centered around either an assembler, a compiler, or an interpreter. The difference between these lies in how much of the meaning of the language is "understood" by the language processor.

An interpreter is a language processor which actually executes programs written in its source language. As such, it can be considered to fully understand that language. At the lowest level of any computer system, there must always be some kind of interpreter, since something must ultimately execute programs. Thus, the hardware may be considered to be the interpreter for the machine language itself. Languages such as BASIC, LISP, and SNOBOL are typically implemented by interpreter programs which are themselves interpreted by this lower level hardware interpreter.

Interpreters running as machine language programs introduce inefficiency because each instruction of the higher level language requires many machine instructions to execute. This motivates the translation of high level language programs to machine language. This translation is accomplished by either assemblers or compilers. If the translation can be accomplished with no attention to the meaning of the source language, then the language is called an assembly or low level language, and the translator is called an assembler. If the meaning must be considered, the translator is called a compiler and the source language is called a high level language. The distinction between high and low level languages is somewhat artificial since there is a continuous spectrum of possible levels of complexity in language design. In fact, many assembly languages contain some high level features, and some high level languages contain low level features.

Since assemblers are the simplest of symbolic programming languages, and since high level languages are complex enough to be the subject of entire texts, only assembly languages will be discussed here. Although this simplifies the discussion of language processing, it does not limit its applicability; most of the problems faced by an implementor of an assembly language are also faced in high level language implementations. Furthermore, most of these problems are present in even the simplest of assembly languages. For this reason, little reference will be made to the comparatively complex assembly languages of real machines in the following sections.

The Assembly Process

It is useful to consider how a person would process a program before trying to think about how it is done by a program. For this purpose, consider the program in Figure 2.1. It is important to note that the assembly process does not require any understanding of the program being assembled. Thus, it is unnecessary to understand the algorithm implemented by the code in Figure 2.1, and little understanding of the particular machine code being used is needed (for those who are curious, the code is written for an R6502 microprocessor, the processor used in the historically important Apple II family of personal computers from the late 1970's).

;   Takes dividend in A, divisor in Y
;   Returns remainder in A, quotient in Y
START: STA IDENDL     ;Store the low half of the dividend
       STY ISOR       ;Store the divisor
       LDA #0         ;Zero the high half of the dividend (in register A)
       TAX            ;Zero the loop counter (in register X)
  LOOP:  ASL IDENDL     ;Shift the dividend left (low half first)
         ROL            ;                        (high half second)
         CMP ISOR       ;Compare high dividend with divisor
         BCC NOSUB      ;If IDEND < ISOR don't subtract
           SBC ISOR       ;Subtract ISOR from IDEND
           INC IDENDL     ;Put a one bit in the quotient
  NOSUB: INX            ;Count times through the loop
         CPX #8
         BNE LOOP       ;Repeat loop 8 times
       LDY IDENDL     ;Return quotient in Y
       RTS            ;Return remainder in A

IDENDL:B 0            ;Reserve storage for the low dividend/quotient
ISOR:  B 0            ;Reserve storage for the divisor

Figure 2.1. An example assembly language program.

When a person who knows the Roman alphabet looks at text such as that illustrated in Figure 2.1, an important, almost unconscious processing step takes place: The text is seen not as a random pattern on the page, but as a sequence of lines, each composed of a sequence of punctuation marks, numbers, and word-like strings. This processing step is formally called lexical analysis, and the words and similar structures recognized at this level are called lexemes.

If the person knows the language in which the text is written, a second and still possibly unconscious processing step will occur: Lexical elements of the text will be classified into structures according to their function in the text. In the case of an assembly language, these might be labels, opcodes, operands, and comments; in English, they might be subjects, objects, verbs, and subsidiary phrases. This level of analysis is called syntactic analysis, and is performed with respect to the grammar or syntax of the language in question.

A person trying to hand translate the above example program must know that the R6502 microprocessor has a 16 bit memory address, that memory is addressed in 8 bit (one byte) units, and that instructions have a one byte opcode field followed optionally by additional bytes for the operands. The first step would typically involve looking at each instruction to find out how many bytes of memory it occupies. Table 2.1 lists the instructions used in the above example and gives the necessary information for this step.

Opcode	Bytes  Hex Code

ASL	  3       0E aa aa
B	  1	  cc
BCC	  2	  90 oo
BNE	  2	  D0 oo
CMP	  3	  CD aa aa
CPX #	  2	  E0 cc
INC	  3       EE aa aa
INX       1	  E8
LDA #	  2	  A9 cc
LDY	  3	  AC aa aa
ROL	  1	  2A
RTS	  1	  60
SBC	  3	  ED aa aa
STA	  3	  8D aa aa
STY	  3	  8C aa aa
TAX	  1	  AA

Notes:	aa aa - two byte address, least significant byte first.
	oo - one byte relative address.
	cc - one byte of constant data.

Table 2.1. Opcodes on the R6502.

To begin the translation of the example program to machine code, we take the data from table 2.1 and attach it to each line of code. Each significant line of an assembly language program includes the symbolic name of one machine instruction, for example, STA. This is called the opcode or operation code for that line. The programmer, of course, needs to know what the program is supposed to do and what these opcodes are supposed to do, but the translator has no need to know this! Here, we show the numerical equivalent of each opcode code in hexadecimal, or base 16. We could have used any number base; inside the computer, the bytes are stored in binary, and because hexidecimal to binary conversion is trivial, we use that base here. While we're at it, we will strip off all the irrelevant commentary and formatting that was only included only for the human reader, and leave only the textual description of the program.
8C        STY ISOR
A9        LDA #0
AA        TAX
2A        ROL
CD        CMP ISOR
90        BCC NOSUB
ED        SBC ISOR
E0        CPX #8
D0        BNE LOOP
60        RTS
cc ISOR:  B 0

Figure 2.2. Partial translation of the example to machine language

The result of this first step in the translation is shown in Figure 2.2. This certainly does not complete the job! Table 2.1 included constant data, relative offsets and addresses, as indicated by the lower case notatons cc, oo and aaaa, and to finish the translation to machine code, we must substitute numeric values for these!

Constants are the easiest. We simply incorporate the appropriate constants from the source code into the machine code, translating each to hexadecimal. Relative offsets are a bit more difficult! These give the number of bytes ahead (if positive) or behind (if negative) the location immediately after the location that references the offset. Negative offsets are represented using 2's complement notation.

8C        STY ISOR
A9        LDA #0
AA        TAX
2A        ROL
CD        CMP ISOR
90        BCC NOSUB
ED        SBC ISOR
E0        CPX #8
D0        BNE LOOP
60        RTS
00 ISOR:  B 0

Figure 2.3. Additional translation of the example to machine language

The result of this next translation step is shown in boldface in Figure 2.3. We cannot complete the translation without determining where the code will be placed in memory. Suppose, for example, that we place this code in memory starting at location 020016. This allows us to determine which byte goes in what memory location, and it allows us to assign values to the two labels IDENDL and ISOR, and thus, fill out the values of all of the 2-byte address fields to complete the translation.
0201: 21
0202: 02
0203: 8C        STY ISOR
0204: 22
0205: 02
0206: A9        LDA #0
0207: 00
0208: AA        TAX
020A: 21
020B: 02
020C: 2A        ROL
020D: CD        CMP ISOR
020E: 22
020F: 02
0210: 90        BCC NOSUB
0211: 06
0212: ED        SBC ISOR
0213: 22
0214: 02
0215: EE        INC IDENDL
0216: 21
0217: 02
0218: E8 NOSUB: INX
0219: E0        CPX #8
021A: 08
021B: D0        BNE LOOP
021C: EC
021D: AC        LDY IDENDL
021E: 21
021F: 02
0220: 60        RTS
0221: 00 IDENDL:B 0
0222: 00 ISOR:  B 0

Figure 2.4. Complete translation of the example to machine language

Again, in completing the translation to machine code, the changes from Figure 2.3 to Figure 2.4 are shown in boldface. For hand assembly of a small program, we don't need anything additional, but if we were assembling a program that ran on for pages and pages, it would be helpful to read through it once to find the numerical addresses of each label in the program, and then read through it again, substituting those numerical values into the code where they are needed.

symbol	address

START	 0200
LOOP	 0209
NOSUB	 0218
IDENDL	 0221
ISOR	 0222

Table 2.2. The symbol table for Figure 2.4.

Table 2.2 shows the symbol table for this small example, sorted into numerical order. For a really large program, we might rewrite the table into alphabetical order to before using it to finish the assembly.

It is worth noting the role which the meaning of the assembly code played in the assembly process. None! The programmer writing the line STA IDENDL must have understood its meaning, "store the value of the A register in the location labeled IDENDL", and the CPU, when it executes the corresponding binary instruction 8D 21 02 must know that this means "store the value of the A register in the location 0221", but there is no need for the person or computer program that translates assembly code to machine code to understand this!

To the translator performing the assembly process, the line STA IDENDL means "allocate 3 bytes of memory, put 8D in the first byte, and put the 16 bit value of the symbol IDENDL in the remaining 2 bytes." If the symbol IDENDL is mapped to the value 0221 by the symbol table, then the interpretation of the result of the assembler's interpretation of the source code is the same as the programmers interpretation. These relationships may be illustrated in Figure 2.5.

                          Source Text
                            /    \
          programmer's     /      \      assembler's
        view of meaning   /        \  view of meaning
                         /          \
            Abstract Meaning ----- Machine Code

                       view of meaning

Figure 2.5. Views of the meaning of a program.

A Simple Assembly Language

In order to simplify this discussion of the translation process, an assembly language less complex than that used in the previous example will be used. The R6502 language used there is complicated by the fact that a single symbolic instruction may assemble in many different ways; for example, the symbolic instruction LDA assembles to either A9, AD, A5, or others depending on the form of the operand field. For example, if the operand field begins with a hash mark (#), the immediate form, A9 is used, while if the operand is an expression with a 16 bit value but is not preceded by a hash mark, the direct addressing form, AD is used. In a simplified assembly language, these differences in the address mode can be indicated by different symbolic names.

Another problem with using the R6502 assembly language is its size; it has 56 different symbolic instructions. None of the basic functions of the assembler depend on the number of different instructions, so a simple assembly language with two instructions will be used as an example for the remainder of this chapter. These instructions are B, which means, initialize one byte (8 bits) of memory, and W, which means initialize one word (16 bits) of memory. These correspond to the .BYTE and .WORD directives in the MACRO-11 assembly language for the PDP-11 (circa 1970), or to variants of the DC directive in the IBM 360 (and 370) assembly language (circa 1965). The syntax of most modern assembly languages can be traced back to one or the other of these older languages, although many minor changes have been introduced in the years since the widespread use of these older languages.

These two simple instructions could be used to assemble code for the R6502 processor by composing however many B and W directives as are needed to make up each actual machine instruction, as is illustrated in Figure 2.6.

STA = #8D            ;STA direct addressing
STY = #8C            ;STY direct addressing
LDAI= #A9            ;LDA immediate operand
TAX = #AA            ;TAX
ASL = #0E            ;ASL

START: B STA          ; Store
          W IDENDL    ; ... the low half of the dividend
       B STY          ; Store
          W ISOR      ; ... the divisor
       B LDAI         ; Load register A (the high half of the dividend)
          B 0         ; ... with zero
       B TAX          ; Zero the loop counter (in register X)
LOOP:  B ASL          ; Shift left
          W IDENDL    ; ... the dividend

Figure 2.6. Part of Figure 2.1 recoded in the simple assembly language.

Figure 2.6 completes the first 5 instructions of the original example, except that the programmer has had to remember the instruction format and write one line per byte or per 16-bit word in the program, and the programmer had to begin his or her efforts by explicitly defining to the assembler the values to be assembled for each machine instruction. In the Figure, indenting has been used to distinguish between instructions and their operands.

Informally, each line of this simple assembly language is either a definition or a statement. Definitions assign values to symbolic names and do not imply the loading of any values in memory; each of the two statements we have defined loads values in memory in its own way.

Each statements consists of an optional label followed by an opcode and an operand. Labels end with a colon and may begin anywhere on the line. Note that the freedom to indent labels is not common. Many assemblers require that labels begin at the left margin.

The valid opcodes are B and W; these mean, respectively, assemble one byte and assemble one word. The operand field, which is the same as the value field in a definition, may be either an identifier, a symbolic name, a decimal number, or a hexadecimal number; the latter is indicated by the use of the # symbol as a prefix (this should not be confused with the use of the # prefix in the official R6502 assembly language, where it means an immediate constant). If an identifier or symbolic name is used, it must be defined elsewhere in the program, either by its use as a label, or by its use in a definition.

Formal Definitions

The above informal definition is accurate as far as it goes, but its very informality leads to difficulties. If two different programmers used this definition and wrote their own assemblers, it is likely that they would end up supporting slightly different languages. With definitions of larger languages, the differences between independently written processors frequently become insurmountable.

Over the years, a number of formal definition techniques have been developed which help to overcome this problem. Perhaps the oldest of these is BNF notation.

The initials BNF stand for either Backus-Naur Form or Backus Normal Form (depending on who is talking). This notation became widely used after Peter Naur used it in the definition of Algol 60; Naur modified a notation used by John Backus (the developer of FORTRAN). Since Backus has claimed that he did not invent the notation himself, but merely used it, and since the notation is not (technically speaking) a normal form, perhaps it is best to forget what the initials BNF stand for.

An important limitation of this notation is that it only defines the syntax of a language, while informal definitions such as the one given above indicate something about the meaning or semantics involved. Thus, a BNF definition can describe how to construct an assembly language program, but it can not describe the meaning of the result. The small assembly language used here is defined in Figure 2.7, with added informal comments.

<program> ::= <line> <end of file> | <line> <program>
     -- a program is a sequence of 1 or more lines

<line> ::= <definition> | <statement> | <comment>
     -- a line is either a definition, statement or comment

<definition> ::= <identifier> = <operand> <comment>
     -- a definition is an identifier, followed by an
        equals sign, followed by an operand        

<statement> ::= <label> <instruction> | <instruction>
     -- the label part of a statement is optional

<instruction> ::= <opcode> <operand> <comment> | <comment>
     -- the opcode, operand part of an instruction is optional

<comment> ::= ;<text> <line end> | <line end>
     -- comments at ends of lines are optional

<label> ::= <identifier> :
     -- a label is a symbol followed by a colon

<opcode> ::= B | W
     -- the legal opcodes are B and W

<operand> ::= <identifier> | <number>
     -- an operand is either an identifier or a number

Figure 2.7. BNF definition of the small assembly language.

Each line in the formal part of the above definition is called a production rule because it defines how to produce an object in the language from simpler objects. For example, a definition is made by concatenating a symbol, an equals sign, an operand, and a comment. Similarly, a comment is made by either a line end or a semicolon followed by any text followed by a line end.

In BNF, the symbols < > | and ::= have special meanings. The ::= symbol is used to indicate that the object on the left is defined by the "expression" to the right. The vertical bar is used to separate alternatives, while the angle brackets are used to enclose "nonterminal" symbols (those which must be further defined elsewhere). All of these special symbols are called metasymbols because they are used to "speak about" symbols in the language being defined.

This definition has two faults: It is wordy, and it omits lexical details such as the rules governing spacing and the construction of identifiers and numbers. Using BNF, the latter details can be defined as shown in Figure 2.8:

<identifier> ::= <letter> | <symbol> <letter or digit>
     -- identifiers start with a letter

<letter> ::= A | B | C | ... | X | Y | Z

<digit> ::= 0 | 1 | 2 | ... | 7 | 8 | 9

<letter or digit> ::= <letter> | <digit>

<number> ::= <decimal> | #<hexadecimal>

<decimal> ::= <digit> | <digit> <decimal>
     -- a decimal number is a sequence of digits

<hexadecimal> ::= <hexdigit> | <hexdigit> <hexadecimal>

<hexdigit> ::= <digit> | A | B | C | D | E | F
Figure 2.8: Lexical details of the example language.

Note that Figure 2.8 does not mention the spaces between lexemes! It is fairly common to leave this detail out of the formal description of programming languages. Instead, the informal statement is made that spaces may be included before any lexeme or between lexemes but may not be included within them. It is sometimes necessary to include the additional rule that successive identifiers or numbers must be separated by at least one space.

There are a number of ways of formally including the treatment of spaces in the definition of the syntax of a language, but it is more common to do this in a formal description of the lexical structure, as will be discussed later.

The primary problem with the BNF definitions given above is that they are wordy. There are too many nonterminal symbols. The most common solution to this is to introduce new metasymbols which allow many BNF production rules to be combined into a single rule in the new notation. The symbols which are generally introduced are [], {}, and (). Square brackets enclose optional constructs, curly brackets enclose constructs which may be repeated zero or more times, and parentheses group alternatives.

Notations such as this are commonly called extended BNF or EBNF notations; this one derives from a merger of BNF with the form of definition used originally for COBOL, in which vertical groupings of symbols indicated alternatives, and the different kinds of brackets were used as they are here. Figure 2.9 gives the definition of the example assembly language in this notation.

<program> ::= <line> { <line> } <end of file>
     -- a program is a line followed by zero or more lines

<line> ::= ( <definition> | <statement> ) [ ;<text> ] <line end>
     -- a line is a definition or statement with an optional comment

<definition> ::= <identifier> = <operand>

<statement> ::= [ <identifier> : ] [ ( B | W ) <operand> ]

<operand> ::= <identifier> | <number>

Figure 2.9. An Extended BNF grammar for the example language.

The difficulty with the definition given in Figure 2.9 is that, by omitting nonterminal symbols such as <comment> and <label>, less of the meaning of the grammar has been conveyed by this definition of the syntax. Of course, if meaningless symbols such as <a> and <b> had been substituted for <comment> and <label> in the original BNF grammar, the same difficulty would have arisen. This illustrates that, by carefully naming nonterminal symbols in a grammar, the grammar can be made to informally describe the meaning of a language at the same time that it formally describes the syntax.

A third notation for the formal definition of the syntax of a language is known as RTN (Recursive Transition Network) notation. Definitions in this form are also frequently called syntax diagrams or railroad charts, and are frequently used for the definition of languages descended from Pascal. The syntax diagrams for the example assembly language are given in Figure 2.10.

program       ------
    ---------| line |-------(end of file)----
         /    ------    \

           --| definition |--
line     /    ------------    \
    ----                        -------------------------(line end)--
         \    -----------     /   \       ---------   /
           --| statement |---       -(;)-| comment |-
              -----------                 ---------

definition   ------------           ---------
    --------| identifier |---(=)---| operand |----------
             ------------           ---------

       \   ------------        /  \                           /
         -| identifier |--(:)-     \ --(B)--     ---------   / 
           ------------             \         --| operand |- 
                                      -(W)--     ---------
Figure 2.10. RTN notation for the example language.

In RTN notation, nonterminal symbols are boxed, while terminal symbols (those which appear in the language) are circled. These syntax diagrams are essentially translations of the Extended BNF grammar given previously. The term "railroad chart" comes from the similarity of these diagrams to the schematic descriptions of railroad networks frequently used in railroad control towers and dispatching centers. As with flowcharts, poorly structured syntax diagrams are possible which are not easily translated to a structured form such as Extended BNF.

RTN notation has an important property: The RTN diagrams for a language are isomorphic to the flowchart of a program which reads input in that language! Such a program is called a parser. The same observations can be made about Extended BNF notation. In that case, the relation to be noted is that there are operations for selection between alternatives (a|b is like if ? then a else b), for repetition ({a} is like while ? do a), and for conditional inclusion ([a] is like if ? then a). Additionally, in both Extended BNF and RTN notation, the inclusion of a nonterminal symbol in the definition is equivalent to a procedure or function call in a program (hence the R in RTN).

There is a problem with the relationship between language definitions and programs which process that language. This problem is hinted at by the question marks in the parenthetic remarks in the last paragraph. The problem is that, although the form of the parsing program is specified by the language definition, the conditions to be tested at each branch in the flowchart are not specified. This is the crux of the parsing problem.

Before discussing some solutions to the parsing problem, it is interesting to consider the reverse problem, that of writing a program which generates programs in the language being defined. In that case, each terminal symbol in the language definition maps to a write statement. A simple program generator for random programs would request a new random number to be used as the basis of each branch in the program. For example, if "random" is a function returning a random boolean value each time it is called, the random generator for lines of assembly code would have the form given in Figure 2.11.

procedure line;			void line()
begin				{
     if random then begin	    if (random()) {
         if random			if (random())
             then definition		    definition();    
             else statement;	        else
     end;			    	    statement();
     if random then begin	    }
         write(';');		    if (random()) {
         text;				putchar(';')
     end				text();
end {line};			    }
				} /* line */

Figure 2.11. A random program generator in Pascal and C.

Most of the "computer poetry" which is the subject of occasional jokes is produced using essentially this technique, except that the basic grammar is that of a language such as English, and variables are added to control such things as rhyme and meter. In artificial intelligence work, an RTN grammar with added variables is referred to as an ATN or Augmented Transition Network grammar. The use of ATN grammars is at the center of much work with natural language understanding.


A program (or part of a program) which reads text in an input language and classifies its components according to their grammatic derivation is called a parser. In this section, we will deal only with parsers and not with the problem of what to do with the output of the parser. A language processing program where the parser directs the translation process is said to be a syntax directed translator; later sections will describe these. As has already been mentioned, the flowchart of a parser can be derived from the grammar of a language; there are other forms of parsers, for example, table driven ones, but these will not be discussed here.

The parsers discussed here are sometimes called top-down parsers because they begin with the assumption that the input will be a program and they operate by trying to decide which of the ways of constructing a program matches the input. An alternative, bottom-up parsing, involves putting pieces of the input together to see what they make, hoping eventually to reduce the entire input to a single object and then making sure that the result is a program. The differences between these two approaches are most apparent in the context of expression analysis, where they will be discussed in more detail. An important property of both techniques, however, is that parsing is accomplished as the input text is read; computer programming languages are designed so that a parser can operate by reading only a few lexemes at a time, without any need to hold the entire text in memory at once.

The basic problem faced in a top-down parser is that of differentiating between the various alternate forms that may be substituted for some nonterminal symbol. The example given in Figure 2.12 demonstrates this problem in the context of the nonterminal symbol <line> from the extended BNF grammar given in Figure 2.9:

B = 5
B : B 5
B 5
W 5

Figure 2.12. A parsing problem for the nonterminal <line>.

The first line in Figure 2.12 is a <definition> while the others are <statement>s. Clearly, these cannot be distinguished by their first lexeme, but the second lexeme does the job. The first lexemes of the second and third lines are the same, but they serve different purposes; again, the second lexeme distinguishes between these purposes. Only in the last two lines is the first lexeme sufficient to distinguish between the forms. These examples suggest (correctly) that the example assembly language can be parsed by reading one lexeme at a time, from left to right, with the added ability to peek ahead at the next lexeme from time to time when that is needed to distinguish between forms which do not differ in their first lexeme.

This process of 'peeking ahead' at the next lexeme is conventionally called looking ahead, or looking right in the input. The number of lexemes ahead of the current lexeme which must be examined in order to parse a language is commonly used as a measure of the complexity of the grammar for that language. Thus, a grammar which allows a language to be parsed without looking ahead is the simplest; such grammars are called LL0 grammars (for Left-to-right parsing, Leftmost reduction first, looking right 0 places'). The example assembly language is in the class LL1 because it requires one symbol look-ahead. It is interesting to speculate about how far ahead one must look in order to parse English; is English an LL6 language?

Most grammars for English appear to require infinite look-ahead, but example sentences illustrating the need for more than a few words of look-ahead are very hard for real people to follow even though they may be correct under the commonplace grammars people use to describe natural languages. It may be that the human capacity for look-ahead is limited by the fact that human short-term memory can hold about 'seven plus or minus two' things at any time; if this is the case, it becomes reasonable to speculate that a grammar requiring somewhere between 5 and 9 symbols of lookahead might be adequate to describe English as it is actually used.

For the example assembly language, the main body of the parser is easy to propose. This is simply a loop which processes lines until the end of a file. Prior to the 1970's, of course, most parsers were written in assembly language or even machine language, but today, it is common to write language processors in decent high-level languages. Figure 2.13 shows how this might look in Pascal and C.

procedure program;		void program()
begin				{
    repeat			    do {  
        line;				line();
    until eof(input);		    } while (!feof(stdin));
end {program};			}

Figure 2.13. The main body of a parser in Pascal and C.

The predicates "eof(input)" or "feof(stdin)" can be formally treated as asking if the current lexeme is a special, invisible, "end of file" lexeme, although it would probably be implemented as a simple test for end of file. Note that the parser given in Figure 2.13 has not been coded to anticipate an empty input file; thus, it may well produce unexpected results for an empty file.

Processing a line is more complex, since there must be some way to examine the current and next lexeme. To allow this, we will use two variables, "lex.this" and ""; the variable "lex.this" always holds the current lexeme, while the variable "" always holds the lexeme that comes next after the current one. Thus, examining the contents of "" corresponds to looking ahead in the input. The procedure "lex.scan" will be used to advance the state of the lexical analyzer.

Formally, the lexical analyzer is an object with two read-only public variables, "lex.this" and "", and one public procedure, "lex.scan". In a language that doesn't support objects, we can simply make these variables global, naming them "lex_this" and "lex_next", with no loss of utility, because we have no intention of ever introducing multiple instances of the lexical analyzer. In fact, if our programming environment requires that we name the lexical analyzer class and then instantiate it, our environment is forcing us to do something inappropriate by suggesting the possibility of multiple instances of this class.

For now, we will assume that the values of "" and "lex.this" are strings, although this would rarely be the case in a production parser; instead, in production, these really ought to be values of type "lexeme", where values of type lexeme carry compact encodings of the attributes of the lexeme as they are computed.

Using the extended BNF grammar of the example assembly language as a basis, a procedure to parse one line can be written as shown in Figure 2.14.

procedure line;			void line()
begin				{
    if = "="		    if (!strcmp(,"="))
        then definition			definition();
        else statement;		    else
    skipline;				statement();
end {line};			    skipline();

Figure 2.14. A parser for lines in Pascal and C.

Note that the inclusion of a comment after the body of the definition or statement has been ignored! Whatever follows the definition or statement up to the end of line has simply been skipped over by the call to "skipline". Detection of errors significantly complicates this code; as is illustrated in Figure 2.15.
procedure line {with error detection};
    if is_identifier(lex.this) then
        if = "="
            then definition
            else statement;
    if (lex.this = ";") or is_eol(lex.this) then begin
    end else begin
        error("comment expected, something else found");
end {line};

Figure 2.15. A parser with error detection.

Here, the predicate "is_identifier" has been used to check that the line begins with a valid identifier, since all legal nonblank lines start with a valid identifier. Similarly, the predicate "iseol" has been used to check to see if the current lexeme is an end-of-line marker. In the remainder of this discussion of parsing, this extra code to handle errors will be ignored, but it should be kept in mind that this code frequently dominates the structure of production-quality parsers because users demand good error detection and reporting.

The procedures for parsing definitions and statements which were called from the above routines can easily be written as shown in Figure 2.16.

void definition ()
     lex_scan(); /* skip over identifier */
     lex_scan(); /* skip over equals sign */
} /* definition */

void statement ()
     if (!strcmp(,":")) {
          lex_scan(); /* skip over identifier */
          lex_scan(); /* skip over colon */
     if (!strcmp(lex.this,"B")) {
          lex_scan(); /* skip over B */
     } else if (!strcmp(lex.this,"W")) {
          lex_scan(); /* skip over W */
} /* statement */

Figure 2.16. Parsers for definitions and statements.

It is interesting to note that these versions of definition and statement would require no additional error checking code if called from the error checking version of line given in Figure 2.12, assuming that the operand procedure performs appropriate checks for malformed operands.

A Syntax Directed Assembler

The parser given in the previous section provides a convenient scaffolding on which to build the rest of an assembler. In order to do this, there must be a place to store the assembled code; here, this will "M", standing for memory, an array of bytes.

It should be noted that most production assemblers do not directly store assembled code in memory, but store it in special files called object files; these will be discussed in detail in Chapter 7. When assembly is directly into memory, it becomes necessary to violate the "sane usage" constraints on pointers, perhaps by using a small assembly language routine that directly interprets an integer memory address as a pointer. A classic name for this routine would be "poke", after the common name for the built-in procedure in many early microcomputer BASIC implementations that did this. Typically, "poke(b,a)" has the effect of "M[a]:=b".

We also need a mechanism to store the association of symbols with values in the symbol table. Logically, the symboltable is an object, perhaps named "st", with two access routines, "st.define" and "st.lookup"; the former defines (or redefines) a symbol by associating a value with it, while the latter returns the value associated with a symbol. Appropriate implementations for these routines will not be discussed until the next chapter, but it is worth noting that, again, the object-oriented paradigm poses minor problems. We don't really want to create a symbol-table class, with the suggestion that there might be multiple coexisting symbol tables in our assembler; rather, we want a guarantee that there will always be exactly one object, the symbol table, that is the only instance of thisr class. Furthermore, with only one instance, the need to prefix each use of an access routine for that instance with the instance name becomes annoying.

We can now rewrite the procedures "definition" and "statement" as shown in Figure 2.17 for use in a real assembler.

procedure definition;
     s := lex.this {save the symbol to be st_defined};
     lex_scan      {skip that symbol};
     lex_scan      {skip the equals sign};
     v := operand;
end {definition};

procedure statement;
     if = ":" then begin
          s := lex.this {save symbol used as label};
          lex_scan  {skip label};
          lex_scan  {skip colon};

     if lex.this = "B" then begin
          lex_scan {skip B};
          M[location] := operand;
          location := location + 1;
     end else if lex.this = "W" then begin
          lex_scan {skip W};
	  o := operand
          M[location] := first_byte_of(o);
          M[location + 1] := second_byte_of(o);
          location := location + 2;
end {statement};

Figure 2.17. The heart of an assembler.

To paraphrase the actions taken by these procedures, when a definition is found, the identifier is set equal to the associated operand. In a statement, when a label is found, it is set equal to the current location. The opcode B causes the operand to be stored in the current location, after which the current location is incremented by one. The opcode W causes the operand to be stored in the current and next location (taken as a 16 bit word), after which the current location is incremented by two.

The variable called "location" above is an important component of any assembler. It is commonly called the location counter in the assembler, by analogy with the program counter maintained by the computer when it runs a program. The assembler uses the location counter to determine where to place assembled instructions in memory during the assembly process, while the computer uses the program counter to determine where to fetch instructions from in memory when it runs a program.

Lexical Analysis

Before the shortcomings of the above basic assembler are examined, We will examine the implementation of the lexical analysis package, with the access procedure "lex.scan" and the variables "lex.this" and "". The "lex.scan" procedure identifies lexemes (words, tokens, or other logical units) from the lexicon (vocabulary) of a language. Although the syntactic structures (grammars) of computer languages differ greatly, their lexical structures are very similar to each other and to the written forms of natural languages which use the same alphabet. Thus, spaces serve to delimit lexemes, as do punctuation marks, which are themselves lexemes. It is important to note that the process of lexical analysis never depends on the meaning of the language or on syntactic issues such as whether or not some lexeme is allowed in a particular context.

The lexical structure of the example assembly language can be summarized as follows: All lexemes are either symbolic names, numbers, or punctuation marks. B and W are simply symbolic names. A symbolic name is a letter followed by zero or more letters or digits. A number is either a string of digits or a pound sign followed by a string of hexadecimal digits. The allowed punctuation marks are the equals sign, colon, semicolon, line-end and end-of-file. Any number of spaces may be inserted between lexemes without changing the lexical structure of a string, but at least one space must initially separate successive symbolic names or numbers. The extended BNF grammar given in Figure 2.18 describes the lexical level of the example assembly language in more detail than that in Figure 2.8.

<program> ::= <lexeme> { <lexeme> }
     -- a program is a string of one or more lexemes

<lexeme> ::= { <blank> } ( <identifier> | <number> | <punctuation> )
     -- any lexeme may be preceded by blanks

<identifier> ::= <letter> { <letter> | <digit> }

<number> ::= # <hexdigit> { <hexdigit> } | <digit> { <digit> }

<punctuation> ::= : | ; | = | <line end> | <end of file>

Figure 2.18. Lexical details in EBNF.

This definition of the lexical level does not include the rule that consecutive identifiers or decimal numbers must be separated by spaces; thus, it is ambiguous. This does not cause a problem in lexical analysis, but programmers must be aware that the string "B12" will be interpreted as one identifier, even though the above rules would allow it to be interpreted as starting with the identifiers "B" or "B1" followed by the numbers "12" or "2". The reason this causes no problem in lexical analysis is that, for both parsers and lexical analyzers, a so called greedy approach is commonly used. That is, we assume that the parser or lexical analyzer will construct the largest identifier or number it can by following the rules for <identifier> or <number> before it returns to the level where it looks for the start of the next lexeme.

An alternate way of formalizing the description of the lexical level of a language rests on the use of finite state transition diagrams or simple state transition networks. In such a definition, state changes are caused by the processing of successive input characters, and some state changes also signal the completion of the analysis of some lexeme. The notation used is very similar to RTN notation, and is shown in Figure 2.19.

        /                                                  \
start   \                                     identifier   /|
  -------->----------(letter)-------->--------------------  |
    /         \  \             /            \               |
    \         /   |           |\            /|              |
      (blank)     |           |  -(letter)-  |              |
                  |            \            /               |
                  |              -(digit)--                 |
                  |\                          number       /|
                  |  (#)----(hexdigit)---->---------------  |
                  |       /            \                    |
                  |       \____________/                    |
                  |\                          number       /|
                  |  ----(digit)---------->---------------  |
                  |    /         \                          |
                  |    \_________/                          |
                  |\                          punctuation  /|
                  |\ ---------(:)--------->--------------- /|
                  |\ -----(;)----------------------------- /|
                   \ --------------(line end)------------- / 
                     -(end of file)----------------------- 

Figure 2.19. Finite state description of the lexical level.

None of the rules given up to this point mention anything about a maximum length for identifiers, maximum value for numbers, maximum number of characters in a line, or maximum program size. These are frequently considered to be outside of the realm of formal definition, and may even vary from one implementation of a language to another. Typically, the informal part of the language specification will include minimum values for the line length, number of significant characters in an identifier, and the maximum number of digits allowed in a number.

A typical lexical analyzer will contain, as a private component, a line buffer which holds one line of input (a string variable or an array of characters). With this buffer is associated a variable which points to or indexes the first character in the buffer which has not yet been processed at the lexical level. Because of the need for look-ahead, processing at the lexical level will generally be a few lexemes ahead of processing at the syntactic level. We will use the variable "pos" to serve this purpose.

In addition, we need a more sophisticated way to represent the current lexemes than simple character strings! Instead, we will represent lexemes with a record or structure that contains information about the lexeme. Figure 2.20 illustrates appropriate type definitions:

type lextypes = (identifier, number, punctuation);
type lexeme = record
                 start: integer { index of start of lexeme on line };
		 stop: integer  { index of end of lexeme on line };
                 typ: lextypes;

enum lextypes { identifier, number, punctuation };
struct lexeme {
    int start; /* index of start of lexeme on line */
    int stop;  /* index of end of lexeme on line */
    lextypes typ;  /* index of end of lexeme on line */

Figure 2.20. Type definitions for lexeme types in Pascal and C.

A programming language such as Ada allows a clear definition of the interface between the lexical analyzer and the rest of the world, as shown in Figure 2.21.

package lex is
   type lextype is (identifier, number, punctuation);
   type lexeme is
         start: integer; -- starting position of lexeme on line
         stop: integer;  -- ending position of lexeme on line
         typ: lextype;   -- nature of this lexeme
      end record;

   this: lexeme; -- the current lexeme
   next: lexeme; -- the lexeme following the current one

   procedure init;     -- called to start the lexical analyzer
   procedure nextline; -- called to advance to the next line
	-- after a call to either of the above, this and next will
	-- be the first and second lexeme on the current line

   procedure scan;     -- called to advance to the next lexeme on the line
	-- after a call to next, this and next will advance one lexeme
	-- within the current line
end lex;

Figure 2.21: An Ada interface to the Lexical Analyzer

As with C++ and Java, the Ada language allows interface specificiations to be given separately from the implementation of an abstraction. All of the definitions in an Ada package declaration are publically available to the rest of the program, including type definitions, variables and functions. Unlike C++ and Java, however, Ada packages are objects, not classes; Ada does include something called a generic package that corresponds to classes, but the purpose of this discussion is not to teach all of Ada.

It is fair to ask, why didn't we add a string field to the lexeme structure to hold the text of the current lexeme? The answer to this is that we are interested in writing efficient software, and copying strings is something that should be avoided if it is not necessary. Therefore, what we want in the lexeme data structure is not the text of the lexeme, but rather, the numerical value of numeric lexemes, some equally concise indication of what identifer is represented, and in the case of punctuation, a quick and easy way to determine what mark is involved. We will deal with these issues later.

Given an interface specification, we can go on to define the functions and private variables of the lexical analyzer as shown in Figure 2.22:

package body lex is
   line: array (0 .. linelen) of char;
   pos: integer; -- current position in line

   ...  -- we omit a few details (initialization etc)

   procedure scan is
      this := next;
      while line(pos) = ' ' loop
         pos := pos + 1;
      next.start := pos; -- mark start of lexeme
      if line(pos) in 'A' .. 'Z' then
         next.typ := identifier;
	    pos := pos + 1;
	    exit when   (line(pos) not in 'A' .. 'Z')
	       and then (line(pos) not in '0' .. '9');
      elsif line(pos) in '0' .. '9' then
         next.typ := number;
         repeat pos := pos + 1;
	    pos := pos + 1;
	    exit when line(pos) not in '0' .. '9';
      elsif linebuf[pos] = '#' then
         next.typ := number;
	    pos := pos + 1;
	    exit when   (line(pos) not in '0' .. '9')
	       and then (line(pos) not in 'A' .. 'F');
         -- we treat everything else as punctuation
         next.typ := punctuation;
         pos := pos + 1;
      next.stop := pos - 1 {remember where lexeme ends};
   end scan;
end lex;

Figure 2.22. A lexical analyzer.

Note that important details have been ignored in this version of "lex.scan" such as initialization, checking for the end of a line, or handling of invalid characters; furthermore, we've provided no way for the user to inspect the current lexeme to determine if it is a particular identifier or a particular punctuation mark!

The version of "lex.scan" given in Figure 2.22 makes it clear that the cost of one lexeme look-ahead is a single assignment statement per lexeme processed, plus an extra variable to store the value of one lexeme. In fact, the assignment statement is not free, since it actually involves copying an entire record that is several words long, but we can afford this.

The fact that the cost of look-ahead is low was not understood in the design of some early programming, where the need for look-ahead was eliminated by having a leading keyword on each line to identify the type of that line. For example, all early versions of BASIC required the keyword "LET" at the start of each assignment statement.

It is common to make the lexical analyzer responsible for skipping comments; thus, semicolon would not be considered a lexeme type in the example assembly language; rather, the end of line lexeme would be considered to include the comments leading up to the end of line. In languages such as Pascal and PL/I, where comments may be interspersed between any lexemes, the lexical analyzer would identify and skip comments as part of the code responsible for skipping spaces between lexemes.

It is also common to integrate the production of a listing with the lexical analyzer. Thus, the routine to print a line is typically called from within the lexical analyzer as a consequence of finishing the analysis of the previous line, and error message formatting is tied to the lexical analyzer so that error messages can be printed under the lexeme to which they apply.


The assembler presented up to this point is incomplete, since it lacks any symbol table mechanism, and even if that were provided, it would not be able to handle identifiers which are defined after their first use. These problems will be solved in the next two chapters, but before solving them, it is useful to look at the alternatives which have been avoided in this presentation of parsing techniques.

A natural objection to the above presentation is that it avoids using powerful high level language features; specifically, it makes little use of string operations which are supposed to greatly simplify text processing. In fact, the extensive use of string operations can lead to trouble, as the following example illustrates:

Consider an assembler which, after reading a line in as a string, searches the line, using a string search operator, for any semicolon and uses substring operations to remove that and all following characters (the comment) from the line. The next step might be to use a search operation for an equals sign in order to distinguish between statements and definitions. For statements, a second search operation could be used to see if there is a colon, and if there is, substring operations could be used to remove the colon from the line and process it. Although there is no doubt that a working assembler could be written this way, this approach is also computationally expensive: Each substring operation is typically implemented by a loop which copies one character at a time, and string searches are typically implemented by sequentially testing successive characters. Even if these are done by hardware, the above approach leads to testing each character on a line many times, requiring many memory cycles where the lexical analyzer given requires only one.

Actually, there is an appropriate way to use string functions in the lexical analysis routine presented above. The key is to use the string function to do exactly the same processing as is explicitly indicated in the code given above; for example:

while linebuf[pos] = ' ' do pos := pos + 1;
can be replaced by
pos := (pos-1) + verify(substr(linebuf,pos),' ');
assuming the PL/I string functions "verify" and "substr", which return the position of some character in a string and take a substring, respectively. Unfortunately, unless a good optimizing compiler is used, the "substr" operation will involve making an unnecessary copy of part of the line buffer, and it is not much harder to write explicit code for the operation in the first place.

Many early assembly languages were designed to eliminate as much of the parsing and lexical analysis problems as possible. For example, requiring labels to occupy columns 1 through 6, opcodes to 8 through 10, and operands to 15 through 20 allows each field to be trivially identified. The rigidity of this approach prevents the use of such techniques as indenting to document control structures, and the approach is clearly not generalizable to other kinds of computer languages. Remanents of this approach are clearly visible in the specification of the original versions of FORTRAN, where labels occupy columns 1 to 5 of a line, and the remainder of a statement occupies columns 7 to 72.


The reader should be familiar with the following general terms after reading this section:

programming language            syntax diagram
language processor              parser
assembler                       syntax directed translator
compiler                        top-down parsing
interpreter                     bottom-up parsing
syntax                          token
semantics                       look-ahead
BNF notation                    location counter
production rule                 lexical analyzer
extended BNF notation           lexicon
RTN notation                    lexeme
Additionally, the reader should be familiar with the following components of the example assembly language:
statement                       opcode
definition                      operand
label                           comment


  1. Describe a high level feature of what might normally be classified as a low level language. Discribe a low level feature of what might normally be considered to be a high level language.

  2. Modify the BNF grammar for the simple assembly language given in Figure 2.7 to indicate where spaces are allowed in a program, using the following definition:
    <spaces> ::= <nothing> | <spaces> <space>

  3. Use syntax diagrams to define the lexical elements <identifier> and <number> as they are defined in Figure 2.8.

  4. Some assembly languages allow multiple labels on one line; if this feature were included in the example assembly language, the following might be allowed:
    a) Modify the Extended BNF grammar given in Figure 2.9 to allow this.

    b) Modify the procedure "statement" given in Figure 2.16 to allow this. (You may translate the code to the language of your choice.)

    c) Modify your answer from part b so that it will handle errors gracefully.

  5. In Figure 2.17, the functions "first byte of" and "second byte of" are used to obtain the two bytes of a 16 bit value for assignment to two consecutive memory locations. How can this be done legally in the programming language of your choice.

    On the R6502, the least significant byte is to be stored first; this is the same byte order used on the Intel x86, but the opposite byte order used on the Motorola 680x0 and several other computers. Also assume that a byte is 8 bits and that the machine on which the assembler runs does all arithmetic to at least 16 bits of precision; neither assumption is universal! There have been commercially successful machines with 6 and 9 bit bytes, and there have been commercially successful machines with word sizes of 12, 18, 24, 32, 36, 48, 60 and 64 bits.

  6. Write the function "operand" used in the code to assemble definitions and statements given in Figure 2.17. This function must use the symbol-table lookup routine to find the value associated with identifiers, and it can use a function evaluate(lexeme) to get the value of numeric lexemes.

  7. Rewrite the lexical analyzer's scan routine given in Figure 2.21 and Figure 2.22 so that it evaluates numeric lexemes and places the value in a new field of each lexeme called "value". The "value" field for non-numeric lexemes should be ignored. Ideally, evaluation should be done one character at a time without resort to built-in functions such as the classic C (and C++) "atoi" routine.

  8. Add a predicate (boolean function) to the lexical analyzer from Figure 2.21 and Figure 2.22 that allows easy testing of whether a lexeme is a particular punctuation mark. Call it "ispunc(l,c)", which returns true if the lexeme l is the punctuation mark c and false otherwise. Then modify the parser code in Figure 2.17 to use this new instead of string comparisons.

  9. The structure of the example assembly language is clearly oriented towards machines with an 8 bit byte and a 16 bit address which fits exactly in one word. Although the PDP-11, R6502, Intel 8080 and many other older computers fit this model, many others do not. Ignoring the machine instructions themselves, but with attention to pseudo-operations for initializing memory locations, how might you modify the example assembly language so that it can generate code for each of the following machines:

    a) The PDP-8, which has a 12 bit address, a 12 bit word, and no concept of byte.

    b) The CDC-6600, which has a 60 bit word, where instructions are either 15 bits or 30 bits long (packed 2, 3, or 4 per word), and a memory address is 18 bits (the low 18 bits of a 30 bit instruction). On the CDC-6600, addresses refer to words, and there is no hardware defined concept of byte except as it applies to the packing of instructions within a word.

    c) The PDP-10, which has a 36 bit word, an 18 bit memory address, and the ability to manipulate bytes of any size starting at any point in a word. All instructions are exactly one word long. Addresses on the PDP-10 point to a word, but a special 36 bit pointer format exists for references to a byte within a word; in this format, the first 6 bits give the offset of the byte within the word, the second 6 bits give the size of the byte, and the last 18 bits give the address of the word holding the byte (6 bits are unused).


The field of parsing has been around for a long time and has accumulated a large and varied literature and an extensive body of theory. A reasonable introduction to this area is provided by

Syntax of Programming Languages by Ronald C. Blackhouse (Prentice-Hall International, 1979). A alternative book,

Principles of Compiler Design by Aho and Ullman (Addison-Wesley, 1977), is slightly more lucid but covers material at a slightly higher level.

Among the best examples of a well designed assembly language still common use today is MACRO-11 on the PDP-11 family of computers built by Digital Equipment Corporation (now part of Compaq). This language served to inspire the assembly languages used for the Intel and Motorola families of microprocessors, but in both cases, the original is an improvement over its descendants. MACRO-11 serves as the basis of much of the example assembly language used here, and is documented in a number of texts, including

Introduction to Computer Systems by Glenn MacEwen (McGraw-Hill, 1980).

The syntactically similar assembly language of the DEC VAX and Compaq Alpha series of computers is descended from MACRO-11. The syntactically similar assembly language for the DECsystem-10 and DECsystem-20 computers was a predecessor of the PDP-11 assembly language; since the DEC-10 and DEC-20 have 36 bit words, while the PDP-11 has a 16 bit word, this demonstrates that many syntactic issues in assembly language design are completely independent of the machine for which the language is intended.

Relatively traditional views of assembler construction can be found in

Assemblers, Compilers, and Program Translation by Peter Calingaert (Computer Science Press, 1979), and in

Assemblers and Loaders by D. W. Barron (North-Holland, 1978, 3rd ed.). Both of these texts cover the assembly process at an introductory level.

The reader interested in the origins of BNF should refer to section 1.1 of the Revised Report on the Algorithmic Language Algol 60 edited by Peter Naur.

This has been reprinted in many places, including the book

Programming Systems and Languages edited by Saul Rosen (McGraw-Hill, 1967).

Many additional historical notes can be found in the

Proceedings of the ACM SIGPLAN History of Programming Languages Conference, published as SIGPLAN Notices, 13, 8 (August 1978), and later edited by R. L. Wexelblat and published as a monograph by Academic Press in 1981.