Lecture 25, Variables and Up-level Addressing

Part of the notes for CS:4908:0002 (22C:196:004)
by Douglas W. Jones
THE UNIVERSITY OF IOWA Department of Computer Science

Variables

The primary attributes of variables are their type and memory address. The type determines the amount of storage for that variable, and the memory address determines where it is allocated.

A special case, public global variables

Kestrel (and C and C++, for that matter) have a special case of global variable, specifically, global variables that are visible to separately compiled programs, even those compiled in other languages. In C and C++, if a variable is declared globally (outside of any function or class) and it is not declared static, storage for that variable is allocated by the linker, and it is visible to all separately compiled program components that have matching declarations.

Similarly, in Kestrel, if a variable is declared globally (at the outermost nesting level) and not marked as private or restricted), it is visible to all separately compiled linked programs, including C and C++ code.

The only attributes of variables of this sort are their type and name. The type includes the size of the memory block to be allocated to the variable. The code generator must include, in the object code, an indication of the name and size of the variable, given in a form the linker will understand so that the linker can allocate space for the variable.

References to such variables use absolute addressing and are done using assembly code that the linker will modify, replacing the symbolic variable name with its absolute memory address.

The general case

In the general case, variables are always allocated in terms of some block. The global block is the activation record of the main program. All other blocks are either instances of record types or they are activation records of procedures and functions.

Consider this block:

x: type record
    a: var int32;
    b: var array 1..10 of int32;
    c: var char;
    d: var boolean;
end;

In this block, there are 4 variables:

a, occupying 4 bytes starting at location 0 in the object.
b, occupying 40 bytes starting at location 4 in the object.
c, occupying 1 byte starting at location 44 in the object.
d, occupying 1 byte starting at location 45 in the object.

The total size of each object of type x 46 bytes; this is the sum of the sizes of all of the fields. Each field begins at an address within the object that is the sum of the sizes of all of the preceeding fields.

Note that the fields c and d in the above description are shown as one-byte fields, but one word could be allocated for each of these. Depending on the CPU -- that is, depending on both the architecture and the implementation of that architecture, there may be a significant advantage to allocating all memory fields so they are aligned on word boundaries.

In the above discussion, we assumed that the only fields in each block were the visible fields that are explicitly declared. In the general case, there may be several invisible fields. For example, in blocks that serve as activation records for procedures and functions, there must be a return-address field, and in blocks that contain code that references non-local variables, there must be an uplink, as discussed in the next section.

Up-level addressing

If we are only concerned about bindings of constants and types, the only attributes of identifiers that matter are their assembly time binding. The picture is more complex for variables. Consider this Kestrel program fragment:

x: type record
    a: const 5;
    b: type a .. a+3;
    c: var b;
    d: type record
        e: var b;
        f: var b;
        e = c;
        f = e;
    end;
    g: var d;
    h: var d;
end;
y: var x;
z: var x;

Looking at the above code, we see a record type d of which there are two instances, g and h within each instance of type x. The initializer code for the type d includes references to local variables, that is, to the fields e and f, and it includes a reference to a variable, c that is declared in the enclosing record type x. This means that the variable c is itself not a fixed location. At the outer level, our code creates two variables, y and z, and any particular reference to c in this code will either refer to y.c or to z.c, depending on the run-time context.

What does this imply about the atteributes bound to each identifier at compile time? Clearly, within the block d, the identifiers e and f are not simply bound to static memory addresses. Rather, when the declaration g: var d is elaborated, e refers to g.e, while when the declaration h:d is elaborated, e refers to h.e, a different memory location.

Therefore, at compile time, the first attribute of every variable binding is the displacement of that variable into an instance of some block. In the case of the identifier c in the above example, C is a displacement into the record x.

Notice that the enclosing block above is not global. It is itself a record declaration, and there are multiple instances of this block, y and z. Therefore, when evaluating c in the context of the assignment statement e=c above, the variable to which c refers is not statically determined.

In the object y, we can speak of y.g.e, and during the initialization of y.g, the assignment e=c really means y.g.e = y.c. Later, during the initialization of z.g, the assignment e=c means z.g.e = z.c.

This is called the up-level addressing problem. Languages like C eliminate this problem by forcing "flat programming", where all variables are either global (with static addresses) or local to a single subroutine. In C, you can nest blocks within a subroutine, but these are anonymous blocks that are (in Kestrel terms) extensions of their enclosing block, not first-class blocks.

When C was extended to make C++, a third level was added. At the outer level are static objects and variables, allocated at fixed memory addresses. Class definitions introduce a middle level, the fields of an instance of the class, and the local variables of the methods of a class are the inner level. While this is a 3-level nesting, no general mechanism is involved. Instead, C++ compilers treat the three levels as distinct unrelated special cases. The rules for declaring items at each of these levels are different and strictly constrained. While the definition int x; is allowed at any of these levels, other declarations are strictly limited to prevent generalized nesting and therefore avoid creating the up-level addressing problem.

Later, as an afterthought during the development of C++, nested classes were introduced. These create all of the complexity of nesting in Kestrel, but having been added late in the history of the language, they are frequently ignored by many programmers.

In languages like Algol 60, Pascal and Ada, arbitrary nesting has been allowed from the start. Kestrel falls in this category. In these languages, up-level addressing is routinely possible.

How is up-level addressing done? The answer is surprisingly simple: By default, unless an optimizer determines that there is no need for it, every instance of every block, whether it is a record variable or the activation record of a procedure or function, contains an implicitly defined pointer to the block instance of the enclosing block. When a block is instantiated, the first and entirely implicit parameter to the block's initializer is the pointer to the enclosing block. This is not a pointer to the block that asked for instantiation, it is a pointer to the block instance where the instantiator was found.

In many object oriented languages, there is a special identifier, self, that is a pointer to the current object. Consider the initializer for the record d in our introductory example:

d: type record
    e: var b;
    f: var b;
    e = c;
    f = e;
end;

In the above, we could rewrite the second assignment statement using explicit self-references as:

self@.f = self@.e;

The first assignment requires up-level addressing. If we name the implicitly declared pointer found in each block up, we can rewrite the first assignment as one or the other of the following:

self@.e = up@.c;

self@.e = self@.up@.c;

The above rewrites are not intended to be real Kestrel code, because Kestrel does not let users directly play with the self and up pointers. Nonetheless, these pointers exist inside the computer. At run-time, the self pointer is invariably stored in an index register whenever fields of an object are referenced, and when up-level addressing is required, the up-pointer is also loaded in an index register, at least temporarily.

The second version emphasizes that all variables in the activation records of a subroutine (including the initializer for a record variable) are referenced relative to the start of that activation record. In effect, the special variable self is the name for the register in the CPU that points to the activation record; in some coding models, this is the stack pointer, in others, the stack pointer and activation record pointers are distinct. Do not imagine using self@.self@.self.

Initializers

In the same spirit that led us to explicitly name the self and up pointers, we can also explicitly construct the initializer function that is an implicit attribute of every record. Consider, again, our introductory example:

x: type record
    a: const 5;
    b: type a .. a+3;
    c: var b;
    d: type record
        e: var b;
        f: var b;
        e = c;
        f = e;
    end;
    g: var d;
    h: var d;
end;
y: var x;
z: var x;

It contains several distinct things: First, it contains declarations for variables that are fields of the records. Second, it contains the code of the initializers for those records, and finally, it contains an implicit outer record type and associated code. We can separate these as follows:

global: type record
    up: @void;

    x: type record
        up: var @global;
        a: const 5;
        b: type a .. a+3;
        c: var b;
        d: type record
            up: var @x;
            e: var b;
            f: var b;
        end;
        g: var d;
        h: var d;
    
        initd: procedure( self: @d, up: @x )
            self@.up = up;
            self@.e = up@.c;
            self@.f = self@.e;
        end;
    end;

    initx: procedure( self: @x, up: @global )
        self@.up = up;
        x.initd( self@.g, self );
        x.initd( self@.h, self );
    end;

    y: var x;
    z: var x;
end;

main: procedure( self: @global, up: @void )
    self@.up = up;
    global.initx( self@.y, self );   
    global.initx( self@.z, self );   
end;

When we do this, we are forbidden to name any local variables in the initializers -- that is because the activation record of the initializer is actually the record pointed to by the self pointer. Self must be passed by the caller, since it is the caller that actually allocated the instance of the type d that needs initializing.

Note that we are being a bit informal here. The parameter passing modes being used for the initializers are a bit suspect, and the nesting of initializers within record definitions is meaningless because we have no implicit scope rules any more -- all variables are referenced relative to chains of pointers that start with self.