A View of C the Register Machine

tzbitsby
Jason Kantz
Oct 2018
 vol 2.

Here's one way of summarizing the job of a programmer: design computational processes that are carried out by machines in order to solve problems. By that I mean programmers create evolving patterns of data manipulation, and these patterns follow certain rules and lead to certain results. The particular rules that determine and constrain the changes to the data are the "execution model" or "model of evaluation".

There are numerous ways of representing data, controlling it in the machine, and describing this in textual (or visual) programs. The related idea of "programming paradigm" describes the endeavor by putting execution models and the ways of programming them into categories based on the various methods used or not used. An important thing to note about programming paradigms is that, defined this way, there are a lot of them. Take any two methods together, for example, exceptions (for changing flow of control) and closures (for creating new control flow under a given state), and that's a paradigm. If there are n techniques, pairing

them gives n choose 2 possible paradigms and many more when using more than two. A survey of the practical ones can be found in

(Van Roy, 2012), which explained:

Concepts are not combined arbitrarily to form paradigms. ... In a given paradigm, it can happen that programs become complicated for technical reasons that have no direct relationship to the specific problem that is being solved. This is a sign that there is a new concept waiting to be discovered.

This itself is a process: abstracting away unnecessary details, tailoring the removal of detail to a given pursuit, and then moving on to a new way of problem solving and design. When the results become mainstream, the culture as a whole advances with new tools for expressing ideas. This process is cataloged for various disciplines in "Notes on notation and thought" (Ye, 2018). Sussman (2004) provides examples of how the precise mathematical languages of geometry, algebra, and classical mechanics have become available in general culture, and, as a consequence, have made everyone smarter. Computational ideas have been going through a similar process as illustrated by Sussman's anecdote:

Back in 1980 (a long time ago!) I was walking around

an exhibit of primitive personal computers at a trade show. I passed a station where a small girl (perhaps 9 years old) was typing furiously at a computer. While I watched, she reached over to a man standing nearby and pulled on his sleeve and said: "Daddy! Daddy! This computer is very smart. Its BASIC knows about recursive definitions!" I am sure that her father had no idea what she was talking about. But notice: the idea of a recursive definition was only a mathematician's dream in the 1930s. It was advanced computer science in the 1950s and 1960s. By 1980 a little girl had a deep enough operational understanding of the idea to construct an effective test and to appreciate its significance.

So this process of inventing new notations or currency of thought has been going on within programming itself, and (perhaps?) making programmers smarter.

One example, pointed to by Van Roy (2012), is how exceptions can simplify some problems. Without exceptions, every procedure that may or may not succeed must return an additional value--an error code--which must be added to every function up a call chain until handled. Exceptions allow the control of flow to jump directly

to error handling code higher up the call stack without the need for the extra value. (While exceptions simplify some problems, they may add problems of their own, so Go doesn't have them, and the Google C++ style guide forbids them.)

There's a problem with viewing "programming paradigms" as simply combinations of execution models, programming practices, methods, or techniques. A few techniques on their own do not necessarily create a paradigm. In the context of C++ there is an idiom known as "Resource Acquisition is Initialization". In other contexts, it might not be called RAII, but something like it has been around for some time. For example, the concept is available in MIT Scheme, just not under that name:

(call-with-input-file "/tmp/foo"
  ;; port is closed when this procedure returns
  (lambda (port)
    (read-line port)))
=> "bar"

Somehow RAII remains a programming idiom and doesn't merit full elevation to paradigm. The missing bit is that paradigms capture more than just groupings of techniques. In practice, paradigms designate collections of computation techniques that branch off and evolve together from certain starting points as programming communities grow out of them to solve particular

problems or create particular kinds of applications. Pitman (1994) touched on this idea as it applied to fractious Lisp communities in the 1990's, and suggested the following thought experiment:

Suppose that the designers of some Lisp dialect L concluded for some reason that language design was a popularity contest and in a desperate attempt to narrow the gap, they exactly replaced the text of their definition of L with the definition of C in order to make L more attractive. Suppose that for some reason no one in the C community took notice, and so the Lisp community remained the sole consumers of L, which differed from C only in name. Think about L today and how it would change in future years. Starting from the same point, would C and L continue to look the same after some years had passed?

Viewed in this light, it's interesting that one execution model can serve as a starting point for most implementations of the current paradigms for

  • data
  • the ways of changing data
  • the ordering of the changes

That model is the register machine. By "register machine", I mean the general model presented by Abelson & Sussman (1996, p. 491). This is an execution model that has a program counter (PC) that points to the next instruction, goto statements that can alter the PC, and operations that can combine values from input storage locations (registers or positions in a vector of addressable memory) and place the result in an output storage location. By "starting point" I mean that other models can be built on it. A functional execution model might restrict register assignment to only a single assignment throughout a program, for example.

The C programming language is built on the register machine execution model and for several decades has been the main language for expressing the API interface to physical register machines. For POSIX operating systems in particular, any language runtime that relies on services from the host operating system must inter-operate with C.

C gets criticism for being a sequential language in a world where the trend is toward approachable concurrent models that lend themselves to parallel execution (Chisnall, 2018). However, we should keep in mind the distinction between "transformational programs" and "reactive programs" made in (Cooper, 2002). In the

context of POSIX, if one wants to create patterns of concurrent communicating processes, the C-centric environment is quite up to the task for transformational programs. Composable C processes excel at transformational tasks. Drake (2014) demonstrated this in "Command-line Tools can be 235x Faster than your Hadoop Cluster". However, reactive programs where the user is in the flow can be difficult to structure as C may need to escape from it's own sequential execution model to the non-deterministic pthreads execution model.

Given it's place in history, and the fact that it has direct or indirect effects on most programming communities in one way or another at some point in time, it's worth taking a closer look at the sequential execution model of C. C falls in the "imperative" paradigm, but what's most interesting about C is how it extends the "register machine" with the idea of structured programming popularized by (Dijkstra, 1968). C retains "goto", but the destination of a goto is a static label that cannot vary at runtime. C programs eschew goto in favor of control flow constructs currently taken for granted: subroutines, while blocks, and if/else blocks.

C represents data objects as storage locations composed of 1 or more uniquely addressable "storage units" (Harbison & Steele,

2002, p. 181). A storage unit is some number of bits and CHAR_BIT in limits.h defines that number for the particular C compiler being used. The number of bits in "char" is a measuring stick for one abstract storage unit, so for any implementation of C,

sizeof(char) == 1

This means the lower bound on the number of bits in an abstract storage unit is determined by the number of characters in C's basic character set.

C has several ways of establishing data objects of particular types through declarations: 1) top-level identifiers, 2) formal parameters in function definitions, and 3) block (local) identifiers (Harbison & Steele, p. 75). So for example,

// Top-level identifier allocates sizeof(int) storage units
// available to end of source file.
int red_chips = 10; 

// Formal parameter blue_chips is sizeof(int) storage units
// available until end of function block
float prob_blue(int blue_chips) {
    // Block local identifier total is sizeof(int) storage units
    // available until end of function block
    int total = red_chips + blue_chips;
    return (float)blue_chips / total;
}

An occurrence of a name evaluates to the storage location for a data object of the established type. C is prone to name conflicts and

particular rules govern which declaration establishes the storage for a name used within or across source files (Harbison & Steele, p. 76, p. 114).

C's primitive data types are integral, floating point, pointer, and void (Harbison & Steele, p. 124). Storage for compound data is established by declaring structures or arrays.

int ary[4] = { 0, 1, 2, 3};
struct {int red; int blue;} a[3] =
    {{10, 2}, {5, 8}, {7, 10}};
char str[] = "Hello World";

Each of these identifiers designates a storage location, and it is in this sense that C can be thought of as a "register machine". The declarations of storage locations (or pointers to them) can be viewed as an overall definition of a machine which includes a number of registers and the control-flow rules for updating them. Each particular C program is a description of one or more particular register machines--subroutines being ways of modularizing and separating different machine definitions. The runtime provides routines like "malloc" and "free" which provide the illusion that the machines can conjure new storage locations arbitrarily at runtime.

C provides a number of operators for combining data objects. Operators produce a result which is assigned to a storage location designated by an lvalue--an expression that refers to an object for

examination or alteration. An lvalue can be a name e or e[k], (e), e.name, e->name, or *e (Harbison & Steele, p. 204, Table 7-1). rvalues are non lvalues, or, generally, expressions that combine data objects for storage into the lvalue.

Assignment expressions are the unit of evaluation and they are composed into statements, the simplest of which is an "expression statement" that involves a single assignment (Harbison & Steele, p. 246). Expressions that don't assign a value to a storage location (i.e. ones with no side-effects) don't have to be evaluated and can be discarded (Harbison & Steele, p. 255).

C99 permits simple assignment of compatible structure or union types.

struct container {int red; int blue;};
struct container jar = { 5, 20 };
struct container more_blue() {
    struct container jar2 = jar;
    jar2.blue++;
    return jar2;
}

However, simple assignments cannot be used to copy entire contents of one array to another since name of an array is not a modifiable lvalue.

rvalues can be compound expressions involving multiple operators for changing or combining data objects and order of application depends on their precedence and associative rules

(Harbison & Steele, p. 205, Table 7-3). Expressions may be reordered by the compiler based on commutative and associative properties, but only if the result is guaranteed to be same. This is not always straightforward because of integer overflow and casting (p. 254):

(1.0 + -3) + (unsigned) 1 != 1.0 + (-3 + (unsigned) 1);

A single expression statement can contain a number of assignment expressions, but multiple modifications of single identifier between successive "sequence points" is undefined (Harbison & Steele, p. 255). Seq points include end of full expression after the first operand of &&, ||, ?:, or comma operator, and after evaluation of arguments and function expression in a function call.

For function calls, function and argument expressions are evaluated and the order is not specified (Harbison & Steele, p. 214). After evaluation, argument expressions are converted to corresponding formal parameter types. Functions use a call-by-value parameter passing convention (p. 299). Conceptually, values of parameters are copied into a storage area local to the called function. Within a function, the formal parameter name is a modifiable lvalue distinct from the actual argument. When the

function returns, the expression given in the return statement is converted to the return type of the function and stored as if by assignment to the point immediately following the call in calling code (p. 279).

In addition to the familiar while and if statements, the C standard library provides setjmp and longjmp for non-local exits, and signals for processing asynchronous events.

The interface to the machine below C is often an implementation of the Portable Operating System Interface (POSIX). POSIX is a super set of the C standard library, libc. The system calls to the operating system managing the machine's resources are themselves defined in terms of C functions on C data objects. Effectively, C's driving community are the people using it as a notation to describe operating systems.

The other interface to the actual machine below is usually a mechanism for embedding machine-specific assembly code within a C program. The GCC compiler provides an extension, the asm keyword, to embed assembler instructions within C code (Using the GNU Compiler Collection, 2018). The ARM C compiler, offers a similar mechanism with its _asm keyword. Both use the GCC inline assembly syntax (GAS), which supports machine dependent

features for 54 different machines (Documentation for binutils 2.31, 2018). Consider then, that the instructions for these machine targets are also evolving notations, each being used and possibly extended by communities of users who want to solve particular problems for particular disciplines. From that point of view, C the structured programming register machine, is, indeed, not a low-level language.

References

Abelson, Harold, Sussman, Gerald Jay. 1996. Structure and Interpretation of Computer Programs (2nd ed.). Cambridge, MA: The MIT Press. https://mitpress.mit.edu/sites/default/files/sicp/full-text/book/book.html

Chisnall, David. 2018. C is not a low-level language: your computer is not a fast PDP-11. acmqueue, 16(2). https://queue.acm.org/detail.cfm?id=3212479

Cooper, Gregory H., 2008. Integrating dataflow evaluation into a practical higher-order call-by-value language. Ph.D. dissertation, Brown University, Providence, Rhode Island.

Dijkstra, Edsger. 1968. Go To Statement Considered Harmful. Communications of the ACM. 11(3), 147-148. New York, NY: ACM.

https://homepages.cwi.nl/~storm/teaching/reader/Dijkstra68.pdf

Documentation for binutils 2.31. 2018. https://sourceware.org/binutils/docs-2.31/

Drake, Adam. 2014. Command-line Tools can be 235x Faster than your Hadoop Cluster. https://adamdrake.com

Google C++ Style Guide. https://google.github.io/styleguide/cppguide.html

Harbison, Samuel P., Steele, Guy L. 2002. C: A Reference Manual. Upper Saddle River, NJ: Prentice Hall.

Pitman, Kent M. 1994. Parenthetically speaking: more than just words: lambda, the ultimate political party. 7(4), 24-29. New York, NY: ACM SIGPLAN.

Sussman, Gerald Jay. 2004. The Legacy of Computer Science. In National Research Council, Computer science, reflections on the field, reflections from the field (pp. 180-183). Washington, DC: The National Academies Press. https://groups.csail.mit.edu/mac/users/gjs/essays/remember.pdf

Using the GNU Compiler Collection (GCC). 2018. https://gcc.gnu.org/onlinedocs/gcc/index.html

Van Roy, Peter. 2009. Programming paradigms for dummies: What every programmer should know. In Gérard Assayag and Andrew

Gerzso (Eds.), New Computational Paradigms for Computer Music. France: IRCAM/Delatour. https://www.info.ucl.ac.be/~pvr/VanRoyChapter.pdf

Ye, Katherine, et. al. 2018. Notes on notation and thought. https://github.com/hypotext/notation