Is That a Fast PDP-11 Emulator in Your Pocket?

tzbitsby
Jason Kantz
Oct 2018
 vol 2.

A few months after Google's Project Zero announced the Spectre and Meltdown security vulnerabilities (Horn, 2018), acmqueue published an article titled "C is not a low level language" (Chisnall, 2018). This is a great article. It has a little bit of programming language war bait, and it also goes through several aspects of computer architecture at work behind the lower levels of programming. The article defines a low-level language as something that provides an abstract machine which closely maps to the machine that hosts it, and that raises a lot of questions like "what abstract machines /are/ we using today?" It also touches on a tension between the past and the future of programming.

The major claim is this, "The root cause of the Spectre and Meltdown vulnerabilities was that processor architects were trying to build not just fast processors, but fast processors that expose the same abstract machine as a PDP-11."

The PDP-11 was before my time. I had to do some research to find out what is really meant by this. I found the PDP-11 Processor

Handbook (DEC, 1969) and a paper titled, "What have we learned from the PDP-11?" (Bell, 1977). Bell's paper actually goes quite well with Chisnall's article which reads as "What have we learned from Spectre and Meltdown?" What I hear Chisnall saying is that if we want to do parallel computing well, let's stop relying so much on implicit parallelism.

Here are some facts about the PDP-11:

  • Manufactured by DEC 1970-1990.
  • PDP = Programmed Data Processor.
  • There were various models released over time at different points in the cost/performance curve.
  • 16 bit address space with flat memory model.
  • Eight 16-bit registers, six 32-bit floating point registers added later.
  • 16 bit instructions: 4 bits for opcodes, 2 operands: 3 bits for addressing mode, 3 for register.
  • Memory-mapped I/O, "The Unibus enables the processor to view peripheral devices-as active memory locations which perform special functions. Peripherals can thus be addressed as memory."
  • Single bus, "I/O devices transferring directly to or from
  • memory may steal memory cycles during instruction operation."
  • Later models offered segmented memory, multiple processors, and a cache.

The basic paradigm does seem to have survived many years unchanged. In 1977, Bell had some great insights into system improvement,

There is a nearly universal tendency of designers to n + 1 their systems: incrementally improve the design forever. No design is so perfect that a redesign cannot improve it. The problems faced by computer designers can usually be attributed to one of two causes: inexperience or second-systemitis. Inexperience is just a problem of resources: Are there designers available? What are their backgrounds? Can a small group work effectively on architectural specifications? Perhaps most important is the principle that no matter who the architect might be, the design must be clearly understood by at least one person. As long as there is one person who clearly understands the total design, a project can usually succeed with many inexperienced designers.

Second-systemitis is the tendency of many designers to specify a system that solves all of the problems faced by prior systems. (p18)

So what are some results of n+1'ing and second systemitis in modern CPU architecture that C is no longer low-level enough to control? Candidates are

  • Instruction level parallelism & register renaming
  • Speculative execution
  • CPU cache hierarchy
  • Vectorization (SIMD instructions)

The argument is that complex architectural optimizations are there to get performance gains while programmers continue to write sequential C code. Chisnall goes through several examples that show how it's very difficult to compile C into optimized code.

So while a lot complexity is under the hood, it must be worthwhile in many ways in a particular context. How has inherently sequential code lived on so long, even when multiple processors are available? Bell commented on this in 1977:

Although it is not surprising that multiprocessors have not been used save in highly specialized applications, it is depressing. ... It is the nature of engineering to be

conservative. Given that there are already a number of risks involved in bringing a product to the market, it is not clear why one should build a higher-risk structure that may require a new way of programming. What has resulted is a sort of deadlock situation: we cannot learn how to program multiprocessors until such machines exist, but we won't build the machine until we are sure that there will be a demand for it, i.e., that the programs will be ready. (p.34)

This implies that the current reliance on implicit parallelism has been the solution to a deadlock. Multiprocessors might not have come to the market without architectures and techniques (ILP/Register Renaming?) that allow programmers to continue writing mostly sequential code so that "programs will be ready"--because they don't have to change much!

It's kind of funny to imagine that we're all just running layers of software and at the bottom is SIMH (an emulator framework) running PDP-11. This idea of modern computers being PDP-11 emulators of sorts got me thinking about what abstract machines /are/ in use. What do people stake their programs and products on? As Bell puts it, "A computer is not solely determined by its

architecture; it reflects the technological, economic, and human aspects of the environment in which it was designed and built." (1977, p7).

Diehl. S. et. al. published a literature review of abstract machines in 2000. They note that particular machine architectures can be viewed as hardware implementations of an abstract machine. SIMH emulators and the various game console emulators bear this out. It's interesting to think of the history of computing as the search for appropriate abstract machines. In the 50's there was a notion that there could be a UNCOL--a universal computer oriented language.

A search has been going on for an abstract machine that wins at being approachable, portable, and performant. Examples? The JVM is hugely significant of course and it gets the credit for bringing garbage collected memory to the mainstream. Android runtime (ART) is the java runtime that powers android phones. WEBASM is a stack machine for delivering code over a network (as the JVM was originally). WEBASM seeks to be fast, hardware independent, platform independent, and independent of any high-level language (WebAssembly, 2018). Etherium is an abstract machine for distributed computing based on blockchains (Ray, 2018). LLVM is an influential abstract register machine that many languages target:

It "aims to be a "universal IR" of sorts, by being at a low enough level that high-level ideas may be cleanly mapped to it (similar to how microprocessors are "universal IR's", allowing many source languages to be mapped to them)." (LLVM 7.0.0, 2017)

The Go language runtime is an abstract machine built around goroutines (lightweight threads) and unbuffered channels to provide a Communicating Sequential Process (CSP) approach to concurrency (Cox, 2006). Parallelism can be added by configuring the runtime to schedule goroutines on more than one thread. "Go is making a bet: more cores better software. But the hardware folks will not put more cores in their hardware, if the software is not going to use them. So it's this balancing act of staring at each other, and we're hoping that Go is going to break through on the software side." (Hudson, 2015)

TensorFlow as an abstract machine is a recent and significant one for parallel programming. In TensorFlow you can explicitly construct TensorFlow graphs and distribute different pieces of the graph to different compute devices. This highlights the choice in approaches. Is the parallelism explicitly controlled by the programmer or is it something done automatically by the machine?

In some cases explicit parallelism is simpler and easier to understand. Chisnall seems to argue for this approach. But just as ILP breaks up sequential C code in an optimal way, we will certainly see more and more sophisticated automated approaches to parallelism.

Approachable models of parallel execution seem to be the natural next step in the portability + performance search. Fran Allen, described the situation in her 2008 Grace Hopper Celebration Keynote,

But there's that curve, which is Moore's Law, which has been very steady and doing well in the high-performance field. But what is happening is that the drivers of this, the basic core technologies, are in trouble. ... So that is creating a serious gap in the expectations of this area. ... So now what's the solution? Well, the solution is to move to have explicit parallelism, back to a field I know a bit about, and do it in software. And have what's on the chip, the actual technology...the parts can be much simpler, much smaller, much cooler. And put all the work that they used to do in terms of having capabilities on each of those chips, lots of heavy-duty

capabilities in the transistors on these chips, onto the code that's produced by software. ... we're going to get to make some very nice advances in compilers and languages and parallelism. But it's going to mean new languages, new compilers, new systems, and it's not going to happen overnight. Fortunately, the community recognizes all of that, and the community I think itself, from various companies and certainly universities, are all coming together and cooperating on different ideas and aspects of this issue.

Maybe a new programming language will eventually come to the market with a new set of primitives. Guy Steele (2015) argues that programs need the right algebraic properties that will allow compilers to adjust parallelism as needed. The most important property is associativity that allows an operator to work in any order. Steele also argues for a new mindset. How many programmers have parallel prefix operators in their toolkit? Steel sums up the need for a new mindset with a nice analogy:

I argue that the management of parallelism is very much like garbage collection, it's the automatic assignment of processors rather than the automatic

assignment of storage--it's the automatic allocation of code to processors rather than data to memory. And again, we are going to fear this at first because those automatic strategies are going to have overheads that will look daunting at first, but I believe that over the next several decades as we continue to engineer those the same will happen as the garbage collectors and the overheads will be reduced over time but not disappear. (2015)

To repeat Bell, "A computer is not solely determined by its architecture; it reflects the technological, economic, and human aspects of the environment in which it was designed and built." We are certainly in a different economic, technologic, and human environment than the one in which the abstract machine for C developed. And this is the huge conflict and tension that underlies "C is not a low level language" which makes it so interesting. There is a lot of network-effect value stored up in existing C code.

Chisnall ends his article by saying "C doesn't map to modern hardware very well," and by modern hardware I think he mostly means GPUs and multithreading CPUS. If we pay attention to Bell's lessons learned from the PDP-11, C isn't going anywhere

until there is a viable parallel systems language alternative. To be a viable alternative at the low level, the language (which includes the abstract machine it's built on) needs to replace a lot of C code. It needs to be a new way of programming that doesn't introduce a lot of risk, e.g., it needs to become stable through widespread adoption. Switching costs have to still allow next year's systems to be "constant dollar systems" that are better on performance and equal on portability.

The current environment offers some interesting, new, abstract machines. The economics of making them better is something to watch, and it will be interesting to see how quickly the current generation of system designers and computer programmers break from the past.

References

Allen, Frances E. 2008. Grace Hopper Celebration Keynote Address. Retrieved from https://amturing.acm.org/pdf/AllenTuringTranscript.pdf

Bell, Gordon C. 1977. What Have We Learned from the PDFP-11? Perspectives on computer science, 10th anniversary symposium. Carnegie Mellon University, Pittsburgh, PA. Retrieved from http://gordonbell.azurewebsites.net/cgb%20files/what%20have%20we%20learned%20from%20the%20pdp-11%201977%20c.pdf

Chisnall, David. 2018. C is not a low-level language: your computer is not a fast PDP-11. acmqueue, 16(2). Retrieved from https://queue.acm.org/detail.cfm?id=3212479

Cox, Russ. 2006. Bell Labs and CSP Threads. Retrieved from https://swtch.com/~rsc/thread/

Diehl, Stephan, P. Hartel, and P. Sestoft. 2000. Abstract machines for programming language implementation. Future Generation Computer Systems. 16(2000) 730-751. Retrieved from http://www.inf.ed.ac.uk/teaching/courses/lsi/diehl_abstract_machines.pdf

Go Authors. 2018. The Go programming language: package runtime. Retrieved from https://golang.org/pkg/runtime/

Horn, Jann. 2018. Reading privileged memory with a side-channel. Project Zero: news and updates from the Project Zero team at Google. Retrieved from https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html

Hudson, Rick. 2015. GopherCon 2015: Rick Hudson - Go GC: Solving the Latency Problem. Retrieved from https://www.youtube.com/watch?v=aiv1JOfMjm0

Jouppi, Norm. 2016. "Google supercharges machine learning tasks with TPU custom chip". Google Cloud Platform Blog. Google. Retrieved from https://cloudplatform.googleblog.com/2016/05/Google-supercharges-machine-learning-tasks-with-custom-chip.html

LLVM Language Reference Manual. 2018. Retrieved from http://releases.llvm.org/7.0.0/docs/index.html

PDP-11 Processor Handbook. 1969. Digital Equipment Corporation. Retrieved from http://research.microsoft.com/users/GBell/Digital/PDP%2011%20Handbook%201969.pdf

Ray, James, et. al. 2018. A Next-Generation Smart Contract and Decentralized Application Platform. Retrieved from https://github.com/ethereum/wiki/wiki/White-Paper

Steele, Guy. 2015. Four solutions to a trivial problem. Google Tech Talk. Cambridge, MA. Retrieved from https://www.youtube.com/watch?v=ftcIcn8AmSY

Supnik, Bob. 2004. Simulators: virtual machines of the past (and future). acmqueue 2(5). Retrieved from https://queue.acm.org/detail.cfm?id=1017002

WebAssembly Community Group. 2018. Web Assembly Specification. Release 1.0. Retrieved from https://webassembly.github.io/spec/core/index.html