A friend sent me a link to an article on changes coming in microprocessors. The article is The Lifer: Why Your Core i7 Processor May Be Obsolete Sooner Than You Think. It got me thinking about writing this post not because the article has any great insight but because of the opposite. The article is too shallow.
One of the topics mentioned is specialized computing. This is nothing new. While it wasn’t the beginning, many people may remember the Intel 8087 floating point coprocessor that offloaded the 8086. Earlier there was the less well know 8031A. I have linked to a copy of the datasheet if you want to see how things used to be. The 8031A paired with the 8080 microprocessor. Interestingly, considering the two companies today, the 8031 and 8031A were licensed versions of AMD’s AM9511 and AM9511A introduced in 1977. Today, we take it for granted that this floating point capability is built into the processors we use.
Throughout computing history the research agencies have driven the need for large, somewhat specialized computers. From the CDC 6600 (1964), to the Cray 1 (1976), to Nebulae (2010) floating point performance has driven a class of supercomputers designed for scientific and military research. Originally these designs employed vector processors. Today, machines like the Nebulae use off the shelf graphics processors as general purpose computing engines (GPGPU). In particular, nVidia has started marketing to this area. The problem is that modern GPU’s are basically SIMD machines and bring along many of the limitations found with a SIMD architecture. Working with the limitations of SIMD and mitigating those limitations is a big topic with a large body of work so I won’t address it in depth here. For restricted problems such as graphics rendering it is a very effective approach. At the top end, the AMD 6990 graphics card contains two processor chips which together yield 3072 Stream Processors, 192 Texture Units, 128 Z/Stencil ROP Units, and 64 Color ROP Units. For graphics rendering this gives amazing performance. What it is not good at is general computing. In summary, specialized computing is nothing new and has been with us for a long time. Massively parallel specialized computing is here today.
Myslewski talks about large numbers of general purpose computing cores. We have made great progress utilizing four core and even eight core system. There are restricted problems such as design rule verification of large chip designs which are amenable to massively parallel systems. However, general purpose computing has trouble utilizing even four cores effectively. More interesting than the straight forward approach Myslewski mentions are approaches which reconsider the very nature of what a processor is. I have been thinking about this lately after watching a talk by Steve Teig of Tabula.
http://www.c-eda.org/IEEE-CEDA-DAC-061510/IEEE-CEDA-DAC-061510.html
Steve mentions Haskell as a language of choice. This is a transition that is needed and is fundamental. We currently force fit a one CPU ecosystem onto multiCPU processors. We patch language structures and manually work to make task division successful. In graphics this is somewhat straight forward. You tell the different cores “Care 1 you work on this area of the scene, core 2 you work over here, core 3 …” Except for specialized areas such as graphics, this model does not fit what we do today when we get beyond four cores. Right now we can, at a very simplistic level, say, “Core 1 you handle operating system commands, core 2 you run the program, core 3 you take care of the anti virus background tasks, core 4…” What is wrong here is the process and mindset itself. That’s why Steve mentions Haskell. The mental process I just outlined is forcing the code onto the processor. What is needed is a new paradigm of code as architecture. I am not talking about the Tensilica approach but something closer to the work discussed here. If you read through the various papers you will see a common theme related to the problem of limited FPGA size. The idea of time as a third dimension opens the door to a possible solution. What needs to be worked out is an interface that gets around the von Neumann memory bottleneck and allows continuous reconfiguration of the FPGA. Once that is achieved, arbitrarily large code can be executed with a three dimensional FPGA (X, Y, time) as the direct instantiation of the code. For an example of this type of FPGA check out Tabula. Be careful to not get lost in the hardware although that is a key component. The main advantage of the hardware is the ability to latch its state and rapidly reconfigure. More important that that functionality is compiling down into the FPGA in a way which allows a mapping of code to circuitry that continuously reconfigures as code is executed rather than execute, save state, load code, reconfigure, execute. Let me know what you think of the concept of code as architecture.