John Mashey on 64-bit computing
John Mashey post in comp.arch, with formating added by me.
Date: 19 Dec 2003 21:07:56 -0800 Message-ID: <ce9d692b.0312192107.325394c0@posting.google.com>
as promised, here's the original. I'll follow up (in couple weeks) with a 12-years-later rerospective summarizing what happened later.
There are a few typos in this, but most of it was pretty close, unsurprising as lots of people knew all this; I just wrote it down.
In mid-1991, I somehow :-) missed predicting that true 64-bit micros:
a) Would rapidly become crucial to the Internet, i.e., via many CISCO routers.
b) Would ship in low cost videogames (Nintendo N64) by 1996.
c) Would end up in disk controllers, wireless chips, set top boxes, communications processors, laser printers, etc.
64-BIT COMPUTING
What is a 64-bit microprocessor? Why would you want one, especially in a personal system?
John R. Mashey BYTE Magazine - September 1991, 135-142.
Today's most popular computers are built around 32-bit microprocessors. The next generation of chips - 64-bit microprocessors - will bring even more power to the desktop.
But what does it mean to call a chip 64-bit? It's easy to get confused, because different numbers of bits are used in different parts of a microprocessor. (see text box "What's in a Chip?" on page 138). Although the MIPS R4000 is currently the only 64-bit microprocessor, 64 bits is almost certainly a coming trend. At microprocessor conferences, sessions on the future of chip technology routinely predict widespread use of true 64-bit microprocessors by 1995 or earlier.
You may be thinking, "My PC software still runs in 16-bit mode and it will be years before the software catches up with 32 bits. But 64 bits? People who predict widespread use of true 64-bit microprocessors by 1995 must be raving lunatics!"
There are two reasons for the prediction: 64-bit integer processing and convenient use of more than 32 bits of address space. The first reason is a straightforward performance issue; the second has more widespread implications. As you'll see, applications for 64-bit microprocessors exist for both servers and desktops.
CPU architectures
When it comes to CPU architectures, it helps to distinguish between Instruction Set Architecture, which presents an assembly language programmer's view of a processor, and hardware implementations of that ISA. Successful ISAs persist unchanged or evolve in an upward-compatible direction for years. Distinct implementations are often built to yield different cost/performance points. At times people get confused about the difference between ISA and implementation sizes. Table 1 may help clear up the confusion.
In figure 1, the CPU's integer registers are R bits wide. Address arithmetic starts with R bits, either producing a virtual address size of V bits (V is the generated user address, V <= R) or using a segment register to expand R bits to V bits. The memory management unit translates V bits of virtual address to A bits of physical address that are actually used to access memory. For each access, up to D bits are transferred (i.e., the data bus is D bits wide). For user-level programs, R and V are programmer-visible properties of the ISA; A and D are usually less-visible implementation-specific characteristics. (Floating-point register size is almost always 64 or 80, and so is not included.)
======================================================================== Figure 1. Efficient address arithmetic is limited by the integer register width (R). Actual memory access is limited by the virtual address size (V) and address bus width (A). Efficient memory I/O size is limited by the data bus width (D). ------------------------------------------------ | Segmentation CPU Integer registers | |(on some machines) (R bits) | | | | | | V V | | Generated virtual address (V bits) | | | | | V | | Memory Management Unit | ------------------------------------------------ Physical Address | ^ Data (A bits) | | (D Bits) V V External Memory System =======================================================================
Table 1 lists numbers for well-known computer families. For simplicity, V is only given for user-level programs. The table shows that physical address size (A) and data bus size can vary within a processor family. The IBM S/360 family included five data bus sizes (8 to 128 bits); the 32-bit Intel 386 is sold in two sizes - 32 and 16.
======================================================== Table 1: The size that a microprocessor is called is generally the integer register size. CPU ISA Hardware Characteristics implementation Year Integer Gen'd Phys Data Released register user addr bus Size size addr size size Called (R) (V) (A) (D) -------------------------------------------------------- DEC PDP-11/45 1973 16 16 16* 18 32 DEC PDP-11/70 1976 16 16 16* 22 32 DEC VAX 11/780 1978 32 32 31 32 64 IBM S/360 1964 32 32 24 24 8-128 IBM S/370XA 1983 32 32 31 32 128 IBM ESA/370 1988 32 32 31* 32 128 IBM RISC 1990 32 32 32* 32 64-128 System/6000 HP Precision 1986 32 32 32* 32 32-64 Intel 386DX 1985 32 32 32* 32 32 Intel 386SX 1987 32 32 32* 24 16 Intel 860 1989 64 32 32 32 64 Intel 486DX 1989 32 32 32* 32 32 Intel 486SX 1991 32 32 32* 32 32 MIPS R2000 1986 32 32 31 32 32 MIPS R4000 1990 64 64 40-62 36 64 Motorola 68000 1980 32 32 24 24 16 Motorola 68020 1985 32 32 32 32 32 Motorola 68030 1987 32 32 32 32 32 Motorola 68040 1990 32 32 32 32 32 Sun SPARC 1987 32 32 32 36 32-64 ======================================================== * These processors use some form of segmentation to obtain more bits of user address space when necessary.
Better performance with bigger integers
For years, PDP-11 Unix systems have used 16-bit integers for most applications, as do many PCs. Sometimes performance can improve merely by switching to larger integers. Integer code has proved resistant to recent speedup techniques that have greatly helped floating-point performance, so any integer improvement is welcome. Some applications for 64-bit integers are the following:
- Long strings of bits and bytes.
By using 64-bit instead of 32-bit integers, some programs may run up to twice as fast. First, operating systems often spend 10 percent to 20 percent of their time zeroing memory or copying blocks of memory; often, doubling the integer size can help these operations. Second modern global optimizing compilers spend a great deal of time performing logical operations on long bit vectors, where 64-bit integers nearly double the speed. Third, the increasing disparity between CPU and I/O device speed is increasing the use of compression/decompression methods, some of which rely on the main CPU, where 64 bits may be helpful.
- Graphics.
Graphics operations are a special, but important, case of the long bit-and-byte-string problem. Using 64-bit integer operations can speed the work required by raster graphics. The increase in performance is especially true for large- area operations like scrolling and area-fill, where performance may approach a full two times that of a 32-bit CPU. This approach helps raise the graphics performance of a minimal-cost-design - a CPU plus a frame buffer but without graphics-support chips.
- Integer arithmetic.
Most chips make addition and subtraction of multiprecision integers (i.e., 64-bit, 96-bit, 128-bit, etc) reasonably fast, but multiplication and division are often quite slow. Cryptography is a heavy user of multiple-precision multiples and divides. Financial calculations could use integer arithmetic; 32-bit integers are far too small, but 64-bit integers are easily big enough to represent objects like the US national debt or Microsoft's annual revenue to the penny.
Big-time addressing
Perhaps more important than using 64-bit integers for performance is the extension of memory addressing above 32 bits, enabling applications that are otherwise difficult to program. It is especially important to distinguish between virtual addressing and physical addressing.
The virtual addressing scheme often can exceed the limits of possible physical addresses. A 64-bit address can handle literally a mountain of memory: Assuming that 1 megabyte of RAM requires 1 cubic inch of space (using 4-megabit DRAM chips), 2**64 bytes would require a square mile of DRAM piled more than 300 feet high! For now, no one expects to address this much DRAM, even with next-generation 16-Mb DRAM chips, but increasing physical memory slightly beyond 32 bits is definitely a goal. With 16-Mb DRAM chips, 2**32 bytes fits into just over 1 cubic foot (not including cooling) - feasible for deskside systems.
An even more important goal is the increase of virtual addresses substantially beyond 32 bits, so you can "waste" it to make programming easier - or even just possible. Although this goal is somewhat independent of the physical memory goal, the two are related.
Database systems often spread a single file across several disks. Current SCSI disks hold up to 2 gigabytes (i.e., they use 31-bit addresses), Calculating file locations as virtual memory addresses requires integer arithmetic. Operating systems are accustomed to working around such problems, but it becomes unpleasant to make workarounds; rather than just making things work well, programmers are struggling just to make something work.
The physical address limit is an implementation choice that is often easier to change than the virtual address limit. For most computers, virtual memory limits often exceed physical limits, because the simplest, cheapest way to solve many performance problems is to add physical memory. If the virtual limit is much smaller than the physical limit, adding memory doesn't help, because software cannot take advantage of it. Of course, some processors use segmentation schemes to extend the natural size of the integer registers until they are equal to or greater than the physical address limit.
The mainframe, minicomputer, and microprocessor
Reflect on this aphorism:
Every design mistake gets made at least three times: once by mainframe people, once by minicomputer people, and then at least once by microprocessor people.
An illustrative sequence is found among IBM mainframes, DEC superminicomputers, and various microprocessors.
IBM S/360 mainframes used 32-bit integers and pointers but computed addresses only to 24 bits, thus limiting virtual (and physical) memory to 16 MB (see reference 1). This seemed reasonable at the time, as systems used core memory, not DRAM chips. A "large" mainframe (such as a 360/75) provided at most 1 MB of memory, although truly huge mainframes (360/91) might offer as much as 6 MB. In addition, most S/360s did not support virtual memory, so user programs generated physical addresses directly. There was little need to consider addresses larger than the physical address size. Although it was unfortunate that only 16MB was addressable, it was even worse to ignore the high-order 8 bits rather than trap on non-zero bits. Assembly language programmers "cleverly" took advantage of this quirk to pack 8 bits of flags with a 24-bit address pointer.
As memory became cheaper, the "adequate" 16-MB limit clearly became inadequate, especially as virtual addressing S/370s made it possible to run programs larger than physical memory. By 1983, 370-XA microprocessors added a 31-bit addressing mode for user programs but were required to retain a 24-bit mode for upward compatibility. Much software had to be rewritten to work in the 31-bit mode. I admit I was one of those "clever" programmers and was somewhat surprised to discover that a large program I wrote in 1970 is still running on many mainframes - in 24-bit compatibility mode, because it won't run any other way. "The evil that men do lives after them, the good is oft interred with their bones."
By the mid-1980s, 31-bit addressing was also viewed as insufficient for certain applications, especially databases. ESA/370 was designed with a form of segmentation to allow code to access multiple 2-gigabyte regions of memory, although it took tricky programming to do so.
In the minicomputer phase of this error, the DEC PDP-11 was a 16-bit minicomputer. Unfortunately, a single task addressed only 64 kilobytes of data and perhaps 64 KB of instructions. Gordon Bell and Craig Mudge wrote, "The biggest and most common mistake that can be made in computer design is that of not providing enough address bits for memory addressing and management. The PDP-11 followed this hallowed tradition of skimping on address bits, but was saved on the principle that a good design can evolve through at least one major change. For the PDP-11, the limited address space was solved for the short run, but not with enough finesse to support a large family of minicomputers. This was indeed a costly oversight." (See reference 2.)
Some PDP-11/70 database applications rapidly grew awkward on machines with 4 MB of memory that could only be addressed in 64-KB pieces, requiring unnatural acts to break up simple programs into pieces that would fit. Although the VAX-11/780 was not much faster than the PDP-11/70, the increased address space was such a major improvement that it essentially ended the evolution of high-end PDP-11s. In discussing the VAX-11/780, William Strecker wrote, "For many purposes, the 65-Kbyte virtual address space typically provided on minicomputers such as the PDP-11 has not been and probably will not continue to be a severe limitation. However, there are some applications whose programming is impractical in a 65-Kbyte address space, and perhaps more importantly, others whose programming is appreciably simplified by having a large address space." (See reference 3.)
Finally, we come to microprocessors. The Intel 8086 was a 16-bit architecture and thus, likely to fall prey to the same issues as the PDP-11. Fortunately, unlike the PDP-11, it at least provided a mechanism for explicit segment manipulation by the program. This made it possible for a single program to access more than 64 KB of data, although it took explicit action to do so. Personal computer programmers are familiar with the multiplicity of memory models, libraries, compiler flags, extenders, and other artifacts needed to deal with the issues.
The Motorola MC68000 started with a more straightforward programming model, since it offered 32-bit integer registers and no segmentation. However, by ignoring the high 8 bits of a 32-bit address computation, it repeated the same mistake made 15 years earlier by the IBM S/360. Once again, "clever" programmers found uses for those bits, and when the MC68020 interpreted all 32 bits, programs broke. BYTE readers may recall problems with some applications when moving from the original Macintosh to the Mac II.
The need for big computers
Two common rules of thumb are that DRAM chips gets four times bigger every three years and that virtual memory usage grows by a factor of 1.5 to 2 per year (see reference 4). Additional memory is often the cheapest and easiest solution to performance problems, but only if software can easily take advantage of it.
As the natural size of code and data reaches and then exceed some virtual address limit, the level of programming pain increases rapidly, because programmers must use more and more unnatural restructuring. If the virtual address limit is lower than the physical limit, it is especially irritating, since buying DRAM won't do you any good. Fortunately, the virtual address limit is typically larger than the physical limit, so programs may work but perhaps run slowly. In this case, you can at least add physical memory until performance becomes adequate.
There is no definite ratio between maximum task virtual-address limits and physical address limit. Conversations with many people have convinced me that a 4-to-1 ratio is reasonable (i.e., you will actually see practical programs four times bigger than physical memory) if the operating system can support them. Some people claim that a ratio of 4 to 1 is terribly conservative and that advanced file-mapping techniques (as in Multics or Mach) use up virtual memory much faster than physical memory. Certainly, in the process of chip design and simulation at Mips Computer Systems, some of our 256-MB servers routinely run programs with virtual images that are four to eight times larger (1 to 2 gigabytes). Several companies (including Mips) already sell desktops with 128 MB of memory. With 16-Mb DRAM chips, similar designs will soon hit 512 MB - enough to have programs that could use at least 4 gigabytes of virtual memory.
32-BIT CRISIS IN 1993
Consider the history of microprocessor-based servers from Mips Computer Systems and Sun Microsystems. Figure 2 shows that the 32-bit limit will become an issue even for physical memory around 1993 or 1994.
As soon as 16-Mb DRAM chips are available, some microprocessors will be sold with 2 to 4 gigabytes of main memory - in fact, just by replacing memory boards in existing cabinets. You may now be convinced that Sun and Mips designers must be crazy to think of such things; but if so, they have plenty of company from others, like those at Silicon Graphics, Hewlett Packard, and IBM. Keeping pace with DRAM growth requires appropriate CPU chips in 1991 so that tools can be debugged in 1992 and applications debugged by 1993 or 1994 - barely in time.
======================================================================== HITTING THE 32-BIT LIMIT Figure 2: The memory sizes of a Mips machine and a Sun machine, year by year, using a logarithmic scale. The data points fall on a straight line, gaining 2 bits every 3 years, as they naturally follow DRAM curves. The top line shows virtual memory size at four times the maximum physical memory size, hinting that large leading-edge applications may already be pushing 32-bit limits in 1991 (and they are). The line below shows physical memory size at 50 percent of maximum size. Vendors actually sell a substantial number of such machines. [I can't draw it here: it has a vertical size in number of bits, with a band of points going from lower left to upper right.] 1991: 32-bit trouble for leading-edge systems 1994: 32-bit trouble for many systems =========================================================================
Why so much memory?
Finally, look at applications that put pressure on the size of virtual memory addressing. To handle virtual memory greater than 32 bits, you either need segmentation of 64-bit integer registers.
Why 64 and not something smaller, like 48? It is difficult to introduce a new architecture that runs the C language poorly. C prefers byte-addressed machines whose number of 8-bit bytes per word is a power of 2. The use of 6 bytes per word requires slow addressing hardware and breaks many C programs, so 64 is the next step after 32.
Segmentation may or may not be an acceptable solution, but there is insufficient space here to debate the relative merits. Suffice it to say that many people with segmentation experience consider it a close encounter of a strange kind.
The following applications tend to consume virtual memory space quickly and generally prefer convenient addressing of large memory space, whether it's contiguous or sparse.
- Databases.
Modern operating systems increasingly use file mapping, in which an entire file is directly mapped into a task's virtual memory. Since you can leave empty space for the file to grow, virtual memory is consumed much faster than physical memory. As CPUs rapidly increase their performance relative to their disk-access speeds, disk accesses are often avoided by keeping the disk blocks in large DRAM cache memories. Database managers on mainframes have long felt the pressure here, as many installations are already above 2**40 bytes. Distributed systems designs often use some bits of the address as a system node address, whether it's contiguous or sparse.
- Video.
For uncompressed video, a 24-bit color, 1280-by-1024-pixel screen needs 3.75 MB of memory. At 24 frames per second, 4 gigabytes of memory is consumed by only 45 seconds of video.
- Images.
At 300 dots per inch, a 24-bit-color, 8 1/2 by 11-inch page used 25 MB, so 4 gigabytes is filled by 160 of these pages. Databases of such objects get large very quickly.
- CAD.
CAD applications often include large networks of servers and desktops, in which the servers manage the databases and run large simulations. They naturally can make use of 64-bit software. Desktops navigate through the huge databases, and although they are not likely to map in as much data at one time as the servers, software compatibility is often desirable.
- Geographic information systems.
These systems combine maps, images, and other data and have most of the stressful characteristics of video, CAD, and GIS.
- Traditional number crunching.
Of course, technical number-crunching application developers have never been satisfied with any memory limits on any machine that exists.
On the desktop?
Perhaps you now believe that 64-bit servers may be reasonable, but you still wonder about the desktop. Table 2 lists the application areas discussed, showing whether the primary use of 64-bit systems is for speed (either in desktop or server); for addressing large amounts of data simultaneously; or for using software in a desktop system identically to its use in a server but with less actual data. Such compatibility is likely to be crucial for CAD applications but is also important for others, if only to get software development done.
=========================================================================== APPLICABILITY OF 64 BITS Table 2: The applicability of 64 bits differs for servers and desktop systems. Server Workstation ---------------- ---------------- Application Speed Addressing Speed Compatibility ---------------------------------------------- Byte pushing X X Graphics X Big integers X X X Database X X Video X Image X X CAD X X GIS* X X Number crunch X X * Geographic information systems ==========================================================================
For most readers, 64 bits is likely to be most important as an enabling technology to bring powerful new applications to the desktop. The history of the computing industry, especially of personal computers, shows there is some merit to thinking ahead. Some of us remember when a 640-KB limit was considered huge.
As 64-bit systems become available, some of the number-crunching people will recompile their FORTRAN programs immediately, and some other developers will start working in this direction. However, I'd expect only a small fraction of applications to jump to 64 bits quickly. For example, I do not expect to see 64-bit word processors soon. [Editor's note: However, see "ASCII Goes Global," July BYTE.] As a result, and important part of 64-bit chip and software design is the ability to mix 32-bit and 64-bit programs on the same system.
Although 64-bit applications may be relatively few, some are absolutely crucial and some are indirectly important to many people. You've probably seen vendors' predictions of huge numbers of transistors per chip over the next few years. Although you may not do electrical CAD yourself, you may buy a system with those big chips; so, somewhere people will be running programs to simulate those big chips, and those programs are huge.
I often give talks that compare computers to cars, using the CPU chip as the engine, exception handling as the brakes, and so forth. What kind of car is a 64-bit computer? Think of it as a car with a four-wheel drive that you engage when necessary for better performance, but especially when faced with really tough problems, like driving up mountainsides. You wouldn't engage four-wheel drive to go to the grocery store, but when you'd need it, you'd need it very badly. Some people already have problems that require 64-bit processing, and soon more will. The necessary vehicles - 64-bit microprocessors - are on the way.
REFERENCES
1. Prasad, N.S. IBM Mainframes: Architecture and Design. New York: McGraw-Hill, 1989.
2. Bell, C. Gordon, and J. Craig Mudge. "The Evolution of the PDP-11." In Computer Engineering: A DEC View of Computer System Design, edited by C. Gordon Bell, J. Craig Mudge, and John E. McNamara. Bedford, MA: Digital Press, 1978.
3. Strecker, William D. "VAX-11/780: A Virtual Address Extension to the DEC PDP-11 Family." In Computer Engineering: A DEC View of Computer System Design, edited by C. Gordon Bell, J. Craig Mudge, and John E. McNamara. Bedford, MA: Digital Press, 1978.
4. Hennessy, John L., and David A. Patterson, Computer Architecture: A Quantitative Approach. San Mateo CA: Morgan Kaufmann, 1990.