Section Three: The Great Dark Cloud Falls: IBM's Choice.

Part I: DEC PDP-11, benchmark for the first 16/32 bit generation. (1970) . . . .

The DEC PDP-11 was the most popular in the PDP (Programmed Data Processors) line of minicomputers, a successor to the previously popular PDP-8, designed in part by Gordon Bell. It remained in production until the decision to discontinue the line as of September 30, 1997 (over 25 years - see note on the DEC Alpha intended lifetime). Many of the PDP-11 features have been carried forward to newer processors because the PDP-11 was the basis for the C programming language, which became the most prolific programming language in the world (in terms of variety of applications, not number) and which includes several low level processor dependant features which were useful to replicate in newer CPUs for this reason.

The PDP-8 continued for a while in certain applications, while the PDP-10 (1967) was a higher capacity 36-bit mainframe-like system (sixteen general registers and floating point operations), much adored and rumoured to have souls.

The PDP-11 had eight general purpose 16-bit registers (R0 to R7 - R6 was also the SP and R7 was the PC). It featured powerful register oriented (little-endian, byte addressable) addressing modes. Since the PC was treated as a general purpose register, constants were loaded using an indirect mode on R7 which had the effect of loading the 16 bit word following the current instruction, then incrementing the PC to the next instruction before fetching. The SP could be accessed the same way (and any register could be used for a user stack (useful for FORTH)). A CC (or PSW) register held results from every instruction that executed.

Adjascent registers could be implicitly grouped into a 32 bit register for multiply and divide results (Multiply result stored in two registers if destination is an even register, not if it's odd. Divide source must be grouped - quotient is stored in high order (low number) register, remainder in low order).

A floating point unit could be added which contains six 64 bit accumulators (AC0 to AC5, can also be used as six 32-bit registers - values can only be loaded or stored using the first four registers).

PDP-11 addresses were 16 bits, limiting program space to 64K, though an MMU could be used to expand total address space (18-bits and 22-bits in different PDP-11 versions).

The LSI-11 (1975-ish) was a popular microprocessor implementation of the PDP-11 using the Western Digital MCP1600 microprogrammable CPU, and the architecture influenced the Motorola 68000, NS 320xx, and Zilog Z-8000 microprocessors in particular. There was also a 32-bit PDP-11 plan as far back as its 1969 introduction. The PDP-11 was finally replaced by the VAX architecture, (early versions included a PDP-11 emulation mode, and were called VAX-11).


Part II: TMS 9900, first of the 16 bits (June 1976) . .

One of the first true 16 bit microprocessors was the TMS 9900, by Texas Instruments (the first are probably National Semiconductor PACE or IMP-16P or AMD 2901 bit slice processors in 16 bit configuration). It was designed as a single chip version of the TI 990 minicomputer series, much like the Intersil 6100 was a single chip PDP-8, and the Fairchild 9440 and Data General mN601 were both one chip versions of Data General's Nova. Unlike the IMS 6100, however, the TMS 9900 had a mature, well thought out design.

It had a 15 bit address space and two internal 16 bit registers. One unique feature, though, was that all user registers were actually kept in memory - this included stack pointers and the program counter. A single workspace register pointed to the 16 register set in RAM, so when a subroutine was entered or an interrupt was processed, only the single workspace register had to be changed - unlike some CPUs which required a dozen or more register saves before acknowledging a context switch.

This was feasible at the time because RAM was often faster than the CPUs. A few modern designs, such as the INMOS Transputers, use this same design using caches or rotating buffers, for the same reason of improved context switches. Other chips of the time, such as the 650x series had a similar philosophy, using index registers, but the TMS 9900 went the farthest in this direction. Later versions added a write-through register buffer/cache.

That wasn't the only positive feature of the chip. It had good interrupt handling features and very good instruction set. Serial I/O was available through address lines. In typical comparisons with the Intel 8086, the TMS9900 had smaller and faster programs. The only disadvantage was the small address space and need for fast RAM. The TMS 9995 was a later version, and the 99000 added fast on-chip memory and several instructions (arithmetic, stack, parallel I/O, in-memory bit manipulation) and expanded addressing range. The 99110 added floating point support (as a sort of on-chip ROM library, I believe - the 99105 version dropped the floating point routines). Memory access could be further expanded with the 99610 MMU.

Despite the very poor support from Texas Instruments, the TMS 9900 had the potential at one point to surpass the 8086 in popularity. TI also produced an embedded version, the TMS 9940.

TI-99/4A Shrine - TI9900 CPU History
The Minicomputer Orphanage

Part III: Zilog Z-8000, another direct competitor . . . .

The Z-8000 was introduced not long after the 8086, but had superior features. It was basically a 16 bit processor, but could address up to 23 bits in some versions by using segment registers (to supply the upper 7 bits). There was also an unsegmented version, but both could be extended further with an additional MMU that used 64 segment registers. The Z-8070 was a memory mapped FPU.

Internally, the Z-8000 had sixteen 16 bit registers, but register size and use were exceedingly flexible - the first eight Z-8000 registers could be used as sixteen 8 bit subregisters (identified RH0, RL0, RH1 ...), or all sixteen could be grouped into eight 32 bit registers (RR0, RR2, RR4 ...), or four 64 bit registers. They were all general purpose registers - the stack pointer was typically register 15, with register 14 holding the stack segment (both accessed as one 32 bit register (RR14) for painless address calculations). The instruction set included 32-bit multiply (into 64 bits) and divide.

The Z-8000 was one of the first to feature two modes, one for the operating system and one for user programs. The user mode prevented the user from messing about with interrupt handling and other potentially dangerous stuff (each mode had its own stack register).

Finally, like the Z-80, the Z-8000 featured automatic RAM refresh circuitry. Unfortunately the processor was somewhat slow, but the features generally made up for that.

A later version, the Z-80000, was introduced about at the beginning of 1986, at about the same time as the 32 bit MC68020 and Intel 80386 CPUs, though the Z-80000 was quite a bit more advanced. It was fully expanded to 32 bits internally, giving it sixteen 32 bit physical registers (the 16 bit registers became subregisters), doubling the number of 32 bit and 64 bit registers (sixteen 8-bit and 16-bit subregisters, 32-bit physical registers, eight 64-bit double registers). The system stack remained in RR14.

In addition to the addressing modes of the Z-8000, larger 24 bit (16Mb) segment addressing was added, as well as an integrated MMU (absent in the 68020 but added later in the 68030) which included an on chip 16 line 256-byte fully associated write-through cache (which could be set to cache only data, instructions, or both, and could also be frozen by software once 'primed' - also found on later versions of the AMD 29K). It also featured multiprocessor support by defining some memory pages to be exclusive and others to be shared (and non-cacheable), with separate memory signals for each (including GREQ (Global memory REQuest) and GACK lines). There was also support for coprocessors, which would monitor the data bus and identify instructions meant for them (the CPU had two coprocessor control lines (one in, one out), and would produce any needed bus transactions).

Finally, the Z-80000 was fully pipelined (six stages), while the fully pipelined 80486 and 68040 weren't introduced until 1991.

But despite being technically advanced, the Z-8000 and Z-80000 series never met mainstream acceptance, due to initial bugs in the Z-8000 (the complex design did not use microcode - it used only 17,500 transistors) and to delays in the Z-80000. There was a radiation resistant military version, and a CMOS version of the Z-80000 (the Z-320). Zilog eventually gave up and became a second source for the AT&T WE32000 32-bit (1986) CPU instead (a VAX-like microprocessor derived from the Bellmac 32A minicomputer, which also became obsolete).

The Z-8001 was used for Commodore's CBM 900 prototype, but the Unix based machine was never released - instead, Commodore bought Amiga, and released the 68000 based machine it was designing. A few companies did produce Z-8000 based computers, with Olivetti being the most famous, and the Plexus P40 being the last - the 68000 quickly became the processor of choice, although the Z8000 continued to be used in embedded systems.

ZiLOG Product Specifications:

Part IV: Motorola 68000, a refined 16/32 bit CPU (September 1979) . . . . . . . . .

The initial 8MHz 68000 was actually a 32 bit architecture internally, but had only a 16 bit data bus and 24 bit address bus to fit in a 64 pin package (address and data shared a bus in the 40 pin packages of the 8086 and Z-8000). Later the 68008 reduced the data bus to 8 bits and address to 20 bits (very slow and not used for much - the cheap and quirky Sinclair QL being the most prominent), and the 68020 was fully 32 bit externally. Addresses were computed as 32 bits (without using segment registers) - unused upper bits in the 68000 or 68008 bits were ignored, but some programmers stored type tags in the upper 8 bits, causing compatibility problems with the 68020's 32 bit addresses. Lack of forced segments made programming the 68000 easier than some competing processors, without the 64K size limit on directly accessed arrays or data structures.

Looking back it was a logical design decision, since most 8 bit processors featured direct 16 bit addressing without segments.

The 68000 had sixteen 32-bit registers, split into eight data and address registers. One address register was reserved for the Stack Pointer. Data registers could be used for any operation, including offset from an address register, but not as the source of an address itself. Operations on address registers were limited to move, add/subtract, or load effective address.

Like the Z-8000, the 68000 featured a supervisor and user mode (each with its own Stack Pointer). The Z-8000 and 68000 were similar in capabilities, but the 68000 was 32 bit units internally (16 bit ALUs, making some 32-bit operations slower than 16-bit - two in parallel for 32-bit data, one for addresses), making it faster and eliminating forced segments. It was designed for expansion, including specifications for floating point and string operations (floating point was added in the 68040 (1991), with eight 80 bit floating point registers compatible with the 68881/2 coprocessor). Like many other CPUs of the time, the 68000 could fetch the next instruction during execution (a 2 stage pipeline). An instruction prefix (0xF) indicated coprocessor instructions (similar to the 80x86), so the coprocessor could "listen" to the instruction stream, and execute instructions it recognized, without a coprocessor bus.

The 68010 (1982) added virtual memory support (the 68000 couldn't restart interrupted instructions) and a special loop mode - small decrement-and-branch loops could be executed from the instruction fetch buffer. The 68020 (1984) expanded external data and address bus to 32 bits, simple 3-stage pipeline, and added a 256 byte cache (loop buffer), with either segmented (68451?) or paged (68851, it supported two level pages (logical, physical) rather than the segment/page mapping of the Intel 80386 and IBM S/360 mainframe) memory management unit. The 68020 also added a coprocessor interface. The 68030 (1987) integrated the paged MMU onto the chip . The 68040 (January 1991) added fully cached Harvard busses (4K each for data and instructions, with new MMU), 6 stage pipeline, and on chip FPU (subset of the 68882, with some operations emulated).

Someone told me a Motorola techie indicated the 68000 was originally planned to use the IBM S/360 instruction set, but the MMU and architectural differences make this unlikely. The 68000 design was later involved in microprocessor versions of the IBM S/370.

The 68060 (April 1994) expanded the design to a superscalar version, like the Intel Pentium and NS320xx (Swordfish) series before it. Like the National Semiconductor Swordfish, and later the Nx586, AMD K5, and Intel's "Pentium Pro", the the third stage of the 10-stage 68060 pipeline translates the 680x0 instructions to a decoded RISC-like form (stored in a 16 entry buffer in stage four). There is also a branch cache, and branches are folded into the decoded instruction stream like the AT&T Hobbit and other more recent processors, then dispatched to two pipelines (three stages: Decode, addr gen, operand fetch) and finally to two of three execution units - 2 integer, 1 floating point) before reaching two 'writeback' stages. Cache sizes are doubled over the 68040.

The 68060 also also includes many innovative power-saving features (3.3V operation, execution unit pipelines could actually be shut down, reducing power consumption at the expense of slower execution, and the clock could be reduced to zero) so power use is lower than the 68040 (4-6 watts vs. 3.9-4.9). Another innovation is that simple register-register instructions which don't generate addresses may use the the address stage ALU to execute 2 cycles early.

The embedded market became the main market for the 680x0 series after workstation venders (and the Apple Macintosh) turned to faster load-store processors, so a variety of embedded versions were introduced. Later, Motorola designed a successor called Coldfire (early 1995), in which complex instructions and addressing modes (added to the 68020) were removed and the instruction set was recoded, simplifying it at the expense of compatibility (source only, not binary) with the 680x0 line.

The Coldfire 52xx (version 2 - the 51xx version 1 was a 68040-based/compatible core) architecture resmbles a stripped (single pipeline) 68060, The 5 stage pipeline is literally folded over itself - after two fetch stages and a 12-byte buffer, instructions pass through the decode and address generate stages, then loop back so the decode becomes the operand fetch stage, and the address generate becomes the execute stage (so only one ALU is required for address and execution calculations). Simple (non-memory) instructions don't need to loop back. There is no translator stage as in the 68060 because Coldfire instructions are already in RISC-like form. The 53xx added a multiply-accumulate (MAC) unit and internal clock doubling. The 54xx adds branch and assignment folding with other instructions for a cheap form of superscalar execution with little added complexity, and uses a Harvard architecture for faster memory access, plus enhancements to the instruciton set to improve code density, performance, and to add fleximility to the MAC unit.

At a quarter the physical size and a fraction of the power consumption, Coldfire is about as fast as a 68040 at the same clock rate, but the smaller design allows a faster clock rate to be acheived.

Few people wonder why Apple chose the Motorola 68000 for the Macintosh, while IBM's decision to use Intel's 8088 for the IBM PC has baffled many. It wasn't a straightforward decision though. The Apple Lisa was the predecessor to the Macintosh, and also used a 68000 (eventually - 8086 and slower bitslice CPUs (which Steve Wozniak thought were neat) were initially considered before the 68000 was available). It also included a fully multitasking, GUI based operating system, highly integrated software, high capacity (but incompatible) 'twiggy' 5 1/4" disk drives, and a large workstation-like monitor. It was better than the Macintosh in almost every way, but was correspondingly more expensive.

The Macintosh was to include the best features of the Lisa, but at an affordable price - in fact the original Macintosh came with only 128K of RAM and no expansion slots. Cost was such a factor that the 8 bit Motorola 6809 was the original design choice, and some prototypes were built, but they quickly realised that it didn't have the power for a GUI based OS, and they used the Lisa's 68000, borrowing some of the Lisa low level functions (such as graphics toolkit routines) for the Macintosh.

Competing personal computers such as the Amiga and Atari ST, and early workstations by Sun, Apollo, NeXT and most others also used 680x0 CPUs, including one of the earliest workstations, the Tandy TRS-80 Model 16, which used a 68000 CPU and Z-80 for I/O and VM support - the 68000 could not restart an instruction stopped by a memory exception, so it was suspended while the Z-80 loaded the page. Early Apollo workstations used a similar solution with a second 68000 handling paging.

Amiga - So the World May Know:
Atari compatible Milan computers:
Atari compatible Medusa computers:
Sinclair QL homepage:

Part V: National Semiconductor 32032, similar but different . . . .

Like the 68000, the 320xx family consisted of a CPU which was 32-bit internally, and 16 bits externally (later also 32 and 8), as indicated by the first and last two digits (originally reversed, but 16032 just seemed less impressive). It appeared a little later than the others here, and so was not really a choice for the IBM PC, but is still representative of the era.

Elegance and regular design was a main goal of this processor, as well as completeness. It was similar to the 68000 in basic features, such as byte addressing, 24-bit address bus in the first version, memory to memory instructions, and so on (The 320xx also includes a string and array instruction). Unlike the 68000, the 320xx had eight instead of sixteen 32-bit registers, and they were all general purpose, not split into data and address registers. There was also a useful scaled-index addressing mode, and unlike other CPUs of the time, only a few operations affected the condition codes (as in more modern CPUs).

Also different, the PC and stack registers were separate from the general register set - they were special purpose registers, along with the interrupt stack, and several "base registers" to provide multitasking support - the base data register pointed to the working memory of the current module (or process), the interrupt base register pointed to a table of interrupt handling procedures anywhere in memory (rather than a fixed location), and the module register pointed to a table of active modules.

The 320xx also had a coprocessor bus, similar to the 8-bit Ferranti F100-L CPU, and coprocessor instructions. Coprocessors included an MMU, and a Floating Point unit which included eight 32-bit registers, which could be used as four 64-bit registers.

The series found use mainly in embedded applications, and was expanded to that end, with timers, graphics enhancements, and even a Digital Signal Processor unit in the Swordfish version (1991, also known as 32732 and 32764). The Swordfish was among the first truely superscalar microprocessors, with two 5-stage pipelines (integer A, and B, which consisted of an integer and floating point pipeline - an instruction dispatched to B would execute in the appropriate pipe, leaving the other with an empty slot. The integer pipe could cycle twice in the memory stage to synchronise with the result of the floating point pipe, to ensure in-order completion when floating point operations could trap. B could also execute branches). This strategy was influenced by the Multiflow VLIW design. Instructions were always fetched two at a time from the instruction cache which partially decoded the instruction pairs and set a bit to indicate whether they were dependent or could be issued simultaneously (effectively generating two-word VLIWs in the cache from an external stream of instructions). The cache decoder also generated branch target addresses to reduce branch latency as in the AT&T CRISP/Hobbit CPU.

The Swordfish implemented the NS32K instruction set using a reduced instruction core - NS32K instructions were translated by the cache decoder into either: one internal instruction, a pair of internal instructions in the cache, or a partially decoded NS32K instruction which would be fully decoded into internal instructions after being fetched by the CPU. The Swordfish also had dynamic bus resizing (8, 16, 32, or 64 bits, allowing 2 instructions to be fetched at once) and clock doubling, 2 DMA channels, and in circuit emulation (ICE) support for debugging.

The Swordfish was later simplified into a load-store design and used to implement an instruction set called CompactRISC (also known as Pirhana, an implementation independent instruction set supporting designs from 8 to 64 bits). CompactRISC has been implemented in three stage, 16-bit (CR16A), 20-bit (CR16B), and 32-bit (CR32A) address versions (CR16B also included bit-oriented memory operations).

It seems interesting to note that in the case of the NS320xx and Z-80000, non mainstream processors gained many advanced design features well ahead of the more mainstream processors, which presumably had more development resources available. One possible reason for this is the greater importance of compatibility in processors used for computers and workstations, which limits the freedom of the designers. Or perhaps the non-mainstream processors were just more flexible designs to begin with. Or some might not have made it to the mainstream because the more ambitious designs resulted in more implementation bugs than competitors.

National Semiconductor - CompactRISC:

Part VI: MIL-STD-1750 - Military artificial intelligence (February 1979) .

The USAF created a draft standard for a 16-bit microprocessor meant to be used in all airborn computers and weapons systems, allowing software developed for one such system to be portable to other similar applications, similar to the intent behind the creation of Ada as the standard high level programming language for the U.S Department of Defense (MIL-STD-1815 accepted October 1979 - 1815 was the year Ada Augusta, Countess of Lovelace and the world's first programmer, was born).

Like other 16 bit designs of the time, 1750 was inspired by the PDP-11, but differs significantly - the exact origin of the design isn't known, and may be significantly older. Sixteen 16-bit registers were specified, and any adjascent pairs (such as R0+R1 or R1+R2) could be used as 32-bit registers (the Z-8000 and PDP-11 could only use even pairs, and the PDP-11 only for specific uses) for integer or floating point (FP) values (no separate FP registers), or triples for 48-bit extended precision FP (with the mantissa concatenated after the exponent - eg. 32-bit FP was [1s][23mantissa][8exp], 48-bit was [1s][23mantissa][8exp][16ext], meaning any 48-bit FP was also a valid 32-bit FP, only losing the extra precision). Also, only the upper four registers (R12 to R15) could be used as an address base (2 instruction bits instead of 4), and R0 can't be used as an index (using R0 implies no indexing, similar to the PowerPC. R15 is used as an implicit stack pointer, the program counter is not user accessible.

Address space is 16 bit word addressed (not bytes), but the design allows for an MMU to extend this to 20 bits. In addition, program and data memory can be separated using the MMU. A 4-bit Address State field in the processor status word (PSW) selects one of sixteen page groups, each containing sixteen registers for data memory and another sixteen for program memory (16x16x2 = 512 total). The top 4 bits of an address selects a register from the current AS group, which provides the upper 8 bits of a 20 bit address.

Each page register also has a 4-bit access key. While other CPUs at the time provided user and supervisor modes, the 1750 provided for sixteen modes, from supervisor (mode 0, could access all pages), fourteen user modes (1 to 14 can only access page with same key, or key 15), and an unprivledged mode (mode 15 can only access page with key 15). Program memory can occupy the same logical address space as data, but will select from the program page registers. Pages can also be write or execute protected.

Several I/O instructions are also included, and are used to access processor state registers.

The 1750 is a very practical 16 bit design, and is still being produced, mainly in expensive radiation resistant forms. It did not achieve widespread acceptance, likely because of the rapid advance of technology and the rise of the RISC paradigm.

XTC1750a ANCI software development environment for MIL-STD 1750A:
MIL-STD-1750A Resource Center/Index:
Public Ada Library (PAL):

Part VII: Intel 8086, IBM's choice (1978)
. . . . . . . . . . . . . . . . . . . .

The Intel 8086 was based on the design of the 8080/8085 (source compatible with the 8080) with a similar register set, but was expanded to 16 bits. The Bus Interface Unit fed the instruction stream to the Execution Unit through a 6 byte prefetch queue, so fetch and execution were concurrent - a primitive form of pipelining (8086 instructions varied from 1 to 4 bytes).

It featured four 16 bit general registers, which could also be accessed as eight 8 bit registers, and four 16 bit index registers (including the stack pointer). The data registers were often used implicitly by instructions, complicating register allocation for temporary values. It featured 64K 8-bit I/O (or 32K 16-bit) ports and fixed vectored interrupts. There were also four segment registers that could be set from index registers.

The segment registers allowed the CPU to access 1 meg of memory through an odd process. Rather than just supplying missing bytes, as most segmented processors, the 8086 actually added the segment registers ( X 16, or shifted left 4 bits) to the address. As a strange result of this unsuccessful attempt at extending the address space without adding address bits, it was possible to have two pointers with the same value point to two different memory locations, or two pointers with different values pointing to the same location, and limited typical data structures to less than 64K. Most people consider this a brain damaged design (a better method might have been that developed for the MIL-STD-1750 MMU).

Although this was largely acceptable for assembly language, where control of the segments was complete (it could even be useful then), in higher level languages it caused constant confusion (ex. near/far pointers). Even worse, this made expanding the address space to more than 1 MB difficult. The 80286 (1982?) expanded the design to 32 bits only by adding a new mode (switching from 'Real' to 'Protected' mode was supported, but switching back required using a bug in the original 80286, which then had to be preserved) which greatly increased the number of segments by using a 16 bit selector for a 'segment descriptor', which contained the location within a 24 bit address space, size (still less than 64K), and attributes (for Virtual Memory support) of a segment.

But all memory access was still restricted to 64K segments until the 80386 (1985), which included much improved addressing: base reg + index reg * scale (1, 2, 4 or 8 bits) + displacement (8 or 32 bit constant = 32 bit address) in the form of paged segments (using six 16-bit segment registers), like the IBM S/360 series, and unlike the Motorola 68030). It also had several processor modes (including separate paged and segmented modes) for compatibility with the previous awkward design. In fact, with the right assembler, code written for the 8008 can still be run on the most recent Pentium Pro. The 80386 also added an MMU, security modes (called "rings" of privledge - kernal, system services, application services, applications) and new op codes in a fashion similar to the Z-80 (and Z-280).

The 8087 was a floating point coprocessor which helped define the IEEE-754 floating point format and standard operations (the main competition was the VAX floating point format), and was based on an eight element stack of 80-bit values. An instruction prefix (0xE0) indicated coprocessor instructions (similar to the 68000), so the coprocessor could "listen" to the instruction stream, and execute instructions it recognized, without a coprocessor bus.

The 80486 (1989) added full pipelines, single on chip 8K cache, FPU on-chip, and clock doubling versions (like the Z-280). Later, FPU-less 80486SX versions plus 80487 FPUs were introduced - initially these were normal 80486es where one unit or the other had failed testing, but versions with only one unit were produced later (smaller dies and reduced testing reduced costs).

The Pentium (late 1993) was superscalar (up to two instructions at once in dual integer units and single FPU) with separate 8K I/D caches. "Pentium" was the name Intel gave the 80586 version because it could not legally protect the name "586" to prevent other companies from using it - and in fact, the Pentium compatible CPU from NexGen is called the Nx586 (early 1995). Due to its popularity, the 80x86 line has been the most widely cloned processors, from the NEC V20/V30 (slightly faster clones of the 8088/8086 (could also run 8085 code)), AMD and Cyrix clones of the 80386 and 80486, to versions of the Pentium within less than two years of its introduction.

MMX (initially reported as MultiMedia eXtension, but later said by Intel to mean Matrix Math eXtension) is very similar to the earlier SPARC VIS or HP-PA MAX, or later MIPS MDMX instructions - they perform integer operations on vectors of 8, 16, or 32 bit words, using the 80 bit FPU stack elements as eight 64 bit registers (switching between FPU and MMX modes as needed - it's very difficult to use them as a stack and as MMX registers at the same time). The P55C Pentium version (January 1997) is the first Intel CPU to include MMX instructions, followed by the AMD K6, and Pentium II. Cyrix also added these instructions in its M2 CPU (6x86MX, June 1997), as well as IDT with its C6.

Interestingly, the old architecture is such a barrier to improvements that most of the Pentium compatible CPUs (NexGen Nx586/Nx686, AMD K5, IDT-C6), and even the "Pentium Pro" (Pentium's successor, late 1995) don't clone the Pentium, but emulate it with specialized hardware decoders like those introduced in the VAX 8700 and used in a simpler form by the National Semiconductor Swordfish, which convert Pentium instructions to RISC-like instructions which are executed on specially designed superscalar RISC-style cores faster than the Pentium itself. Intel also used BiCMOS in the Pentium and Pentium Pro to achieve clock rates competitive with CMOS load-store processors (the Pentium P55C (early 1997) version is a pure CMOS design).

IBM had been developing hardware or software to translate Pentium instructions for the PowerPC in a similar manner as part of the PowerPC 615 CPU (able to switch between instruction 80x86, 32-bit and 64-bit PowerPC instruction sets in five cycles (to drain the execution pipeline)), but the project was killed after significant development for marketing reasons. Rumour has it that engineers who worked on the project went on to Transmeta corporation.

The Cyrix 6x86 (early 1996), initially manufactured by IBM before Cyrix merged with National Semiconductor, still directly executes 80x86 instructions (in two integer and one FPU pipeline), but partly out of order, making it faster than a Pentium at the same clock speed. Cyrix also sold an integrated version with graphics and audio on-chip called the MediaGX. MMX instructions were added to the 6x86MX, and 3DNow! graphics instructions to the 6x86MXi. The M3 (mid 1998) turned to superpipelining (eleven stages compared to six (seven?) for the M2) for a higher clock rate (partly for marketing purposes, as MHz is often preferred to performance in the PC market), and was to provide dual floating point/MMX/3DNow! units. The Cyrix division of National Semiconductor was purchased by PC chipset maker Via, and the M3 was cancelled. National Semiconductor continued with the integrated Geode low-power/cost CPU.

The Pentium Pro (P6 execution core) is a 1 or 2-chip (CPU plus 256K or 512K L2 cache - I/D L1 cache (8K each) is on the CPU), 14-stage superpipelined processor. It uses extensive multiple branch prediction and speculative execution via register renaming. Three decoders (one for complex instructions (up to four micro-ops), two for simpler ones) each decode one 80x86 instruction into micro-ops (one per simple decoder + up to four from the complex decoder = three to six per cycle). Up to five (usually three) micro-ops can be issued in parallel and out of order (five units - integer+FPU ALU, integer ALU, two address, one load/store), but are held and retired (results written to registers or memory) as a group to prevent an inconsistant state (equivalent to half an instruction being executed when an interrupt occurs, for example). 80x86 instructions may produce several micro-ops in CPUs like this (and the Nx586 and AMD K5), so the actual instruction rate is lower. In fact, due to problems handling instruction alignment in the Pentium Pro, emulated 16-bit instructions execute slower than on a Pentium. The Pentium II (April 1997) added MMX instructions to the P6 core (both ALUs), doubled cache to 32K, and was packaged in a processor card instead of an IC package. The Pentium III added Streaming SIMD Extensions (SSE) to the P6 core (both ALUs), which included eight 128-bit registers which could be used as vectors of four 32-bit integer of floating point values (like the PowerPC AltiVec extensions, but with fewer operations or data types). Unlike MMX (and like AltiVec), the SSE registers need to be saved seperately during context switches, requiring OS modifications.

In June 1998, Intel created two sub-brands of P6 CPUs, low cost (Celeron) and server oriented (Xeon). They differed in amount of cache and bus speeds.

The P7 was first released as the Pentium 4 in December 2000. This equivalent to AMD's K7 (see below) was late due to the decision to concentrate on the development of the IA-64 architecture. Intel used two teams for alternating 80x86 designs, the P5 team started work on the P7, originally a 64 bit version like the AMD K8, while the other team worked on the P6. When the 64-bit P7 was changed to the IA-64, the P6 team started on a scaled down P7 after the Pentium III was finished - meanwhile, Intel sold "overclocked" (small quantities able to run at a higher than designed clock rate) P6 CPUs to compete with the AMD K7, then later updated P6 designs.

The P7 extended the pipeline even further to over 20 stages (or 30 during cache misses), stressing clock speed over execution speed (for marketing reasons) - this led to some questionable design decisions. The three decoders are replaced by single decoder and a trace cache - similar in concept to the decoded instruction cache of the AT&T Hobbit, but 80x86 instructions often decode into multiple micro-ops, so mapping the micro-ops to memory is more complex, and instructions are loaded ahead of time using branch prediction. This speeds execution within the cache, but the single decoder limits the external instruction stream to one at a time. Long micro-op sequences are stored in microcode ROM and fed to the dispatch unit without being stored in the cache.

There are seven execution units, one FPU/MMX/SSE, one FP register load/store unit, two add/subtract integer units, one logic (shift and rotate) unit, one load and one store unit. The add/subtract units run at double the clock rate, basically as a two stage pipe, allowing two results within a single clock cycle, meaning up to nine micro-ops could be dispatched each cycle to the seven units, but in practice the trace cache is limited to three per cycle. A slower logic unit replaces two faster address units in the P6, slowing most code. Since the stack-oriented FPU registers are difficult to use for superscalar or out-of-order execution, Intel added floating-point SSE instructions (called SSE2), so that floating point operations can use the flat SSE registers which will make future designs easier, and the old FPU design becomes less important.

The bottlenecks might have been a result of rushing the design, or due to cost. As a result, the P7 executing existing code is actually slower than a P6 at slightly lower clock speed, and much slower than the AMD K7, but the intent of the design was to allow clock speed to be increased enough to make up for the difference. Possibly the bottlenecks will be removed in a future version.

A server (Xeon) version of the P7 (March 2002) introduced vertical multithreading (called "Hyperthreading" by Intel), like the IBM Northstar CPU.

AMD was a second source for Intel CPUs as far back as the AMD 9080 (AMD's version of the Intel 8080). The AMD K5 translates 80x86 code to ROPs (RISC OPerations), which execute on a RISC-style core based on the unproduced superscalar AMD 29K. Up to four ROPs can be dispatched to six units (two integer, one FPU, two load/store, one branch unit), and five can be retired at a time. The complexity led to low clock speeds for the K5, prompting AMD to buy NexGen and integrate its designs for the next generation K6.

The NexGen/AMD Nx586 (early 1995) is unique by being able to execute its micro-ops (called RISC86 code) directly, allowing optimised RISC86 programs to be written which are faster than an equivalent x86 program would be, but this feature is seldom used. It also features two 16K I/D L1 caches, a dedicated L2 cache bus (like that in the Pentium Pro 2-chip module) and an off-chip FPU (either separate chip, or later as in 2-chip module).

The Nx586 sucessor, the K6 (April 1997) actually has three caches - 32K each for data and instructions, and a half-size 16K cache containing instruction decode information. It also brings the FPU on-chip and eliminates the dedicated cache bus of the Nx586, allowing it to be pin-compatible with the P54C model Pentium. Another decoder is added (two complex decoders, compared to the Pentium Pro's one complex and two simple decoders) producing up to four micro-ops and issuing up to six (to seven units - load, store, complex/simple integer, FPU, branch, multimedia) and retiring four per cycle. It includes MMX instructions, licensed from Intel, and AMD has designed and added 3DNow! graphics extensions without waiting for Intel's SSE additions.

AMD aggressively pursued a superscalar (fourteen-stage pipeline) design for the Athlon (K7, mid 1999), decoding x86 instructions into 'MacroOps' (made up of one or two 'micro-ops', a process similar to the branch folding in the AT&T Hobbit or instruction grouping in the T9000 Transputer and the Motorola 54xx Coldfire CPU) in two decoders (one for simple and one for complex instructions) producing up to three MacroOps per cycle. Up to nine decoded operations per cycle can be issued in six MacroOps to six functional units (three integer, each able to execute one simple integer and one address op simultaneously, and three FPU/MMX/3DNow! instructions (FMUL mul/div/sqrt, FADD simple/comparisons, FSTORE load/store/move) with extensive stack and register renaming, and a separate integer multiply unit which follows integer ALU 0, and can forward results to either ALU 0 or 1). The K7 replaces the Intel-compatible bus of the K6 with the high speed Alpha EV6 bus because Intel decided to prevent competitors from using its own higher speed bus designs (Dirk Meyer was director of engineering for the K7, as well as co-architect of the Alpha EV4 and EV6). This makes it easier to use either Alpha or AMD K7 processors in a single design. At introduction, the K7 managed to out-perform Intel's fastest P6 CPU.

Centaur, a subsidiary of Integrated Device Technology, introduced the IDT-C6 WinChip (May 1997), which uses a much simpler (6-stage, 2 way integer/simple-FPU execution) desgn than Intel and AMD translation-based designs by using micro-ops more closely resembling 80x86 than RISC code, which allows for a higher clock rate and larger L1 (32K each I/D) and TLB caches in a lower cost, lower power consumption design. Simplifications include replacing branch prediction (less important with a short pipeline) with an eight entry call/return stack, depending more on caches. The FPU unit includes MMX support. The C6+ version adds second FPU/MMX unit and 3D graphics enhancements.

Like Cyrix, Centaur opted for a superpipelined eleven-stage design for added performance, combined with sophisticated early branch prediction in its WinChip 4. The design also pays attention to supporting common code sequences - for example, loads occur earlier in the pipeline than stores, allowing load-alu-store sequences to be more efficient.

Cyrix division of National Semiconductor and the Centaur division of IDT were bought by Korean motherboard chipset maker Via. The Cyrix CPU was cancelled, and the Centaur design was given the "Cyrix III" brand instead.

Intel, with partner Hewlett-Packard, developed a next generation 64-bit processor architecture called IA-64 (the 80x86 design was renamed IA-32) - the first implementation was named Itanium. It's was intended to be both compatible in some way with both the PA-RISC and 80x86. This may finally produce the incentive to let the 80x86 architecture finally fade away.

On the other hand, the demand for compatibility will remain a strong market force. AMD announced its intention to extend the K7 design to produce an 80x86 compatible K8 (codenamed "Sledgehammer", then changed to just "Hammer" - variants indicate market segments, such as "Clawhammer" (keeping the Athlon brand name) for desktops, and "Sledgehammer" (named Opteron) for servers). It produced a 64-bit architecture called x86-64, in competition with the Intel IA-64. Rumours indicate that Intel is developing a CPU codenamed "Yamhill" which implements the x86-64 architecture and instruction set (under pressure from Microsoft to avoid creating yet another instruction set to support), originally as an unofficial project (maybe original P7 plans dusted off), then officially when performance of the first Itanium disappointed.

When moving from the 80286 to the 80386 (IA-32), Intel took the opportunity to fix some of the least liked features remaining in the previous design. Moving to x86-64, AMD decided to further modernize the design, adding a cleaner 64-bit mode (selected by Code Segment Descriptor (CSD) register bits).

It's based on sixteen 64-bit integer and sixteen 128-bit vector/floating point (XMM) registers (the lower eight registers of each map to the original x86 integer and SSE/SSE2 registers) and the 8087 FPU/MMX registers, with a 64-bit program counter. In 64-bit mode, integer registers are uniform and can be 8-, 16-, 32-, or 64-bit. Address spece is changed from mainly segmented to a flat space (keeping data segment registers in 32- or 16-bit sub-modes) with PC relative addressing, although code segments (within the address space) are still used to define the modes for each segment. Older 8086 modes are supported in a separate "legacy" mode. These changes give compilers a larger, more regular register set to use making optimizations easier.

So why did IBM chose the 8-bit 8088 (1979) version of the 8086 for the IBM 5150 PC (1981) when most of the alternatives were so much better? Apparently IBM's own engineers wanted to use the 68000, and it was used later in the forgotten IBM Instruments 9000 Laboratory Computer, but IBM already had rights to manufacture the 8086, in exchange for giving Intel the rights to its bubble memory designs. IBM was using 8086s in the IBM Displaywriter word processor (the 8080 and 8085 were also used in other products).

Other factors were the fact that the the 8-bit 8088 could use existing low cost 8085-type components, and allowed the computer to be based on a modified 8085 design. 68000 components were not widely available, though it could use 6800 components to an extent. After the failure and expense of the IBM 5100 (1975, their first attempt at a peronal computer - discrete random logic CPU with no bus, built in BASIC and APL as the OS, 16K RAM and 5 inch monochrome monitor - $10,000!), cost was a large factor in the design of the PC. Strategists were also not eager to have a microcomputer competing with IBM's low end minicomputers.

The availability of CP/M-86 is also likely a factor, since CP/M was the operating system standard for the computer industry at the time. However Digital Research founder Gary Kildall was unhappy with the legal demands of IBM, so Microsoft, a programming language company, was hired instead to provide the operating system (initially known at varying times as QDOS, SCP-DOS, and finally 86-DOS, it was purchased by Microsoft from Seattle Computer Products and renamed MS-DOS).

Digital Research did eventually produce CP/M 68K for the 68000 series, making the operating system choice less relevant than other factors.

Intel bubble memory was on the market for a while, but faded away as better and cheaper memory technologies arrived.

Intel Corporation:
Intel Product Info:
AMD, Inc.:
AMD PC Processors Plus:
IBM's PC CPU decision (near beginning of page):
An Interview with the Old Man of Floating-Point

Previous Page
Table of Contents
Next Page