The Great CPU List, Appendix C

Appendix C:

CPU Features:

Most of the terms in this list are defined somewhere within, and others are available in the Free On-line Dictionary of Computing, but here's clarification for a few terms:

Accumulator

A register that is used as the implicit source and destination of an operation (the register doesn't have to be specified separately). The PDP-8 has the best example in this document.

RISC processors use a load/store architecture instead - to add memory to a register, it must be loaded into an intermediate register first.

Asynchronous Design

A design which does not synchronize individual circuits using a clock signal, as synchronous designs do. Some other method (such as a "dummy circuit" which does nothing but consume the same amount of time as the real circuit) is used to generate a signal when the result is ready/valid, and the valid signals can be used to start the next operation.

There is an asynchronous version of the ARM architecture, and Sun is researching an asynchronous Transport-triggered architecture with a project called FleetZero.

Branch Prediction

The general method of keeping track of which path was taken by a particular branch instruction, and following that path the next time the same instruction is encountered. Generally a history table is used to indicate how often a branch at a given address is taken or not taken.

Branch Target Cache

The practice of saving one or more instructions which are executed immediately after a branch instruction, so that the next time the branch is encountered, the instructions have already been loaded.

Cache

You should know this term already. But if you don't, it refers to a small amount of fast memory which holds recently accessed data or instructions so that if they are used by the programs again, the cache can supply them transparently faster than main memory. Cache memory is typically organised into lines (several bytes are loaded at once, on the assumption that nearby memory will beused next). The lines are organised into sets, each set is mapped to a separate group of memory addresses, and there are usually between two and sixty-four lines per set (fewer lines per set are simpler, but access to more addresses than cache lines in the same set can cause data in the cache to be discarded before it can be used).

Smaller caches are faster, so often a small level 1 cache is used, with a larger but slower level 2 cache supporting it. Level 3 caches can even be used in some cases.

Some cache controllers monitor the memory bus to detect when a cached memory value has been modified by another CPU, or a peripheral.

DSP

Digital Signal Processor, a CPU designed mainly for performing simple, repetitious operations on a stream or buffer of data - for example, decoding digital audio data from a CD. Generally meant for embedded applications, leaving out features of general purpose CPUs which aren't needed in a DSP application. There is usually little or no interrupt support, or memory management support.

EEPROM

Electrically Erasable Programmable ROM.

Endian

The order in which a multi-byte binary number is stored in byte-addressable memory. "Little-endian" means the least significant byte (the "little end") is stored in the first (lowest) address, "big-endian" means the most significant byte ("big end") has the first position in memory.

A potential source of code and communications incompatibility, but with no significant advantages to either, making the decision arbitrary (except for compatibility requirements). The term comes from an equally arbitrary disagreement in Liliputian society (from Jonathan Swift's book "Gulliver's Travels") over which end to break boiled eggs (the big or little end), a distinction which caused civil wars. Swift was satirizing differences in the treatment of Catholics in his own time - fortunately there's been no documented case of CPU designers coming to blows over CPU endian-ness, despite the heated discussions that once took place (but which later became unfashionable after network endian order was standardised in TCP/IP).

Explicitly Parallel Instruction Computing

The HP/Intel term for a form of VLIW with Variable Length Instruction Groupings which uses fields in the instruction stream or instructions themselves to group (specify instruction dependencies), rather than using a fixed length instruction word. Used in the TI 320C6x and the HP/Intel Merced/IA/64.

Two problems are usually identified with VLIW processors (like the Phillips TriMedia). One is that if the instruction word can't be filled, the rest of the entries need to be filled with NOP instructions, which waste space. The other is that it prevents future versions which may be able to execute more instructions in parallel, or lower cost versions which execute fewer. EPIC solves this, but requires a small semantic change that instructions within a group must be independent - that is, act the same whether they were executed in order or parallel. By contrast, in the MultiFlow TRACE systems a pair of instructions such as "MOVE A, B" and "MOVE B, A" could be in the same word because they were guaranteed to execute in parallel, with the result that values in A and B would be swapped.

EPROM

Erasable Programmable ROM (erased by exposing the EPROM to ultraviolet light).

Harvard Architecture

Strictly speaking, refers to a CPU with separate program and data spaces, (specifically the PIC embedded processors), but it's often generally used to refer to separate program and data busses (and usually caches too) for improved speed, though the address spaces are actually shared. Originally Harvard architecture computers were programmed using plug boards or something similar, and data was in a writable storage area. The von Neumann architecture introduced the idea of a stored program in the same writable memory that data was stored in.

Indirection Bit

Some designs used one address bit as an indirection bit, meaning that the value in memory is the address of the actual value. Other designs used a separate addressing mode for indirect addressing.

INTERCAL

An actual programming language designed to be as evil as possible.

Microcode

Earlier CPUs were designed to execute instructions with the circuitry directly decoding and executing program instructions. Microcode was a way of simplifying CPU design by allowing simpler hardware which executes simple microinstructions to interpret more complex machine instructions, first used commercially in the mid and low range IBM System/360. Microcode is often slower and increases CPU size (compare transistor count of microcoded Motorola 68000 (68,000) with hardwired Zilog Z-8000 (17,500) - and the fact that the Z-8000) was both late and buggy).

Implementations generally use either 'horizontal' or 'vertical' microcode, which differ mainly in number of bits. Microinstructions include a condition code and jump address (jump if condition is true, next instruction if false), and the operation to be performed. In horizontal microcode, each operation bit triggers an individual control line (simple CPU controller but large microcode storage), in vertical microcode, the operation field is decoded to produce the control signals (smaller microcode but more complex controller). Some CPUs used a combination.

Multithreading

The ability to share CPU resources among multiple threads. 'Vertical' multithreading allows a CPU to switch execution between threads without needing to save thread state (generally using duplicated registers, and usually used to continue execution with another thread when one thread hits a delay due to a cache miss and must wait). 'Horizontal' multithreading allows threads to share functional units without halting the execution of a thread (an idle functional unit can be assigned to any thread that needs it).

A simpler variation called a "barrel processor" cycles through threads on every clock cycle whether there is a delay or not, so when there are enough thread "slots" to cover any expected execution delay, it appears to the program that each instruction takes one cycle (in addition, no hardware is required to check for data dependencies in the pipeline).

Network order

Big-endian, used in TCP/IP standards.

Out Of Order Execution

A superscalar CPU may issue instructions in an order different than that in the program if state conflicts can be resolved (with renaming for example). For example:

1: add r1,r2->r8
2: sub r8,r3->r3
3: add r4,r5->r8
4: sub r8,r6->r6

Instructions 1 and 3 can be executed in parallel if r8 is renamed, and instructions 2 and 4 can then be executed in parallel. Instruction 3 is executed before 2, out of the order which they appear in the program.

Predicated instructions

Instructions which are executed only if conditions are true, usually bits in a condition code register. This eliminates some branches, and in a superscalar machine can allow both branches in certain conditions to be executed in parallel, and the incorrect one discarded with no branch penalty. Used in the ARM and TMS320C6x, in HP some PA-RISC instructions, and the upcoming HP/Intel IA-64.

PROM

Programmable ROM (not erasable).

RAM

If you don't know what Random Access Memory is, why are you reading this in the first place?

Register Renaming

A number of extra registers can be assigned to hold the data that would normally be written to the destination register (in other words, the extra register is renamed as far as that particular instruction is concerned). One use for this is for speculative execution of branches - if the branch is eventually taken, then data in the rename register can be written to the real register, if not then the data is discarded. Another use is for out of order execution, renamed registers can produce an 'image' of the processor state which an instruction expects, while the actual processor state has already been modified by another instruction (known as write conflicts).

The circutry required to keep track of renamed registers can be complex.

Resource Renaming

A more general form of register renaming where resources other than registers are renamed.

ROM

Read Only RAM. It's really spelled ROR. Engineers know this, but don't tell anybody so that they can laugh at everyone who says 'ROM'. Really, this is the truth.

Saturation Arithmetic

When arithmetic operations produce values too large or too small for registers, the largest or smallest value that can be represented is substituted instead.

Segment

Properly, a section of memory of almost any size and at any address, accessed through an identifier tag which includes protection bits, particularly useful for object oriented programming. A good idea which was missed by a painful margin with the Intel 8086.

Speculative Execution

In a pipelined processor, branch instructions in the execute stage affect the instruction fetch stage - there are two possible paths of execution, and the correct one isn't known until the conditional branch executes. If the CPU waits until the conditional branch executes, the stages between fetch and execute become empty, leading to a delay before execution can resume after a branch (the time taken for new instructions to fill the pipeline again). The alternative is to choose an execution path, and if that is the correct one, there is no branch delay. But if it's the wrong one, any results from the speculative execution have to either be discarded or undone.

Stack Frame

A segment of a stack which holds parameters, local variables, previous stack frame pointer and return address, created when calling a procedure, function (procedure which returns a value), or method (function or procedure which can access private data in an object) in most high level languages.

Superscalar

Refers to a processor which executes more than one instruction simultaneously, but more properly refers to the issuing of instructions (the CDC 6600 issues one, but executes many simultaneously).

Synchronous Design

A design which ensures that when two circuits take different amounts of time to perform a function, further operations will wait until a voltage signal (which switches between on and off at a specified frequency) changes. The changing signal is called the circuit's clock, and changes at the speed of the slowest circuit, in order to keep the faster circuits synchronized with it.

Designs which don't use a clock signal are called asynchronous.

Thread

A thread is a stream or path of execution where the state is entirely stored in the CPU, while a process includes extra state information - mainly operating system support to protect processes from unexpected and unwanted interferences (either from bugs or intentional attack). Threads are sometimes called lightweight processes.

Transport Triggered Architecture

Also called a Transfer Triggered Architecture, or Move Machine, a TTA is a design where operations are triggered by moving data to the functional units which operate on it, instead or moving data in response to the CPU operations (an Operation Triggered Architechture, or OPA).

For example, a TTA would have one unit for add, one for subtract, one for load, and so on. A number would be loaded by moving the address to the load unit, triggering it to load. The result could be transferred to the add unit, and another number from a register or another unit could be transferred, triggering the unit to add them together.

TTAs are primarily experimental, with researchers into using the very regular design properties for automated custom CPU designs. The TI MSP430 implements the multiplier as an on-chip peripheral, and Sun is researching high-speed asynchronous designs.

Very Long Instruction Word (VLIW)

An instruction which includes more than one operation, intended to be executed concurrently - either a fixed number of operations per instruction, or a variable number (Variable Length Instruction Grouping or Explicitly Parallel Instruction Computing (EPIC)).

Virtual Machine

A software emulation of a CPU, usually including an OS environment.

Previous Page
Table of Contents
Next Page