Although partly shadowed by other design choices in this particular chip, the multiplexed bus limited performance slightly; transfers of 16-bit or 8-bit quantities were done in a four-clock memory access cycle.[11] As instructions varied from 1 to 6 bytes, fetch and execution were made concurrent (as it remains in today's x86 processors): The bus interface unit fed the instruction stream to the execution unit through a 6 byte prefetch queue (a form of loosely coupled pipelining), speeding up operations on registers and immediates, while memory operations unfortunately became slower (4 years later, this performance problem was fixed with the 80186 and 80286). However, the full (instead of partial) 16-bit architecture with a full width ALU meant that 16-bit arithmetic instructions could now be performed with a single ALU cycle (instead of two, via carry), speeding up such instructions considerably. Combined with orthogonalizations of operations versus operand-types and addressing modes, as well as other enhancements, this made the performance gain over the 8080 or 8085 fairly significant, despite cases where the older chips may be faster (see below).
Execution times for typical instructions (in clock cycles):
Timings are best case, depending on prefetch status, instruction alignment, and other factors.MOV reg,reg: 2, reg,im: 4, reg,mem: 8+EA, mem,reg: 9+EA, mem,im: 10+EA cyclesALU reg,reg: 3, reg,im: 4, reg,mem: 9+EA, mem,reg: 16+EA, mem,im: 17+EA cyclesJMP reg: 11, JMP label: 15, Jcc label: 16 (cc = condition code)MUL reg: 70..118 cyclesIDIV reg: 101..165 cycles
EA: time to compute effective address, ranging from 5 to 12 cycles.
As can be seen from these tables, operations on registers and immediates were fast (between 2 and 4 cycles), while memory-operand instructions and jumps were quite slow; jumps took more cycles than on the simple 8080 and 8085, and the 8088 (used in the IBM PC) was additionally hampered by its narrower bus. The reasons why most memory related instructions were slow were threefold:
Loosely coupled fetch and execution units are efficient for instruction prefetch, but not for jumps and random data access (without special measures).
No dedicated address calculation adder was afforded; the microcode routines had to use the main ALU for this (although there was a dedicated segment + offset adder).
The address and data buses were multiplexed, forcing a slightly longer (33~50%) bus cycle than in typical contemporary 8-bit processors.
It should be noted, however, that memory access performance was drastically enhanced with Intel's next generation chips. The 80186 and 80286 both had dedicated address calculation hardware, saving many cycles, and 80286 also had separate (non-multiplexed) address and data buses.
No comments:
Post a Comment