English
Language : 

80960SB Datasheet, PDF (9/38 Pages) Intel Corporation – EMBEDDED 32-BIT MICROPROCESSOR WITH 16-BIT BURST DATA BUS
80960SB
80-bit registers (fp0 through fp3). These registers
perform the same function as the general-purpose
registers provided in other popular microprocessors.
The term global refers to the fact that these registers
retain their contents across procedure calls.
The local registers, on the other hand, are procedure
specific. For each procedure call, the 80960SB
allocates 16 local registers (r0 through r15). Each
local register is 32 bits wide. Any register can also be
used for single or double-precision floating-point
operations; the 80-bit floating-point registers are
provided for extended precision.
1.1.4 Multiple Register Sets
To further increase the efficiency of the register set,
multiple sets of local registers are stored on-chip
(See Figure 4). This cache holds up to four local
register frames, which means that up to three
procedure calls can be made without having to
access the procedure stack resident in memory.
Although programs may have procedure calls nested
many calls deep, a program typically oscillates back
and forth between only two to three levels. As a
result, with four stack frames in the cache, the proba-
bility of having a free frame available on the cache
when a call is made is very high. In fact, runs of
representative C-language programs show that 80%
of the calls are handled without needing to access
memory.
If four or more procedures are active and a new
procedure is called, the 80960SB moves the oldest
local register set in the stack-frame cache to a
procedure stack in memory to make room for a new
set of registers. Global register g15 is the frame
pointer (FP) to the procedure stack.
Global and floating point registers are not exchanged
on a procedure call, but retain their contents, making
them available to all procedures for fast parameter
passing.
1.1.5 Instruction Cache
To further reduce memory accesses, the 80960SB
includes a 512-byte on-chip instruction cache. The
instruction cache is based on the concept of locality
of reference; most programs are not usually
executed in a steady stream but consist of many
branches, loops and procedure calls that lead to
jumping back and forth in the same small section of
code. Thus, by maintaining a block of instructions in
cache, the number of memory references required to
read instructions into the processor is greatly
reduced.
To load the instruction cache, instructions are
fetched in 16-byte blocks; up to four instructions can
be fetched at one time. An efficient prefetch
algorithm increases the probability that an instruction
will already be in the cache when it is needed.
Code for small loops often fits entirely within the
cache, leading to a great increase in processing
speed since further memory references might not be
necessary until the program exits the loop. Similarly,
when calling short procedures, the code for the
calling procedure is likely to remain in the cache so it
will be there on the procedure’s return.
1.1.6 Register Scoreboarding
The instruction decoder is optimized in several ways.
One optimization method is the ability to overlap
instructions by using register scoreboarding.
Register scoreboarding occurs when a LOAD moves
a variable from memory into a register. When the
instruction initiates, a scoreboard bit on the target
register is set. Once the register is loaded, the bit is
reset. In between, any reference to the register
contents is accompanied by a test of the scoreboard
bit to ensure that the load has completed before
processing continues. Since the processor does not
need to wait for the LOAD to complete, it can execute
additional instructions placed between the LOAD
and the instruction that uses the register contents, as
shown in the following example:
ld data_2, r4
ld data_2, r5
Unrelated instruction
Unrelated instruction
add r4, r5, r6
In essence, the two unrelated instructions between
LOAD and ADD are executed “for free” (i.e., take no
apparent time to execute) because they are
executed while the register is being loaded. Up to
three load instructions can be pending at one time
with three corresponding scoreboard bits set. By
exploiting this feature, system programmers and
compiler writers have a useful tool for optimizing
execution speed.
5