N instructions

advertisement
Platform-based Design
RISC
Instruction Set
Implementation
Alternatives
== using MIPS as example ==
TU/e 5kk70
Henk Corporaal
Bart Mesman
Topics

MIPS ISA: Instruction Set Architecture
MIPS single cycle implementation
MIPS multi-cycle implementation
MIPS pipelined implementation
Pipeline hazards
Recap of RISC principles
Other architectures

Based on the book:







Many slides; I'll go quick and
skip some
H.Corporaal ACA 5MD00
2
Main Types of Instructions

Arithmetic



Memory access instructions


Integer
Floating Point
Load & Store
Control flow



Jump
Conditional Branch
Call & Return
H.Corporaal ACA 5MD00
3
MIPS arithmetic


Most instructions have 3 operands
Operand order is fixed (destination first)
Example:
C code:
A = B + C
MIPS code: add $s0, $s1, $s2
($s0, $s1 and $s2 are associated with variables by
compiler)
H.Corporaal ACA 5MD00
4
MIPS arithmetic
C code:
A = B + C + D;
E = F - A;
MIPS code: add $t0, $s1, $s2
add $s0, $t0, $s3
sub $s4, $s5, $s0


Operands must be registers, only 32 registers provided
Design Principle: smaller is faster.
Why?
H.Corporaal ACA 5MD00
5
Registers vs. Memory



Arithmetic instruction operands must be registers,
— only 32 registers provided
Compiler associates variables with registers
What about programs with lots of variables ?
CPU
Memory
register file
IO
H.Corporaal ACA 5MD00
6
Register allocation


Compiler tries to keep as many variables in registers as
possible
Some variables can not be allocated




large arrays (too few registers)
aliased variables (variables accessible through pointers in C)
dynamic allocated variables
 heap
 stack
Compiler may run out of registers => spilling
H.Corporaal ACA 5MD00
7
Memory Organization



Viewed as a large, single-dimension array, with an
address
A memory address is an index into the array
"Byte addressing" means that successive addresses are
one byte apart
0
8 bits of data
1
8 bits of data
2
8 bits of data
3
8 bits of data
4
8 bits of data
5
8 bits of data
6
8 bits of data
...
H.Corporaal ACA 5MD00
8
Memory Organization




Bytes are nice, but most data items use larger "words"
For MIPS, a word is 32 bits or 4 bytes.
0
32 bits of data
4
32 bits of data
8
32 bits of data
...
12
32 bits of data
Registers hold 32 bits of data
232 bytes with byte addresses from 0 to 232-1
230 words with byte addresses 0, 4, 8, ... 232-4
H.Corporaal ACA 5MD00
9
Memory layout: Alignment
31
address
0
4
23
15
7
0
this word is aligned; the others are not!
8
12
16
20
24
Words are aligned

What are the least 2 significant bits of a word
address?
H.Corporaal ACA 5MD00
10
Instructions: load and store
Example:
C code:
A[8] = h + A[8];
MIPS code: lw $t0, 32($s3)
add $t0, $s2, $t0
sw $t0, 32($s3)


Store word operation has no destination (reg) operand
Remember arithmetic operands are registers, not
memory!
H.Corporaal ACA 5MD00
11
Let's translate some C-code

Can we figure out the code?
swap(int v[], int k);
{ int temp;
temp = v[k]
v[k] = v[k+1];
v[k+1] = temp;
}
swap:
muli
add
lw
lw
sw
sw
jr
$2 ,
$2 ,
$15,
$16,
$16,
$15,
$31
$5, 4
$4, $2
0($2)
4($2)
0($2)
4($2)
Explanation:
index k : $5
base address of v: $4
address of v[k] is $4 + 4.$5
H.Corporaal ACA 5MD00
12
Machine Language

Instructions, like registers and words of data, are also 32 bits long



Example: add $t0, $s1, $s2
Registers have numbers: $t0=9, $s1=17, $s2=18
Instruction Format:
op
000000
6 bits
rs
rt
10001
10010
5 bits
5 bits
rd
01000
5 bits
shamt
00000
5 bits
funct
100000
6 bits
Can you guess what the field names stand for?
H.Corporaal ACA 5MD00
13
Machine Language

Consider the load-word and store-word instructions,



Introduce a new type of instruction format



What would the regularity principle have us do?
New principle: Good design demands a compromise
I-type for data transfer instructions
other format was R-type for register
Example: lw $t0, 32($s2)
H.Corporaal ACA 5MD00
35
18
9
op
rs
rt
32
16 bit number
14
Stored Program Concept
memory
OS
Program 1
CPU
code
global data
stack
heap
unused
Program 2
unused
H.Corporaal ACA 5MD00
15
Control

Decision making instructions



alter the control flow,
i.e., change the "next" instruction to be executed
MIPS conditional branch instructions:
bne $t0, $t1, Label
beq $t0, $t1, Label

Example:
if (i==j) h = i + j;
bne $s0, $s1, Label
add $s3, $s0, $s1
Label:
....
H.Corporaal ACA 5MD00
16
Control


MIPS unconditional branch instructions:
j label
Example:
if (i!=j)
h=i+j;
else
h=i-j;

beq $s4, $s5, Lab1
add $s3, $s4, $s5
j Lab2
Lab1:sub $s3, $s4, $s5
Lab2:...
Can you build a simple for loop?
H.Corporaal ACA 5MD00
17
So far:


Instruction
Meaning
add $s1,$s2,$s3
sub $s1,$s2,$s3
lw $s1,100($s2)
sw $s1,100($s2)
bne $s4,$s5,L
beq $s4,$s5,L
j Label
$s1 = $s2 + $s3
$s1 = $s2 – $s3
$s1 = Memory[$s2+100]
Memory[$s2+100] = $s1
Next instr. is at Label if $s4 ° $s5
Next instr. is at Label if $s4 = $s5
Next instr. is at Label
Formats:
R
op
rs
rt
rd
I
op
rs
rt
16 bit address
J
op
H.Corporaal ACA 5MD00
shamt
funct
26 bit address
18
Control Flow


We have: beq, bne, what about Branch-if-less-than?
New instruction:
meaning:
if
slt $t0, $s1, $s2


$s1 < $s2 then
$t0 = 1
else
$t0 = 0
Can use this instruction to build "blt $s1, $s2, Label"
— can now build general control structures
Note that the assembler needs a register to do this,
— use conventions for registers
H.Corporaal ACA 5MD00
19
MIPS compiler/assembler Conventions
Name Register number
Usage
$zero
0
the constant value 0
$v0-$v1
2-3
values for results and expression evaluation
$a0-$a3
4-7
arguments
$t0-$t7
8-15
temporaries
$s0-$s7
16-23
saved (by callee)
$t8-$t9
24-25
more temporaries
$gp
28
global pointer
$sp
29
stack pointer
$fp
30
frame pointer
$ra
31
return address
H.Corporaal ACA 5MD00
20
Constants


Small constants are used quite frequently (50% of operands)
e.g., A = A + 5;
B = B + 1;
C = C - 18;
Solutions? Why not?




put 'typical constants' in memory and load them
create hard-wired registers (like $zero) for constants like one
or …….
MIPS Instructions:
addi
slti
andi
ori
H.Corporaal ACA 5MD00
$29,
$8,
$29,
$29,
$29,
$18,
$29,
$29,
4
10
6
4
3
21
How about larger constants?


We'd like to be able to load a 32 bit constant into a register
Must use two instructions; new "load upper immediate"
instruction
lui $t0, 1010101010101010 filled with zeros
1010101010101010

0000000000000000
Then must get the lower order bits right, i.e.,
ori $t0, $t0, 1010101010101010
ori
H.Corporaal ACA 5MD00
1010101010101010
0000000000000000
0000000000000000
1010101010101010
1010101010101010
1010101010101010
22
Assembly Language vs. Machine Language


Assembly provides convenient symbolic representation

much easier than writing down numbers

e.g., destination first
Machine language is the underlying reality



e.g., destination is no longer first
Assembly can provide 'pseudoinstructions'

e.g., “move $t0, $t1” exists only in Assembly

would be implemented using “add $t0,$t1,$zero”
When considering performance you should count real
instructions
H.Corporaal ACA 5MD00
23
Addresses in Branches and Jumps

Instructions:
bne $t4,$t5,Label
beq $t4,$t5,Label
j Label


Next instruction is at Label if $t4  $t5
Next instruction is at Label if $t4 = $t5
Next instruction is at Label
Formats:
I
op
J
op
rs
rt
16 bit address
26 bit address
Addresses are not 32 bits
— How do we handle this with load and store instructions?
H.Corporaal ACA 5MD00
24
What's the next address?

Instructions:
Next instruction is at Label if $t4  $t5
Next instruction is at Label if $t4 = $t5
bne $t4,$t5,Label
beq $t4,$t5,Label

Formats:
I

op
rs
rt
16 bit address
Could specify a register (like lw and sw) and add it to address


use Instruction Address Register (PC = program counter)
most branches are local (principle of locality)

Jump instructions just use high order bits of PC

address boundaries of 256 MB
H.Corporaal ACA 5MD00
25
To summarize:
add
MIPS assembly language
Example
Meaning
add $s1, $s2, $s3
$s1 = $s2 + $s3
Three operands; data in registers
subtract
sub $s1, $s2, $s3
$s1 = $s2 - $s3
Three operands; data in registers
$s1 = $s2 + 100
$s1 = Memory[$s2 + 100]
Memory[$s2 + 100] = $s1
$s1 = Memory[$s2 + 100]
Memory[$s2 + 100] = $s1
Used to add constants
Category
Arithmetic
Instruction
addi $s1, $s2, 100
lw $s1, 100($s2)
load word
sw $s1, 100($s2)
store word
lb $s1, 100($s2)
Data transfer load byte
sb $s1, 100($s2)
store byte
load upper immediate lui $s1, 100
add immediate
Conditional
branch
Unconditional jump
$s1 = 100 * 2
16
Comments
Word from memory to register
Word from register to memory
Byte from memory to register
Byte from register to memory
Loads constant in upper 16 bits
branch on equal
beq
$s1, $s2, 25
if ($s1 == $s2) go to
PC + 4 + 100
Equal test; PC-relative branch
branch on not equal
bne
$s1, $s2, 25
if ($s1 != $s2) go to
PC + 4 + 100
Not equal test; PC-relative
set on less than
slt
$s1, $s2, $s3
if ($s2 < $s3) $s1 = 1;
else $s1 = 0
Compare less than; for beq, bne
set less than
immediate
slti
jump
j
jr
jal
jump register
jump and link
H.Corporaal ACA 5MD00
$s1, $s2, 100 if ($s2 < 100) $s1 = 1;
Compare less than constant
else $s1 = 0
2500
$ra
2500
Jump to target address
go to 10000
For switch, procedure return
go to $ra
$ra = PC + 4; go to 10000 For procedure call
26
MIPS (3+2) addressing modes overview
1. Immediate addressing
op
rs
rt
Immediate
2. Register addressing
op
rs
rt
rd
...
funct
Registers
Register
3. Base addressing
op
rs
rt
Memory
Address
+
Register
Byte
Halfword
Word
4. PC-relative addressing
op
rs
rt
Memory
Address
PC
+
Word
5. Pseudodirect addressing
op
Address
PC
H.Corporaal ACA 5MD00
Memory
Word
27
MIPS Datapath

Building a datapath


A single cycle processor datapath


all instruction actions in one (long) cycle
A multi-cycle processor datapath


support a subset of the MIPS-I instruction-set
each instructions takes multiple (shorter) cycles
For details see book (ch 5):
H.Corporaal ACA 5MD00
28
Datapath and Control
FSM
or
Microprogramming
Registers &
Memories
Multiplexors
Buses
ALUs
Control
H.Corporaal ACA 5MD00
Datapath
29
The Processor: Datapath & Control

Simplified MIPS implementation to contain only:




lw, sw
add, sub, and, or, slt
beq, j
Generic Implementation:





memory-reference instructions:
arithmetic-logical instructions:
control flow instructions:
use the program counter (PC) to supply instruction address
get the instruction from memory
read registers
use the instruction to decide exactly what to do
All instructions use the ALU after reading the registers
Why?
memory-reference?
 arithmetic?
 control flow?
H.Corporaal ACA 5MD00

30
More Implementation Details

Abstract / Simplified View:
Data
Address
PC
Instruction
memory
Instruction
Register #
Registers
Register #
ALU
Address
Data
memory
Register #
Data

Two types of functional units:


H.Corporaal ACA 5MD00
elements that operate on data values (combinational)
elements that contain state (sequential)
31
State Elements


Unclocked vs. Clocked
Clocks used in synchronous logic

when should an element that contains state be updated?
falling edge
cycle time
rising edge
H.Corporaal ACA 5MD00
32
An unclocked state element

The set-reset (SR) latch

output depends on present inputs and also on past inputs
R
Q
Q
S
Truth table:
H.Corporaal ACA 5MD00
R
0
0
1
1
S
0
1
0
1
Q
Q
1
0
?
state change
33
Latches and Flip-flops


Output is equal to the stored value inside the element
(don't need to ask for permission to look at the value)
Change of state (value) is based on the clock


Latches: whenever the inputs change, and the clock is asserted
Flip-flop: state changes only on a clock edge
(edge-triggered methodology)
A clocking methodology defines when signals can be read and written
— wouldn't want to read a signal at the same time it was being written
H.Corporaal ACA 5MD00
34
D-latch

Two inputs:



the data value to be stored (D)
the clock signal (C) indicating when to read & store D
Two outputs:

the value of the internal state (Q) and it's complement
C
D
Q
C
_
Q
D
H.Corporaal ACA 5MD00
Q
35
D flip-flop

Output changes only on the clock edge
D
D
C
D
latch
Q
D
Q
D
latch _
C
Q
Q
_
Q
C
D
C
Q
H.Corporaal ACA 5MD00
36
Our Implementation


An edge triggered methodology
Typical execution:



read contents of some state elements,
send values through some combinational logic,
write results to one or more state elements
State
element
1
Combinational logic
State
element
2
Clockcycle
H.Corporaal ACA 5MD00
37
Register File

3-ported: one write, two read ports
Read reg. #1
Read
data 1
Read reg.#2
Read
data 2
Write reg.#
Write
data
Write
H.Corporaal ACA 5MD00
38
Register file: read ports
• Register file built using D flip-flops
Read register
number 1
Register 0
Register 1
M
Register n – 1
u
x
Read data 1
Register n
Read register
number 2
M
u
Read data 2
x
Implementation of the read ports
H.Corporaal ACA 5MD00
39
Register file: write port

Note: we still use the real clock to determine when to
write
W r ite
0
1
R e g is te r n um b e r
n -to -1
C
R e g iste r 0
D
C
d e co d e r
n – 1
R e g iste r 1
D
n
C
R e g is te r n – 1
D
C
R e g iste r n
R e gister d ata
H.Corporaal ACA 5MD00
D
40
Building the Datapath
Use multiplexors to stitch them together

PCSrc
M
u
x
Add
Add ALU
result
4
Shift
left 2
Registers
PC
Read
address
Instruction
Instruction
memory
Read
register 1
Read
Read
data 1
register 2
Write
register
Write
data
RegWrite
16
H.Corporaal ACA 5MD00
ALUSrc
Read
data 2
Sign
extend
M
u
x
3 ALU operation
Zero
ALU ALU
result
MemWrite
MemtoReg
Address
Read
data
Data
Write memory
data
M
u
x
32
MemRead
41
Our Simple Control Structure



All of the logic is combinational
We wait for everything to settle down, and the right thing
to be done

ALU might not produce “right answer” right away

we use write signals along with clock to determine when to write
Cycle time determined by length of the longest path
S tate
elem ent
1
Com binational logic
State
elem ent
2
Clock cycle
We are ignoring some details like setup and hold times !
H.Corporaal ACA 5MD00
42
Control

Selecting the operations to perform (ALU, read/write, etc.)

Controlling the flow of data (multiplexor inputs)

Information comes from the 32 bits of the instruction

Example:
add $8, $17, $18
000000
op

Instruction Format:
10001
rs
10010
rt
01000
rd
00000
shamt
100000
funct
ALU's operation based on instruction type and function code
H.Corporaal ACA 5MD00
43
Control: 2 level implementation
31
6
Control 2
26
instruction register
Opcode
bit
2
ALUop
00: lw, sw
01: beq
10: add, sub, and, or, slt
Funct.
Control 1
H.Corporaal ACA 5MD00
5
6
3
ALUcontrol 000: and
001: or
010: add
110: sub
111: set on less than
ALU
0
44
Datapath with Control
0
M
u
x
Add ALU
result
Add
4
Instruction[31–26]
PC
Instruction
memory
Read
register 1
Instruction[20–16]
Instruction
[31–0]
Instruction[15–11]
Shift
left 2
RegDst
Branch
MemRead
MemtoReg
Control ALUOp
MemWrite
ALUSrc
RegWrite
Instruction[25–21]
Read
address
0
M
u
x
1
1
Read
data1
Read
register 2
Registers Read
Write
data2
register
0
M
u
x
1
Write
data
Zero
ALU ALU
result
Address
Write
data
Instruction[15–0]
16
Sign
extend
Read
data
Data
memory
1
M
u
x
0
32
ALU
control
Instruction[5–0]
H.Corporaal ACA 5MD00
45
ALU Control1


What should the ALU do with this instruction
example: lw $1, 100($2)
2
1
op
rs
rt
100
16 bit offset
ALU control input
000
001
010
110
111

35
AND
OR
add
subtract
set-on-less-than
Why is the code for subtract 110 and not 011?
H.Corporaal ACA 5MD00
46
ALU Control1

Must describe hardware to compute 3-bit ALU control input



given instruction type
00 = lw, sw
01 = beq,
10 = arithmetic
function code for arithmetic
ALU Operation class,
computed from instruction type
Describe it using a truth table (can turn into gates):
inputs
H.Corporaal ACA 5MD00
ALUOp
ALUOp1 ALUOp0
0
0
X
1
1
X
1
X
1
X
1
X
1
X
F5
X
X
X
X
X
X
X
Funct field
F4 F3 F2 F1
X X X X
X X X X
X 0 0 0
X 0 0 1
X 0 1 0
X 0 1 0
X 1 0 1
outputs
Operation
F0
X
X
0
0
0
1
0
010
110
010
110
000
001
111
47
ALU Control1

Simple combinational logic (truth tables)
ALUOp
ALU control block
ALUOp0
ALUOp1
F3
F2
F (5– 0)
Operation2
Operation1
Operation
F1
Operation0
F0
H.Corporaal ACA 5MD00
48
Deriving Control2 signals
Input
6-bits
9 control (output) signals
Memto- Reg Mem Mem
Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0
R-format
1
0
0
1
0
0
0
1
0
lw
0
1
1
1
1
0
0
0
0
sw
X
1
X
0
0
1
0
0
0
beq
X
0
X
0
0
0
1
0
1
Determine these control signals directly from the opcodes:
R-format: 0
lw:
35
sw:
43
beq:
4
H.Corporaal ACA 5MD00
49
Control 2
Inputs
Op5
Op4
Op3

PLA example
implementation
Op2
Op1
Op0
Outputs
R-format
Iw
sw
beq
RegDst
ALUSrc
MemtoReg
RegWrite
MemRead
MemWrite
Branch
ALUOp1
ALUOpO
H.Corporaal ACA 5MD00
50
Single Cycle Implementation

Calculate cycle time assuming negligible delays except:

memory (2ns), ALU and adders (2ns), register file access (1ns)
PCSrc
Add
ALU
Add result
4
RegWrite
Instruction [25– 21]
PC
Read
address
Instruction
[31– 0]
Instruction
memory
Instruction [20– 16]
1
M
u
Instruction [15– 11] x
0
RegDst
Instruction [15– 0]
Read
register 1
Read
register 2
Read
data 1
Read
Write
data 2
register
Write
Registers
data
16
Sign 32
extend
1
M
u
x
0
Shift
left 2
MemWrite
ALUSrc
1
M
u
x
0
Zero
ALU ALU
result
MemtoReg
Address
Write
data
ALU
control
Read
data
Data
memory
1
M
u
x
0
MemRead
Instruction [5– 0]
ALUOp
H.Corporaal ACA 5MD00
51
Single Cycle Implementation

Memory (2ns), ALU & adders (2ns), reg. file access (1ns)

Fixed length clock: longest instruction is the ‘lw’ which requires 8 ns

Variable clock length (not realistic, just as exercise):






R-instr:
Load:
Store:
Branch:
Jump:
6 ns
8 ns
7 ns
5 ns
2 ns
Average depends on instruction mix
H.Corporaal ACA 5MD00
52
Where we are headed

Single Cycle Problems:



what if we had a more complicated instruction like floating point?
wasteful of area: NO Sharing of Hardware resources
One Solution:



use a “smaller” cycle time
have different instructions take different numbers of cycles
a “multicycle” datapath:
Instruction
register
PC
Address
ALU
Registers
Memory
data
register
MDR
H.Corporaal ACA 5MD00
A
Register #
Instruction
Memory
or data
Data
Data
IR
ALUOut
Register #
B
Register #
53
Multicycle Approach

We will be reusing functional units




Add registers after every major functional unit
Our control signals will not be determined solely by
instruction


ALU used to compute address and to increment PC
Memory used for instruction and data
e.g., what should the ALU do for a “subtract” instruction?
We’ll use a finite state machine (FSM) or microcode for
control
H.Corporaal ACA 5MD00
54
Review: finite state machines

Finite state machines:



a set of states and
next state function (determined by current state and the input)
output function (determined by current state and possibly input)
Current state
Next-state
function
Next
state
Clock
Inputs
Output
function

Outputs
We’ll use a Moore machine (output based only on current state)
H.Corporaal ACA 5MD00
55
Multicycle Approach

Break up the instructions into steps, each step takes a
cycle



At the end of a cycle



balance the amount of work to be done
restrict each cycle to use only one major functional unit
store values for use in later cycles (easiest thing to do)
introduce additional “internal” registers
Notice: we distinguish


processor state: programmer visible registers
internal state: programmer invisible registers (like IR, MDR, A, B,
and ALUout)
H.Corporaal ACA 5MD00
56
Multicycle Approach
PC
0
M
u
x
1
Address
Memory
Instruction
[25–21]
Read
register 1
Instruction
[20–16]
Read
Read
data1
register 2
Registers
Write
Read
register data2
MemData
Write
data
Instruction
[15–0] Instruction
[15–11]
Instruction
register
Instruction
[15–0]
Memory
data
register
H.Corporaal ACA 5MD00
0
M
u
x
1
A
B
0
M
u
x
1
Sign
extend
32
Zero
ALU ALU
result
ALUOut
0
4
Write
data
16
0
M
u
x
1
1M
u
2x
3
Shift
left 2
57
Multicycle Approach

Note that previous picture does not include:



branch support
jump support
Control lines and logic

Tclock > max (ALU delay, Memory access, Regfile access)

See book for complete picture
H.Corporaal ACA 5MD00
58
Five Execution Steps

Instruction Fetch

Instruction Decode and Register Fetch

Execution, Memory Address Computation, or Branch
Completion

Memory Access or R-type instruction completion

Write-back step
INSTRUCTIONS TAKE FROM 3 - 5 CYCLES!
H.Corporaal ACA 5MD00
59
Step 1: Instruction Fetch



Use PC to get instruction and put it in the Instruction
Register
Increment the PC by 4 and put the result back in the PC
Can be described succinctly using RTL "Register-Transfer
Language"
IR = Memory[PC];
PC = PC + 4;

Can we figure out the values of the control signals?

What is the advantage of updating the PC now?
H.Corporaal ACA 5MD00
60
Step 2: Instruction Decode and
Register Fetch




Read registers rs and rt in case we need them
Compute the branch address in case the instruction is a
branch
Previous two actions are done optimistically!!
RTL:
A = Reg[IR[25-21]];
B = Reg[IR[20-16]];
ALUOut = PC+(sign-extend(IR[15-0])<< 2);

We aren't setting any control lines based on the instruction
type
(we are busy "decoding" it in our control logic)
H.Corporaal ACA 5MD00
61
Step 3 (instruction dependent)

ALU is performing one of four functions, based on instruction type

Memory Reference:
ALUOut = A + sign-extend(IR[15-0]);

R-type:
ALUOut = A op B;

Branch:
if (A==B) PC = ALUOut;

Jump:
PC = PC[31-28] || (IR[25-0]<<2)
H.Corporaal ACA 5MD00
62
Step 4 (R-type or Memory-access)

Loads and stores access memory
MDR = Memory[ALUOut];
or
Memory[ALUOut] = B;

R-type instructions finish
Reg[IR[15-11]] = ALUOut;
The write actually takes place at the end of the cycle
on the edge
H.Corporaal ACA 5MD00
63
Write-back step

Memory read completion step
Reg[IR[20-16]]= MDR;
What about all the other instructions?
H.Corporaal ACA 5MD00
64
Summary execution steps
Steps taken to execute any instruction class
Step name
Instruction fetch
Action for R-type
instructions
Instruction
decode/register fetch
Action for memory-reference
Action for
instructions
branches
IR = Memory[PC]
PC = PC + 4
A = Reg [IR[25-21]]
B = Reg [IR[20-16]]
ALUOut = PC + (sign-extend (IR[15-0]) << 2)
Execution, address
computation, branch/
jump completion
ALUOut = A op B
ALUOut = A + sign-extend
(IR[15-0])
Memory access or R-type
completion
Reg [IR[15-11]] =
ALUOut
Load: MDR = Memory[ALUOut]
or
Store: Memory [ALUOut] = B
Memory read completion
H.Corporaal ACA 5MD00
if (A ==B) then
PC = ALUOut
Action for
jumps
PC = PC [31-28] II
(IR[25-0]<<2)
Load: Reg[IR[20-16]] = MDR
65
Simple Questions

How many cycles will it take to execute this code?
lw $t2, 0($t3)
lw $t3, 4($t3)
beq $t2, $t3, L1
add $t5, $t2, $t3
sw $t5, 8($t3)
L1: ...


#assume not taken
What is going on during the 8th cycle of execution?
In what cycle does the actual addition of $t2 and $t3 takes place?
H.Corporaal ACA 5MD00
66
Implementing the Control

Value of control signals is dependent upon:



Use the information we have accumulated to specify a finite
state machine (FSM)



what instruction is being executed
which step is being performed
specify the finite state machine graphically, or
use microprogramming
Implementation can be derived from specification
H.Corporaal ACA 5MD00
67
0
Start
How many
state bits will
we need?
Memory address
computation
6
ALUSrcA = 1
ALUSrcB = 10
ALUOp = 00
Memory
access
5
MemRead
IorD = 1
Write-back step
4
RegDst = 0
RegWrite
MemtoReg = 1
ALUSrcA = 0
ALUSrcB = 11
ALUOp = 00
Branch
completion
8
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 10
Memory
access
3
1
Execution
2
(Op = 'LW')

MemRead
ALUSrcA = 0
IorD = 0
IRWrite
ALUSrcB = 01
ALUOp = 00
PCWrite
PCSource = 00
MemWrite
IorD = 1
RegDst = 1
RegWrite
MemtoReg = 0
Jump
completion
9
ALUSrcA = 1
ALUSrcB = 00
ALUOp = 01
PCWriteCond
PCSource = 01
R-type completion
7
(Op = 'J')
Graphical Specification
of FSM
Instruction decode/
register fetch
Instruction fetch
PCWrite
PCSource = 10
Finite State Machine for Control
PCWrite
PCWriteCond
IorD
MemRead
Implementation:
MemWrite
IRWrite
Control logic
MemtoReg
PCSource
ALUOp
Outputs
ALUSrcB
ALUSrcA
RegWrite
RegDst
NS3
NS2
NS1
NS0
Instruction register
opcode field
H.Corporaal ACA 5MD00
S0
S1
S2
S3
Op0
Op1
Op2
Op3
Op4
Op5
Inputs
State register
69
Op4
opcode
PLA
Implementation
Op5
(see book)
Op3
Op2
Op1
Op0
S3
current
state
S2
S1
S0

If I picked a
horizontal or
vertical line could
you explain it ?
What type of
FSM is used?
Mealy or Moore?
datapath control

PCWrite
PCWriteCond
IorD
MemRead
MemWrite
IRWrite
MemtoReg
PCSource1
PCSource0
ALUOp1
ALUOp0
ALUSrcB1
ALUSrcB0
ALUSrcA
RegWrite
RegDst
NS3
NS2
NS1
NS0
next
state
H.Corporaal ACA 5MD00
70
Pipelined implementation




Pipelining
Pipelined datapath
Pipelined control
Hazards:






Structural
Data
Control
Exceptions
Scheduling
For details see the book (chapter 6):
H.Corporaal ACA 5MD00
71
Pipelining
Improve performance by increasing instruction throughput
P rog ram
e x ec ution
T im e
o rd er
(in in structio ns )
lw $ 1, 10 0 ($0 )
2
Ins truction
R eg
fe tch
lw $ 2, 20 0 ($0 )
4
6
8
A LU
Data
access
10
12
14
16
18
R eg
Instruction
R eg
fe tch
8 ns
lw $ 3, 30 0 ($0 )
D ata
acc ess
A LU
R eg
Instruction
fe tch
8 ns
...
8 ns
P rog ram
e x ec utio n
Tim e
o rd er
(in in struc tio n s)
2
lw $1 , 1 00 ($ 0)
Instruction
fetch
lw $2 , 2 00 ($ 0)
2 ns
lw $3 , 3 00 ($ 0)
4
Reg
Instruction
fetch
2 ns
6
ALU
Reg
Instruction
fetch
2 ns
H.Corporaal ACA 5MD00
8
Da ta
access
ALU
Reg
2 ns
10
14
12
Reg
Data
access
Reg
ALU
D a ta
access
2 ns
2 ns
Reg
2 ns
72
Pipelining

Ideal speedup = number of stages

Do we achieve this?
H.Corporaal ACA 5MD00
73
Pipelining

What makes it easy




What makes it hard?





all instructions are the same length
just a few instruction formats
memory operands appear only in loads and stores
structural hazards: suppose we had only one memory
control hazards: need to worry about branch instructions
data hazards: an instruction depends on a previous instruction
We’ll build a simple pipeline and look at these issues
We’ll talk about modern processors and what really makes it
hard:


H.Corporaal ACA 5MD00
exception handling
trying to improve performance with out-of-order execution, etc.
74
Basic idea: start from single cycle impl.
What do we need to add to actually split the datapath into stages?
IF: Instruction fetch
ID: Instruction decode/
register file read
EX: Execute/
address calculation
MEM: Memory access
WB: Write back
0
M
u
x
1
Add
Add
Add result
4
Shift
left 2
PC
Read
register 1
Address
Instruction
Instruction
memory
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
H.Corporaal ACA 5MD00
Sign
extend
Read
data
1
M
u
x
0
32
75
Pipelined Datapath
Can you find a problem even if there are no dependencies?
What instructions can we execute to manifest the problem?
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
Add result
4
PC
Address
Instruction
memory
Instruction
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
H.Corporaal ACA 5MD00
Sign
extend
Read
data
1
M
u
x
0
32
76
Corrected Datapath
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
4
Add
Add
result
PC
Address
Instruction
memory
I nstr ucti on
Shift
left 2
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
0
M
u
x
1
Zero
ALU ALU
result
Address
Data
memory
Write
data
16
H.Corporaal ACA 5MD00
Sign
extend
Read
data
1
M
u
x
0
32
77
Graphically Representing Pipelines
Time (in clock cycles)
Program
execution
order
(in instructions)
lw $10, 20($1)
sub $11, $2, $3

CC 1
CC 2
CC 3
CC 4
CC 5
IM
Reg
ALU
DM
Reg
ALU
DM
IM
Reg
CC 6
Reg
Can help with answering questions like:



how many cycles does it take to execute this code?
what is the ALU doing during cycle 4?
use this representation to help understand datapaths
H.Corporaal ACA 5MD00
78
Pipeline Control
PCSrc
0
M
u
x
1
IF/ID
ID/EX
EX/MEM
MEM/WB
Add
Add
4
Add
result
Branch
Shift
left 2
PC
Address
Instruction
memory
Instruction
RegWrite
Read
register 1
MemWrite
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
ALUSrc
Zero
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Address
Data
memory
Write
Read
data
1
M
u
x
0
data
Instruction
16
[15– 0]
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
32
6
0
M
u
x
1
ALU
control
MemRead
ALUOp
RegDst
H.Corporaal ACA 5MD00
79
Pipeline control

We have 5 stages. What needs to be controlled in each
stage?






Instruction Fetch and PC Increment
Instruction Decode / Register Fetch
Execution
Memory Stage
Write Back
How would control be handled in an automobile plant?


a fancy control center telling everyone what to do?
should we use a finite state machine?
H.Corporaal ACA 5MD00
80
Pipeline Control
Instruction
R-format
lw
sw
beq
Execution/Address
Calculation stage control
lines
Reg
ALU
ALU
ALU
Dst
Op1
Op0
Src
1
1
0
0
0
0
0
1
X
0
0
1
X
0
1
0
Memory access stage
control lines
Branc Mem
Mem
h
Read Write
0
0
0
0
1
0
0
0
1
1
0
0
Write-back
stage control
lines
Reg
Mem
write to Reg
1
0
1
1
0
X
0
X
(compare single cycle control!)
Pass control signals along
just like the data:
WB
Instruction
H.Corporaal ACA 5MD00
IF/ID
Control
M
WB
EX
M
WB
ID/EX
EX/MEM
MEM/WB
81
Datapath with Control
PCSrc
ID/EX
0
M
u
x
1
WB
Control
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
Add
ALUSrc
Read
register 1
Read
data 1
Read
register 2
Registers Read
Write
data 2
register
Write
data
Zero
ALU ALU
result
0
M
u
x
1
MemtoReg
Instruction
memory
Branch
Shift
left 2
MemWrite
Address
Instruction
PC
Add
Add result
RegWrite
4
Address
Data
memory
Read
data
Write
data
Instruction 16
[15– 0]
Instruction
[20– 16]
Instruction
[15– 11]
Sign
extend
32
6
ALU
control
0
M
u
x
1
1
M
u
x
0
MemRead
ALUOp
RegDst
H.Corporaal ACA 5MD00
82
H.Corporaal ACA 5MD00
83
Hazards: problems due to pipelining
Hazard types:

Structural


Data


same resource is needed multiple times in the same cycle
data dependencies limit pipelining
Control

next executed instruction may not be the next specified
instruction
H.Corporaal ACA 5MD00
84
Structural hazards
Examples:

Two accesses to a single ported memory

Two operations need the same function unit
at the same time

Two operations need the same function unit
in successive cycles, but the unit is not pipelined
Solutions:

stalling

add more hardware
H.Corporaal ACA 5MD00
85
Structural hazards on MIPS
Q: Do we have structural hazards on our simple MIPS
pipeline?
time
IF
H.Corporaal ACA 5MD00
ID
EX
MEM WB
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
86
Data hazards

Data dependencies:




(read-after-write)
(write-after-write)
(write-after-read)
Hardware solution:




RaW
WaW
WaR
Forwarding / Bypassing
Detection logic
Stalling
Software solution: Scheduling
H.Corporaal ACA 5MD00
87
Data dependences
Three types: RaW, WaR and WaW
add r1, r2, 5
sub r4, r1, r3
; r1 := r2+5
; RaW of r1
add r1, r2, 5
sub r2, r4, 1
; WaR of r2
add r1, r2, 5
sub r1, r1, 1
; WaW of r1
st
ld
; M[r2+5] := r1
; RaW if 5+r2 = 0+r4
r1, 5(r2)
r5, 0(r4)
WaW and WaR do not occur in simple pipelines, but they limit
scheduling freedom!
Problems for your compiler and Pentium!
 use register renaming to solve this!
H.Corporaal ACA 5MD00
88
RaW on MIPS pipeline
T im e (in clock cycle s)
V alue of
re giste r $ 2 :
CC 1
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
10
10
10
1 0 /– 2 0
– 20
– 20
– 20
– 20
IM
Reg
DM
Reg
P ro gra m
e xe cution
orde r
(in instru ctio ns)
sub $ 2 , $ 1 , $ 3
a nd $ 1 2 , $ 2 , $ 5
or $ 1 3 , $ 6 , $ 2
a dd $ 1 4 , $ 2 , $ 2
sw $ 1 5, 1 0 0 ($ 2 )
H.Corporaal ACA 5MD00
IM
DM
R eg
IM
DM
R eg
IM
R eg
DM
Reg
IM
R eg
R eg
R eg
DM
Reg
89
Forwarding
Use temporary results, don’t wait for them to be written


register file forwarding to handle read/write to same register
ALU forwarding
T ime (in clock cycles)
CC 1
V alue of re gister $ 2 : 10
V a lue of E X/M EM : X
V a lue of M EM /W B : X
CC 2
CC 3
CC 4
CC 5
CC 6
CC 7
CC 8
CC 9
10
X
X
10
X
X
10
– 20
X
1 0/– 20
X
– 20
– 20
X
X
– 20
X
X
– 20
X
X
– 20
X
X
DM
R eg
Program
e xe cution orde r
(in instructions)
sub $ 2, $1 , $ 3
What if this
$2 was $13?
a nd $ 12 , $ 2, $5
or $ 13 , $ 6, $ 2
a dd $ 14 , $ 2, $2
sw $ 15 , 1 00 ($2 )
H.Corporaal ACA 5MD00
IM
Reg
IM
R eg
IM
DM
R eg
IM
R eg
DM
DM
R eg
IM
Reg
R eg
Reg
DM
Reg
90
Forwarding hardware
ALU forwarding circuitry principle:
from register file
ALU
to register file
from register file
Note: there are two options
• buf - ALU – bypass – mux - buf
• buf - bypass – mux – ALU - buf
H.Corporaal ACA 5MD00
91
Forwarding
ID/EX
WB
Control
PC
Instruction
memory
Instruction
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
M
u
x
Registers
ForwardA
ALU
Data
memory
M
u
x
IF/ID.RegisterRs
Rs
IF/ID.RegisterRt
Rt
IF/ID.RegisterRt
Rt
IF/ID.RegisterRd
Rd
ForwardB
M
u
x
EX/MEM.RegisterRd
Forwarding
unit
H.Corporaal ACA 5MD00
M
u
x
MEM/WB.RegisterRd
92
Forwarding check


Check for matching register-ids:
For each source-id of operation in the EX-stage check if
there is a matching pending dest-id
Example:
if (EX/MEM.RegWrite) 
(EX/MEM.RegisterRd  0) 
(EX/MEM.RegisterRd = ID/EX.RegisterRs)
then ForwardA = 10
Q. How many comparators do we need?
H.Corporaal ACA 5MD00
93
Can't always forward

Load word can still cause a hazard:

an instruction tries to read register r following a load to the same r
Need a hazard detection unit to “stall” the load instruction

Time (in clock cycles)
Program
CC 1
execution
order
(in instructions)
lw $2, 20 ($1)
and $4, $2, $5
or $8, $2, $6
add $9, $4, $2
slt $1, $6, $7
H.Corporaal ACA 5MD00
IM
CC 2
CC 3
Reg
IM
CC 4
CC 5
DM
Reg
Reg
IM
DM
Reg
IM
CC 6
CC 8
CC 9
Reg
DM
Reg
IM
CC 7
Reg
DM
Reg
Reg
DM
Reg
94
Stalling
We can stall the pipeline by keeping an instruction in the same stage
Program
Time(inclockcycles)
execution
CC1
CC2
order
(ininstructions)
lw$2, 20($1)
and$4, $2, $5
or $8, $2, $6
IM
CC3
Reg
IM
CC4
CC5
DM
Reg
Reg
Reg
IM
IM
CC6
CC7
DM
Reg
Reg
DM
CC8
CC9
CC10
Reg
bubble
add$9, $4, $2
In
CC4 the ALU is not used,
slt $1, $6, $7
Reg, and IM are redone
H.Corporaal ACA 5MD00
IM
DM
Reg
IM
Reg
Reg
DM
Reg
95
Hazard Detection Unit
ID/EX.MemRead
Hazard
detection
unit
ID/EX
IF/IDWrite
WB
Control
0
M
u
x
PC
Instruction
memory
Instruction
PCWrite
IF/ID
EX/MEM
M
WB
EX
M
MEM/WB
WB
M
u
x
Registers
ALU
Data
memory
M
u
x
M
u
x
IF/ID.RegisterRs
IF/ID.RegisterRt
H.Corporaal ACA 5MD00
IF/ID.RegisterRt
Rt
IF/ID.RegisterRd
Rd
ID/EX.RegisterRt
Rs
Rt
M
u
x
EX/MEM.RegisterRd
Forwarding
unit
MEM/WB.RegisterRd
96
Software only solution?


Have compiler guarantee that no hazards occur
Example: where do we insert the “NOPs” ?
sub
and
or
add
sw

$2,
$12,
$13,
$14,
$13,
$1, $3
$2, $5
$6, $2
$2, $2
100($2)
sub
nop
nop
and
or
Problem: this really slows us down! add
nop
sw
H.Corporaal ACA 5MD00
$2,
$1, $3
$12, $2, $5
$13, $6, $2
$14, $2, $2
$13, 100($2)
97
Control hazards

Control operations may change the sequential flow of
instructions





branch
jump
call (jump and link)
return
(exception/interrupt and rti / return from interrupt)
H.Corporaal ACA 5MD00
98
Control hazard: Branch
Branch actions:

Compute new address

Determine condition

Perform the actual branch (if taken): PC := new address
H.Corporaal ACA 5MD00
99
Branch example
Progra m
Time (in clock cycle s)
execu tion
CC 1
CC 2
IM
Reg
CC 3
CC 4
CC 5
DM
R eg
CC 6
CC 7
CC 8
CC 9
order
(in instructions)
40 beq $1, $3, 7
44 an d $1 2, $2 , $ 5
48 or $13, $6 , $2
52 ad d $1 4, $2 , $ 2
72 lw $4 , 50($ 7)
H.Corporaal ACA 5MD00
IM
R eg
IM
DM
R eg
IM
R eg
DM
R eg
IM
R eg
DM
R eg
Reg
DM
R eg
100
Branching
Squash pipeline:

When we decide to branch, other instructions are in the
pipeline!

We are predicting “branch not taken”

need to add hardware for flushing instructions if we are wrong
H.Corporaal ACA 5MD00
101
Branch with predict not taken
Clock cycles
Branch L
IF
Predict
not taken
L:
H.Corporaal ACA 5MD00
ID
EX
MEM WB
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
102
Branch speedup

Earlier address computation
Earlier condition calculation

Put both in the ID pipeline stage



adder
comparator
Clock cycles
Branch L
Predict
not taken
L:
H.Corporaal ACA 5MD00
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
IF
ID
EX
MEM WB
103
Improved branching / flushing IF/ID
IF.Flush
Hazard
detection
unit
ID/EX
M
u
x
WB
Control
0
M
u
x
IF/ID
4
M
WB
EX
M
MEM/WB
WB
Shift
left 2
Registers
PC
EX/MEM
=
M
u
x
Instruction
memory
ALU
M
u
x
Data
memory
M
u
x
Sign
extend
M
u
x
Forwarding
unit
H.Corporaal ACA 5MD00
104
Exception support
Types of exceptions:

Overflow

I/O device request

Operating system call

Undefined instruction

Hardware malfunction

Page fault

Precise exception:


finish previous instructions (which are still in the pipeline)
flush excepting and following instructions, redo them after
handling the exception(s)
H.Corporaal ACA 5MD00
105
Exceptions
Changes needed for handling overflow exception of an
operation in EX stage (see book for details) :


Extend PC input mux with extra entry with fixed address
Add EPC register recording the ID/EX stage PC


this is the address of the next instruction !
Cause register recording exception type
E.g., in case of overflow exception insert 3 bubbles;
flush the following stages:

IF/ID stage

ID/EX stage

EX/MEM stage
H.Corporaal ACA 5MD00
106
Scheduling, why?
Let’s look at the execution time:
Texecution = Ncycles x Tcycle
= Ninstructions x CPI x Tcycle
Scheduling may reduce Texecution

Reduce CPI (cycles per instruction)
 early scheduling of long latency operations
 avoid pipeline stalls due to structural, data and control hazards
allow Nissue > 1 and therefore CPI < 1
Reduce Ninstructions
 compact many operations into each instruction (VLIW)


H.Corporaal ACA 5MD00
107
Scheduling data hazards:
example 1
Try and avoid RaW stalls (in this case load interlocks)!
E.g., reorder these instructions:
lw
lw
sw
sw
$t0,
$t2,
$t2,
$t0,
H.Corporaal ACA 5MD00
0($t1)
4($t1)
0($t1)
4($t1)
lw
lw
sw
sw
$t0,
$t2,
$t0,
$t2,
0($t1)
4($t1)
4($t1)
0($t1)
108
Scheduling data hazards
example 2
Avoiding RaW stalls:
Reordering instructions for
following program
(by you or the compiler)
Unscheduled code:
Lw R1,b
Lw R2,c
Add R3,R1,R2
interlock
Sw a,R3
Lw R1,e
Lw R2,f
Sub R4,R1,R2
interlock
Sw d,R4
Code:
a = b + c
d = e - f
H.Corporaal ACA 5MD00
Scheduled code:
Lw R1,b
Lw R2,c
Lw R5,e
extra reg. needed!
Add R3,R1,R2
Lw R2,f
Sw a,R3
Sub R4,R5,R2
Sw d,R4
109
Scheduling control hazards
Texecution = Ninstructions x CPI x Tcycle
CPI = CPIideal + fbranch x Pbranch
Pbranch = Ndelayslots x miss_rate

Modern processors tend to have large branch penalty,
Pbranch, due to:



many pipeline stages
multi-issue
Note that penalties have larger effect when CPIideal is
low
H.Corporaal ACA 5MD00
110
Scheduling control hazards
What can we do about control hazards and CPI
penalty?

Keep penalty Pbranch low:






Early computation of new PC
Early determination of condition
Visible branch delay slots filled by compiler (MIPS)
Branch prediction
Reduce control dependencies (control height
reduction) [Schlansker and Kathail, Micro’95]
Remove branches: if-conversion


Conditional instructions: CMOVE, cond skip next
Guarding all instructions: TriMedia
H.Corporaal ACA 5MD00
111
Branch delay slot

Add a branch delay slot:



the next instruction after a branch is always executed
rely on compiler to “fill” the slot with something useful
Is this a good idea?

let's look how it works
H.Corporaal ACA 5MD00
112
Branch delay slot scheduling
Q. What to put in the delay slot?
op 1
beq r1,r2, L
.............
'fall-through'
op 2
.............
branch target
L: op 3
.............
H.Corporaal ACA 5MD00
113
Summary



Modern processors are (deeply) pipelined, to reduce
Tcycle and aim at CPI = 1
Hazards increase CPI
Several software and hardware measure to avoid or
reduce hazards are taken
Not discussed, but important developments:

Multi-issue further reduces CPI

Branch prediction to avoid high branch penalties

Dynamic scheduling

In all cases: a scheduling compiler needed
H.Corporaal ACA 5MD00
114
Recap of MIPS





RISC architecture
Register space
Addressing
Instruction format
Pipelining
H.Corporaal ACA 5MD00
115
Why RISC? Keep it simple
RISC characteristics:

Reduced number of instructions

Limited addressing modes



Large register set





uniform (no distinction between e.g. address and data registers)
Limited number of instruction sizes (preferably one)


load-store architecture
enables pipelining
know directly where the following instruction starts
Limited number of instruction formats
Memory alignment restrictions
......
Based on quantitative analysis

" the famous MIPS one percent rule": don't even think about it
when its not used more than one percent
H.Corporaal ACA 5MD00
116
Register space
32 integer (and 32 floating point) registers of 32-bit
Name Register number
Usage
$zero
0
the constant value 0
$v0-$v1
2-3
values for results and expression evaluation
$a0-$a3
4-7
arguments
$t0-$t7
8-15
temporaries
$s0-$s7
16-23
saved (by callee)
$t8-$t9
24-25
more temporaries
$gp
28
global pointer
$sp
29
stack pointer
$fp
30
frame pointer
$ra
31
return address
H.Corporaal ACA 5MD00
117
Addressing
1. Immediate addressing
op
rs
rt
Immediate
2. Register addressing
op
rs
rt
rd
...
funct
Registers
Register
3. Base addressing
op
rs
rt
Memory
Address
+
Register
Byte
Halfword
Word
4. PC-relative addressing
op
rs
rt
Memory
Address
PC
+
Word
5. Pseudodirect addressing
op
Address
PC
H.Corporaal ACA 5MD00
Memory
Word
118
Instruction format
R
op
rs
rt
rd
I
op
rs
rt
16 bit address
J
op
Example instructions
Instruction
add $s1,$s2,$s3
addi $s2,$s3,4
lw $s1,100($s2)
bne $s4,$s5,L
j Label
H.Corporaal ACA 5MD00
shamt
funct
26 bit address
Meaning
$s1 = $s2 + $s3
$s2 = $s3 + 4
$s1 = Memory[$s2+100]
if $s4<>$s5 goto L
goto Label
119
Pipelining
All integer instructions fit into the following pipeline
time
IF
H.Corporaal ACA 5MD00
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
IF
ID
EX
MEM
WB
120
Other architecture styles

Accumulator architecture


Stack



three operands, all in registers
loads and stores are the only instructions accessing memory (i.e.
with a memory (indirect) addressing mode
Register-Memory


zero operand: all operands implicit (on TOS)
Register (load store)


one operand (in register or memory), accumulator almost always
implicitly used
two operands, one in memory
Memory-Memory

three operands, may be all in memory
(there are more varieties / combinations)
H.Corporaal ACA 5MD00
121
Accumulator architecture
Accumulator
latch
ALU
registers
address
Memory
latch
Example code: a = b+c;
load b;
add
c;
store a;
H.Corporaal ACA 5MD00
// accumulator is implicit operand
122
Stack architecture
latch
latch
top of
stack
ALU
latch
Example code: a = b+c;
push b;
push b
push c;
b
add;
stack:
pop a;
H.Corporaal ACA 5MD00
Memory
stack pt
push c
add
c
b
b+c
pop a
123
Other architecture styles
Let's look at the code for C = A + B
Stack
Architecture
Accumulator
Architecture
RegisterMemory
MemoryMemory
Register
(load-store)
Push A
Load A
Load r1,A
Add C,B,A
Load r1,A
Push B
Add
Add
Add
Store C
Pop
C
B
r1,B
Store C,r1
Load r2,B
Add
r3,r1,r2
Store C,r3
Q: What are the advantages / disadvantages of load-store (RISC) architecture?
H.Corporaal ACA 5MD00
124
Download