Platform-based Design RISC Instruction Set Implementation Alternatives == using MIPS as example == TU/e 5kk70 Henk Corporaal Bart Mesman Topics MIPS ISA: Instruction Set Architecture MIPS single cycle implementation MIPS multi-cycle implementation MIPS pipelined implementation Pipeline hazards Recap of RISC principles Other architectures Based on the book: Many slides; I'll go quick and skip some H.Corporaal ACA 5MD00 2 Main Types of Instructions Arithmetic Memory access instructions Integer Floating Point Load & Store Control flow Jump Conditional Branch Call & Return H.Corporaal ACA 5MD00 3 MIPS arithmetic Most instructions have 3 operands Operand order is fixed (destination first) Example: C code: A = B + C MIPS code: add $s0, $s1, $s2 ($s0, $s1 and $s2 are associated with variables by compiler) H.Corporaal ACA 5MD00 4 MIPS arithmetic C code: A = B + C + D; E = F - A; MIPS code: add $t0, $s1, $s2 add $s0, $t0, $s3 sub $s4, $s5, $s0 Operands must be registers, only 32 registers provided Design Principle: smaller is faster. Why? H.Corporaal ACA 5MD00 5 Registers vs. Memory Arithmetic instruction operands must be registers, — only 32 registers provided Compiler associates variables with registers What about programs with lots of variables ? CPU Memory register file IO H.Corporaal ACA 5MD00 6 Register allocation Compiler tries to keep as many variables in registers as possible Some variables can not be allocated large arrays (too few registers) aliased variables (variables accessible through pointers in C) dynamic allocated variables heap stack Compiler may run out of registers => spilling H.Corporaal ACA 5MD00 7 Memory Organization Viewed as a large, single-dimension array, with an address A memory address is an index into the array "Byte addressing" means that successive addresses are one byte apart 0 8 bits of data 1 8 bits of data 2 8 bits of data 3 8 bits of data 4 8 bits of data 5 8 bits of data 6 8 bits of data ... H.Corporaal ACA 5MD00 8 Memory Organization Bytes are nice, but most data items use larger "words" For MIPS, a word is 32 bits or 4 bytes. 0 32 bits of data 4 32 bits of data 8 32 bits of data ... 12 32 bits of data Registers hold 32 bits of data 232 bytes with byte addresses from 0 to 232-1 230 words with byte addresses 0, 4, 8, ... 232-4 H.Corporaal ACA 5MD00 9 Memory layout: Alignment 31 address 0 4 23 15 7 0 this word is aligned; the others are not! 8 12 16 20 24 Words are aligned What are the least 2 significant bits of a word address? H.Corporaal ACA 5MD00 10 Instructions: load and store Example: C code: A[8] = h + A[8]; MIPS code: lw $t0, 32($s3) add $t0, $s2, $t0 sw $t0, 32($s3) Store word operation has no destination (reg) operand Remember arithmetic operands are registers, not memory! H.Corporaal ACA 5MD00 11 Let's translate some C-code Can we figure out the code? swap(int v[], int k); { int temp; temp = v[k] v[k] = v[k+1]; v[k+1] = temp; } swap: muli add lw lw sw sw jr $2 , $2 , $15, $16, $16, $15, $31 $5, 4 $4, $2 0($2) 4($2) 0($2) 4($2) Explanation: index k : $5 base address of v: $4 address of v[k] is $4 + 4.$5 H.Corporaal ACA 5MD00 12 Machine Language Instructions, like registers and words of data, are also 32 bits long Example: add $t0, $s1, $s2 Registers have numbers: $t0=9, $s1=17, $s2=18 Instruction Format: op 000000 6 bits rs rt 10001 10010 5 bits 5 bits rd 01000 5 bits shamt 00000 5 bits funct 100000 6 bits Can you guess what the field names stand for? H.Corporaal ACA 5MD00 13 Machine Language Consider the load-word and store-word instructions, Introduce a new type of instruction format What would the regularity principle have us do? New principle: Good design demands a compromise I-type for data transfer instructions other format was R-type for register Example: lw $t0, 32($s2) H.Corporaal ACA 5MD00 35 18 9 op rs rt 32 16 bit number 14 Stored Program Concept memory OS Program 1 CPU code global data stack heap unused Program 2 unused H.Corporaal ACA 5MD00 15 Control Decision making instructions alter the control flow, i.e., change the "next" instruction to be executed MIPS conditional branch instructions: bne $t0, $t1, Label beq $t0, $t1, Label Example: if (i==j) h = i + j; bne $s0, $s1, Label add $s3, $s0, $s1 Label: .... H.Corporaal ACA 5MD00 16 Control MIPS unconditional branch instructions: j label Example: if (i!=j) h=i+j; else h=i-j; beq $s4, $s5, Lab1 add $s3, $s4, $s5 j Lab2 Lab1:sub $s3, $s4, $s5 Lab2:... Can you build a simple for loop? H.Corporaal ACA 5MD00 17 So far: Instruction Meaning add $s1,$s2,$s3 sub $s1,$s2,$s3 lw $s1,100($s2) sw $s1,100($s2) bne $s4,$s5,L beq $s4,$s5,L j Label $s1 = $s2 + $s3 $s1 = $s2 – $s3 $s1 = Memory[$s2+100] Memory[$s2+100] = $s1 Next instr. is at Label if $s4 ° $s5 Next instr. is at Label if $s4 = $s5 Next instr. is at Label Formats: R op rs rt rd I op rs rt 16 bit address J op H.Corporaal ACA 5MD00 shamt funct 26 bit address 18 Control Flow We have: beq, bne, what about Branch-if-less-than? New instruction: meaning: if slt $t0, $s1, $s2 $s1 < $s2 then $t0 = 1 else $t0 = 0 Can use this instruction to build "blt $s1, $s2, Label" — can now build general control structures Note that the assembler needs a register to do this, — use conventions for registers H.Corporaal ACA 5MD00 19 MIPS compiler/assembler Conventions Name Register number Usage $zero 0 the constant value 0 $v0-$v1 2-3 values for results and expression evaluation $a0-$a3 4-7 arguments $t0-$t7 8-15 temporaries $s0-$s7 16-23 saved (by callee) $t8-$t9 24-25 more temporaries $gp 28 global pointer $sp 29 stack pointer $fp 30 frame pointer $ra 31 return address H.Corporaal ACA 5MD00 20 Constants Small constants are used quite frequently (50% of operands) e.g., A = A + 5; B = B + 1; C = C - 18; Solutions? Why not? put 'typical constants' in memory and load them create hard-wired registers (like $zero) for constants like one or ……. MIPS Instructions: addi slti andi ori H.Corporaal ACA 5MD00 $29, $8, $29, $29, $29, $18, $29, $29, 4 10 6 4 3 21 How about larger constants? We'd like to be able to load a 32 bit constant into a register Must use two instructions; new "load upper immediate" instruction lui $t0, 1010101010101010 filled with zeros 1010101010101010 0000000000000000 Then must get the lower order bits right, i.e., ori $t0, $t0, 1010101010101010 ori H.Corporaal ACA 5MD00 1010101010101010 0000000000000000 0000000000000000 1010101010101010 1010101010101010 1010101010101010 22 Assembly Language vs. Machine Language Assembly provides convenient symbolic representation much easier than writing down numbers e.g., destination first Machine language is the underlying reality e.g., destination is no longer first Assembly can provide 'pseudoinstructions' e.g., “move $t0, $t1” exists only in Assembly would be implemented using “add $t0,$t1,$zero” When considering performance you should count real instructions H.Corporaal ACA 5MD00 23 Addresses in Branches and Jumps Instructions: bne $t4,$t5,Label beq $t4,$t5,Label j Label Next instruction is at Label if $t4 $t5 Next instruction is at Label if $t4 = $t5 Next instruction is at Label Formats: I op J op rs rt 16 bit address 26 bit address Addresses are not 32 bits — How do we handle this with load and store instructions? H.Corporaal ACA 5MD00 24 What's the next address? Instructions: Next instruction is at Label if $t4 $t5 Next instruction is at Label if $t4 = $t5 bne $t4,$t5,Label beq $t4,$t5,Label Formats: I op rs rt 16 bit address Could specify a register (like lw and sw) and add it to address use Instruction Address Register (PC = program counter) most branches are local (principle of locality) Jump instructions just use high order bits of PC address boundaries of 256 MB H.Corporaal ACA 5MD00 25 To summarize: add MIPS assembly language Example Meaning add $s1, $s2, $s3 $s1 = $s2 + $s3 Three operands; data in registers subtract sub $s1, $s2, $s3 $s1 = $s2 - $s3 Three operands; data in registers $s1 = $s2 + 100 $s1 = Memory[$s2 + 100] Memory[$s2 + 100] = $s1 $s1 = Memory[$s2 + 100] Memory[$s2 + 100] = $s1 Used to add constants Category Arithmetic Instruction addi $s1, $s2, 100 lw $s1, 100($s2) load word sw $s1, 100($s2) store word lb $s1, 100($s2) Data transfer load byte sb $s1, 100($s2) store byte load upper immediate lui $s1, 100 add immediate Conditional branch Unconditional jump $s1 = 100 * 2 16 Comments Word from memory to register Word from register to memory Byte from memory to register Byte from register to memory Loads constant in upper 16 bits branch on equal beq $s1, $s2, 25 if ($s1 == $s2) go to PC + 4 + 100 Equal test; PC-relative branch branch on not equal bne $s1, $s2, 25 if ($s1 != $s2) go to PC + 4 + 100 Not equal test; PC-relative set on less than slt $s1, $s2, $s3 if ($s2 < $s3) $s1 = 1; else $s1 = 0 Compare less than; for beq, bne set less than immediate slti jump j jr jal jump register jump and link H.Corporaal ACA 5MD00 $s1, $s2, 100 if ($s2 < 100) $s1 = 1; Compare less than constant else $s1 = 0 2500 $ra 2500 Jump to target address go to 10000 For switch, procedure return go to $ra $ra = PC + 4; go to 10000 For procedure call 26 MIPS (3+2) addressing modes overview 1. Immediate addressing op rs rt Immediate 2. Register addressing op rs rt rd ... funct Registers Register 3. Base addressing op rs rt Memory Address + Register Byte Halfword Word 4. PC-relative addressing op rs rt Memory Address PC + Word 5. Pseudodirect addressing op Address PC H.Corporaal ACA 5MD00 Memory Word 27 MIPS Datapath Building a datapath A single cycle processor datapath all instruction actions in one (long) cycle A multi-cycle processor datapath support a subset of the MIPS-I instruction-set each instructions takes multiple (shorter) cycles For details see book (ch 5): H.Corporaal ACA 5MD00 28 Datapath and Control FSM or Microprogramming Registers & Memories Multiplexors Buses ALUs Control H.Corporaal ACA 5MD00 Datapath 29 The Processor: Datapath & Control Simplified MIPS implementation to contain only: lw, sw add, sub, and, or, slt beq, j Generic Implementation: memory-reference instructions: arithmetic-logical instructions: control flow instructions: use the program counter (PC) to supply instruction address get the instruction from memory read registers use the instruction to decide exactly what to do All instructions use the ALU after reading the registers Why? memory-reference? arithmetic? control flow? H.Corporaal ACA 5MD00 30 More Implementation Details Abstract / Simplified View: Data Address PC Instruction memory Instruction Register # Registers Register # ALU Address Data memory Register # Data Two types of functional units: H.Corporaal ACA 5MD00 elements that operate on data values (combinational) elements that contain state (sequential) 31 State Elements Unclocked vs. Clocked Clocks used in synchronous logic when should an element that contains state be updated? falling edge cycle time rising edge H.Corporaal ACA 5MD00 32 An unclocked state element The set-reset (SR) latch output depends on present inputs and also on past inputs R Q Q S Truth table: H.Corporaal ACA 5MD00 R 0 0 1 1 S 0 1 0 1 Q Q 1 0 ? state change 33 Latches and Flip-flops Output is equal to the stored value inside the element (don't need to ask for permission to look at the value) Change of state (value) is based on the clock Latches: whenever the inputs change, and the clock is asserted Flip-flop: state changes only on a clock edge (edge-triggered methodology) A clocking methodology defines when signals can be read and written — wouldn't want to read a signal at the same time it was being written H.Corporaal ACA 5MD00 34 D-latch Two inputs: the data value to be stored (D) the clock signal (C) indicating when to read & store D Two outputs: the value of the internal state (Q) and it's complement C D Q C _ Q D H.Corporaal ACA 5MD00 Q 35 D flip-flop Output changes only on the clock edge D D C D latch Q D Q D latch _ C Q Q _ Q C D C Q H.Corporaal ACA 5MD00 36 Our Implementation An edge triggered methodology Typical execution: read contents of some state elements, send values through some combinational logic, write results to one or more state elements State element 1 Combinational logic State element 2 Clockcycle H.Corporaal ACA 5MD00 37 Register File 3-ported: one write, two read ports Read reg. #1 Read data 1 Read reg.#2 Read data 2 Write reg.# Write data Write H.Corporaal ACA 5MD00 38 Register file: read ports • Register file built using D flip-flops Read register number 1 Register 0 Register 1 M Register n – 1 u x Read data 1 Register n Read register number 2 M u Read data 2 x Implementation of the read ports H.Corporaal ACA 5MD00 39 Register file: write port Note: we still use the real clock to determine when to write W r ite 0 1 R e g is te r n um b e r n -to -1 C R e g iste r 0 D C d e co d e r n – 1 R e g iste r 1 D n C R e g is te r n – 1 D C R e g iste r n R e gister d ata H.Corporaal ACA 5MD00 D 40 Building the Datapath Use multiplexors to stitch them together PCSrc M u x Add Add ALU result 4 Shift left 2 Registers PC Read address Instruction Instruction memory Read register 1 Read Read data 1 register 2 Write register Write data RegWrite 16 H.Corporaal ACA 5MD00 ALUSrc Read data 2 Sign extend M u x 3 ALU operation Zero ALU ALU result MemWrite MemtoReg Address Read data Data Write memory data M u x 32 MemRead 41 Our Simple Control Structure All of the logic is combinational We wait for everything to settle down, and the right thing to be done ALU might not produce “right answer” right away we use write signals along with clock to determine when to write Cycle time determined by length of the longest path S tate elem ent 1 Com binational logic State elem ent 2 Clock cycle We are ignoring some details like setup and hold times ! H.Corporaal ACA 5MD00 42 Control Selecting the operations to perform (ALU, read/write, etc.) Controlling the flow of data (multiplexor inputs) Information comes from the 32 bits of the instruction Example: add $8, $17, $18 000000 op Instruction Format: 10001 rs 10010 rt 01000 rd 00000 shamt 100000 funct ALU's operation based on instruction type and function code H.Corporaal ACA 5MD00 43 Control: 2 level implementation 31 6 Control 2 26 instruction register Opcode bit 2 ALUop 00: lw, sw 01: beq 10: add, sub, and, or, slt Funct. Control 1 H.Corporaal ACA 5MD00 5 6 3 ALUcontrol 000: and 001: or 010: add 110: sub 111: set on less than ALU 0 44 Datapath with Control 0 M u x Add ALU result Add 4 Instruction[31–26] PC Instruction memory Read register 1 Instruction[20–16] Instruction [31–0] Instruction[15–11] Shift left 2 RegDst Branch MemRead MemtoReg Control ALUOp MemWrite ALUSrc RegWrite Instruction[25–21] Read address 0 M u x 1 1 Read data1 Read register 2 Registers Read Write data2 register 0 M u x 1 Write data Zero ALU ALU result Address Write data Instruction[15–0] 16 Sign extend Read data Data memory 1 M u x 0 32 ALU control Instruction[5–0] H.Corporaal ACA 5MD00 45 ALU Control1 What should the ALU do with this instruction example: lw $1, 100($2) 2 1 op rs rt 100 16 bit offset ALU control input 000 001 010 110 111 35 AND OR add subtract set-on-less-than Why is the code for subtract 110 and not 011? H.Corporaal ACA 5MD00 46 ALU Control1 Must describe hardware to compute 3-bit ALU control input given instruction type 00 = lw, sw 01 = beq, 10 = arithmetic function code for arithmetic ALU Operation class, computed from instruction type Describe it using a truth table (can turn into gates): inputs H.Corporaal ACA 5MD00 ALUOp ALUOp1 ALUOp0 0 0 X 1 1 X 1 X 1 X 1 X 1 X F5 X X X X X X X Funct field F4 F3 F2 F1 X X X X X X X X X 0 0 0 X 0 0 1 X 0 1 0 X 0 1 0 X 1 0 1 outputs Operation F0 X X 0 0 0 1 0 010 110 010 110 000 001 111 47 ALU Control1 Simple combinational logic (truth tables) ALUOp ALU control block ALUOp0 ALUOp1 F3 F2 F (5– 0) Operation2 Operation1 Operation F1 Operation0 F0 H.Corporaal ACA 5MD00 48 Deriving Control2 signals Input 6-bits 9 control (output) signals Memto- Reg Mem Mem Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0 R-format 1 0 0 1 0 0 0 1 0 lw 0 1 1 1 1 0 0 0 0 sw X 1 X 0 0 1 0 0 0 beq X 0 X 0 0 0 1 0 1 Determine these control signals directly from the opcodes: R-format: 0 lw: 35 sw: 43 beq: 4 H.Corporaal ACA 5MD00 49 Control 2 Inputs Op5 Op4 Op3 PLA example implementation Op2 Op1 Op0 Outputs R-format Iw sw beq RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp1 ALUOpO H.Corporaal ACA 5MD00 50 Single Cycle Implementation Calculate cycle time assuming negligible delays except: memory (2ns), ALU and adders (2ns), register file access (1ns) PCSrc Add ALU Add result 4 RegWrite Instruction [25– 21] PC Read address Instruction [31– 0] Instruction memory Instruction [20– 16] 1 M u Instruction [15– 11] x 0 RegDst Instruction [15– 0] Read register 1 Read register 2 Read data 1 Read Write data 2 register Write Registers data 16 Sign 32 extend 1 M u x 0 Shift left 2 MemWrite ALUSrc 1 M u x 0 Zero ALU ALU result MemtoReg Address Write data ALU control Read data Data memory 1 M u x 0 MemRead Instruction [5– 0] ALUOp H.Corporaal ACA 5MD00 51 Single Cycle Implementation Memory (2ns), ALU & adders (2ns), reg. file access (1ns) Fixed length clock: longest instruction is the ‘lw’ which requires 8 ns Variable clock length (not realistic, just as exercise): R-instr: Load: Store: Branch: Jump: 6 ns 8 ns 7 ns 5 ns 2 ns Average depends on instruction mix H.Corporaal ACA 5MD00 52 Where we are headed Single Cycle Problems: what if we had a more complicated instruction like floating point? wasteful of area: NO Sharing of Hardware resources One Solution: use a “smaller” cycle time have different instructions take different numbers of cycles a “multicycle” datapath: Instruction register PC Address ALU Registers Memory data register MDR H.Corporaal ACA 5MD00 A Register # Instruction Memory or data Data Data IR ALUOut Register # B Register # 53 Multicycle Approach We will be reusing functional units Add registers after every major functional unit Our control signals will not be determined solely by instruction ALU used to compute address and to increment PC Memory used for instruction and data e.g., what should the ALU do for a “subtract” instruction? We’ll use a finite state machine (FSM) or microcode for control H.Corporaal ACA 5MD00 54 Review: finite state machines Finite state machines: a set of states and next state function (determined by current state and the input) output function (determined by current state and possibly input) Current state Next-state function Next state Clock Inputs Output function Outputs We’ll use a Moore machine (output based only on current state) H.Corporaal ACA 5MD00 55 Multicycle Approach Break up the instructions into steps, each step takes a cycle At the end of a cycle balance the amount of work to be done restrict each cycle to use only one major functional unit store values for use in later cycles (easiest thing to do) introduce additional “internal” registers Notice: we distinguish processor state: programmer visible registers internal state: programmer invisible registers (like IR, MDR, A, B, and ALUout) H.Corporaal ACA 5MD00 56 Multicycle Approach PC 0 M u x 1 Address Memory Instruction [25–21] Read register 1 Instruction [20–16] Read Read data1 register 2 Registers Write Read register data2 MemData Write data Instruction [15–0] Instruction [15–11] Instruction register Instruction [15–0] Memory data register H.Corporaal ACA 5MD00 0 M u x 1 A B 0 M u x 1 Sign extend 32 Zero ALU ALU result ALUOut 0 4 Write data 16 0 M u x 1 1M u 2x 3 Shift left 2 57 Multicycle Approach Note that previous picture does not include: branch support jump support Control lines and logic Tclock > max (ALU delay, Memory access, Regfile access) See book for complete picture H.Corporaal ACA 5MD00 58 Five Execution Steps Instruction Fetch Instruction Decode and Register Fetch Execution, Memory Address Computation, or Branch Completion Memory Access or R-type instruction completion Write-back step INSTRUCTIONS TAKE FROM 3 - 5 CYCLES! H.Corporaal ACA 5MD00 59 Step 1: Instruction Fetch Use PC to get instruction and put it in the Instruction Register Increment the PC by 4 and put the result back in the PC Can be described succinctly using RTL "Register-Transfer Language" IR = Memory[PC]; PC = PC + 4; Can we figure out the values of the control signals? What is the advantage of updating the PC now? H.Corporaal ACA 5MD00 60 Step 2: Instruction Decode and Register Fetch Read registers rs and rt in case we need them Compute the branch address in case the instruction is a branch Previous two actions are done optimistically!! RTL: A = Reg[IR[25-21]]; B = Reg[IR[20-16]]; ALUOut = PC+(sign-extend(IR[15-0])<< 2); We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic) H.Corporaal ACA 5MD00 61 Step 3 (instruction dependent) ALU is performing one of four functions, based on instruction type Memory Reference: ALUOut = A + sign-extend(IR[15-0]); R-type: ALUOut = A op B; Branch: if (A==B) PC = ALUOut; Jump: PC = PC[31-28] || (IR[25-0]<<2) H.Corporaal ACA 5MD00 62 Step 4 (R-type or Memory-access) Loads and stores access memory MDR = Memory[ALUOut]; or Memory[ALUOut] = B; R-type instructions finish Reg[IR[15-11]] = ALUOut; The write actually takes place at the end of the cycle on the edge H.Corporaal ACA 5MD00 63 Write-back step Memory read completion step Reg[IR[20-16]]= MDR; What about all the other instructions? H.Corporaal ACA 5MD00 64 Summary execution steps Steps taken to execute any instruction class Step name Instruction fetch Action for R-type instructions Instruction decode/register fetch Action for memory-reference Action for instructions branches IR = Memory[PC] PC = PC + 4 A = Reg [IR[25-21]] B = Reg [IR[20-16]] ALUOut = PC + (sign-extend (IR[15-0]) << 2) Execution, address computation, branch/ jump completion ALUOut = A op B ALUOut = A + sign-extend (IR[15-0]) Memory access or R-type completion Reg [IR[15-11]] = ALUOut Load: MDR = Memory[ALUOut] or Store: Memory [ALUOut] = B Memory read completion H.Corporaal ACA 5MD00 if (A ==B) then PC = ALUOut Action for jumps PC = PC [31-28] II (IR[25-0]<<2) Load: Reg[IR[20-16]] = MDR 65 Simple Questions How many cycles will it take to execute this code? lw $t2, 0($t3) lw $t3, 4($t3) beq $t2, $t3, L1 add $t5, $t2, $t3 sw $t5, 8($t3) L1: ... #assume not taken What is going on during the 8th cycle of execution? In what cycle does the actual addition of $t2 and $t3 takes place? H.Corporaal ACA 5MD00 66 Implementing the Control Value of control signals is dependent upon: Use the information we have accumulated to specify a finite state machine (FSM) what instruction is being executed which step is being performed specify the finite state machine graphically, or use microprogramming Implementation can be derived from specification H.Corporaal ACA 5MD00 67 0 Start How many state bits will we need? Memory address computation 6 ALUSrcA = 1 ALUSrcB = 10 ALUOp = 00 Memory access 5 MemRead IorD = 1 Write-back step 4 RegDst = 0 RegWrite MemtoReg = 1 ALUSrcA = 0 ALUSrcB = 11 ALUOp = 00 Branch completion 8 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 10 Memory access 3 1 Execution 2 (Op = 'LW') MemRead ALUSrcA = 0 IorD = 0 IRWrite ALUSrcB = 01 ALUOp = 00 PCWrite PCSource = 00 MemWrite IorD = 1 RegDst = 1 RegWrite MemtoReg = 0 Jump completion 9 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 01 PCWriteCond PCSource = 01 R-type completion 7 (Op = 'J') Graphical Specification of FSM Instruction decode/ register fetch Instruction fetch PCWrite PCSource = 10 Finite State Machine for Control PCWrite PCWriteCond IorD MemRead Implementation: MemWrite IRWrite Control logic MemtoReg PCSource ALUOp Outputs ALUSrcB ALUSrcA RegWrite RegDst NS3 NS2 NS1 NS0 Instruction register opcode field H.Corporaal ACA 5MD00 S0 S1 S2 S3 Op0 Op1 Op2 Op3 Op4 Op5 Inputs State register 69 Op4 opcode PLA Implementation Op5 (see book) Op3 Op2 Op1 Op0 S3 current state S2 S1 S0 If I picked a horizontal or vertical line could you explain it ? What type of FSM is used? Mealy or Moore? datapath control PCWrite PCWriteCond IorD MemRead MemWrite IRWrite MemtoReg PCSource1 PCSource0 ALUOp1 ALUOp0 ALUSrcB1 ALUSrcB0 ALUSrcA RegWrite RegDst NS3 NS2 NS1 NS0 next state H.Corporaal ACA 5MD00 70 Pipelined implementation Pipelining Pipelined datapath Pipelined control Hazards: Structural Data Control Exceptions Scheduling For details see the book (chapter 6): H.Corporaal ACA 5MD00 71 Pipelining Improve performance by increasing instruction throughput P rog ram e x ec ution T im e o rd er (in in structio ns ) lw $ 1, 10 0 ($0 ) 2 Ins truction R eg fe tch lw $ 2, 20 0 ($0 ) 4 6 8 A LU Data access 10 12 14 16 18 R eg Instruction R eg fe tch 8 ns lw $ 3, 30 0 ($0 ) D ata acc ess A LU R eg Instruction fe tch 8 ns ... 8 ns P rog ram e x ec utio n Tim e o rd er (in in struc tio n s) 2 lw $1 , 1 00 ($ 0) Instruction fetch lw $2 , 2 00 ($ 0) 2 ns lw $3 , 3 00 ($ 0) 4 Reg Instruction fetch 2 ns 6 ALU Reg Instruction fetch 2 ns H.Corporaal ACA 5MD00 8 Da ta access ALU Reg 2 ns 10 14 12 Reg Data access Reg ALU D a ta access 2 ns 2 ns Reg 2 ns 72 Pipelining Ideal speedup = number of stages Do we achieve this? H.Corporaal ACA 5MD00 73 Pipelining What makes it easy What makes it hard? all instructions are the same length just a few instruction formats memory operands appear only in loads and stores structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction We’ll build a simple pipeline and look at these issues We’ll talk about modern processors and what really makes it hard: H.Corporaal ACA 5MD00 exception handling trying to improve performance with out-of-order execution, etc. 74 Basic idea: start from single cycle impl. What do we need to add to actually split the datapath into stages? IF: Instruction fetch ID: Instruction decode/ register file read EX: Execute/ address calculation MEM: Memory access WB: Write back 0 M u x 1 Add Add Add result 4 Shift left 2 PC Read register 1 Address Instruction Instruction memory Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 H.Corporaal ACA 5MD00 Sign extend Read data 1 M u x 0 32 75 Pipelined Datapath Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem? 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 H.Corporaal ACA 5MD00 Sign extend Read data 1 M u x 0 32 76 Corrected Datapath 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add 4 Add Add result PC Address Instruction memory I nstr ucti on Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 H.Corporaal ACA 5MD00 Sign extend Read data 1 M u x 0 32 77 Graphically Representing Pipelines Time (in clock cycles) Program execution order (in instructions) lw $10, 20($1) sub $11, $2, $3 CC 1 CC 2 CC 3 CC 4 CC 5 IM Reg ALU DM Reg ALU DM IM Reg CC 6 Reg Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths H.Corporaal ACA 5MD00 78 Pipeline Control PCSrc 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add 4 Add result Branch Shift left 2 PC Address Instruction memory Instruction RegWrite Read register 1 MemWrite Read data 1 Read register 2 Registers Read Write data 2 register Write data ALUSrc Zero Zero ALU ALU result 0 M u x 1 MemtoReg Address Data memory Write Read data 1 M u x 0 data Instruction 16 [15– 0] Instruction [20– 16] Instruction [15– 11] Sign extend 32 6 0 M u x 1 ALU control MemRead ALUOp RegDst H.Corporaal ACA 5MD00 79 Pipeline control We have 5 stages. What needs to be controlled in each stage? Instruction Fetch and PC Increment Instruction Decode / Register Fetch Execution Memory Stage Write Back How would control be handled in an automobile plant? a fancy control center telling everyone what to do? should we use a finite state machine? H.Corporaal ACA 5MD00 80 Pipeline Control Instruction R-format lw sw beq Execution/Address Calculation stage control lines Reg ALU ALU ALU Dst Op1 Op0 Src 1 1 0 0 0 0 0 1 X 0 0 1 X 0 1 0 Memory access stage control lines Branc Mem Mem h Read Write 0 0 0 0 1 0 0 0 1 1 0 0 Write-back stage control lines Reg Mem write to Reg 1 0 1 1 0 X 0 X (compare single cycle control!) Pass control signals along just like the data: WB Instruction H.Corporaal ACA 5MD00 IF/ID Control M WB EX M WB ID/EX EX/MEM MEM/WB 81 Datapath with Control PCSrc ID/EX 0 M u x 1 WB Control IF/ID EX/MEM M WB EX M MEM/WB WB Add ALUSrc Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data Zero ALU ALU result 0 M u x 1 MemtoReg Instruction memory Branch Shift left 2 MemWrite Address Instruction PC Add Add result RegWrite 4 Address Data memory Read data Write data Instruction 16 [15– 0] Instruction [20– 16] Instruction [15– 11] Sign extend 32 6 ALU control 0 M u x 1 1 M u x 0 MemRead ALUOp RegDst H.Corporaal ACA 5MD00 82 H.Corporaal ACA 5MD00 83 Hazards: problems due to pipelining Hazard types: Structural Data same resource is needed multiple times in the same cycle data dependencies limit pipelining Control next executed instruction may not be the next specified instruction H.Corporaal ACA 5MD00 84 Structural hazards Examples: Two accesses to a single ported memory Two operations need the same function unit at the same time Two operations need the same function unit in successive cycles, but the unit is not pipelined Solutions: stalling add more hardware H.Corporaal ACA 5MD00 85 Structural hazards on MIPS Q: Do we have structural hazards on our simple MIPS pipeline? time IF H.Corporaal ACA 5MD00 ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB 86 Data hazards Data dependencies: (read-after-write) (write-after-write) (write-after-read) Hardware solution: RaW WaW WaR Forwarding / Bypassing Detection logic Stalling Software solution: Scheduling H.Corporaal ACA 5MD00 87 Data dependences Three types: RaW, WaR and WaW add r1, r2, 5 sub r4, r1, r3 ; r1 := r2+5 ; RaW of r1 add r1, r2, 5 sub r2, r4, 1 ; WaR of r2 add r1, r2, 5 sub r1, r1, 1 ; WaW of r1 st ld ; M[r2+5] := r1 ; RaW if 5+r2 = 0+r4 r1, 5(r2) r5, 0(r4) WaW and WaR do not occur in simple pipelines, but they limit scheduling freedom! Problems for your compiler and Pentium! use register renaming to solve this! H.Corporaal ACA 5MD00 88 RaW on MIPS pipeline T im e (in clock cycle s) V alue of re giste r $ 2 : CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 10 10 10 10 1 0 /– 2 0 – 20 – 20 – 20 – 20 IM Reg DM Reg P ro gra m e xe cution orde r (in instru ctio ns) sub $ 2 , $ 1 , $ 3 a nd $ 1 2 , $ 2 , $ 5 or $ 1 3 , $ 6 , $ 2 a dd $ 1 4 , $ 2 , $ 2 sw $ 1 5, 1 0 0 ($ 2 ) H.Corporaal ACA 5MD00 IM DM R eg IM DM R eg IM R eg DM Reg IM R eg R eg R eg DM Reg 89 Forwarding Use temporary results, don’t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding T ime (in clock cycles) CC 1 V alue of re gister $ 2 : 10 V a lue of E X/M EM : X V a lue of M EM /W B : X CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 10 X X 10 X X 10 – 20 X 1 0/– 20 X – 20 – 20 X X – 20 X X – 20 X X – 20 X X DM R eg Program e xe cution orde r (in instructions) sub $ 2, $1 , $ 3 What if this $2 was $13? a nd $ 12 , $ 2, $5 or $ 13 , $ 6, $ 2 a dd $ 14 , $ 2, $2 sw $ 15 , 1 00 ($2 ) H.Corporaal ACA 5MD00 IM Reg IM R eg IM DM R eg IM R eg DM DM R eg IM Reg R eg Reg DM Reg 90 Forwarding hardware ALU forwarding circuitry principle: from register file ALU to register file from register file Note: there are two options • buf - ALU – bypass – mux - buf • buf - bypass – mux – ALU - buf H.Corporaal ACA 5MD00 91 Forwarding ID/EX WB Control PC Instruction memory Instruction IF/ID EX/MEM M WB EX M MEM/WB WB M u x Registers ForwardA ALU Data memory M u x IF/ID.RegisterRs Rs IF/ID.RegisterRt Rt IF/ID.RegisterRt Rt IF/ID.RegisterRd Rd ForwardB M u x EX/MEM.RegisterRd Forwarding unit H.Corporaal ACA 5MD00 M u x MEM/WB.RegisterRd 92 Forwarding check Check for matching register-ids: For each source-id of operation in the EX-stage check if there is a matching pending dest-id Example: if (EX/MEM.RegWrite) (EX/MEM.RegisterRd 0) (EX/MEM.RegisterRd = ID/EX.RegisterRs) then ForwardA = 10 Q. How many comparators do we need? H.Corporaal ACA 5MD00 93 Can't always forward Load word can still cause a hazard: an instruction tries to read register r following a load to the same r Need a hazard detection unit to “stall” the load instruction Time (in clock cycles) Program CC 1 execution order (in instructions) lw $2, 20 ($1) and $4, $2, $5 or $8, $2, $6 add $9, $4, $2 slt $1, $6, $7 H.Corporaal ACA 5MD00 IM CC 2 CC 3 Reg IM CC 4 CC 5 DM Reg Reg IM DM Reg IM CC 6 CC 8 CC 9 Reg DM Reg IM CC 7 Reg DM Reg Reg DM Reg 94 Stalling We can stall the pipeline by keeping an instruction in the same stage Program Time(inclockcycles) execution CC1 CC2 order (ininstructions) lw$2, 20($1) and$4, $2, $5 or $8, $2, $6 IM CC3 Reg IM CC4 CC5 DM Reg Reg Reg IM IM CC6 CC7 DM Reg Reg DM CC8 CC9 CC10 Reg bubble add$9, $4, $2 In CC4 the ALU is not used, slt $1, $6, $7 Reg, and IM are redone H.Corporaal ACA 5MD00 IM DM Reg IM Reg Reg DM Reg 95 Hazard Detection Unit ID/EX.MemRead Hazard detection unit ID/EX IF/IDWrite WB Control 0 M u x PC Instruction memory Instruction PCWrite IF/ID EX/MEM M WB EX M MEM/WB WB M u x Registers ALU Data memory M u x M u x IF/ID.RegisterRs IF/ID.RegisterRt H.Corporaal ACA 5MD00 IF/ID.RegisterRt Rt IF/ID.RegisterRd Rd ID/EX.RegisterRt Rs Rt M u x EX/MEM.RegisterRd Forwarding unit MEM/WB.RegisterRd 96 Software only solution? Have compiler guarantee that no hazards occur Example: where do we insert the “NOPs” ? sub and or add sw $2, $12, $13, $14, $13, $1, $3 $2, $5 $6, $2 $2, $2 100($2) sub nop nop and or Problem: this really slows us down! add nop sw H.Corporaal ACA 5MD00 $2, $1, $3 $12, $2, $5 $13, $6, $2 $14, $2, $2 $13, 100($2) 97 Control hazards Control operations may change the sequential flow of instructions branch jump call (jump and link) return (exception/interrupt and rti / return from interrupt) H.Corporaal ACA 5MD00 98 Control hazard: Branch Branch actions: Compute new address Determine condition Perform the actual branch (if taken): PC := new address H.Corporaal ACA 5MD00 99 Branch example Progra m Time (in clock cycle s) execu tion CC 1 CC 2 IM Reg CC 3 CC 4 CC 5 DM R eg CC 6 CC 7 CC 8 CC 9 order (in instructions) 40 beq $1, $3, 7 44 an d $1 2, $2 , $ 5 48 or $13, $6 , $2 52 ad d $1 4, $2 , $ 2 72 lw $4 , 50($ 7) H.Corporaal ACA 5MD00 IM R eg IM DM R eg IM R eg DM R eg IM R eg DM R eg Reg DM R eg 100 Branching Squash pipeline: When we decide to branch, other instructions are in the pipeline! We are predicting “branch not taken” need to add hardware for flushing instructions if we are wrong H.Corporaal ACA 5MD00 101 Branch with predict not taken Clock cycles Branch L IF Predict not taken L: H.Corporaal ACA 5MD00 ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB 102 Branch speedup Earlier address computation Earlier condition calculation Put both in the ID pipeline stage adder comparator Clock cycles Branch L Predict not taken L: H.Corporaal ACA 5MD00 IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB 103 Improved branching / flushing IF/ID IF.Flush Hazard detection unit ID/EX M u x WB Control 0 M u x IF/ID 4 M WB EX M MEM/WB WB Shift left 2 Registers PC EX/MEM = M u x Instruction memory ALU M u x Data memory M u x Sign extend M u x Forwarding unit H.Corporaal ACA 5MD00 104 Exception support Types of exceptions: Overflow I/O device request Operating system call Undefined instruction Hardware malfunction Page fault Precise exception: finish previous instructions (which are still in the pipeline) flush excepting and following instructions, redo them after handling the exception(s) H.Corporaal ACA 5MD00 105 Exceptions Changes needed for handling overflow exception of an operation in EX stage (see book for details) : Extend PC input mux with extra entry with fixed address Add EPC register recording the ID/EX stage PC this is the address of the next instruction ! Cause register recording exception type E.g., in case of overflow exception insert 3 bubbles; flush the following stages: IF/ID stage ID/EX stage EX/MEM stage H.Corporaal ACA 5MD00 106 Scheduling, why? Let’s look at the execution time: Texecution = Ncycles x Tcycle = Ninstructions x CPI x Tcycle Scheduling may reduce Texecution Reduce CPI (cycles per instruction) early scheduling of long latency operations avoid pipeline stalls due to structural, data and control hazards allow Nissue > 1 and therefore CPI < 1 Reduce Ninstructions compact many operations into each instruction (VLIW) H.Corporaal ACA 5MD00 107 Scheduling data hazards: example 1 Try and avoid RaW stalls (in this case load interlocks)! E.g., reorder these instructions: lw lw sw sw $t0, $t2, $t2, $t0, H.Corporaal ACA 5MD00 0($t1) 4($t1) 0($t1) 4($t1) lw lw sw sw $t0, $t2, $t0, $t2, 0($t1) 4($t1) 4($t1) 0($t1) 108 Scheduling data hazards example 2 Avoiding RaW stalls: Reordering instructions for following program (by you or the compiler) Unscheduled code: Lw R1,b Lw R2,c Add R3,R1,R2 interlock Sw a,R3 Lw R1,e Lw R2,f Sub R4,R1,R2 interlock Sw d,R4 Code: a = b + c d = e - f H.Corporaal ACA 5MD00 Scheduled code: Lw R1,b Lw R2,c Lw R5,e extra reg. needed! Add R3,R1,R2 Lw R2,f Sw a,R3 Sub R4,R5,R2 Sw d,R4 109 Scheduling control hazards Texecution = Ninstructions x CPI x Tcycle CPI = CPIideal + fbranch x Pbranch Pbranch = Ndelayslots x miss_rate Modern processors tend to have large branch penalty, Pbranch, due to: many pipeline stages multi-issue Note that penalties have larger effect when CPIideal is low H.Corporaal ACA 5MD00 110 Scheduling control hazards What can we do about control hazards and CPI penalty? Keep penalty Pbranch low: Early computation of new PC Early determination of condition Visible branch delay slots filled by compiler (MIPS) Branch prediction Reduce control dependencies (control height reduction) [Schlansker and Kathail, Micro’95] Remove branches: if-conversion Conditional instructions: CMOVE, cond skip next Guarding all instructions: TriMedia H.Corporaal ACA 5MD00 111 Branch delay slot Add a branch delay slot: the next instruction after a branch is always executed rely on compiler to “fill” the slot with something useful Is this a good idea? let's look how it works H.Corporaal ACA 5MD00 112 Branch delay slot scheduling Q. What to put in the delay slot? op 1 beq r1,r2, L ............. 'fall-through' op 2 ............. branch target L: op 3 ............. H.Corporaal ACA 5MD00 113 Summary Modern processors are (deeply) pipelined, to reduce Tcycle and aim at CPI = 1 Hazards increase CPI Several software and hardware measure to avoid or reduce hazards are taken Not discussed, but important developments: Multi-issue further reduces CPI Branch prediction to avoid high branch penalties Dynamic scheduling In all cases: a scheduling compiler needed H.Corporaal ACA 5MD00 114 Recap of MIPS RISC architecture Register space Addressing Instruction format Pipelining H.Corporaal ACA 5MD00 115 Why RISC? Keep it simple RISC characteristics: Reduced number of instructions Limited addressing modes Large register set uniform (no distinction between e.g. address and data registers) Limited number of instruction sizes (preferably one) load-store architecture enables pipelining know directly where the following instruction starts Limited number of instruction formats Memory alignment restrictions ...... Based on quantitative analysis " the famous MIPS one percent rule": don't even think about it when its not used more than one percent H.Corporaal ACA 5MD00 116 Register space 32 integer (and 32 floating point) registers of 32-bit Name Register number Usage $zero 0 the constant value 0 $v0-$v1 2-3 values for results and expression evaluation $a0-$a3 4-7 arguments $t0-$t7 8-15 temporaries $s0-$s7 16-23 saved (by callee) $t8-$t9 24-25 more temporaries $gp 28 global pointer $sp 29 stack pointer $fp 30 frame pointer $ra 31 return address H.Corporaal ACA 5MD00 117 Addressing 1. Immediate addressing op rs rt Immediate 2. Register addressing op rs rt rd ... funct Registers Register 3. Base addressing op rs rt Memory Address + Register Byte Halfword Word 4. PC-relative addressing op rs rt Memory Address PC + Word 5. Pseudodirect addressing op Address PC H.Corporaal ACA 5MD00 Memory Word 118 Instruction format R op rs rt rd I op rs rt 16 bit address J op Example instructions Instruction add $s1,$s2,$s3 addi $s2,$s3,4 lw $s1,100($s2) bne $s4,$s5,L j Label H.Corporaal ACA 5MD00 shamt funct 26 bit address Meaning $s1 = $s2 + $s3 $s2 = $s3 + 4 $s1 = Memory[$s2+100] if $s4<>$s5 goto L goto Label 119 Pipelining All integer instructions fit into the following pipeline time IF H.Corporaal ACA 5MD00 ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB 120 Other architecture styles Accumulator architecture Stack three operands, all in registers loads and stores are the only instructions accessing memory (i.e. with a memory (indirect) addressing mode Register-Memory zero operand: all operands implicit (on TOS) Register (load store) one operand (in register or memory), accumulator almost always implicitly used two operands, one in memory Memory-Memory three operands, may be all in memory (there are more varieties / combinations) H.Corporaal ACA 5MD00 121 Accumulator architecture Accumulator latch ALU registers address Memory latch Example code: a = b+c; load b; add c; store a; H.Corporaal ACA 5MD00 // accumulator is implicit operand 122 Stack architecture latch latch top of stack ALU latch Example code: a = b+c; push b; push b push c; b add; stack: pop a; H.Corporaal ACA 5MD00 Memory stack pt push c add c b b+c pop a 123 Other architecture styles Let's look at the code for C = A + B Stack Architecture Accumulator Architecture RegisterMemory MemoryMemory Register (load-store) Push A Load A Load r1,A Add C,B,A Load r1,A Push B Add Add Add Store C Pop C B r1,B Store C,r1 Load r2,B Add r3,r1,r2 Store C,r3 Q: What are the advantages / disadvantages of load-store (RISC) architecture? H.Corporaal ACA 5MD00 124