Embedded Processor Architecture 5kk73 RISC Instruction Set Implementation Alternatives == using MIPS as example == TU/e 5kk73 Henk Corporaal 2014 Topics MIPS ISA: Instruction Set Architecture MIPS single cycle implementation MIPS multi-cycle implementation MIPS pipelined implementation Pipeline hazards Recap of RISC principles Memory hierarchy (in other slides) Other architectures Based on the book: ch2-5 (5th ed) Many slides; I'll go quick and skip some H.Corporaal EmbProcArch 5kk73 2 Main Types of Instructions Arithmetic Memory access instructions Integer Floating Point Load & Store Control flow Jump Conditional Branch Call & Return H.Corporaal EmbProcArch 5kk73 3 MIPS arithmetic Most instructions have 3 operands Operand order is fixed (destination first) Example: C code: A = B + C MIPS code: add $s0, $s1, $s2 ($s0, $s1 and $s2 are associated with variables by compiler) H.Corporaal EmbProcArch 5kk73 4 MIPS arithmetic C code: A = B + C + D; E = F - A; MIPS code: add $t0, $s1, $s2 add $s0, $t0, $s3 sub $s4, $s5, $s0 Operands must be registers, only 32 registers provided Design Principle: smaller is faster. Why? H.Corporaal EmbProcArch 5kk73 5 Registers vs. Memory Arithmetic instruction operands must be registers, — only 32 registers provided Compiler associates variables with registers What about programs with lots of variables ? CPU Memory register file IO H.Corporaal EmbProcArch 5kk73 6 Register allocation Compiler tries to keep as many variables in registers as possible Some variables can not be allocated large arrays (too few registers) aliased variables (variables accessible through pointers in C) e.g. int a = 4; int *b = &a; *b = 5; dynamic allocated variables heap stack Compiler may run out of registers => spilling H.Corporaal EmbProcArch 5kk73 7 Memory Organization Viewed as a large, single-dimension array, with an address A memory address is an index into the array "Byte addressing" means that successive addresses are one byte apart 0 8 bits of data 1 8 bits of data 2 8 bits of data 3 8 bits of data 4 8 bits of data 5 8 bits of data 6 8 bits of data ... H.Corporaal EmbProcArch 5kk73 8 Memory Organization Bytes are nice, but most data items use larger "words" For MIPS, a word is 32 bits or 4 bytes. 0 32 bits of data 4 32 bits of data 8 32 bits of data ... 12 32 bits of data Registers hold 32 bits of data 232 bytes with byte addresses from 0 to 232-1 230 words with byte addresses 0, 4, 8, ... 232-4 H.Corporaal EmbProcArch 5kk73 9 Memory layout: Alignment 31 address 0 4 23 15 7 0 this word is aligned; the others are not! 8 12 16 20 24 Words are aligned What are the least 2 significant bits of a word address? H.Corporaal EmbProcArch 5kk73 10 Memory hierarchy H.Corporaal EmbProcArch 5kk73 11 High end Intel i7 (Ivy Bridge) die L1+L2 L1+L2 L1+L2 L1+L2 •1.4 billion transistors •160 mm2 •3.5 GHz •775kk73 W / 22 nm H.Corporaal EmbProcArch •64KB L1 per core •256 KB L2 per core •8MB L3 shared 12 Instructions: load and store Example: C code: A[8] = h + A[8]; MIPS code: lw $t0, 32($s3) add $t0, $s2, $t0 sw $t0, 32($s3) Store word operation has no destination (reg) operand Remember arithmetic operands are registers, not memory! H.Corporaal EmbProcArch 5kk73 13 Let's translate some C-code Can we figure out the code? swap(int v[], int k); { int temp; temp = v[k] v[k] = v[k+1]; v[k+1] = temp; } swap: muli add lw lw sw sw jr $2 , $2 , $15, $16, $16, $15, $31 $5, 4 $4, $2 0($2) 4($2) 0($2) 4($2) Explanation: index k : $5 base address of v: $4 address of v[k] is $4 + 4.$5 H.Corporaal EmbProcArch 5kk73 14 Machine Language Instructions, like registers and words of data, are also 32 bits long Example: add $t0, $s1, $s2 Registers have numbers: $t0=9, $s1=17, $s2=18 Instruction Format: op 000000 6 bits rs rt 10001 10010 5 bits 5 bits rd 01001 5 bits shamt 00000 5 bits funct 100000 6 bits Can you guess what the field names stand for? H.Corporaal EmbProcArch 5kk73 15 Machine Language Consider the load-word and store-word instructions, Introduce a new type of instruction format What would the regularity principle have us do? New principle: Good design demands a compromise I-type for data transfer instructions other format was R-type for register Example: lw $t0, 32($s2) H.Corporaal EmbProcArch 5kk73 35 18 9 op rs rt 32 16 bit number 16 Stored Program Concept memory OS Program 1 CPU code global data stack heap unused Program 2 unused H.Corporaal EmbProcArch 5kk73 17 Control flow Decision making instructions alter the control flow, i.e., change the "next" instruction to be executed MIPS conditional branch instructions: bne $t0, $t1, Label beq $t0, $t1, Label Example: if (i==j) h = i + j; bne $s0, $s1, Label add $s3, $s0, $s1 Label: .... H.Corporaal EmbProcArch 5kk73 18 Control flow MIPS unconditional branch instructions: j label Example: if (i!=j) h=i+j; else h=i-j; beq $s4, $s5, Lab1 add $s3, $s4, $s5 j Lab2 Lab1:sub $s3, $s4, $s5 Lab2:... Can you build a simple for loop? H.Corporaal EmbProcArch 5kk73 19 So far: Instruction Meaning add $s1,$s2,$s3 sub $s1,$s2,$s3 lw $s1,100($s2) sw $s1,100($s2) bne $s4,$s5,L beq $s4,$s5,L j Label $s1 = $s2 + $s3 $s1 = $s2 – $s3 $s1 = Memory[$s2+100] Memory[$s2+100] = $s1 Next instr. is at Label if $s4 ° $s5 Next instr. is at Label if $s4 = $s5 Next instr. is at Label Formats: R op rs rt rd I op rs rt 16 bit address J op H.Corporaal EmbProcArch 5kk73 shamt funct 26 bit address 20 Control Flow We have: beq, bne, what about Branch-if-less-than? New instruction: meaning: if slt $t0, $s1, $s2 $s1 < $s2 then $t0 = 1 else $t0 = 0 Can use this instruction to build "blt $s1, $s2, Label" — can now build general control structures Note that the assembler needs a register to do this, — use conventions for registers H.Corporaal EmbProcArch 5kk73 21 MIPS compiler/assembler Conventions Name Register number Usage $zero 0 the constant value 0 $v0-$v1 2-3 values for results and expression evaluation $a0-$a3 4-7 arguments $t0-$t7 8-15 temporaries $s0-$s7 16-23 saved (by callee) $t8-$t9 24-25 more temporaries $gp 28 global pointer $sp 29 stack pointer $fp 30 frame pointer $ra 31 return address H.Corporaal EmbProcArch 5kk73 22 Constants Small constants are used quite frequently (50% of operands) e.g., A = A + 5; B = B + 1; C = C - 18; Solutions? Why not? put 'typical constants' in memory and load them create hard-wired registers (like $zero) for constants like 0, 1, 2, … or ……. MIPS Instructions: addi slti andi ori H.Corporaal EmbProcArch 5kk73 $29, $8, $29, $29, $29, $18, $29, $29, 4 10 6 4 3 23 How about larger constants? We'd like to be able to load a 32 bit constant into a register Must use two instructions; new "load upper immediate" instruction lui $t0, 1010101010101010 filled with zeros 1010101010101010 0000000000000000 Then must get the lower order bits right, i.e., ori $t0, $t0, 1010101010101010 ori 1010101010101010 0000000000000000 0000000000000000 1010101010101010 1010101010101010 1010101010101010 H.Corporaal EmbProcArch 5kk73 24 Assembly Language vs. Machine Language Assembly provides convenient symbolic representation much easier than writing down numbers e.g., destination first Machine language is the underlying reality e.g., destination is no longer first Assembly can provide 'pseudoinstructions' e.g., “move $t0, $t1” exists only in Assembly would be implemented using “add $t0,$t1,$zero” Another pseudo instr: blt $t1, $t2, label When considering performance you should count real instructions H.Corporaal EmbProcArch 5kk73 25 Addresses in Branches and Jumps Instructions: bne $t4,$t5,Label beq $t4,$t5,Label j Label Next instruction is at Label if $t4 $t5 Next instruction is at Label if $t4 = $t5 Next instruction is at Label Formats: I op J op rs rt 16 bit address 26 bit address Addresses are not 32 bits — How do we handle this with load and store instructions? H.Corporaal EmbProcArch 5kk73 26 What's the next address? Instructions: Next instruction is at Label if $t4 $t5 Next instruction is at Label if $t4 = $t5 bne $t4,$t5,Label beq $t4,$t5,Label Formats: I op rs rt 16 bit address Could specify a register (like lw and sw) and add it to address use Instruction Address Register (PC = program counter) most branches are local (principle of locality) Jump instructions just use high order bits of PC address boundaries of 256 MB H.Corporaal EmbProcArch 5kk73 27 To summarize: add MIPS assembly language Example Meaning add $s1, $s2, $s3 $s1 = $s2 + $s3 Three operands; data in registers subtract sub $s1, $s2, $s3 $s1 = $s2 - $s3 Three operands; data in registers addi $s1, $s2, 100 lw $s1, 100($s2) sw $s1, 100($s2) lb $s1, 100($s2) sb $s1, 100($s2) lui $s1, 100 $s1 = $s2 + 100 Used to add constants $s1 = Memory[$s2 + 100]Word from memory to register Memory[$s2 + 100] = $s1 Word from register to memory $s1 = Memory[$s2 + 100]Byte from memory to register Memory[$s2 + 100] = $s1 Byte from register to memory Loads constant in upper 16 bits $s1 = 100 * 2 16 beq $s1, $s2, 25 if ($s1 == $s2) go to PC + 4 + 100 Equal test; PC-relative branch branch on not equal bne $s1, $s2, 25 if ($s1 != $s2) go to PC + 4 + 100 Not equal test; PC-relative $s1, $s2, $s3 if ($s2 < $s3) $s1 = 1; else $s1 = 0 Compare less than; for beq, bne Category Arithmetic Instruction add immediate load w ord store w ord Data transfer load byte store byte load upper immediate branch on equal Conditional branch Unconditional jump set on less than slt set less than immediate slti jump jump register jump and link j jr jal H.Corporaal EmbProcArch 5kk73 $s1, $s2, 100 if ($s2 < 100) $s1 = 1; Comments Compare less than constant else $s1 = 0 2500 $ra 2500 go to 10000 Jump to target address go to $ra For sw itch, procedure return $ra = PC + 4; go to 10000 For procedure call 28 MIPS (3+2) addressing modes overview 1. Immediate addressing op rs rt Immediate 2. Register addressing op rs rt rd ... funct Registers Register 3. Base addressing op rs rt Memory Address + Register Byte Halfword Word 4. PC-relative addressing op rs rt Memory Address PC + Word 5. Pseudodirect addressing op Address PC H.Corporaal EmbProcArch 5kk73 Memory Word 29 MIPS Datapath Building a datapath A single cycle processor datapath all instruction actions in one (long) cycle A multi-cycle processor datapath support a subset of the MIPS-I instruction-set each instructions takes multiple (shorter) cycles For details see book (ch 5, 3rd ed. Or ch 4 in 4th ed.): H.Corporaal EmbProcArch 5kk73 30 Datapath and Control FSM or Microprogramming Registers & Memories Multiplexors Buses ALUs Control H.Corporaal EmbProcArch 5kk73 Datapath 31 The Processor: Datapath & Control Simplified MIPS implementation to contain only: lw, sw add, sub, and, or, slt beq, j Generic Implementation: memory-reference instructions: arithmetic-logical instructions: control flow instructions: use the program counter (PC) to supply instruction address get the instruction from memory read registers use the instruction to decide exactly what to do All instructions use the ALU after reading the registers Why? memory-reference? arithmetic? control flow? H.Corporaal EmbProcArch 5kk73 32 More Implementation Details Abstract / Simplified View: Data PC Address Instruction memory Instruction Register # Registers Register # ALU Address Data memory Register # Data Two types of functional units: elements that operate on data values (combinational) elements that contain state (sequential) H.Corporaal EmbProcArch 5kk73 33 State Elements Unclocked vs. Clocked Clocks used in synchronous logic when should an element that contains state be updated? falling edge cycle time rising edge H.Corporaal EmbProcArch 5kk73 34 An unclocked state element The set-reset (SR) latch output depends on present inputs and also on past inputs R Q Q S Truth table: H.Corporaal EmbProcArch 5kk73 R 0 0 1 1 S 0 1 0 1 Q Q 1 0 ? state change 35 Latches and Flip-flops Output is equal to the stored value inside the element (don't need to ask for permission to look at the value) Change of state (value) is based on the clock Latches: whenever the inputs change, and the clock is asserted Flip-flop: state changes only on a clock edge (edge-triggered methodology) A clocking methodology defines when signals can be read and written — wouldn't want to read a signal at the same time it was being written H.Corporaal EmbProcArch 5kk73 36 D-latch (level-sensitive) Two inputs: the data value to be stored (D) the clock signal (C) indicating when to read & store data (D) Two outputs: the value of the internal state (Q) and it's complement C D Q C _ Q D H.Corporaal EmbProcArch 5kk73 Q 37 D flip-flop (edge-triggered) Output changes only on the clock edge D D C D latch Q D Q D latch _ C Q Q _ Q C D C Q H.Corporaal EmbProcArch 5kk73 38 Our Implementation An edge triggered methodology Typical execution: read contents of some state elements, send values through some combinational logic, write results to one or more state elements State element 1 Combinational logic State element 2 Clockcycle H.Corporaal EmbProcArch 5kk73 39 Register File 3-ported: one write, two read ports Read reg. #1 Read data 1 Read reg.#2 Read data 2 Write reg.# Write data Write H.Corporaal EmbProcArch 5kk73 40 Register file: read ports • Register file built using D flip-flops Read register number 1 Register 0 Register 1 M Register n – 1 u x Read data 1 Register n Read register number 2 M u Read data 2 x Implementation of the read ports H.Corporaal EmbProcArch 5kk73 41 Register file: write port Note: we still use the real clock to determine when to write W r ite 0 1 R e g is te r n um b e r n -to -1 C R e g iste r 0 D C d e co d e r n – 1 R e g iste r 1 D n C R e g is te r n – 1 D C R e g iste r n R e gister d ata H.Corporaal EmbProcArch 5kk73 D 42 Building the Datapath Use multiplexors to stitch them together PCSrc M u x Add Add ALU result 4 Shift left 2 Registers PC Read address Instruction Instruction memory Read register 1 Read Read data 1 register 2 Write register Write data RegWrite 16 H.Corporaal EmbProcArch 5kk73 ALUSrc Read data 2 Sign extend M u x 3 ALU operation Zero ALU ALU result MemWrite MemtoReg Address Read data Data Write memory data M u x 32 MemRead 43 Our Simple Control Structure All of the logic is combinational We wait for everything to settle down, and the right thing to be done ALU might not produce “right answer” right away we use write signals along with clock to determine when to write Cycle time determined by length of the longest path S tate elem ent 1 Com binational logic State elem ent 2 Clock cycle We are ignoring some details like setup and hold times ! H.Corporaal EmbProcArch 5kk73 44 Control Selecting the operations to perform (ALU, read/write, etc.) Controlling the flow of data (multiplexor inputs) Information comes from the 32 bits of the instruction Example: add $8, $17, $18 000000 op Instruction Format: 10001 rs 10010 rt 01000 rd 00000 shamt 100000 funct ALU's operation based on instruction type and function code H.Corporaal EmbProcArch 5kk73 45 Control: 2 level implementation 31 6 Control 2 26 instruction register Opcode bit 2 ALUop 00: lw, sw 01: beq 10: add, sub, and, or, slt Funct. Control 1 5 6 3 ALUcontrol 000: and 001: or 010: add 110: sub 111: set on less than ALU 0 H.Corporaal EmbProcArch 5kk73 46 Datapath with Control 0 M u x Add ALU result Add 4 Instruction[31–26] PC Read register 1 Instruction[20–16] Instruction [31–0] Instruction memory Instruction[15–11] Shift left 2 RegDst Branch MemRead MemtoReg Control ALUOp MemWrite ALUSrc RegWrite Instruction[25–21] Read address 0 M u x 1 1 Read data1 Read register 2 Registers Read Write data2 register 0 M u x 1 Write data Zero ALU ALU result Address Write data Instruction[15–0] 16 Sign extend Read data Data memory 1 M u x 0 32 ALU control Instruction[5–0] H.Corporaal EmbProcArch 5kk73 47 ALU Control1 What should the ALU do with this instruction example: lw $1, 100($2) 35 2 1 op rs rt 16 bit offset ALU control input 000 001 010 110 111 100 AND OR add subtract set-on-less-than Why is the code for subtract 110 and not 011? H.Corporaal EmbProcArch 5kk73 48 ALU Control1 Must describe hardware to compute 3-bit ALU control input given instruction type 00 = lw, sw 01 = beq, 10 = arithmetic function code for arithmetic ALU Operation class, computed from instruction type Describe it using a truth table (can turn into gates): inputs ALUOp ALUOp1 ALUOp0 0 0 X 1 1 X 1 X 1 X 1 X 1 X H.Corporaal EmbProcArch 5kk73 F5 X X X X X X X Funct field F4 F3 F2 F1 X X X X X X X X X 0 0 0 X 0 0 1 X 0 1 0 X 0 1 0 X 1 0 1 outputs Operation F0 X X 0 0 0 1 0 010 110 010 110 000 001 111 49 ALU Control1 Simple combinational logic (truth tables) ALUOp ALU control block ALUOp0 ALUOp1 F3 F2 F (5– 0) Operation2 Operation1 Operation F1 Operation0 F0 H.Corporaal EmbProcArch 5kk73 50 Deriving Control2 signals Input 6-bits 9 control (output) signals Memto- Reg Mem Mem Instruction RegDst ALUSrc Reg Write Read Write Branch ALUOp1 ALUp0 R-format 1 0 0 1 0 0 0 1 0 lw 0 1 1 1 1 0 0 0 0 sw X 1 X 0 0 1 0 0 0 beq X 0 X 0 0 0 1 0 1 Determine these control signals directly from the opcodes: R-format: 0 lw: 35 sw: 43 beq: 4 H.Corporaal EmbProcArch 5kk73 51 Control 2 Inputs Op5 Op4 Op3 PLA example implementation Op2 Op1 Op0 Outputs R-format Iw sw beq RegDst ALUSrc MemtoReg RegWrite MemRead MemWrite Branch ALUOp1 ALUOpO H.Corporaal EmbProcArch 5kk73 52 Single Cycle Implementation Calculate cycle time assuming negligible delays except: memory (2ns), ALU and adders (2ns), register file access (1ns) PCSrc Add ALU Add result 4 RegWrite Instruction [25– 21] PC Read address Instruction [31– 0] Instruction memory Instruction [20– 16] 1 M u Instruction [15– 11] x 0 RegDst Instruction [15– 0] Read register 1 Read register 2 Read data 1 Read Write data 2 register Write Registers data 16 Sign 32 extend 1 M u x 0 Shift left 2 MemWrite ALUSrc 1 M u x 0 Zero ALU ALU result MemtoReg Address Write data ALU control Read data Data memory 1 M u x 0 MemRead Instruction [5– 0] ALUOp H.Corporaal EmbProcArch 5kk73 53 Single Cycle Implementation Memory (2ns), ALU & adders (2ns), reg. file access (1ns) Fixed length clock: longest instruction is the ‘lw’ which requires 8 ns Variable clock length (not realistic, just as exercise): R-instr: Load: Store: Branch: Jump: 6 ns 8 ns 7 ns 5 ns 2 ns Average depends on instruction mix H.Corporaal EmbProcArch 5kk73 54 Where we are headed Single Cycle Problems: what if we had a more complicated instruction like floating point? wasteful of area: NO Sharing of Hardware resources One Solution: use a “smaller” cycle time have different instructions take different numbers of cycles a “multicycle” datapath: Instruction register PC Address ALU Registers Memory data register MDR H.Corporaal EmbProcArch 5kk73 A Register # Instruction Memory or data Data Data IR ALUOut Register # B Register # 55 Multicycle Approach We will be reusing functional units Add registers after every major functional unit Our control signals will not be determined solely by instruction ALU used to compute address and to increment PC Memory used for instruction and data e.g., what should the ALU do for a “subtract” instruction? We’ll use a finite state machine (FSM) or microcode for control H.Corporaal EmbProcArch 5kk73 56 Review: finite state machines Finite state machines: a set of states and next state function (determined by current state and the input) output function (determined by current state and possibly input) Current state Next-state function Next state Clock Inputs Output function Outputs We’ll use a Moore machine (output based only on current state) H.Corporaal EmbProcArch 5kk73 57 Multicycle Approach Break up the instructions into steps, each step takes a cycle At the end of a cycle balance the amount of work to be done restrict each cycle to use only one major functional unit store values for use in later cycles (easiest thing to do) introduce additional “internal” registers Notice: we distinguish processor state: programmer visible registers internal state: programmer invisible registers (like IR, MDR, A, B, and ALUout) H.Corporaal EmbProcArch 5kk73 58 Multicycle Approach PC 0 M u x 1 Address Memory Instruction [25–21] Read register 1 Instruction [20–16] Read Read data1 register 2 Registers Write Read register data2 MemData Write data Instruction [15–0] Instruction [15–11] Instruction register Instruction [15–0] Memory data register H.Corporaal EmbProcArch 5kk73 0 M u x 1 A B 0 M u x 1 Sign extend 32 Zero ALU ALU result ALUOut 0 4 Write data 16 0 M u x 1 1M u 2x 3 Shift left 2 59 Multicycle Approach Note that previous picture does not include: branch support jump support Control lines and logic Tclock > max (ALU delay, Memory access, Regfile access) See book for complete picture H.Corporaal EmbProcArch 5kk73 60 Five Execution Steps Instruction Fetch Instruction Decode and Register Fetch Execution, Memory Address Computation, or Branch Completion Memory Access or R-type instruction completion Write-back step INSTRUCTIONS TAKE FROM 3 - 5 CYCLES! H.Corporaal EmbProcArch 5kk73 61 Step 1: Instruction Fetch Use PC to get instruction and put it in the Instruction Register Increment the PC by 4 and put the result back in the PC Can be described succinctly using RTL "Register-Transfer Language" IR = Memory[PC]; PC = PC + 4; Can we figure out the values of the control signals? What is the advantage of updating the PC now? H.Corporaal EmbProcArch 5kk73 62 Step 2: Instruction Decode and Register Fetch Read registers rs and rt in case we need them Compute the branch address in case the instruction is a branch Previous two actions are done optimistically!! RTL: A = Reg[IR[25-21]]; B = Reg[IR[20-16]]; ALUOut = PC+(sign-extend(IR[15-0])<< 2); We aren't setting any control lines based on the instruction type (we are busy "decoding" it in our control logic) H.Corporaal EmbProcArch 5kk73 63 Step 3 (instruction dependent) ALU is performing one of four functions, based on instruction type Memory Reference: ALUOut = A + sign-extend(IR[15-0]); R-type: ALUOut = A op B; Branch: if (A==B) PC = ALUOut; Jump: PC = PC[31-28] || (IR[25-0]<<2) H.Corporaal EmbProcArch 5kk73 64 Step 4 (R-type or Memory-access) Loads and stores access memory MDR = Memory[ALUOut]; or Memory[ALUOut] = B; R-type instructions finish Reg[IR[15-11]] = ALUOut; The write actually takes place at the end of the cycle on the edge H.Corporaal EmbProcArch 5kk73 65 Write-back step Memory read completion step Reg[IR[20-16]]= MDR; What about all the other instructions? H.Corporaal EmbProcArch 5kk73 66 Summary execution steps Steps taken to execute any instruction class Step name Instruction fetch Action for R-type instructions Instruction decode/register fetch Action for memory-reference Action for instructions branches IR = Memory[PC] PC = PC + 4 A = Reg [IR[25-21]] B = Reg [IR[20-16]] ALUOut = PC + (sign-extend (IR[15-0]) << 2) Execution, address computation, branch/ jump completion ALUOut = A op B ALUOut = A + sign-extend (IR[15-0]) Memory access or R-type completion Reg [IR[15-11]] = ALUOut Load: MDR = Memory[ALUOut] or Store: Memory [ALUOut] = B Memory read completion H.Corporaal EmbProcArch 5kk73 if (A ==B) then PC = ALUOut Action for jumps PC = PC [31-28] II (IR[25-0]<<2) Load: Reg[IR[20-16]] = MDR 67 Simple Questions How many cycles will it take to execute this code? lw $t2, 0($t3) lw $t3, 4($t3) beq $t2, $t3, L1 add $t5, $t2, $t3 sw $t5, 8($t3) L1: ... #assume not taken What is going on during the 8th cycle of execution? In what cycle does the actual addition of $t2 and $t3 takes place? H.Corporaal EmbProcArch 5kk73 68 Implementing the Control Value of control signals is dependent upon: Use the information we have accumulated to specify a finite state machine (FSM) what instruction is being executed which step is being performed specify the finite state machine graphically, or use microprogramming Implementation can be derived from specification H.Corporaal EmbProcArch 5kk73 69 0 Start How many state bits will we need? Memory address computation 6 ALUSrcA = 1 ALUSrcB = 10 ALUOp = 00 Memory access 5 MemRead IorD = 1 Write-back step 4 RegDst = 0 RegWrite MemtoReg = 1 ALUSrcA = 0 ALUSrcB = 11 ALUOp = 00 Branch completion 8 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 10 Memory access 3 1 Execution 2 (Op = 'LW') MemRead ALUSrcA = 0 IorD = 0 IRWrite ALUSrcB = 01 ALUOp = 00 PCWrite PCSource = 00 MemWrite IorD = 1 RegDst = 1 RegWrite MemtoReg = 0 Jump completion 9 ALUSrcA = 1 ALUSrcB = 00 ALUOp = 01 PCWriteCond PCSource = 01 R-type completion 7 (Op = 'J') Graphical Specification of FSM Instruction decode/ register fetch Instruction fetch PCWrite PCSource = 10 Finite State Machine for Control PCWrite PCWriteCond IorD MemRead Implementation: MemWrite IRWrite Control logic MemtoReg PCSource ALUOp Outputs ALUSrcB ALUSrcA RegWrite RegDst NS3 NS2 NS1 NS0 Instruction register opcode field H.Corporaal EmbProcArch 5kk73 S0 S1 S2 S3 Op0 Op1 Op2 Op3 Op4 Op5 Inputs State register 71 Op4 opcode PLA Implementation Op5 (see book) Op3 Op2 Op1 Op0 S3 current state S2 S1 S0 If I picked a horizontal or vertical line could you explain it ? What type of FSM is used? Mealy or Moore? datapath control PCWrite PCWriteCond IorD MemRead MemWrite IRWrite MemtoReg PCSource1 PCSource0 ALUOp1 ALUOp0 ALUSrcB1 ALUSrcB0 ALUSrcA RegWrite RegDst NS3 NS2 NS1 NS0 next state H.Corporaal EmbProcArch 5kk73 72 Pipelined implementation Pipelining Pipelined datapath Pipelined control Hazards: Structural Data Control Exceptions Scheduling For details see the book (chapter 6): H.Corporaal EmbProcArch 5kk73 73 Pipelining Improve performance by increasing instruction throughput P rog ram e x ec ution T im e o rd er (in in structio ns ) lw $ 1, 10 0 ($0 ) 2 Ins truction R eg fe tch lw $ 2, 20 0 ($0 ) 4 6 8 A LU Data access 10 12 14 16 18 R eg Instruction R eg fe tch 8 ns lw $ 3, 30 0 ($0 ) D ata acc ess A LU R eg Instruction fe tch 8 ns ... 8 ns P rog ram e x ec utio n Tim e o rd er (in in struc tio n s) 2 lw $1 , 1 00 ($ 0) Instruction fetch lw $2 , 2 00 ($ 0) 2 ns lw $3 , 3 00 ($ 0) 4 Reg Instruction fetch 2 ns 6 ALU Reg Instruction fetch 2 ns H.Corporaal EmbProcArch 5kk73 8 Da ta access ALU Reg 2 ns 10 14 12 Reg Data access Reg ALU D a ta access 2 ns 2 ns Reg 2 ns 74 Pipelining Ideal speedup = number of stages Do we achieve this? H.Corporaal EmbProcArch 5kk73 75 Pipelining What makes it easy What makes it hard? all instructions are the same length just a few instruction formats memory operands appear only in loads and stores structural hazards: suppose we had only one memory control hazards: need to worry about branch instructions data hazards: an instruction depends on a previous instruction We’ll build a simple pipeline and look at these issues We’ll talk about modern processors and what really makes it hard: exception handling trying to improve performance with out-of-order execution, etc. H.Corporaal EmbProcArch 5kk73 76 Basic idea: start from single cycle impl. What do we need to add to actually split the datapath into stages? IF: Instruction fetch ID: Instruction decode/ register file read EX: Execute/ address calculation MEM: Memory access WB: Write back 0 M u x 1 Add Add Add result 4 Shift left 2 PC Read register 1 Address Instruction Instruction memory Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 H.Corporaal EmbProcArch 5kk73 Sign extend Read data 1 M u x 0 32 77 Pipelined Datapath Can you find a problem even if there are no dependencies? What instructions can we execute to manifest the problem? 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add Add result 4 PC Address Instruction memory Instruction Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 H.Corporaal EmbProcArch 5kk73 Sign extend Read data 1 M u x 0 32 78 Corrected Datapath 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add 4 Add Add result PC Address Instruction memory I nstr ucti on Shift left 2 Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data 0 M u x 1 Zero ALU ALU result Address Data memory Write data 16 H.Corporaal EmbProcArch 5kk73 Sign extend Read data 1 M u x 0 32 79 Graphically Representing Pipelines Time (in clock cycles) Program execution order (in instructions) CC 1 CC 2 CC 3 CC 4 CC 5 IM Reg ALU DM Reg ALU DM lw $10, 20($1) sub $11, $2, $3 IM Reg CC 6 Reg Can help with answering questions like: how many cycles does it take to execute this code? what is the ALU doing during cycle 4? use this representation to help understand datapaths H.Corporaal EmbProcArch 5kk73 80 Pipeline Control PCSrc 0 M u x 1 IF/ID ID/EX EX/MEM MEM/WB Add Add 4 Add result Branch Shift left 2 PC Address Instruction memory Instruction RegWrite Read register 1 MemWrite Read data 1 Read register 2 Registers Read Write data 2 register Write data ALUSrc Zero Zero ALU ALU result 0 M u x 1 MemtoReg Address Data memory Write Read data 1 M u x 0 data Instruction 16 [15– 0] Instruction [20– 16] Instruction [15– 11] Sign extend 32 6 0 M u x 1 ALU control MemRead ALUOp RegDst H.Corporaal EmbProcArch 5kk73 81 Pipeline control We have 5 stages. What needs to be controlled in each stage? Instruction Fetch and PC Increment Instruction Decode / Register Fetch Execution Memory Stage Write Back How would control be handled in an automobile plant? a fancy control center telling everyone what to do? should we use a finite state machine? H.Corporaal EmbProcArch 5kk73 82 Pipeline Control Instruction R-format lw sw beq Execution/Address Calculation stage control lines Reg ALU ALU ALU Dst Op1 Op0 Src 1 1 0 0 0 0 0 1 X 0 0 1 X 0 1 0 Memory access stage control lines Branc Mem Mem h Read Write 0 0 0 0 1 0 0 0 1 1 0 0 Write-back stage control lines Reg Mem write to Reg 1 0 1 1 0 X 0 X (compare single cycle control!) Pass control signals along just like the data: WB Instruction H.Corporaal EmbProcArch 5kk73 IF/ID Control M WB EX M WB ID/EX EX/MEM MEM/WB 83 Datapath with Control PCSrc ID/EX 0 M u x 1 WB Control IF/ID EX/MEM M WB EX M MEM/WB WB Add ALUSrc Read register 1 Read data 1 Read register 2 Registers Read Write data 2 register Write data Zero ALU ALU result 0 M u x 1 MemtoReg Instruction memory Branch Shift left 2 MemWrite Address Instruction PC Add Add result RegWrite 4 Address Data memory Read data Write data Instruction 16 [15– 0] Instruction [20– 16] Instruction [15– 11] Sign extend 32 6 ALU control 0 M u x 1 1 M u x 0 MemRead ALUOp RegDst H.Corporaal EmbProcArch 5kk73 84 H.Corporaal EmbProcArch 5kk73 85 Hazards: problems due to pipelining Hazard types: Structural Data same resource is needed multiple times in the same cycle data dependencies limit pipelining Control next executed instruction may not be the next specified instruction H.Corporaal EmbProcArch 5kk73 86 Structural hazards Examples: Two accesses to a single ported memory Two operations need the same function unit at the same time Two operations need the same function unit in successive cycles, but the unit is not pipelined Solutions: stalling add more hardware H.Corporaal EmbProcArch 5kk73 87 Structural hazards on MIPS Q: Do we have structural hazards on our simple MIPS pipeline? time IF H.Corporaal EmbProcArch 5kk73 ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB 88 Data hazards Data dependencies: (read-after-write) (write-after-write) (write-after-read) Hardware solution: RaW WaW WaR Forwarding / Bypassing Detection logic Stalling Software solution: Scheduling H.Corporaal EmbProcArch 5kk73 89 Data dependences Three types: RaW, WaR and WaW add r1, r2, 5 sub r4, r1, r3 ; r1 := r2+5 ; RaW of r1 add r1, r2, 5 sub r2, r4, 1 ; WaR of r2 add r1, r2, 5 sub r1, r1, 1 ; WaW of r1 st ld ; M[r2+5] := r1 ; RaW if 5+r2 = 0+r4 r1, 5(r2) r5, 0(r4) WaW and WaR do not occur in simple pipelines, but they limit scheduling freedom! Problems for your compiler and your Laptop processors! use register renaming to solve this! H.Corporaal EmbProcArch 5kk73 90 RaW on MIPS pipeline T im e (in clock cycle s) V alue of re giste r $ 2 : CC 1 CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 10 10 10 10 1 0 /– 2 0 – 20 – 20 – 20 – 20 IM Reg DM Reg P ro gra m e xe cution orde r (in instru ctio ns) sub $ 2 , $ 1 , $ 3 a nd $ 1 2 , $ 2 , $ 5 or $ 1 3 , $ 6 , $ 2 a dd $ 1 4 , $ 2 , $ 2 sw $ 1 5, 1 0 0 ($ 2 ) H.Corporaal EmbProcArch 5kk73 IM DM R eg IM DM R eg IM R eg DM Reg IM R eg R eg R eg DM Reg 91 Forwarding Use temporary results, don’t wait for them to be written register file forwarding to handle read/write to same register ALU forwarding T ime (in clock cycles) CC 1 V alue of re gister $ 2 : 10 V a lue of E X/M EM : X V a lue of M EM /W B : X CC 2 CC 3 CC 4 CC 5 CC 6 CC 7 CC 8 CC 9 10 X X 10 X X 10 – 20 X 1 0/– 20 X – 20 – 20 X X – 20 X X – 20 X X – 20 X X DM R eg Program e xe cution orde r (in instructions) sub $ 2, $1 , $ 3 What if this $2 was $13? a nd $ 12 , $ 2, $5 or $ 13 , $ 6, $ 2 a dd $ 14 , $ 2, $2 sw $ 15 , 1 00 ($2 ) H.Corporaal EmbProcArch 5kk73 IM Reg IM R eg IM DM R eg IM R eg DM DM R eg IM Reg R eg Reg DM Reg 92 Forwarding hardware ALU forwarding circuitry principle: from register file ALU to register file from register file Note: there are two options • buf - ALU – bypass – mux - buf • buf - bypass – mux – ALU - buf H.Corporaal EmbProcArch 5kk73 93 Forwarding ID/EX WB Control PC Instruction memory Instruction IF/ID EX/MEM M WB EX M MEM/WB WB M u x Registers ForwardA ALU Data memory M u x IF/ID.RegisterRs Rs IF/ID.RegisterRt Rt IF/ID.RegisterRt Rt IF/ID.RegisterRd Rd ForwardB M u x EX/MEM.RegisterRd Forwarding unit H.Corporaal EmbProcArch 5kk73 M u x MEM/WB.RegisterRd 94 Forwarding check Check for matching register-ids: For each source-id of operation in the EX-stage check if there is a matching pending dest-id Example: if (EX/MEM.RegWrite) (EX/MEM.RegisterRd 0) (EX/MEM.RegisterRd = ID/EX.RegisterRs) then ForwardA = 10 Q. How many comparators do we need? H.Corporaal EmbProcArch 5kk73 95 Can't always forward Load word can still cause a hazard: an instruction tries to read register r following a load to the same r Need a hazard detection unit to “stall” the load instruction Time (in clock cycles) Program CC 1 execution order (in instructions) lw $2, 20 ($1) and $4, $2, $5 or $8, $2, $6 add $9, $4, $2 slt $1, $6, $7 H.Corporaal EmbProcArch 5kk73 IM CC 2 CC 3 Reg IM CC 4 CC 5 DM Reg Reg IM DM Reg IM CC 6 CC 8 CC 9 Reg DM Reg IM CC 7 Reg DM Reg Reg DM Reg 96 Stalling We can stall the pipeline by keeping an instruction in the same stage Program Time(inclockcycles) execution CC1 CC2 order (ininstructions) lw$2, 20($1) and$4, $2, $5 or $8, $2, $6 IM CC3 Reg IM CC4 CC5 DM Reg Reg Reg IM IM CC6 CC7 DM Reg Reg DM CC8 CC9 CC10 Reg bubble add$9, $4, $2 In CC4 the ALU is not used, slt $1, $6, $7 Reg, and IM are redone H.Corporaal EmbProcArch 5kk73 IM DM Reg IM Reg Reg DM Reg 97 Hazard Detection Unit ID/EX.MemRead Hazard detection unit ID/EX IF/IDWrite WB Control 0 M u x PC Instruction memory Instruction PCWrite IF/ID EX/MEM M WB EX M MEM/WB WB M u x Registers ALU Data memory M u x M u x IF/ID.RegisterRs IF/ID.RegisterRt H.Corporaal EmbProcArch 5kk73 IF/ID.RegisterRt Rt IF/ID.RegisterRd Rd ID/EX.RegisterRt Rs Rt M u x EX/MEM.RegisterRd Forwarding unit MEM/WB.RegisterRd 98 Software only solution? Have compiler guarantee that no hazards occur Example: where do we insert the “NOPs” ? sub and or add sw $2, $12, $13, $14, $13, $1, $3 $2, $5 $6, $2 $2, $2 100($2) sub nop nop and or Problem: this really slows us down! add nop sw H.Corporaal EmbProcArch 5kk73 $2, $1, $3 $12, $2, $5 $13, $6, $2 $14, $2, $2 $13, 100($2) 99 Control hazards Control operations may change the sequential flow of instructions branch jump call (jump and link) return (exception/interrupt and rti / return from interrupt) H.Corporaal EmbProcArch 5kk73 100 Control hazard: Branch Branch actions: Compute new address Determine condition Perform the actual branch (if taken): PC := new address H.Corporaal EmbProcArch 5kk73 101 Branch example Progra m Time (in clock cycle s) execu tion CC 1 CC 2 IM Reg CC 3 CC 4 CC 5 DM R eg CC 6 CC 7 CC 8 CC 9 order (in instructions) 40 beq $1, $3, 7 44 an d $1 2, $2 , $ 5 48 or $13, $6 , $2 52 ad d $1 4, $2 , $ 2 72 lw $4 , 50($ 7) H.Corporaal EmbProcArch 5kk73 IM R eg IM DM R eg IM R eg DM R eg IM R eg DM R eg Reg DM R eg 102 Branching Squash pipeline: When we decide to branch, other instructions are in the pipeline! We are predicting “branch not taken” need to add hardware for flushing instructions if we are wrong H.Corporaal EmbProcArch 5kk73 103 Branch with predict not taken Clock cycles Branch L IF Predict not taken L: H.Corporaal EmbProcArch 5kk73 ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB 104 Branch speedup Earlier address computation Earlier condition calculation Put both in the ID pipeline stage adder comparator Clock cycles Branch L Predict not taken L: H.Corporaal EmbProcArch 5kk73 IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB 105 Improved branching / flushing IF/ID IF.Flush Hazard detection unit ID/EX M u x WB Control 0 M u x IF/ID 4 M WB EX M MEM/WB WB Shift left 2 Registers PC EX/MEM = M u x Instruction memory ALU M u x Data memory M u x Sign extend M u x Forwarding unit H.Corporaal EmbProcArch 5kk73 106 Exception support Types of exceptions: Overflow I/O device request Operating system call Undefined instruction Hardware malfunction Page fault Precise exception: finish previous instructions (which are still in the pipeline) flush excepting and following instructions, redo them after handling the exception(s) H.Corporaal EmbProcArch 5kk73 107 Exceptions Changes needed for handling overflow exception of an operation in EX stage (see book for details) : Extend PC input mux with extra entry with fixed address Add EPC register recording the ID/EX stage PC this is the address of the next instruction ! Cause register recording exception type E.g., in case of overflow exception insert 3 bubbles; flush the following stages: IF/ID stage ID/EX stage EX/MEM stage H.Corporaal EmbProcArch 5kk73 108 Scheduling, why? Let’s look at the execution time: Texecution = Ncycles x Tcycle = Ninstructions x CPI x Tcycle Scheduling may reduce Texecution Reduce CPI (cycles per instruction) early scheduling of long latency operations avoid pipeline stalls due to structural, data and control hazards allow Nissue > 1 and therefore CPI < 1 Reduce Ninstructions compact many operations into each instruction (VLIW) H.Corporaal EmbProcArch 5kk73 109 Scheduling data hazards: example 1 Try and avoid RaW stalls (in this case load interlocks)! E.g., reorder these instructions: lw lw sw sw $t0, $t2, $t2, $t0, H.Corporaal EmbProcArch 5kk73 0($t1) 4($t1) 0($t1) 4($t1) lw lw sw sw $t0, $t2, $t0, $t2, 0($t1) 4($t1) 4($t1) 0($t1) 110 Scheduling data hazards example 2 Avoiding RaW stalls: Reordering instructions for following program (by you or the compiler) Unscheduled code: Lw R1,b Lw R2,c Add R3,R1,R2 interlock Sw a,R3 Lw R1,e Lw R2,f Sub R4,R1,R2 interlock Sw d,R4 Code: a = b + c d = e - f H.Corporaal EmbProcArch 5kk73 Scheduled code: Lw R1,b Lw R2,c Lw R5,e extra reg. needed! Add R3,R1,R2 Lw R2,f Sw a,R3 Sub R4,R5,R2 Sw d,R4 111 Scheduling control hazards Texecution = Ninstructions x CPI x Tcycle CPI = CPIideal + fbranch x Pbranch Pbranch = Ndelayslots x miss_rate Modern processors tend to have large branch penalty, Pbranch, due to: many pipeline stages multi-issue Note that penalties have larger effect when CPIideal is low H.Corporaal EmbProcArch 5kk73 112 Scheduling control hazards What can we do about control hazards and CPI penalty? Keep penalty Pbranch low: Early computation of new PC Early determination of condition Visible branch delay slots filled by compiler (MIPS) Branch prediction Reduce control dependencies (control height reduction) [Schlansker and Kathail, Micro’95] Remove branches: if-conversion Conditional instructions: CMOVE, cond skip next Guarding all instructions: TriMedia H.Corporaal EmbProcArch 5kk73 113 Branch delay slot Add a branch delay slot: the next instruction after a branch is always executed rely on compiler to “fill” the slot with something useful Is this a good idea? let's look how it works H.Corporaal EmbProcArch 5kk73 114 Branch delay slot scheduling Q. What to put in the delay slot? op 1 beq r1,r2, L ............. 'fall-through' op 2 ............. branch target L: op 3 ............. H.Corporaal EmbProcArch 5kk73 115 Summary Modern processors are (deeply) pipelined, to reduce Tcycle and aim at CPI = 1 Hazards increase CPI Several software and hardware measure to avoid or reduce hazards are taken Not discussed, but important developments: Multi-issue further reduces CPI Branch prediction to avoid high branch penalties Dynamic scheduling In all cases: a scheduling compiler needed H.Corporaal EmbProcArch 5kk73 116 Recap of MIPS RISC architecture Register space Addressing Instruction format Pipelining H.Corporaal EmbProcArch 5kk73 117 Why RISC? Keep it simple RISC characteristics: Reduced number of instructions Limited addressing modes Large register set uniform (no distinction between e.g. address and data registers) Limited number of instruction sizes (preferably one) load-store architecture enables pipelining know directly where the following instruction starts Limited number of instruction formats Memory alignment restrictions ...... Based on quantitative analysis " the famous MIPS one percent rule": don't even think about it when its not used more than one percent H.Corporaal EmbProcArch 5kk73 118 Register space Name Register Number $zero 0 $at 1 $v0 - $v1 2-3 $a0 - $a3 4-7 $t0 - $t7 8-15 $s0 - $s7 16-23 $t8 - $t9 24-25 $gp 28 $sp 29 $fp 30 $ra 31 H.Corporaal EmbProcArch 5kk73 32 integer (and 32 floating point) registers of 32-bit Usage Preserve on call? constant 0 (hardware) n.a. reserved for assembler n.a. returned values no arguments yes temporaries no saved values yes temporaries no global pointer yes stack pointer yes frame pointer yes return addr (hardware) yes 119 Addressing 1. Immediate addressing op rs rt Immediate 2. Register addressing op rs rt rd ... funct Registers Register 3. Base addressing op rs rt Memory Address + Register Byte Halfword Word 4. PC-relative addressing op rs rt Memory Address PC + Word 5. Pseudodirect addressing op Address PC H.Corporaal EmbProcArch 5kk73 Memory Word 120 Instruction format R op rs rt rd I op rs rt 16 bit address J op Example instructions Instruction add $s1,$s2,$s3 addi $s2,$s3,4 lw $s1,100($s2) bne $s4,$s5,L j Label H.Corporaal EmbProcArch 5kk73 shamt funct 26 bit address Meaning $s1 = $s2 + $s3 $s2 = $s3 + 4 $s1 = Memory[$s2+100] if $s4<>$s5 goto L goto Label 121 Pipelining All integer instructions fit into the following pipeline time IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM WB IF ID EX MEM H.Corporaal EmbProcArch 5kk73 WB 122 Other architecture styles Accumulator architecture Stack three operands, all in registers loads and stores are the only instructions accessing memory (i.e. with a memory (indirect) addressing mode Register-Memory zero operand: all operands implicit (on TOS) Register (load store) one operand (in register or memory), accumulator almost always implicitly used two operands, one in memory Memory-Memory three operands, may be all in memory (there are more varieties / combinations) H.Corporaal EmbProcArch 5kk73 123 Accumulator architecture Accumulator latch ALU registers address Memory latch Example code: a = b+c; load b; add c; store a; H.Corporaal EmbProcArch 5kk73 // accumulator is implicit operand 124 Stack architecture latch latch top of stack ALU latch Example code: a = b+c; push b; push b push c; b add; stack: pop a; H.Corporaal EmbProcArch 5kk73 Memory stack pt push c add c b b+c pop a 125 Other architecture styles Let's look at the code for C = A + B Stack Architecture Accumulator Architecture RegisterMemory MemoryMemory Register (load-store) Push A Load A Load r1,A Add C,B,A Load r1,A Push B Add Add Add Store C Pop C B r1,B Store C,r1 Load r2,B Add r3,r1,r2 Store C,r3 Q: What are the advantages / disadvantages of load-store (RISC) architecture? H.Corporaal EmbProcArch 5kk73 126