Computer Systems The processor architecture University of Amsterdam Arnoud Visser Computer Systems – the processor architecture 1 Basic Knowledge • Relative timing of the elements is important University of Amsterdam Arnoud Visser Computer Systems – the processor architecture 2 Programmers visible state Program registers %eax %esi %ecx %edi %edx %esp %ebx %ebp Memory CC PC Von Neumann architecture, both instructions and data in memory University of Amsterdam Arnoud Visser Computer Systems – the processor architecture 3 Program counter 0xffffffff 0xc0000000 Kernel virtual memory User stack (created at runtime) Memory mapped region for shared libraries 0x40000000 Memory invisible to user code printf() function Run-time heap (created at runtime by malloc) PC or Read/write data Read-only code and data Loaded from the hello executable file 0x08048000 0 Unused • The program counter holds the address of the instruction currently executed • The next instruction has to be collected from memory (slow!) University of Amsterdam Arnoud Visser Computer Systems – the processor architecture 4 Processing a single instruction • Fetch – Read the instruction (1-5 bytes) from memory • Decode – Reads the values from the registers • Execute – Perform a arithmetic/logic operation OR Test the jump conditions • Memory – Read/Write to memory • Write back – Update the registers • PC update – Set the address of the next instruction University of Amsterdam Arnoud Visser Computer Systems – the processor architecture 5 Seq. architecture • Hardware PC Write back Data memory Memory connected with named wires (word & bytes, byte & bits, bit) CC Execute icode ifun rA rB valC Need regids Split PC increment Register A BM file E Decode Align Bytes 1-5 Byte 0 Instruction memory PC Arnoud Visser valP Need valC Instr valid ALU Fetch Instruction memory PC increment PC University of Amsterdam Computer Systems – the processor architecture 6 Stage Computation: ALU Operation OPl rA, rB icode:ifun M1[PC] Read instruction byte rA:rB M1[PC+1] Read register byte valP PC+2 Compute next PC valA R[rA] Read operand A valB R[rB] Read operand B valE valB ifun valA Perform ALU operation Set CC Set condition code register Memory Write R[rB] valE Write back result back PC update PC valP Update PC Fetch Decode Execute – Formulate instruction execution as sequence of simple steps – Use same general form for all instructions Arnoud Visser Computer Systems – the processor architecture University of Amsterdam 7 Stage Computation: procedure call call Dest icode:ifun M1[PC] Read instruction byte valC M4[PC+1] Read destination address valP PC+5 Compute return point valB R[%esp] Read stack pointer valE valB + –4 Decrement stack pointer Memory Write M4[valE] valP R[%esp] valE Write return value on stack back PC update PC valC Set PC to destination Fetch Decode Execute Update stack pointer – Use ALU to decrement stack pointer – Store incremented PC University of Amsterdam Arnoud Visser Computer Systems – the processor architecture 8 Stage Computation: jump jXX Dest Fetch icode:ifun M1[PC] Read instruction byte valC M4[PC+1] Read destination address valP PC+5 Fall through address Bch Cond(CC,ifun) Take branch? PC Bch ? valC : valP Update PC Decode Execute Memory Write back PC update – Compute both addresses – Choose based on setting of condition codes and branch condition XX/ifun Arnoud Visser Computer Systems – the processor architecture University of Amsterdam 9 Branch conditions JXX Condition Codes Description jmp 7 0 1 Direct jump jle 7 1 (SF^OF) | ZF Less or equal <= jl 7 2 SF^OF Less < je 7 3 ZF Equal == jne 7 4 ~ZF Non equal != jge 7 5 ~(SF^OF) jg 7 6 ~(SF^OF) & ~ZF Greater or equal >= Greater > University of Amsterdam Arnoud Visser Computer Systems – the processor architecture 10 Execute Logic Datapaths & Control Logic Bch – – – – ALU fun: select function ALU A: select Input A ALU B: select Input B Set CC: Should condition code register be loaded? valE bcond bcond Set CC icode ifun ALU fun. ALU ALU CC CC ALU A valC ALU B valA valB University of Amsterdam Arnoud Visser Computer Systems – the processor architecture 11 Control logic: ALU A OPl rA, rB Execute valE valB OP valA rmmovl rA, D(rB) Execute valE valB + valC popl rA Execute valE valB + 4 jXX Dest Execute Perform ALU operation Compute effective address Increment stack pointer No operation call Dest Execute valE valB + –4 ret Execute valE valB + 4 Decrement stack pointer Increment stack pointer int aluA = [ icode in { IRRMOVL, IOPL } : valA; icode in { IIRMOVL, IRMMOVL, IMRMOVL } : valC; icode in { ICALL, IPUSHL } : -4; icode in { IRET, IPOPL } : 4; # Other instructions don't need ALU ]; Arnoud Visser Computer Systems – the processor architecture University of Amsterdam 12 Hardware structure newPC New PC PC • This can be translated in silicon valM Memory Execute dat Mem. re Data a control a memory out w d ri Addr Data t e Bch valE ALU fun. ALU CC ALU A ALU B valAvalBdstEdstMsrcAsrcB dstEdstMsrcAsrcB Register A BM file E Decode icodeifun rA rB valC Fetch Instruction memory valP Write back PC increment PC University of Amsterdam Arnoud Visser Computer Systems – the processor architecture 13 University of Amsterdam Arnoud Visser Computer Systems – the processor architecture 14 Sequential is too slow • Clock has to slow enough to let the signal propagate through all wires and transistors Clk . . . . . . . . . . . . • Critical path: the slowest path between any two storage devices University of Amsterdam Arnoud Visser Computer Systems – the processor architecture 15 Pipelining • Divide the operations in stages and allow to start the next operation if the first operation is ready with first stage 100 ps 20 ps 100 ps 20 ps 100 ps 20 ps Comb. logic A R e g Comb. logic B R e g Comb. logic C R e g Clock • Increase the throughput, increase latency University of Amsterdam Arnoud Visser Computer Systems – the processor architecture 16 Insert registers between stages W_icode, W_valM • Pipeline registers means extra silicon and delay 1 2 3 4 F D E M F D E F D F 5 W M E D F Cycl eW5 I1 M I2 E I3 D I4 F I5 Arnoud Visser 6 7 8 W_valE, W_valM, W_dstE, W_dstM W valM Memory 9 M_ico de, M_Bc h, M_val M A Data memory Addr, Data Bch W M W E M W D E M W valE CC ALU Execute aluA, aluB E Decode D Fetch valA, d_sr valBRegister A BM cA, file d_sr E cB valP icode, ifun, Instruction rA, rB, memory valC Write back valP PC increment predP C f_PC PC F University of Amsterdam Computer Systems – the processor architecture 17 Data hazards Additional pipeline control is needed to prevent unintended interactions between instructions • Stalling (wait a few stages till hazard is gone) • Data forwarding (passing value to E before M/W) Pipeline architecture already used for i386 http://www.pcmech.com/show/processors/35/ University of Amsterdam Arnoud Visser Computer Systems – the processor architecture 18 Pipeline efficiency Pipeline control can prevent many, but not all interactions between instructions → bubbles For the model described in the book: • Load / Use hazards (20% of load instr. → 1 bubble) • Mispredicted branches (40% of jmp instr. → 2 bubbles) • Return from procedure calls (100% of ret instr. → 3 bubbles) University of Amsterdam Arnoud Visser Computer Systems – the processor architecture 19 Today’s architectures • Superscalar (Pentium) (often two instructions/cycle) • Dynamic execution (P6) (three instructions out-of-order/cycle) • Explicit parallelism (Itanium) (six execution units) Arnoud Visser Computer Systems – the processor architecture University of Amsterdam 20 Hyper-Threading http://or1cedar.intel.com/media/training/detect_ht_dt_v1/tutorial/ch6/topic04.htm University of Amsterdam Arnoud Visser Computer Systems – the processor architecture 21 Metrics of performance Answers per month Application Scaling of algorithms Programming Language Compiler (millions) of Instructions per second – MIPS (millions) of (F.P.) operations per second – MFLOP/s ISA Datapath Control Megabytes per second Function Units Transistors Wires Pins Cycles per second (clock rate) Each metric has a place and a purpose, and each can be optimized University of Amsterdam Arnoud Visser Computer Systems – the processor architecture 22 Summary • Shown that an instruction set architecture can be translated onto multiple processor architectures – Complicated control logic on datapaths – Compilers have optimize the control logic for multiple machines/targets – A programmer can add/frustrate compiler University of Amsterdam Arnoud Visser Computer Systems – the processor architecture 23 Assignment • Practice Problem 4.21 (page 314) 80 ps 30 ps 60 ps 50 ps 70 ps 10 ps 20 ps A B C D E F R e g Calculate the throughput and latency of a n-stage pipeline for the given 6 blocks University of Amsterdam Arnoud Visser Computer Systems – the processor architecture 24