Embedded Processor Architecture 5kk73 DS P Programmable CPU Programmable DSP Application specific instruction set processor (ASIP) Application specific processor flexibility efficiency Embedded Processor Architecture Henk Corporaal / Bart Mesman 2 Application examples (1) x4 #define NTAPS 4 int fir(int in) int i; static int state[NTAPS]; static int coeff[NTAPS]; int out[NTAPS]; Z c4 x3 -1 Z c3 * x2 -1 Z c2 * x1 -1 Z c1 * x0 -1 c0 * * + y state[NTAPS] = in; out[0] = state[0] * coeff[0]; for ( i = 1; i < NTAPS+1; i++) out[i] = out[i-1] + state[i] * coeff[i]; state[i-1] = state[i]; return(out[NTAPS]); Embedded Processor Architecture Henk Corporaal / Bart Mesman 3 Application examples (1) .L1000006 sll addu lw addiu addu lw nop mult addu lw addiu mflo addu sw addu sw slti bne addiu $3, $2, 2 $14, $15, $3 $24, 0($14) $12, $6, -4 $11, $12, $3 $13, 0($11) R3=R2>>2 R14=R15+R3 R24=load(*R14) R12=R6-4 R11=R12+R3 R13=load(*R11) $24, $13 $25, $sp, $3 $9, -4($25) $2, $2, 1 $13 $10, $9, $13 $10, 0($25) $25, $7, $3 $24, 0($25) $24, $2, 10 $24, $0, .L100006 $15, $7, -4 R24=R24*R13 R25=sp+R3 R9=load(R25-4) R2=R2+1 R10=R9+R13 mem(*R25)=R10 R25=R7+R3 mem(*R25)=R24 Embedded Processor Architecture R3=i-1 R24=coeff[i-1] R13=state[i-1] R9=out[i-1] i=i+1 R13=move from low mpy reg R10=out[i] 19 instructions per tap!! Henk Corporaal / Bart Mesman 4 Application examples (2) Bit level operations: finite field arithmetic nonzero common r1 = LB input r2 = SLL r1 r3 = ANDI r1, mask r4 = ADDI r3, -1 BNE ( r4 != r0) nop R5 = XORI(r1, 29) J common nop r5 = XOR(r1,r0) … temp1 = input << 1 temp2 = if (bit(input,7) == 1 then 29 else 0 out = temp1 exor temp2 Load byte Shift left logical AND immediate ADD immediate Branch on != to nonzero Exclusive or immediate Jump in[0] in[1] in[2] in[3] in[4] exor exor exor in[5] in[6] in[7] Exclusive OR 10 instructions!! Very simple in hardware out[0] out[1] out[2] out[3] out[4] out[5] out[6] out[7] Embedded Processor Architecture Henk Corporaal / Bart Mesman 5 Application examples (2) Bit level operations : DES example source register ($2) 272625 2322 20 srl andi srl andi or srl andi or sll $13, $25, $14, $24, $15, $13, $14, $25, $24, $2, 20 $13, 1 $2, 21 $14, 6 $25, $24 $2, 22 $13, 56 $15, $14 $25, 2 7 6 5 4 3 2 destination register ($24) Embedded Processor Architecture Henk Corporaal / Bart Mesman 6 Application examples (2) Bit level operations : A5 example (GSM encryption) 181716 13 $5 srl srl xor srl xor srl xor andi xor $24, $5, 18 $25, $5, 17 $8, $24, $25 $9, $5, 16 $10, $8, $9 $11, $5, 13 $12, $10, $11 $13, $12, 1 … 0 ... $13 1 Embedded Processor Architecture Henk Corporaal / Bart Mesman 7 Application examples: conclusions • • • • CPUs offer flexibility, but… not efficient in performance not efficient in code size not efficient in power consumption Embedded Processor Architecture Henk Corporaal / Bart Mesman 8 Power Consumption in microprocessors Power consumption is (becoming) the limiting factor in processor design Solution in direction of • Hardware acceleration • Instruction Level Parallelism instead of clock speed source: ISSCC2001, Patrick Gelsinger, • Code size efficiency Intel Embedded Processor Architecture Henk Corporaal / Bart Mesman 9 Amdahl’s law • Impact of an improvement on the execution time of a program depends on 2 parameters: – f = fraction of the original computation time that is affected by the improvement – s = speedup factor (local) • exec_time_new = exec_time_old * (1-f) + exec_time_old * f / s • speedup_overall = exec_time_old / exec_time_new = 1 / ( 1 – f + f / s) • if s >> 1 then speedup_overall = 1 / ( 1 – f ) • Example: 40 % of program can be executed 10 x faster speedup_overall = 1 / ( 0.6 + 0.4 / 10 ) = 1.56 Embedded Processor Architecture Henk Corporaal / Bart Mesman 10 Conclusions • Programmable CPU cores are important for the control parts of the application. • They are well supported with tools to support the development of end-user software. ( vs. deeply embedded sw) • Keep it Simple heuristic (RISC vs. CISC) • Make frequent cases fast and rare cases correct. • Regular (orthogonal) instruction set • No special features that match a high level language construct. • At least 16 registers to ease register allocation. • Embedded cores are often light cores which are a compromise between performance, area and power dissipation. (vs. stand-alone CPU cores which are optimised for performance) Embedded Processor Architecture Henk Corporaal / Bart Mesman 11 Programmable Digital Signal Processors • real-time worst-case processing = need for more compute power sec instr cycles sec prog prog instr cycle CPI = 1 • instruction level parallelism (ILP) • hardware support for loop control • attention for high level data types e.g. arrays, delaylines (vs. scalars for CPUs) • difficult to compare architectures • e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling, shuffling, intialisation … can be included or forgotten • benchmarking (Berkeley Design Technology Inc (BDTi)) (compare to SpecInt benchmarks for CPs) Embedded Processor Architecture Henk Corporaal / Bart Mesman 12 Outline • architectures for programmable DSPs • multiplier-accumulator • modified Harvard architecture • extension with an ALU (decision making) • controller architectures • examples: TI, Motorola, Philips • code generation • developments: VLIW (Very Long Instruction Word) examples: C6 and TM Embedded Processor Architecture Henk Corporaal / Bart Mesman 13 DSP data types • not every signal requires 32 bits • 2 types of DSP: floating point and integer • advantages FP: most specs are in FP (conversion to int is time consuming since the behavior may change) • disadvantage FP: cost (area, speed, power) • integer multiplication doubles the number of bits: n * n => 2n Embedded Processor Architecture Henk Corporaal / Bart Mesman 14 c(i) x(i) control MPY (Booth, Wallace..) P_reg clock PR ADDER P_reg clock ACR SHIFT ROUND TRUNCATE Embedded Processor Architecture Henk Corporaal / Bart Mesman 15 Prog/data memory prog mem. data mem. prog mem. data mem. 1 data mem. 2 EXU EXU EXU Von Neumann (sequential) Harvard Modified Harvard c(i) * x(i) Goal = 1 cycle per iteration Embedded Processor Architecture Henk Corporaal / Bart Mesman 16 Reset +1 Interrupt address PC ACU_A ACU_B AR_A AR_B RAM_A RAM_B DR_A DR_B Stack Program Memory IR Control Bus MAC Rfile Embedded Processor Architecture Henk Corporaal / Bart Mesman 17 time loop ci * xi 1 cycle/tap ? filter loop i How updating the delayline ? x5 c5 x4 -1 Z * c4 * x3 -1 Z c3 x2 -1 Z * c2 x1 -1 Z * c1 * + Embedded Processor Architecture y Henk Corporaal / Bart Mesman 18 Solution 2: indirect adressing Memory location 1 2 3 4 5 6 7 8 output sample 1 x1 x2 x3 x4 x5 output sample 2 x2 x3 x4 x5 x6 output sample 3 x3 x4 x5 x6 x7 output sample 4 x4 x5 x6 x7 x8 Output sample 5 x9 x5 x6 x7 x8 • use of a pointer to mark the begin of the delay line • problem: trashing of the whole memory • solution: modulo addressing • need for a register to store the pointer Embedded Processor Architecture Henk Corporaal / Bart Mesman 19 A S Output Read_A A Read_S S incA A+1 decA A-1 Step A+S Inc_step S+1 Modulo output to RAM ACU architecture and Instruction set Modulo can be implemented as a mask operation if the size is 2k Embedded Processor Architecture reg A A A A+1 A-1 A+S A reg S S S S S S S+1 16 10 000 23 10 111 mask =hold Henk Corporaal / Bart Mesman 20 Addressing modes • register • immediate • direct • indirect • w. inc/dec • indexed ADD R4, R3 ADD R4, #3 ADD R4, (100) ADD R4, (R3) ADD R4, (R3)± R[R4] = R[R4] + R[R3] R[R4] = R[R4] + #3 R[R4] = R[R4] + Mem[100] R[R4] = R[R4] + Mem[R[R3]] R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] ± 1 ADD R4, (R3±R2) R[R4] = R[R4] + Mem[R[R3]] R[R3] = R[R3] ± R[R2] Remarks • direct = for static data • indirect = for arrays • inc/dec = for stepping through arrays e.g. xn • index = for stepping through arrays e.g. x2n Embedded Processor Architecture Henk Corporaal / Bart Mesman 21 Addressing modes: extra for DSP • 8 ARs (address or auxiliary register) available • extra indirect modes •circular *ARn ± % post inc/dec by 1 - circular *ARn ± AR0 % post inc/dec by AR0 - circular • bit reverse *ARn ± AR0 B post inc/dec by AR0 - bit rev. Embedded Processor Architecture Henk Corporaal / Bart Mesman 22 Reset +1 Interrupt address PC ACU_A ACU_B AR_A AR_B RAM_A RAM_B DR_A DR_B Stack Program Memory IR Control Bus MAC ALU Rfile Embedded Processor Architecture Henk Corporaal / Bart Mesman 23 first solution c(i) * x(i) Not shown coefficient RAM+ACU resources LABEL ALU MPY-ACC Acc = 0 RAM ACU init (i=0) init counter loop incr (=i+1) read x(i) acc(i)=acc(i-1)+x(i)*c(i) dec counter branch to loop if counter > 0 nop time (cc) 6 clockcycles/sample limit pipelines in the controller Embedded Processor Architecture Henk Corporaal / Bart Mesman 24 Loopfolding (software pipelining) ai f for i = 0 to n bi bi = f(ai) g ci = g(bi) ci di = h(ci) h di a0 f b0 a1 g f c0 b1 a2 g h f d0 c1 b2 g h d1 c2 h d2 Embedded Processor Architecture ci-2 bi-1 ai h g di-2 ci-1 f bi Henk Corporaal / Bart Mesman for i = 2 to n bi = f(ai) ci-1 = g(bi-1) di-2 = h(ci-2) 25 Loopfolding (software pipelining) c(i) * x(i) LABEL ALU MPY-ACC acc(i-1)=0 RAM init counter loop acc(i) = acc(i-1)+x(i)*c(i) ACU init (i=1) read x(i) inc(=i+1) read x(i+1) incr (=i+2) dec counter branch to loop if counter > 0 nop acc(n-1) = acc(n-2)+x(n-1)*c(n-1) read x(n) acc(n) = acc(n-1)+x(n)*c(n) Pre- and postamble 4 clockcycles /sample Embedded Processor Architecture Henk Corporaal / Bart Mesman 26 hardware support for loop control c(i) * x(i) Label ALU MPY-ACC acc(i-1=0 init counter repeat n-2 acc(i)=acc(i-1)+x(i)*c(i) acc(n-1) = acc(n-2) + x(n-1)*c(n-1) acc(n) = acc(n-1) + x(n)*c(n) RAM ACU init (i=1) read x(i) inc(=i+1) read x(i+1) incr(=i+2) read x(n) 1 clockcycles/sample repeat instruction and repeat block Embedded Processor Architecture Henk Corporaal / Bart Mesman 27 TMS320C5000 T register E T D P C D B A T A Sign ctr Sign ctr A(40) B(40) C BACD D Sign ctr Sign ctr Sign ctr MUX Multiplier (17*17) A M U A B 0 ALU (40) A B B fractional MUX Barrer shifter MUX COMP Adder (40) MSW/LSW select TRN ZERO SAT ROUND TC Embedded Processor Architecture Henk Corporaal / Bart Mesman 28 Address bus 16 bits Motorola 56K family EXTERNAL ADRESS SWITCH P Address Y Address X Address 2,048-by-24-bit PROGRAM MEMORY ROM I/O PORTS 7 BITS Address ALU X-DATA Y DATA P DATA INTERNAL DATA-BUS SWITCH 24 BITS X memory 256-by-24-bit RAM 256-by-24-bit ROM EXTERNAL DATA-BUS SWITCH GLOBAL DATA ON CHIP PERIPHERALS, HOST, SYNCHRONOUS SERIAL INTERFACE SERIAL COMMUNICATIONS INTERFACE, PROGRAMMED I/O, BUS CONTROL Y memory 256-by-24-bit RAM 256-by-24-bit ROM 24 BITS DATA ALU PROGRAM CONTROLLER 2 BITS CLOCK Embedded Processor Architecture 3 BITS 24-by-24 bit MULTIPLIERACCUMULATOR PRODUCING 56 BIT RESULT INTERRUPT Henk Corporaal / Bart Mesman 29 DATA BUS Two 16-by-16 bit multipliers Y0 Y1 Y1 X X PO P1 scale X Program control unit scale Y Two address Compution units X data memory Saturation Y data memory shift Two 40 bit arithmiclogic units 96-bit instructions Y0 Program memory (Z data) Saturation Four 40 bit accumulators 16-bit bus Saturation/scale R.E.A.L. 16 bit bus 16 bit bus X data Buses for Y data Z data Embedded Processor Architecture Henk Corporaal / Bart Mesman 30 source lexical analysis syntax analysis Front end semantic analysis Intermediate machine independent representation Code selection Register allocation 1 instr = // ops order of instr Code generation scheduling Embedded Processor Architecture code Henk Corporaal / Bart Mesman 31 Intermediate machine independent representation BBi BBj BBk a b * t1 := a * b t2 := c + d t3 := t1 + c out := t2 * t3 Embedded Processor Architecture c c t1 + t3 d + t2 * Henk Corporaal / Bart Mesman 32 Code selection example d memory ax ay x af y +- ADSP [Analog Devices] p memory mx my x ALU mf y * MAC +ar Embedded Processor Architecture mr Henk Corporaal / Bart Mesman 33 Example of code selection = covering of intermediate representation with RTPs mx := dmem my := pmem a b * Mr := mr + (mx * my) ax := dmem ay := pmem c d mr := dmem c 3: t1 2: + t3 1: + ar := ax + ay t2 * my := ar mr = mr * my Embedded Processor Architecture Henk Corporaal / Bart Mesman 34 Problems • local decisions which have a global impact • phase coupling: example • asap schedule • maximal freedom for scheduling • code selection during scheduling • register allocation comes afterwards • can lead to infeasible solutions Embedded Processor Architecture Henk Corporaal / Bart Mesman 35 phase coupling: discussion It is very difficult and almost impossible to develop robust and efficient DSP compilers. Current DSP practice = programming in assembler Solution: 1. Solve code generation for DSPs 2. Step back and rethink the architecture develop an architecture which is still efficient but also a good model for building a compiler Efficiency = exploit instruction level parallelism (ILP) compilation = systematic positioning of registers and regular interconnect = VLIW = Very Long Instruction Word Embedded Processor Architecture Henk Corporaal / Bart Mesman 36