DSP

advertisement
Embedded Processor
Architecture
5kk73
DS
P
Programmable
CPU
Programmable
DSP
Application specific
instruction set
processor (ASIP)
Application
specific processor
flexibility
efficiency
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
2
Application examples (1)
x4
#define NTAPS 4
int fir(int in)

int i;
static int state[NTAPS];
static int coeff[NTAPS];
int out[NTAPS];
Z
c4
x3
-1
Z
c3
*
x2
-1
Z
c2
*
x1
-1
Z
c1
*
x0
-1
c0
*
*
+
y
state[NTAPS] = in;
out[0] = state[0] * coeff[0];
for ( i = 1; i < NTAPS+1; i++) 
out[i] = out[i-1] + state[i] * coeff[i];
state[i-1] = state[i];

return(out[NTAPS]);

Embedded Processor Architecture
Henk Corporaal / Bart Mesman
3
Application examples (1)
.L1000006
sll
addu
lw
addiu
addu
lw
nop
mult
addu
lw
addiu
mflo
addu
sw
addu
sw
slti
bne
addiu
$3, $2, 2
$14, $15, $3
$24, 0($14)
$12, $6, -4
$11, $12, $3
$13, 0($11)
R3=R2>>2
R14=R15+R3
R24=load(*R14)
R12=R6-4
R11=R12+R3
R13=load(*R11)
$24, $13
$25, $sp, $3
$9, -4($25)
$2, $2, 1
$13
$10, $9, $13
$10, 0($25)
$25, $7, $3
$24, 0($25)
$24, $2, 10
$24, $0, .L100006
$15, $7, -4
R24=R24*R13
R25=sp+R3
R9=load(R25-4)
R2=R2+1
R10=R9+R13
mem(*R25)=R10
R25=R7+R3
mem(*R25)=R24
Embedded Processor Architecture
R3=i-1
R24=coeff[i-1]
R13=state[i-1]
R9=out[i-1]
i=i+1
R13=move from low mpy reg
R10=out[i]
19 instructions per tap!!
Henk Corporaal / Bart Mesman
4
Application examples (2)
Bit level operations:
finite field arithmetic
nonzero
common
r1 = LB input
r2 = SLL r1
r3 = ANDI r1, mask
r4 = ADDI r3, -1
BNE ( r4 != r0)
nop
R5 = XORI(r1, 29)
J common
nop
r5 = XOR(r1,r0)
…
temp1 = input << 1
temp2 = if (bit(input,7) == 1
then 29
else 0
out = temp1 exor temp2
Load byte
Shift left logical
AND immediate
ADD immediate
Branch on != to nonzero
Exclusive or immediate
Jump
in[0]
in[1]
in[2]
in[3]
in[4]
exor
exor
exor
in[5]
in[6]
in[7]
Exclusive OR
10 instructions!!
Very simple in hardware
out[0] out[1] out[2] out[3] out[4] out[5] out[6] out[7]
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
5
Application examples (2)
Bit level operations : DES example
source register ($2)
272625 2322 20
srl
andi
srl
andi
or
srl
andi
or
sll
$13,
$25,
$14,
$24,
$15,
$13,
$14,
$25,
$24,
$2, 20
$13, 1
$2, 21
$14, 6
$25, $24
$2, 22
$13, 56
$15, $14
$25, 2
7 6 5 4 3 2
destination register
($24)
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
6
Application examples (2)
Bit level operations : A5 example (GSM encryption)
181716
13
$5
srl
srl
xor
srl
xor
srl
xor
andi
xor
$24, $5, 18
$25, $5, 17
$8, $24, $25
$9, $5, 16
$10, $8, $9
$11, $5, 13
$12, $10, $11
$13, $12, 1
… 0 ...
$13
1
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
7
Application examples: conclusions
•
•
•
•
CPUs offer flexibility, but…
not efficient in performance
not efficient in code size
not efficient in power consumption
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
8
Power Consumption in microprocessors
Power consumption is
(becoming) the
limiting factor in
processor design
Solution in direction of
• Hardware acceleration
• Instruction Level
Parallelism instead of
clock speed
source: ISSCC2001, Patrick Gelsinger,
• Code size efficiency
Intel
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
9
Amdahl’s law
• Impact of an improvement on the execution time of a
program depends on 2 parameters:
– f = fraction of the original computation time that is
affected by the improvement
– s = speedup factor (local)
• exec_time_new = exec_time_old * (1-f) +
exec_time_old * f / s
• speedup_overall = exec_time_old / exec_time_new
= 1 / ( 1 – f + f / s)
• if s >> 1 then speedup_overall = 1 / ( 1 – f )
• Example: 40 % of program can be executed 10 x faster
speedup_overall = 1 / ( 0.6 + 0.4 / 10 ) = 1.56
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
10
Conclusions
• Programmable CPU cores are important for the control parts of the
application.
• They are well supported with tools to support the development of
end-user software. ( vs. deeply embedded sw)
• Keep it Simple heuristic (RISC vs. CISC)
• Make frequent cases fast and rare cases correct.
• Regular (orthogonal) instruction set
• No special features that match a high level language construct.
• At least 16 registers to ease register allocation.
• Embedded cores are often light cores which are a compromise
between performance, area and power dissipation.
(vs. stand-alone CPU cores which are optimised for performance)
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
11
Programmable Digital Signal Processors
• real-time worst-case processing = need for more compute power
sec
instr cycles sec
prog prog instr cycle
CPI = 1
• instruction level parallelism (ILP)
• hardware support for loop control
• attention for high level data types e.g. arrays, delaylines
(vs. scalars for CPUs)
• difficult to compare architectures
• e.g. DIT, DIF, radix 2/4, FFT loop unrolling, scaling,
shuffling, intialisation … can be included or forgotten
• benchmarking (Berkeley Design Technology Inc (BDTi))
(compare to SpecInt benchmarks for CPs)
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
12
Outline
• architectures for programmable DSPs
• multiplier-accumulator
• modified Harvard architecture
• extension with an ALU (decision making)
• controller architectures
• examples: TI, Motorola, Philips
• code generation
• developments: VLIW (Very Long Instruction Word)
examples: C6 and TM
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
13
DSP data types
• not every signal requires 32 bits
• 2 types of DSP: floating point and integer
• advantages FP: most specs are in FP
(conversion to int is time consuming since the behavior
may change)
• disadvantage FP: cost (area, speed, power)
• integer multiplication doubles the number of bits: n * n => 2n
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
14
c(i) x(i)
control
MPY
(Booth,
Wallace..)
P_reg
clock
PR
ADDER
P_reg
clock
ACR
SHIFT
ROUND
TRUNCATE
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
15
Prog/data
memory
prog
mem.
data
mem.
prog
mem.
data
mem. 1
data
mem. 2
EXU
EXU
EXU
Von Neumann
(sequential)
Harvard
Modified Harvard
 c(i) * x(i)
Goal = 1 cycle per iteration
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
16
Reset
+1
Interrupt
address
PC
ACU_A
ACU_B
AR_A
AR_B
RAM_A
RAM_B
DR_A
DR_B
Stack
Program
Memory
IR
Control Bus
MAC
Rfile
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
17
time loop
 ci * xi
1 cycle/tap ?
filter loop i
How updating the delayline ?
x5
c5
x4
-1
Z
*
c4
*
x3
-1
Z
c3
x2
-1
Z
*
c2
x1
-1
Z
*
c1
*
+
Embedded Processor Architecture
y
Henk Corporaal / Bart Mesman
18
Solution 2: indirect adressing
Memory
location
1
2
3
4
5
6
7
8
output
sample 1
x1
x2
x3
x4
x5
output
sample 2
x2
x3
x4
x5
x6
output
sample 3
x3
x4
x5
x6
x7
output
sample 4
x4
x5
x6
x7
x8
Output
sample 5
x9
x5
x6
x7
x8
• use of a pointer to mark the begin of the delay line
• problem: trashing of the whole memory
• solution: modulo addressing
• need for a register to store the pointer
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
19
A
S
Output
Read_A A
Read_S S
incA
A+1
decA
A-1
Step
A+S
Inc_step S+1
Modulo
output
to RAM
ACU architecture and
Instruction set
Modulo can be
implemented as a
mask operation
if the size is 2k
Embedded Processor Architecture
reg A
A
A
A+1
A-1
A+S
A
reg S
S
S
S
S
S
S+1
16 10 000
23 10 111
mask
=hold
Henk Corporaal / Bart Mesman
20
Addressing modes
• register
• immediate
• direct
• indirect
• w. inc/dec
• indexed
ADD R4, R3
ADD R4, #3
ADD R4, (100)
ADD R4, (R3)
ADD R4, (R3)±
R[R4] = R[R4] + R[R3]
R[R4] = R[R4] + #3
R[R4] = R[R4] + Mem[100]
R[R4] = R[R4] + Mem[R[R3]]
R[R4] = R[R4] + Mem[R[R3]]
R[R3] = R[R3] ± 1
ADD R4, (R3±R2) R[R4] = R[R4] + Mem[R[R3]]
R[R3] = R[R3] ± R[R2]
Remarks
• direct = for static data
• indirect = for arrays
• inc/dec = for stepping through arrays e.g.  xn
• index = for stepping through arrays e.g.  x2n
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
21
Addressing modes: extra for DSP
• 8 ARs (address or auxiliary register) available
• extra indirect modes
•circular *ARn ± %
post inc/dec by 1 - circular
*ARn ± AR0 % post inc/dec by AR0 - circular
• bit reverse *ARn ± AR0 B post inc/dec by AR0 - bit rev.
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
22
Reset
+1
Interrupt
address
PC
ACU_A
ACU_B
AR_A
AR_B
RAM_A
RAM_B
DR_A
DR_B
Stack
Program
Memory
IR
Control Bus
MAC
ALU
Rfile
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
23
first solution
 c(i) * x(i)
Not shown
coefficient RAM+ACU
resources
LABEL ALU
MPY-ACC
Acc = 0
RAM
ACU
init (i=0)
init counter
loop
incr (=i+1)
read x(i)
acc(i)=acc(i-1)+x(i)*c(i)
dec counter
branch to loop if counter > 0
nop
time (cc)
6 clockcycles/sample
limit pipelines in the controller
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
24
Loopfolding (software pipelining)
ai
f
for i = 0 to n
bi
bi = f(ai)
g
ci = g(bi)
ci
di = h(ci)
h
di
a0
f
b0 a1
g
f
c0 b1 a2
g
h
f
d0 c1 b2
g
h
d1 c2
h
d2
Embedded Processor Architecture
ci-2
bi-1 ai
h
g
di-2 ci-1
f
bi
Henk Corporaal / Bart Mesman
for i = 2 to n
bi = f(ai)
ci-1 = g(bi-1)
di-2 = h(ci-2)
25
Loopfolding (software pipelining)
 c(i) * x(i)
LABEL
ALU
MPY-ACC
acc(i-1)=0
RAM
init counter
loop
acc(i) = acc(i-1)+x(i)*c(i)
ACU
init (i=1)
read x(i) inc(=i+1)
read x(i+1) incr (=i+2)
dec counter
branch to loop if counter > 0
nop
acc(n-1) = acc(n-2)+x(n-1)*c(n-1) read x(n)
acc(n) = acc(n-1)+x(n)*c(n)
Pre- and postamble
4 clockcycles /sample
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
26
hardware support for loop control
 c(i) * x(i)
Label ALU
MPY-ACC
acc(i-1=0
init counter
repeat n-2
acc(i)=acc(i-1)+x(i)*c(i)
acc(n-1) = acc(n-2) + x(n-1)*c(n-1)
acc(n) = acc(n-1) + x(n)*c(n)
RAM
ACU
init (i=1)
read x(i) inc(=i+1)
read x(i+1) incr(=i+2)
read x(n)
1 clockcycles/sample
repeat instruction and repeat block
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
27
TMS320C5000
T register
E
T
D
P C D
B A T
A
Sign ctr
Sign ctr
A(40)
B(40)
C
BACD
D
Sign ctr
Sign ctr
Sign ctr
MUX
Multiplier (17*17)
A
M U
A B
0
ALU (40)
A
B
B
fractional
MUX
Barrer shifter
MUX
COMP
Adder (40)
MSW/LSW
select
TRN
ZERO
SAT
ROUND
TC
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
28
Address bus
16 bits
Motorola 56K family
EXTERNAL
ADRESS SWITCH
P Address
Y Address
X Address
2,048-by-24-bit
PROGRAM
MEMORY
ROM
I/O
PORTS
7 BITS
Address
ALU
X-DATA
Y DATA
P DATA
INTERNAL
DATA-BUS
SWITCH
24 BITS
X memory
256-by-24-bit
RAM
256-by-24-bit
ROM
EXTERNAL
DATA-BUS
SWITCH
GLOBAL DATA
ON CHIP
PERIPHERALS,
HOST,
SYNCHRONOUS
SERIAL INTERFACE
SERIAL COMMUNICATIONS
INTERFACE,
PROGRAMMED I/O,
BUS CONTROL
Y memory
256-by-24-bit
RAM
256-by-24-bit
ROM
24
BITS
DATA ALU
PROGRAM CONTROLLER
2 BITS
CLOCK
Embedded Processor Architecture
3 BITS
24-by-24 bit
MULTIPLIERACCUMULATOR
PRODUCING
56 BIT RESULT
INTERRUPT
Henk Corporaal / Bart Mesman
29
DATA
BUS
Two 16-by-16 bit
multipliers
Y0
Y1
Y1
X
X
PO
P1
scale
X
Program
control
unit
scale
Y
Two address
Compution
units
X data
memory
Saturation
Y data
memory
shift
Two 40 bit
arithmiclogic units
96-bit instructions
Y0
Program
memory
(Z data)
Saturation
Four 40 bit
accumulators
16-bit
bus
Saturation/scale
R.E.A.L.
16 bit
bus
16 bit
bus
X data
Buses for
Y data
Z data
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
30
source
lexical analysis
syntax analysis
Front end
semantic analysis
Intermediate
machine independent
representation
Code selection
Register allocation
1 instr = // ops
order of instr
Code generation
scheduling
Embedded Processor Architecture
code
Henk Corporaal / Bart Mesman
31
Intermediate
machine independent
representation
BBi
BBj
BBk
a
b
*
t1 := a * b
t2 := c + d
t3 := t1 + c
out := t2 * t3
Embedded Processor Architecture
c
c
t1
+
t3
d
+
t2
*
Henk Corporaal / Bart Mesman
32
Code selection example
d memory
ax
ay
x
af
y
+-
ADSP
[Analog Devices]
p memory
mx
my
x
ALU
mf
y
*
MAC
+ar
Embedded Processor Architecture
mr
Henk Corporaal / Bart Mesman
33
Example of code selection
= covering of intermediate representation with RTPs
mx := dmem
my := pmem
a
b
*
Mr := mr + (mx * my)
ax := dmem ay := pmem
c
d
mr := dmem
c 3:
t1
2: +
t3
1:
+
ar := ax + ay
t2
*
my := ar
mr = mr * my
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
34
Problems
• local decisions which have a global impact
• phase coupling: example
• asap schedule
• maximal freedom for scheduling
• code selection during scheduling
• register allocation comes afterwards
• can lead to infeasible solutions
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
35
phase coupling: discussion
It is very difficult and almost impossible to develop robust and
efficient DSP compilers.
Current DSP practice = programming in assembler
Solution:
1. Solve code generation for DSPs
2. Step back and rethink the architecture
develop an architecture which is still efficient but also
a good model for building a compiler
Efficiency = exploit instruction level parallelism (ILP)
compilation = systematic positioning of registers and regular
interconnect
= VLIW = Very Long Instruction Word
Embedded Processor Architecture
Henk Corporaal / Bart Mesman
36
Download