Shared Memory

advertisement
Introduction
to
Many-Core Architectures
Henk Corporaal
www.ics.ele.tue.nl/~heco
ASCI Winterschool on Embedded Systems
Soesterberg, March 2010
Intel Trends
(K. Olukotun)
Core i7
3GHz
100W
5
ASCI Winterschool 2010
Henk Corporaal
(2)
System-level integration
(Chuck Moore, AMD at MICRO 2008)
 Single-chip CPU Era: 1986 –2004
 Extreme focus on single-threaded performance
 Multi-issue, out-of-order execution plus moderate cache hierarchy
 Chip Multiprocessor (CMP) Era: 2004 –2010
 Early: Hasty integration of multiple cores into same chip/package
 Mid-life: Address some of the HW scalability and interference issues
 Current: Homogeneous CPUs plus moderate system-level
functionality
 System-level Integration Era: ~2010 onward
 Integration of substantial system-level functionality
 Heterogeneous processors and accelerators
 Introspective control systems for managing on-chip resources &
events
ASCI Winterschool 2010
Henk Corporaal
(3)
Why many core?
 Running into
 Frequency wall
 ILP wall
 Memory wall
 Energy wall
 Chip area enabler: Moore's law goes well below 22 nm
 What to do with all this area?
 Multiple processors fit easily on a single die
 Application demands
 Cost effective (just connect existing processors or processor
cores)
 Low power: parallelism may allow lowering Vdd
 Performance/Watt is the new metric !!
ASCI Winterschool 2010
Henk Corporaal
(4)
Low power through parallelism
 Sequential Processor
 Switching capacitance C
 Frequency f
 Voltage V
 P1 = fCV2
CPU
 Parallel Processor (two times the number of units)
 Switching capacitance 2C
 Frequency f/2
CPU1
CPU2
 Voltage V’ < V
 P2 = f/2 2C V’2 = fCV’2 < P1
ASCI Winterschool 2010
Henk Corporaal
(5)
How low Vdd can we go?
 Subthreshold JPEG encoder
 Vdd 0.4 – 1.2 Volt
Engine
Engine
Engine
Engine
pJ/operation
8.0
7.0
6.0
5.0
3.4X 4.4X 5.6X
4.0
8.3X
3.0
2.0
1.0
0.0
ASCI Winterschool 2010
1.2
1.1
1.0
0.9 0.8 0.7 0.6
Supply Voltage (V)
0.5
0.4
Henk Corporaal
(6)
Computational efficiency: how many MOPS/Watt?
Yifan He e.a., DAC 2010
ASCI Winterschool 2010
Henk Corporaal
(7)
Computational efficiency: what do we need?
10000
Pe rforma nce (Go ps)
4G Wireless
IBM Cell
1000
Mobile HD
Video
100
SODA
(65nm)
SODA
(90nm)
Imagine
3G Wireless
10
VIRAM
Pentium M
TI C 6X
1
0.1
1
Power (Watts )
10
100
Woh e.a., ISCA 2009
ASCI Winterschool 2010
Henk Corporaal
(8)
Intel's opinion: 48-core x86
ASCI Winterschool 2010
Henk Corporaal
(9)
Outline
 Classifications of Parallel Architectures
 Examples
 Various (research) architectures
 GPUs
 Cell
 Intel multi-cores
 How much performance do you really get?
Roofline model
 Trends & Conclusions
ASCI Winterschool 2010
Henk Corporaal
(10)
Classifications
 Performance / parallelism driven:
 4-5 D
 Flynn
 Communication & Memory
 Message passing / Shared memory
 Shared memory issues: coherency, consistency,
synchronization
 Interconnect
ASCI Winterschool 2010
Henk Corporaal
(11)
Flynn's Taxomony
 SISD (Single Instruction, Single Data)
 Uniprocessors
 SIMD (Single Instruction, Multiple Data)
 Vector architectures also belong to this class
 Multimedia extensions (MMX, SSE, VIS, AltiVec, …)
 Examples: Illiac-IV, CM-2, MasPar MP-1/2, Xetal, IMAP, Imagine,
GPUs, ……
 MISD (Multiple Instruction, Single Data)
 Systolic arrays / stream based processing
 MIMD (Multiple Instruction, Multiple Data)
 Examples: Sun Enterprise 5000, Cray T3D/T3E, SGI Origin
 Flexible
 Most widely used
ASCI Winterschool 2010
Henk Corporaal
(12)
Flynn's Taxomony
ASCI Winterschool 2010
Henk Corporaal
(13)
Enhance performance:
4 architecture methods
 (Super)-pipelining
 Powerful instructions
 MD-technique


multiple data operands per operation
MO-technique

multiple operations per instruction
 Multiple instruction issue
 Single stream: Superscalar
 Multiple streams


ASCI Winterschool 2010
Single core, multiple threads: Simultaneously MultiThreading
Multiple cores
Henk Corporaal
(14)
Architecture methods
Pipelined Execution of Instructions
IF: Instruction Fetch
INSTRUCTION
CYCLE
1
1
2
3
4
2
IF
3
DC
IF
4
RF
DC
IF
5
EX
RF
DC
IF
6
WB
EX
RF
DC
7
DC: Instruction Decode
8
RF: Register Fetch
WB
EX
RF
EX: Execute instruction
WB
EX
WB
WB: Write Result Register
Simple 5-stage pipeline
 Purpose of pipelining:
 Reduce #gate_levels in critical path
 Reduce CPI close to one (instead of a large number for the
multicycle machine)
 More efficient Hardware
 Problems
 Hazards: pipeline stalls
 Structural hazards: add more hardware
 Control hazards, branch penalties: use branch prediction
 Data hazards: by passing required
ASCI Winterschool 2010
Henk Corporaal
(15)
Architecture methods
Pipelined Execution of Instructions
 Superpipelining:
 Split one or more of the critical pipeline stages
 Superpipelining degree S:
S(architecture) =  f(Op) * lt (Op)
Op I_set
*
where:
f(op) is frequency of operation op
lt(op) is latency of operation op
ASCI Winterschool 2010
Henk Corporaal
(16)
Architecture methods
Powerful Instructions (1)
 MD-technique
 Multiple data operands per operation
 SIMD: Single Instruction Multiple Data
Vector instruction:
Assembly:
for (i=0, i++, i<64)
c[i] = a[i] + 5*b[i];
set
ldv
mulvi
ldv
addv
stv
or
c = a + 5*b
ASCI Winterschool 2010
vl,64
v1,0(r2)
v2,v1,5
v1,0(r1)
v3,v1,v2
v3,0(r3)
Henk Corporaal
(17)
Architecture methods
Powerful Instructions (1)
 SIMD computing
 All PEs (Processing
Elements) execute same
operation
connectivity
 Exploit data locality of e.g.
image processing
applications
time
 Typical mesh or hypercube
SIMD Execution Method
PE1
PE2
PEn
Instruction 1
Instruction 2
Instruction 3
 Dense encoding (few
instruction bits needed)
Instruction n
ASCI Winterschool 2010
Henk Corporaal
(18)
Architecture methods
Powerful Instructions (1)
 Sub-word parallelism
 SIMD on restricted scale:
 Used for Multi-media instructions
 Examples
 MMX, SSE, SUN-VIS, HP MAX-2,
AMD-K7/Athlon 3Dnow, Trimedia II

Example: i=1..4|ai-bi|
ASCI Winterschool 2010
*
*
*
*
Henk Corporaal
(19)
Architecture methods
Powerful Instructions (2)
 MO-technique: multiple operations per instruction
 Two options:
 CISC (Complex Instruction Set Computer)
 VLIW (Very Long Instruction Word)
field
FU 1
instruction
sub r8, r5, 3
FU 2
and r1, r5, 12
FU 3
mul r6, r5, r2
FU 4
ld r3, 0(r5)
FU 5
bnez r5, 13
VLIW instruction example
ASCI Winterschool 2010
Henk Corporaal
(20)
VLIW architecture: central Register File
Register file
Exec Exec Exec
unit 1 unit 2 unit 3
Issue slot 1
Exec Exec Exec
unit 4 unit 5 unit 6
Issue slot 2
Exec Exec Exec
unit 7 unit 8 unit 9
Issue slot 3
Q: How many ports does the registerfile need for n-issue?
ASCI Winterschool 2010
Henk Corporaal
(21)
Architecture methods
Multiple instruction issue (per cycle)
 Who guarantees semantic correctness?
 can instructions be executed in parallel
 User: he specifies multiple instruction streams
 Multi-processor: MIMD (Multiple Instruction Multiple
Data)
 HW: Run-time detection of ready instructions
 Superscalar
 Compiler: Compile into dataflow representation
 Dataflow processors
ASCI Winterschool 2010
Henk Corporaal
(22)
Four dimensional representation of the
architecture design space <I, O, D, S>
SIMD
100
Data/operation ‘D’
10
Vector
CISC
Superscalar
0.1
MIMD
10
RISC
Dataflow
100
Instructions/cycle ‘I’
Superpipelined
10
VLIW
10
Operations/instruction ‘O’
ASCI Winterschool 2010
Superpipelining
Degree ‘S’
Henk Corporaal
(23)
Architecture design space
Example values of <I, O, D, S> for different architectures
Architecture
I
O
D
S
CISC
RISC
VLIW
Superscalar
SIMD
MIMD
GPU
Top500 Jaguar
0.2
1
1
3
1
32
32
1.2
1
10
1
1
1
2
1.1
1
1
1
128
1
8
1
1.2
1.2
1.2
1.2
1.2
24
S(architecture) =  f(Op) * lt (Op)
Op I_set
Mpar
0.26
1.2
12
3.6
154
38
12288
???
You should exploit this
amount of parallelism !!!
Mpar = I*O*D*S
ASCI Winterschool 2010
Henk Corporaal
(24)
Communication
 Parallel Architecture extends traditional
computer architecture with a communication
network


abstractions (HW/SW interface)
organizational structure to realize abstraction
efficiently
Communication Network
Processing
node
ASCI Winterschool 2010
Processing
node
Processing
node
Processing
node
Processing
node
Henk Corporaal
(25)
Communication models: Shared Memory
Shared
Memory
(read, write)
Process P1
(read, write)
Process P2
 Coherence problem
 Memory consistency issue
 Synchronization problem
ASCI Winterschool 2010
Henk Corporaal
(26)
Communication models: Shared memory
 Shared address space
 Communication primitives:
 load, store, atomic swap
 Two varieties:
 Physically shared => Symmetric Multi-Processors (SMP)


usually combined with local caching
Physically distributed => Distributed Shared Memory
(DSM)
ASCI Winterschool 2010
Henk Corporaal
(27)
SMP: Symmetric Multi-Processor
 Memory: centralized with uniform access time (UMA) and bus
interconnect, I/O
 Examples: Sun Enterprise 6000, SGI Challenge, Intel
can be 1 bus,Processor
N
busses, or any
network One or
more cache
levels
Processor
Processor
Processor
One or
more cache
levels
One or
more cache
levels
One or
more cache
levels
Main memory
ASCI Winterschool 2010
I/O System
Henk Corporaal
(28)
DSM: Distributed Shared Memory
 Nonuniform access time (NUMA) and scalable
interconnect (distributed memory)
Processor
Processor
Processor
Processor
Cache
Cache
Cache
Cache
Memory
Memory
Memory
Memory
Interconnection Network
Main memory
ASCI Winterschool 2010
I/O System
Henk Corporaal
(29)
Shared Address Model Summary
 Each processor can name every physical location in
the machine
 Each process can name all data it shares with other
processes
 Data transfer via load and store
 Data size: byte, word, ... or cache blocks
 Memory hierarchy model applies:
 communication moves data to local proc. cache
ASCI Winterschool 2010
Henk Corporaal
(30)
Three fundamental issues for shared
memory multiprocessors
 Coherence,
about: Do I see the most recent data?
 Consistency,
about: When do I see a written value?

e.g. do different processors see writes at the same time
(w.r.t. other memory accesses)?
 Synchronization
How to synchronize processes?

how to protect access to shared data?
ASCI Winterschool 2010
Henk Corporaal
(31)
Communication models: Message Passing
 Communication primitives
 e.g., send, receive library calls
 standard MPI: Message Passing Interface

www.mpi-forum.org
 Note that MP can be build on top of SM and vice versa!
receive
send
Process P2
Process P1
send
receive
FiFO
ASCI Winterschool 2010
Henk Corporaal
(32)
Message Passing Model
 Explicit message send and receive operations
 Send specifies local buffer + receiving process on
remote computer
 Receive specifies sending process on remote
computer + local buffer to place data
 Typically blocking communication, but may use DMA
Message structure
Header
ASCI Winterschool 2010
Data
Trailer
Henk Corporaal
(33)
Message passing communication
Processor
Processor
Processor
Processor
Cache
Cache
Cache
Cache
Memory
Memory
Memory
Memory
DMA
DMA
DMA
DMA
Network
interface
Network
interface
Network
interface
Network
interface
Interconnection Network
ASCI Winterschool 2010
Henk Corporaal
(34)
Communication Models: Comparison
 Shared-Memory:
 Compatibility with well-understood language mechanisms
 Ease of programming for complex or dynamic
communications patterns
 Shared-memory applications; sharing of large data
structures
 Efficient for small items
 Supports hardware caching
 Messaging Passing:
 Simpler hardware
 Explicit communication
 Implicit synchronization (with any communication)
ASCI Winterschool 2010
Henk Corporaal
(35)
Interconnect
 How to connect your cores?
 Some options:
 Connect everybody:



Single bus
Hierarchical bus
NoC
• multi-hop via routers
• any topology possible
• easy 2D layout helps

Connect with e.g. neighbors only


ASCI Winterschool 2010
e.g. using shift operation in SIMD
or using dual-ported mems to connect 2 cores.
Henk Corporaal
(36)
Bus (shared) or Network (switched)
 Network:
 claimed to be more scalable
 no bus arbitration
 point-to-point connections

but router overhead
Example:
NoC with 2x4 mesh
routing network
node
node
R
node
R
node
R
ASCI Winterschool 2010
node
node
R
node
R
R
node
R
R
Henk Corporaal
(37)
Historical Perspective
 Early machines were:
 Collection of microprocessors.
 Communication was performed using bi-directional queues
between nearest neighbors.
 Messages were forwarded by processors on path
 “Store and forward” networking
 There was a strong emphasis on topology in algorithms, in
order to minimize the number of hops => minimize time
ASCI Winterschool 2010
Henk Corporaal
(38)
Design Characteristics of a Network
 Topology (how things are connected):
 Crossbar, ring, 2-D and 3-D meshes or torus, hypercube, tree,
butterfly, perfect shuffle, ....
 Routing algorithm (path used):
 Example in 2D torus: all east-west then all north-south (avoids
deadlock)
 Switching strategy:
 Circuit switching: full path reserved for entire message, like the
telephone.
 Packet switching: message broken into separately-routed packets,
like the post office.
 Flow control and buffering (what if there is congestion):
 Stall, store data temporarily in buffers
 re-route data to other nodes
 tell source node to temporarily halt, discard, etc.
 QoS guarantees, Error handling, …., etc, etc.
ASCI Winterschool 2010
Henk Corporaal
(39)
Switch / Network Topology
 Topology determines:
 Degree: number of links from a node
 Diameter: max number of links crossed between nodes
 Average distance: number of links to random destination
 Bisection: minimum number of links that separate the
network into two halves
 Bisection bandwidth = link bandwidth * bisection
ASCI Winterschool 2010
Henk Corporaal
(40)
Bisection Bandwidth
 Bisection bandwidth: bandwidth across smallest cut that
divides network into two equal halves
 Bandwidth across “narrowest” part of the network
not a
bisection
cut
bisection
cut
bisection bw= link bw
bisection bw = sqrt(n) * link
bw
 Bisection bandwidth is important for algorithms in which all
processors need to communicate with all others
ASCI Winterschool 2010
Henk Corporaal
(41)
Common Topologies
Type
Degree Diameter Ave Dist
1D mesh
2
N-1
2D mesh
4
2(N1/2 - 1) 2N1/2 / 3
N1/2
3D mesh
6
3(N1/3 - 1) 3N1/3 / 3
N2/3
nD mesh
2n
n(N1/n - 1) nN1/n / 3
N(n-1) / n
Ring
2
N/2
N/4
2
2D torus
4
N1/2
N1/2 / 2
2N1/2
n/2
N/2
Hypercube
Log2N n=Log2N
N/3
Bisection
1
2D Tree
3
2Log2N
~2Log2 N 1
Crossbar
N-1
1
1
N2/2
N = number of nodes, n = dimension
ASCI Winterschool 2010
Henk Corporaal
(42)
Red Storm (Opteron +
Cray network, future)
3D Mesh
Blue Gene/L
3D Torus
SGI Altix
Fat tree
newer
Cray X1
4D Hypercube (approx)
Myricom (Millennium)
Arbitrary
older
Topologies in Real High End Machines
Quadrics (in HP Alpha
server clusters)
Fat tree
IBM SP
Fat tree (approx)
SGI Origin
Hypercube
Intel Paragon
2D Mesh
BBN Butterfly
Butterfly
ASCI Winterschool 2010
Henk Corporaal
(43)
Network: Performance metrics
 Network Bandwidth
 Need high bandwidth in communication
 How does it scale with number of nodes?
 Communication Latency
 Affects performance, since processor may have to wait
 Affects ease of programming, since it requires more thought to
overlap communication and computation
 How can a mechanism help hide latency?
 overlap message send with computation,
 prefetch data,
 switch to other task or thread
ASCI Winterschool 2010
Henk Corporaal
(44)
Examples of many core / PE architectures
 SIMD
 Xetal (320 PEs), Imap (128 PEs), AnySP (Michigan Univ)
 VLIW
 Itanium,TRIPS / EDGE, ADRES,
 Multi-threaded
 idea: hide long latencies
 Denelcor HEP (1982), SUN Niagara (2005)
 Multi-processor
 RaW, PicoChip, Intel/AMD, GRID, Farms, …..
 Hybrid, like , Imagine, GPUs, XC-Core
 actually, most are hybrid !!
ASCI Winterschool 2010
Henk Corporaal
(45)
IMAP from NEC
NEC IMAP
SIMD
•128 PEs
•Supports indirect addressing
e.g. LD r1, (r2)
•Each PE 5-issue VLIW
ASCI Winterschool 2010
Henk Corporaal
(46)
TRIPS (Austin Univ / IBM)
a statically mapped data flow architecture
R: register file
E: execution unit
D: Data cache
I: Instruction cache
G: global control
ASCI Winterschool 2010
Henk Corporaal
(47)
Compiling for TRIPS
1.
2.
Form hyperblocks (use unrolling, predication, inlining to enlarge
scope)
Spatial map operations of each hyperblock

3.
registers are accessed at hyperblock boundaries
Schedule hyperblocks
ASCI Winterschool 2010
Henk Corporaal
(48)
Time (processor cycle)
Multithreaded Categories
Superscalar
Thread 1
Thread 2
ASCI Winterschool 2010
Fine-Grained Coarse-Grained
Thread 3
Thread 4
Multiprocessing
Thread 5
Idle slot
Simultaneous
Multithreading
Intel calls this
'Hyperthreading'
Henk Corporaal
(49)
SUN Niagara processing element
 4 threads per processor
 4 copies of PC logic, Instr. buffer, Store buffer, Register file
ASCI Winterschool 2010
Henk Corporaal
(50)
Really BIG: Jaguar-Cray XT5-HE
 Oak Ridge Nat
Lab
 224,256 AMD
Opteron cores
 2.33 PetaFlop
peak perf.
 299 Tbyte main
memory
 10 Petabyte disk
 478GB/s mem
bandwidth
 6.9 MegaWatt
 3D torus
 TOP 500 #1
(Nov 2009)
ASCI Winterschool 2010
Henk Corporaal
(51)
Graphic Processing Units (GPUs)
NVIDIA GT 340
(2010)
ATI 5970
(2009)
ASCI Winterschool 2010
Henk Corporaal
(52)
Why GPUs
ASCI Winterschool 2010
Henk Corporaal
(53)
In Need of TeraFlops?
3 * GTX295
• 1440 PEs
• 5.3 TeraFlop
ASCI Winterschool 2010
Henk Corporaal
(54)
How Do GPUs Spend Their Die Area?
GPUs are designed to match the workload of 3D graphics.
Die photo of GeForce GTX 280 (source: NVIDIA)
J. Roca, et al. "Workload Characterization of 3D Games", IISWC 2006, link
T.ASCI
Mitra,
et al. "Dynamic 3D Graphics Workload Characterization and the Architectural Implications", Micro 1999, link
Winterschool 2010
Henk Corporaal
(55)
How Do CPUs Spend Their Die Area?
CPUs are designed for low latency instead of high throughput
Die photo of Intel Penryn (source: Intel)
ASCI Winterschool 2010
Henk Corporaal
(56)
GPU: Graphics Processing Unit
From polygon mesh to image pixel.
The Utah teapot: http://en.wikipedia.org/wiki/Utah_teapot
ASCI Winterschool 2010
Henk Corporaal
(57)
The Graphics Pipeline
K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008, http://doi.acm.org/10.1145/1365490.1365498
ASCI Winterschool 2010
Henk Corporaal
(58)
The Graphics Pipeline
K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008, http://doi.acm.org/10.1145/1365490.1365498
ASCI Winterschool 2010
Henk Corporaal
(59)
The Graphics Pipeline
K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008, http://doi.acm.org/10.1145/1365490.1365498
ASCI Winterschool 2010
Henk Corporaal
(60)
The Graphics Pipeline
K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008, http://doi.acm.org/10.1145/1365490.1365498
ASCI Winterschool 2010
Henk Corporaal
(61)
GPUs: what's inside?
Basically an SIMD:
• A single instruction stream operates on multiple data streams
• All PEs execute the same instruction at the same time
• PEs operate concurrently on their own piece of memory
• However, GPU far more complex !!
Data memory
Add
Instr.
Instruction
•
Memory
•
Instr.
PE
1
Control
Processor
Addr.
Add Add Add Add Add Add Add Add
PE
2
PE
3
PE
4
PE
5
PE
6
...
PE
320
Status
Interconnect
ASCI Winterschool 2010
Henk Corporaal
(62)
CPU Programming: NVIDIA CUDA example
• CUDA program expresses data level parallelism (DLP) in
terms of thread level parallelism (TLP).
• Hardware converts TLP into DLP at run time.
Single thread program
CUDA program
float A[4][8];
do-all(i=0;i<4;i++){
do-all(j=0;j<8;j++){
A[i][j]++;
}
}
float A[4][8];
ASCI Winterschool 2010
kernelF<<<(4,1),(8,1)>>>(A);
__device__
kernelF(A){
i = blockIdx.x;
j = threadIdx.x;
A[i][j]++;
}
Henk Corporaal
(63)
System Architecture
Erik Lindholm, et al. "NVIDIA Tesla: A Unified Graphics and Computing Architecture", IEEE Micro 2008, link
ASCI Winterschool 2010
Henk Corporaal
(64)
NVIDIA Tesla Architecture (G80)
Erik Lindholm, et al. "NVIDIA Tesla: A Unified Graphics and Computing Architecture", IEEE Micro 2008, link
ASCI Winterschool 2010
Henk Corporaal
(65)
Texture Processor Cluster (TPC)
ASCI Winterschool 2010
Henk Corporaal
(66)
Deeply pipelined SM for high throughput


One instruction executed by a warp of 32 threads
One warp is executed on 8 PEs over 4 shader cycles
Let's start with a simple example:
execution of 1 instruction
ASCI Winterschool 2010
Henk Corporaal
(67)
Issue an Instruction for 32 Threads
ASCI Winterschool 2010
Henk Corporaal
(68)
Read Source Operands of 32 Threads
ASCI Winterschool 2010
Henk Corporaal
(69)
Buffer Source Operands to Op Collector
ASCI Winterschool 2010
Henk Corporaal
(70)
Execute Threads 0~7
ASCI Winterschool 2010
Henk Corporaal
(71)
Execute Threads 8~15
ASCI Winterschool 2010
Henk Corporaal
(72)
Execute Threads 16~23
ASCI Winterschool 2010
Henk Corporaal
(73)
Execute Threads 24~31
ASCI Winterschool 2010
Henk Corporaal
(74)
Write Back from Result Queue to Reg
ASCI Winterschool 2010
Henk Corporaal
(75)
Warp: Basic Scheduling Unit in Hardware


One warp consists of 32 consecutive threads
Warps are transparent to programmer, formed at run
time
ASCI Winterschool 2010
Henk Corporaal
(76)
Warp Scheduling
•
•
ASCI Winterschool 2010
Schedule at most
24 warps in an
interleaved manner
Zero overhead for
interleaved issue of
warps
Henk Corporaal
(77)
Handling Branch
 Threads within a warp are free to branch.
if( $r17 > $r19 ){
$r16 = $r20 + $r31
}
else{
$r16 = $r21 - $r32
}
$r18 = $r15 + $r16
Assembly code on the right are disassembled from cuda binary (cubin) using "decuda", link
ASCI Winterschool 2010
Henk Corporaal
(78)
Branch Divergence within a Warp
 If threads within a warp diverge, both paths have to be executed.
 Masks are set to filter out threads not executing on current path.
ASCI Winterschool 2010
Henk Corporaal
(79)
CPU Programming: NVIDIA CUDA example
• CUDA program expresses data level parallelism (DLP) in
terms of thread level parallelism (TLP).
• Hardware converts TLP into DLP at run time.
Single thread program
CUDA program
float A[4][8];
do-all(i=0;i<4;i++){
do-all(j=0;j<8;j++){
A[i][j]++;
}
}
float A[4][8];
ASCI Winterschool 2010
kernelF<<<(4,1),(8,1)>>>(A);
__device__
kernelF(A){
i = blockIdx.x;
j = threadIdx.x;
A[i][j]++;
}
Henk Corporaal
(80)
CUDA Programming
Both grid and thread block can have two dimensional index.
kernelF<<<(2,2),(4,2)>>>(A);
__device__
kernelF(A){
i = blockDim.x *
blockIdx.y
+ blockIdx.x;
j = threadDim.x *
threadIdx.y
+ threadIdx.x;
A[i][j]++;
}
ASCI Winterschool 2010
Henk Corporaal
(81)
Mapping Thread Blocks to SMs
 One thread block can only run on one SM
 Thread block can not migrate from one SM to another SM
 Threads of the same thread block can share data using shared
memory
Example: mapping 12 thread blocks on 4 SMs.
ASCI Winterschool 2010
Henk Corporaal
(82)
Mapping Thread Blocks (0,0)/(0,1)/(0,2)/(0,3)
ASCI Winterschool 2010
Henk Corporaal
(83)
CUDA Compilation Trajectory
cudafe: CUDA front end
nvopencc: customized open64 compiler for CUDA
ptx: high level assemble code (documented)
ptxas: ptx assembler
cubin: CUDA binrary
decuda, http://wiki.github.com/laanwj/decuda
ASCI Winterschool 2010
Henk Corporaal
(84)
Optimization Guide
 Optimizations on memory latency tolerance
 Reduce register pressure
 Reduce shared memory pressure
 Optimizations on memory bandwidth
 Global memory coalesce
 Shared memory bank conflicts
 Grouping byte access
 Avoid Partition camping
 Optimizations on computation efficiency
 Mul/Add balancing
 Increase floating point proportion
 Optimizations on operational intensity
 Use tiled algorithm
 Tuning thread granularity
ASCI Winterschool 2010
Henk Corporaal
(85)
Global Memory: Coalesced Access
perfectly coalesced
allow threads
skipping LD/ST
NVIDIA, "CUDA Programming Guide", link
ASCI Winterschool 2010
Henk Corporaal
(86)
Global Memory: Non-Coalesced Access
non-consecutive
address
starting address not
aligned to 128 Byte
non-consecutive
address
stride larger than one
word
NVIDIA, "CUDA Programming Guide", link
ASCI Winterschool 2010
Henk Corporaal
(87)
Shared Memory: without Bank Conflict
one access per bank
one access per bank
with shuffling
access the same address
(broadcast)
partial broadcast and
skipping some banks
NVIDIA, "CUDA Programming Guide", link
ASCI Winterschool 2010
Henk Corporaal
(88)
Shared Memory: with Bank Conflict
access more than one
address per bank
broadcast more than one
address per bank
NVIDIA, "CUDA Programming Guide", link
ASCI Winterschool 2010
Henk Corporaal
(89)
Optimizing MatrixMul
atrix Multiplication example from the 5kk70 course in TU/e, link.
e CUDA@MIT course also provides Matrix Multiplication as a hands-on example, link.
ASCI Winterschool 2010
Henk Corporaal
(90)
ATI Cypress (RV870)
•
1600 shader ALUs
ref: tom's hardware, link
ASCI Winterschool 2010
Henk Corporaal
(91)
ATI Cypress (RV870)
•
VLIW PEs
ref: tom's hardware, link
ASCI Winterschool 2010
Henk Corporaal
(92)
Intel Larrabee
•
x86 core, 8/16/32 cores.
Larry Seiler, et al. "Larrabee: a many-core x86 architecture for visual computing", SIGGRAPH 2008, link
ASCI Winterschool 2010
Henk Corporaal
(93)
CELL
Video Memory
GDDR3
NVIDIA
RSX
PS3
GDDR3
GDDR3
reality
synthesizer
GDDR3
128pin * 1.4Gbps/pin = 22.4GB/sec
15 GB/sec
20 GB/sec
Main Memory
XDR DRAM
Cell
Broadband
Engine
3.2 GHz
XDR DRAM
XDR DRAM
XDR DRAM
64pin * 3.2Gbps/pin = 25.6GB/sec
2.5 GB/sec
2.5 GB/sec
South Bridge
ASCI Winterschool 2010
drives
USB
Network
Media
Henk Corporaal
(94)
CELL – the architecture
1 x PPE 64-bit PowerPC
L1: 32 KB I$ + 32 KB D$
L2: 512 KB
8 x SPE cores:
Local store: 256 KB
128 x 128 bit vector
registers
Hybrid memory model:
PPE: Rd/Wr
SPEs: Asynchronous DMA
 EIB: 205 GB/s sustained aggregate bandwidth
 Processor-to-memory bandwidth: 25.6 GB/s
 Processor-to-processor: 20 GB/s in each direction
ASCI Winterschool 2010
Henk Corporaal
(95)
ASCI Winterschool 2010
Henk Corporaal
(96)
Intel / AMD x86 – Historical overview
ASCI Winterschool 2010
Henk Corporaal
(97)
Nehalem architecture
 In novel processors
 Core i7 & Xeon 5500s
 Quad Core
 3 cache levels
 2 TLB levels
 2 branch predictors
1 core
 Out-of-Order execution
 Simultaneous Multithreading
 DVFS: dynamic voltage &
frequency scaling
ASCI Winterschool 2010
Henk Corporaal
(98)
Nehalem pipeline (1/2)
Instruction Fetch and PreDecode
Instruction Queue
Microcode
ROM
Decode
Rename/Alloc
Retirement unit
(Re-Order Buffer)
Scheduler
EXE
Unit
Clust
er 0
EXE
Unit
Clust
er 1
EXE
Unit
Clust
er 2
Load
Store
L1D Cache and DTLB
L2 Cache
Inclusive L3 Cache by all cores
ASCI Winterschool 2010
Q
PI
Quick Path Interconnect (2x20 bit)
Henk Corporaal
(99)
Nehalem pipeline (2/2)
ASCI Winterschool 2010
Henk Corporaal
(100)
Tylersburg: connecting 2 quad cores
Core
L1D
Core
L1I
L1D
L2U
Core
L1I
L1D
L2U
L1I
L2U
Core
L1D
Core
L1I
L1D
L2U
L1I
L2U
Core
L1D
L1I
L1I
L1D
L2U
L1I
L2U
QPI
QPI
Memory controller
DDR3
QPI
QP
I
QPI
I
QP
Main memory
IOH
Main memory
Level
Capacity
Associativity
(ways)
Line size
(bytes)
Access Latency
(clocks)
Access Throughput
(clocks)
Write Update
Policy
L1D
4 x 32 KiB
8
64
4
1
Writeback
L1I
4 x 32 KiB
4
N/A
N/A
N/A
N/A
L2U
4 x 256KiB
8
64
10
Varies
Writeback
L3U
1 x 8 MiB
16
64
35-40
Varies
Writeback
ASCI Winterschool 2010
Core
L3U
QPI
DDR3
L1D
L2U
L3U
Memory controller
Core
Henk Corporaal
(101)
Programming these arechitectures: N-tap FIR
N 1
out[i]   in[i  j ] * coeff [ j ]
j 0
C-code:
int i, j;
for (i = 0; i < M; i ++){
out[i] = 0;
for (j = 0; j < N; j ++)
out[i] +=n[i+j]*coeff[j];
}
ASCI Winterschool 2010
Henk Corporaal
(102)
X0
X1
X2
X3
X4
X5
X6
x
x
x
x
C0
C1
C2
C3
+
x
x
x
x
C0
C1
C2
C3
+
x
x
x
x
C0
C1
C2
C3
+
x
x
x
x
C0
C1
C2
C3
Y4
Y5
Y6
X7
X8
X9
X10
X11
Y7
Y8
Y9
Y10
Y11
+
Y0
ASCI Winterschool 2010
Y1
Y2
Y3
Henk Corporaal
(103)
FIR with x86 SSE Intrinsics
__m128 X, XH, XL, Y, C, H;
int i, j;
for(i = 0; i < (M/4); i ++){
XL = _mm_load_ps(&in[i*4]); Y = _mm_setzero_ps();
for(j = 0; j < (N/4); j ++){
XH = XL; XL = _mm_load_ps(&in[(i+j+1)*4]);
C =_mm_load_ps(&coeff[j*4]);
H =_mm_shuffle_ps (C, C, _MM_SHUFFLE(0,0,0,0));
X = _mm_mul_ps (XH, H); Y = _mm_add_ps (Y, X);
H =_mm_shuffle_ps (C, C, _MM_SHUFFLE(1,1,1,1));
X = _mm_alignr_epi8 (XL, XH, 4);
X = _mm_mul_ps (X, H); Y = _mm_add_ps (Y, X);
H = _mm_shuffle_ps (C, C, _MM_SHUFFLE(2,2,2,2));
X = _mm_alignr_epi8 (XL, XH, 8);
X = _mm_mul_ps (X, H); Y = _mm_add_ps (Y, X);
H = _mm_shuffle_ps (C, C, _MM_SHUFFLE(3,3,3,3));
X = _mm_alignr_epi8 (XL, XH, 12);
X = _mm_mul_ps (X, H); Y = _mm_add_ps (Y, X);
}
_mm_store_ps(&out[i*4], Y);
}
Y
X
H
X
H
X
H
X
H
Y0
X0
C0
X1
C1
X2
C2
X3
C3
Y1
X1
C0
X2
C1
X3
C2
X4
C3
=
ASCI Winterschool 2010
x
+
x
+
x
+
x
Y2
X2
C0
X3
C1
X4
C2
X5
C3
Y3
X3
C0
X4
C1
X5
C2
X6
C3
Henk Corporaal
(104)
FIR using pthread
pthread_t fir_threads[N_THREAD];
fir_arg
fa[N_THREAD];
tsize = M/N_THREAD;
for(i = 0; i < N_THREAD; i ++){
/*… Initialize thread
parameters fa[i] … */
rc = pthread_create(&fir_threads[i],\
NULL, fir_kernel, (void *)&fa[i]);
}
for(i=0; i<N_THREAD; i++) {
rc = pthread_join(fir_threads[i],\
&status);
}
Input
split
T0
T1
T2
T3
join
Sequential FIR kernel
or
Vectorized FIR kernel
ASCI Winterschool 2010
Henk Corporaal
(105)
x86 FIR speedup
 On Intel Core 2 Quad Q8300, gcc optimization level 2
 Input: ~5M samples
 #threads in pthread: 4
14
12
10
8
4-tap
64-tap
6
4
2
0
Sequential
ASCI Winterschool 2010
SSE
pthread
SSE+pthread
Henk Corporaal
(106)
FIR kernel on CELL SPE
Vectorization is similar to SSE
vector float,X, XH, XL, Y, H;
int i, j;
for(i = 0; i < (M/4); i ++){
XL = in[i]; Y = spu_splats(0.0f);
for(j = 0; j < (N/4); j ++){
XH = XL; XL = in[i+j+1]);
H=splats(coeff[j*4]);
Y = spu_madd(XH, H, Y);
H=splats(coeff[j*4+1]);
X = spu_shuffle(XH, XL, SHUFFLE_X1);
Y = spu_madd(X, H, Y);
H=splats(coeff[j*4+2]);
X = spu_shuffle(XH, XL, SHUFFLE_X2);
Y = spu_madd(X, H, Y);
H=splats(coeff[j*4+3]);
X = spu_shuffle(XH, XL, SHUFFLE_X3);
Y = spu_madd(X, H, Y);
}
}
out[i] = Y;
ASCI Winterschool 2010
Henk Corporaal
(107)
SPE DMA double buffering
float iBuf[2][BUF_SIZE];
float oBuf[2][BUF_SIZE];
int idx=0; int buffers=size/BUF_SIZE;
mfc_get(iBuf[idx],argp,\
BUF_SIZE*sizeof(float),\
tag[idx],0,0);
for(int i = 1;I < buffers; i++){
wait_for_dma(tag[idx]);
next_idx = idx^1;
mfc_get(iBuf[next_idx],argp,\
BUF_SIZE*sizeof(float),0,0,0);
fir_kernel(oBuf[idx], iBuf[idx],\
coeff,BUF_SIZE,taps);
mfc_put(oBuf[idx],outbuf,\
BUF_SIZE*sizeof(float),\
tag[idx],0,0);
idx = next_idx;
}
/* Finish up the last block ...*/
ASCI Winterschool 2010
time
Get
iBuf0
Get
iBuf1
Use iBuf0
Write to oBuf0
Get
iBuf0
Use iBuf1
Write to oBuf1
Put
oBuf0
Get
iBuf1
Use iBuf0
Write to oBuf0
Put
oBuf1
...
Henk Corporaal
(108)
CELL FIR speedup
 On PlayStation 3, CELL with six accessible SPE
 Input: ~6M samples
 Speed-up compare to scalar implementation on PPE
35
30
25
20
scalar
SIMD
15
10
5
0
1 SPE
ASCI Winterschool 2010
2 SPE
4 SPE
6 SPE
Henk Corporaal
(109)
Roofline Model
Performance in GFlops/sec
Introduced by Samual Williams and David Patterson
peak performance
ridge point
balanced architecture for
given application
Operational intensity in Flops/Byte
ASCI Winterschool 2010
Henk Corporaal
(110)
Roofline Model of GT8800 GPU
ASCI Winterschool 2010
Henk Corporaal
(111)
Roofline Model
 Threads of one warp diverge into different paths at branch.
ASCI Winterschool 2010
Henk Corporaal
(112)
Roofline Model
 In G80 architecture, a non-coalesced global memory access will
be separated into 16 accesses.
ASCI Winterschool 2010
Henk Corporaal
(113)
Roofline Model
Previous examples assume memory latency can be hidden.
Otherwise the program can be latency-bound.
rm : percentage of memory instruction in total instruction
tavg : average memory latency
CPIexe : Cycle per Instruction
• There is one memory instruction in every (1/rm) instructions.
• There is one memory instruction every (1/rm) x CPIexe cycles.
• It takes (tavg x rm / CPIexe) threads to hide memory latency.
Z. Guz, et al, "Many-Core vs. Many-Thread Machines: Stay Away From the Valley", IEEE Comp Arch Letters, 2009, link
S. Hong, et al. "An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness", ISCA09, link
ASCI Winterschool 2010
Henk Corporaal
(114)
Roofline Model
If not enough threads to hide the memory latency, the memory
latency could become the bottleneck.
Samuel Williams, "Auto-tuning Performance on Multicore Computers", PhD Thesis, UC Berkeley, 2008, link
S.ASCI
Hong,
et al. "An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness",Henk
ISCA09,
link
Winterschool 2010
Corporaal (115)
Four Architectures
SRI / crossbar
Crossbar
SRI / crossbar
10.66 GB/s
10.66 GB/s
667MHz DDR2 DIMMs
90 GB/s
4MB Shared L2 (16 way)
(64b interleaved)
2x64b memory controllers
667MHz DDR2 DIMMs
2x128b controllers
10.66 GB/s
MT SPARC
MT SPARC
MT SPARC
MT SPARC
MT SPARC
MT SPARC
MT SPARC
4 Coherency Hubs
21.33 GB/s
10.66 GB/s
667MHz FBDIMMs
SPE
SPE
SPE
SPE
SPE
MFC 256K
MFC 256K
MFC 256K
MFC 256K
XDR memory controllers
25.6 GB/s
25.6 GB/s
512MB XDR DRAM
VMT
PPE
512K
L2
Thread
Cluster
Thread
Cluster
Thread
Cluster
Thread
Cluster
Thread
Cluster
interconnect
SPE
MFC 256K
SPE
MFC 256K
SPE
SPE
MFC 256K
MFC 256K
SPE
MFC 256K
SPE
SPE
MFC 256K
MFC 256K
EIB (ring network)
XDR memory controllers
512MB XDR DRAM
90 GB/s
NVIDIA G80
MFC 256K
SPE
MFC 256K
BIF
SPE
MFC 256K
<20GB/s
(each direction)
SPE
MFC 256K
EIB (ring network)
BIF
SPE
MFC 256K
512K
L2
179 GB/s
4MB Shared L2 (16 way)
(64b interleaved)
2x128b controllers
667MHz FBDIMMs
VMT
PPE
Crossbar
4 Coherency Hubs
21.33 GB/s
IBM Cell Blade
MT SPARC
MT SPARC
2MB Shared quasi-victim (32 way)
179 GB/s
2x64b memory controllers
MT SPARC
512KB
victim
MT SPARC
512KB
victim
MT SPARC
512KB
victim
MT SPARC
512KB
victim
8 x 6.4 GB/s
(1 per hub per direction)
2MB Shared quasi-victim (32 way)
Opteron Opteron Opteron Opteron
MT SPARC
512KB
victim
MT SPARC
512KB
victim
Sun Victoria Falls
MT SPARC
512KB
victim
HyperTransport
512KB
victim
4GB/s
(each direction)
Opteron Opteron Opteron Opteron
HyperTransport
AMD Barcelona
Thread
Cluster
Thread
Cluster
Thread
Cluster
192KB L2 (Textures only)
24 ROPs
6 x 64b memory controllers
86.4 GB/s
768MB 900MHz GDDR3 Device DRAM
ASCI Winterschool 2010
Henk Corporaal
(116)
32b Rooflines for the Four
(in-core parallelism)
AMD Barcelona
Sun Victoria Falls
256
peak SP
128
mul / add imbalance
64
32
w/out SIMD
16
8
w/out ILP
4
Roofline models for the
SMPs used in this work.
256
128
 Based on micro-
64
32
peak SP
16
8
1/
1/
1 2 4 8
flop:DRAM byte ratio
4
2
 Ceilings =
1/
16
8
1/
4
256
w/out FMA
128
w/out SIMD
32
16
w/out ILP
4
attainable Gflop/s (32b)
peak SP
8
1/
1/
1 2 4 8
ASCI Winterschool
2010
flop:DRAM byte ratio
8
4
2
1
2
4
8
16
16
in-core parallelism
 Can the compiler find
512
peak SP
256
w/out FMA
128
64
32
all this parallelism ?
 NOTE:
 log-log scale
 Assumes perfect
SPMD
16
8
4
1/
2
NVIDIA G80
IBM Cell Blade
64
1/
flop:DRAM byte ratio
512
benchmarks,
experience, and
manuals
8
4
1/
attainable Gflop/s (32b)
 Single Precision
512
attainable Gflop/s (32b)
attainable Gflop/s (32b)
512
1/
8
1/
1/
1 2 4 8
flop:DRAM byte ratio
4
2
16
Henk Corporaal
(117)
Let's conclude: Trends
 Reliability + Fault Tolerance
 Requires run-time management, process migration
 Power is the new metric
 Low power management at all levels - Scenarios - subthreshold,
back biasing, ….
 Virtualization (1): do not disturb other applications
 composability
 Virtualization (2): 1 virual target platform avoids porting problem
 1 intermediate supporting multiple target
 huge RT management support, JITC
 multiple OS
 Compute servers
 Transactional memory
 3D: integrate different dies
ASCI Winterschool 2010
Henk Corporaal
(118)
3D using Through Silicon Vias (TSV)
Can enlarge
device area
Using TVS:
Face-to-Back
(Scalable)
4um pitch in 2011
(ITRS 2007)
Flip-Chip:
Face-to-Face
(limited to
2 die tiers)
from Woo e.a. HPCA 2009
ASCI Winterschool 2010
Henk Corporaal
(119)
Don't forget Amdahl
However, see next slide!
ASCI Winterschool 2010
Henk Corporaal
(120)
Trends: Homogeneous vs Heterogeneous: where
do we go ?
 Homogenous:
 Easier to program
 Favored by DLP / Vector parallelism
 Fault tolerant / Task migration
 Heterogeneous
 Energy efficiency demands
 Higher speedup
 Amdahl++
(see Hill and Marty, HPCA'08 on Amdahl's law in multi-core area)
 Memory dominated suggests homogenous sea of heterogeneous
cores
 Sea of reconfigurable compute or processor blocks?
 many examples: Smart Memory, SmartCell, PicoChip, MathStar FPOA,
Stretch, XPP, ……. etc.
ASCI Winterschool 2010
Henk Corporaal
(121)
How does a future architecture look like
 A couple of high performance (low latency) cores
 also sequential code should run fast
 Add a whole battery of wide vector processors
 Some shared memory (to reduce copying large data
structures)


Level 2 and 3 in 3D technology
Huge bandwidth; exploit large vectors
 Accelerators for dedicated domains
 OS support (runtime mapping, DVFS, use of
accelerators)
ASCI Winterschool 2010
Henk Corporaal
(122)
But the real problem is …..
 Programming parallel is the real bottleneck
 new programming models like transaction based progr.
 That's what we will talk about this week…
ASCI Winterschool 2010
Henk Corporaal
(123)
ASCI Winterschool 2010
Henk Corporaal
(124)
Download