Introduction to Many-Core Architectures Henk Corporaal www.ics.ele.tue.nl/~heco ASCI Winterschool on Embedded Systems Soesterberg, March 2010 Intel Trends (K. Olukotun) Core i7 3GHz 100W 5 ASCI Winterschool 2010 Henk Corporaal (2) System-level integration (Chuck Moore, AMD at MICRO 2008) Single-chip CPU Era: 1986 –2004 Extreme focus on single-threaded performance Multi-issue, out-of-order execution plus moderate cache hierarchy Chip Multiprocessor (CMP) Era: 2004 –2010 Early: Hasty integration of multiple cores into same chip/package Mid-life: Address some of the HW scalability and interference issues Current: Homogeneous CPUs plus moderate system-level functionality System-level Integration Era: ~2010 onward Integration of substantial system-level functionality Heterogeneous processors and accelerators Introspective control systems for managing on-chip resources & events ASCI Winterschool 2010 Henk Corporaal (3) Why many core? Running into Frequency wall ILP wall Memory wall Energy wall Chip area enabler: Moore's law goes well below 22 nm What to do with all this area? Multiple processors fit easily on a single die Application demands Cost effective (just connect existing processors or processor cores) Low power: parallelism may allow lowering Vdd Performance/Watt is the new metric !! ASCI Winterschool 2010 Henk Corporaal (4) Low power through parallelism Sequential Processor Switching capacitance C Frequency f Voltage V P1 = fCV2 CPU Parallel Processor (two times the number of units) Switching capacitance 2C Frequency f/2 CPU1 CPU2 Voltage V’ < V P2 = f/2 2C V’2 = fCV’2 < P1 ASCI Winterschool 2010 Henk Corporaal (5) How low Vdd can we go? Subthreshold JPEG encoder Vdd 0.4 – 1.2 Volt Engine Engine Engine Engine pJ/operation 8.0 7.0 6.0 5.0 3.4X 4.4X 5.6X 4.0 8.3X 3.0 2.0 1.0 0.0 ASCI Winterschool 2010 1.2 1.1 1.0 0.9 0.8 0.7 0.6 Supply Voltage (V) 0.5 0.4 Henk Corporaal (6) Computational efficiency: how many MOPS/Watt? Yifan He e.a., DAC 2010 ASCI Winterschool 2010 Henk Corporaal (7) Computational efficiency: what do we need? 10000 Pe rforma nce (Go ps) 4G Wireless IBM Cell 1000 Mobile HD Video 100 SODA (65nm) SODA (90nm) Imagine 3G Wireless 10 VIRAM Pentium M TI C 6X 1 0.1 1 Power (Watts ) 10 100 Woh e.a., ISCA 2009 ASCI Winterschool 2010 Henk Corporaal (8) Intel's opinion: 48-core x86 ASCI Winterschool 2010 Henk Corporaal (9) Outline Classifications of Parallel Architectures Examples Various (research) architectures GPUs Cell Intel multi-cores How much performance do you really get? Roofline model Trends & Conclusions ASCI Winterschool 2010 Henk Corporaal (10) Classifications Performance / parallelism driven: 4-5 D Flynn Communication & Memory Message passing / Shared memory Shared memory issues: coherency, consistency, synchronization Interconnect ASCI Winterschool 2010 Henk Corporaal (11) Flynn's Taxomony SISD (Single Instruction, Single Data) Uniprocessors SIMD (Single Instruction, Multiple Data) Vector architectures also belong to this class Multimedia extensions (MMX, SSE, VIS, AltiVec, …) Examples: Illiac-IV, CM-2, MasPar MP-1/2, Xetal, IMAP, Imagine, GPUs, …… MISD (Multiple Instruction, Single Data) Systolic arrays / stream based processing MIMD (Multiple Instruction, Multiple Data) Examples: Sun Enterprise 5000, Cray T3D/T3E, SGI Origin Flexible Most widely used ASCI Winterschool 2010 Henk Corporaal (12) Flynn's Taxomony ASCI Winterschool 2010 Henk Corporaal (13) Enhance performance: 4 architecture methods (Super)-pipelining Powerful instructions MD-technique multiple data operands per operation MO-technique multiple operations per instruction Multiple instruction issue Single stream: Superscalar Multiple streams ASCI Winterschool 2010 Single core, multiple threads: Simultaneously MultiThreading Multiple cores Henk Corporaal (14) Architecture methods Pipelined Execution of Instructions IF: Instruction Fetch INSTRUCTION CYCLE 1 1 2 3 4 2 IF 3 DC IF 4 RF DC IF 5 EX RF DC IF 6 WB EX RF DC 7 DC: Instruction Decode 8 RF: Register Fetch WB EX RF EX: Execute instruction WB EX WB WB: Write Result Register Simple 5-stage pipeline Purpose of pipelining: Reduce #gate_levels in critical path Reduce CPI close to one (instead of a large number for the multicycle machine) More efficient Hardware Problems Hazards: pipeline stalls Structural hazards: add more hardware Control hazards, branch penalties: use branch prediction Data hazards: by passing required ASCI Winterschool 2010 Henk Corporaal (15) Architecture methods Pipelined Execution of Instructions Superpipelining: Split one or more of the critical pipeline stages Superpipelining degree S: S(architecture) = f(Op) * lt (Op) Op I_set * where: f(op) is frequency of operation op lt(op) is latency of operation op ASCI Winterschool 2010 Henk Corporaal (16) Architecture methods Powerful Instructions (1) MD-technique Multiple data operands per operation SIMD: Single Instruction Multiple Data Vector instruction: Assembly: for (i=0, i++, i<64) c[i] = a[i] + 5*b[i]; set ldv mulvi ldv addv stv or c = a + 5*b ASCI Winterschool 2010 vl,64 v1,0(r2) v2,v1,5 v1,0(r1) v3,v1,v2 v3,0(r3) Henk Corporaal (17) Architecture methods Powerful Instructions (1) SIMD computing All PEs (Processing Elements) execute same operation connectivity Exploit data locality of e.g. image processing applications time Typical mesh or hypercube SIMD Execution Method PE1 PE2 PEn Instruction 1 Instruction 2 Instruction 3 Dense encoding (few instruction bits needed) Instruction n ASCI Winterschool 2010 Henk Corporaal (18) Architecture methods Powerful Instructions (1) Sub-word parallelism SIMD on restricted scale: Used for Multi-media instructions Examples MMX, SSE, SUN-VIS, HP MAX-2, AMD-K7/Athlon 3Dnow, Trimedia II Example: i=1..4|ai-bi| ASCI Winterschool 2010 * * * * Henk Corporaal (19) Architecture methods Powerful Instructions (2) MO-technique: multiple operations per instruction Two options: CISC (Complex Instruction Set Computer) VLIW (Very Long Instruction Word) field FU 1 instruction sub r8, r5, 3 FU 2 and r1, r5, 12 FU 3 mul r6, r5, r2 FU 4 ld r3, 0(r5) FU 5 bnez r5, 13 VLIW instruction example ASCI Winterschool 2010 Henk Corporaal (20) VLIW architecture: central Register File Register file Exec Exec Exec unit 1 unit 2 unit 3 Issue slot 1 Exec Exec Exec unit 4 unit 5 unit 6 Issue slot 2 Exec Exec Exec unit 7 unit 8 unit 9 Issue slot 3 Q: How many ports does the registerfile need for n-issue? ASCI Winterschool 2010 Henk Corporaal (21) Architecture methods Multiple instruction issue (per cycle) Who guarantees semantic correctness? can instructions be executed in parallel User: he specifies multiple instruction streams Multi-processor: MIMD (Multiple Instruction Multiple Data) HW: Run-time detection of ready instructions Superscalar Compiler: Compile into dataflow representation Dataflow processors ASCI Winterschool 2010 Henk Corporaal (22) Four dimensional representation of the architecture design space <I, O, D, S> SIMD 100 Data/operation ‘D’ 10 Vector CISC Superscalar 0.1 MIMD 10 RISC Dataflow 100 Instructions/cycle ‘I’ Superpipelined 10 VLIW 10 Operations/instruction ‘O’ ASCI Winterschool 2010 Superpipelining Degree ‘S’ Henk Corporaal (23) Architecture design space Example values of <I, O, D, S> for different architectures Architecture I O D S CISC RISC VLIW Superscalar SIMD MIMD GPU Top500 Jaguar 0.2 1 1 3 1 32 32 1.2 1 10 1 1 1 2 1.1 1 1 1 128 1 8 1 1.2 1.2 1.2 1.2 1.2 24 S(architecture) = f(Op) * lt (Op) Op I_set Mpar 0.26 1.2 12 3.6 154 38 12288 ??? You should exploit this amount of parallelism !!! Mpar = I*O*D*S ASCI Winterschool 2010 Henk Corporaal (24) Communication Parallel Architecture extends traditional computer architecture with a communication network abstractions (HW/SW interface) organizational structure to realize abstraction efficiently Communication Network Processing node ASCI Winterschool 2010 Processing node Processing node Processing node Processing node Henk Corporaal (25) Communication models: Shared Memory Shared Memory (read, write) Process P1 (read, write) Process P2 Coherence problem Memory consistency issue Synchronization problem ASCI Winterschool 2010 Henk Corporaal (26) Communication models: Shared memory Shared address space Communication primitives: load, store, atomic swap Two varieties: Physically shared => Symmetric Multi-Processors (SMP) usually combined with local caching Physically distributed => Distributed Shared Memory (DSM) ASCI Winterschool 2010 Henk Corporaal (27) SMP: Symmetric Multi-Processor Memory: centralized with uniform access time (UMA) and bus interconnect, I/O Examples: Sun Enterprise 6000, SGI Challenge, Intel can be 1 bus,Processor N busses, or any network One or more cache levels Processor Processor Processor One or more cache levels One or more cache levels One or more cache levels Main memory ASCI Winterschool 2010 I/O System Henk Corporaal (28) DSM: Distributed Shared Memory Nonuniform access time (NUMA) and scalable interconnect (distributed memory) Processor Processor Processor Processor Cache Cache Cache Cache Memory Memory Memory Memory Interconnection Network Main memory ASCI Winterschool 2010 I/O System Henk Corporaal (29) Shared Address Model Summary Each processor can name every physical location in the machine Each process can name all data it shares with other processes Data transfer via load and store Data size: byte, word, ... or cache blocks Memory hierarchy model applies: communication moves data to local proc. cache ASCI Winterschool 2010 Henk Corporaal (30) Three fundamental issues for shared memory multiprocessors Coherence, about: Do I see the most recent data? Consistency, about: When do I see a written value? e.g. do different processors see writes at the same time (w.r.t. other memory accesses)? Synchronization How to synchronize processes? how to protect access to shared data? ASCI Winterschool 2010 Henk Corporaal (31) Communication models: Message Passing Communication primitives e.g., send, receive library calls standard MPI: Message Passing Interface www.mpi-forum.org Note that MP can be build on top of SM and vice versa! receive send Process P2 Process P1 send receive FiFO ASCI Winterschool 2010 Henk Corporaal (32) Message Passing Model Explicit message send and receive operations Send specifies local buffer + receiving process on remote computer Receive specifies sending process on remote computer + local buffer to place data Typically blocking communication, but may use DMA Message structure Header ASCI Winterschool 2010 Data Trailer Henk Corporaal (33) Message passing communication Processor Processor Processor Processor Cache Cache Cache Cache Memory Memory Memory Memory DMA DMA DMA DMA Network interface Network interface Network interface Network interface Interconnection Network ASCI Winterschool 2010 Henk Corporaal (34) Communication Models: Comparison Shared-Memory: Compatibility with well-understood language mechanisms Ease of programming for complex or dynamic communications patterns Shared-memory applications; sharing of large data structures Efficient for small items Supports hardware caching Messaging Passing: Simpler hardware Explicit communication Implicit synchronization (with any communication) ASCI Winterschool 2010 Henk Corporaal (35) Interconnect How to connect your cores? Some options: Connect everybody: Single bus Hierarchical bus NoC • multi-hop via routers • any topology possible • easy 2D layout helps Connect with e.g. neighbors only ASCI Winterschool 2010 e.g. using shift operation in SIMD or using dual-ported mems to connect 2 cores. Henk Corporaal (36) Bus (shared) or Network (switched) Network: claimed to be more scalable no bus arbitration point-to-point connections but router overhead Example: NoC with 2x4 mesh routing network node node R node R node R ASCI Winterschool 2010 node node R node R R node R R Henk Corporaal (37) Historical Perspective Early machines were: Collection of microprocessors. Communication was performed using bi-directional queues between nearest neighbors. Messages were forwarded by processors on path “Store and forward” networking There was a strong emphasis on topology in algorithms, in order to minimize the number of hops => minimize time ASCI Winterschool 2010 Henk Corporaal (38) Design Characteristics of a Network Topology (how things are connected): Crossbar, ring, 2-D and 3-D meshes or torus, hypercube, tree, butterfly, perfect shuffle, .... Routing algorithm (path used): Example in 2D torus: all east-west then all north-south (avoids deadlock) Switching strategy: Circuit switching: full path reserved for entire message, like the telephone. Packet switching: message broken into separately-routed packets, like the post office. Flow control and buffering (what if there is congestion): Stall, store data temporarily in buffers re-route data to other nodes tell source node to temporarily halt, discard, etc. QoS guarantees, Error handling, …., etc, etc. ASCI Winterschool 2010 Henk Corporaal (39) Switch / Network Topology Topology determines: Degree: number of links from a node Diameter: max number of links crossed between nodes Average distance: number of links to random destination Bisection: minimum number of links that separate the network into two halves Bisection bandwidth = link bandwidth * bisection ASCI Winterschool 2010 Henk Corporaal (40) Bisection Bandwidth Bisection bandwidth: bandwidth across smallest cut that divides network into two equal halves Bandwidth across “narrowest” part of the network not a bisection cut bisection cut bisection bw= link bw bisection bw = sqrt(n) * link bw Bisection bandwidth is important for algorithms in which all processors need to communicate with all others ASCI Winterschool 2010 Henk Corporaal (41) Common Topologies Type Degree Diameter Ave Dist 1D mesh 2 N-1 2D mesh 4 2(N1/2 - 1) 2N1/2 / 3 N1/2 3D mesh 6 3(N1/3 - 1) 3N1/3 / 3 N2/3 nD mesh 2n n(N1/n - 1) nN1/n / 3 N(n-1) / n Ring 2 N/2 N/4 2 2D torus 4 N1/2 N1/2 / 2 2N1/2 n/2 N/2 Hypercube Log2N n=Log2N N/3 Bisection 1 2D Tree 3 2Log2N ~2Log2 N 1 Crossbar N-1 1 1 N2/2 N = number of nodes, n = dimension ASCI Winterschool 2010 Henk Corporaal (42) Red Storm (Opteron + Cray network, future) 3D Mesh Blue Gene/L 3D Torus SGI Altix Fat tree newer Cray X1 4D Hypercube (approx) Myricom (Millennium) Arbitrary older Topologies in Real High End Machines Quadrics (in HP Alpha server clusters) Fat tree IBM SP Fat tree (approx) SGI Origin Hypercube Intel Paragon 2D Mesh BBN Butterfly Butterfly ASCI Winterschool 2010 Henk Corporaal (43) Network: Performance metrics Network Bandwidth Need high bandwidth in communication How does it scale with number of nodes? Communication Latency Affects performance, since processor may have to wait Affects ease of programming, since it requires more thought to overlap communication and computation How can a mechanism help hide latency? overlap message send with computation, prefetch data, switch to other task or thread ASCI Winterschool 2010 Henk Corporaal (44) Examples of many core / PE architectures SIMD Xetal (320 PEs), Imap (128 PEs), AnySP (Michigan Univ) VLIW Itanium,TRIPS / EDGE, ADRES, Multi-threaded idea: hide long latencies Denelcor HEP (1982), SUN Niagara (2005) Multi-processor RaW, PicoChip, Intel/AMD, GRID, Farms, ….. Hybrid, like , Imagine, GPUs, XC-Core actually, most are hybrid !! ASCI Winterschool 2010 Henk Corporaal (45) IMAP from NEC NEC IMAP SIMD •128 PEs •Supports indirect addressing e.g. LD r1, (r2) •Each PE 5-issue VLIW ASCI Winterschool 2010 Henk Corporaal (46) TRIPS (Austin Univ / IBM) a statically mapped data flow architecture R: register file E: execution unit D: Data cache I: Instruction cache G: global control ASCI Winterschool 2010 Henk Corporaal (47) Compiling for TRIPS 1. 2. Form hyperblocks (use unrolling, predication, inlining to enlarge scope) Spatial map operations of each hyperblock 3. registers are accessed at hyperblock boundaries Schedule hyperblocks ASCI Winterschool 2010 Henk Corporaal (48) Time (processor cycle) Multithreaded Categories Superscalar Thread 1 Thread 2 ASCI Winterschool 2010 Fine-Grained Coarse-Grained Thread 3 Thread 4 Multiprocessing Thread 5 Idle slot Simultaneous Multithreading Intel calls this 'Hyperthreading' Henk Corporaal (49) SUN Niagara processing element 4 threads per processor 4 copies of PC logic, Instr. buffer, Store buffer, Register file ASCI Winterschool 2010 Henk Corporaal (50) Really BIG: Jaguar-Cray XT5-HE Oak Ridge Nat Lab 224,256 AMD Opteron cores 2.33 PetaFlop peak perf. 299 Tbyte main memory 10 Petabyte disk 478GB/s mem bandwidth 6.9 MegaWatt 3D torus TOP 500 #1 (Nov 2009) ASCI Winterschool 2010 Henk Corporaal (51) Graphic Processing Units (GPUs) NVIDIA GT 340 (2010) ATI 5970 (2009) ASCI Winterschool 2010 Henk Corporaal (52) Why GPUs ASCI Winterschool 2010 Henk Corporaal (53) In Need of TeraFlops? 3 * GTX295 • 1440 PEs • 5.3 TeraFlop ASCI Winterschool 2010 Henk Corporaal (54) How Do GPUs Spend Their Die Area? GPUs are designed to match the workload of 3D graphics. Die photo of GeForce GTX 280 (source: NVIDIA) J. Roca, et al. "Workload Characterization of 3D Games", IISWC 2006, link T.ASCI Mitra, et al. "Dynamic 3D Graphics Workload Characterization and the Architectural Implications", Micro 1999, link Winterschool 2010 Henk Corporaal (55) How Do CPUs Spend Their Die Area? CPUs are designed for low latency instead of high throughput Die photo of Intel Penryn (source: Intel) ASCI Winterschool 2010 Henk Corporaal (56) GPU: Graphics Processing Unit From polygon mesh to image pixel. The Utah teapot: http://en.wikipedia.org/wiki/Utah_teapot ASCI Winterschool 2010 Henk Corporaal (57) The Graphics Pipeline K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008, http://doi.acm.org/10.1145/1365490.1365498 ASCI Winterschool 2010 Henk Corporaal (58) The Graphics Pipeline K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008, http://doi.acm.org/10.1145/1365490.1365498 ASCI Winterschool 2010 Henk Corporaal (59) The Graphics Pipeline K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008, http://doi.acm.org/10.1145/1365490.1365498 ASCI Winterschool 2010 Henk Corporaal (60) The Graphics Pipeline K. Fatahalian, et al. "GPUs: a Closer Look", ACM Queue 2008, http://doi.acm.org/10.1145/1365490.1365498 ASCI Winterschool 2010 Henk Corporaal (61) GPUs: what's inside? Basically an SIMD: • A single instruction stream operates on multiple data streams • All PEs execute the same instruction at the same time • PEs operate concurrently on their own piece of memory • However, GPU far more complex !! Data memory Add Instr. Instruction • Memory • Instr. PE 1 Control Processor Addr. Add Add Add Add Add Add Add Add PE 2 PE 3 PE 4 PE 5 PE 6 ... PE 320 Status Interconnect ASCI Winterschool 2010 Henk Corporaal (62) CPU Programming: NVIDIA CUDA example • CUDA program expresses data level parallelism (DLP) in terms of thread level parallelism (TLP). • Hardware converts TLP into DLP at run time. Single thread program CUDA program float A[4][8]; do-all(i=0;i<4;i++){ do-all(j=0;j<8;j++){ A[i][j]++; } } float A[4][8]; ASCI Winterschool 2010 kernelF<<<(4,1),(8,1)>>>(A); __device__ kernelF(A){ i = blockIdx.x; j = threadIdx.x; A[i][j]++; } Henk Corporaal (63) System Architecture Erik Lindholm, et al. "NVIDIA Tesla: A Unified Graphics and Computing Architecture", IEEE Micro 2008, link ASCI Winterschool 2010 Henk Corporaal (64) NVIDIA Tesla Architecture (G80) Erik Lindholm, et al. "NVIDIA Tesla: A Unified Graphics and Computing Architecture", IEEE Micro 2008, link ASCI Winterschool 2010 Henk Corporaal (65) Texture Processor Cluster (TPC) ASCI Winterschool 2010 Henk Corporaal (66) Deeply pipelined SM for high throughput One instruction executed by a warp of 32 threads One warp is executed on 8 PEs over 4 shader cycles Let's start with a simple example: execution of 1 instruction ASCI Winterschool 2010 Henk Corporaal (67) Issue an Instruction for 32 Threads ASCI Winterschool 2010 Henk Corporaal (68) Read Source Operands of 32 Threads ASCI Winterschool 2010 Henk Corporaal (69) Buffer Source Operands to Op Collector ASCI Winterschool 2010 Henk Corporaal (70) Execute Threads 0~7 ASCI Winterschool 2010 Henk Corporaal (71) Execute Threads 8~15 ASCI Winterschool 2010 Henk Corporaal (72) Execute Threads 16~23 ASCI Winterschool 2010 Henk Corporaal (73) Execute Threads 24~31 ASCI Winterschool 2010 Henk Corporaal (74) Write Back from Result Queue to Reg ASCI Winterschool 2010 Henk Corporaal (75) Warp: Basic Scheduling Unit in Hardware One warp consists of 32 consecutive threads Warps are transparent to programmer, formed at run time ASCI Winterschool 2010 Henk Corporaal (76) Warp Scheduling • • ASCI Winterschool 2010 Schedule at most 24 warps in an interleaved manner Zero overhead for interleaved issue of warps Henk Corporaal (77) Handling Branch Threads within a warp are free to branch. if( $r17 > $r19 ){ $r16 = $r20 + $r31 } else{ $r16 = $r21 - $r32 } $r18 = $r15 + $r16 Assembly code on the right are disassembled from cuda binary (cubin) using "decuda", link ASCI Winterschool 2010 Henk Corporaal (78) Branch Divergence within a Warp If threads within a warp diverge, both paths have to be executed. Masks are set to filter out threads not executing on current path. ASCI Winterschool 2010 Henk Corporaal (79) CPU Programming: NVIDIA CUDA example • CUDA program expresses data level parallelism (DLP) in terms of thread level parallelism (TLP). • Hardware converts TLP into DLP at run time. Single thread program CUDA program float A[4][8]; do-all(i=0;i<4;i++){ do-all(j=0;j<8;j++){ A[i][j]++; } } float A[4][8]; ASCI Winterschool 2010 kernelF<<<(4,1),(8,1)>>>(A); __device__ kernelF(A){ i = blockIdx.x; j = threadIdx.x; A[i][j]++; } Henk Corporaal (80) CUDA Programming Both grid and thread block can have two dimensional index. kernelF<<<(2,2),(4,2)>>>(A); __device__ kernelF(A){ i = blockDim.x * blockIdx.y + blockIdx.x; j = threadDim.x * threadIdx.y + threadIdx.x; A[i][j]++; } ASCI Winterschool 2010 Henk Corporaal (81) Mapping Thread Blocks to SMs One thread block can only run on one SM Thread block can not migrate from one SM to another SM Threads of the same thread block can share data using shared memory Example: mapping 12 thread blocks on 4 SMs. ASCI Winterschool 2010 Henk Corporaal (82) Mapping Thread Blocks (0,0)/(0,1)/(0,2)/(0,3) ASCI Winterschool 2010 Henk Corporaal (83) CUDA Compilation Trajectory cudafe: CUDA front end nvopencc: customized open64 compiler for CUDA ptx: high level assemble code (documented) ptxas: ptx assembler cubin: CUDA binrary decuda, http://wiki.github.com/laanwj/decuda ASCI Winterschool 2010 Henk Corporaal (84) Optimization Guide Optimizations on memory latency tolerance Reduce register pressure Reduce shared memory pressure Optimizations on memory bandwidth Global memory coalesce Shared memory bank conflicts Grouping byte access Avoid Partition camping Optimizations on computation efficiency Mul/Add balancing Increase floating point proportion Optimizations on operational intensity Use tiled algorithm Tuning thread granularity ASCI Winterschool 2010 Henk Corporaal (85) Global Memory: Coalesced Access perfectly coalesced allow threads skipping LD/ST NVIDIA, "CUDA Programming Guide", link ASCI Winterschool 2010 Henk Corporaal (86) Global Memory: Non-Coalesced Access non-consecutive address starting address not aligned to 128 Byte non-consecutive address stride larger than one word NVIDIA, "CUDA Programming Guide", link ASCI Winterschool 2010 Henk Corporaal (87) Shared Memory: without Bank Conflict one access per bank one access per bank with shuffling access the same address (broadcast) partial broadcast and skipping some banks NVIDIA, "CUDA Programming Guide", link ASCI Winterschool 2010 Henk Corporaal (88) Shared Memory: with Bank Conflict access more than one address per bank broadcast more than one address per bank NVIDIA, "CUDA Programming Guide", link ASCI Winterschool 2010 Henk Corporaal (89) Optimizing MatrixMul atrix Multiplication example from the 5kk70 course in TU/e, link. e CUDA@MIT course also provides Matrix Multiplication as a hands-on example, link. ASCI Winterschool 2010 Henk Corporaal (90) ATI Cypress (RV870) • 1600 shader ALUs ref: tom's hardware, link ASCI Winterschool 2010 Henk Corporaal (91) ATI Cypress (RV870) • VLIW PEs ref: tom's hardware, link ASCI Winterschool 2010 Henk Corporaal (92) Intel Larrabee • x86 core, 8/16/32 cores. Larry Seiler, et al. "Larrabee: a many-core x86 architecture for visual computing", SIGGRAPH 2008, link ASCI Winterschool 2010 Henk Corporaal (93) CELL Video Memory GDDR3 NVIDIA RSX PS3 GDDR3 GDDR3 reality synthesizer GDDR3 128pin * 1.4Gbps/pin = 22.4GB/sec 15 GB/sec 20 GB/sec Main Memory XDR DRAM Cell Broadband Engine 3.2 GHz XDR DRAM XDR DRAM XDR DRAM 64pin * 3.2Gbps/pin = 25.6GB/sec 2.5 GB/sec 2.5 GB/sec South Bridge ASCI Winterschool 2010 drives USB Network Media Henk Corporaal (94) CELL – the architecture 1 x PPE 64-bit PowerPC L1: 32 KB I$ + 32 KB D$ L2: 512 KB 8 x SPE cores: Local store: 256 KB 128 x 128 bit vector registers Hybrid memory model: PPE: Rd/Wr SPEs: Asynchronous DMA EIB: 205 GB/s sustained aggregate bandwidth Processor-to-memory bandwidth: 25.6 GB/s Processor-to-processor: 20 GB/s in each direction ASCI Winterschool 2010 Henk Corporaal (95) ASCI Winterschool 2010 Henk Corporaal (96) Intel / AMD x86 – Historical overview ASCI Winterschool 2010 Henk Corporaal (97) Nehalem architecture In novel processors Core i7 & Xeon 5500s Quad Core 3 cache levels 2 TLB levels 2 branch predictors 1 core Out-of-Order execution Simultaneous Multithreading DVFS: dynamic voltage & frequency scaling ASCI Winterschool 2010 Henk Corporaal (98) Nehalem pipeline (1/2) Instruction Fetch and PreDecode Instruction Queue Microcode ROM Decode Rename/Alloc Retirement unit (Re-Order Buffer) Scheduler EXE Unit Clust er 0 EXE Unit Clust er 1 EXE Unit Clust er 2 Load Store L1D Cache and DTLB L2 Cache Inclusive L3 Cache by all cores ASCI Winterschool 2010 Q PI Quick Path Interconnect (2x20 bit) Henk Corporaal (99) Nehalem pipeline (2/2) ASCI Winterschool 2010 Henk Corporaal (100) Tylersburg: connecting 2 quad cores Core L1D Core L1I L1D L2U Core L1I L1D L2U L1I L2U Core L1D Core L1I L1D L2U L1I L2U Core L1D L1I L1I L1D L2U L1I L2U QPI QPI Memory controller DDR3 QPI QP I QPI I QP Main memory IOH Main memory Level Capacity Associativity (ways) Line size (bytes) Access Latency (clocks) Access Throughput (clocks) Write Update Policy L1D 4 x 32 KiB 8 64 4 1 Writeback L1I 4 x 32 KiB 4 N/A N/A N/A N/A L2U 4 x 256KiB 8 64 10 Varies Writeback L3U 1 x 8 MiB 16 64 35-40 Varies Writeback ASCI Winterschool 2010 Core L3U QPI DDR3 L1D L2U L3U Memory controller Core Henk Corporaal (101) Programming these arechitectures: N-tap FIR N 1 out[i] in[i j ] * coeff [ j ] j 0 C-code: int i, j; for (i = 0; i < M; i ++){ out[i] = 0; for (j = 0; j < N; j ++) out[i] +=n[i+j]*coeff[j]; } ASCI Winterschool 2010 Henk Corporaal (102) X0 X1 X2 X3 X4 X5 X6 x x x x C0 C1 C2 C3 + x x x x C0 C1 C2 C3 + x x x x C0 C1 C2 C3 + x x x x C0 C1 C2 C3 Y4 Y5 Y6 X7 X8 X9 X10 X11 Y7 Y8 Y9 Y10 Y11 + Y0 ASCI Winterschool 2010 Y1 Y2 Y3 Henk Corporaal (103) FIR with x86 SSE Intrinsics __m128 X, XH, XL, Y, C, H; int i, j; for(i = 0; i < (M/4); i ++){ XL = _mm_load_ps(&in[i*4]); Y = _mm_setzero_ps(); for(j = 0; j < (N/4); j ++){ XH = XL; XL = _mm_load_ps(&in[(i+j+1)*4]); C =_mm_load_ps(&coeff[j*4]); H =_mm_shuffle_ps (C, C, _MM_SHUFFLE(0,0,0,0)); X = _mm_mul_ps (XH, H); Y = _mm_add_ps (Y, X); H =_mm_shuffle_ps (C, C, _MM_SHUFFLE(1,1,1,1)); X = _mm_alignr_epi8 (XL, XH, 4); X = _mm_mul_ps (X, H); Y = _mm_add_ps (Y, X); H = _mm_shuffle_ps (C, C, _MM_SHUFFLE(2,2,2,2)); X = _mm_alignr_epi8 (XL, XH, 8); X = _mm_mul_ps (X, H); Y = _mm_add_ps (Y, X); H = _mm_shuffle_ps (C, C, _MM_SHUFFLE(3,3,3,3)); X = _mm_alignr_epi8 (XL, XH, 12); X = _mm_mul_ps (X, H); Y = _mm_add_ps (Y, X); } _mm_store_ps(&out[i*4], Y); } Y X H X H X H X H Y0 X0 C0 X1 C1 X2 C2 X3 C3 Y1 X1 C0 X2 C1 X3 C2 X4 C3 = ASCI Winterschool 2010 x + x + x + x Y2 X2 C0 X3 C1 X4 C2 X5 C3 Y3 X3 C0 X4 C1 X5 C2 X6 C3 Henk Corporaal (104) FIR using pthread pthread_t fir_threads[N_THREAD]; fir_arg fa[N_THREAD]; tsize = M/N_THREAD; for(i = 0; i < N_THREAD; i ++){ /*… Initialize thread parameters fa[i] … */ rc = pthread_create(&fir_threads[i],\ NULL, fir_kernel, (void *)&fa[i]); } for(i=0; i<N_THREAD; i++) { rc = pthread_join(fir_threads[i],\ &status); } Input split T0 T1 T2 T3 join Sequential FIR kernel or Vectorized FIR kernel ASCI Winterschool 2010 Henk Corporaal (105) x86 FIR speedup On Intel Core 2 Quad Q8300, gcc optimization level 2 Input: ~5M samples #threads in pthread: 4 14 12 10 8 4-tap 64-tap 6 4 2 0 Sequential ASCI Winterschool 2010 SSE pthread SSE+pthread Henk Corporaal (106) FIR kernel on CELL SPE Vectorization is similar to SSE vector float,X, XH, XL, Y, H; int i, j; for(i = 0; i < (M/4); i ++){ XL = in[i]; Y = spu_splats(0.0f); for(j = 0; j < (N/4); j ++){ XH = XL; XL = in[i+j+1]); H=splats(coeff[j*4]); Y = spu_madd(XH, H, Y); H=splats(coeff[j*4+1]); X = spu_shuffle(XH, XL, SHUFFLE_X1); Y = spu_madd(X, H, Y); H=splats(coeff[j*4+2]); X = spu_shuffle(XH, XL, SHUFFLE_X2); Y = spu_madd(X, H, Y); H=splats(coeff[j*4+3]); X = spu_shuffle(XH, XL, SHUFFLE_X3); Y = spu_madd(X, H, Y); } } out[i] = Y; ASCI Winterschool 2010 Henk Corporaal (107) SPE DMA double buffering float iBuf[2][BUF_SIZE]; float oBuf[2][BUF_SIZE]; int idx=0; int buffers=size/BUF_SIZE; mfc_get(iBuf[idx],argp,\ BUF_SIZE*sizeof(float),\ tag[idx],0,0); for(int i = 1;I < buffers; i++){ wait_for_dma(tag[idx]); next_idx = idx^1; mfc_get(iBuf[next_idx],argp,\ BUF_SIZE*sizeof(float),0,0,0); fir_kernel(oBuf[idx], iBuf[idx],\ coeff,BUF_SIZE,taps); mfc_put(oBuf[idx],outbuf,\ BUF_SIZE*sizeof(float),\ tag[idx],0,0); idx = next_idx; } /* Finish up the last block ...*/ ASCI Winterschool 2010 time Get iBuf0 Get iBuf1 Use iBuf0 Write to oBuf0 Get iBuf0 Use iBuf1 Write to oBuf1 Put oBuf0 Get iBuf1 Use iBuf0 Write to oBuf0 Put oBuf1 ... Henk Corporaal (108) CELL FIR speedup On PlayStation 3, CELL with six accessible SPE Input: ~6M samples Speed-up compare to scalar implementation on PPE 35 30 25 20 scalar SIMD 15 10 5 0 1 SPE ASCI Winterschool 2010 2 SPE 4 SPE 6 SPE Henk Corporaal (109) Roofline Model Performance in GFlops/sec Introduced by Samual Williams and David Patterson peak performance ridge point balanced architecture for given application Operational intensity in Flops/Byte ASCI Winterschool 2010 Henk Corporaal (110) Roofline Model of GT8800 GPU ASCI Winterschool 2010 Henk Corporaal (111) Roofline Model Threads of one warp diverge into different paths at branch. ASCI Winterschool 2010 Henk Corporaal (112) Roofline Model In G80 architecture, a non-coalesced global memory access will be separated into 16 accesses. ASCI Winterschool 2010 Henk Corporaal (113) Roofline Model Previous examples assume memory latency can be hidden. Otherwise the program can be latency-bound. rm : percentage of memory instruction in total instruction tavg : average memory latency CPIexe : Cycle per Instruction • There is one memory instruction in every (1/rm) instructions. • There is one memory instruction every (1/rm) x CPIexe cycles. • It takes (tavg x rm / CPIexe) threads to hide memory latency. Z. Guz, et al, "Many-Core vs. Many-Thread Machines: Stay Away From the Valley", IEEE Comp Arch Letters, 2009, link S. Hong, et al. "An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness", ISCA09, link ASCI Winterschool 2010 Henk Corporaal (114) Roofline Model If not enough threads to hide the memory latency, the memory latency could become the bottleneck. Samuel Williams, "Auto-tuning Performance on Multicore Computers", PhD Thesis, UC Berkeley, 2008, link S.ASCI Hong, et al. "An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness",Henk ISCA09, link Winterschool 2010 Corporaal (115) Four Architectures SRI / crossbar Crossbar SRI / crossbar 10.66 GB/s 10.66 GB/s 667MHz DDR2 DIMMs 90 GB/s 4MB Shared L2 (16 way) (64b interleaved) 2x64b memory controllers 667MHz DDR2 DIMMs 2x128b controllers 10.66 GB/s MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC MT SPARC 4 Coherency Hubs 21.33 GB/s 10.66 GB/s 667MHz FBDIMMs SPE SPE SPE SPE SPE MFC 256K MFC 256K MFC 256K MFC 256K XDR memory controllers 25.6 GB/s 25.6 GB/s 512MB XDR DRAM VMT PPE 512K L2 Thread Cluster Thread Cluster Thread Cluster Thread Cluster Thread Cluster interconnect SPE MFC 256K SPE MFC 256K SPE SPE MFC 256K MFC 256K SPE MFC 256K SPE SPE MFC 256K MFC 256K EIB (ring network) XDR memory controllers 512MB XDR DRAM 90 GB/s NVIDIA G80 MFC 256K SPE MFC 256K BIF SPE MFC 256K <20GB/s (each direction) SPE MFC 256K EIB (ring network) BIF SPE MFC 256K 512K L2 179 GB/s 4MB Shared L2 (16 way) (64b interleaved) 2x128b controllers 667MHz FBDIMMs VMT PPE Crossbar 4 Coherency Hubs 21.33 GB/s IBM Cell Blade MT SPARC MT SPARC 2MB Shared quasi-victim (32 way) 179 GB/s 2x64b memory controllers MT SPARC 512KB victim MT SPARC 512KB victim MT SPARC 512KB victim MT SPARC 512KB victim 8 x 6.4 GB/s (1 per hub per direction) 2MB Shared quasi-victim (32 way) Opteron Opteron Opteron Opteron MT SPARC 512KB victim MT SPARC 512KB victim Sun Victoria Falls MT SPARC 512KB victim HyperTransport 512KB victim 4GB/s (each direction) Opteron Opteron Opteron Opteron HyperTransport AMD Barcelona Thread Cluster Thread Cluster Thread Cluster 192KB L2 (Textures only) 24 ROPs 6 x 64b memory controllers 86.4 GB/s 768MB 900MHz GDDR3 Device DRAM ASCI Winterschool 2010 Henk Corporaal (116) 32b Rooflines for the Four (in-core parallelism) AMD Barcelona Sun Victoria Falls 256 peak SP 128 mul / add imbalance 64 32 w/out SIMD 16 8 w/out ILP 4 Roofline models for the SMPs used in this work. 256 128 Based on micro- 64 32 peak SP 16 8 1/ 1/ 1 2 4 8 flop:DRAM byte ratio 4 2 Ceilings = 1/ 16 8 1/ 4 256 w/out FMA 128 w/out SIMD 32 16 w/out ILP 4 attainable Gflop/s (32b) peak SP 8 1/ 1/ 1 2 4 8 ASCI Winterschool 2010 flop:DRAM byte ratio 8 4 2 1 2 4 8 16 16 in-core parallelism Can the compiler find 512 peak SP 256 w/out FMA 128 64 32 all this parallelism ? NOTE: log-log scale Assumes perfect SPMD 16 8 4 1/ 2 NVIDIA G80 IBM Cell Blade 64 1/ flop:DRAM byte ratio 512 benchmarks, experience, and manuals 8 4 1/ attainable Gflop/s (32b) Single Precision 512 attainable Gflop/s (32b) attainable Gflop/s (32b) 512 1/ 8 1/ 1/ 1 2 4 8 flop:DRAM byte ratio 4 2 16 Henk Corporaal (117) Let's conclude: Trends Reliability + Fault Tolerance Requires run-time management, process migration Power is the new metric Low power management at all levels - Scenarios - subthreshold, back biasing, …. Virtualization (1): do not disturb other applications composability Virtualization (2): 1 virual target platform avoids porting problem 1 intermediate supporting multiple target huge RT management support, JITC multiple OS Compute servers Transactional memory 3D: integrate different dies ASCI Winterschool 2010 Henk Corporaal (118) 3D using Through Silicon Vias (TSV) Can enlarge device area Using TVS: Face-to-Back (Scalable) 4um pitch in 2011 (ITRS 2007) Flip-Chip: Face-to-Face (limited to 2 die tiers) from Woo e.a. HPCA 2009 ASCI Winterschool 2010 Henk Corporaal (119) Don't forget Amdahl However, see next slide! ASCI Winterschool 2010 Henk Corporaal (120) Trends: Homogeneous vs Heterogeneous: where do we go ? Homogenous: Easier to program Favored by DLP / Vector parallelism Fault tolerant / Task migration Heterogeneous Energy efficiency demands Higher speedup Amdahl++ (see Hill and Marty, HPCA'08 on Amdahl's law in multi-core area) Memory dominated suggests homogenous sea of heterogeneous cores Sea of reconfigurable compute or processor blocks? many examples: Smart Memory, SmartCell, PicoChip, MathStar FPOA, Stretch, XPP, ……. etc. ASCI Winterschool 2010 Henk Corporaal (121) How does a future architecture look like A couple of high performance (low latency) cores also sequential code should run fast Add a whole battery of wide vector processors Some shared memory (to reduce copying large data structures) Level 2 and 3 in 3D technology Huge bandwidth; exploit large vectors Accelerators for dedicated domains OS support (runtime mapping, DVFS, use of accelerators) ASCI Winterschool 2010 Henk Corporaal (122) But the real problem is ….. Programming parallel is the real bottleneck new programming models like transaction based progr. That's what we will talk about this week… ASCI Winterschool 2010 Henk Corporaal (123) ASCI Winterschool 2010 Henk Corporaal (124)