Sorting with multicore SIMD processors Martijn van den Heuvel – 0547028 Bram Kersten – 0537059 December 2007 December 2007 Presentation Contents: •Introduction •Sorting basics •Parallel sorting using SIMD •AA-Sort algorithm •Overview •In-core algorithm •Out-of-core algorithm •Experimental results •Paper discussion •Questions Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten December 2007 Introduction: AA-Sort: A New Parallel Sorting Algorithm for MultiCore SIMD Processors (IEEE, September 2007) •Hiroshi Inoue •Takao Moriyama IBM Tokyo Research Laboratory •Hideaki Komatsu •Toshio Nakatani AA-Sort: Aligned-Access sort Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten December 2007 Sorting basics: Sorting is very important. Algorithms: •Bubblesort •CombSort •MergeSort •QuickSort Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten December 2007 Sorting basics: Bottlenecks in sorting algorithms: •Branch mispredictions •Most popular algorithms are not suitable for exploiting SIMD instructions Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten December 2007 Presentation Contents: •Introduction •Sorting basics •Parallel sorting using SIMD •AA-Sort algorithm •Overview •In-core algorithm •Out-of-core algorithm •Experimental results •Paper discussion •Questions Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten December 2007 Parallel sorting using SIMD: •Devide the data into smaller blocks that fit into a processors cache. •Use aligned memory access •Use SIMD instructions: •Vector compare •Vector select •Vector permutation •Merge all blocks. •No branche misprediction…… Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten December 2007 Presentation Contents: •Introduction •Sorting basics •Parallel sorting using SIMD •AA-Sort algorithm •Overview •In-core algorithm •Out-of-core algorithm •Experimental results •Paper discussion •Questions Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten December 2007 AA-Sort overview: AA-Sort executes 3 phases: •Divide all of the data to be sorted into blocks that fit in the cache or the local memory of the processor •Sort each block with the in-core sorting algorithm in parallel by multiple threads, where each thread processes an independent block. •Merge the sorted blocks with the out-of-core sorting algorithm by multiple threads Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten December 2007 AA-Sort overview: •We assume processors having 128 bit SIMD registers •Sorting 32 bit integers Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten December 2007 AA-Sort in-core algorithm: •In-vector sorting using SIMD instructions. •Vector compare •Vector select 5 2 7 4 3 1 0 6 9 … 2 4 5 7 0 1 3 6 9 … Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten December 2007 AA-Sort in-core algorithm: •Transpose the registers. 3 6 10 14 3 2 0 5 2 4 8 11 6 4 1 9 0 1 6 13 10 8 6 12 5 9 12 15 14 11 13 15 Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten December 2007 AA-Sort in-core algorithm: •Apply a modified version of combSort to the transposed registers 3 2 0 5 6 4 1 9 10 8 6 12 14 11 13 15 Sorting with multicore SIMD processors 3 2 0 5 Gap = 3 6 4 1 9 Vector_ cmpswap 10 8 6 12 14 11 13 15 Martijn van den Heuvel & Bram Kersten December 2007 AA-Sort in-core algorithm: •Apply a modified version of combSort to the transposed registers 3 2 0 5 6 4 1 9 10 8 6 12 14 11 13 15 Sorting with multicore SIMD processors 3 6 4 1 Gap = 3 2 10 8 6 Vector_ cmpswap_ Skew *3 0 14 11 13 5 9 12 15 Martijn van den Heuvel & Bram Kersten December 2007 AA-Sort in-core algorithm: •Apply a modified version of combSort to the transposed registers 3 6 4 1 2 10 8 6 0 14 11 13 5 9 12 15 Sorting with multicore SIMD processors 0 6 4 1 Gap = 2 2 9 8 6 Vector_ cmpswap 3 14 11 13 5 10 12 15 Martijn van den Heuvel & Bram Kersten December 2007 AA-Sort in-core algorithm: •Apply a modified version of combSort to the transposed registers 0 6 4 1 2 9 8 6 3 14 11 13 5 10 12 15 Sorting with multicore SIMD processors 0 6 14 11 Gap = 2 2 9 10 12 Vector_ cmpswap_ skew 3 4 1 13 5 8 6 15 Martijn van den Heuvel & Bram Kersten December 2007 AA-Sort in-core algorithm: •Apply a modified version of combSort to the transposed registers 0 6 14 11 2 9 10 12 3 4 1 13 5 8 6 15 Sorting with multicore SIMD processors 0 6 10 11 Gap = 1 2 4 1 12 Vector_ cmpswap 3 8 6 13 5 9 14 15 Martijn van den Heuvel & Bram Kersten December 2007 AA-Sort in-core algorithm: •Apply a modified version of combSort to the transposed registers 0 6 10 11 2 4 1 12 3 8 6 13 5 9 14 15 Sorting with multicore SIMD processors 0 6 10 14 Gap = 1 2 4 1 12 Vector_ cmpswap_ skew 3 8 6 13 5 9 11 15 Martijn van den Heuvel & Bram Kersten December 2007 AA-Sort in-core algorithm: •Apply a modified version of combSort to the transposed registers 0 6 10 14 2 4 1 12 3 8 6 13 5 9 11 15 Sorting with multicore SIMD processors 0 4 1 12 Gap = 1 2 6 6 13 Vector_ cmpswap 3 8 10 14 5 9 11 15 Martijn van den Heuvel & Bram Kersten December 2007 AA-Sort in-core algorithm: •Apply a modified version of combSort to the transposed registers 0 4 1 12 2 6 6 13 3 8 10 14 5 9 11 15 Sorting with multicore SIMD processors 0 5 9 12 Gap = 1 2 6 6 13 Vector_ cmpswap_ skew 3 8 10 14 4 1 11 15 Martijn van den Heuvel & Bram Kersten December 2007 AA-Sort in-core algorithm: •Apply a modified version of combSort to the transposed registers 0 5 9 12 2 6 6 13 3 8 10 14 4 1 11 15 Sorting with multicore SIMD processors 0 4 8 12 Gap = 1 1 5 9 13 Vector_ cmpswap* 2 6 10 14 3 6 11 15 Martijn van den Heuvel & Bram Kersten December 2007 AA-Sort in-core algorithm: •Transpose back to origional order 0 4 8 12 1 5 9 13 2 6 10 3 6 11 0 1 2 3 4 5 6 6 14 8 9 10 11 15 12 13 14 15 Sorting with multicore SIMD processors Transpose Martijn van den Heuvel & Bram Kersten December 2007 Presentation Contents: •Introduction •Sorting basics •Parallel sorting using SIMD •AA-Sort algorithm •Overview •In-core algorithm •Out-of-core algorithm •Experimental results •Paper discussion •Questions Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten December 2007 AA-Sort out-of-core algorithm: Odd-even merge is implemented with SIMD instructions Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten December 2007 Presentation Contents: •Introduction •Sorting basics •Parallel sorting using SIMD •AA-Sort algorithm •Overview •In-core algorithm •Out-of-core algorithm •Experimental results •Paper discussion •Questions Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten December 2007 AA-Sort experimental results: Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten December 2007 AA-Sort experimental results: Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten December 2007 AA-Sort experimental results: Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten December 2007 AA-Sort experimental results: Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten December 2007 Presentation Contents: •Introduction •Sorting basics •Parallel sorting using SIMD •AA-Sort algorithm •Overview •In-core algorithm •Out-of-core algorithm •Experimental results •Paper discussion •Questions Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten December 2007 Strong points: •Good scalability •Use of data locality •Data independent •Tested on up to date hardware •Convincing and clear paper •Clear use of Pseudocode Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten December 2007 Weak points: •GPUTerasort is optimized for GPU so comparison on a GPU would be nice. •We don’t expect GPUTerrasort will outperform AASort •In-core results are dependent on Heuristics (Shrink factor) Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten December 2007 Applicability : •Searching •Database management systems •Scientific Applications •Depth buffer Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten December 2007 The future : •Integration in compilers •Scalability with even more processor cores Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten December 2007 Presentation Contents: •Introduction •Sorting basics •Parallel sorting using SIMD •AA-Sort algorithm •Overview •In-core algorithm •Out-of-core algorithm •Experimental results •Paper discussion •Questions Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten December 2007 Sources : •AA-Sort: A New Parallel Sorting Algorithm for Multi-Core SIMD Processors. Hiroshi Inoue, Takao Moriyama, Hideaki Komatsu and Toshio Nakatani. IBM Tokyo Research Department (sep 2007) •Using SIMD Registers and Instructions to Enable Instruction-Level Parallelism in Sorting Algorithms. Timothy Furtak, José Nelson Amaral, Robert Niewiadomski Department of Computing Science University of Arberta (jun 2007) •GPUTeraSort: High Perdormance Graphics Co-processor Sorting for Large Database Management. Naga K Govindaraju, Jim Gray, Ritesg Kumar, Dinesh Manocha. (jun 2006) •Odd-Even mergesort, www.iti.fhflensburg.de/lang/algorithmen/sortieren/networks/oemen.htm (dec 2007) Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten December 2007 Questions: ??? Sorting with multicore SIMD processors Martijn van den Heuvel & Bram Kersten