Active Routing for Near-Data Processing in Memory Networks
Explore the concept of active routing and its role in optimizing data movement and computation in memory networks. Motivated by the need for efficient processing of large datasets, this research delves into architecture, implementation, and enhancements of active routing. By leveraging near-data processing and in-network computing techniques, the study aims to reduce energy consumption, improve memory throughput, and enhance processor bandwidth.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Active-Routing: Compute on the Way for Near-Data Processing Jiayi Huang, Ramprakash Reddy Puli, Pritam Majumder Sungkeun Kim, Rahul Boyapati, Ki Hwan Yum and EJ Kim
Outline Motivation Active-Routing Architecture Implementation Enhancements in Active-Routing Evaluation Conclusion 2
Data Is Exploding Graph processing (social networks) Requires more memory to process big data Deep learning (NLP) [Hestess et al. 2019] 4 Ref: https://griffsgraphs.files.wordpress.com/2012/07/facebook-network.png
Demand More Memory 3D die-stacked memory [Loh ISCA 08] HMC and HBM Denser capacity higher throughput Vault Controller Vault Controller DRAM layer Logic Layer Intra-Cube Network Vault I/O I/O I/O I/O Memory network[Kim et al. PACT 13, Zhan et al. MICRO 16] Scalable memory capacity Better processor bandwidth 5
Enormous Data Movement Is Expensive Stall processor computation Consume energy Active-Routing for dataflow execution in memory network Active-Routing for dataflow execution in memory network Reduce data movement and more flexible Reduce data movement and more flexible Exploit memory throughput and network concurrency Active-Routing for dataflow execution in memory network Near-data processing to reduce data movement Processing-in-memory (PIM) PIM-Enabled Instruction (PEI) [Ahn et al. ISCA 2015] C = A x B (Can we bring less data?) In-network computing Compute in router switches [Panda IPPS 95, Chen et al. SC 11] MAERI [Kwon et al. ASPLOS 18] 6
System Architecture HMC Interface O3core Network Network-on-Chip HMC Controller Cache Host CPU Interface O3core Network Cache Host CPU Memory Network 8
Active-Routing Flow Active-Routing tree dataflow for compute kernel Ai Bi Compute kernel example for (i = 0; i < n; i++) { sum += *Ai *Bi; } Host CPU Reduction over intermediate results Active-Routing tree dataflow 9
Active-Routing Three-Phase Processing Active-Routing Tree Construction Ai Bi Update Packet for (i = 0; i < n; i++) { sum += *Ai *Bi; } Host CPU 10
Active-Routing Three-Phase Processing Active-Routing Tree Construction Update Phase for data processing Ai Bi for (i = 0; i < n; i++) { sum += *Ai *Bi; } Host CPU Bk Operand response Bk Operand request Ak Operand response Ak Operand request 11
Active-Routing Three-Phase Processing Active-Routing Tree Construction Update Phase for data processing Gather Phase for tree reduction Ai Bi Gather request Gather request Gather request for (i = 0; i < n; i++) { sum += *Ai *Bi; } Gather request Gather request Host CPU Gather request 12
Programming interface and ISA extension Extended PIM Instructions Active-Routing Execution API Update(void *src1, void *src2, void *target, int op); Gather(void *target, int num_threads); for (i = 0; i < n; i++) { Update(Ai, Bi, &sum, MAC); } Gather(&sum, 16); for (i = 0; i < n; i++) { sum += *Ai *Bi; } 14
Programming interface and ISA extension Extended PIM Instructions Active-Routing Execution API Update(void *src1, void *src2, void *target, int op); Gather(void *target, int num_threads); Offloading logic in network interface Dedicated registers for offloading information Convert to Update/Gather packets 15
Active-Routing Engine Packet Processing Unit Flow Table ALU Operand Buffers Controller Vault Vault Active-Routing Vault Controller Controller Vault Controller Engine DRAM layer Logic Layer Logic Layer Intra-Cube Network Intra-Cube Network Vault I/O I/O I/O I/O I/O I/O I/O I/O 16
Packet Processing Unit Packet Processing Unit Flow Table ALU Operand Buffers Process Update/Gather packets Schedule corresponding actions 17
Flow Table Packet Processing Unit Flow Table ALU Operand Buffers Flow Table Entry 64-bit 6-bit 64-bit 64-bit 64-bit 2-bit 4-bit 1-bit flowID opcode result req_counter resp_counter parent children flags Gflag 18
Flow Table Packet Processing Unit Flow Table ALU Operand Buffers Flow Table Entry 64-bit 6-bit 64-bit 64-bit 64-bit 2-bit 4-bit 1-bit flowID opcode result req_counter resp_counter parent children flags Gflag 19
Flow Table Packet Processing Unit Flow Table ALU Operand Buffers Flow Table Entry 64-bit 6-bit 64-bit 64-bit 64-bit 2-bit 4-bit 1-bit flowID opcode result req_counter resp_counter parent children flags Gflag Maintain tree structure 20
Flow Table Packet Processing Unit Flow Table ALU Operand Buffers Flow Table Entry 64-bit 6-bit 64-bit 64-bit 64-bit 2-bit 4-bit 1-bit flowID opcode result req_counter resp_counter parent children flags Gflag Maintain tree structure Keep track of state information of each flow 21
Flow Table Packet Processing Unit Flow Table ALU Operand Buffers Flow Table Entry 64-bit 6-bit 64-bit 64-bit 64-bit 2-bit 4-bit 1-bit flowID opcode result req_counter resp_counter parent children flags Gflag Maintain tree structure Keep track of state information of each flow 22
Operand Buffers Packet Processing Unit Flow Table ALU Operand Buffers Operand Buffer Entry 64-bit 64-bit 1-bit 64-bit 1-bit flowID op_value1 op_ready1 op_value2 op_ready2 23
Operand Buffers Packet Processing Unit Flow Table ALU Operand Buffers Operand Buffer Entry 64-bit 64-bit 1-bit 64-bit 1-bit flowID op_value1 op_ready1 op_value2 op_ready2 Shared temporal storage 24
Operand Buffers Packet Processing Unit Flow Table ALU Operand Buffers Operand Buffer Entry 64-bit 64-bit 1-bit 64-bit 1-bit flowID op_value1 op_ready1 op_value2 op_ready2 Shared temporal storage Fire for computation in dataflow 25
One Tree Per Flow One tree from one memory port Deep tree Congestion Host CPU Deep tree Congestion at memory port 27
Multiple Trees Per Flow Build multiple trees Thread 1 Thread 0 Host CPU Thread 3 Thread 2 ART-tid: interleave the thread ID ART-addr: nearest port based on operands address 28
Exploit Memory Access Locality Pure reduction Irregular (random) Regular for (i = 0; i < n; i++) { sum += *Ai; } Reduction on intermediate results Irregular-Irregular (II) Regular-Irregular (RI) Regular-Regular (RR) for (i = 0; i < n; i++) { sum += *Ai *Bi; } Offload cache block granularity for regular accesses 29
Methodology Compared techniques HMC Baseline PIM-Enabled Instruction (PEI) Active-Routing-threadID (ART-tid) Active-Routing-address (ART-addr) Tools: Pin, McSimA+ and CasHMC System configurations 16 O3 cores at 2 GHz 16 memory cubes in dragonfly topology Minimum routing with virtual cut-through Active-Routing Engine 1250 MHz, 16 flow table entries, 128 operand entries 31
Workloads Benchmarks (graph app, ML kernels, etc.) backprop lud pagerank sgemm spmv Microbenchmarks reduce (sum reduction) rand_reduce mac (multipy-and-accumulate) rand_mac 32
Comparison of Enhancements in Active-Routing Normalized Speedup over HMC Baseline ART-Na ve ART-tid ART-addr 1.5 Normalized Runtime Speedup 1 0.5 0 -0.5 (log) -1 -1.5 33
Comparison of Enhancements in Active-Routing Normalized Speedup over HMC Baseline ART-Na ve ART-tid ART-addr 1.5 Normalized Runtime Speedup 1 0.5 0 -0.5 (log) -1 -1.5 34
Comparison of Enhancements in Active-Routing Normalized Speedup over HMC Baseline ART-Na ve ART-tid ART-addr 1.5 Normalized Runtime Speedup 1 0.5 0 -0.5 Multiple trees and cache-block grained offloading effectively make ART much better. (log) -1 -1.5 35
Benchmark Performance PEI ART-tid ART-addr 8 Normalized Runtime 6 Speedup 4 2 0 36
Benchmark Performance PEI ART-tid ART-addr 8 Normalized Runtime 6 Speedup 4 2 0 37
Benchmark Performance PEI ART-tid ART-addr 8 Normalized Runtime 6 Imbalance computations Speedup 4 2 0 In general, ART-addr > ART-tid > PEI 38
Analysis of spmv Compute Point Distribution Operand Distribution ART-tid ART-addr 0 4E+05 39
Benchmark Performance PEI ART-tid ART-addr PEI cache thrashing C = A x B 8 Normalized Runtime 6 Speedup 4 2 0 40
Microbenchmark Performance PEI ART-tid ART-addr 60 Normalized Runtime 50 ART-addr > ART-tid > PEI 40 Speedup 30 20 10 0 41
Energy-Delay Product ART-tid ART-addr 0.5 Normalized Energy- Delay-Product (log) 0 -0.5 -1 Reduce EDP by 80% on average -1.5 -2 42
Dynamic Offloading Case Study (lud) 2500 2.5 Thousands ART-tid 2000 2 ART-tid-adaptive 1.5 Speedup 1500 Cycles First Phase 1 1000 0.5 500 0 0 Second Phase 0 20 41 60 80 100 120 Phase ART-tid-adaptive: dynamic offloading based on locality and reuse 43
Conclusion Propose Active-Routingin-network computing architecture which computes near-data in the memory network in data- flow style Present a three-phase processing procedure for Active- Routing Categorize memory access patterns to exploit the locality and offload computations in various granularities Active-Routing achieves up to 7x speedup with 60% average performance improvement and reduce energy- delay product by 80% on average 44
Thank You & Questions Jiayi Huang jyhuang@cse.tamu.edu
Active-Routing: Compute on the Way for Near-Data Processing Jiayi Huang, Ramprakash Reddy Puli, Pritam Majumder Sungkeun Kim, Rahul Boyapati, Ki Hwan Yum and EJ Kim
API Extensions UpdateRR(void *src1, void *src2, void *target, int op); Offload both operands in cache block granularity UpdateRI(void *src1, void *src2[], void *target, int op); Fetch irregular data then offload to regular operands location for cache-block granularity processing UpdateII(void *src1, void *src2, void *target, int op); Offload both operands in single element granularity Update(type src1, type src2, void *target, int op); Support general PIM Gather(void *target, int num_threads); As explicit barrier 48
Workloads (Optimization Region and Dataset) Benchmarks backprop: activation calculation in feedforward pass (2M hidden units) lud: upper and lower triangular matrix decomposition (4K matrix dimension) pagerank: ranking score calculation (web-Google graph from SNAP Dataset) sgemm: matrix multiplication (4K x 4K matrix) spmv: matrix-vector multiplication loop (4K dimension with 0.7 sparsity) Microbenchmarks reduce: sum reduction over a sequential vector (64K dimension) rand_reduce: sum reduction over random elements (64K dimension) mac: multipy-and-accumulate over two sequential vectors (2 64K-D vectors) rand_mac: multiply-and-accumulate over two random element lists (2 64K-element lists) 49
On/Off-Chip Data Movement normal request normal response Fine-grained active request active response PEI cache thrashing 1.25 offloading overhead Normalized On/Off-chip Data 1 0.75 0.5 0.25 Movement 0 PEI ART-addr ART-addr ART-addr ART-addr ART-addr PEI PEI PEI PEI ART-tid ART-tid ART-tid ART-tid ART-tid backprop lud pagerank sgemm spmv 50