Active Routing for Near-Data Processing in Memory Networks

Slide Note

Explore the concept of active routing and its role in optimizing data movement and computation in memory networks. Motivated by the need for efficient processing of large datasets, this research delves into architecture, implementation, and enhancements of active routing. By leveraging near-data processing and in-network computing techniques, the study aims to reduce energy consumption, improve memory throughput, and enhance processor bandwidth.

babyboygu Follow

Uploaded on Sep 15, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Active-Routing: Compute on the Way for Near-Data Processing Jiayi Huang, Ramprakash Reddy Puli, Pritam Majumder Sungkeun Kim, Rahul Boyapati, Ki Hwan Yum and EJ Kim

Outline Motivation Active-Routing Architecture Implementation Enhancements in Active-Routing Evaluation Conclusion 2

Motivation

Data Is Exploding Graph processing (social networks) Requires more memory to process big data Deep learning (NLP) [Hestess et al. 2019] 4 Ref: https://griffsgraphs.files.wordpress.com/2012/07/facebook-network.png

Demand More Memory 3D die-stacked memory [Loh ISCA 08] HMC and HBM Denser capacity higher throughput Vault Controller Vault Controller DRAM layer Logic Layer Intra-Cube Network Vault I/O I/O I/O I/O Memory network[Kim et al. PACT 13, Zhan et al. MICRO 16] Scalable memory capacity Better processor bandwidth 5

Enormous Data Movement Is Expensive Stall processor computation Consume energy Active-Routing for dataflow execution in memory network Active-Routing for dataflow execution in memory network Reduce data movement and more flexible Reduce data movement and more flexible Exploit memory throughput and network concurrency Active-Routing for dataflow execution in memory network Near-data processing to reduce data movement Processing-in-memory (PIM) PIM-Enabled Instruction (PEI) [Ahn et al. ISCA 2015] C = A x B (Can we bring less data?) In-network computing Compute in router switches [Panda IPPS 95, Chen et al. SC 11] MAERI [Kwon et al. ASPLOS 18] 6

Active-Routing Architecture

System Architecture HMC Interface O3core Network Network-on-Chip HMC Controller Cache Host CPU Interface O3core Network Cache Host CPU Memory Network 8

Active-Routing Flow Active-Routing tree dataflow for compute kernel Ai Bi Compute kernel example for (i = 0; i < n; i++) { sum += *Ai *Bi; } Host CPU Reduction over intermediate results Active-Routing tree dataflow 9

Active-Routing Three-Phase Processing Active-Routing Tree Construction Ai Bi Update Packet for (i = 0; i < n; i++) { sum += *Ai *Bi; } Host CPU 10

Active-Routing Three-Phase Processing Active-Routing Tree Construction Update Phase for data processing Ai Bi for (i = 0; i < n; i++) { sum += *Ai *Bi; } Host CPU Bk Operand response Bk Operand request Ak Operand response Ak Operand request 11

Active-Routing Three-Phase Processing Active-Routing Tree Construction Update Phase for data processing Gather Phase for tree reduction Ai Bi Gather request Gather request Gather request for (i = 0; i < n; i++) { sum += *Ai *Bi; } Gather request Gather request Host CPU Gather request 12

Implementation

Programming interface and ISA extension Extended PIM Instructions Active-Routing Execution API Update(void *src1, void *src2, void *target, int op); Gather(void *target, int num_threads); for (i = 0; i < n; i++) { Update(Ai, Bi, &sum, MAC); } Gather(&sum, 16); for (i = 0; i < n; i++) { sum += *Ai *Bi; } 14

Programming interface and ISA extension Extended PIM Instructions Active-Routing Execution API Update(void *src1, void *src2, void *target, int op); Gather(void *target, int num_threads); Offloading logic in network interface Dedicated registers for offloading information Convert to Update/Gather packets 15

Active-Routing Engine Packet Processing Unit Flow Table ALU Operand Buffers Controller Vault Vault Active-Routing Vault Controller Controller Vault Controller Engine DRAM layer Logic Layer Logic Layer Intra-Cube Network Intra-Cube Network Vault I/O I/O I/O I/O I/O I/O I/O I/O 16

Packet Processing Unit Packet Processing Unit Flow Table ALU Operand Buffers Process Update/Gather packets Schedule corresponding actions 17

Flow Table Packet Processing Unit Flow Table ALU Operand Buffers Flow Table Entry 64-bit 6-bit 64-bit 64-bit 64-bit 2-bit 4-bit 1-bit flowID opcode result req_counter resp_counter parent children flags Gflag 18

Flow Table Packet Processing Unit Flow Table ALU Operand Buffers Flow Table Entry 64-bit 6-bit 64-bit 64-bit 64-bit 2-bit 4-bit 1-bit flowID opcode result req_counter resp_counter parent children flags Gflag 19

Flow Table Packet Processing Unit Flow Table ALU Operand Buffers Flow Table Entry 64-bit 6-bit 64-bit 64-bit 64-bit 2-bit 4-bit 1-bit flowID opcode result req_counter resp_counter parent children flags Gflag Maintain tree structure 20

Flow Table Packet Processing Unit Flow Table ALU Operand Buffers Flow Table Entry 64-bit 6-bit 64-bit 64-bit 64-bit 2-bit 4-bit 1-bit flowID opcode result req_counter resp_counter parent children flags Gflag Maintain tree structure Keep track of state information of each flow 21

Flow Table Packet Processing Unit Flow Table ALU Operand Buffers Flow Table Entry 64-bit 6-bit 64-bit 64-bit 64-bit 2-bit 4-bit 1-bit flowID opcode result req_counter resp_counter parent children flags Gflag Maintain tree structure Keep track of state information of each flow 22

Operand Buffers Packet Processing Unit Flow Table ALU Operand Buffers Operand Buffer Entry 64-bit 64-bit 1-bit 64-bit 1-bit flowID op_value1 op_ready1 op_value2 op_ready2 23

Operand Buffers Packet Processing Unit Flow Table ALU Operand Buffers Operand Buffer Entry 64-bit 64-bit 1-bit 64-bit 1-bit flowID op_value1 op_ready1 op_value2 op_ready2 Shared temporal storage 24

Operand Buffers Packet Processing Unit Flow Table ALU Operand Buffers Operand Buffer Entry 64-bit 64-bit 1-bit 64-bit 1-bit flowID op_value1 op_ready1 op_value2 op_ready2 Shared temporal storage Fire for computation in dataflow 25

Enhancements in Active-Routing

One Tree Per Flow One tree from one memory port Deep tree Congestion Host CPU Deep tree Congestion at memory port 27

Multiple Trees Per Flow Build multiple trees Thread 1 Thread 0 Host CPU Thread 3 Thread 2 ART-tid: interleave the thread ID ART-addr: nearest port based on operands address 28

Exploit Memory Access Locality Pure reduction Irregular (random) Regular for (i = 0; i < n; i++) { sum += *Ai; } Reduction on intermediate results Irregular-Irregular (II) Regular-Irregular (RI) Regular-Regular (RR) for (i = 0; i < n; i++) { sum += *Ai *Bi; } Offload cache block granularity for regular accesses 29

Evaluation

Methodology Compared techniques HMC Baseline PIM-Enabled Instruction (PEI) Active-Routing-threadID (ART-tid) Active-Routing-address (ART-addr) Tools: Pin, McSimA+ and CasHMC System configurations 16 O3 cores at 2 GHz 16 memory cubes in dragonfly topology Minimum routing with virtual cut-through Active-Routing Engine 1250 MHz, 16 flow table entries, 128 operand entries 31

Workloads Benchmarks (graph app, ML kernels, etc.) backprop lud pagerank sgemm spmv Microbenchmarks reduce (sum reduction) rand_reduce mac (multipy-and-accumulate) rand_mac 32

Comparison of Enhancements in Active-Routing Normalized Speedup over HMC Baseline ART-Na ve ART-tid ART-addr 1.5 Normalized Runtime Speedup 1 0.5 0 -0.5 (log) -1 -1.5 33

Comparison of Enhancements in Active-Routing Normalized Speedup over HMC Baseline ART-Na ve ART-tid ART-addr 1.5 Normalized Runtime Speedup 1 0.5 0 -0.5 (log) -1 -1.5 34

Comparison of Enhancements in Active-Routing Normalized Speedup over HMC Baseline ART-Na ve ART-tid ART-addr 1.5 Normalized Runtime Speedup 1 0.5 0 -0.5 Multiple trees and cache-block grained offloading effectively make ART much better. (log) -1 -1.5 35

Benchmark Performance PEI ART-tid ART-addr 8 Normalized Runtime 6 Speedup 4 2 0 36

Benchmark Performance PEI ART-tid ART-addr 8 Normalized Runtime 6 Speedup 4 2 0 37

Benchmark Performance PEI ART-tid ART-addr 8 Normalized Runtime 6 Imbalance computations Speedup 4 2 0 In general, ART-addr > ART-tid > PEI 38

Analysis of spmv Compute Point Distribution Operand Distribution ART-tid ART-addr 0 4E+05 39

Benchmark Performance PEI ART-tid ART-addr PEI cache thrashing C = A x B 8 Normalized Runtime 6 Speedup 4 2 0 40

Microbenchmark Performance PEI ART-tid ART-addr 60 Normalized Runtime 50 ART-addr > ART-tid > PEI 40 Speedup 30 20 10 0 41

Energy-Delay Product ART-tid ART-addr 0.5 Normalized Energy- Delay-Product (log) 0 -0.5 -1 Reduce EDP by 80% on average -1.5 -2 42

Dynamic Offloading Case Study (lud) 2500 2.5 Thousands ART-tid 2000 2 ART-tid-adaptive 1.5 Speedup 1500 Cycles First Phase 1 1000 0.5 500 0 0 Second Phase 0 20 41 60 80 100 120 Phase ART-tid-adaptive: dynamic offloading based on locality and reuse 43

Conclusion Propose Active-Routingin-network computing architecture which computes near-data in the memory network in data- flow style Present a three-phase processing procedure for Active- Routing Categorize memory access patterns to exploit the locality and offload computations in various granularities Active-Routing achieves up to 7x speedup with 60% average performance improvement and reduce energy- delay product by 80% on average 44

Thank You & Questions Jiayi Huang jyhuang@cse.tamu.edu

Active-Routing: Compute on the Way for Near-Data Processing Jiayi Huang, Ramprakash Reddy Puli, Pritam Majumder Sungkeun Kim, Rahul Boyapati, Ki Hwan Yum and EJ Kim

Backup

API Extensions UpdateRR(void *src1, void *src2, void *target, int op); Offload both operands in cache block granularity UpdateRI(void *src1, void *src2[], void *target, int op); Fetch irregular data then offload to regular operands location for cache-block granularity processing UpdateII(void *src1, void *src2, void *target, int op); Offload both operands in single element granularity Update(type src1, type src2, void *target, int op); Support general PIM Gather(void *target, int num_threads); As explicit barrier 48

Workloads (Optimization Region and Dataset) Benchmarks backprop: activation calculation in feedforward pass (2M hidden units) lud: upper and lower triangular matrix decomposition (4K matrix dimension) pagerank: ranking score calculation (web-Google graph from SNAP Dataset) sgemm: matrix multiplication (4K x 4K matrix) spmv: matrix-vector multiplication loop (4K dimension with 0.7 sparsity) Microbenchmarks reduce: sum reduction over a sequential vector (64K dimension) rand_reduce: sum reduction over random elements (64K dimension) mac: multipy-and-accumulate over two sequential vectors (2 64K-D vectors) rand_mac: multiply-and-accumulate over two random element lists (2 64K-element lists) 49

On/Off-Chip Data Movement normal request normal response Fine-grained active request active response PEI cache thrashing 1.25 offloading overhead Normalized On/Off-chip Data 1 0.75 0.5 0.25 Movement 0 PEI ART-addr ART-addr ART-addr ART-addr ART-addr PEI PEI PEI PEI ART-tid ART-tid ART-tid ART-tid ART-tid backprop lud pagerank sgemm spmv 50

Active Routing for Near-Data Processing in Memory Networks

Download Presentation

Presentation Transcript

Related

More Related Content