Revitalizing GPU for Packet Processing Acceleration

apunet revitalizing gpu as packet processing n.w

1 / 22

Embed Share

"Explore the potential of GPU-accelerated networked systems for executing parallel packet operations with high power and bandwidth efficiency. Discover how GPU benefits from memory access latency hiding and compare CPU vs. GPU memory access hiding. Uncover the contributions of GPUs in packet processing algorithms and the performance advantages in integrated GPU systems. Dive into the world of discrete GPUs communicating with CPUs via PCIe lanes for enhanced computation power and memory bandwidth."

mfari Follow

Uploaded on Mar 19, 2025 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

APUNet: Revitalizing GPU as Packet Processing Accelerator Younghwan Go, Muhammad Asim Jamshed, YoungGyoun Moon, Changho Hwang, and KyoungSoo Park School of Electrical Engineering, KAIST

GPU-accelerated Networked Systems Execute same/similar operations on each packet in parallel High parallelization power Large memory bandwidth GPU CPU Packet Packet Packet Packet Packet Packet Improvements shown in number of research works PacketShader [SIGCOMM 10], SSLShader [NSDI 11], Kargus [CCS 12], NBA [EuroSys 15], MIDeA [CCS 11], DoubleClick [APSys 12], 2

Source of GPU Benefits GPU acceleration mainly comes from memory access latency hiding Memory I/O switch to other thread for continuous execution GPU GPU Quick Context Switch Thread 1 Thread 1 Thread 2 Thread 2 a = b + c; v = mem[a].val; Memory I/O a = b + c; v = mem[a].val; Prefetch in background d = e * f; d = e * f; Inactive Inactive 3

Memory Access Hiding in CPU vs. GPU Re-order CPU code to mask memory access (G-Opt)* Group prefetching, software pipelining Questions: Can CPU code optimization be generalized to all network applications? Which processor is more beneficial in packet processing? *Borrowed from G-Opt slides *Raising the Bar for Using GPUs in Software Packet Processing [NSDI 15] Anuj Kalia, Dong Zhu, Michael Kaminsky, and David G. Anderson 4

Contributions Demystify processor-level effectiveness on packet processing algorithms CPU optimization benefits light-weight memory-bound workloads CPU optimization often does not help large memory workloads GPU is more beneficial for compute-bound workloads GPU s data transfer overhead is the main bottleneck, not its capacity Packet processing system with integrated GPU w/o DMA overhead Addresses GPU kernel setup / data sync overhead, and memory contention Up to 4x performance over CPU-only approaches! 5

Discrete GPU Peripheral device communicating with CPU via a PCIe lane Host Host DRAM DRAM GPU CPU PCIe Lanes PCIe Lanes L2 Cache Streaming Multiprocessor (SM) High computation power High memory bandwidth Fast inst./data access Fast context switch Graphics Processing Cluster Registers Scheduler x N Instruction Cache GDDR Device Memory GDDR Device Memory Require CPU-GPU DMA transfer! L1 Cache Shared Memory 6

Integrated GPU Place GPU into same die as CPU share DRAM AMD Accelerated Processing Unit (APU), Intel HD Graphics Host DRAM DRAM Host CPU Unified Northbridge Unified Northbridge No DMA transfer! GPU High computation power Fast inst./data access Fast context switch Low power & cost Graphics Northbridge Graphics Northbridge L2 Cache Compute Unit Compute Unit x N Scheduler Registers L1 Cache APU 7

CPU vs. GPU: Cost Efficiency Analysis Performance-per-dollar on 8 popular packet processing algorithms Memory- or compute-intensive IPv4, IPv6, Aho-Corasick pattern match, ChaCha20, Poly1305, SHA-1, SHA-2, RSA Test platform CPU-baseline, G-Opt (optimized CPU), dGPU w/ copy, dGPU w/o copy, iGPU CPU / Discrete GPU Intel Xeon E5-2650 v2 (8 @ 2.6 GHz) NVIDIA GTX980 (2048 @ 1.2 GHz) 64 GB (DIMM DDR3 @ 1333 MHz) CPU: $1143.9 APU / Integrated GPU AMD RX-421BD (4 @ 3.4 GHz) AMD R7 Graphics (512 @ 800 MHz) 16 GB (DIMM DDR3 @ 2133 MHz) iGPU: $67.5 CPU GPU RAM Cost CPU GPU RAM Cost dGPU: $840 8

Cost Effectiveness of CPU-based Optimization G-Opt helps memory-intensive, but not compute-intensive algorithms Computation capacity as bottleneck with more computations IPv6 table lookup AC pattern matching SHA-2 Normalized perf per dollar Normalized perf per dollar Normalized perf per dollar Normalized perf per dollar 4 4 2 2.5 1.8 3.5 Detailed analysis on CPU-based optimization in the paper 2.0 2 3 3 1.5 1.5 1.0 2 2 1 1.0 1.0 1.0 1.0 1.0 1 1 1 0.5 0.5 0 0 0 0 CPU CPU G-Opt G-Opt dGPU w/ copy w/ copy dGPU CPU G-Opt CPU G-Opt 9

Cost Effectiveness of Discrete/Integrated GPUs Discrete GPU suffers from DMA transfer overhead Integrated GPU is most cost efficient! 2048-bit RSA decryption ChaCha20 IPv4 table lookup IPv4 table lookup Normalized perf per dollar Normalized perf per dollar Normalized perf per dollar 16 14.4 Normalized perf per dollar Our approach: 3.5 3.5 20 3.08 17.0 3 3 Use integrated GPU to accelerate packet processing! 12 15 2.5 2.5 2 2 8 10 1.5 1.5 4.8 1.08 1.08 1.00 1.00 4.3 1 1 4 5 1.0 0.5 0.5 1.0 0 0 0 0 G-Opt dGPU w/o copy iGPU G-Opt G-Opt dGPU w/ copy w/ copy dGPU dGPU w/o copy w/o copy dGPU G-Opt dGPU w/o copy iGPU 10

Contents Introduction and motivation Background on GPU CPU vs. GPU: cost efficiency analysis Research Challenges APUNet design Evaluation Conclusion 11

Research Challenges Set input Launch kernel Teardown kernel Retrieve result Frequent GPU kernel setup overhead Overhead exposed w/o DMA transfer Redundant overhead! Launch kernel Teardown kernel High data synchronization overhead CPU-GPU cache coherency APU CPU Bottleneck! Explicit Sync! DRAM GPU Cache More contention on shared DRAM Reduced effective memory bandwidth BW: 10x APUNet: a high-performance APU-accelerated network packet processor 12

Persistent Thread Execution Architecture Persistently run GPU threads without kernel teardown Master passes packet pointer addresses to GPU threads Shared Virtual Memory (SVM) CPU GPU Thread 0 Master Thread 1 Workers Packet Packet Thread 2 Packet I/O Thread 3 Packet Packet Pool Persistent Threads Pointer Array NIC Packet 13

Data Synchronization Overhead Synchronization point for GPU threads: L2 cache Require explicit synchronization to main memory Shared Virtual Memory P P P P P Master CPU GPU Thread 0 Thread 1 Thread 2 Thread 3 L2 Graphics Northbridge Update result? Cache Thread 4 Thread 5 Need explicit sync! Thread 6 Thread 7 Can process one request at a time! 14

Solution: Group Synchronization Implicitly synchronize group of packet memory GPU threads processed Exploit LRU cache replacement policy Shared Virtual Memory Master Group 1 Group 2 P P P P P 32 0 Verify CPU For more details and tuning/optimizations, please refer to our paper correctness! GPU GPU Thread Group 1 GPU Thread Group 2 GPU L2 Cache P P P P D D Thread 0 Poll Thread 31 Poll Thread 0 Poll D D D P P D Process Barrier Process Barrier Dummy Mem I/O Barrier 15

Zero-copy Based Packet Processing Option 1. Traditional method with discrete GPU Option 2. Zero-copy between CPU-GPU GPU CPU GPU NIC NIC CPU COPY COPY Standard (e.g., mmap) Standard (e.g., mmap) Shared (e.g., clSVMAlloc) GDDR (e.g., cudaMalloc) High overhead! High overhead! Integrate memory allocation for NIC, CPU, GPU NIC CPU GPU No copy overhead! Shared (e.g., clSVMAlloc) 16

Evaluation APUNet (AMD Carrizo APU) RX-421BD (4 cores @ 3.4 GHz) R7 Graphics (512 cores @ 800 MHz) 16GB DRAM 40 Gbps NIC Mellanox ConnectX-4 Client (packet/flow generator) Xeon E3-1285 v4 (8 cores @ 3.5 GHz) 32GB DRAM How well does APUNet reduce latency and improve throughputs? How practical is APUNet in real-world network applications? 17

Benefits of APUNet Design Workload: IPsec (128-bit AES-CBC + HMAC-SHA1) Packet Processing Latency Synchronization Throughput (64B Packet) 6 GPU-Copy GPU-ZC GPU-ZC-PERSIST 5.31 20 5 Throughput (Gbps) Packet latency (us) 16 4 12 5.7x 3 8 2 0.93 5.4x 4 1 1.5x 0 0 Atomics Group sync 64 128 256 512 1024 1451 Packet size (bytes) 18

Real-world Network Applications 5 real-world network applications IPv4/IPv6 packet forwarding, IPsec gateway, SSL proxy, network IDS IPsec Gateway SSL Proxy CPU Baseline G-Opt APUNet CPU Baseline G-Opt APUNet 5000 20 4241 16.4 Throughput (Gbps) HTTP trans/sec 3583 4000 15 1801 2x 3000 1540 2.75x 8.2 10 7.7 1791 1539 2000 5.3 5 2.8 2.6 1000 0 0 256 8192 64 1451 Number of concurrent connections Packet size (bytes) 19

Real-world Network Applications Snort-based Network IDS Aho-Corasick pattern matching Network IDS CPU Baseline CPU Baseline CPU Baseline G-Opt G-Opt G-Opt APUNet APUNet APUNet DFC DFC DFC 12 12 12 10.4 No benefit from CPU optimization! Access many data structures Eviction of already cached data Throughput (Gbps) Throughput (Gbps) Throughput (Gbps) 9.7 9.7 10 10 10 8 8 8 4x 6 6 6 4.2 3.6 3.6 3.6 3.6 3.6 3.6 4 4 4 2.6 2.6 2.4 2.4 2.4 2.3 2.3 2.3 2 2 2 DFC* outperforms AC-APUNet CPU-based algorithm Cache-friendly & reduces memory access 0 0 0 64 64 64 1514 1514 1514 Packet size (bytes) Packet size (bytes) Packet size (bytes) *DFC: Accelerating String Pattern Matching for Network Applications [NSDI 16] Byungkwon Choi, Jongwook Chae, Muhammad Jamshed, KyoungSoo Park, and Dongsu Han 20

Conclusion Re-examine the efficacy of GPU-based packet processor GPU is bottlenecked by PCIe data transfer overhead Integrated GPU is the most cost effective processor APUNet: APU-accelerated networked system Persistent thread execution: eliminate kernel setup overhead Group synchronization: minimize data synchronization overhead Zero-copy packet processing: reduce memory contention Up to 4x performance improvement over CPU baseline & G-Opt APUNet High-performance, cost-effective platform for real-world network applications 21

Thank you. Q & A

Revitalizing GPU for Packet Processing Acceleration

Download Presentation

Presentation Transcript

Related

More Related Content