
Efficient Data Movement in Fully Disaggregated Memory Systems
Explore DaeMon, an innovative solution for optimizing data movement in Disaggregated Systems, showcasing significant performance improvements and cost savings. Discover the benefits of resource disaggregation, network advancements, and more.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
DaeMon: Architectural Support for Efficient Data Movement in Fully Disaggregated Memory Systems Christina Giannoula Kailong Huang, Jonathan Tang, Nectarios Koziris, Georgios Goumas, Zeshan Chishti, Nandita Vijaykumar
DaeMon Executive Summary Problem: Efficient data movement support is a major system challenge for fully Disaggregated Systems (DSs) Contribution: DaeMon: the first adaptive data movement solution for fully DSs Key Results: DaeMon achieves 2.39x better performance and 3.06x lower data access costs over the widely-adopted scheme of moving data at page granularity 2
What is resource disaggregation? Network 3
Monolithic vs Disaggregated Systems Network ` ` ` ` ` thanks to recent advances in network technologies 4
Benefits of Fully Disaggregated Systems Resource Utilization fits a few jobs idle fits many jobs Network idle Monolithic System Disaggregated System 5
Benefits of Fully Disaggregated Systems Failure Handling Network Monolithic System Disaggregated System 6
Benefits of Fully Disaggregated Systems Resource Scaling Network 7
Benefits of Fully Disaggregated Systems Heterogeneity Network ` ` ` ` ` many different types of hardware devices over the network 8
Benefits of Fully Disaggregated Systems Resource Utilization Failure Handling Resource Scaling Heterogeneity Disaggregated systems can significantly decrease data center costs 9
Baseline Disaggregated System Compute Component Compute Component Compute Component CPU Local Memory Network Controller Memory Component Remote Memory Memory Component Memory Component Memory Component 10
Baseline Disaggregated System Compute Component Compute Component Compute Component CPU Local Memory hosts ~20% of application s data Network Controller Memory Component Remote Memory Memory Component Memory Component Memory Component 11
Baseline Disaggregated System Compute Component Compute Component Compute Component CPU Local Memory Network Controller Memory Component Remote Memory hosts ~80% of application s data Memory Component Memory Component Memory Component 12
Baseline Disaggregated System Compute Component Compute Component Compute Component CPU data is typically moved at page granularity Local Memory Network Controller Memory Component Remote Memory Memory Component Memory Component Memory Component 13
Baseline Disaggregated System Compute Component Compute Component Compute Component CPU Local Memory distributed OS modules Network Controller Memory Component Remote Memory Memory Component Memory Component Memory Component 14
Why is data movement challenging? Controller CPU Network Remote Memory Local Memory 15
#1: Coarse-Grained Data Migrations Page granularity (e.g., 4KB) data migrations: Software transparency Low metadata overheads High spatial locality latency-critical cache lines are slowed down Controller Remote Memory CPU Local Memory Controller Network Remote Memory CPU Local Memory high bandwidth consumption 16
#1: Coarse-Grained Data Migrations Page granularity (e.g., 4KB) data migrations: Software transparency Low metadata overheads High spatial locality Controller Remote Memory CPU A latency-efficient and bandwidth-efficient solution is necessary Local Memory Controller Network Remote Memory CPU Local Memory 17
#2: Non-Conventional System Design Disaggregated systems are notmonolithic Controller Controller Controller Controller CPU Remote Memory Remote Memory Remote Memory Remote Memory Local Memory distributed memory management Hybrid/heterogeneous memory systems: System-Level Solutions CPU DRAM Cache centralized memory management Thermostat [ASPLOS 17] Kleio [HPDC 19] Chameleon [MICRO 18] HSCC [ICS 17] Nimble [ASPLOS 19] DRAM 18
#2: Non-Conventional System Design Disaggregated systems are notmonolithic Controller Controller Controller Controller CPU Remote Memory Remote Memory Remote Memory Remote Memory would incur high hardware overheads Local Memory Hybrid/heterogeneous memory systems: System-Level Solutions Hardware-Level Solutions CPU DRAM Cache centralized hardware units in the CPU side Chop [HPCA 10] UH-MEM [CLUSTER 17] MemPod [HPCA 17] LGM [IPDPS 19] DRAM 19
#2: Non-Conventional System Design Disaggregated systems are notmonolithic Controller Controller Controller Controller CPU Remote Memory Remote Memory Remote Memory Remote Memory Local Memory Hybrid/heterogeneous memory systems: Prior solutions are not suitable or efficient for disaggregated memory systems System-Level Solutions Hardware-Level Solutions CPU DRAM Cache DRAM 20
#3: Variability in Data Access Latencies Data access latencies depend: Location of the remote memory component CPU Local Memory Controller CPU Remote Memory data1 Local Memory different locations for application s data Controller Controller Controller Remote Memory Remote Memory Remote Memory Controller Controller Remote Memory Remote Memory data2 21
#3: Variability in Data Access Latencies Data access latencies depend: Location of the remote memory component Network contention CPU Local Memory high contention due to concurrent jobs sharing the network Controller CPU Remote Memory Local Memory Controller Controller Controller Remote Memory Remote Memory data Remote Memory 22
#3: Variability in Data Access Latencies CPU CPU Local Memory Local Memory Controller Controller CPU CPU Remote Memory Remote Memory Local Memory Local Memory Controller Controller Controller Controller Controller Controller Remote Memory Remote Memory data2 Remote Memory data2 Remote Memory Remote Memory data1 Remote Memory data1 A robust solution to variability in data access latencies is necessary Time data placements can vary during runtime or between multiple executions 23
How can we build an efficient solution? DaeMon [Sigmetrics 23] 24
1. Disaggregated Hardware Support Compute Component Memory Component Controller FPGA CPU DaeMon Memory Engine DaeMon Compute Engine LLC dedicated units Remote Memory Local Memory Compute Component Memory Component FPGA CPU Controller DaeMon Compute Engine Independence LLC DaeMon Memory Engine Network Local Memory Remote Memory Compute Component High Parallelism High Scalability Memory Component FPGA CPU Memory Component DaeMon Compute Engine Controller LLC DaeMon Memory Engine Controller DaeMon Memory Engine Local Memory Remote Memory Remote Memory 25
2. Multiple Granularity Data Movement Compute Component DaeMon Compute Engine Memory Component DaeMon Memory Engine Controller CPU Sub-block Queue Sub-block Queue Page Queue Page Queue LLC Remote Memory Local Memory 26
2. Multiple Granularity Data Movement Compute Component DaeMon Compute Engine Memory Component DaeMon Memory Engine Controller CPU Sub-block Queue Sub-block Queue Controller Controller Queue Queue Cache lines Page Queue Pages Page Queue LLC Remote Memory Local Memory prioritization of cache line migrations 27
2. Multiple Granularity Data Movement Compute Component DaeMon Compute Engine Memory Component DaeMon Memory Engine Controller CPU Sub-block Queue Sub-block Queue Controller Controller Queue Queue Cache lines Software Transparency Page Queue Pages Page Queue LLC Low Metadata Overheads High Spatial Locality Latency-Efficiency in Critical Data Remote Memory Local Memory 28
3. Link Compression in Page Migrations Compute Component DaeMon Compute Engine Memory Component DaeMon Memory Engine Controller CPU Sub-block Queue Sub-block Queue Controller Controller Queue Queue Cache lines Page Queue Compressed pages Page Queue LLC Remote Memory (De) Compr. Unit (De) Compr. Unit Local Memory compressed pages inside the network 29
3. Link Compression in Page Migrations Compute Component DaeMon Compute Engine Memory Component DaeMon Memory Engine Controller CPU Sub-block Queue Sub-block Queue Controller Controller Queue Queue Cache lines Page Queue Compressed pages Page Queue LLC Bandwidth-Efficiency Critical Cache Line Prioritization Remote Memory (De) Compr. Unit (De) Compr. Unit Local Memory 30
4. Selection Granularity Data Movement Compute Component DaeMon Compute Engine Memory Component DaeMon Memory Engine Controller CPU Sub-block Queue Sub-block Queue Controller Controller Queue Queue Cache lines Page Queue Pages Page Queue LLC Remote Memory (De) Compr. Unit (De) Compr. Unit Cache line, page or both? Local Memory 31
4. Selection Granularity Data Movement Compute Component DaeMon Compute Engine Memory Component DaeMon Memory Engine Controller CPU Sub-block Queue Sub-block Queue Controller Controller Queue Queue Cache lines Page Queue Pages Page Queue LLC Remote Memory Inflight Sub- block and Page Buffers (De) Compr. Unit (De) Compr. Unit Local Memory Sub- block Page 32 track pending data migrations
4. Selection Granularity Data Movement Compute Component DaeMon Compute Engine Memory Component DaeMon Memory Engine Controller CPU Sub-block Queue Sub-block Queue Selection Granularity Unit Controller Controller Queue Queue Cache lines Page Queue Pages Page Queue LLC Remote Memory Inflight Sub-block and Page Buffers (De) Compr. Unit (De) Compr. Unit Cache line, page or both? Local Memory 33 Sub- block Page
4. Selection Granularity Data Movement Compute Component DaeMon Compute Engine Memory Component DaeMon Memory Engine Controller CPU Sub-block Queue Sub-block Queue Selection Granularity Unit Controller Controller Queue Queue Cache lines Page Queue Pages Page Queue LLC Robustness Versatility Adaptivity to Runtime Changes Remote Memory Inflight Sub-block and Page Buffers (De) Compr. Unit (De) Compr. Unit Local Memory 34
Why does this work? DaeMon 35
Use Case 1: Memory Access Patterns Compute Component Memory Component Controller DaeMon Engine Inflight Buffers CPU DaeMon Memory Engine Cache lines Selection Gran. Unit LLC Compressed pages Remote Memory Local Memory Inflight Buffers Utilization Sub- block Page Sub- block Page Sub- block Page Sub- block Page 36 Time high locality within pages
Use Case 1: Memory Access Patterns Compute Component Memory Component Controller DaeMon Engine Inflight Buffers CPU DaeMon Memory Engine Cache lines Selection Gran. Unit LLC Compressed pages Remote Memory Local Memory Inflight Buffers Utilization Sub- block Page Sub- block Page Sub- block Page Sub- block Page 37 Time low locality within pages
Use Case 2: Network Characteristics Compute Component Memory Component Controller DaeMon Engine Inflight Buffers CPU DaeMon Memory Engine Cache lines Selection Gran. Unit LLC Compressed pages Remote Memory Local Memory Inflight Buffers Utilization Sub- block Page 38 Time high bandwidth consumption
Use Case 2: Network Characteristics Compute Component Memory Component Controller DaeMon Engine Inflight Buffers CPU DaeMon Memory Engine Cache lines Selection Gran. Unit LLC Compressed pages Remote Memory Local Memory Inflight Buffers Utilization Sub- block Page Sub- block Page 39 Time high bandwidth consumption low bandwidth consumption
Use Case 3: Data Compressibility Compute Component Memory Component Controller DaeMon Engine Inflight Buffers CPU DaeMon Memory Engine Cache lines Selection Gran. Unit LLC Compressed pages Remote Memory Local Memory Inflight Buffers Utilization Sub- block Page app1 40 Time high data compressibility
Use Case 3: Data Compressibility Compute Component Memory Component Controller DaeMon Engine Inflight Buffers CPU DaeMon Memory Engine Cache lines Selection Gran. Unit LLC Compressed pages Remote Memory Local Memory Inflight Buffers Utilization Sub- block Page Sub- block Page app1 app2 41 Time high data compressibility low data compressibility
Speedup in Real Applications Page ComprPage CacheLine CacheLine+Page DaeMon-Compr DaeMon 5 4 Speedup 3 2 1 0 kc tr pr nw bf bc ts sp sl hp pf dr rs GM Workloads 42
Speedup in Real Applications Page ComprPage CacheLine CacheLine+Page DaeMon-Compr DaeMon 14.6 5 4 Speedup 3 1.95x 2 1 0 kc tr pr nw bf bc ts sp sl hp pf dr rs GM Workloads 43
Speedup in Real Applications Page ComprPage CacheLine CacheLine+Page DaeMon-Compr DaeMon 14.6 5 4 Speedup 3 1.29x 2 1 0 kc tr pr nw bf bc ts sp sl hp pf dr rs GM Workloads 44
Speedup in Real Applications Page ComprPage CacheLine CacheLine+Page DaeMon-Compr DaeMon 14.6 8.4 5 low locality within pages 4 Speedup 3 1.09x 0.95x 2 1 0 kc tr pr nw bf bc ts sp sl hp pf dr rs GM Workloads 45
Speedup in Real Applications Page ComprPage CacheLine CacheLine+Page DaeMon-Compr DaeMon 11.7 14.6 8.4 5 4 Speedup 3 2 1.53x 1 0 kc tr pr nw bf bc ts sp sl hp pf dr rs GM Workloads 46
Speedup in Real Applications Page ComprPage CacheLine CacheLine+Page DaeMon-Compr DaeMon 11.7 14.6 8.4 5 4 Speedup 3 DaeMon performs best in real-world applications 2 1 0 kc tr pr nw bf bc ts sp sl hp pf dr rs GM Workloads 47
Data Access Costs in Real Applications Page ComprPage DaeMon-Compr DaeMon 1 0.9 0.8 0.7 Data Access Costs 0.6 0.5 0.4 0.3 0.2 0.1 0 kc tr pr nw bf bc ts sp sl hp pf dr rs GM 48 low locality within pages
Data Access Costs in Real Applications Page ComprPage DaeMon-Compr DaeMon 1 0.9 0.8 0.7 Data Access Costs 0.6 0.5 0.4 0.3 0.2 0.1 0 kc tr pr nw bf bc ts sp sl hp pf dr rs GM 49 medium locality within pages
Data Access Costs in Real Applications Page ComprPage DaeMon-Compr DaeMon 1 0.9 0.8 0.7 Data Access Costs 0.6 0.5 0.4 0.3 0.2 0.1 0 kc tr pr nw bf bc ts sp sl hp pf dr rs GM 50 high locality within pages