Efficient Data Movement in Fully Disaggregated Memory Systems

daemon architectural support for efficient data l.w
1 / 55
Embed
Share

Explore DaeMon, an innovative solution for optimizing data movement in Disaggregated Systems, showcasing significant performance improvements and cost savings. Discover the benefits of resource disaggregation, network advancements, and more.

  • Data Movement
  • Disaggregated Systems
  • Efficient
  • Memory Systems
  • Technology

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. DaeMon: Architectural Support for Efficient Data Movement in Fully Disaggregated Memory Systems Christina Giannoula Kailong Huang, Jonathan Tang, Nectarios Koziris, Georgios Goumas, Zeshan Chishti, Nandita Vijaykumar

  2. DaeMon Executive Summary Problem: Efficient data movement support is a major system challenge for fully Disaggregated Systems (DSs) Contribution: DaeMon: the first adaptive data movement solution for fully DSs Key Results: DaeMon achieves 2.39x better performance and 3.06x lower data access costs over the widely-adopted scheme of moving data at page granularity 2

  3. What is resource disaggregation? Network 3

  4. Monolithic vs Disaggregated Systems Network ` ` ` ` ` thanks to recent advances in network technologies 4

  5. Benefits of Fully Disaggregated Systems Resource Utilization fits a few jobs idle fits many jobs Network idle Monolithic System Disaggregated System 5

  6. Benefits of Fully Disaggregated Systems Failure Handling Network Monolithic System Disaggregated System 6

  7. Benefits of Fully Disaggregated Systems Resource Scaling Network 7

  8. Benefits of Fully Disaggregated Systems Heterogeneity Network ` ` ` ` ` many different types of hardware devices over the network 8

  9. Benefits of Fully Disaggregated Systems Resource Utilization Failure Handling Resource Scaling Heterogeneity Disaggregated systems can significantly decrease data center costs 9

  10. Baseline Disaggregated System Compute Component Compute Component Compute Component CPU Local Memory Network Controller Memory Component Remote Memory Memory Component Memory Component Memory Component 10

  11. Baseline Disaggregated System Compute Component Compute Component Compute Component CPU Local Memory hosts ~20% of application s data Network Controller Memory Component Remote Memory Memory Component Memory Component Memory Component 11

  12. Baseline Disaggregated System Compute Component Compute Component Compute Component CPU Local Memory Network Controller Memory Component Remote Memory hosts ~80% of application s data Memory Component Memory Component Memory Component 12

  13. Baseline Disaggregated System Compute Component Compute Component Compute Component CPU data is typically moved at page granularity Local Memory Network Controller Memory Component Remote Memory Memory Component Memory Component Memory Component 13

  14. Baseline Disaggregated System Compute Component Compute Component Compute Component CPU Local Memory distributed OS modules Network Controller Memory Component Remote Memory Memory Component Memory Component Memory Component 14

  15. Why is data movement challenging? Controller CPU Network Remote Memory Local Memory 15

  16. #1: Coarse-Grained Data Migrations Page granularity (e.g., 4KB) data migrations: Software transparency Low metadata overheads High spatial locality latency-critical cache lines are slowed down Controller Remote Memory CPU Local Memory Controller Network Remote Memory CPU Local Memory high bandwidth consumption 16

  17. #1: Coarse-Grained Data Migrations Page granularity (e.g., 4KB) data migrations: Software transparency Low metadata overheads High spatial locality Controller Remote Memory CPU A latency-efficient and bandwidth-efficient solution is necessary Local Memory Controller Network Remote Memory CPU Local Memory 17

  18. #2: Non-Conventional System Design Disaggregated systems are notmonolithic Controller Controller Controller Controller CPU Remote Memory Remote Memory Remote Memory Remote Memory Local Memory distributed memory management Hybrid/heterogeneous memory systems: System-Level Solutions CPU DRAM Cache centralized memory management Thermostat [ASPLOS 17] Kleio [HPDC 19] Chameleon [MICRO 18] HSCC [ICS 17] Nimble [ASPLOS 19] DRAM 18

  19. #2: Non-Conventional System Design Disaggregated systems are notmonolithic Controller Controller Controller Controller CPU Remote Memory Remote Memory Remote Memory Remote Memory would incur high hardware overheads Local Memory Hybrid/heterogeneous memory systems: System-Level Solutions Hardware-Level Solutions CPU DRAM Cache centralized hardware units in the CPU side Chop [HPCA 10] UH-MEM [CLUSTER 17] MemPod [HPCA 17] LGM [IPDPS 19] DRAM 19

  20. #2: Non-Conventional System Design Disaggregated systems are notmonolithic Controller Controller Controller Controller CPU Remote Memory Remote Memory Remote Memory Remote Memory Local Memory Hybrid/heterogeneous memory systems: Prior solutions are not suitable or efficient for disaggregated memory systems System-Level Solutions Hardware-Level Solutions CPU DRAM Cache DRAM 20

  21. #3: Variability in Data Access Latencies Data access latencies depend: Location of the remote memory component CPU Local Memory Controller CPU Remote Memory data1 Local Memory different locations for application s data Controller Controller Controller Remote Memory Remote Memory Remote Memory Controller Controller Remote Memory Remote Memory data2 21

  22. #3: Variability in Data Access Latencies Data access latencies depend: Location of the remote memory component Network contention CPU Local Memory high contention due to concurrent jobs sharing the network Controller CPU Remote Memory Local Memory Controller Controller Controller Remote Memory Remote Memory data Remote Memory 22

  23. #3: Variability in Data Access Latencies CPU CPU Local Memory Local Memory Controller Controller CPU CPU Remote Memory Remote Memory Local Memory Local Memory Controller Controller Controller Controller Controller Controller Remote Memory Remote Memory data2 Remote Memory data2 Remote Memory Remote Memory data1 Remote Memory data1 A robust solution to variability in data access latencies is necessary Time data placements can vary during runtime or between multiple executions 23

  24. How can we build an efficient solution? DaeMon [Sigmetrics 23] 24

  25. 1. Disaggregated Hardware Support Compute Component Memory Component Controller FPGA CPU DaeMon Memory Engine DaeMon Compute Engine LLC dedicated units Remote Memory Local Memory Compute Component Memory Component FPGA CPU Controller DaeMon Compute Engine Independence LLC DaeMon Memory Engine Network Local Memory Remote Memory Compute Component High Parallelism High Scalability Memory Component FPGA CPU Memory Component DaeMon Compute Engine Controller LLC DaeMon Memory Engine Controller DaeMon Memory Engine Local Memory Remote Memory Remote Memory 25

  26. 2. Multiple Granularity Data Movement Compute Component DaeMon Compute Engine Memory Component DaeMon Memory Engine Controller CPU Sub-block Queue Sub-block Queue Page Queue Page Queue LLC Remote Memory Local Memory 26

  27. 2. Multiple Granularity Data Movement Compute Component DaeMon Compute Engine Memory Component DaeMon Memory Engine Controller CPU Sub-block Queue Sub-block Queue Controller Controller Queue Queue Cache lines Page Queue Pages Page Queue LLC Remote Memory Local Memory prioritization of cache line migrations 27

  28. 2. Multiple Granularity Data Movement Compute Component DaeMon Compute Engine Memory Component DaeMon Memory Engine Controller CPU Sub-block Queue Sub-block Queue Controller Controller Queue Queue Cache lines Software Transparency Page Queue Pages Page Queue LLC Low Metadata Overheads High Spatial Locality Latency-Efficiency in Critical Data Remote Memory Local Memory 28

  29. 3. Link Compression in Page Migrations Compute Component DaeMon Compute Engine Memory Component DaeMon Memory Engine Controller CPU Sub-block Queue Sub-block Queue Controller Controller Queue Queue Cache lines Page Queue Compressed pages Page Queue LLC Remote Memory (De) Compr. Unit (De) Compr. Unit Local Memory compressed pages inside the network 29

  30. 3. Link Compression in Page Migrations Compute Component DaeMon Compute Engine Memory Component DaeMon Memory Engine Controller CPU Sub-block Queue Sub-block Queue Controller Controller Queue Queue Cache lines Page Queue Compressed pages Page Queue LLC Bandwidth-Efficiency Critical Cache Line Prioritization Remote Memory (De) Compr. Unit (De) Compr. Unit Local Memory 30

  31. 4. Selection Granularity Data Movement Compute Component DaeMon Compute Engine Memory Component DaeMon Memory Engine Controller CPU Sub-block Queue Sub-block Queue Controller Controller Queue Queue Cache lines Page Queue Pages Page Queue LLC Remote Memory (De) Compr. Unit (De) Compr. Unit Cache line, page or both? Local Memory 31

  32. 4. Selection Granularity Data Movement Compute Component DaeMon Compute Engine Memory Component DaeMon Memory Engine Controller CPU Sub-block Queue Sub-block Queue Controller Controller Queue Queue Cache lines Page Queue Pages Page Queue LLC Remote Memory Inflight Sub- block and Page Buffers (De) Compr. Unit (De) Compr. Unit Local Memory Sub- block Page 32 track pending data migrations

  33. 4. Selection Granularity Data Movement Compute Component DaeMon Compute Engine Memory Component DaeMon Memory Engine Controller CPU Sub-block Queue Sub-block Queue Selection Granularity Unit Controller Controller Queue Queue Cache lines Page Queue Pages Page Queue LLC Remote Memory Inflight Sub-block and Page Buffers (De) Compr. Unit (De) Compr. Unit Cache line, page or both? Local Memory 33 Sub- block Page

  34. 4. Selection Granularity Data Movement Compute Component DaeMon Compute Engine Memory Component DaeMon Memory Engine Controller CPU Sub-block Queue Sub-block Queue Selection Granularity Unit Controller Controller Queue Queue Cache lines Page Queue Pages Page Queue LLC Robustness Versatility Adaptivity to Runtime Changes Remote Memory Inflight Sub-block and Page Buffers (De) Compr. Unit (De) Compr. Unit Local Memory 34

  35. Why does this work? DaeMon 35

  36. Use Case 1: Memory Access Patterns Compute Component Memory Component Controller DaeMon Engine Inflight Buffers CPU DaeMon Memory Engine Cache lines Selection Gran. Unit LLC Compressed pages Remote Memory Local Memory Inflight Buffers Utilization Sub- block Page Sub- block Page Sub- block Page Sub- block Page 36 Time high locality within pages

  37. Use Case 1: Memory Access Patterns Compute Component Memory Component Controller DaeMon Engine Inflight Buffers CPU DaeMon Memory Engine Cache lines Selection Gran. Unit LLC Compressed pages Remote Memory Local Memory Inflight Buffers Utilization Sub- block Page Sub- block Page Sub- block Page Sub- block Page 37 Time low locality within pages

  38. Use Case 2: Network Characteristics Compute Component Memory Component Controller DaeMon Engine Inflight Buffers CPU DaeMon Memory Engine Cache lines Selection Gran. Unit LLC Compressed pages Remote Memory Local Memory Inflight Buffers Utilization Sub- block Page 38 Time high bandwidth consumption

  39. Use Case 2: Network Characteristics Compute Component Memory Component Controller DaeMon Engine Inflight Buffers CPU DaeMon Memory Engine Cache lines Selection Gran. Unit LLC Compressed pages Remote Memory Local Memory Inflight Buffers Utilization Sub- block Page Sub- block Page 39 Time high bandwidth consumption low bandwidth consumption

  40. Use Case 3: Data Compressibility Compute Component Memory Component Controller DaeMon Engine Inflight Buffers CPU DaeMon Memory Engine Cache lines Selection Gran. Unit LLC Compressed pages Remote Memory Local Memory Inflight Buffers Utilization Sub- block Page app1 40 Time high data compressibility

  41. Use Case 3: Data Compressibility Compute Component Memory Component Controller DaeMon Engine Inflight Buffers CPU DaeMon Memory Engine Cache lines Selection Gran. Unit LLC Compressed pages Remote Memory Local Memory Inflight Buffers Utilization Sub- block Page Sub- block Page app1 app2 41 Time high data compressibility low data compressibility

  42. Speedup in Real Applications Page ComprPage CacheLine CacheLine+Page DaeMon-Compr DaeMon 5 4 Speedup 3 2 1 0 kc tr pr nw bf bc ts sp sl hp pf dr rs GM Workloads 42

  43. Speedup in Real Applications Page ComprPage CacheLine CacheLine+Page DaeMon-Compr DaeMon 14.6 5 4 Speedup 3 1.95x 2 1 0 kc tr pr nw bf bc ts sp sl hp pf dr rs GM Workloads 43

  44. Speedup in Real Applications Page ComprPage CacheLine CacheLine+Page DaeMon-Compr DaeMon 14.6 5 4 Speedup 3 1.29x 2 1 0 kc tr pr nw bf bc ts sp sl hp pf dr rs GM Workloads 44

  45. Speedup in Real Applications Page ComprPage CacheLine CacheLine+Page DaeMon-Compr DaeMon 14.6 8.4 5 low locality within pages 4 Speedup 3 1.09x 0.95x 2 1 0 kc tr pr nw bf bc ts sp sl hp pf dr rs GM Workloads 45

  46. Speedup in Real Applications Page ComprPage CacheLine CacheLine+Page DaeMon-Compr DaeMon 11.7 14.6 8.4 5 4 Speedup 3 2 1.53x 1 0 kc tr pr nw bf bc ts sp sl hp pf dr rs GM Workloads 46

  47. Speedup in Real Applications Page ComprPage CacheLine CacheLine+Page DaeMon-Compr DaeMon 11.7 14.6 8.4 5 4 Speedup 3 DaeMon performs best in real-world applications 2 1 0 kc tr pr nw bf bc ts sp sl hp pf dr rs GM Workloads 47

  48. Data Access Costs in Real Applications Page ComprPage DaeMon-Compr DaeMon 1 0.9 0.8 0.7 Data Access Costs 0.6 0.5 0.4 0.3 0.2 0.1 0 kc tr pr nw bf bc ts sp sl hp pf dr rs GM 48 low locality within pages

  49. Data Access Costs in Real Applications Page ComprPage DaeMon-Compr DaeMon 1 0.9 0.8 0.7 Data Access Costs 0.6 0.5 0.4 0.3 0.2 0.1 0 kc tr pr nw bf bc ts sp sl hp pf dr rs GM 49 medium locality within pages

  50. Data Access Costs in Real Applications Page ComprPage DaeMon-Compr DaeMon 1 0.9 0.8 0.7 Data Access Costs 0.6 0.5 0.4 0.3 0.2 0.1 0 kc tr pr nw bf bc ts sp sl hp pf dr rs GM 50 high locality within pages

Related


More Related Content