Optimizing Memory and Storage Support for Non-Volatile Memory Systems - JanusJanus
JanusJanus focuses on optimizing memory and storage support for non-volatile memory systems by parallelizing sub-operations and pre-executing based on dependencies. This approach enhances performance, achieving 2X faster results compared to serialized baselines. The system addresses the critical path issues in NVM programs, dividing operations into smaller sub-designs for efficiency.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Janus Janus Optimizing Memory and Storage Support for Non-Volatile Memory Systems Sihang Liu Sihang Liu Korakit Seemakhupt, Gennady Pekhimenko, Aasheesh Kolli, and Samira Khan 1
SUMMARY Problem and Motivation Problem and Motivation The non-volatile memory (NVM) acts as both memory NVM systems require both memory However, operations in memory and storage support increases write latency which is on the critical path critical path of NVM programs Key Ideas Key Ideas We observe that these operations can be divided We propose Janus, a software software- -hardware co hardware co- -design Parallelizes Parallelizes sub-operations if they are independent Pre Pre- -executes executes sub-operations if their dependencies are resolved Performance Performance 2X faster than serialized baseline memory and storage storage support storage memory and storage support increases write latency, divided into smaller sub design that sub- -operations operations 2
BACKGROUND The non-volatile memory (NVM) is High-speed Persistent Byte-addressable NVM allows programs to directly manipulate persistent data in memory through a load/store interface Different from conventional DRAM, NVM acts as both Memory Memory and Storage Storage Intel 3D XPoint A practical NVM system requires both memory and storage support 3
MEMORY AND STORAGE SUPPORT The memory and storage support is designed for Prevent attackers from stealing or tampering data Encryption, integrity verification, etc. Security Improve NVM s limited bandwidth Deduplication, compression, etc. Bandwidth Extend NVM s limited lifetime Wear-leveling, error correction, etc. Endurance We refer to the memory and storage support as backend memory operations backend memory operations 4
BACKEND MEMORY OPERATION LATENCY Cache Writeback Memory Controller Core Cache Cache Memory Controller Controller Cache Memory NVM Access NVM Write Access Timeline 5
BACKEND MEMORY OPERATION LATENCY Cache Writeback Memory Controller Core Backend Memory Operations Cache Cache Cache Memory Controller NVM Access NVM Write Access Timeline Non-volatile Volatile Recent NVM support guarantees writes accepted by memory controller is non-volatile 6
BACKEND MEMORY OPERATION LATENCY Cache Writeback Core Memory Controller Backend Memory Operations Cache Cache Memory Controller NVM Access NVM Write Access Timeline Non-volatile Volatile ~15 ns >100 ns Latency to Persistence 7
WHY WRITE LATENCY IS IMPORTANT? NVM programs need to use crash consistency mechanisms crash consistency mechanisms that enforces data writeback Core Volatile Cache persist_barrier Non-volatile NVM 8
WRITE LATENCY IN NVM PROGRAMS Writeback from cache Backup persist_barrier Update Commit Timeline Example: Steps in undo logging transaction Execution cannot continue until writeback completes 9
WRITE LATENCY IN NVM PROGRAMS Backup Update Write latency is on critical path Commit Timeline Example: Steps in undo logging transaction Crash consistency mechanism puts write latency write latency on the critical path critical path 10
WRITE LATENCY IN NVM PROGRAMS Backup Backup Update Update Backend memory operations Commit Timeline Commit Increased latency Backend memory operations increase increase the writeback latency 11
Backend memory operations are on the critical path How to reduce reduce the latency? critical path 12
OUTLINE BACKGROUND AND MOTIVATION KEY IDEAS JANUS MECHANISM EVALUATION CONCLUSION 13
OBSERVATION Each backend memory operation seems indivisible Integration leads to serialized serialized operations indivisible Counter-mode Encryption Integrity Verification Deduplication 14
OBSERVATION However, it is possible to decompose them into sub sub- -operations operations Generate counter Decompose Encrypt counter Data Encrypted counter Generate MAC (for integrity verification) Counter-mode Encryption 15
KEY IDEA I: PARALLELIZATION After decomposing decomposing the example operations: Counter-mode Encryption Integrity Verification Deduplication 16
KEY IDEA I: PARALLELIZATION There are two types of dependencies: Intra Intra- -operation operation dependency Inter Inter- -operation operation dependency Counter-mode Encryption across different operations when they cooperate 1. Dependency within each operation Integrity Verification within each operation 2. Dependency across different operations Deduplication 17
KEY IDEA I: PARALLELIZATION There are two types of dependencies: Intra Intra- -operation operation dependency Inter Inter- -operation operation dependency Parallelizable Parallelizable Counter-mode Encryption Integrity Verification Sub-operations without dependency can execute in parallel in parallel Deduplication 18
KEY IDEA I: PARALLELIZATION There are two types of dependencies: Intra Intra- -operation operation dependency Inter Inter- -operation operation dependency Parallelizable Parallelizable Counter-mode Encryption Integrity Verification Sub-operations without dependency can execute in parallel in parallel Deduplication 19
KEY IDEA II: PRE-EXECUTION A write consists of: Address Data External External dependency Sub-operations can pre pre- -execute execute Counter-mode Encryption as soon as their data/address dependency is resolved Integrity Verification Deduplication 20
KEY IDEA II: PRE-EXECUTION A write consists of: Address Data Address Address-dependent Address Address- -dependent dependent sub-operations can pre-execute as soon as Counter-mode Encryption the address of the write is available Integrity Verification Deduplication 21
KEY IDEA II: PRE-EXECUTION A write consists of: Address Data Data Data-dependent Data Data- -dependent dependent sub-operations can pre-execute as soon as Counter-mode Encryption the data of the write is available Integrity Verification Deduplication 22
KEY IDEA II: PRE-EXECUTION A write consists of: Address Data Both Both-dependent Both Both- -dependent dependent sub-operations can pre-execute as soon as Counter-mode Encryption both the data and address of the write are available Integrity Verification Deduplication 23
KEY IDEA II: PRE-EXECUTION A write consists of: Address Data Counter-mode Encryption Integrity Verification How can we know the address/data ahead of time? Deduplication 24
AVAILABILITY OF ADDRESS AND DATA Data for update Update tree node <Key, Value> using undo log Pre-execution for update Backup Traverse tree with Key Update Use Pre-execution Results Commit Backup Location is known after traversal During backup: Data Data of the update is known During backup: Address Address of the update is known Pre- -execution execution of address and data-dependent sub-operations Take pre-executed results when writing back update Pre 25
OUR PROPOSAL: JANUS Janus is a Roman god with two faces: one looks into the past past, and another into the future future When dependent data and address become available Pre-execute operations with dependency resolved Past Future 26
Backend memory operations Original writeback latency JANUS OVERVIEW Backup Backup Update Update Janus: Parallelization Commit Commit Timeline Parallelized Serialized reduces the latency of each operation Parallelization reduces
Backend memory operations Original writeback latency JANUS OVERVIEW Backup Backup Backup Update Update Update Update Janus: Parallelization Pre-execution Commit Commit Commit Commit Timeline Parallelized off the critical path Pre-executed Serialized Pre-execution moves their latency off the critical path
JANUS OVERVIEW Parallelization Parallelization Janus SW Interface Janus SW Interface Pre Pre- -execution execution NVM Program Janus HW Core Memory Controller SW HW CPU Janus software interface enables pre-execution 29
OUTLINE BACKGROUND AND MOTIVATION KEY IDEAS JANUS MECHANISM EVALUATION CONCLUSION 30
SOFTWARE INTERFACE Janus provides functions for pre-executing address dependent dependent sub-operations at object granularity Janus interface is hardware hardware- -independent void updateTree(int key, item_t val) { // find tree node with key node* location = find(key); // add old val to undo log undo_log(location); // update val location->val = val; persist_barrier(); // commit updates commit(); } 11 address- -, data data- -, and both both- - object granularity independent: only takes address and data val 1 2 3 4 5 6 7 8 9 The data for update is known location Backup The address for update is known Update Commit 10 31
SOFTWARE INTERFACE Janus provides functions for pre-executing address dependent dependent sub-operations at object granularity Janus interface is hardware hardware- -independent void updateTree(int key, item_t val) { pre_t pre_obj; PRE_DATA(&pre_obj, &val, sizeof(item_t)); // find tree node with key node* location = find(key); PRE_ADDR(&pre_obj, location, sizeof(item_t)); // add old val to undo log undo_log(location); // update val location->val = val; persist_barrier(); // commit updates commit(); } address- -, data data- -, and both both- - object granularity independent: only takes address and data Pre-execute data-dependent sub-operations 1 2 3 4 5 6 7 8 9 Keep track of pre-execution: PRE_ID Thread_ID Transaction_ID Address Pre-execute address-dependent sub-operations Size 10 11 12 13 32
AUTOMATED INSTURMENTATION Manual instrumentation is effective at improving performance, but requires significant programmer s effort programmer s effort Janus provides a compiler pass to automatically program with the interface automatically instrument Compiler Pass Dependency Analysis Instrumentation of Janus Interface NVM Program Address Address Data Data 33
JANUS OVERVIEW Parallelization Parallelization Janus SW Interface Janus SW Interface Pre Pre- -execution execution NVM Program Janus HW Core Memory Controller SW HW CPU 34
HARDWARE MECHANISM Converter: Converter: convert pre-execution from object to cache line granularity Intermediate result buffer: Intermediate result buffer: store pre-execution results to avoid changing processor/memory state Correctness check: Correctness check: invalidate incorrect pre-execution Cache line Object Cache line granularity granularity Object granularity granularity Convert Core Results Intermediate Intermediate Result Buffer Result Buffer Backend memory Backend memory operations operations correctness check 35 Memory Controller
HARDWARE MECHANISM Converter: Converter: convert pre-execution from object to cache line granularity Intermediate result buffer: Intermediate result buffer: store pre-execution results to avoid changing processor/memory state Correctness check: Correctness check: invalidate incorrect pre-execution Cache line granularity granularity Cache line Object granularity granularity Object Write completes Write completes Convert X X1 1X X2 2 X X PRE_BOTH X PRE_BOTH X WRITE X WRITE X1 1 Core Results Intermediate Intermediate Result Buffer Result Buffer Backend memory Backend memory operations operations R1 R2 X1 X2 correctness check Take results 36 Memory Controller
OUTLINE BACKGROUND AND MOTIVATION KEY IDEAS JANUS MECHANISM EVALUATION CONCLUSION 37
METHODOLOGY Gem5 Simulator: Processor L1 D/I, L2 cache Backend memory operation cache Backend memory operation units Intermediate result buffer Out-of-Order, 4GHz 64/32KB, 2MB per core (shared) 512KB per core for each operation (shared) 4 units per core 64 entries per core (shared) Design points: Serialized: all backend memory operations are serialized Janus: pre-execute parallelized backend memory operations Instrumentation of Janus functions: Manual Manual Automated Automated 38
JANUS VS. BASELINE Parallelization Pre-execution 6 Serialized Baseline Speedup over 2.35X Speedup 5 4 3 2 1 Janus provides 2.35X speedup on average 39
JANUS VS. BASELINE Moves the latency off critical path Reduces the latency Parallelization Pre-execution 6 Serialized Baseline Speedup over 5 4 3 2 1 Pre-execution provides more speedup 40
MANUAL VS. AUTO INSTRUMENTATION Janus (Manual) Janus (Auto) 6 6 Serialized Baseline Speedup over 5 5 15% slower 4 4 3 3 2 2 1 1 Compiler pass provides close-to-manual performance 41
CONCLUSION Problem Problem Storage and memory support for NVM systems increases write latency is on the critical path critical path of NVM programs Key Ideas Key Ideas We observe that these operations can be divided We propose Janus, a software software- -hardware co hardware co- -design Parallelizes Parallelizes sub-operations if they are independent Pre Pre- -executes executes sub-operations if their dependencies are resolved Results Results Janus provides 2.35x 2.35x speedup speedup compared to serializing all operations Janus provides a compiler pass for automated instrumentation increases write latency, which divided into smaller sub design that sub- -operations operations automated instrumentation 42
Janus Janus Optimizing Memory and Storage Support for Non-Volatile Memory Systems Sihang Liu Sihang Liu Korakit Seemakhupt, Gennady Pekhimenko, Aasheesh Kolli, and Samira Khan 43
SW INTERFACE Pre-execution object (pre_obj PRE_ID PRE_ID Thread_ID Thread_ID Transaction_ID Transaction_ID Address Address Size Size pre_obj): ID of the pre-executed NVM object ID of the current thread ID of the current transaction Address of NVM object Size of NVM object Match pre Match pre- -execution with the actual write with the actual write execution Pre-execution functions: PRE_ADDR() PRE_ADDR() PRE_DATA() PRE_DATA() PRE_BOTH() PRE_BOTH() Pre-execute address-dependent sub-operations Pre-execute data-dependent sub-operations Pre-execute all sub-operations Depending on the availability of address/data Depending on the availability of address/data 45
JANUS CORRECTNESS Pre-execution requests can be incorrect Pre-executing wrong address or data Redundant pre-execution, etc. Janus hardware always guarantees correctness always guarantees correctness The intermediate result buffer keeps a copy of the pre-executed data, and compare when the write takes the result Unused results are discarded after timeout All correctness issues in using the interface only results in performance degradation 46
WORKLOADS Array Sway Queue Hash Table RB-Tree B-Tree TATP TPCC Swap random items in an array Randomly enqueue or dequeue queue nodes Insert random values to a hash table Insert random values to a red-black tree Insert random values to a b-tree Update random records in the TATP benchmark Add new orders from the TPCC benchmark 47
MULTICORE Memory bandwidth becomes the bottleneck 3.0 Serialized Baseline Speedup over 2.5 2.0 1.5 1.0 1 2 4 8 Number of cores Speedup decreases as the number of cores increases 48
PRE-EXECUTION PERCENTAGE Complete Incomplete Percentage over 1 Total Operations 0.8 0.6 0.4 0.2 0 Array Swap Queue Hash Table B-Tree RB-Tree TATP TPCC 49
JANUS VS. IDEAL Compare Janus with the ideal case ideal case that has non non- -blocking blocking writeback Serialized Janus Non-blocking Witeback 16 Slowdown over 13 10 7 4 1 Array Swap Queue Hash B-Tree RB-Tree TATP TPCC Geo. Mean 50