Revolutionizing Data Storage: The BW-Tree Architecture for Modern Hardware Platforms

Slide Note
Embed
Share

The BW-Tree presents a novel latch-free approach for high-performance data management on modern hardware. By leveraging processor caches and implementing log-structured storage, it offers efficient data organization and management. The architecture ensures thread efficiency and cache preservation, targeting flash storage for optimal I/O operations. This paper outlines the design, benefits, and techniques of the BW-Tree, emphasizing its utilization in improving write performance and reducing cache invalidations.


Uploaded on Oct 07, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. THE BW TREE: A B-TREE FOR NEW HARDWARE PLATFORMS Justin J. Levandoski, David B. Lomet ,Sudipta Sengupta Presented By: Isha Narula

  2. INTRODUCTION A new B-Tree called BW-tree is proposed in this paper that achieves high performance via latch free approach BW-Tree exploits processor caches of modern multi-core chips. Storage manager uses unique form of log structuring that blurs distinction between page and a record store. Architecture and algorithms for the BW-Tree are proposed in this paper which describe the main memory aspects. Details of latch free technique that avoids cache line invalidations is proposed 2

  3. ATOMIC RECORD STORES An atomic record store is a component of a more complete transactional system, given appropriate control operations. An ARS supports the reading and writing of individual records, each identified by a key. An ARS is more than an access method. It includes the management of stable storage and the requirement that the updates should be recoverable when the system crashes. 3

  4. The Bw Tree Bw-tree is latch free i.e it ensures that a thread never yields or even redirects its activity in the face of conflicts. Bw-tree performs delta updates and avoids updating a page in place thus preserving cached lines of pages. Bw-tree targets flash storage because it offers higher I/O operations per second. Performs log structuring itself at its storage layer. Helps to avoid dependence on FTL and ensures that our write performance is as high as possible. 4

  5. PAPER OUTLINE An overview of the following topics is there in the presentation: (1)BW-Tree Architecture (2)BW-Tree Organization (3)Structure Modifications (4)Management of Cache (5)Performance and Experiments Results 5

  6. Bw-tree ARCHITECTURE Bw-tree atomic record structure is a classic example of B+- tree in many aspects The access method layer in Bw-tree is at the top. It interacts with middle cache layer. Cache manager is built on top of the Storage Layer which implements Log structured store. 6

  7. Modern Hardware Changes in Bw-tree Design Threads never block and eliminating latches is the main technique. State changes are installed using atomic compare and swap instruction(CAS). Node updates in the Bw-tree are performed via delta updates(attaching update to an existing page) not via update in- place. Avoiding updates in place reduces CPU cache invalidation and results in higher cache hit ratios. 7

  8. The Mapping Table Mapping table maps logical pages to physical pages. Logical pages are identified by logical page identifier or PID. Mapping table translates PID into a flash offset or a memory pointer. It is the central location for managing paginated trees. PIDs in the Bw-tree are used to link the nodes of the tree. Serves the connection between physical location and inter-node links. It helps in changing the physical location of a BW-Tree node to change on every update. Bw-tree nodes are logical and do not occupy fixed locations, either on stable storage or in main memory. Page size for a storage is elastic that is we can split when convenient. 8

  9. Delta Updating Page state changes are done by creating a delta record(describing a change) and prepending it to an existing page. Memory address of delta record is installed into page s physical address slot in mapping table using atomic CAS instruction. Delta updating simultaneously enables latch free access in the Bw-tree and preserves processor data The Bw-tree mapping table is the key enabler of these features via its ability to isolate the effects of the node updates to that node alone. 9

  10. BW-tree structure modifications Latches do not protect parts of our index during structure modifications. A page split introduces changes to more than one page. In order to overcome this problem, SMO is broken into sequence of atomic actions which is installable via CAS. A B-link design is used in order to make this easier. We can decompose a node split into two half split atomic actions. In order to make sure that no thread has to wait for an SMO to complete, a thread that sees a partial SMO will complete it before proceeding with its own operation. This ensures that no thread needs to wait for an SMO to complete. 10

  11. Log Structures Store Pages are written sequentially in a large batch, greatly reducing the number of separate write I/O required. The LSS only flushes the deltas that represent the changes made to the page since its previous flush. The LSS cleans prior parts of flash representing the old parts of its log storage. Delta flushing reduces pressure on the LSS cleaner by reducing the amount of storage used per page. 11

  12. Managing Transactional Logs Our ARS needs to ensure that updates persist across system crashes. Update operation with unique identifier that is typically the log sequence number(LSN) of the update is tagged. Pages are flushed lazily while honoring the write-ahead log protocol (WAL). Recent updates are separate deltas from the rest of the page, and we can remove recent updates from pages when flushing. 12

  13. IN-MEMORY LATCH FREE PAGES-Basic Structure Information on a BW-Tree page is similar to that of B+- tree. Internal index nodes contain pairs sorted by key that direct searches down the tree. Data nodes contains ( key, record) pairs. Pages also contain low key, high key and a side pointer. BW-tree pages are unique as they are logical and identified using mapping table. Pages are elastic meaning there is no hard limit on how large a page may grow. Pages also have delta records prepended to them. 13

  14. Updates in BW-Tree Page Delta record of page is created that describes update and prepend to existing page. Delta records help us to incrementally update page state in latch free manner. New delta record D is created that points to the page s current address P. Memory address of the delta record will serve as new memory address. Atomic compare and swap instructions are used to replace current address P with address D.CAS compares current address in mapping table.If current address equals P, CAS is successfully installed, else it fails. 14

  15. Leaf-Level Update Operations Updates are of three types at the leaf. Level (1) Insert at leaf level (2) Modify at leaf level and (3) Delete at leaf level. All update deltas contain LSN provided by the client issuing update. LSN is used for transactional recovery. Insert and update deltas contain record representing new payload. Leaf page search involves traversing the delta chain.If delta containing key, represents an insert or update, the search succeeds and returns the record 15

  16. PAGE CONSOLIDATION It creates a new reorganized base page containing all entries from original base page as modified by the updates from delta chain. Thread creates a new base page when consolidating. Base page is consolidated with sorted vector containing most recent version of record from either delta chain or old base page. New address of the consolidated page is then installed in the mapping table with CAS. 16

  17. RANGE SCANS It is specified by a key range. Either boundary keys can be omitted that is one end of key change can be open ended A scan identifies either ascending or descending key order for delivering the records. Scan maintains a cursor providing key indicating how far the search has progressed. Each next record operation is treated as an atomic unit. Entire scan is not atomic. Transactional locking will prevent modifications to records we have seen, but do not know the form of concurrency control. 17

  18. Garbage Collection We do not want to deallocate memory which is still accessed by another thread. An epoch mechanism is a way of protecting objects from being deallocated. A thread joins an epoch when it wants to protect objects, and it exits the epoch when its dependency is finished. A threads duration in an epoch is for a single operation 18

  19. BW-TREE STRCUTURE MODIFICATIONS Here is a more detailed view of the BW Tree Structure Modifications.BW-tree structure modification operations (SMOs) are performed in a latch-free manner. Here are some of the operations: (1)NODE SPLIT (2)NODE MERGE 19

  20. Node Split Splits are triggered by an accessor thread that notices a page size has grown beyond a system threshold. After attempting its operation, the thread performs the split. The BW-tree employs the B-link atomic split installation technique that works in two phases. The split is first atomically installed at the child level. This is called the half split. The parent node is then atomically updated with new index term and new separator key and pointer to newly created split page. 20

  21. Child Split:. For splitting node P, B-Tree layer requests allocation for a new node Q in the mapping table. An appropriate key separator K is then found from P that provides a balanced split and proceeds to create a new consolidated base state for Q, containing records from P with keys greater than K Page Q also contains a side link to the former right sibling of P. The physical address of Q is then installed in the mapping table. At this point, the original (unsplit) state of P is still present in the mapping table, and Q is invisible to the rest of the index. We atomically install the split by prepending a split delta record to P. 21

  22. The split delta contains two pieces of information: (1)The separator key K used to invalidate all records within P greater than K . (2)A logical side pointer to the new sibling Q. At this point, the index is valid, even without the presence of an index term for Q in the parent node O. All searches for a key contained within Q will first go to P. Upon encountering the split delta on P, the search will traverse the side link to Q when the search key is greater than separator key K. Meanwhile, all searches for keys less than the K remain at P. 22

  23. Parent Update An index term delta record to the parent of P and Q to complete the second half split is present. Delta index has (1)Key separator (2)Logical pointer to Q and (3) Seperator key for Q Having KP and KQ present in the boundary key delta is an optimization to improve search speed. The figure depicts our running split example after prepending the index entry delta to parent page O, where the dashed line represents the logical pointer to page Q. 23

  24. Node Merge Node merges are triggered when a thread encounters a node that is below some threshold size. Node Merges are performed in a latch-free manner. Merges are decidedly more complicated than splits, and we need more atomic actions to accomplish them. Marking for Delete. The node R to be merged is updated with a remove node delta. All further use of node R is then stopped. A thread encountering a remove node delta in R needs to read or update the contents of R that were there previously by going to the left sibling, into which data from R will be merged. 24

  25. Merging Children: The node merge delta indicates that the contents of R are to be included in L. Further, it points directly to this state, which is now logically considered to be part of L. This storage for R s state is now transferred to L 25

  26. Parent Update:. The parent node P of R is now updated by deleting its index term associated with R. . This is done by posting an index term delete delta that includes not only that R is being deleted, but also that L will now include data from the key space formerly contained by R. 26

  27. CACHE MANAGEMENT The cache layer is responsible for reading, flushing, and swapping pages between memory and flash. It maintains the mapping table and provides the abstraction of logical pages to the BW-tree layer. When the BW-tree layer requests a page reference using a PID, the cache layer returns the address in the mapping table if it is a memory pointer. All updates to pages, including those involving page management operations like split and page flush involve CAS operations on the mapping table at the location indexed by the PID 27

  28. Write Ahead Log Protocols AND LSNs Record insert and update deltas in a the BW-tree page are tagged with the Log Sequence Number (LSN) of their operations. The highest LSN among updates flushed is recorded in the flush delta describing the flush. LSNs are generated by the higher- level TC and are used by its transactional log. 28

  29. Flushing Pages To The LSS The LSS provides a large buffer into which the cache manager posts pages and system transactions describing BW-tree structure modifications. We give a brief overview of this functionality. Page Marshalling: The cache manager marshals the bytes from the pointer representation of the page in main memory into a linear representation that can be written to the flush buffer. Incremental Flushing:. When flushing a page, the cache manager only marshals those delta records which have an LSN between the previously flushed largest LSN on that page and the current ESL value. 29

  30. Flush Activity The flush buffer aggregates writes to LSS up to a configurable threshold (currently set at 1MB) to reduce I/O overhead. It uses ping-pong (double) buffers and alternates between them with asynchronous I/O calls to the LSS so that the buffer for the next batch of page flushes can be prepared while the current one is in progress. After the I/O completes for a flush buffer, the states of the respective pages are updated in the mapping table 30

  31. PERFORMANCE EVALUATION The performance of the BW-tree is compared with the Berkeley DB key-value database. We chose Berkeley DB due to its good performance as a standalone storage engine. The BW-tree is also compared to a latch-free skip list implementation. The skip list has become a popular alternative to B-trees for memory-optimized databases since they can be implemented latch-free and exhibit fast insert performance, They also maintain logarithmic search cost. 31

  32. Mainly two aspects of the BW-tree were evaluated: (1)The effect of delta chain length on performance (2)The latch-free failure rate for posting delta updates. Delta Chain Length: 32

  33. Comparing BW Tree to traditional B-Tree Architecture For the Xbox workload, the BW-tree exhibits a throughput of 10.4M operations/second, while Berkeley DB has a throughput of 555K operations/second, representing a speedup of 18.7x. The throughput of both systems drops for the update-intensive deduplication workload, however the BW-tree maintains a 8.6x speedup over Berkeley DB. In general two main aspects lead to superior performance of BW- Tree: 1)Latch Free Freedom 2)CPU Cache efficiency 33

  34. 34

  35. Comparing BW-Tree to Latch Free Skip List The performance of the BW-tree is also compared to a latch free skip list implementation. The skip list provides key-ordered access with logarithmic search overhead, and can easily be implemented in a latch-free manner. For these reasons, it is starting to receive attention as a B-tree alternative in memory optimized databases 35

  36. Cache Performance CPU cache efficiency of the BW-tree compared to the skip list was also evaluated. The BW-tree exhibits very good cache efficiency compared to the skip list. Almost 90% of its memory reads come from cache, compared to 75% for the skip list. For a given cache size, more of the memory accesses involved in a search hit the cache for BW-tree over skip list. 36

  37. CONCLUSION The BW-tree implementation achieves very high performance. And it is sufficiently complete that we can measure normal operation performance reliably and consistently. It is designed and implemented such that it can exist as a free standing atomic record store. Innovations proposed in this paper like elastic pages, delta updates, shared read-only state, latch-free operation, and log structured storage, eliminate thread blocking, improve processor cache effectiveness, and reduce I/O demands. These techniques should work well in other settings as well, including hashing and multi- attribute access methods. Millions of operations were supported per second on a single vanilla CPU. 37

Related