Exploring NoSQL Database Scalability Using Indexing Techniques

Slide Note
Embed
Share

Dive into the world of NoSQL database scalability by understanding how indexing enables richer queries and how local indexing impacts partitioning, updates, and lookups across distributed databases.


Uploaded on Oct 10, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Replex: A Multi-Index, Highly- Available Data Store Amy Tai* , Michael Wei*, Michael J. Freedman , Ittai Abraham*, Dahlia Malkhi* *VMware Research, Princeton University

  2. SQL vs NoSQL Key-value stores Ex: HyperDex, Cassandra Ex: CockroachDB, Yesquel, H-Store NoSQL SQL NewSQL Rich querying model Simple, key-based querying No or limited transaction support Rich querying model Transaction support Transaction support Poor performance Scalable, high performance Scalable

  3. SQL vs NoSQL Key-value stores Ex: HyperDex, Cassandra Ex: CockroachDB, Yesquel, H-Store NoSQL SQL NewSQL Rich querying model Simple, key-based querying No or limited transaction support Rich querying model Transaction support Transaction support Poor performance Scalable, high performance Scalable

  4. NoSQL Scales with Database Partitioning col2 colc col col . . . 1 0 rows r[0] r[1] r[2] . . . r[c] default partitioning key

  5. NoSQL Scales with Database Partitioning Partitions (with respect to the partitioning index)

  6. How to enable richer queries without sacrificing shared- nothing scale-out?

  7. Key Observations 1. Indexing enables richer queries

  8. Local Indexing on the Partitioning Index col2 colc col1 col0 . . . select * where col0= foo Index updates are local Index lookups are local

  9. Local Indexing Index stored locally at each partition col2 colc col1 col0 . . . insert (r[0] r[1] r[2] . . . r[c]) Index updates are local

  10. Local Indexing Each partition builds a local index col2 colc col1 col0 . . . select * where col2= foo Index updates are local Index lookups must be broadcast to all partitions Potential synchronization across partitions

  11. Global Indexing Distributed data structure spans all partitions col2 colc col1 col0 . . . insert (r[0] r[1] r[2] . . . r[c]) Index updates are multi-hop Index lookups executed on a single data structure

  12. Key Observations 1. Indexing enables richer queries 2. Indexes more efficient if data is partitioned according to that index

  13. Partitioning by Index Storage Overheads col2 colc col1 col0 . . . default partitioning key Partitioning by index must store data again

  14. Partitioning by Index Storage Overheads col2 colc col1 col0 . . . default partitioning key Partitioning by index must store data again

  15. Partitioning by Index Storage Overheads col2 colc col1 col0 . . . default partitioning key Partitioning by index must store data again

  16. Partitioning by Index Storage Overheads col2 colc col1 col0 . . . default partitioning key HyperDex

  17. Key Observations 1. Indexing enables richer queries 2. Indexes more efficient if data is partitioned according to that index 3. Rethink replication paradigm

  18. Replex New replication unit: replex -- data replica partitioned and sorted with respect to an associated index replexes Serves data replication and indexing

  19. Replex New replication unit: replex -- data replica partitioned and sorted with respect to an associated index replexes Replace data replicas with replexes Serves data replication and indexing

  20. System Architecture col2 colc col0 col1 . . .

  21. Inserts in Replex insert (r[0] r[1] r[2] . . . r[c]) ?

  22. Partition Determined by a replexs Index insert (r[0] r[1] r[2] . . . r[c]) ?

  23. Partition Determined by a replexs Index insert (r[0] r[1] r[2] . . . r[c]) ?

  24. Partition Determined by a replexs Index insert (r[0] r[1] r[2] . . . r[c]) ?

  25. Partition Determined by a replexs Index insert (r[0] r[1] r[2] . . . r[c])

  26. Data Chains in Replex (x[0] x[1] x[2] . . . x[c]) (y[0] y[1] y[2] . . . y[c]) (z[0] z[1] z[2] . . . z[c]) Partitions in different replexes make up a data chain if they share rows

  27. Data Chains in Replex All pairs of partitions are potential data chains

  28. Main Challenge in Replex Increased failure complexity

  29. Partition Failures 1. Partition Recovery 2. Client Requests Index is unavailable, data is available

  30. Partition Failures Where is the data? Index is unavailable, data is available

  31. Finding Data After Failure

  32. Finding Data After Failure Client Requests Partition Recovery Tradeoff? Requests must be multicast Parallel Recovery Lookups reduce to linear scan at each partitions Recovery Time Request Amplification

  33. Hybrid Replex: Key Contributions 1. Recovery time vs request amplification tradeoff 2. Improve failure availability of multiple replexes 3. Storage vs recovery performance tradeoff 4. Graceful failure degradation

  34. Hybrid Replex: Key Contributions 1. Recovery time vs request amplification tradeoff 2. Improve failure availability of multiple replexes 3. Storage vs recovery performance tradeoff 4. Graceful failure degradation

  35. Hybrid Replex: What is it? Shared across r replexes Not associated with an index Partitioning function dependent on these r replexes

  36. Hybrid Replex hybrid replex

  37. Data Chains with Hybrid Replex insert (r[0] r[1] r[2] . . . r[c])

  38. Data Chains with Hybrid Replex insert (r[0] r[1] r[2] . . . r[c])

  39. Data Chains with Hybrid Replex insert (r[0] r[1] r[2] . . . r[c]) ? ? ? ?

  40. Data Chains with Hybrid Replex insert (r[0] r[1] r[2] . . . r[c])

  41. 1. Recovery time vs Request Amplification Recovery is less parallel but lower request amplification

  42. 2. Improve failure availability of multiple replexes

  43. 2. Improve failure availability of multiple replexes

  44. Improving Failure Availability w/o Hybrid Replexes 2x Storage!

  45. 3. Storage vs. recovery performance 1.5x Storage! but worse client requests

  46. Generalizing Hybrid Replexes k hybrid replexes may be shared across r replexes For the special case when r = 2, request amplification will be O(n1/(k+1)) For the special case when k = 1, request amplification will be O(n(r-1)/r)

  47. Implementation Built on top of HyperDex, ~700 LOC Implemented partition recovery and request re-routing on failure Implemented variety of hybrid replex configurations

  48. Evaluation 1. What is impact of replexes on steady-state performance? 2. How do hybrid replexes affect failure performance?

  49. Evaluation Setup Table with two indexes 12 server machines, 4 client machines All machines colocated in the same rack, connected via 1GB head-of-rack switch 8 CPU, 16GB memory per machine

  50. Systems Under Evaluation HyperDex Replex-2 Replex-3 Storage overhead: 2x Network hops per insert: 2 Failures Tolerated: 1 Storage overhead: 3x Network hops per insert: 3 Failures Tolerated: 2 Storage overhead: 6x Network hops per insert: 6 Failures Tolerated: 2

Related


More Related Content