Exploring NoSQL Database Scalability Using Indexing Techniques

Slide Note

Embed Share

Download Presentation

Dive into the world of NoSQL database scalability by understanding how indexing enables richer queries and how local indexing impacts partitioning, updates, and lookups across distributed databases.

mark_rla Follow

Uploaded on Oct 10, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Replex: A Multi-Index, Highly- Available Data Store Amy Tai* , Michael Wei*, Michael J. Freedman , Ittai Abraham*, Dahlia Malkhi* *VMware Research, Princeton University

SQL vs NoSQL Key-value stores Ex: HyperDex, Cassandra Ex: CockroachDB, Yesquel, H-Store NoSQL SQL NewSQL Rich querying model Simple, key-based querying No or limited transaction support Rich querying model Transaction support Transaction support Poor performance Scalable, high performance Scalable

SQL vs NoSQL Key-value stores Ex: HyperDex, Cassandra Ex: CockroachDB, Yesquel, H-Store NoSQL SQL NewSQL Rich querying model Simple, key-based querying No or limited transaction support Rich querying model Transaction support Transaction support Poor performance Scalable, high performance Scalable

NoSQL Scales with Database Partitioning col2 colc col col . . . 1 0 rows r[0] r[1] r[2] . . . r[c] default partitioning key

NoSQL Scales with Database Partitioning Partitions (with respect to the partitioning index)

How to enable richer queries without sacrificing shared- nothing scale-out?

Key Observations 1. Indexing enables richer queries

Local Indexing on the Partitioning Index col2 colc col1 col0 . . . select * where col0= foo Index updates are local Index lookups are local

Local Indexing Index stored locally at each partition col2 colc col1 col0 . . . insert (r[0] r[1] r[2] . . . r[c]) Index updates are local

Local Indexing Each partition builds a local index col2 colc col1 col0 . . . select * where col2= foo Index updates are local Index lookups must be broadcast to all partitions Potential synchronization across partitions

Global Indexing Distributed data structure spans all partitions col2 colc col1 col0 . . . insert (r[0] r[1] r[2] . . . r[c]) Index updates are multi-hop Index lookups executed on a single data structure

Key Observations 1. Indexing enables richer queries 2. Indexes more efficient if data is partitioned according to that index

Partitioning by Index Storage Overheads col2 colc col1 col0 . . . default partitioning key Partitioning by index must store data again

Partitioning by Index Storage Overheads col2 colc col1 col0 . . . default partitioning key Partitioning by index must store data again

Partitioning by Index Storage Overheads col2 colc col1 col0 . . . default partitioning key Partitioning by index must store data again

Partitioning by Index Storage Overheads col2 colc col1 col0 . . . default partitioning key HyperDex

Key Observations 1. Indexing enables richer queries 2. Indexes more efficient if data is partitioned according to that index 3. Rethink replication paradigm

Replex New replication unit: replex -- data replica partitioned and sorted with respect to an associated index replexes Serves data replication and indexing

Replex New replication unit: replex -- data replica partitioned and sorted with respect to an associated index replexes Replace data replicas with replexes Serves data replication and indexing

System Architecture col2 colc col0 col1 . . .

Inserts in Replex insert (r[0] r[1] r[2] . . . r[c]) ?

Partition Determined by a replexs Index insert (r[0] r[1] r[2] . . . r[c]) ?

Partition Determined by a replexs Index insert (r[0] r[1] r[2] . . . r[c]) ?

Partition Determined by a replexs Index insert (r[0] r[1] r[2] . . . r[c]) ?

Partition Determined by a replexs Index insert (r[0] r[1] r[2] . . . r[c])

Data Chains in Replex (x[0] x[1] x[2] . . . x[c]) (y[0] y[1] y[2] . . . y[c]) (z[0] z[1] z[2] . . . z[c]) Partitions in different replexes make up a data chain if they share rows

Data Chains in Replex All pairs of partitions are potential data chains

Main Challenge in Replex Increased failure complexity

Partition Failures 1. Partition Recovery 2. Client Requests Index is unavailable, data is available

Partition Failures Where is the data? Index is unavailable, data is available

Finding Data After Failure

Finding Data After Failure Client Requests Partition Recovery Tradeoff? Requests must be multicast Parallel Recovery Lookups reduce to linear scan at each partitions Recovery Time Request Amplification

Hybrid Replex: Key Contributions 1. Recovery time vs request amplification tradeoff 2. Improve failure availability of multiple replexes 3. Storage vs recovery performance tradeoff 4. Graceful failure degradation

Hybrid Replex: Key Contributions 1. Recovery time vs request amplification tradeoff 2. Improve failure availability of multiple replexes 3. Storage vs recovery performance tradeoff 4. Graceful failure degradation

Hybrid Replex: What is it? Shared across r replexes Not associated with an index Partitioning function dependent on these r replexes

Hybrid Replex hybrid replex

Data Chains with Hybrid Replex insert (r[0] r[1] r[2] . . . r[c])

Data Chains with Hybrid Replex insert (r[0] r[1] r[2] . . . r[c])

Data Chains with Hybrid Replex insert (r[0] r[1] r[2] . . . r[c]) ? ? ? ?

Data Chains with Hybrid Replex insert (r[0] r[1] r[2] . . . r[c])

1. Recovery time vs Request Amplification Recovery is less parallel but lower request amplification

2. Improve failure availability of multiple replexes

2. Improve failure availability of multiple replexes

Improving Failure Availability w/o Hybrid Replexes 2x Storage!

3. Storage vs. recovery performance 1.5x Storage! but worse client requests

Generalizing Hybrid Replexes k hybrid replexes may be shared across r replexes For the special case when r = 2, request amplification will be O(n1/(k+1)) For the special case when k = 1, request amplification will be O(n(r-1)/r)

Implementation Built on top of HyperDex, ~700 LOC Implemented partition recovery and request re-routing on failure Implemented variety of hybrid replex configurations

Evaluation 1. What is impact of replexes on steady-state performance? 2. How do hybrid replexes affect failure performance?

Evaluation Setup Table with two indexes 12 server machines, 4 client machines All machines colocated in the same rack, connected via 1GB head-of-rack switch 8 CPU, 16GB memory per machine

Systems Under Evaluation HyperDex Replex-2 Replex-3 Storage overhead: 2x Network hops per insert: 2 Failures Tolerated: 1 Storage overhead: 3x Network hops per insert: 3 Failures Tolerated: 2 Storage overhead: 6x Network hops per insert: 6 Failures Tolerated: 2

Related

More Related Content