Modern Approaches to Database Scalability and Performance Engineering

Slide Note
Embed
Share

Explore the challenges and solutions for horizontally scaling RDBMS like PostgreSQL on Amazon servers, considering factors such as data distribution and multi-tenancy. Delve into the complexities of H-Scaling and the trade-offs between relational and non-relational databases in terms of performance and ACID compliance.


Uploaded on Aug 22, 2024 | 3 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. PERFORMANCE OF NON-RELATIONAL DATABASES Software Performance Engineering 1

  2. Consider H-Scaling an RDBMS e.g. how would you horizontally scale a PostgreSQL server of Amazon.com? Move tables to their own servers? Delegate read-query execution to their own places Reviews server has the likes table Products server has the product data when you look up a product A given thing is split up across lots of places One table across servers? RDBMS s have always had H-Scaling features like this Indexes are all partial indexes via keys like date or region but having queries hit many servers at once is VERY expensive which tables are on which server turns into a hard problem Is making this h-scaling fully transparent to developers a good idea? 2

  3. Consider H-Scaling an RDBMS Refactor app to many separate API interfaces, each with their own implementations Basically a microservices approach but even the simplest call must go through networks for this Not great for legacy need to re-architect and re-engineer legacy apps for this Use multi-tenant Same (or similar) database schema, but many databases deployed across servers Great for when your business has big customers or very regional e.g. A Rochester-based network (e.g. craigslist), or salesforce but now you re managing LOTs of db s, and they could all diverge How do you stay ACID compliant and performant? (Atomic, Consistent, Isolated, Durable) Answer: it s difficult. 3

  4. Not all persistence is relational Relational databases are only one way of approaching persistence Complex queries ACID compliance (Atomic, Consistent, Isolated, Durable) Drawbacks Schema is fixed Horizontal scaling will split an entity up many ways What if we organize our storage so that a single entity is in a single place? e.g. Amazon having everything about a product in a single place One database hit! No joins! H-scaling: have a book server , a hardware server , etc. Results: Document-oriented storage & Key-value storage For today, I will conflate the two, (but they are different) 4

  5. But what was the cost? NoSQL costs us a lot to gain this paradigm We don t have joins Developers are essentially implementing a join by themselves This can be fine for simple join-and-lookup-in-index operations Common practice to have lots of duplicate data We don t have ACID compliance NoSQL often relies upon sacrificing data de-duplication Atomicity is somewhat supported, but not the default Durability is not guaranteed Isolation and Consistency are sacrificed: you can read something from a database and have it be out-of-date Some NoSQL systems today have kept ACID compliance, but are largely not in use (according to Meneely s 2017 Unscientific Survey) 5

  6. There have always been alternatives Since the 1960 s there has been row-based storage and key- value storage Google in the early 2000s used row-based storage to store their map of the internet (graph-based storage) The term NoSQL took off around 2009 Coincided with a cultural exhaustion with relational databases Relational databases were used for all kinds of purposes that they were not intended for Often associated with Big Data, although perhaps that s misplaced (more on this later ) 6

  7. TLDR; Which is faster? Not a fair comparison. Which is faster, your bike or your car? depends on the terrain. Which is faster, your smart phone or your desktop? .depends on where you are. (lessons: don t make assumptions about the workload!) They are fundamentally different paradigms NoSQL will be better at pulling lots of quirky data, almost-the-same data at once SQL will be better at joining known data structures together Practically, large systems will probably employ both. Use SQL on the heavy lifting using but cache the results in NoSQL Caches can be fickle don t need ACID there Or use NoSQL for real-time features and then SQL for archival features NoSQL can still support some querying so your caches don t duplicate so much 7

  8. Consider the Blog Post A Post has many Comments, and each comment is on a Post Relational way: two tables Post table, Comments table Want to list blog posts? Quick table scan. Want to see them both? Need a join Most of the time we re joining, hitting indexes, hitting multiple tables Document-oriented way: each Post has Comments in it Want to see them both? Pull up the document Want to see a listing of Posts? Hit an index, and pull up parts of each document Hits multiple documents, so it s slower So far, NoSQL seems like it s winning... 8

  9. But What About Tags? Now let s suppose you want to group your blog posts with Tags Readers: Get all posts for a given tag Blogger: Count all comments for a given tag Relational Way: Much faster as joins are built-in NoSQL way? Get all comments for a tag? Application developer is essentially implementing her own join now. Horizontal scaling by document is NOT helping this situation So which is better? Comes down to workloads and scenarios How relational is this data? How often will people traverse the back catalog vs. look at the latest? How often do your users use the Tags? How often do your users hit the front page? How often do your users have tons of comments on one blog post at once? 9

  10. Paradoxes of NoSQL It s better for scaling! Big Data Enthusiasts But without joins you ll often need to duplicate data to remain performant so if I have 1 terabyte of data I now need 3 terabytes just for all the foreign values?? No fixed schema!! We can change things easily!! Developers Individual entities have lots of quirky attributes, e.g. page count on a book. NoSQL has no fixed schema it s just hash tables inside hash tables! But your application needs to know something about the data still to query it. So you usually have to define some sort of schema. Indexing still requires a schema, so having a consistent structure is still desirable Agile argument: is changing your schema really that hard? 10

  11. NoSQL shines in semi-persistence e.g. Redis Up front with you about a lack of ACID The amount of durability is configurable: default is to delay writing to disk as long as possible Great for a scalable publish-subscribe subsystem Great for Least Recently Used cache Allows for executing multiple queries at once ( pipelining ) Often seen using Redis in conjunction with relational data Twitter uses Redis with 105 TB of RAM (as of 2014 ) to do the Timeline and most No disk services Archiving activity is done through other services, but the main functionality is Redis 11

  12. Best of Both Worlds? PostgreSQL has a relatively new feature (circa 2014?) that allows for a column to be in JSONB (binary) format This allows for document-oriented storage to be inside traditional relational storage Specific indexing algorithms are used for this so that you can query inside of a JSONB easily Best of both worlds? You still get ACID compliance You get the flexibility of schemas changing at runtime You don t (necessarily) get the horizontal scaling benefits You still need a SQL schema sitting on top of your JSON columns 12

  13. Standard NoSQL language? There is no common standard for document-oriented querying i.e. no SQL for NoSQL Instead, NoSQL systems have their own APIs in various driver languages Pro: APIs can be idiosyncratic to the implementation of the individual system Con: knowledge of one NoSQL system doesn t necessarily translate to the others 13

  14. Graph-based storage A third way of thinking about persistence is with graphs Nodes Edges Properties Value of horizontal scaling is dependent on how dense/sparse the graph is Queries are about going to a node and examining nearby neighbors Today, the languages to query are still being developed e.g. GraphQL is storage-agnostic MS SQL Server has some graph-based storage features PostgreSQL has some features in development as of 2019 Other RDBMSs have extensions for graph-based storage 14

  15. e.g. GitHubs GraphQL API Here s an example query getting commit history from GitHub { repository(name: "apache", owner: "httpd") { ref(qualifiedName: "master") { target { ... on Commit { id history(first: 5) { pageInfo { hasNextPage } edges { node { messageHeadline message author { name email date } } } } } } } } } 15

  16. Todays Activity Research your favorite NoSQL system Find the part of their documentation about performance tuning Pick 3-5 performance improvements that you can do to this system Explain them to the other people at your table. 16

Related


More Related Content