Evolution of Database Systems: A Spark SQL Perspective

Slide Note

Explore the evolution of database systems, specifically focusing on Spark SQL, NoSQL, and column stores for OLAP. Learn about the history of parallel DB systems, common complaints, the story of NoSQL, and the advantages of column stores for data aggregation and compression in OLAP scenarios.

csic Follow

Uploaded on Oct 09, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Spark SQL

Some History (for Dremel and SparkSQL) Parallel DB Systems have been around for 20- 30 years prior Historical DB companies supporting parallelism include: Teradata, Tandem, Informix, Oracle, RedBrick, Sybase, DB2

Common Complaints Complaints included Too slow (especially for internet scale applications) Too much loading time Too monolithic and complex Instruction manuals of ~500 pages Too much heft for internet scale applications Too expensive Too hard to understand Poor support for complex non-relational ops

NoSQL The story of NoSQL This is the OLAP story, not the OLTP story Online Analytical Processing not Online Transaction Processing OLTP story BigTable (06) => MegaStore (11) => Spanner, F1 (12) Less consistency => More consistency Contemporaries: PNUTS, Cassandra, HBase, CouchDB, Dynamo

A Timeline Google: SQL on MR Others: Pig, Hive, Impala Column Stores (05) Dremel (10) DBs are Slow for OLAP Map Reduce (04) Google Spark (12) SparkSQL (14) Main-Mem MR SQL is bad! Yay NoSQL! SQL is good!

Column Stores For OLAP, column stores are a lot better than row stores Idea from the 80s, commercialized as Vertica in 2005. Key idea: store values for a single column together Why is this better for aggregation?

Column Stores For OLAP, column stores are a lot better than row stores Key idea: store values for a single column together Why is this better for aggregation? Better compression; can pack similar values together better Can skip over unnecessary columns Much less data read from disk

A Timeline Google: SQL on MR Others: Pig, Hive, Impala Column Stores (05) Dremel (10) DBs are Slow for OLAP Map Reduce (04) Google Spark (12) SparkSQL (14) Main-Mem MR SQL is bad! Yay NoSQL! SQL is good!

Map-Reduce 2004: Google published MapReduce. Parallel programming paradigm Pros: Fast fast fast Imperative Many real use-cases Cons: Checkpointing all intermediate results No real logic or optimization Very rigid , no room for improvement Many bottlenecks

NoSQL One OLAP story MapReduce (04) => Dremel (10) Less using pdb principles => More using pdb principles By 2010, Google had restricted MapReduce to complex batch processing, with Dremel for interactive analytics Contemporaries: MapReduce: Hadoop (Yahoo) PSQL-on-MapReduce: Pig (Yahoo), Hive (Facebook) PSQL-not-on-MapReduce: Impala

Along comes Dremel 2010: Eliminating limitations in MapReduce via multiple ways: ?

Along comes Dremel 2010: Eliminating limitations in MapReduce via multiple ways: Tree-based computation SQL-based specification Column Store encoding Native JSON support

Spark vs. Dremel 2012: Berkeley Folks Similar to Dremel in that the focus is on interactive ad-hoc tasks Caveat: Dremel is primarily aggregation primarily read-only moving away from the drawbacks of MR (but in different ways) Dremel uses Column Store ideas + Disk Spark uses Memory (Java objects) + Avoiding checkpointing + Persistence

Disadvantages of MapReduce 1. Extremely rigid data flow M R Other flows constantly hacked in M M M R Join, Union Chains Split 2. Common operations must be coded by hand Join, filter, projection, aggregates, sorting, distinct 3. Semantics hidden inside map-reduce functions Difficult to maintain, extend, and optimize

Not the first time! Similar proposals have been made to natively support other relational operators on top of MapReduce. PIG: Imperative style, like Spark. From Yahoo!

Another Example: PIG visits = load /data/visits as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load /data/urlInfo as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into /data/topUrls ;

Another Example: DryadLINQ Get string uri = @"file://\\machine\directory\input.pt"; PartitionedTable<LineRecord> input = PartitionedTable.Get<LineRecord>(uri); Execution Plan Graph SM string separator = ","; var words = input.SelectMany(x => SplitLineRecord(separator)); G var groups = words.GroupBy(x => x); S var counts = groups.Select(x => new Pair(x.Key, x.Count())); var ordered = counts.OrderByDescending(x => x[2]); O var top = ordered.Take(k); Take top.ToDryadPartitionedTable("matching.pt");

Not the first time! Similar proposals have been made to natively support other relational operators on top of MapReduce. Unlike Spark, most of them cannot have datasets persist across queries. PIG: Imperative style, like Spark. From Yahoo! DryadLINQ: Imperative programming interface. From Microsoft. HIVE: SQL like. From Facebook HadoopDB: SQL like (hybrid of MR + databases). From Yale

A Timeline Google: SQL on MR Others: Pig, Hive, Impala Column Stores (05) Dremel (10) DBs are Slow for OLAP Map Reduce (04) Google Spark (12) SparkSQL (14) Main-Mem MR SQL is bad! Yay NoSQL! SQL is good!

What did you think of this paper?

This paper Appeared at the Industry Track of SIGMOD Lightly reviewed Use-cases and impact more important than new technical contributions Light on experiments Light on details Esp. on optimization

Key Benefits of SparkSQL Bridging the gap between procedural and relational Allowing analysts to mix both Not just fully A or fully B but intermingled At the same time, doesn t force one single format of intermingling Can issue fully SQL Can issue fully procedural Not better than impala: but not their contribution.

Impala From Cloudera Since 2012 SQL on Hadoop Clusters Open-source Support for Protocol Buffers like format (parquet) C++ based: less overhead of java/scala May circumvent MR by using a distributed query engine similar to parallel RDBMS

History lesson: earliest example of bridging the gap What s the earliest example of bridging the gap between procedural and relational?

History lesson: earliest example of bridging the gap What s the earliest example of bridging the gap between procedural and relational? UDFs Been there since the early 90s Rage back then: Object relational databases OOP was starting to pick up Representing and reasoning about objects in databases Postgres was one of the first to use it Used to call custom code in the middle of SQL

Evolution of Database Systems: A Spark SQL Perspective

Download Presentation

Presentation Transcript

Related

More Related Content