Evolution of Database Systems: A Spark SQL Perspective

Spark SQL
 
Some History
(for Dremel and SparkSQL)
Parallel DB Systems have been around for 20-
30 years prior
Historical DB companies supporting
parallelism include:
Teradata, Tandem, Informix, Oracle, RedBrick,
Sybase, DB2
Common Complaints
Complaints included
Too slow (especially for internet scale applications)
Too much loading time
Too monolithic and complex
Instruction manuals of ~500 pages
Too much heft for “internet scale” applications
Too expensive
Too hard to understand
Poor support for complex non-relational ops
NoSQL
The story of NoSQL
This is the OLAP story, not the OLTP story
Online Analytical Processing not Online Transaction
Processing
OLTP story
BigTable (06) => MegaStore (11) => Spanner, F1 (12)
Less consistency => More consistency
Contemporaries:
PNUTS, Cassandra, HBase, CouchDB, Dynamo
A Timeline
 
DBs are
Slow for
OLAP
Column
Stores
(05)
Map
Reduce
(04)
Dremel
(10)
Spark
(12)
SparkSQL
(14)
Google: SQL on MR
Others: Pig, Hive, Impala
Google
SQL is bad!
Yay NoSQL!
SQL is good!
Main-Mem MR
Column Stores
For OLAP, column stores are a lot better than
row stores
Idea from the 80s, commercialized as Vertica
in 2005.
Key idea: store values for a single column
together
Why is this better for aggregation?
Column Stores
For OLAP, column stores are a lot better than
row stores
Key idea: store values for a single column
together
Why is this better for aggregation?
Better compression; can pack similar values
together better
Can skip over unnecessary columns
Much less data read from disk
A Timeline
 
DBs are
Slow for
OLAP
Column
Stores
(05)
Map
Reduce
(04)
Dremel
(10)
Spark
(12)
SparkSQL
(14)
Google: SQL on MR
Others: Pig, Hive, Impala
Google
SQL is bad!
Yay NoSQL!
SQL is good!
Main-Mem MR
Map-Reduce
2004: Google published MapReduce.
Parallel programming paradigm
Pros:
Fast fast fast
Imperative
Many real use-cases
Cons:
Checkpointing all intermediate results
No real logic or optimization
Very “rigid”, no room for improvement
Many bottlenecks
NoSQL
One OLAP story
MapReduce (04) => Dremel (10)
Less using pdb principles => More using pdb principles
By 2010, Google had restricted MapReduce to
complex batch processing, with Dremel for interactive
analytics
Contemporaries:
MapReduce: Hadoop (Yahoo)
PSQL-on-MapReduce: Pig (Yahoo), Hive (Facebook)
PSQL-not-on-MapReduce: Impala
Along comes Dremel
2010:
Eliminating limitations in MapReduce via multiple
ways:
?
Along comes Dremel
2010:
Eliminating limitations in MapReduce via multiple
ways:
Tree-based computation
SQL-based specification
Column Store encoding
Native JSON support
Spark vs. Dremel
2012: Berkeley Folks
Similar to Dremel in that
the focus is on interactive ad-hoc tasks
Caveat: Dremel is primarily aggregation
primarily read-only
moving away from the drawbacks of MR (but in
different ways)
Dremel uses Column Store ideas + Disk
Spark uses Memory (Java objects) + Avoiding
checkpointing + Persistence
Disadvantages of MapReduce
1. Extremely rigid data flow
Other flows constantly hacked in
Join, Union
Split
M
R
M
M
R
M
Chains
2. Common operations must be coded by hand
 Join, filter, projection, aggregates, sorting, distinct
3. Semantics hidden 
inside map-reduce functions
 Difficult to maintain, extend, and optimize
Not the first time!
Similar proposals have been made to natively
support other relational operators on top of
MapReduce.
PIG: Imperative style, like Spark. From Yahoo!
visits             = 
load
 
/data/visits
 
as
 (user, url, time);
gVisits          = 
group
 visits 
by
 url;
visitCounts  = 
foreach
 gVisits 
generate
 url, count(urlVisits);
urlInfo          = 
load
 
/data/urlInfo
 
as
 (url, category, pRank);
visitCounts  = 
join
 visitCounts 
by
 url, urlInfo 
by
 url;
gCategories = 
group
 visitCounts 
by
 category;
topUrls = 
foreach
 gCategories 
generate
 top(visitCounts,10);
store topUrls into 
/data/topUrls
;
Another Example: PIG
Another Example: DryadLINQ
 
s
tring 
uri = 
@"file://\\machine\directory\input.pt"
;
PartitionedTable
<
LineRecord
> input =
      
PartitionedTable
.
Get
<
LineRecord
>(uri);
 
s
tring 
separator = 
","
;
var 
words = input.
SelectMany
(x => SplitLineRecord(separator));
 
var 
groups = words.
GroupBy
(x => x);
 
var 
counts = groups.
Select
(x => new 
Pair
(x.Key, x.Count()));
 
var 
ordered = counts.
OrderByDescending
(x => x[2]);
 
var
 top = ordered.
Take
(k);
 
top.
ToDryadPartitionedTable
(
"matching.pt"
);
 
 
Execution Plan Graph
Not the first time!
Similar proposals have been made to natively support
other relational operators on top of MapReduce.
Unlike Spark, most of them cannot have datasets
persist across queries.
PIG: Imperative style, like Spark. From Yahoo!
DryadLINQ: Imperative programming interface. From
Microsoft.
HIVE: SQL like. From Facebook
HadoopDB: SQL like (hybrid of MR + databases). From
Yale
A Timeline
 
DBs are
Slow for
OLAP
Column
Stores
(05)
Map
Reduce
(04)
Dremel
(10)
Spark
(12)
SparkSQL
(14)
Google: SQL on MR
Others: Pig, Hive, Impala
Google
SQL is bad!
Yay NoSQL!
SQL is good!
Main-Mem MR
What did you think of this paper?
 
This paper
Appeared at the “Industry” Track of SIGMOD
Lightly reviewed
Use-cases and impact more important than new
technical contributions
Light on experiments
Light on details
Esp. on optimization
Key Benefits of SparkSQL
Bridging the gap between procedural and
relational
Allowing analysts to mix both
Not just fully A or fully B but intermingled
At the same time, doesn’t force one single format of
intermingling
Can issue fully SQL
Can issue fully procedural
Not better than impala: but not their
contribution.
Impala
From Cloudera
Since 2012
SQL on Hadoop Clusters
Open-source
Support for Protocol Buffers like format (parquet)
C++ based: less overhead of java/scala
May circumvent MR by using a distributed query
engine similar to parallel RDBMS
History lesson: earliest example of
“bridging the gap”
What’s the earliest example of “bridging the
gap” between procedural and relational?
History lesson: earliest example of
“bridging the gap”
What’s the earliest example of “bridging the
gap” between procedural and relational?
UDFs
Been there since the early 90s
Rage back then: Object relational databases
OOP was starting to pick up
Representing and reasoning about objects in databases
Postgres was one of the first to use it
Used to call custom code in the middle of SQL
Slide Note
Embed
Share

Explore the evolution of database systems, specifically focusing on Spark SQL, NoSQL, and column stores for OLAP. Learn about the history of parallel DB systems, common complaints, the story of NoSQL, and the advantages of column stores for data aggregation and compression in OLAP scenarios.

  • Evolution Database Systems
  • Spark SQL
  • NoSQL
  • Column Stores
  • OLAP

Uploaded on Oct 09, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Spark SQL

  2. Some History (for Dremel and SparkSQL) Parallel DB Systems have been around for 20- 30 years prior Historical DB companies supporting parallelism include: Teradata, Tandem, Informix, Oracle, RedBrick, Sybase, DB2

  3. Common Complaints Complaints included Too slow (especially for internet scale applications) Too much loading time Too monolithic and complex Instruction manuals of ~500 pages Too much heft for internet scale applications Too expensive Too hard to understand Poor support for complex non-relational ops

  4. NoSQL The story of NoSQL This is the OLAP story, not the OLTP story Online Analytical Processing not Online Transaction Processing OLTP story BigTable (06) => MegaStore (11) => Spanner, F1 (12) Less consistency => More consistency Contemporaries: PNUTS, Cassandra, HBase, CouchDB, Dynamo

  5. A Timeline Google: SQL on MR Others: Pig, Hive, Impala Column Stores (05) Dremel (10) DBs are Slow for OLAP Map Reduce (04) Google Spark (12) SparkSQL (14) Main-Mem MR SQL is bad! Yay NoSQL! SQL is good!

  6. Column Stores For OLAP, column stores are a lot better than row stores Idea from the 80s, commercialized as Vertica in 2005. Key idea: store values for a single column together Why is this better for aggregation?

  7. Column Stores For OLAP, column stores are a lot better than row stores Key idea: store values for a single column together Why is this better for aggregation? Better compression; can pack similar values together better Can skip over unnecessary columns Much less data read from disk

  8. A Timeline Google: SQL on MR Others: Pig, Hive, Impala Column Stores (05) Dremel (10) DBs are Slow for OLAP Map Reduce (04) Google Spark (12) SparkSQL (14) Main-Mem MR SQL is bad! Yay NoSQL! SQL is good!

  9. Map-Reduce 2004: Google published MapReduce. Parallel programming paradigm Pros: Fast fast fast Imperative Many real use-cases Cons: Checkpointing all intermediate results No real logic or optimization Very rigid , no room for improvement Many bottlenecks

  10. NoSQL One OLAP story MapReduce (04) => Dremel (10) Less using pdb principles => More using pdb principles By 2010, Google had restricted MapReduce to complex batch processing, with Dremel for interactive analytics Contemporaries: MapReduce: Hadoop (Yahoo) PSQL-on-MapReduce: Pig (Yahoo), Hive (Facebook) PSQL-not-on-MapReduce: Impala

  11. Along comes Dremel 2010: Eliminating limitations in MapReduce via multiple ways: ?

  12. Along comes Dremel 2010: Eliminating limitations in MapReduce via multiple ways: Tree-based computation SQL-based specification Column Store encoding Native JSON support

  13. Spark vs. Dremel 2012: Berkeley Folks Similar to Dremel in that the focus is on interactive ad-hoc tasks Caveat: Dremel is primarily aggregation primarily read-only moving away from the drawbacks of MR (but in different ways) Dremel uses Column Store ideas + Disk Spark uses Memory (Java objects) + Avoiding checkpointing + Persistence

  14. Disadvantages of MapReduce 1. Extremely rigid data flow M R Other flows constantly hacked in M M M R Join, Union Chains Split 2. Common operations must be coded by hand Join, filter, projection, aggregates, sorting, distinct 3. Semantics hidden inside map-reduce functions Difficult to maintain, extend, and optimize

  15. Not the first time! Similar proposals have been made to natively support other relational operators on top of MapReduce. PIG: Imperative style, like Spark. From Yahoo!

  16. Another Example: PIG visits = load /data/visits as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load /data/urlInfo as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into /data/topUrls ;

  17. Another Example: DryadLINQ Get string uri = @"file://\\machine\directory\input.pt"; PartitionedTable<LineRecord> input = PartitionedTable.Get<LineRecord>(uri); Execution Plan Graph SM string separator = ","; var words = input.SelectMany(x => SplitLineRecord(separator)); G var groups = words.GroupBy(x => x); S var counts = groups.Select(x => new Pair(x.Key, x.Count())); var ordered = counts.OrderByDescending(x => x[2]); O var top = ordered.Take(k); Take top.ToDryadPartitionedTable("matching.pt");

  18. Not the first time! Similar proposals have been made to natively support other relational operators on top of MapReduce. Unlike Spark, most of them cannot have datasets persist across queries. PIG: Imperative style, like Spark. From Yahoo! DryadLINQ: Imperative programming interface. From Microsoft. HIVE: SQL like. From Facebook HadoopDB: SQL like (hybrid of MR + databases). From Yale

  19. A Timeline Google: SQL on MR Others: Pig, Hive, Impala Column Stores (05) Dremel (10) DBs are Slow for OLAP Map Reduce (04) Google Spark (12) SparkSQL (14) Main-Mem MR SQL is bad! Yay NoSQL! SQL is good!

  20. What did you think of this paper?

  21. This paper Appeared at the Industry Track of SIGMOD Lightly reviewed Use-cases and impact more important than new technical contributions Light on experiments Light on details Esp. on optimization

  22. Key Benefits of SparkSQL Bridging the gap between procedural and relational Allowing analysts to mix both Not just fully A or fully B but intermingled At the same time, doesn t force one single format of intermingling Can issue fully SQL Can issue fully procedural Not better than impala: but not their contribution.

  23. Impala From Cloudera Since 2012 SQL on Hadoop Clusters Open-source Support for Protocol Buffers like format (parquet) C++ based: less overhead of java/scala May circumvent MR by using a distributed query engine similar to parallel RDBMS

  24. History lesson: earliest example of bridging the gap What s the earliest example of bridging the gap between procedural and relational?

  25. History lesson: earliest example of bridging the gap What s the earliest example of bridging the gap between procedural and relational? UDFs Been there since the early 90s Rage back then: Object relational databases OOP was starting to pick up Representing and reasoning about objects in databases Postgres was one of the first to use it Used to call custom code in the middle of SQL

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#