Evolution of Database Systems: A Spark SQL Perspective

Spark SQL

Some History

(for Dremel and SparkSQL)

•

Parallel DB Systems have been around for 20-

30 years prior

•

Historical DB companies supporting

parallelism include:

–

Teradata, Tandem, Informix, Oracle, RedBrick,

Sybase, DB2

Common Complaints

–

Complaints included

•

Too slow (especially for internet scale applications)

•

Too much loading time

•

Too monolithic and complex

–

Instruction manuals of ~500 pages

•

Too much heft for “internet scale” applications

•

Too expensive

•

Too hard to understand

•

Poor support for complex non-relational ops

NoSQL

•

The story of NoSQL

•

This is the OLAP story, not the OLTP story

–

Online Analytical Processing not Online Transaction

Processing

•

OLTP story

–

BigTable (06) => MegaStore (11) => Spanner, F1 (12)

–

Less consistency => More consistency

–

Contemporaries:

•

PNUTS, Cassandra, HBase, CouchDB, Dynamo

A Timeline

DBs are

Slow for

OLAP

Column

Stores

(05)

Map

Reduce

(04)

Dremel

(10)

Spark

(12)

SparkSQL

(14)

Google: SQL on MR

Others: Pig, Hive, Impala

Google

SQL is bad!

Yay NoSQL!

SQL is good!

Main-Mem MR

Column Stores

•

For OLAP, column stores are a lot better than

row stores

•

Idea from the 80s, commercialized as Vertica

in 2005.

•

Key idea: store values for a single column

together

•

Why is this better for aggregation?

Column Stores

•

For OLAP, column stores are a lot better than

row stores

•

Key idea: store values for a single column

together

•

Why is this better for aggregation?

–

Better compression; can pack similar values

together better

–

Can skip over unnecessary columns

–

Much less data read from disk

A Timeline

DBs are

Slow for

OLAP

Column

Stores

(05)

Map

Reduce

(04)

Dremel

(10)

Spark

(12)

SparkSQL

(14)

Google: SQL on MR

Others: Pig, Hive, Impala

Google

SQL is bad!

Yay NoSQL!

SQL is good!

Main-Mem MR

Map-Reduce

•

2004: Google published MapReduce.

–

Parallel programming paradigm

–

Pros:

•

Fast fast fast

•

Imperative

•

Many real use-cases

–

Cons:

•

Checkpointing all intermediate results

•

No real logic or optimization

•

Very “rigid”, no room for improvement

•

Many bottlenecks

NoSQL

•

One OLAP story

–

MapReduce (04) => Dremel (10)

–

Less using pdb principles => More using pdb principles

–

By 2010, Google had restricted MapReduce to

complex batch processing, with Dremel for interactive

analytics

–

Contemporaries:

•

MapReduce: Hadoop (Yahoo)

•

PSQL-on-MapReduce: Pig (Yahoo), Hive (Facebook)

•

PSQL-not-on-MapReduce: Impala

Along comes Dremel

•

2010:

–

Eliminating limitations in MapReduce via multiple

ways:

•

Along comes Dremel

•

2010:

–

Eliminating limitations in MapReduce via multiple

ways:

•

Tree-based computation

•

SQL-based specification

•

Column Store encoding

•

Native JSON support

Spark vs. Dremel

•

2012: Berkeley Folks

•

Similar to Dremel in that

–

the focus is on interactive ad-hoc tasks

•

Caveat: Dremel is primarily aggregation

–

primarily read-only

–

moving away from the drawbacks of MR (but in

different ways)

•

Dremel uses Column Store ideas + Disk

•

Spark uses Memory (Java objects) + Avoiding

checkpointing + Persistence

Disadvantages of MapReduce

1. Extremely rigid data flow

Other flows constantly hacked in

Join, Union

Split

Chains

2. Common operations must be coded by hand

•

 Join, filter, projection, aggregates, sorting, distinct

3. Semantics hidden

inside map-reduce functions

•

 Difficult to maintain, extend, and optimize

Not the first time!

•

Similar proposals have been made to natively

support other relational operators on top of

MapReduce.

•

PIG: Imperative style, like Spark. From Yahoo!

visits             =

load

‘

/data/visits

’

as

 (user, url, time);

gVisits          =

group

 visits

by

 url;

visitCounts  =

foreach

 gVisits

generate

 url, count(urlVisits);

urlInfo          =

load

‘

/data/urlInfo

’

as

 (url, category, pRank);

visitCounts  =

join

 visitCounts

by

 url, urlInfo

by

 url;

gCategories =

group

 visitCounts

by

 category;

topUrls =

foreach

 gCategories

generate

 top(visitCounts,10);

store topUrls into

‘

/data/topUrls

’

Another Example: PIG

Another Example: DryadLINQ

tring

uri =

@"file://\\machine\directory\input.pt"

PartitionedTable

LineRecord

> input =

PartitionedTable

Get

LineRecord

>(uri);

tring

separator =

","

var

words = input.

SelectMany

(x => SplitLineRecord(separator));

var

groups = words.

GroupBy

(x => x);

var

counts = groups.

Select

(x => new

Pair

(x.Key, x.Count()));

var

ordered = counts.

OrderByDescending

(x => x[2]);

var

 top = ordered.

Take

(k);

top.

ToDryadPartitionedTable

"matching.pt"

);

Execution Plan Graph

Not the first time!

•

Similar proposals have been made to natively support

other relational operators on top of MapReduce.

•

Unlike Spark, most of them cannot have datasets

persist across queries.

•

PIG: Imperative style, like Spark. From Yahoo!

•

DryadLINQ: Imperative programming interface. From

Microsoft.

•

HIVE: SQL like. From Facebook

•

HadoopDB: SQL like (hybrid of MR + databases). From

Yale

A Timeline

DBs are

Slow for

OLAP

Column

Stores

(05)

Map

Reduce

(04)

Dremel

(10)

Spark

(12)

SparkSQL

(14)

Google: SQL on MR

Others: Pig, Hive, Impala

Google

SQL is bad!

Yay NoSQL!

SQL is good!

Main-Mem MR

What did you think of this paper?

This paper

•

Appeared at the “Industry” Track of SIGMOD

–

Lightly reviewed

–

Use-cases and impact more important than new

technical contributions

•

Light on experiments

•

Light on details

–

Esp. on optimization

Key Benefits of SparkSQL

•

Bridging the gap between procedural and

relational

–

Allowing analysts to mix both

–

Not just fully A or fully B but intermingled

–

At the same time, doesn’t force one single format of

intermingling

•

Can issue fully SQL

•

Can issue fully procedural

•

Not better than impala: but not their

contribution.

Impala

•

From Cloudera

•

Since 2012

•

SQL on Hadoop Clusters

–

Open-source

–

Support for Protocol Buffers like format (parquet)

–

C++ based: less overhead of java/scala

–

May circumvent MR by using a distributed query

engine similar to parallel RDBMS

History lesson: earliest example of

“bridging the gap”

•

What’s the earliest example of “bridging the

gap” between procedural and relational?

History lesson: earliest example of

“bridging the gap”

•

What’s the earliest example of “bridging the

gap” between procedural and relational?

–

UDFs

–

Been there since the early 90s

–

Rage back then: Object relational databases

•

OOP was starting to pick up

•

Representing and reasoning about objects in databases

•

Postgres was one of the first to use it

–

Used to call custom code in the middle of SQL

Slide Note

Embed Share

Download

Explore the evolution of database systems, specifically focusing on Spark SQL, NoSQL, and column stores for OLAP. Learn about the history of parallel DB systems, common complaints, the story of NoSQL, and the advantages of column stores for data aggregation and compression in OLAP scenarios.

csic Follow

Uploaded on Oct 09, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Spark SQL

Some History (for Dremel and SparkSQL) Parallel DB Systems have been around for 20- 30 years prior Historical DB companies supporting parallelism include: Teradata, Tandem, Informix, Oracle, RedBrick, Sybase, DB2

Common Complaints Complaints included Too slow (especially for internet scale applications) Too much loading time Too monolithic and complex Instruction manuals of ~500 pages Too much heft for internet scale applications Too expensive Too hard to understand Poor support for complex non-relational ops

NoSQL The story of NoSQL This is the OLAP story, not the OLTP story Online Analytical Processing not Online Transaction Processing OLTP story BigTable (06) => MegaStore (11) => Spanner, F1 (12) Less consistency => More consistency Contemporaries: PNUTS, Cassandra, HBase, CouchDB, Dynamo

A Timeline Google: SQL on MR Others: Pig, Hive, Impala Column Stores (05) Dremel (10) DBs are Slow for OLAP Map Reduce (04) Google Spark (12) SparkSQL (14) Main-Mem MR SQL is bad! Yay NoSQL! SQL is good!

Column Stores For OLAP, column stores are a lot better than row stores Idea from the 80s, commercialized as Vertica in 2005. Key idea: store values for a single column together Why is this better for aggregation?

Column Stores For OLAP, column stores are a lot better than row stores Key idea: store values for a single column together Why is this better for aggregation? Better compression; can pack similar values together better Can skip over unnecessary columns Much less data read from disk

A Timeline Google: SQL on MR Others: Pig, Hive, Impala Column Stores (05) Dremel (10) DBs are Slow for OLAP Map Reduce (04) Google Spark (12) SparkSQL (14) Main-Mem MR SQL is bad! Yay NoSQL! SQL is good!

Map-Reduce 2004: Google published MapReduce. Parallel programming paradigm Pros: Fast fast fast Imperative Many real use-cases Cons: Checkpointing all intermediate results No real logic or optimization Very rigid , no room for improvement Many bottlenecks

NoSQL One OLAP story MapReduce (04) => Dremel (10) Less using pdb principles => More using pdb principles By 2010, Google had restricted MapReduce to complex batch processing, with Dremel for interactive analytics Contemporaries: MapReduce: Hadoop (Yahoo) PSQL-on-MapReduce: Pig (Yahoo), Hive (Facebook) PSQL-not-on-MapReduce: Impala

Along comes Dremel 2010: Eliminating limitations in MapReduce via multiple ways: ?

Along comes Dremel 2010: Eliminating limitations in MapReduce via multiple ways: Tree-based computation SQL-based specification Column Store encoding Native JSON support

Spark vs. Dremel 2012: Berkeley Folks Similar to Dremel in that the focus is on interactive ad-hoc tasks Caveat: Dremel is primarily aggregation primarily read-only moving away from the drawbacks of MR (but in different ways) Dremel uses Column Store ideas + Disk Spark uses Memory (Java objects) + Avoiding checkpointing + Persistence

Disadvantages of MapReduce 1. Extremely rigid data flow M R Other flows constantly hacked in M M M R Join, Union Chains Split 2. Common operations must be coded by hand Join, filter, projection, aggregates, sorting, distinct 3. Semantics hidden inside map-reduce functions Difficult to maintain, extend, and optimize

Not the first time! Similar proposals have been made to natively support other relational operators on top of MapReduce. PIG: Imperative style, like Spark. From Yahoo!

Another Example: PIG visits = load /data/visits as (user, url, time); gVisits = group visits by url; visitCounts = foreach gVisits generate url, count(urlVisits); urlInfo = load /data/urlInfo as (url, category, pRank); visitCounts = join visitCounts by url, urlInfo by url; gCategories = group visitCounts by category; topUrls = foreach gCategories generate top(visitCounts,10); store topUrls into /data/topUrls ;

Another Example: DryadLINQ Get string uri = @"file://\\machine\directory\input.pt"; PartitionedTable<LineRecord> input = PartitionedTable.Get<LineRecord>(uri); Execution Plan Graph SM string separator = ","; var words = input.SelectMany(x => SplitLineRecord(separator)); G var groups = words.GroupBy(x => x); S var counts = groups.Select(x => new Pair(x.Key, x.Count())); var ordered = counts.OrderByDescending(x => x[2]); O var top = ordered.Take(k); Take top.ToDryadPartitionedTable("matching.pt");

Not the first time! Similar proposals have been made to natively support other relational operators on top of MapReduce. Unlike Spark, most of them cannot have datasets persist across queries. PIG: Imperative style, like Spark. From Yahoo! DryadLINQ: Imperative programming interface. From Microsoft. HIVE: SQL like. From Facebook HadoopDB: SQL like (hybrid of MR + databases). From Yale

A Timeline Google: SQL on MR Others: Pig, Hive, Impala Column Stores (05) Dremel (10) DBs are Slow for OLAP Map Reduce (04) Google Spark (12) SparkSQL (14) Main-Mem MR SQL is bad! Yay NoSQL! SQL is good!

What did you think of this paper?

This paper Appeared at the Industry Track of SIGMOD Lightly reviewed Use-cases and impact more important than new technical contributions Light on experiments Light on details Esp. on optimization

Key Benefits of SparkSQL Bridging the gap between procedural and relational Allowing analysts to mix both Not just fully A or fully B but intermingled At the same time, doesn t force one single format of intermingling Can issue fully SQL Can issue fully procedural Not better than impala: but not their contribution.

Impala From Cloudera Since 2012 SQL on Hadoop Clusters Open-source Support for Protocol Buffers like format (parquet) C++ based: less overhead of java/scala May circumvent MR by using a distributed query engine similar to parallel RDBMS

History lesson: earliest example of bridging the gap What s the earliest example of bridging the gap between procedural and relational?

History lesson: earliest example of bridging the gap What s the earliest example of bridging the gap between procedural and relational? UDFs Been there since the early 90s Rage back then: Object relational databases OOP was starting to pick up Representing and reasoning about objects in databases Postgres was one of the first to use it Used to call custom code in the middle of SQL

Evolution of Database Systems: A Spark SQL Perspective

Download Presentation

Presentation Transcript

Related

More Related Content