Introduction to Interactive Data Analytics with Spark on Tachyon

 
Interactive Data Analytics
with Spark on 
Tachyon
in Baidu
 
Bin Fan (Tachyon Nexus)
binfan@tachyonnexus.com
 
Xiang Wen (Baidu)
wenxiang@baidu.com
 
Dec 02 2015 @ Strata + Hadoop World, Singapore
 
1
 
Who Are We?
 
Bin Fan
 
Tachyon Project
Contributor
 
Software Engineer
at Tachyon Nexus
 
Xiang Wen
 
From Baidu Big
Data Department
 
Senior Software
Engineer at Baidu
 
2
 
Team consists of Tachyon creators, top contributors
 
Series A ($7.5 million) from Andreessen Horowitz
Committed to Tachyon Open Source Project
www.tachyonnexus.com
 
 
3
 
Agenda
 
Tachyon Basics & New Features
Motivation
Building an interactive data service
Spark + Tachyon
Future Works
 
4
 
History of Tachyon
 
Started at UC Berkeley AMPLab
From summer 2012
 
Open sourced
April 2013 (two and half years ago)
Apache License 2.0
Latest Release: Version 0.8.2 (November 2015)
 
5
 
One of the 
Fastest 
Growing
Big Data Open Source Project
 
> 
170 
Contributors (v0.8)
     
 
3x 
increment over the last year
 
http://tachyon-project.org/community/
 
6
 
> 
50 
Organizations
 
One of the 
Fastest 
Growing
Big Data Open Source Project
 
7
 
An Open Source
Memory-Centric
Distributed Storage System
What is
8
Tachyon Stack
9
 
M
e
m
o
r
y
-
s
p
e
e
d
 
d
a
t
a
 
s
h
a
r
i
n
g
a
c
r
o
s
s
 
j
o
b
s
 
a
n
d
 
f
r
a
m
e
w
o
r
k
s
H
D
F
S
 
/
 
A
m
a
z
o
n
 
S
3
T
a
c
h
y
o
n
i
n
-
m
e
m
o
r
y
D
a
t
a
Why Use 
Tachyon?
 
D
a
t
a
 
s
u
r
v
i
v
e
 
i
n
m
e
m
o
r
y
 
a
f
t
e
r
c
o
m
p
u
t
a
t
i
o
n
c
r
a
s
h
e
s
c
r
a
s
h
 
O
f
f
-
h
e
a
p
s
t
o
r
a
g
e
,
 
n
o
 
G
C
10
 
Enable Faster Innovation in
Storage Layer
 
 
11
 
What if
data size exceeds
memory capacity?
 
12
Tiered Storage: Tachyon Manages More
Than DRAM
 
MEM
 
SSD
 
HDD
 
Faster
 
Higher
Capacity
13
Configurable Storage Tiers
 
MEM only
 
MEM + HDD
 
SSD only
14
Pluggable Data Management Policy
 
E
v
i
c
t
 
s
t
a
l
e
 
d
a
t
a
 
t
o
l
o
w
e
r
 
t
i
e
r
 
P
r
o
m
o
t
e
 
h
o
t
 
d
a
t
a
 
t
o
u
p
p
e
r
 
t
i
e
r
15
 
Pin Data in Memory
 
16
 
Transparent Naming
 
 
17
 
Unified Namespace Across Under
Storage Systems
 
 
18
More Features
 
Remote Write Support
Easy deployment with 
Mesos
 
and 
YARN
Initial Security Support
One Command Cluster Deployment
Metrics Reporting for Clients, Workers, and
Master
 
19
Rich 
Choice of 
Under Storage Supports
20
How Easy to Use Tachyon in
scala> val file = sc.textFile(“hdfs://foo”)
scala> val file = sc.textFile(“
tachyon
://foo”)
21
 
Use Case: a SAAS Company
 
Framework: 
Impala
 
Under Storage: 
S3
 
Storage Media: 
MEM + SSD
 
15x
 Performance Improvement
 
22
 
Use Case: a Biotechnology Company
 
Framework
: 
Spark & MapReduce
 
Under Storage: 
GlusterFS
 
Storage Media: 
MEM and SSD
 
23
 
When Tachyon Meets Baidu
 
~ 100 nodes in deployment, > 1 PB storage space
 
 
30X Acceleration of our Big Data Analytics Workload
 
24
 
Agenda
 
Tachyon Basics & New Features
Motivation
Building an interactive data service
Spark + Tachyon
Future Works
 
25
Background
 
 
 
 
 
26
 
Frustrated data explorers
 
Example:
John is a PM and he needs to keep track of the top user
actions for a new feature
Based on the top actions of the day, he will perform
additional analysis
But John is very frustrated that each query takes tens of
minutes to finish
 
27
 
A dedicated service for data exploring
 
Manages PBs of data
Most queries within one minute
 
 
 
 
 
 
 
28
User Scenario
29
 
Agenda
 
Tachyon Basics & New Features
Motivation
Building an interactive data service
Spark + Tachyon
Future Works
 
30
 
Choose Spark as compute solution
 
Compute 
Center
 
Data Center
 
31
 
Choose Tachyon as storage solution
 
Read from remote data
center: ~ 100 ~ 150 seconds
Read from Tachyon cluster
local node: 10 ~ 15 sec
Read from Tachyon machine
local node: ~ 5 sec
 
Tachyon Brings 30X Speed-up !
 
32
 
Overall Performance
 
Setup:
1.
Use MR to query 6 TB of data
2.
Use Spark to query 6 TB of
data
3.
Use Spark + Tachyon to query
6 TB of data
 
Results:
1.
Spark + Tachyon achieves 
50-
fold 
speedup compared to
MR
 
33
 
Architecture
SparkContext
Operation
Manager
View
Manager
 
Run Query
 
Build Cache
Data Warehouse
Cache
Scheduler
 
Ask & Profile
 
 
 
 
 
 
 
 
34
 
Catalyst helps to be ‘transparent’
 
lookupRelation
CacheableRelation
Tachyon
HDFS
Union
HiveTableScan
withUncachedPartitions
HiveTableScan
withCachedPartitions
 
35
 
Cache Policy
 
Prefetch
Fetch the views daily in advance when system is idle
The views fetched are based on the pattern of the past
query history profiling, e.g. 3 months query logs
 
On Demand caches
Fetch the views at runtime when system is serving regular
queries.
Using machine learning to generate policy file monthly for
views/tables
When a query is accessing some views, and parts of views
match our pre-generated policy, those views will be cached
at that time.
 
36
 
Hot Query: Cache Hit
 
37
 
Cold Query: Cache Miss
 
38
 
Daily Stats with Cache
 
Daily
Table
Queries: 100 – 300
Hit Rate: ~40%
Partition
Queries: 80K – 120K
Hit Rate: ~40 – 50%
Performance with Cache
avg 2 - 3  time faster than without Cache
 
39
 
Agenda
 
Tachyon Basics & New Features
Motivation
Building an interactive data service
Spark + Tachyon
Future Works
 
40
 
Improve Caching System
 
As Extended Meta Service
Improve legacy schema/input-format
Load block meta into cache layer
Index / Materialized View
Cost Based Caching/Optimizing
Better performance, hit rate & execution
Lower storage needs for cache layer
 
41
 
More User Scenario
 
If John is data scientist
Need a way to construct dataset conveniently
Usually have many tries with same dataset
An interactive system should help a lot
Spark is an ideal solution
 
42
 
Hardware assisted
big data infrastructure
 
Hardware
GPU
FPGA
Applications
Accelerate common SQL and ML operators
TableScan && InputFormat && Serde
Lower down the cost
10 dollars for big data
1 more dollar for interactive big data
 
43
 
Q&A
 
 
44
Slide Note
Embed
Share

Explore the collaboration between Baidu and Tachyon Nexus in advancing interactive data analytics with Spark on Tachyon. Learn about the team, Tachyon's history, features, and why it's a fast-growing open-source project. Discover how Tachyon enables efficient memory-centric distributed storage and its significance in data processing.

  • Data Analytics
  • Spark
  • Tachyon
  • Baidu
  • Open Source

Uploaded on Oct 09, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Interactive Data Analytics with Spark on Tachyon in Baidu Xiang Wen (Baidu) wenxiang@baidu.com Bin Fan (Tachyon Nexus) binfan@tachyonnexus.com Dec 02 2015 @ Strata + Hadoop World, Singapore 1

  2. Who Are We? Bin Fan Xiang Wen Tachyon Project Contributor From Baidu Big Data Department Software Engineer at Tachyon Nexus Senior Software Engineer at Baidu 2

  3. TNX-ID-logo.png Team consists of Tachyon creators, top contributors Series A ($7.5 million) from Andreessen Horowitz Committed to Tachyon Open Source Project www.tachyonnexus.com 3

  4. Agenda Tachyon Basics & New Features Motivation Building an interactive data service Spark + Tachyon Future Works 4

  5. History of Tachyon Started at UC Berkeley AMPLab From summer 2012 Open sourced April 2013 (two and half years ago) Apache License 2.0 Latest Release: Version 0.8.2 (November 2015) 5

  6. One of the Fastest Growing Big Data Open Source Project http://tachyon-project.org/community/ > 170 Contributors (v0.8) 3x increment over the last year 6

  7. One of the Fastest Growing Big Data Open Source Project > 50 Organizations 7

  8. What is An Open Source Memory-Centric Distributed Storage System 8

  9. Tachyon Stack 9

  10. Why Use Tachyon? Data survive in memory after computation crashes Memory-speed data sharing across jobs and frameworks Spark Job Spark Job Hadoop Hadoop MR Job crash crash MR Job Spark Spark mem mem YARN YARN Off-heap storage, no GC HDFS disk Tachyon Tachyon in in- -memory memory block 1 block 2 Data Data block 3 block 4 HDFS / Amazon S3 HDFS / Amazon S3 10

  11. Enable Faster Innovation in Storage Layer 11

  12. What if data size exceeds memory capacity? 12

  13. Tiered Storage: Tachyon Manages More Than DRAM Faster MEM SSD HDD Higher Capacity 13

  14. Configurable Storage Tiers MEM only MEM + HDD SSD only 14

  15. Pluggable Data Management Policy Promote hot data to upper tier Evict stale data to lower tier 15

  16. Pin Data in Memory 16

  17. Transparent Naming 17

  18. Unified Namespace Across Under Storage Systems 18

  19. More Features Remote Write Support Easy deployment with Mesos and YARN Initial Security Support One Command Cluster Deployment Metrics Reporting for Clients, Workers, and Master 19

  20. Rich Choice of Under Storage Supports 20

  21. How Easy to Use Tachyon in scala> val file = sc.textFile( hdfs://foo ) scala> val file = sc.textFile( tachyon://foo ) 21

  22. Use Case: a SAAS Company Framework: Impala Under Storage: S3 Storage Media: MEM + SSD 15x Performance Improvement 22

  23. Use Case: a Biotechnology Company Framework: Spark & MapReduce Under Storage: GlusterFS Storage Media: MEM and SSD 23

  24. When Tachyon Meets Baidu 30X Acceleration of our Big Data Analytics Workload ~ 100 nodes in deployment, > 1 PB storage space 24

  25. Agenda Tachyon Basics & New Features Motivation Building an interactive data service Spark + Tachyon Future Works 25

  26. Background Data Warehouse Hours or Days Hours Logs Online Data Services 26

  27. Frustrated data explorers Example: John is a PM and he needs to keep track of the top user actions for a new feature Based on the top actions of the day, he will perform additional analysis But John is very frustrated that each query takes tens of minutes to finish 27

  28. A dedicated service for data exploring Manages PBs of data Most queries within one minute 28

  29. User Scenario Have first try select some_action, count(*) from event_table where event_day= 20151123 group by user_action Web Site Client Service Gate Try another way select some_action, event_hour, count(*) from event_table where event_day= 20151123 group by user_action, event_hour Query Engine Not as expected! See what happens in original log Structured Logs Data Data Marts Warehouse select * from event_log where event_day= 20151123 and event_hour= 01 limit 10 29

  30. Agenda Tachyon Basics & New Features Motivation Building an interactive data service Spark + Tachyon Future Works 30

  31. Choose Spark as compute solution Service Gate Service Gate 4X Improvement but not good enough! Hive Map Reduce Compute Center Data Center Data Warehouse BFS 31

  32. Choose Tachyon as storage solution Compute Center Read from remote data center: ~ 100 ~ 150 seconds Read from Tachyon cluster local node: 10 ~ 15 sec Read from Tachyon machine local node: ~ 5 sec Spark Task Spark Task Spark mem Spark mem HDFS disk in-memory block 1 block 1 block 2 Tachyon block 3 block 3 block 4 block 4 Tachyon Brings 30X Speed-up ! Data Center Baidu File System (BFS) 32

  33. Overall Performance 1200 Setup: 1. Use MR to query 6 TB of data 2. Use Spark to query 6 TB of data 3. Use Spark + Tachyon to query 6 TB of data 1000 800 600 400 200 Results: 1. Spark + Tachyon achieves 50- fold speedup compared to MR 0 MR (sec) Spark (sec) Spark + Tachyon (sec) 33

  34. Architecture Operation Manager View Manager Ask & Profile Run Query Scheduler Build Cache SparkContext Cache Data Warehouse 34

  35. Catalyst helps to be transparent lookupRelation CacheableRelation HiveTableScan withUncachedPartitions Union HiveTableScan withCachedPartitions Tachyon HDFS 35

  36. Cache Policy Prefetch Fetch the views daily in advance when system is idle The views fetched are based on the pattern of the past query history profiling, e.g. 3 months query logs On Demand caches Fetch the views at runtime when system is serving regular queries. Using machine learning to generate policy file monthly for views/tables When a query is accessing some views, and parts of views match our pre-generated policy, those views will be cached at that time. 36

  37. Hot Query: Cache Hit Query UI View Manager Operation Manager Cache Meta Spark HDFS Tachyon 37

  38. Cold Query: Cache Miss Query UI View Manager Operation Manager Cache Meta Spark HDFS Tachyon 38

  39. Daily Stats with Cache Daily Table Queries: 100 300 Hit Rate: ~40% Partition Queries: 80K 120K Hit Rate: ~40 50% Performance with Cache avg 2 - 3 time faster than without Cache 39

  40. Agenda Tachyon Basics & New Features Motivation Building an interactive data service Spark + Tachyon Future Works 40

  41. Improve Caching System As Extended Meta Service Improve legacy schema/input-format Load block meta into cache layer Index / Materialized View Cost Based Caching/Optimizing Better performance, hit rate & execution Lower storage needs for cache layer 41

  42. More User Scenario If John is data scientist Need a way to construct dataset conveniently Usually have many tries with same dataset An interactive system should help a lot Spark is an ideal solution 42

  43. Hardware assisted big data infrastructure Hardware GPU FPGA Applications Accelerate common SQL and ML operators TableScan && InputFormat && Serde Lower down the cost 10 dollars for big data 1 more dollar for interactive big data 43

  44. Q&A 44

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#