Introduction to Interactive Data Analytics with Spark on Tachyon

Interactive Data Analytics

with Spark on

Tachyon

in Baidu

Bin Fan (Tachyon Nexus)

binfan@tachyonnexus.com

Xiang Wen (Baidu)

wenxiang@baidu.com

Dec 02 2015 @ Strata + Hadoop World, Singapore

Who Are We?

•

Bin Fan

•

Tachyon Project

Contributor

•

Software Engineer

at Tachyon Nexus

•

Xiang Wen

•

From Baidu Big

Data Department

•

Senior Software

Engineer at Baidu

•

Team consists of Tachyon creators, top contributors

•

Series A ($7.5 million) from Andreessen Horowitz

•

Committed to Tachyon Open Source Project

•

www.tachyonnexus.com

Agenda

•

Tachyon Basics & New Features

•

Motivation

•

Building an interactive data service

–

Spark + Tachyon

•

Future Works

History of Tachyon

•

Started at UC Berkeley AMPLab

–

From summer 2012

•

Open sourced

–

April 2013 (two and half years ago)

–

Apache License 2.0

–

Latest Release: Version 0.8.2 (November 2015)

One of the

Fastest

Growing

Big Data Open Source Project

Contributors (v0.8)

3x

increment over the last year

http://tachyon-project.org/community/

Organizations

One of the

Fastest

Growing

Big Data Open Source Project

An Open Source

Memory-Centric

Distributed Storage System

What is

Tachyon Stack

Why Use

Tachyon?

Enable Faster Innovation in

Storage Layer

What if

data size exceeds

memory capacity?

Tiered Storage: Tachyon Manages More

Than DRAM

MEM

SSD

HDD

Faster

Higher

Capacity

Configurable Storage Tiers

MEM only

MEM + HDD

SSD only

Pluggable Data Management Policy

Pin Data in Memory

Transparent Naming

Unified Namespace Across Under

Storage Systems

More Features

•

Remote Write Support

•

Easy deployment with

Mesos

and

YARN

•

Initial Security Support

•

One Command Cluster Deployment

•

Metrics Reporting for Clients, Workers, and

Master

Rich

Choice of

Under Storage Supports

How Easy to Use Tachyon in

scala> val file = sc.textFile(“hdfs://foo”)

scala> val file = sc.textFile(“

tachyon

://foo”)

Use Case: a SAAS Company

•

Framework:

Impala

•

Under Storage:

S3

•

Storage Media:

MEM + SSD

•

15x

 Performance Improvement

Use Case: a Biotechnology Company

•

Framework

Spark & MapReduce

•

Under Storage:

GlusterFS

•

Storage Media:

MEM and SSD

When Tachyon Meets Baidu

~ 100 nodes in deployment, > 1 PB storage space

30X Acceleration of our Big Data Analytics Workload

Agenda

•

Tachyon Basics & New Features

•

Motivation

•

Building an interactive data service

–

Spark + Tachyon

•

Future Works

Background

Frustrated data explorers

•

Example:

–

John is a PM and he needs to keep track of the top user

actions for a new feature

–

Based on the top actions of the day, he will perform

additional analysis

–

But John is very frustrated that each query takes tens of

minutes to finish

A dedicated service for data exploring

•

Manages PBs of data

•

Most queries within one minute

User Scenario

Agenda

•

Tachyon Basics & New Features

•

Motivation

•

Building an interactive data service

–

Spark + Tachyon

•

Future Works

Choose Spark as compute solution

Compute

Center

Data Center

Choose Tachyon as storage solution

•

Read from remote data

center: ~ 100 ~ 150 seconds

•

Read from Tachyon cluster

local node: 10 ~ 15 sec

•

Read from Tachyon machine

local node: ~ 5 sec

Tachyon Brings 30X Speed-up !

Overall Performance

Setup:

1.

Use MR to query 6 TB of data

2.

Use Spark to query 6 TB of

data

3.

Use Spark + Tachyon to query

6 TB of data

Results:

1.

Spark + Tachyon achieves

50-

fold

speedup compared to

MR

Architecture

SparkContext

Operation

Manager

View

Manager

Run Query

Build Cache

Data Warehouse

Cache

Scheduler

Ask & Profile

Catalyst helps to be ‘transparent’

lookupRelation

CacheableRelation

Tachyon

HDFS

Union

HiveTableScan

withUncachedPartitions

HiveTableScan

withCachedPartitions

Cache Policy

•

Prefetch

–

Fetch the views daily in advance when system is idle

–

The views fetched are based on the pattern of the past

query history profiling, e.g. 3 months query logs

•

On Demand caches

–

Fetch the views at runtime when system is serving regular

queries.

–

Using machine learning to generate policy file monthly for

views/tables

–

When a query is accessing some views, and parts of views

match our pre-generated policy, those views will be cached

at that time.

Hot Query: Cache Hit

Cold Query: Cache Miss

Daily Stats with Cache

•

Daily

–

Table

•

Queries: 100 – 300

•

Hit Rate: ~40%

–

Partition

•

Queries: 80K – 120K

•

Hit Rate: ~40 – 50%

•

Performance with Cache

–

avg 2 - 3  time faster than without Cache

Agenda

•

Tachyon Basics & New Features

•

Motivation

•

Building an interactive data service

–

Spark + Tachyon

•

Future Works

Improve Caching System

•

As Extended Meta Service

–

Improve legacy schema/input-format

–

Load block meta into cache layer

–

Index / Materialized View

•

Cost Based Caching/Optimizing

–

Better performance, hit rate & execution

–

Lower storage needs for cache layer

More User Scenario

•

If John is data scientist

–

Need a way to construct dataset conveniently

–

Usually have many tries with same dataset

•

An interactive system should help a lot

–

Spark is an ideal solution

Hardware assisted

big data infrastructure

•

Hardware

–

GPU

–

FPGA

•

Applications

–

Accelerate common SQL and ML operators

–

TableScan && InputFormat && Serde

•

Lower down the cost

–

10 dollars for big data

–

1 more dollar for interactive big data

Q&A

Slide Note

Embed Share

Download

Explore the collaboration between Baidu and Tachyon Nexus in advancing interactive data analytics with Spark on Tachyon. Learn about the team, Tachyon's history, features, and why it's a fast-growing open-source project. Discover how Tachyon enables efficient memory-centric distributed storage and its significance in data processing.

yetzalira Follow

Uploaded on Oct 09, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Interactive Data Analytics with Spark on Tachyon in Baidu Xiang Wen (Baidu) wenxiang@baidu.com Bin Fan (Tachyon Nexus) binfan@tachyonnexus.com Dec 02 2015 @ Strata + Hadoop World, Singapore 1

Who Are We? Bin Fan Xiang Wen Tachyon Project Contributor From Baidu Big Data Department Software Engineer at Tachyon Nexus Senior Software Engineer at Baidu 2

TNX-ID-logo.png Team consists of Tachyon creators, top contributors Series A ($7.5 million) from Andreessen Horowitz Committed to Tachyon Open Source Project www.tachyonnexus.com 3

Agenda Tachyon Basics & New Features Motivation Building an interactive data service Spark + Tachyon Future Works 4

History of Tachyon Started at UC Berkeley AMPLab From summer 2012 Open sourced April 2013 (two and half years ago) Apache License 2.0 Latest Release: Version 0.8.2 (November 2015) 5

One of the Fastest Growing Big Data Open Source Project http://tachyon-project.org/community/ > 170 Contributors (v0.8) 3x increment over the last year 6

One of the Fastest Growing Big Data Open Source Project > 50 Organizations 7

What is An Open Source Memory-Centric Distributed Storage System 8

Tachyon Stack 9

Why Use Tachyon? Data survive in memory after computation crashes Memory-speed data sharing across jobs and frameworks Spark Job Spark Job Hadoop Hadoop MR Job crash crash MR Job Spark Spark mem mem YARN YARN Off-heap storage, no GC HDFS disk Tachyon Tachyon in in- -memory memory block 1 block 2 Data Data block 3 block 4 HDFS / Amazon S3 HDFS / Amazon S3 10

Enable Faster Innovation in Storage Layer 11

What if data size exceeds memory capacity? 12

Tiered Storage: Tachyon Manages More Than DRAM Faster MEM SSD HDD Higher Capacity 13

Configurable Storage Tiers MEM only MEM + HDD SSD only 14

Pluggable Data Management Policy Promote hot data to upper tier Evict stale data to lower tier 15

Pin Data in Memory 16

Transparent Naming 17

Unified Namespace Across Under Storage Systems 18

More Features Remote Write Support Easy deployment with Mesos and YARN Initial Security Support One Command Cluster Deployment Metrics Reporting for Clients, Workers, and Master 19

Rich Choice of Under Storage Supports 20

How Easy to Use Tachyon in scala> val file = sc.textFile( hdfs://foo ) scala> val file = sc.textFile( tachyon://foo ) 21

Use Case: a SAAS Company Framework: Impala Under Storage: S3 Storage Media: MEM + SSD 15x Performance Improvement 22

Use Case: a Biotechnology Company Framework: Spark & MapReduce Under Storage: GlusterFS Storage Media: MEM and SSD 23

When Tachyon Meets Baidu 30X Acceleration of our Big Data Analytics Workload ~ 100 nodes in deployment, > 1 PB storage space 24

Agenda Tachyon Basics & New Features Motivation Building an interactive data service Spark + Tachyon Future Works 25

Background Data Warehouse Hours or Days Hours Logs Online Data Services 26

Frustrated data explorers Example: John is a PM and he needs to keep track of the top user actions for a new feature Based on the top actions of the day, he will perform additional analysis But John is very frustrated that each query takes tens of minutes to finish 27

A dedicated service for data exploring Manages PBs of data Most queries within one minute 28

User Scenario Have first try select some_action, count(*) from event_table where event_day= 20151123 group by user_action Web Site Client Service Gate Try another way select some_action, event_hour, count(*) from event_table where event_day= 20151123 group by user_action, event_hour Query Engine Not as expected! See what happens in original log Structured Logs Data Data Marts Warehouse select * from event_log where event_day= 20151123 and event_hour= 01 limit 10 29

Agenda Tachyon Basics & New Features Motivation Building an interactive data service Spark + Tachyon Future Works 30

Choose Spark as compute solution Service Gate Service Gate 4X Improvement but not good enough! Hive Map Reduce Compute Center Data Center Data Warehouse BFS 31

Choose Tachyon as storage solution Compute Center Read from remote data center: ~ 100 ~ 150 seconds Read from Tachyon cluster local node: 10 ~ 15 sec Read from Tachyon machine local node: ~ 5 sec Spark Task Spark Task Spark mem Spark mem HDFS disk in-memory block 1 block 1 block 2 Tachyon block 3 block 3 block 4 block 4 Tachyon Brings 30X Speed-up ! Data Center Baidu File System (BFS) 32

Overall Performance 1200 Setup: 1. Use MR to query 6 TB of data 2. Use Spark to query 6 TB of data 3. Use Spark + Tachyon to query 6 TB of data 1000 800 600 400 200 Results: 1. Spark + Tachyon achieves 50- fold speedup compared to MR 0 MR (sec) Spark (sec) Spark + Tachyon (sec) 33

Architecture Operation Manager View Manager Ask & Profile Run Query Scheduler Build Cache SparkContext Cache Data Warehouse 34

Catalyst helps to be transparent lookupRelation CacheableRelation HiveTableScan withUncachedPartitions Union HiveTableScan withCachedPartitions Tachyon HDFS 35

Cache Policy Prefetch Fetch the views daily in advance when system is idle The views fetched are based on the pattern of the past query history profiling, e.g. 3 months query logs On Demand caches Fetch the views at runtime when system is serving regular queries. Using machine learning to generate policy file monthly for views/tables When a query is accessing some views, and parts of views match our pre-generated policy, those views will be cached at that time. 36

Hot Query: Cache Hit Query UI View Manager Operation Manager Cache Meta Spark HDFS Tachyon 37

Cold Query: Cache Miss Query UI View Manager Operation Manager Cache Meta Spark HDFS Tachyon 38

Daily Stats with Cache Daily Table Queries: 100 300 Hit Rate: ~40% Partition Queries: 80K 120K Hit Rate: ~40 50% Performance with Cache avg 2 - 3 time faster than without Cache 39

Agenda Tachyon Basics & New Features Motivation Building an interactive data service Spark + Tachyon Future Works 40

Improve Caching System As Extended Meta Service Improve legacy schema/input-format Load block meta into cache layer Index / Materialized View Cost Based Caching/Optimizing Better performance, hit rate & execution Lower storage needs for cache layer 41

More User Scenario If John is data scientist Need a way to construct dataset conveniently Usually have many tries with same dataset An interactive system should help a lot Spark is an ideal solution 42

Hardware assisted big data infrastructure Hardware GPU FPGA Applications Accelerate common SQL and ML operators TableScan && InputFormat && Serde Lower down the cost 10 dollars for big data 1 more dollar for interactive big data 43

Q&A 44

Introduction to Interactive Data Analytics with Spark on Tachyon

Download Presentation

Presentation Transcript

Related

More Related Content