Data Processing and MapReduce: Concepts and Applications

 
B
i
g
 
D
a
t
a
 
P
r
o
c
e
s
s
i
n
g
 
CS 240: Computing Systems and Concurrency
Lecture 23
 
Marco Canini
 
Data-Parallel Computation
 
2
 
3
 
Ex: Word count using partial aggregation
 
1.
Compute word counts from individual files
2.
Then merge intermediate output
3.
Compute word count on merged outputs
 
4
 
Putting it together…
 
map
 
combine
 
partition
(“shuffle”)
 
reduce
5
Synchronization
Barrier
6
Fault Tolerance in MapReduce
 
Map worker writes intermediate output to local disk,
separated by partitioning. Once completed, tells
master node
Reduce worker told of location of map task outputs,
pulls their partition’s data from each mapper, execute
function across data
Note:
“All-to-all” shuffle b/w mappers and reducers
Written to disk (“materialized”) b/w 
each
 stage
 
Generality vs Specialization
 
7
 
General Systems
 
Can be used for many different applications
Jack of all trades, master of none
Pay a generality penalty
Once a specific application, or class of
applications becomes sufficiently important,
time to build specialized systems
 
8
 
9
 
MapReduce is a General System
 
Can express large computations on large data;
enables fault tolerant, parallel computation
Fault tolerance is an inefficient fit for many
applications
Parallel programming model (map, reduce) within
synchronous rounds is an inefficient fit for many
applications
MapReduce for Google’s Index
 
Flagship application in original MapReduce paper
 
Q: What is inefficient about MapReduce for computing
web indexes?
MapReduce and other batch-processing systems cannot
process small updates individually as they rely on creating
large batches for efficiency.”
 
Index moved to Percolator
 
in ~2010 
[OSDI ‘10]
Incrementally process updates to index
Uses OCC to apply updates
50% reduction in average age of documents
10
MapReduce for Iterative Computations
 
Iterative computations: compute on the same data as
we update it
e.g., PageRank
e.g., Logistic regression
 
Q: What is inefficient about MapReduce for these?
Writing data to disk between all iterations is slow
 
Many systems designed for iterative computations,
most notable is Apache Spark
Key idea 1: Keep data in memory once loaded
Key idea 2: Provide fault tolerance via 
lineage
 (record ops)
11
 
MapReduce for Stream Processing
 
Stream processing: Continuously process an
infinite stream of incoming events
e.g., estimating traffic conditions from GPS data
e.g., identify trending hashtags on twitter
e.g., detect fraudulent ad-clicks
 
Q: What is inefficient about MapReduce for
these?
 
12
Stream Processing Systems
 
Many stream processing systems as well, typical structure:
Definite computation ahead of time
Setup machines to run specific parts of computation and pass data around
(topology)
Stream data into topology
Repeat forever
Trickiest part: fault tolerance!
 
Notably systems and their fault tolerance
Apache/Twitter Storm: Record acknowledgment
Spark Streaming: Micro-batches
Google Cloud dataflow: transactional updates
Apache Flink: Distributed snapshot
 
Specialization is much faster, e.g., click-fraud detection at Microsoft
Batch-processing system: 6 hours
w/ StreamScope
[NSDI ‘16]
: 20 minute average
13
 
In-Memory Data-Parallel
Computation
 
14
 
Spark: Resilient Distributed Datasets
 
Let’s think of just having a big block of RAM,
partitioned across machines…
And a series of operators that can be executed in
parallel across the different partitions
 
That’s basically Spark
A distributed memory abstraction that is both
fault-tolerant and efficient
 
15
 
Restricted form of distributed shared memory
Immutable, partitioned collections of records
Can only be built through 
coarse-grained
deterministic transformations (map, filter, join, …)
They are called 
Resilient Distributed Datasets
 (RDDs)
 
Efficient fault recovery using 
lineage
Log one operation to apply to many elements
Recompute lost partitions on failure
No cost if nothing fails
 
Spark: Resilient Distributed Datasets
 
16
 
Spark Programming Interface
 
Language-integrated API in Scala (+ Python)
Usable interactively via Spark shell
Provides:
Resilient distributed datasets (RDDs)
Operations on RDDs: deterministic
transformations
 (build new RDDs), 
actions
(compute and output results)
Control of each RDD’s 
partitioning
 (layout across
nodes) and 
persistence
 (storage in RAM, on disk,
etc)
 
17
Example: Log Mining
Load error messages from a log into memory,
then interactively search for various patterns
 
lines = spark.textFile(“hdfs://...”)
errors = lines.
filter
(
_.startsWith(“ERROR”)
)
messages = errors.
map
(
_.split(‘\t’)(2)
)
messages.
persist
()
Block 1
Block 2
Block 3
 
messages.
filter
(
_.contains(“foo”)
).
count
 
messages.
filter
(
_.contains(“bar”)
).
count
 
tasks
 
results
Msgs. 1
Msgs. 2
Msgs. 3
Base RDD
Transformed RDD
Action
18
 
In-Memory Data Sharing
 
Input
query 1
query 2
query 3
 
.  .  .
 
one-time
processing
iter. 1
iter. 2
 
.  .  .
 
Input
 
19
Efficient Fault Recovery via Lineage
Input
query 1
query 2
query 3
.  .  .
one-time
processing
iter. 1
iter. 2
.  .  .
Input
 
Maintain a reliable log of applied operations
 
Recompute lost partitions on failure
20
Generality of RDDs
 
Despite their restrictions, RDDs can express many
parallel algorithms
These naturally 
apply the same operation to many
items
Unify many programming models
Data flow models:
 MapReduce, Dryad, SQL, …
Specialized models
 for iterative apps: BSP (Pregel),
iterative MapReduce (Haloop), bulk incremental, …
Support 
new apps 
that these models don’t
Enables apps to efficiently 
intermix
 these models
21
 
Spark Operations
 
22
 
Spark Summary
 
Global aggregate computations that produce
program state
compute the count() of an RDD, compute the max diff,
etc.
Loops!
Spark makes it much easier to do multi-stage
MapReduce
Built-in abstractions for some other common
operations like joins
See also Apache Flink for a flexible big data
platform
 
23
 
Stream Processing
 
24
 
Single node/process
Read data from input source (e.g., network socket)
Process
Write output
 
25
25
 
S
i
m
p
l
e
 
s
t
r
e
a
m
 
p
r
o
c
e
s
s
i
n
g
 
Convert Celsius temperature to Fahrenheit
S
t
a
t
e
l
e
s
s
 
o
p
e
r
a
t
i
o
n
:
 
 
 
e
m
i
t
 
 
(
i
n
p
u
t
 
*
 
9
 
/
 
5
)
 
+
 
3
2
 
 
 
 
26
26
 
E
x
a
m
p
l
e
s
:
 
 
S
t
a
t
e
l
e
s
s
 
c
o
n
v
e
r
s
i
o
n
CtoF
 
Function can filter inputs
i
f
 
(
i
n
p
u
t
 
>
 
t
h
r
e
s
h
o
l
d
)
 
 
{
 
 
e
m
i
t
 
i
n
p
u
t
 
}
 
 
 
27
27
 
E
x
a
m
p
l
e
s
:
 
 
S
t
a
t
e
l
e
s
s
 
f
i
l
t
e
r
i
n
g
Filter
 
Compute EWMA of Fahrenheit temperature
new_temp = 
 * ( CtoF(input) ) + (1- 
) * last_temp
last_temp = new_temp
e
m
i
t
 
n
e
w
_
t
e
m
p
 
 
 
28
28
 
E
x
a
m
p
l
e
s
:
 
 
S
t
a
t
e
f
u
l
 
c
o
n
v
e
r
s
i
o
n
EWMA
 
E.g., Average value per window
Window can be # elements (10) or time (1s)
Windows can be fixed (every 5s)
Windows can be “sliding” (5s window every 1s)
29
29
E
x
a
m
p
l
e
s
:
 
 
A
g
g
r
e
g
a
t
i
o
n
 
(
s
t
a
t
e
f
u
l
)
Avg
 
30
30
 
S
t
r
e
a
m
 
p
r
o
c
e
s
s
i
n
g
 
a
s
 
c
h
a
i
n
Avg
CtoF
Filter
31
31
S
t
r
e
a
m
 
p
r
o
c
e
s
s
i
n
g
 
a
s
 
d
i
r
e
c
t
e
d
 
g
r
a
p
h
Avg
CtoF
Filter
KtoF
s
e
n
s
o
r
t
y
p
e
 
2
s
e
n
s
o
r
 
t
y
p
e
 
1
 
a
l
e
r
t
s
 
s
t
o
r
a
g
e
 
Large amounts of data to process in real time
Examples
Social network trends (#trending)
Intrusion detection systems (networks, datacenters)
Sensors:  Detect earthquakes by correlating
vibrations of millions of smartphones
Fraud detection
Visa:  2000 txn / sec on average, peak ~47,000 / sec
 
32
32
T
h
e
 
c
h
a
l
l
e
n
g
e
 
o
f
 
s
t
r
e
a
m
 
p
r
o
c
e
s
s
i
n
g
T
u
p
l
e
-
b
y
-
T
u
p
l
e
 
i
n
p
u
t
 
 
r
e
a
d
i
f
 
(
i
n
p
u
t
 
>
 
t
h
r
e
s
h
o
l
d
)
 
 
{
e
m
i
t
 
i
n
p
u
t
}
M
i
c
r
o
-
b
a
t
c
h
 
i
n
p
u
t
s
 
 
r
e
a
d
out = []
for input in inputs {
  
if (input > threshold) {
   
out.append(input)
  
}
}
e
m
i
t
 
o
u
t
33
33
S
c
a
l
e
 
u
p
:
 
b
a
t
c
h
i
n
g
T
u
p
l
e
-
b
y
-
T
u
p
l
e
Lower Latency
Lower Throughput
M
i
c
r
o
-
b
a
t
c
h
Higher Latency
Higher Throughput
34
34
S
c
a
l
e
 
u
p
W
h
y
?
 
 
E
a
c
h
 
r
e
a
d
/
w
r
i
t
e
 
i
s
 
a
n
 
s
y
s
t
e
m
 
c
a
l
l
 
i
n
t
o
 
k
e
r
n
e
l
.
M
o
r
e
 
c
y
c
l
e
s
 
p
e
r
f
o
r
m
i
n
g
 
k
e
r
n
e
l
/
a
p
p
l
i
c
a
t
i
o
n
 
t
r
a
n
s
i
t
i
o
n
s
(
c
o
n
t
e
x
t
 
s
w
i
t
c
h
e
s
)
,
 
l
e
s
s
 
a
c
t
u
a
l
l
y
 
s
p
e
n
t
 
p
r
o
c
e
s
s
i
n
g
 
d
a
t
a
.
35
35
S
c
a
l
e
 
o
u
t
 
36
36
 
S
t
a
t
e
l
e
s
s
 
o
p
e
r
a
t
i
o
n
s
:
 
t
r
i
v
i
a
l
l
y
 
p
a
r
a
l
l
e
l
i
z
e
d
 
Aggregations:
Need to join results across parallel computations
 
37
37
 
S
t
a
t
e
 
c
o
m
p
l
i
c
a
t
e
s
 
p
a
r
a
l
l
e
l
i
z
a
t
i
o
n
Avg
CtoF
Filter
 
Aggregations:
Need to join results across parallel computations
 
38
38
 
S
t
a
t
e
 
c
o
m
p
l
i
c
a
t
e
s
 
p
a
r
a
l
l
e
l
i
z
a
t
i
o
n
Avg
C
t
o
F
C
t
o
F
C
t
o
F
S
u
m
C
n
t
S
u
m
C
n
t
S
u
m
C
n
t
F
i
l
t
e
r
F
i
l
t
e
r
F
i
l
t
e
r
Aggregations:
Need to join results across parallel computations
39
39
P
a
r
a
l
l
e
l
i
z
a
t
i
o
n
 
c
o
m
p
l
i
c
a
t
e
s
 
f
a
u
l
t
-
t
o
l
e
r
a
n
c
e
Avg
C
t
o
F
C
t
o
F
C
t
o
F
S
u
m
C
n
t
S
u
m
C
n
t
S
u
m
C
n
t
F
i
l
t
e
r
F
i
l
t
e
r
F
i
l
t
e
r
 
-
 
b
l
o
c
k
s
 
-
 
Compute trending keywords
E.g.,
 
40
40
 
C
a
n
 
p
a
r
a
l
l
e
l
i
z
e
 
j
o
i
n
s
Sum
/ key
S
u
m
/
 
k
e
y
S
u
m
/
 
k
e
y
S
u
m
/
 
k
e
y
S
o
r
t
t
o
p
-
k
 
-
 
b
l
o
c
k
s
 
-
 
p
o
r
t
i
o
n
 
t
w
e
e
t
s
 
p
o
r
t
i
o
n
 
t
w
e
e
t
s
 
p
o
r
t
i
o
n
 
t
w
e
e
t
s
41
41
C
a
n
 
p
a
r
a
l
l
e
l
i
z
e
 
j
o
i
n
s
S
u
m
/
 
k
e
y
S
u
m
/
 
k
e
y
t
o
p
-
k
S
u
m
/
 
k
e
y
p
o
r
t
i
o
n
 
t
w
e
e
t
s
p
o
r
t
i
o
n
 
t
w
e
e
t
s
p
o
r
t
i
o
n
 
t
w
e
e
t
s
S
u
m
/
 
k
e
y
S
u
m
/
 
k
e
y
S
u
m
/
 
k
e
y
t
o
p
-
k
t
o
p
-
k
S
o
r
t
S
o
r
t
S
o
r
t
H
a
s
h
p
a
r
t
i
t
i
o
n
e
d
t
w
e
e
t
s
 
1.
m
e
r
g
e
2.
s
o
r
t
3.
t
o
p
-
k
42
42
P
a
r
a
l
l
e
l
i
z
a
t
i
o
n
 
c
o
m
p
l
i
c
a
t
e
s
 
f
a
u
l
t
-
t
o
l
e
r
a
n
c
e
S
u
m
/
 
k
e
y
S
u
m
/
 
k
e
y
t
o
p
-
k
S
u
m
/
 
k
e
y
p
o
r
t
i
o
n
 
t
w
e
e
t
s
p
o
r
t
i
o
n
 
t
w
e
e
t
s
p
o
r
t
i
o
n
 
t
w
e
e
t
s
S
u
m
/
 
k
e
y
S
u
m
/
 
k
e
y
S
u
m
/
 
k
e
y
t
o
p
-
k
t
o
p
-
k
S
o
r
t
S
o
r
t
S
o
r
t
H
a
s
h
p
a
r
t
i
t
i
o
n
e
d
t
w
e
e
t
s
1.
m
e
r
g
e
2.
s
o
r
t
3.
t
o
p
-
k
 
Various fault tolerance mechanisms:
 
1.
Record acknowledgement (Storm)
2.
Micro-batches (Spark Streaming, Storm Trident)
3.
Transactional updates (Google Cloud dataflow)
4.
Distributed snapshots (Flink)
 
43
43
 
P
o
p
u
l
a
r
 
S
t
r
e
a
m
i
n
g
 
F
r
a
m
e
w
o
r
k
s
 
1.
R
e
c
o
r
d
 
a
c
k
n
o
w
l
e
d
g
e
m
e
n
t
 
(
S
t
o
r
m
)
At least once semantics
Ensure each input “fully processed”
Track every processed tuple over the DAG, propagate ACKs
upwards to the input source of data
Cons: Apps need to deal with duplicate or out-of-order tuples
2.
Micro-batches (Spark Streaming, Storm Trident)
3.
Transactional updates (Google Cloud dataflow)
4.
Distributed snapshots (Flink)
 
44
44
 
P
o
p
u
l
a
r
 
S
t
r
e
a
m
i
n
g
 
F
r
a
m
e
w
o
r
k
s
 
1.
Record acknowledgement (Storm)
2.
M
i
c
r
o
-
b
a
t
c
h
e
s
 
(
S
p
a
r
k
 
S
t
r
e
a
m
i
n
g
,
 
S
t
o
r
m
 
T
r
i
d
e
n
t
)
Each micro-batch may succeed or fail
On failure, recompute the micro-batch
Use lineage to track dependencies
Checkpoint state to support failure recovery
3.
Transactional updates (Google Cloud dataflow)
4.
Distributed snapshots (Flink)
 
45
45
 
P
o
p
u
l
a
r
 
S
t
r
e
a
m
i
n
g
 
F
r
a
m
e
w
o
r
k
s
 
1.
Record acknowledgement (Storm)
2.
Micro-batches (Spark Streaming, Storm Trident)
3.
T
r
a
n
s
a
c
t
i
o
n
a
l
 
u
p
d
a
t
e
s
 
(
G
o
o
g
l
e
 
C
l
o
u
d
 
d
a
t
a
f
l
o
w
)
Treat every processed record as a transaction, committed upon
processing
On failure, replay the log to restore a consistent state and replay
lost records
4.
Distributed snapshots (Flink)
 
46
46
 
P
o
p
u
l
a
r
 
S
t
r
e
a
m
i
n
g
 
F
r
a
m
e
w
o
r
k
s
 
1.
Record acknowledgement (Storm)
2.
Micro-batches (Spark Streaming, Storm Trident)
3.
Transactional updates (Google Cloud dataflow)
4.
D
i
s
t
r
i
b
u
t
e
d
 
s
n
a
p
s
h
o
t
s
 
(
F
l
i
n
k
)
Take system-wide consistent snapshot (algo is a variation of
Chandy-Lamport)
Snapshot periodically
On failure, recover the latest snapshot and rewind the stream
source to snapshot point, then replay inputs
 
47
47
 
P
o
p
u
l
a
r
 
S
t
r
e
a
m
i
n
g
 
F
r
a
m
e
w
o
r
k
s
 
G
r
a
p
h
-
P
a
r
a
l
l
e
l
 
C
o
m
p
u
t
a
t
i
o
n
 
48
48
 
Properties of Graph Parallel Algorithms
 
Dependency
Graph
 
Iterative
Computation
 
Factored
Computation
 
ML Tasks Beyond Data-Parallelism
Data-Parallel                     
Graph-Parallel
 
Cross
Validation
 
Feature
Extraction
 
Map Reduce
 
Computing Sufficient
Statistics
 
Graphical Models
Gibbs Sampling
Belief Propagation
Variational Opt.
 
Semi-Supervised
Learning
Label Propagation
CoEM
 
Graph Analysis
PageRank
Triangle Counting
 
Collaborative
Filtering
Tensor Factorization
 
50
 
?
 
P
r
e
g
e
l
:
 
B
u
l
k
 
S
y
n
c
h
r
o
n
o
u
s
 
P
a
r
a
l
l
e
l
 
L
e
t
s
 
s
l
i
g
h
t
l
y
 
r
e
t
h
i
n
k
 
t
h
e
 
M
a
p
R
e
d
u
c
e
 
m
o
d
e
l
 
f
o
r
 
p
r
o
c
e
s
s
i
n
g
 
g
r
a
p
h
s
Vertices
“Edges” are really messages
 
Compare to MapReduce keys 
 values?
 
 
 
“Think like a vertex”
vertex
ID
vertex value
vertex
ID
 
51
51
T
h
e
 
B
a
s
i
c
 
P
r
e
g
e
l
 
E
x
e
c
u
t
i
o
n
 
M
o
d
e
l
 
A sequence of 
supersteps
, for each vertex V
At superstep S:
Compute in parallel at each V
Read messages sent to V in superstep S-1
Update value / state
Optionally change topology
Send messages
Synchronization
Wait till all communication is finished
vertex
ID
vertex value
vertex value
Université catholique de Louvain
52
52
 
T
e
r
m
i
n
a
t
i
o
n
 
T
e
s
t
 
Based on every vertex voting to halt
Once a vertex deactivates itself it does no further work unless
triggered externally by receiving a message
Algorithm terminates when all vertices are simultaneously
inactive
Active
Ina
ctive
 
Vote to halt
 
Message received
 
53
53
 
D
i
s
t
r
i
b
u
t
e
d
 
M
a
c
h
i
n
e
 
L
e
a
r
n
i
n
g
 
54
54
M
a
c
h
i
n
e
 
l
e
a
r
n
i
n
g
 
(
M
L
)
 
ML algorithms can improve automatically through experience (data)
 
 
Most common approaches
S
u
p
e
r
v
i
s
e
d
 
l
e
a
r
n
i
n
g
:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
t
r
a
i
n
 
t
h
e
 
m
o
d
e
l
 
f
i
r
s
t
,
 
t
h
e
n
 
u
s
e
 
i
t
U
n
s
u
p
e
r
v
i
s
e
d
 
l
e
a
r
n
i
n
g
:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
t
h
e
 
m
o
d
e
l
 
l
e
a
r
n
s
 
b
y
 
i
t
s
e
l
f
R
e
i
n
f
o
r
c
e
m
e
n
t
 
l
e
a
r
n
i
n
g
 
(
R
L
)
:
 
 
 
 
m
o
d
e
l
 
l
e
a
r
n
s
 
w
h
i
l
e
 
d
o
i
n
g
 
T
r
a
i
n
i
n
g
Feed the ML model data, so that it
can learn how to make decisions
 
I
n
f
e
r
e
n
c
e
 
(
o
r
 
m
o
d
e
l
 
s
e
r
v
i
n
g
)
ML model in use, to process live data
Loss
function
M
L
 
t
r
a
i
n
i
n
g
Training
dataset
DOG
CAT
DOG
CAT
DOG
100% WRONG
Δ
WORKER 2
D
i
s
t
r
i
b
u
t
e
d
 
M
L
 
t
r
a
i
n
i
n
g
D
a
t
a
 
p
a
r
a
l
l
e
l
WORKER 1
M
i
n
i
-
b
a
t
c
h
Amount of data
processed by a single
worker during 1 iteration
G
l
o
b
a
l
 
b
a
t
c
h
Amount of data
processed by all workers
during 1 iteration
Δ
1
Δ
Δ
2
Δ
+
S
t
o
c
h
a
s
t
i
c
 
g
r
a
d
i
e
n
t
d
e
s
c
e
n
t
 
(
S
G
D
)
WORKER 2
 
D
i
s
t
r
i
b
u
t
e
d
 
M
L
 
t
r
a
i
n
i
n
g
M
o
d
e
l
 
p
a
r
a
l
l
e
l
 
o
r
 
h
y
b
r
i
d
Training
dataset
Training
dataset
WORKER 1
WORKER 2
Training
dataset
Training
dataset
WORKER 1
WORKER 4
Training
dataset
Training
dataset
WORKER 3
 
Model parallel
 
Hybrid model-data parallel
 
W
e
a
k
 
s
c
a
l
i
n
g
Fixed local batch size per-
worker fixed
More workers can process a
larger global batch in one
iteration
Same iteration time, fewer
iterations
Same data transfers at each
iteration
 
Time to accuracy does not
scale linearly with the number
of workers
 
S
t
r
o
n
g
 
s
c
a
l
i
n
g
Fixed global batch size
With more workers, the local
batch size per-worker
decreases
Reduced iteration time (for
computation)
Same data transfers at each
iteration
 
 
M
o
r
e
 
f
r
e
q
u
e
n
t
s
y
n
c
h
r
o
n
i
z
a
t
i
o
n
s
 
a
m
o
n
g
w
o
r
k
e
r
s
 
(
m
o
r
e
 
n
e
t
w
o
r
k
t
r
a
f
f
i
c
)
 
W
e
a
k
 
s
c
a
l
i
n
g
 
a
n
d
 
s
t
r
o
n
g
 
s
c
a
l
i
n
g
 
W
h
e
n
 
t
h
e
 
n
e
t
w
o
r
k
 
i
s
 
t
h
e
 
b
o
t
t
l
e
n
e
c
k
 
Compute accelerators performance improvements
have so far outpaced network bandwidth increases
Newer, larger DNN models spend more time on
communication
Slide Note
Embed
Share

Exploring concepts of big data processing, data-parallel computation, fault tolerance in MapReduce, generality vs. specialization in systems, and the efficiency of MapReduce for large computations such as web indexing. Understand the role of synchronization barriers, handling partial aggregation, and the significance of specialized systems in processing data efficiently.

  • Data Processing
  • MapReduce
  • Fault Tolerance
  • Big Data
  • Parallel Computation

Uploaded on Oct 03, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Big Data Processing CS 240: Computing Systems and Concurrency Lecture 23 Marco Canini

  2. Data-Parallel Computation 2

  3. Ex: Word count using partial aggregation 1. Compute word counts from individual files 2. Then merge intermediate output 3. Compute word count on merged outputs 3

  4. Putting it together map combine partition ( shuffle ) reduce 4

  5. Synchronization Barrier 5

  6. Fault Tolerance in MapReduce Map worker writes intermediate output to local disk, separated by partitioning. Once completed, tells master node Reduce worker told of location of map task outputs, pulls their partition s data from each mapper, execute function across data Note: All-to-all shuffle b/w mappers and reducers Written to disk ( materialized ) b/w each stage 6

  7. Generality vs Specialization 7

  8. General Systems Can be used for many different applications Jack of all trades, master of none Pay a generality penalty Once a specific application, or class of applications becomes sufficiently important, time to build specialized systems 8

  9. MapReduce is a General System Can express large computations on large data; enables fault tolerant, parallel computation Fault tolerance is an inefficient fit for many applications Parallel programming model (map, reduce) within synchronous rounds is an inefficient fit for many applications 9

  10. MapReduce for Googles Index Flagship application in original MapReduce paper Q: What is inefficient about MapReduce for computing web indexes? MapReduce and other batch-processing systems cannot process small updates individually as they rely on creating large batches for efficiency. Index moved to Percolatorin ~2010 [OSDI 10] Incrementally process updates to index Uses OCC to apply updates 50% reduction in average age of documents 10

  11. MapReduce for Iterative Computations Iterative computations: compute on the same data as we update it e.g., PageRank e.g., Logistic regression Q: What is inefficient about MapReduce for these? Writing data to disk between all iterations is slow Many systems designed for iterative computations, most notable is Apache Spark Key idea 1: Keep data in memory once loaded Key idea 2: Provide fault tolerance via lineage (record ops) 11

  12. MapReduce for Stream Processing Stream processing: Continuously process an infinite stream of incoming events e.g., estimating traffic conditions from GPS data e.g., identify trending hashtags on twitter e.g., detect fraudulent ad-clicks Q: What is inefficient about MapReduce for these? 12

  13. Stream Processing Systems Many stream processing systems as well, typical structure: Definite computation ahead of time Setup machines to run specific parts of computation and pass data around (topology) Stream data into topology Repeat forever Trickiest part: fault tolerance! Notably systems and their fault tolerance Apache/Twitter Storm: Record acknowledgment Spark Streaming: Micro-batches Google Cloud dataflow: transactional updates Apache Flink: Distributed snapshot Specialization is much faster, e.g., click-fraud detection at Microsoft Batch-processing system: 6 hours w/ StreamScope[NSDI 16]: 20 minute average 13

  14. In-Memory Data-Parallel Computation 14

  15. Spark: Resilient Distributed Datasets Let s think of just having a big block of RAM, partitioned across machines And a series of operators that can be executed in parallel across the different partitions That s basically Spark A distributed memory abstraction that is both fault-tolerant and efficient 15

  16. Spark: Resilient Distributed Datasets Restricted form of distributed shared memory Immutable, partitioned collections of records Can only be built through coarse-grained deterministic transformations (map, filter, join, ) They are called Resilient Distributed Datasets (RDDs) Efficient fault recovery using lineage Log one operation to apply to many elements Recompute lost partitions on failure No cost if nothing fails 16

  17. Spark Programming Interface Language-integrated API in Scala (+ Python) Usable interactively via Spark shell Provides: Resilient distributed datasets (RDDs) Operations on RDDs: deterministic transformations (build new RDDs), actions (compute and output results) Control of each RDD s partitioning (layout across nodes) and persistence (storage in RAM, on disk, etc) 17

  18. Example: Log Mining Load error messages from a log into memory, then interactively search for various patterns Msgs. 1 Base RDD Transformed RDD lines = spark.textFile( hdfs://... ) Worker results errors = lines.filter(_.startsWith( ERROR )) tasks Block 1 messages = errors.map(_.split( \t )(2)) Master messages.persist() Action messages.filter(_.contains( foo )).count Msgs. 2 messages.filter(_.contains( bar )).count Worker Msgs. 3 Block 2 Worker Block 3 18

  19. In-Memory Data Sharing . . . iter. 1 iter. 2 Input query 1 one-time processing query 2 query 3 Input . . . 19

  20. Efficient Fault Recovery via Lineage Maintain a reliable log of applied operations . . . iter. 1 iter. 2 Input Recompute lost partitions on failure query 1 one-time processing query 2 query 3 Input . . . 20

  21. Generality of RDDs Despite their restrictions, RDDs can express many parallel algorithms These naturally apply the same operation to many items Unify many programming models Data flow models: MapReduce, Dryad, SQL, Specialized models for iterative apps: BSP (Pregel), iterative MapReduce (Haloop), bulk incremental, Support new apps that these models don t Enables apps to efficiently intermix these models 21

  22. Spark Operations flatMap union join cogroup cross mapValues map filter sample groupByKey reduceByKey sortByKey Transformations (define a new RDD) collect reduce count save lookupKey take Actions (return a result to driver program) 22

  23. Spark Summary Global aggregate computations that produce program state compute the count() of an RDD, compute the max diff, etc. Loops! Spark makes it much easier to do multi-stage MapReduce Built-in abstractions for some other common operations like joins See also Apache Flink for a flexible big data platform 23

  24. Stream Processing 24

  25. Simple stream processing Single node/process Read data from input source (e.g., network socket) Process Write output 25

  26. Examples: Stateless conversion CtoF Convert Celsius temperature to Fahrenheit Stateless operation: emit (input * 9 / 5) + 32 26

  27. Examples: Stateless filtering Filter Function can filter inputs if (input > threshold) { emit input } 27

  28. Examples: Stateful conversion EWMA Compute EWMA of Fahrenheit temperature new_temp = * ( CtoF(input) ) + (1- ) * last_temp last_temp = new_temp emit new_temp 28

  29. Examples: Aggregation (stateful) Avg E.g., Average value per window Window can be # elements (10) or time (1s) Windows can be fixed (every 5s) Windows can be sliding (5s window every 1s) 29

  30. Stream processing as chain CtoF Avg Filter 30

  31. Stream processing as directed graph sensor type 1 alerts CtoF Avg Filter sensor type 2 storage KtoF 31

  32. The challenge of stream processing Large amounts of data to process in real time Examples Social network trends (#trending) Intrusion detection systems (networks, datacenters) Sensors: Detect earthquakes by correlating vibrations of millions of smartphones Fraud detection Visa: 2000 txn / sec on average, peak ~47,000 / sec 32

  33. Scale up: batching Tuple-by-Tuple Micro-batch input read if (input > threshold) { emit input } inputs read out = [] for input in inputs { if (input > threshold) { out.append(input) } } emit out 33

  34. Scale up Tuple-by-Tuple Micro-batch Lower Latency Higher Latency Lower Throughput Higher Throughput Why? Each read/write is an system call into kernel. More cycles performing kernel/application transitions (context switches), less actually spent processing data. 34

  35. Scale out 35

  36. Stateless operations: trivially parallelized C F C F C F 36

  37. State complicates parallelization Aggregations: Need to join results across parallel computations CtoF Avg Filter 37

  38. State complicates parallelization Aggregations: Need to join results across parallel computations Sum Cnt CtoF Filter Sum Cnt Avg CtoF Filter Sum Cnt CtoF Filter 38

  39. Parallelization complicates fault-tolerance Aggregations: Need to join results across parallel computations Sum Cnt CtoF Filter Sum Cnt Avg CtoF Filter - blocks - Sum Cnt CtoF Filter 39

  40. Can parallelize joins Compute trending keywords E.g., portion tweets Sum / key Sum / key portion tweets Sum / key Sort top-k portion tweets - blocks - Sum / key 40

  41. Can parallelize joins Hash 1. merge 2. sort 3. top-k partitioned tweets portion tweets Sum / key Sum / key Sort top-k portion tweets Sum / key Sum / key Sort top-k portion tweets Sum / key Sum / key Sort top-k 41

  42. Parallelization complicates fault-tolerance Hash 1. merge 2. sort 3. top-k partitioned tweets portion tweets Sum / key Sum / key Sort top-k portion tweets Sum / key Sum / key Sort top-k portion tweets Sum / key Sum / key Sort top-k 42

  43. Popular Streaming Frameworks Various fault tolerance mechanisms: 1. Record acknowledgement (Storm) 2. Micro-batches (Spark Streaming, Storm Trident) 3. Transactional updates (Google Cloud dataflow) 4. Distributed snapshots (Flink) 43

  44. Popular Streaming Frameworks 1. Record acknowledgement (Storm) At least once semantics Ensure each input fully processed Track every processed tuple over the DAG, propagate ACKs upwards to the input source of data Cons: Apps need to deal with duplicate or out-of-order tuples 2. Micro-batches (Spark Streaming, Storm Trident) 3. Transactional updates (Google Cloud dataflow) 4. Distributed snapshots (Flink) 44

  45. Popular Streaming Frameworks 1. Record acknowledgement (Storm) 2. Micro-batches (Spark Streaming, Storm Trident) Each micro-batch may succeed or fail On failure, recompute the micro-batch Use lineage to track dependencies Checkpoint state to support failure recovery 3. Transactional updates (Google Cloud dataflow) 4. Distributed snapshots (Flink) 45

  46. Popular Streaming Frameworks 1. Record acknowledgement (Storm) 2. Micro-batches (Spark Streaming, Storm Trident) 3. Transactional updates (Google Cloud dataflow) Treat every processed record as a transaction, committed upon processing On failure, replay the log to restore a consistent state and replay lost records 4. Distributed snapshots (Flink) 46

  47. Popular Streaming Frameworks 1. Record acknowledgement (Storm) 2. Micro-batches (Spark Streaming, Storm Trident) 3. Transactional updates (Google Cloud dataflow) 4. Distributed snapshots (Flink) Take system-wide consistent snapshot (algo is a variation of Chandy-Lamport) Snapshot periodically On failure, recover the latest snapshot and rewind the stream source to snapshot point, then replay inputs 47

  48. Graph-Parallel Computation 48

  49. Properties of Graph Parallel Algorithms Dependency Graph Factored Computation Iterative Computation What I Like What My Friends Like

  50. ML Tasks Beyond Data-Parallelism Data-Parallel Graph-Parallel Map Reduce ? Feature Extraction Cross Validation Graphical Models Gibbs Sampling Belief Propagation Variational Opt. Semi-Supervised Learning Label Propagation CoEM Computing Sufficient Statistics Collaborative Filtering Tensor Factorization Graph Analysis PageRank Triangle Counting 50

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#