Introduction to MapReduce: Efficient Data Processing Technique

MapReduce
Francisco Moreno
Excerpts from
“Mining of Massive Datasets”
Rajaraman, Leskovec, and Ullman
Introduction
Modern data-mining applications, often called “big-data” analysis,
require us to manage immense amounts of data quickly.
There is ample opportunity to exploit 
parallelism
.
Parallelism
 is exploited not from a “supercomputer,” 
but from
“computing clusters”.
Introduction
In the past, applications that called for parallel processing, such as
large scientific calculations, were done on 
special-purpose parallel
computers 
with many processors and specialized hardware.
However, the prevalence of large-scaleWeb services has caused more
and more computing to be done on installations with thousands of
compute nodes operating more or less independently.
In these installations, the compute nodes are commodity
*
 hardware,
which greatly 
reduces the cost compared with special-purpose
parallel 
machines
.
*
 Producto básico, materia prima
Introduction
Central to this new software stack is a 
programming technique
 called
MapReduce
.
Implementations of MapReduce enable many of the most common
calculations on large-scale data to be performed on computing
clusters efficiently
MapReduce
MapReduce is a style of computing that has been implemented in
several systems, including Google’s internal implementation and the
popular open-source implementation 
Hadoop
It has also been implemented by:
   - Couchbase
   - CouchDB (Apache CouchDB)
   - Infinispan
   
-
 MongoDB
   - RavenDB
   - Riak
It can also be mimicked in other systems,
e.g., in Oracle
MapReduce
All you need to write are two functions, called 
Map
 and 
Reduce
, while
the system manages the parallel execution
, coordination of tasks
that execute Map or Reduce, and also deals with the possibility that
one of these tasks will fail to execute.
In brief, a MapReduce computation executes as follows:
MapReduce 
Step 1
Some number of Map tasks (mappers) each are given one or more
chunks (
a chunk is a collection of elements, e.g., documents
) from a
distributed file system.
These Map tasks turn the chunk into a sequence of 
key-value pairs
.
The way 
key-value pairs 
are produced from the input data is
determined by the code written by the user 
for the Map function.
Note que acá las claves (
keys
) no tienen el sentido de claves únicas (claves
primarias, claves alternativas). De hecho, una función map podría generar varias
parejas clave-valor (
key-value pairs
) idénticas…
MapReduce 
Step 2
The key-value pairs from each Map task are collected by a 
master
controller (combiner, shuffler) 
and 
grouped by key
*
.
The keys are divided among all the Reduce tasks (reducers), so all key-
value pairs with the same key wind up at the same Reduce 
task
**
.
*
 
Es un proceso similar a un GROUP BY de SQL, solo que acá no se aplican funciones
de agregación sobre los valores. Con los valores de cada clave (
key
) se forma una
colección.
** 
Esto es así ya que todos los valores de una misma clave están reunidos en una
colección: (clave, colección de valores).
MapReduce 
Step 3
The Reduce tasks work on one key at a time, and combine (usually,
aggregate, “consolidate”) 
all the values* 
associated with that key in
some way.
The manner of combination (aggregation, consolidation) of values is
determined by the code written by the user
 for the Reduce 
function.
*
 O sea, la colección de valores de cada clave
Schematic of a MapReduce computation
 
--
--
--
--
--
--
Input
chunks
Map 1
Map 2
Map
 n
C
o
n
t
r
o
l
l
e
r
Key-value
pairs (
k
, 
v
)
Key-value
pairs (
k
, 
v
)
Key-value
pairs (
k
, 
v
)
Reduce 1
Reduce 2
Reduce 3
Reduce 
m
Keys with all
their values
(collection of
values)
(k1, [values])
(k2, [values])
(k3, [values])
(k4, [values])
(k5, [values])
(k6, [values])
(k7, [values])
(k8, [values])
(k9, [values])
Combined
result
Group by
keys
A first schematic example
 
10
 Juan
21
 Pedro
Input
documents
Map 1
Map 2
Controller
(
0
, Juan)
(
1
, Pedro)
Reduce 1
Reduce 2
Map
: if 
cc
is odd then
1
 else 
0
(
0
, [Juan, Lisa, Ana])
(
1
, [Pedro, Pierre])
Combined
result
30
 Lisa
40
 Ana
35
 Pierre
(
0
, Lisa)
(
0
, Ana)
(
1
, Pierre)
Group by
keys
(
0
, [Juan, Lisa, Ana])
(
1
, [Pedro, Pierre])
Reduce
:
count the
number of
values
(
0
, 3)
(
1
, 2)
(
0
, 3)
(
1
, 2)
Keys with all
their values
cc
 
name
cc
 
name
A first schematic example
 
10
 Juan
21
 Pedro
Input
documents
Map 1
Map 2
Controller
(
0
, 
1
)
(
1
, 
1
)
Reduce 1
Reduce 2
(
0
, [1, 1, 1])
(
1
, [1, 1])
30
 Lisa
40
 Ana
35
 Pierre
(
0
, 
1
)
(
0
, 
1
)
(
1
, 
1
)
Group by
keys
(
0
, [
1
, 
1
, 
1
])
(
1
, [
1
, 
1
])
Reduce:
count the
number of
values
Here, instead of emitting the
names, just emits 
1s
Map: if 
cc
is odd then
1
 else 
0
(
0
, 3)
(
1
, 2)
(
0
, 3)
(
1
, 2)
Combined
result
Keys with all
their values
cc
 
name
cc
 
name
A first schematic example
 
10
 Juan
21
 Pedro
NA
 Eric
Input
documents
Map 1
Map 2
Controller
(
0
, 
1
)
(
1
, 
1
)
(
-1
, 
1
)
Reduce 1
Reduce 2
(
0
, [1, 1, 1])
(
1
, [1, 1])
(
-1
, [1, 1])
30
 Lisa
40
 Ana
35
 Pierre
NA
 Bill
(
0
, 
1
)
(
0
, 
1
)
(
1
, 
1
)
(
-1
, 
1
)
Group by
keys
(
0
, [
1
, 
1
, 
1
])
(
1
, [
1
, 
1
])
(
-1
, [1, 1])
Reduce:
count the
number of
values
Map: if 
cc 
is odd
then 
1
 else 
0
.
If 
cc
 = NA then -1
(
0
, 3)
(
1
, 2)
(
-1
, 2)
(
0
, 3)
(
1
, 2)
(
-1
, 2)
Combined
result
Keys with all
their values
cc
 
name
cc
 
name
Here, reduce 2 processes
two (key, values)
T
h
e
 
M
a
p
 
T
a
s
k
s
We view input files for a Map task as consisting of elements, which
can be any type
: a tuple or a document, for example.
A chunk is a collection of elements, and no element is stored across
two chunks.
Technically, all inputs 
to Map tasks and outputs from Reduce tasks are
of the 
key-value-pair
 form.
Insisting on this form for inputs and outputs is motivated by the
desire to allow 
composition
 of 
several
 MapReduce processes.
T
h
e
 
M
a
p
 
T
a
s
k
s
The Map function takes an input element as its argument and
produces zero or more key-value pairs.
The types of keys and values are each arbitrary.
Further, keys are not “keys” in the usual sense; 
they do not have to be
unique
: a Map task can produce several key-value pairs with the same
key, 
even 
from the same element
.
Example
We shall illustrate a MapReduce computation with what has become
the standard example application: counting the number of
occurrences for each word in a collection of documents.
In this example, the input file is a repository of documents, and each
document is an element
The Map function 
for this example uses keys that are of type String
(the words) and values that 
are integers.
Example
The Map task reads a document and breaks it into its sequence
   of words 
w1
,
w2
, . . . ,
w
n
.
It then emits a sequence of key-value pairs where the value is always
1. That is, the output of the Map task for this document is the
sequence of key-value pairs:
(w1, 1), (w2, 1), . . . , (w
n
, 1)
Example
Note that a single Map task will typically process many documents –
all the documents in one or more chunks.
Note also that if a word 
w
 appears 
m
 times among all the documents
assigned to that process, then there will be m key-value pairs 
(w, 1)
among its output.
An option, which 
we show later
, is to combine these 
m
 pairs into a
single pair 
(w, 
m
)
G
r
o
u
p
i
n
g
 
b
y
 
K
e
y
As soon as the Map tasks have all completed successfully, the key-
value pairs are grouped by key, and the values associated with each
key are formed into a list (collection) 
of values.
The grouping is performed by the controller, regardless of what the
Map 
and Reduce tasks do.
G
r
o
u
p
i
n
g
 
b
y
 
K
e
y
The master controller process knows how many Reduce tasks there
will be, say
 
r
 
such tasks or the user could tells the MapReduce system
what 
r
 should be
*
.
Then the master controller picks a hash function
**
 that applies to
keys and produces a bucket number from 0 to
 
r
 
- 1. This number
represents the corresponding Reduce process where the key-value
pair is destined.
* 
A similar situation for the number of map tasks.
**
Optionally, users can specify their own hash function or other method for
assigning keys to Reduce tasks. However, whatever algorithm is used, 
each key
is assigned to one and only 
one Reduce task.
G
r
o
u
p
i
n
g
 
b
y
 
K
e
y
The controller performs the grouping by key and distributes to the
Reduce tasks:
The input to the 
Reduce task that handles key 
k
 is a pair of the form
(k, [v1, v2, . . . , v
p
])
Where 
(k, v1), (k, v2), . . . , (k, v
p
)
 are all the key-value pairs with key k
coming from 
all the Map tasks.
T
h
e
 
R
e
d
u
c
e
 
T
a
s
k
s
The Reduce function’s argument (input parameter) is a pair consisting
of a key and its list of 
associated values.
The output of the Reduce function is a sequence of 
zero or 
more key-
value pairs
.
These key-value pairs can be of a type different from those sent from
Map tasks to Reduce tasks, but often they are the same type.
T
h
e
 
R
e
d
u
c
e
 
T
a
s
k
s
We shall refer to the application of the Reduce function to a single
key and its associated list of values as a 
reducer 
(reducción).
A Reduce task receives 
one or more 
keys and their associated value
lists. That is, a Reduce task executes one or more 
reducers
.
The outputs from all the Reduce tasks are merged into a single output
(output file) 
 Combined result
Example
Let us continue with our word-count example.
The Reduce function simply adds up all the values. The output of a
reducer consists of the word and the sum.
Thus, the output of all the Reduce tasks is a sequence of 
(w, m) 
pairs,
where 
w
 is a word that appears at least once among all the input
documents and 
m
 is indeed the total number of occurrences of 
w
among 
all those documents.
Example schematically:
 
…casa…gato
…niño…casa
…gato…
Map 1
Map 2
Controller
(casa, 1)
(gato, 1)
(niño, 1)
(casa, 1)
(gato, 1)
Reduce 1
Reduce 2
(casa, [1, 1, 1])
(gato, [1, 1])
(perro, [1, 1])
(niño, [1, 1, 1])
Combined
result
…casa…niño..
…perro…niño
…perro…
(casa, [1, 1, 1])
(gato, [1, 1])
(perro, [1, 1])
(niño, [1, 1, 1])
(casa, 3)
(gato, 2)
(perro, 2)
(niño, 3)
(casa, 1)
(niño, 1)
(perro, 1)
(niño, 1)
(perro, 1)
(casa, 3)
(gato, 2)
(perro, 2)
(niño, 3)
Por supuesto, en este ejemplo no se consideran las posibles palabras representadas por los puntos suspensivos…
An alternative for the Example
In our example, we can “
push
 
some of what the reducers do to the
Map tasks:
Instead of the previous Map tasks producing many pairs 
(w, 1), (w, 1),
. . .
, we could apply the process that the Reduce function performs
within
 the Map task, before the output of the Map tasks is subject to
grouping and subsequent aggregation by the Reduce function.
These key-value pairs would thus be replaced by one pair with key 
w
and value equal to the sum of all the 1’s in 
all those pairs.
An alternative for the Example
That is, the pairs with key 
w
 generated by a single Map task would be
replaced by a pair 
(w, 
m
)
, where 
m
 is the number of times that 
w
appears among the documents handled by this Map task.
Note that it is still necessary to do grouping and aggregation and to
pass the result to the Reduce tasks, since there will typically be one
key-value pair with key w coming from each of the Map tasks.
An alternative for the Example schematically:
 
…casa…gato
…niño…casa
…gato…
Map 1
Map 2
Controller
(casa, 
2
)
(gato, 
2
)
(niño, 
1
)
Reduce 1
Reduce 2
(casa, [2, 1])
(gato, [2])
(perro, [2])
(niño, [1, 2])
Combined
result
…casa…niño..
…perro…niño
…perro…
(casa, [2, 1])
(gato, [2])
(perro, [2])
(niño, [1, 2])
(casa, 3)
(gato, 2)
(perro, 2)
(niño, 3)
(casa, 
1
)
(niño, 
2
)
(perro, 
2
)
(casa, 3)
(gato, 2)
(perro, 2)
(niño, 3)
Computing Natural Join by MapReduce
The idea behind implementing 
natural join 
via MapReduce can be
seen if we look at the specific case of joining 
R
(A, 
B
) with 
S
(
B
, C).
We must find tuples that agree on their 
B
 components, that is the
second component from tuples of 
R
 and the first component of tuples
of 
S
.
Join attribute
Computing Natural Join by MapReduce
We shall use the 
B
-value of tuples from either relation as the 
key
. The
value will be 
the other component (attributes) and the name of the
relation
, so the Reduce function can know where each tuple came
from:
(
B
, 
(
the name of the relation, the other attributes of the relation)
)
Key
Value
Computing Natural Join by MapReduce
The 
Map
 Function:
For each tuple (
a
, 
b
) of 
R
(
A
, 
B
), produce the key-value pair (
b
, 
(R, a)
).
For each tuple (
b
, 
c
) of 
S
(
B
, 
C
), produce the key-value pair (
b
, 
(S, c)
).
Example
EMP
   
 
ced
 
dep
 
10
 
1
20
 
1
30
 
2
DEPT
   
 
dep
 
name
1
 
Ventas
2
 
Cocina
3
 
Mercadeo
Foreign key
EMP
 Natural Join 
DEPT
 
  
ced
 
dep
 
name
10
 
1
 
Ventas
20
 
1
 
Ventas
30
 
2
 
Cocina
Computing Natural Join by MapReduce
Tuples produced by the map functions:
From 
EMP
:
(
1
, 
(EMP, 10)
)
(
1
, 
(EMP, 20)
)
(
2
, 
(EMP, 30)
)
From 
DEPT
:
(
1
, 
(DEPT, Ventas)
)
(
2
, 
(DEPT, Cocina)
)
(
3
, 
(DEPT, Mercadeo)
)
Computing Natural Join by MapReduce
Controller:
(
1
, 
(EMP, 10)
)
(
1
, 
(EMP, 20)
)
(
2
, 
(EMP, 30)
)
(
1
, 
(DEPT, Ventas)
)
(
2
, 
(DEPT, Cocina)
)
(
3
, 
(DEPT, Mercadeo)
)
Map
tasks
(
1
, 
[(EMP, 10), (EMP, 20), (DEPT, Ventas)]
)
(
2
, 
[(EMP, 30), (DEPT, Cocina)]
)
(
3
, 
[(DEPT, Mercadeo)]
)
DEPT
EMP
C
o
n
t
r
o
l
l
e
r
Grouping
Computing Natural Join by MapReduce
The 
Reduce
 Function:
Now that the controller has associated each key value 
b
 with a list of pairs
   that are either of the form 
(R, a)
 or 
(S, c)
, 
construct all pairs consisting of
   one with first component 
R
 and the other with first component 
S
*
.
The output from this key and value list is a sequence of key-
value
 pairs.
Each 
value 
is one of the triples 
(a, 
b
, c) 
such that 
(R, a) 
and 
(S, c)
 are on the
input list of values of 
b
.
Here, the key (in the reduce output) is irrelevant
. The result (join) is made
up of all the triples (values).
*
 That is, the cartesian product
Computing Natural Join by MapReduce
(
1
, 
[(EMP, 10), (EMP, 20), (DEPT, Ventas)]
)
(
2
, 
[(EMP, 30), (DEPT, Cocina)]
)
(
3
, 
[(DEPT, Mercadeo)]
)
(
1
, 
[(10, 
1
, Ventas), (20, 
1
, Ventas])]
)
(
2
, 
[(30, 
2
, Cocina)]
)
(
3
, 
[]
)
Reducers
Result
Computing Natural Join by MapReduce
Result:
(10, 
1
, Ventas)
(20, 
1
, Ventas)
(30, 
2
, Cocina)
Matrix-Vector Multiplication by MapReduce
Suppose we have an 
n
×
n
 matrix 
M
, whose element in row 
i
 and
column 
j
 will be denoted 
m
ij
.
Suppose we also have a vector 
v
 of length 
n
, whose 
j
th
 element
   is 
v
j
 .
Then the matrix-vector product is the vector
 x 
of length 
n
, whose 
i
th
element 
x
i
 
is given by
Example
8
 
4
 
1
9
 
5
 
2
3
 
2
 
3
2
3
9
 
37
51
39
 
x
=
i
 = 1 : 8x2 + 4x3 + 1x9 = 37
i 
= 2 : 9x2 + 5x3 + 2x9 = 51
i 
= 3 : 3x2 + 2x3 + 3x9 = 39
Matrix-Vector Multiplication by MapReduce
The matrix 
M
 and the vector 
v
 each will be stored in a file of the DFS
(data file system).
The Map function is written to apply to one element of 
M
.
Each Map task will operate on a 
chunk
 of the matrix 
M
. From each
matrix element 
m
ij
 it produces the key-value pair 
(i, 
m
ij
v
j
)
.
Thus, all terms of the sum that make up the component 
x
i
 of the
matrix-vector product will get the 
same key,
 i
.
The Reduce Function: The Reduce function simply sums all the values
associated with a given key 
i
. The result will be a pair (
i
, 
x
i
).
Example schematically (1/2):
8
 
4
 
1
9
 
5
 
2
3
 
2
 
3
2
3
9
 
x
Map 1
Map 2
Map 3
Map 4
Map 5
Map 6
Map 7
Map 8
Map 9
(1, 8*
2
)
(1, 4*
3
)
(1, 1*
9
)
(2, 9*
2
)
(2, 5*
3
)
(2, 2*
9
)
(3, 3*
2
)
(3, 2*
3
)
(3, 3*
9
)
Controller
(1, [16, 12, 9])
(2, [18, 15, 18])
(3, [6, 6, 27])
Reducers
Row
Example schematically (2/2):
Controller
(1, [16, 12, 9])
(2, [18, 15, 18])
(3, [6, 6, 27])
Reduce 1
Reduce 2
Reduce 3
Combined
result
(1, 37)
(2, 51)
(3, 39)
(1, 37)
(2, 51)
(3, 39)
Matrix Multiplication by MapReduce
If 
M
 is a matrix with element 
m
ij
 
in row 
i
 and column 
j
, and 
N
 is a
matrix with element 
n
jk
 in row 
j
 and column
 k
, then the product 
P =
MN
 is the matrix 
P
 with element 
p
ik
 in row
 i 
and column 
k
, where
It is required that the number of columns of 
M
 (
h
) equals the number
of rows of 
N 
(
h
), so the sum over 
j
 makes sense.
h
Example
 
Tomado de: http://www.proferiera.comocreartuweb.es/material5/unidad2/objetos/multiplicacion-de-matrices2.gif
Matrix Multiplication by MapReduce
The 
Map
 Function: 
For 
each
 element
 
m
ij
 
of 
M
, produce all the key-
value pairs 
((i, k), (
M
, j,m
ij
))
 for 
k
 = 1, 2, . . ., 
up to the number of
columns
 of 
N
.
Similarly, 
for 
each
 element
 
n
jk
 of 
N
, produce all the key-value pairs
   ((i, k), (
N
, j, n
jk
)
 for 
i 
= 1, 2, . . ., 
up to the number of rows 
of 
M
.
Note that 
M
 and 
N
 are really indicators to tell which of the two
relations a value comes 
from.
Example step by step
Consider element 
m
i
j
 
=
 m
1
,
1
 
= 
3
 
then
For k = 1
((
1
, 1), (
M
, 
1
, 
3
))
For k = 2
((
1
, 2), (
M
, 
1
, 
3
))
3
 
2
 
1
1
 
1
 
3
0
 
2
 
1
So for 
each
 element
 
m
i
j
 
of 
M 
we have:
m
i
j
 
=
 m
1
,
1
 
= 
3
 
then
For k = 1
((
1
, 1), (
M
, 
1
, 
3
))
For k = 2
((
1
, 2), (
M
, 
1
, 
3
))
m
i
j
 
=
 m
1
,
2
 
= 
2
 
then
For k = 1
((
1
, 1), (
M
, 
2
, 
2
))
For k = 2
((
1
, 2), (
M
, 
2
, 
2
))
m
i
j
 
=
 m
1
,
3
 
= 
1
 
then
For k = 1
((
1
, 1), (
M
, 
3
, 
1
))
For k = 2
((
1
, 2), (
M
, 
3
, 
1
))
Row 
i 
=
 1
3
 
2
 
1
1
 
1
 
3
0
 
2
 
1
m
i
j
 
=
 m
2
,
1
 
= 
1
 
then
For k = 1
((
2
, 1), (
M
, 
1
, 
1
))
For k = 2
((
2
, 2), (
M
, 
1
, 
1
))
m
i
j
 
=
 m
2
,
2
 
= 
1
 
then
For k = 1
((
2
, 1), (
M
, 
2
, 
1
))
For k = 2
((
2
, 2), (
M
, 
2
, 
1
))
m
i
j
 
=
 m
2
,
3
 
= 
3
 
then
For k = 1
((
2
, 1), (
M
, 
3
, 
3
))
For k = 2
((
2
, 2), (
M
, 
3
, 
3
))
Row 
i 
=
 2
3
 
2
 
1
1
 
1
 
3
0
 
2
 
1
m
i
j
 
=
 m
3
,
1
 
= 
0
 
then
For k = 1
((
3
, 1), (
M
, 
1
, 
0
))
For k = 2
((
3
, 2), (
M
, 
1
, 
0
))
m
i
j
 
=
 m
3
,
2
 
= 
2
 
then
For k = 1
((
3
, 1), (
M
, 
2
, 
2
))
For k = 2
((
3
, 2), (
M
, 
2
, 
2
))
m
i
j
 
=
 m
3
,
3
 
= 
1
 
then
For k = 1
((
3
, 1), (
M
, 
3
, 
1
))
For k = 2
((
3
, 2), (
M
, 
3
, 
1
))
Row 
i 
=
 3
3
 
2
 
1
1
 
1
 
3
0
 
2
 
1
Consider element 
n
j
k
 = n
1
,
1
 = 
2
 then
For i = 1
((1, 
1
), (
N
, 
1
, 
2
))
For i = 2
((2, 
1
), (
N
, 
1
, 
2
))
For i = 3
((3, 
1
), (
N
, 
1
, 
2
))
2
 
1
1
 
0
3
 
2
n
j
k
 = n
1
,
1
 = 
2
 then
For i = 1
((1, 
1
), (
N
, 
1
, 
2
))
For i = 2
((2, 
1
), (
N
, 
1
, 
2
))
For i = 3
((3, 
1
), (
N
, 
1
, 
2
))
n
j
k
 = n
1
,
2
 = 
1
 then
For i = 1
((1, 
2
), (
N
, 
1
, 
1
))
For i = 2
((2, 
2
), (
N
, 
1
, 
1
))
For i = 3
((3, 
2
), (
N
, 
1
, 
1
))
Row 
j 
=
 1
So for 
each
 element
 
n
j
k
 
of 
N 
we have:
2
 
1
1
 
0
3
 
2
n
j
k
 = n
2
,
1
 = 
1
 then
For i = 1
((1, 
1
), (
N
, 
2
, 
1
))
For i = 2
((2, 
1
), (
N
, 
2
, 
1
))
For i = 3
((3, 
1
), (
N
, 
2
, 
1
))
n
j
k
 = n
2
,
2
 = 
0
 then
For i = 1
((1, 
2
), (
N
, 
2
, 
0
))
For i = 2
((2, 
2
), (
N
, 
2
, 
0
))
For i = 3
((3, 
2
), (
N
, 
2
, 
0
))
Row 
j 
=
 2
2
 
1
1
 
0
3
 
2
n
j
k
 = n
3
,
1
 = 
3
 then
For i = 1
((1, 
1
), (
N
, 
3
, 
3
))
For i = 2
((2, 
1
), (
N
, 
3
, 
3
))
For i = 3
((3, 
1
), (
N
, 
3
, 
3
))
n
j
k
 = n
3
,
2
 = 
2
 then
For i = 1
((1, 
2
), (
N
, 
3
, 
2
))
For i = 2
((2, 
2
), (
N
, 
3
, 
2
))
For i = 3
((3, 
2
), (
N
, 
3
, 
2
))
Row 
j 
=
 3
2
 
1
1
 
0
3
 
2
((1, 1), (M, 1, 3))
((1, 2), (M, 1, 3))
((1, 1), (M, 2, 2))
((1, 2), (M, 2, 2))
((1, 1), (M, 3, 1))
((1, 2), (M, 3, 1))
((2, 1), (M, 1, 1))
((2, 2), (M, 1, 1))
((2, 1), (M, 2, 1))
((2, 2), (M, 2, 1))
((2, 1), (M, 3, 3))
((2, 2), (M, 3, 3))
((3, 1), (M, 1, 0))
((3, 2), (M, 1, 0))
((3, 1), (M, 2, 2))
((3, 2), (M, 2, 2))
((3, 1), (M, 3, 1))
((3, 2), (M, 3, 1))
((1, 1), (N, 1, 2))
((2, 1), (N, 1, 2))
((3, 1), (N, 1, 2))
((1, 2), (N, 1, 1))
((2, 2), (N, 1, 1))
((3, 2), (N, 1, 1))
((1, 1), (N, 2, 1))
((2, 1), (N, 2, 1))
((3, 1), (N, 2, 1))
((1, 2), (N, 2, 0))
((2, 2), (N, 2, 0))
((3, 2), (N, 2, 0))
((1, 1), (N, 3, 3))
((2, 1), (N, 3, 3))
((3, 1), (N, 3, 3))
((1, 2), (N, 3, 2))
((2, 2), (N, 3, 2))
((3, 2), (N, 3, 2))
So these are the data
produced by the
maps and that are
sent to the controller
Controller
Grouping
(see next slide)
M
a
p
s
M
a
p
s
(
(1,1)
(1,1)
, [(M, 
1
, 
3
), (M, 
2
, 
2
), (M, 
3
, 
1
),
              (N, 
1
, 
2
), (N, 
2
, 
1
), (N, 
3
, 
3
)]
)
(
(1,2)
(1,2)
, [(M, 
1
, 
3
), (M, 
2
, 
2
), (M, 
3
, 
1
),
              (N, 
1
, 
1
), (N, 
2
, 
0
), (N, 
3
, 
2
)]
)
(
(2,1)
(2,1)
, [(M, 
1
, 
1
), (M, 
2
, 
1
), (M, 
3
, 
3
),
             (N, 
1
, 
2
), (N, 
2
, 
1
), (N, 
3
, 
3
)]
)
(
(2,2)
(2,2)
, [(M, 
1
, 
1
), (M, 
2
, 
1
), (M, 
3
, 
3
),
             (N, 
1
, 
1
), (N, 
2
, 
0
), (N, 
3
, 
2
)]
)
(
(3,1)
, [(M, 
1
, 
0
), (M, 
2
, 
2
), (M, 
3
, 
1
),
             (N, 
1
, 
2
), (N, 
2
, 
1
), (N, 
3
, 
3
)]
)
(
(3,2)
, [(M, 
1
, 
0
), (M, 
2
, 
2
), (M, 
3
, 
1
),
             (N, 
1
, 
1
), (N, 
2
, 
0
), (N, 
3
, 
2
)]
)
Matrix Multiplication by MapReduce
The Reduce Function: Each key 
(i, k) 
has an associated list with all the
values 
(M, 
j
, 
m
ij
 ) 
and 
(N, 
j
, 
n
jk
)
, for all possible values of 
j
.
The Reduce function needs to connect the two values on the list that
have the same value
 
of 
j
, for each 
j
.
The 
j
th
 values on each list 
m
ij
 and 
n
jk
 are extracted and multiplied.
Then, these products are summed and the result is paired with 
(i, k)
in the output of the Reduce function.
Matrix Multiplication by MapReduce
For example, for element 
(i, k) 
=
 (1, 1) 
we have:
(
(1,1)
(1,1)
, [(M, 
1
, 
3
), (M, 
2
, 
2
), (M, 
3
, 
1
),
              (N, 
1
, 
2
), (N, 
2
, 
1
), (N, 
3
, 
3
)]
)
                  3
x
2
     +    
2
x
1    
+ 
    1
x
3 
= 11
(
(1,1)
(1,1)
, [(M, 
1
, 
3
), (M, 
2
, 
2
), (M, 
3
, 
1
),
              (N, 
1
, 
2
), (N, 
2
, 
1
), (N, 
3
, 
3
)]
)
(
(1,2)
(1,2)
, [(M, 
1
, 
3
), (M, 
2
, 
2
), (M, 
3
, 
1
),
              (N, 
1
, 
1
), (N, 
2
, 
0
), (N, 
3
, 
2
)]
)
(
(2,1)
(2,1)
, [(M, 
1
, 
1
), (M, 
2
, 
1
), (M, 
3
, 
3
),
             (N, 
1
, 
2
), (N, 
2
, 
1
), (N, 
3
, 
3
)]
)
(
(2,2)
(2,2)
, [(M, 
1
, 
1
), (M, 
2
, 
1
), (M, 
3
, 
3
),
             (N, 
1
, 
1
), (N, 
2
, 
0
), (N, 
3
, 
2
)]
)
(
(3,1)
, [(M, 
1
, 
0
), (M, 
2
, 
2
), (M, 
3
, 
1
),
             (N, 
1
, 
2
), (N, 
2
, 
1
), (N, 
3
, 
3
)]
)
(
(3,2)
, [(M, 
1
, 
0
), (M, 
2
, 
2
), (M, 
3
, 
1
),
             (N, 
1
, 
1
), (N, 
2
, 
0
), (N, 
3
, 
2
)]
)
(
(1,1)
(1,1)
, 
3
x
2
+
2
x
1
+
1
x
3
= 11
)
(
(1,2)
(1,2)
, 
3
x
1
+
2
x
0
+
1
x
2
= 5
)
(
(2,1)
(2,1)
, 
1
x
2
+
1
x
1
+
3
x
3
= 12
)
(
(2,2)
(2,2)
, 
1
x
1
+
1
x
0
+
3
x
2
= 7
)
(
(3,1)
, 
0
x
2
+
2
x
1
+
1
x
3
= 5
)
(
(3,2)
, 
0
x
1
+
2
x
0
+
1
x
2
= 2
)
Result
11
 
5
12
 
7
5
 
2
Matrix Multiplication using 
two 
MapReduce
First MapReduce:
The Map Function: For each matrix element 
m
ij
, produce the key
value pair 
(j, (
M
,
 
i, m
ij
))
.
Likewise, for each matrix element 
n
jk
, produce the key value 
pair
   (j, (
N
, k, n
jk
)
.
Example step by step
m
i
j
 
 
 
(
j
, (
M
, 
i
, 
m
i
j
)):
(
1
, (
M
, 
1
, 
3
))
  
(
2
, (
M
, 
1
, 
2
))
  
(
3
, (
M
, 
1
, 
1
))
(
1
, (
M
, 
2
, 
1
))
  
(
2
, (
M
, 
2
, 
1
))
  
(
3
, (
M
, 
2
, 
3
))
(
1
, (
M
, 
3
, 
0
))
  
(
2
, (
M
, 
3
, 
2
))
  
(
3
, (
M
, 
3
, 
1
))
3
 
2
 
1
1
 
1
 
3
0
 
2
 
1
Example step by step
n
j
k
 
 
(
j
, (
N
, 
k
, 
n
j
k
):
(
1
, (
N
, 
1
, 
2
))
  
(
1
, (
N
, 
2
, 
1
))
  
(
2
, (
N
, 
1
, 
1
))
  
(
2
, (
N
, 
2
, 
0
))
  
(
3
, (
N
, 
1
, 
3
))
  
(
3
, (
N
, 
2
, 
2
))
2
 
1
1
 
0
3
 
2
(1, (M, 1, 3))
(2, (M, 1, 2))
(3, (M, 1, 1))
(1, (M, 2, 1))
(2, (M, 2, 1))
(3, (M, 2, 3))
(1, (M, 3, 0))
(2, (M, 3, 2))
(3, (M, 3, 1))
(1, (N, 1, 2))
(1, (N, 2, 1))
(2, (N, 1, 1))
(2, (N, 2, 0))
(3, (N, 1, 3))
(3, (N, 2, 2))
Controller
Grouping
(see next slide)
M
a
p
s
M
a
p
s
So these are the data
produced by the
maps and that are
sent to the controller
(
1
1
, [
(M, 
1
, 
3
)
, 
(M, 
2
, 
1
)
, 
(M, 
3
, 
0
)
, (N, 
1
, 
2
), (N, 
2
, 
1
)]
)
(
2
2
, [
(M, 
1
, 
2
)
, 
(M, 
2
, 
1
)
, 
(M, 
3
, 
2
)
, (N, 
1
, 
1
), (N, 
2
, 
0
)]
)
(
3
3
, [
(M, 
1
, 
1
)
, 
(M, 
2
, 
3
) (M, 
3
, 
1
)
, (N, 
1
, 
3
), (N, 
2
, 
2
)]
)
 
j
          
i
              
i
             
i
             
k
           
k
Reducers
Grouping (controller):
 
The Reduce Function: For each key 
j
, examine its list of associated
values. 
For each value 
that comes from 
M
, say (
M
, 
i
, 
m
ij
), and each
value that comes from 
N
, say (
N
, 
k
, 
n
jk
), produce a key-value pair with
key equal to (
i
, 
k
) and value equal to the product of these elements,
m
ij
n
jk
.
((
1
, 
1
), 
6
) y ((
1
, 
2
), 
3
)
((
2
, 
1
), 
2
) y ((
2
, 
2
), 
1
)
((
3
, 
1
), 
0
) y ((
3
, 
2
), 
0
)
(
1
1
, [
(M, 
1
, 
3
)
, 
(M, 
2
, 
1
)
, 
(M, 
3
, 
0
)
,
(N, 
1
, 
2
), (N, 
2
, 
1
)]
)
(
2
2
, [
(M, 
1
, 
2
)
, 
(M, 
2
, 
1
)
, 
(M, 
3
, 
2
)
,
(N, 
1
, 
1
), (N, 
2
, 
0
)]
)
(
3
3
, [
(M, 
1
, 
1
)
, 
(M, 
2
, 
3
) (M, 
3
, 
1
)
,
(N, 
1
, 
3
), (N, 
2
, 
2
)]
)
((
1
, 
1
), 
2
) y ((
1
,
 2
), 
0
)
((
2
, 
1
), 
1
) y ((
2
, 
2
), 
0
)
((
3
, 
1
), 
2
) y ((
3
, 
2
), 
0
)
((
1
, 
1
), 
3
) y ((
1
, 
2
), 
2
)
((
2
, 
1
), 
9
) y ((
2
, 
2
), 
6
)
((
3
, 
1
), 
3
) y ((
3
, 
2
), 
2
)
This is the input
for the second
MapReduce
i
 , 
k
i
 , 
k
i
 , 
k
i
 , 
k
i
 , 
k
i
 , 
k
Reducers
   j
 
The Map Function: 
This function is just the identity
. That is, for every
input element with key (i, k) and value 
v
, produce exactly this key-
value pair.
((1, 1), 
6
) y ((1, 2), 
3
)
((2, 1), 
2
) y ((2, 2), 
1
)
((3, 1), 
0
) y ((3, 2), 
0
)
((1, 1), 
2
) y ((1, 2), 
0
)
((2, 1), 
1
) y ((2, 2), 
0
)
((3, 1), 
2
) y ((3, 2), 
0
)
((1, 1), 
3
) y ((1, 2), 
2
)
((2, 1), 
9
) y ((2, 2), 
6
)
((3, 1), 
3
) y ((3, 2), 
2
)
Controller
((1, 1), [
6
, 
2
, 
3
])
((1, 2), [
3
, 
0
, 
2
])
((2, 1), [
2
, 
1
, 
9
])
((2, 2), [
1
, 
0
, 
6
])
((3, 1), [
0
, 
2
, 
3
])
((3, 2), [
0
, 
0
, 
2
])
Reducers
Key        
List of values
 
The Reduce Function: For each key (i, k), produce the sum of the list
of values associated with this key. The result is a pair ((i, k), 
v
), where
v
 is the value of the element in row i and column k of the matrix P =
MN
((1,1), [
6
, 
2
, 
3
])
((1,2), [
3
, 
0
, 
2
])
((2,1), [
2
, 
1
, 
9
])
((2,2), [
1
, 
0
, 
6
])
((3,1), [
0
, 
2
, 
3
])
((3,2), [
0
, 
0
, 
2
])
Reducers
((1,1), 
11
)
((1,2), 
5
)
((2,1), 
12
)
((2,2), 
7
)
((3,1), 
5
)
((3,2), 
2
)
Result
11
 
5
12
 
7
5
 
2
Slide Note
Embed
Share

Modern data-mining applications require managing immense amounts of data quickly, leveraging parallelism in computing clusters. MapReduce, a programming technique, enables efficient large-scale data calculations on computing clusters, reducing costs compared to special-purpose machines. MapReduce is implemented in various systems, simplifying parallel execution and coordination tasks. The process involves Map tasks generating key-value pairs from input data, followed by Reduce tasks executing operations based on key-value pairs.

  • Data Mining
  • MapReduce
  • Parallel Computing
  • Programming Technique
  • Large-Scale Data

Uploaded on Oct 04, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. MapReduce Francisco Moreno Excerpts from Mining of Massive Datasets Rajaraman, Leskovec, and Ullman

  2. Introduction Modern data-mining applications, often called big-data analysis, require us to manage immense amounts of data quickly. There is ample opportunity to exploit parallelism. Parallelism is exploited not from a supercomputer, but from computing clusters .

  3. Introduction In the past, applications that called for parallel processing, such as large scientific calculations, were done on special-purpose parallel computers with many processors and specialized hardware. However, the prevalence of large-scaleWeb services has caused more and more computing to be done on installations with thousands of compute nodes operating more or less independently. In these installations, the compute nodes are commodity* hardware, which greatly reduces the cost compared with special-purpose parallel machines. * Producto b sico, materia prima

  4. Introduction Central to this new software stack is a programming technique called MapReduce. Implementations of MapReduce enable many of the most common calculations on large-scale data to be performed on computing clusters efficiently

  5. MapReduce MapReduce is a style of computing that has been implemented in several systems, including Google s internal implementation and the popular open-source implementation Hadoop It has also been implemented by: - Couchbase - CouchDB (Apache CouchDB) - Infinispan - MongoDB - RavenDB - Riak e.g., in Oracle It can also be mimicked in other systems,

  6. MapReduce All you need to write are two functions, called Map and Reduce, while the system manages the parallel execution, coordination of tasks that execute Map or Reduce, and also deals with the possibility that one of these tasks will fail to execute. In brief, a MapReduce computation executes as follows:

  7. MapReduce Step 1 Some number of Map tasks (mappers) each are given one or more chunks (a chunk is a collection of elements, e.g., documents) from a distributed file system. These Map tasks turn the chunk into a sequence of key-value pairs. The way key-value pairs are produced from the input data is determined by the code written by the user for the Map function. Note que ac las claves (keys) no tienen el sentido de claves nicas (claves primarias, claves alternativas). De hecho, una funci n map podr a generar varias parejas clave-valor (key-value pairs) id nticas

  8. MapReduce Step 2 The key-value pairs from each Map task are collected by a master controller (combiner, shuffler) and grouped by key*. The keys are divided among all the Reduce tasks (reducers), so all key- value pairs with the same key wind up at the same Reduce task**. * Es un proceso similar a un GROUP BY de SQL, solo que ac no se aplican funciones de agregaci n sobre los valores. Con los valores de cada clave (key) se forma una colecci n. ** Esto es as ya que todos los valores de una misma clave est n reunidos en una colecci n: (clave, colecci n de valores).

  9. MapReduce Step 3 The Reduce tasks work on one key at a time, and combine (usually, aggregate, consolidate ) all the values* associated with that key in some way. The manner of combination (aggregation, consolidation) of values is determined by the code written by the user for the Reduce function. * O sea, la colecci n de valores de cada clave

  10. Schematic of a MapReduce computation (k1, [values]) (k2, [values]) -- -- Key-value pairs (k, v) Map 1 Reduce 1 C o n t r o l l e r (k3, [values]) (k4, [values]) Reduce 2 Key-value pairs (k, v) -- -- (k5, [values]) (k6, [values]) Map 2 Combined result Reduce 3 (k7, [values]) (k8, [values]) (k9, [values]) -- -- Key-value pairs (k, v) Map n Reduce m Keys with all their values (collection of values) Input chunks Group by keys

  11. Reduce: count the number of values A first schematic example cc name (0, Juan) (1, Pedro) 10 Juan 21 Pedro Map 1 (0, [Juan, Lisa, Ana]) (0, 3) Reduce 1 Controller (0, [Juan, Lisa, Ana]) (1, [Pedro, Pierre]) (0, 3) (1, 2) (1, [Pedro, Pierre]) Reduce 2 (1, 2) cc name Keys with all their values 30 Lisa 40 Ana 35 Pierre (0, Lisa) (0, Ana) (1, Pierre) Group by keys Map 2 Combined result Input documents Map: if cc is odd then 1 else 0

  12. Reduce: count the number of values A first schematic example cc name (0, 1) (1, 1) 10 Juan 21 Pedro Map 1 (0, [1, 1, 1]) (0, 3) Reduce 1 Controller (0, [1, 1, 1]) (1, [1, 1]) (0, 3) (1, 2) Input documents (1, [1, 1]) Reduce 2 (1, 2) cc name Keys with all their values 30 Lisa 40 Ana 35 Pierre Combined result (0, 1) (0, 1) (1, 1) Group by keys Map 2 Here, instead of emitting the names, just emits 1s Map: if cc is odd then 1 else 0

  13. Reduce: count the number of values A first schematic example cc name (0, 1) (1, 1) (-1, 1) 10 Juan 21 Pedro NA Eric Map 1 (0, [1, 1, 1]) (0, 3) Reduce 1 Controller (0, [1, 1, 1]) (1, [1, 1]) (-1, [1, 1]) (0, 3) (1, 2) (-1, 2) Input documents (1, [1, 1]) (-1, [1, 1]) Reduce 2 (1, 2) (-1, 2)Combined cc name Keys with all their values (0, 1) (0, 1) (1, 1) (-1, 1) 30 Lisa 40 Ana 35 Pierre NA Bill Group by keys Map 2 result Here, reduce 2 processes two (key, values) Map: if cc is odd then 1 else 0. If cc = NA then -1

  14. The The Map Map Tasks Tasks We view input files for a Map task as consisting of elements, which can be any type: a tuple or a document, for example. A chunk is a collection of elements, and no element is stored across two chunks. Technically, all inputs to Map tasks and outputs from Reduce tasks are of the key-value-pair form. Insisting on this form for inputs and outputs is motivated by the desire to allow composition of several MapReduce processes.

  15. The The Map Map Tasks Tasks The Map function takes an input element as its argument and produces zero or more key-value pairs. The types of keys and values are each arbitrary. Further, keys are not keys in the usual sense; they do not have to be unique: a Map task can produce several key-value pairs with the same key, even from the same element.

  16. Example We shall illustrate a MapReduce computation with what has become the standard example application: counting the number of occurrences for each word in a collection of documents. In this example, the input file is a repository of documents, and each document is an element The Map function for this example uses keys that are of type String (the words) and values that are integers.

  17. Example The Map task reads a document and breaks it into its sequence of words w1,w2, . . . ,wn. It then emits a sequence of key-value pairs where the value is always 1. That is, the output of the Map task for this document is the sequence of key-value pairs: (w1, 1), (w2, 1), . . . , (wn, 1)

  18. Example Note that a single Map task will typically process many documents all the documents in one or more chunks. Note also that if a word w appears m times among all the documents assigned to that process, then there will be m key-value pairs (w, 1) among its output. An option, which we show later, is to combine these m pairs into a single pair (w, m)

  19. Grouping Grouping by by Key Key As soon as the Map tasks have all completed successfully, the key- value pairs are grouped by key, and the values associated with each key are formed into a list (collection) of values. The grouping is performed by the controller, regardless of what the Map and Reduce tasks do.

  20. Grouping Grouping by by Key Key The master controller process knows how many Reduce tasks there will be, say r such tasks or the user could tells the MapReduce system what r should be*. Then the master controller picks a hash function** that applies to keys and produces a bucket number from 0 to r - 1. This number represents the corresponding Reduce process where the key-value pair is destined. * A similar situation for the number of map tasks. **Optionally, users can specify their own hash function or other method for assigning keys to Reduce tasks. However, whatever algorithm is used, each key is assigned to one and only one Reduce task.

  21. Grouping Grouping by by Key Key The controller performs the grouping by key and distributes to the Reduce tasks: The input to the Reduce task that handles key k is a pair of the form (k, [v1, v2, . . . , vp]) Where (k, v1), (k, v2), . . . , (k, vp) are all the key-value pairs with key k coming from all the Map tasks.

  22. The The Reduce Reduce Tasks Tasks The Reduce function s argument (input parameter) is a pair consisting of a key and its list of associated values. The output of the Reduce function is a sequence of zero or more key- value pairs. These key-value pairs can be of a type different from those sent from Map tasks to Reduce tasks, but often they are the same type.

  23. The The Reduce Reduce Tasks Tasks We shall refer to the application of the Reduce function to a single key and its associated list of values as a reducer (reducci n). A Reduce task receives one or more keys and their associated value lists. That is, a Reduce task executes one or more reducers. The outputs from all the Reduce tasks are merged into a single output (output file) Combined result

  24. Example Let us continue with our word-count example. The Reduce function simply adds up all the values. The output of a reducer consists of the word and the sum. Thus, the output of all the Reduce tasks is a sequence of (w, m) pairs, where w is a word that appears at least once among all the input documents and m is indeed the total number of occurrences of w among all those documents.

  25. Example schematically: (casa, 1) (gato, 1) (ni o, 1) (casa, 1) (gato, 1) casa gato ni o casa gato Map 1 (casa, [1, 1, 1]) (gato, [1, 1]) (casa, 3) (gato, 2) Reduce 1 Controller (casa, [1, 1, 1]) (gato, [1, 1]) (perro, [1, 1]) (ni o, [1, 1, 1]) (casa, 3) (gato, 2) (perro, 2) (ni o, 3) Reduce 2 (perro, 2) (ni o, 3) (casa, 1) (ni o, 1) (perro, 1) (ni o, 1) (perro, 1) (perro, [1, 1]) (ni o, [1, 1, 1]) casa ni o.. perro ni o perro Map 2 Combined result Por supuesto, en este ejemplo no se consideran las posibles palabras representadas por los puntos suspensivos

  26. An alternative for the Example In our example, we can push some of what the reducers do to the Map tasks: Instead of the previous Map tasks producing many pairs (w, 1), (w, 1), . . ., we could apply the process that the Reduce function performs within the Map task, before the output of the Map tasks is subject to grouping and subsequent aggregation by the Reduce function. These key-value pairs would thus be replaced by one pair with key w and value equal to the sum of all the 1 s in all those pairs.

  27. An alternative for the Example That is, the pairs with key w generated by a single Map task would be replaced by a pair (w, m), where m is the number of times that w appears among the documents handled by this Map task. Note that it is still necessary to do grouping and aggregation and to pass the result to the Reduce tasks, since there will typically be one key-value pair with key w coming from each of the Map tasks.

  28. An alternative for the Example schematically: (casa, 2) (gato, 2) (ni o, 1) casa gato ni o casa gato Map 1 (casa, [2, 1]) (gato, [2]) (casa, 3) (gato, 2) Reduce 1 Controller (casa, [2, 1]) (gato, [2]) (perro, [2]) (ni o, [1, 2]) (casa, 3) (gato, 2) (perro, 2) (ni o, 3) Reduce 2 (perro, 2) (ni o, 3) (casa, 1) (ni o, 2) (perro, 2) (perro, [2]) (ni o, [1, 2]) casa ni o.. perro ni o perro Map 2 Combined result

  29. Computing Natural Join by MapReduce The idea behind implementing natural join via MapReduce can be seen if we look at the specific case of joining R(A, B) with S(B, C). We must find tuples that agree on their B components, that is the second component from tuples of R and the first component of tuples of S. Join attribute

  30. Computing Natural Join by MapReduce We shall use the B-value of tuples from either relation as the key. The value will be the other component (attributes) and the name of the relation, so the Reduce function can know where each tuple came from: (B, (the name of the relation, the other attributes of the relation)) Value Key

  31. Computing Natural Join by MapReduce The Map Function: For each tuple (a, b) of R(A, B), produce the key-value pair (b, (R, a)). For each tuple (b, c) of S(B, C), produce the key-value pair (b, (S, c)).

  32. Example EMP ced 10 20 30 DEPT dep name 1 2 3 dep 1 1 2 Ventas Cocina Mercadeo Foreign key

  33. EMP Natural Join DEPT ced dep name 10 1 20 1 30 2 Ventas Ventas Cocina

  34. Computing Natural Join by MapReduce Tuples produced by the map functions: From EMP: From DEPT: (1, (EMP, 10)) (1, (EMP, 20)) (2, (EMP, 30)) (1, (DEPT, Ventas)) (2, (DEPT, Cocina)) (3, (DEPT, Mercadeo))

  35. Computing Natural Join by MapReduce Controller: Grouping EMP C o n t r o l l e r (1, (EMP, 10)) (1, (EMP, 20)) (2, (EMP, 30)) (1, [(EMP, 10), (EMP, 20), (DEPT, Ventas)]) Map tasks (2, [(EMP, 30), (DEPT, Cocina)]) (1, (DEPT, Ventas)) (2, (DEPT, Cocina)) (3, (DEPT, Mercadeo)) (3, [(DEPT, Mercadeo)]) DEPT

  36. Computing Natural Join by MapReduce The Reduce Function: Now that the controller has associated each key value b with a list of pairs that are either of the form (R, a) or (S, c), construct all pairs consisting of one with first component R and the other with first component S*. The output from this key and value list is a sequence of key-value pairs. Each value is one of the triples (a, b, c) such that (R, a) and (S, c) are on the input list of values of b. Here, the key (in the reduce output) is irrelevant. The result (join) is made up of all the triples (values). * That is, the cartesian product

  37. Computing Natural Join by MapReduce Reducers (1, [(EMP, 10), (EMP, 20), (DEPT, Ventas)]) (1, [(10, 1, Ventas), (20, 1, Ventas])]) (2, [(EMP, 30), (DEPT, Cocina)]) (2, [(30, 2, Cocina)]) (3, [(DEPT, Mercadeo)]) (3, []) Result

  38. Computing Natural Join by MapReduce Result: (10, 1, Ventas) (20, 1, Ventas) (30, 2, Cocina)

  39. Matrix-Vector Multiplication by MapReduce Suppose we have an n n matrix M, whose element in row i and column j will be denoted mij. Suppose we also have a vector v of length n, whose jth element is vj . Then the matrix-vector product is the vector x of length n, whose ith element xi is given by

  40. Example 37 51 39 8 9 3 4 5 2 1 2 3 2 3 9 x = i = 1 : 8x2 + 4x3 + 1x9 = 37 i = 2 : 9x2 + 5x3 + 2x9 = 51 i = 3 : 3x2 + 2x3 + 3x9 = 39

  41. Matrix-Vector Multiplication by MapReduce The matrix M and the vector v each will be stored in a file of the DFS (data file system). The Map function is written to apply to one element of M. Each Map task will operate on a chunk of the matrix M. From each matrix element mij it produces the key-value pair (i, mijvj). Thus, all terms of the sum that make up the component xi of the matrix-vector product will get the same key, i. The Reduce Function: The Reduce function simply sums all the values associated with a given key i. The result will be a pair (i, xi).

  42. Example schematically (1/2): (1, 8*2) Map 1 Map 2 Map 3 (1, 4*3) (1, 1*9) Controller (2, 9*2) 8 9 3 4 5 2 1 2 3 2 3 9 Map 4 Map 5 Map 6 (1, [16, 12, 9]) (2, [18, 15, 18]) (3, [6, 6, 27]) x (2, 5*3) Reducers (2, 2*9) (3, 3*2) Map 7 Map 8 Map 9 (3, 2*3) (3, 3*9) Row

  43. Example schematically (2/2): Reduce 1 (1, 37) Controller (1, [16, 12, 9]) (2, [18, 15, 18]) (3, [6, 6, 27]) (1, 37) (2, 51) (3, 39) (2, 51) Reduce 2 Reduce 3 Combined result (3, 39)

  44. Matrix Multiplication by MapReduce If M is a matrix with element mij in row i and column j, and N is a matrix with element njk in row j and column k, then the product P = MN is the matrix P with element pik in row i and column k, where h It is required that the number of columns of M (h) equals the number of rows of N (h), so the sum over j makes sense.

  45. Example Tomado de: http://www.proferiera.comocreartuweb.es/material5/unidad2/objetos/multiplicacion-de-matrices2.gif

  46. Matrix Multiplication by MapReduce The Map Function: For each element mij of M, produce all the key- value pairs ((i, k), (M, j,mij)) for k = 1, 2, . . ., up to the number of columns of N. Similarly, for each element njk of N, produce all the key-value pairs ((i, k), (N, j, njk) for i = 1, 2, . . ., up to the number of rows of M. Note that M and N are really indicators to tell which of the two relations a value comes from.

  47. Example step by step 3 1 0 2 1 2 1 3 1 Consider element mij = m1,1 = 3 then For k = 1 ((1, 1), (M, 1, 3)) For k = 2 ((1, 2), (M, 1, 3))

  48. 3 1 0 2 1 2 1 3 1 So for each element mij of M we have: Row i = 1 mij = m1,1 = 3 then For k = 1 ((1, 1), (M, 1, 3)) For k = 2 ((1, 2), (M, 1, 3)) mij = m1,2 = 2 then For k = 1 ((1, 1), (M, 2, 2)) For k = 2 ((1, 2), (M, 2, 2)) mij = m1,3 = 1 then For k = 1 ((1, 1), (M, 3, 1)) For k = 2 ((1, 2), (M, 3, 1))

  49. 3 1 0 2 1 2 1 3 1 Row i = 2 mij = m2,1 = 1 then For k = 1 ((2, 1), (M, 1, 1)) For k = 2 ((2, 2), (M, 1, 1)) mij = m2,2 = 1 then For k = 1 ((2, 1), (M, 2, 1)) For k = 2 ((2, 2), (M, 2, 1)) mij = m2,3 = 3 then For k = 1 ((2, 1), (M, 3, 3)) For k = 2 ((2, 2), (M, 3, 3))

  50. 3 1 0 2 1 2 1 3 1 Row i = 3 mij = m3,1 = 0 then For k = 1 ((3, 1), (M, 1, 0)) For k = 2 ((3, 2), (M, 1, 0)) mij = m3,2 = 2 then For k = 1 ((3, 1), (M, 2, 2)) For k = 2 ((3, 2), (M, 2, 2)) mij = m3,3 = 1 then For k = 1 ((3, 1), (M, 3, 1)) For k = 2 ((3, 2), (M, 3, 1))

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#