Introduction to MapReduce: Efficient Data Processing Technique

MapReduce

Francisco Moreno

Excerpts from

“Mining of Massive Datasets”

Rajaraman, Leskovec, and Ullman

Introduction

•

Modern data-mining applications, often called “big-data” analysis,

require us to manage immense amounts of data quickly.

•

There is ample opportunity to exploit

parallelism

•

Parallelism

 is exploited not from a “supercomputer,”

but from

“computing clusters”.

Introduction

•

In the past, applications that called for parallel processing, such as

large scientific calculations, were done on

special-purpose parallel

computers

with many processors and specialized hardware.

•

However, the prevalence of large-scaleWeb services has caused more

and more computing to be done on installations with thousands of

compute nodes operating more or less independently.

•

In these installations, the compute nodes are commodity

 hardware,

which greatly

reduces the cost compared with special-purpose

parallel

machines

 Producto básico, materia prima

Introduction

•

Central to this new software stack is a

programming technique

 called

MapReduce

•

Implementations of MapReduce enable many of the most common

calculations on large-scale data to be performed on computing

clusters efficiently

MapReduce

•

MapReduce is a style of computing that has been implemented in

several systems, including Google’s internal implementation and the

popular open-source implementation

Hadoop

•

It has also been implemented by:

   - Couchbase

   - CouchDB (Apache CouchDB)

   - Infinispan

 MongoDB

   - RavenDB

   - Riak

It can also be mimicked in other systems,

e.g., in Oracle

MapReduce

•

All you need to write are two functions, called

Map

and

Reduce

, while

the system manages the parallel execution

, coordination of tasks

that execute Map or Reduce, and also deals with the possibility that

one of these tasks will fail to execute.

•

In brief, a MapReduce computation executes as follows:

MapReduce

Step 1

•

Some number of Map tasks (mappers) each are given one or more

chunks (

a chunk is a collection of elements, e.g., documents

) from a

distributed file system.

•

These Map tasks turn the chunk into a sequence of

key-value pairs

•

The way

key-value pairs

are produced from the input data is

determined by the code written by the user

for the Map function.

Note que acá las claves (

keys

) no tienen el sentido de claves únicas (claves

primarias, claves alternativas). De hecho, una función map podría generar varias

parejas clave-valor (

key-value pairs

) idénticas…

MapReduce

Step 2

•

The key-value pairs from each Map task are collected by a

master

controller (combiner, shuffler)

and

grouped by key

•

The keys are divided among all the Reduce tasks (reducers), so all key-

value pairs with the same key wind up at the same Reduce

task

**

Es un proceso similar a un GROUP BY de SQL, solo que acá no se aplican funciones

de agregación sobre los valores. Con los valores de cada clave (

key

) se forma una

colección.

**

Esto es así ya que todos los valores de una misma clave están reunidos en una

colección: (clave, colección de valores).

MapReduce

Step 3

•

The Reduce tasks work on one key at a time, and combine (usually,

aggregate, “consolidate”)

all the values*

associated with that key in

some way.

•

The manner of combination (aggregation, consolidation) of values is

determined by the code written by the user

 for the Reduce

function.

 O sea, la colección de valores de cada clave

Schematic of a MapReduce computation

--

--

--

--

--

--

Input

chunks

Map 1

Map 2

Map

Key-value

pairs (

Key-value

pairs (

Key-value

pairs (

Reduce 1

Reduce 2

Reduce 3

Reduce

Keys with all

their values

(collection of

values)

(k1, [values])

(k2, [values])

(k3, [values])

(k4, [values])

(k5, [values])

(k6, [values])

(k7, [values])

(k8, [values])

(k9, [values])

Combined

result

Group by

keys

…

…

…

A first schematic example

 Juan

 Pedro

Input

documents

Map 1

Map 2

Controller

, Juan)

, Pedro)

Reduce 1

Reduce 2

Map

: if

cc

is odd then

 else

, [Juan, Lisa, Ana])

, [Pedro, Pierre])

Combined

result

 Lisa

Ana

 Pierre

, Lisa)

, Ana)

, Pierre)

Group by

keys

, [Juan, Lisa, Ana])

, [Pedro, Pierre])

Reduce

count the

number of

values

, 3)

, 2)

, 3)

, 2)

Keys with all

their values

cc

name

cc

name

A first schematic example

 Juan

 Pedro

Input

documents

Map 1

Map 2

Controller

Reduce 1

Reduce 2

, [1, 1, 1])

, [1, 1])

 Lisa

Ana

 Pierre

Group by

keys

, [

])

, [

])

Reduce:

count the

number of

values

Here, instead of emitting the

names, just emits

1s

Map: if

cc

is odd then

 else

, 3)

, 2)

, 3)

, 2)

Combined

result

Keys with all

their values

cc

name

cc

name

A first schematic example

 Juan

 Pedro

NA

 Eric

Input

documents

Map 1

Map 2

Controller

-1

Reduce 1

Reduce 2

, [1, 1, 1])

, [1, 1])

-1

, [1, 1])

 Lisa

Ana

 Pierre

NA

 Bill

-1

Group by

keys

, [

])

, [

])

-1

, [1, 1])

Reduce:

count the

number of

values

Map: if

cc

is odd

then

 else

If

cc

 = NA then -1

, 3)

, 2)

-1

, 2)

, 3)

, 2)

-1

, 2)

Combined

result

Keys with all

their values

cc

name

cc

name

Here, reduce 2 processes

two (key, values)

•

We view input files for a Map task as consisting of elements, which

can be any type

: a tuple or a document, for example.

•

A chunk is a collection of elements, and no element is stored across

two chunks.

•

Technically, all inputs

to Map tasks and outputs from Reduce tasks are

of the

key-value-pair

 form.

•

Insisting on this form for inputs and outputs is motivated by the

desire to allow

composition

of

several

 MapReduce processes.

•

The Map function takes an input element as its argument and

produces zero or more key-value pairs.

•

The types of keys and values are each arbitrary.

•

Further, keys are not “keys” in the usual sense;

they do not have to be

unique

: a Map task can produce several key-value pairs with the same

key,

even

from the same element

Example

•

We shall illustrate a MapReduce computation with what has become

the standard example application: counting the number of

occurrences for each word in a collection of documents.

•

In this example, the input file is a repository of documents, and each

document is an element

•

The Map function

for this example uses keys that are of type String

(the words) and values that

are integers.

Example

•

The Map task reads a document and breaks it into its sequence

   of words

w1

w2

, . . . ,

•

It then emits a sequence of key-value pairs where the value is always

1. That is, the output of the Map task for this document is the

sequence of key-value pairs:

(w1, 1), (w2, 1), . . . , (w

, 1)

Example

•

Note that a single Map task will typically process many documents –

all the documents in one or more chunks.

•

Note also that if a word

 appears

 times among all the documents

assigned to that process, then there will be m key-value pairs

(w, 1)

among its output.

•

An option, which

we show later

, is to combine these

 pairs into a

single pair

(w,

•

As soon as the Map tasks have all completed successfully, the key-

value pairs are grouped by key, and the values associated with each

key are formed into a list (collection)

of values.

•

The grouping is performed by the controller, regardless of what the

Map

and Reduce tasks do.

•

The master controller process knows how many Reduce tasks there

will be, say

such tasks or the user could tells the MapReduce system

what

 should be

•

Then the master controller picks a hash function

**

 that applies to

keys and produces a bucket number from 0 to

- 1. This number

represents the corresponding Reduce process where the key-value

pair is destined.

A similar situation for the number of map tasks.

**

Optionally, users can specify their own hash function or other method for

assigning keys to Reduce tasks. However, whatever algorithm is used,

each key

is assigned to one and only

one Reduce task.

•

The controller performs the grouping by key and distributes to the

Reduce tasks:

•

The input to the

Reduce task that handles key

 is a pair of the form

(k, [v1, v2, . . . , v

])

•

Where

(k, v1), (k, v2), . . . , (k, v

 are all the key-value pairs with key k

coming from

all the Map tasks.

•

The Reduce function’s argument (input parameter) is a pair consisting

of a key and its list of

associated values.

•

The output of the Reduce function is a sequence of

zero or

more key-

value pairs

•

These key-value pairs can be of a type different from those sent from

Map tasks to Reduce tasks, but often they are the same type.

•

We shall refer to the application of the Reduce function to a single

key and its associated list of values as a

reducer

(reducción).

•

A Reduce task receives

one or more

keys and their associated value

lists. That is, a Reduce task executes one or more

reducers

•

The outputs from all the Reduce tasks are merged into a single output

(output file)



 Combined result

Example

•

Let us continue with our word-count example.

•

The Reduce function simply adds up all the values. The output of a

reducer consists of the word and the sum.

•

Thus, the output of all the Reduce tasks is a sequence of

(w, m)

pairs,

where

 is a word that appears at least once among all the input

documents and

 is indeed the total number of occurrences of

among

all those documents.

Example schematically:

…casa…gato

…niño…casa

…gato…

Map 1

Map 2

Controller

(casa, 1)

(gato, 1)

(niño, 1)

(casa, 1)

(gato, 1)

Reduce 1

Reduce 2

(casa, [1, 1, 1])

(gato, [1, 1])

(perro, [1, 1])

(niño, [1, 1, 1])

Combined

result

…casa…niño..

…perro…niño

…perro…

(casa, [1, 1, 1])

(gato, [1, 1])

(perro, [1, 1])

(niño, [1, 1, 1])

(casa, 3)

(gato, 2)

(perro, 2)

(niño, 3)

(casa, 1)

(niño, 1)

(perro, 1)

(niño, 1)

(perro, 1)

(casa, 3)

(gato, 2)

(perro, 2)

(niño, 3)

Por supuesto, en este ejemplo no se consideran las posibles palabras representadas por los puntos suspensivos…

An alternative for the Example

•

In our example, we can “

push

”

some of what the reducers do to the

Map tasks:

•

Instead of the previous Map tasks producing many pairs

(w, 1), (w, 1),

. . .

, we could apply the process that the Reduce function performs

within

 the Map task, before the output of the Map tasks is subject to

grouping and subsequent aggregation by the Reduce function.

•

These key-value pairs would thus be replaced by one pair with key

and value equal to the sum of all the 1’s in

all those pairs.

An alternative for the Example

•

That is, the pairs with key

 generated by a single Map task would be

replaced by a pair

(w,

, where

 is the number of times that

appears among the documents handled by this Map task.

•

Note that it is still necessary to do grouping and aggregation and to

pass the result to the Reduce tasks, since there will typically be one

key-value pair with key w coming from each of the Map tasks.

An alternative for the Example schematically:

…casa…gato

…niño…casa

…gato…

Map 1

Map 2

Controller

(casa,

(gato,

(niño,

Reduce 1

Reduce 2

(casa, [2, 1])

(gato, [2])

(perro, [2])

(niño, [1, 2])

Combined

result

…casa…niño..

…perro…niño

…perro…

(casa, [2, 1])

(gato, [2])

(perro, [2])

(niño, [1, 2])

(casa, 3)

(gato, 2)

(perro, 2)

(niño, 3)

(casa,

(niño,

(perro,

(casa, 3)

(gato, 2)

(perro, 2)

(niño, 3)

Computing Natural Join by MapReduce

•

The idea behind implementing

natural join

via MapReduce can be

seen if we look at the specific case of joining

(A,

) with

, C).

•

We must find tuples that agree on their

 components, that is the

second component from tuples of

 and the first component of tuples

of

Join attribute

Computing Natural Join by MapReduce

•

We shall use the

-value of tuples from either relation as the

key

. The

value will be

the other component (attributes) and the name of the

relation

, so the Reduce function can know where each tuple came

from:

the name of the relation, the other attributes of the relation)

Key

Value

Computing Natural Join by MapReduce

The

Map

 Function:

•

For each tuple (

) of

), produce the key-value pair (

(R, a)

).

•

For each tuple (

) of

), produce the key-value pair (

(S, c)

).

Example

EMP

ced

dep

DEPT

dep

name

Ventas

Cocina

Mercadeo

Foreign key

EMP

 Natural Join

DEPT

ced

dep

name

Ventas

Ventas

Cocina

Computing Natural Join by MapReduce

•

Tuples produced by the map functions:

From

EMP

(EMP, 10)

(EMP, 20)

(EMP, 30)

From

DEPT

(DEPT, Ventas)

(DEPT, Cocina)

(DEPT, Mercadeo)

Computing Natural Join by MapReduce

•

Controller:

(EMP, 10)

(EMP, 20)

(EMP, 30)

(DEPT, Ventas)

(DEPT, Cocina)

(DEPT, Mercadeo)

Map

tasks

[(EMP, 10), (EMP, 20), (DEPT, Ventas)]

[(EMP, 30), (DEPT, Cocina)]

[(DEPT, Mercadeo)]

DEPT

EMP

Grouping

Computing Natural Join by MapReduce

•

The

Reduce

 Function:

•

Now that the controller has associated each key value

 with a list of pairs

   that are either of the form

(R, a)

or

(S, c)

construct all pairs consisting of

   one with first component

 and the other with first component

•

The output from this key and value list is a sequence of key-

value

 pairs.

•

Each

value

is one of the triples

(a,

, c)

such that

(R, a)

and

(S, c)

 are on the

input list of values of

•

Here, the key (in the reduce output) is irrelevant

. The result (join) is made

up of all the triples (values).

 That is, the cartesian product

Computing Natural Join by MapReduce

[(EMP, 10), (EMP, 20), (DEPT, Ventas)]

[(EMP, 30), (DEPT, Cocina)]

[(DEPT, Mercadeo)]

[(10,

, Ventas), (20,

, Ventas])]

[(30,

, Cocina)]

[]

Reducers

Result

Computing Natural Join by MapReduce

•

Result:

(10,

, Ventas)

(20,

, Ventas)

(30,

, Cocina)

Matrix-Vector Multiplication by MapReduce

•

Suppose we have an

×

 matrix

, whose element in row

and

column

 will be denoted

ij

•

Suppose we also have a vector

 of length

, whose

th

 element

is

•

Then the matrix-vector product is the vector

of length

, whose

th

element

is given by

Example

 = 1 : 8x2 + 4x3 + 1x9 = 37

= 2 : 9x2 + 5x3 + 2x9 = 51

= 3 : 3x2 + 2x3 + 3x9 = 39

Matrix-Vector Multiplication by MapReduce

•

The matrix

 and the vector

 each will be stored in a file of the DFS

(data file system).

•

The Map function is written to apply to one element of

•

Each Map task will operate on a

chunk

 of the matrix

. From each

matrix element

ij

 it produces the key-value pair

(i,

ij

•

Thus, all terms of the sum that make up the component

 of the

matrix-vector product will get the

same key,

•

The Reduce Function: The Reduce function simply sums all the values

associated with a given key

. The result will be a pair (

).

Example schematically (1/2):

Map 1

Map 2

Map 3

Map 4

Map 5

Map 6

Map 7

Map 8

Map 9

(1, 8*

(1, 4*

(1, 1*

(2, 9*

(2, 5*

(2, 2*

(3, 3*

(3, 2*

(3, 3*

Controller

(1, [16, 12, 9])

(2, [18, 15, 18])

(3, [6, 6, 27])

Reducers

Row

Example schematically (2/2):

Controller

(1, [16, 12, 9])

(2, [18, 15, 18])

(3, [6, 6, 27])

Reduce 1

Reduce 2

Reduce 3

Combined

result

(1, 37)

(2, 51)

(3, 39)

(1, 37)

(2, 51)

(3, 39)

Matrix Multiplication by MapReduce

•

If

 is a matrix with element

ij

in row

 and column

, and

 is a

matrix with element

jk

 in row

 and column

, then the product

P =

MN

 is the matrix

 with element

ik

 in row

and column

, where

•

It is required that the number of columns of

) equals the number

of rows of

), so the sum over

 makes sense.

Example

Tomado de: http://www.proferiera.comocreartuweb.es/material5/unidad2/objetos/multiplicacion-de-matrices2.gif

Matrix Multiplication by MapReduce

•

The

Map

 Function:

For

each

 element

ij

of

, produce all the key-

value pairs

((i, k), (

, j,m

ij

))

for

 = 1, 2, . . .,

up to the number of

columns

of

•

Similarly,

for

each

 element

jk

of

, produce all the key-value pairs

   ((i, k), (

, j, n

jk

for

= 1, 2, . . .,

up to the number of rows

of

•

Note that

and

 are really indicators to tell which of the two

relations a value comes

from.

Example step by step

•

Consider element

then

•

For k = 1

((

, 1), (

))

•

For k = 2

((

, 2), (

))

•

So for

each

 element

of

we have:

then

For k = 1

((

, 1), (

))

For k = 2

((

, 2), (

))

then

For k = 1

((

, 1), (

))

For k = 2

((

, 2), (

))

then

For k = 1

((

, 1), (

))

For k = 2

((

, 2), (

))

Row

then

For k = 1

((

, 1), (

))

For k = 2

((

, 2), (

))

then

For k = 1

((

, 1), (

))

For k = 2

((

, 2), (

))

then

For k = 1

((

, 1), (

))

For k = 2

((

, 2), (

))

Row

then

For k = 1

((

, 1), (

))

For k = 2

((

, 2), (

))

then

For k = 1

((

, 1), (

))

For k = 2

((

, 2), (

))

then

For k = 1

((

, 1), (

))

For k = 2

((

, 2), (

))

Row

•

Consider element

= n

 then

•

For i = 1

((1,

), (

))

•

For i = 2

((2,

), (

))

•

For i = 3

((3,

), (

))

= n

 then

For i = 1

((1,

), (

))

For i = 2

((2,

), (

))

For i = 3

((3,

), (

))

= n

 then

For i = 1

((1,

), (

))

For i = 2

((2,

), (

))

For i = 3

((3,

), (

))

Row

•

So for

each

 element

of

we have:

= n

 then

For i = 1

((1,

), (

))

For i = 2

((2,

), (

))

For i = 3

((3,

), (

))

= n

 then

For i = 1

((1,

), (

))

For i = 2

((2,

), (

))

For i = 3

((3,

), (

))

Row

= n

 then

For i = 1

((1,

), (

))

For i = 2

((2,

), (

))

For i = 3

((3,

), (

))

= n

 then

For i = 1

((1,

), (

))

For i = 2

((2,

), (

))

For i = 3

((3,

), (

))

Row

((1, 1), (M, 1, 3))

((1, 2), (M, 1, 3))

((1, 1), (M, 2, 2))

((1, 2), (M, 2, 2))

((1, 1), (M, 3, 1))

((1, 2), (M, 3, 1))

((2, 1), (M, 1, 1))

((2, 2), (M, 1, 1))

((2, 1), (M, 2, 1))

((2, 2), (M, 2, 1))

((2, 1), (M, 3, 3))

((2, 2), (M, 3, 3))

((3, 1), (M, 1, 0))

((3, 2), (M, 1, 0))

((3, 1), (M, 2, 2))

((3, 2), (M, 2, 2))

((3, 1), (M, 3, 1))

((3, 2), (M, 3, 1))

((1, 1), (N, 1, 2))

((2, 1), (N, 1, 2))

((3, 1), (N, 1, 2))

((1, 2), (N, 1, 1))

((2, 2), (N, 1, 1))

((3, 2), (N, 1, 1))

((1, 1), (N, 2, 1))

((2, 1), (N, 2, 1))

((3, 1), (N, 2, 1))

((1, 2), (N, 2, 0))

((2, 2), (N, 2, 0))

((3, 2), (N, 2, 0))

((1, 1), (N, 3, 3))

((2, 1), (N, 3, 3))

((3, 1), (N, 3, 3))

((1, 2), (N, 3, 2))

((2, 2), (N, 3, 2))

((3, 2), (N, 3, 2))

So these are the data

produced by the

maps and that are

sent to the controller

Controller

Grouping

(see next slide)

(1,1)

(1,1)

, [(M,

), (M,

), (M,

),

(N,

), (N,

), (N,

)]

(1,2)

(1,2)

, [(M,

), (M,

), (M,

),

(N,

), (N,

), (N,

)]

(2,1)

(2,1)

, [(M,

), (M,

), (M,

),

(N,

), (N,

), (N,

)]

(2,2)

(2,2)

, [(M,

), (M,

), (M,

),

(N,

), (N,

), (N,

)]

(3,1)

, [(M,

), (M,

), (M,

),

(N,

), (N,

), (N,

)]

(3,2)

, [(M,

), (M,

), (M,

),

(N,

), (N,

), (N,

)]

Matrix Multiplication by MapReduce

•

The Reduce Function: Each key

(i, k)

has an associated list with all the

values

(M,

ij

and

(N,

jk

, for all possible values of

•

The Reduce function needs to connect the two values on the list that

have the same value

of

, for each

•

The

th

 values on each list

ij

and

jk

 are extracted and multiplied.

•

Then, these products are summed and the result is paired with

(i, k)

in the output of the Reduce function.

Matrix Multiplication by MapReduce

•

For example, for element

(i, k)

 (1, 1)

we have:

(1,1)

(1,1)

, [(M,

), (M,

), (M,

),

(N,

), (N,

), (N,

)]

= 11

(1,1)

(1,1)

, [(M,

), (M,

), (M,

),

(N,

), (N,

), (N,

)]

(1,2)

(1,2)

, [(M,

), (M,

), (M,

),

(N,

), (N,

), (N,

)]

(2,1)

(2,1)

, [(M,

), (M,

), (M,

),

(N,

), (N,

), (N,

)]

(2,2)

(2,2)

, [(M,

), (M,

), (M,

),

(N,

), (N,

), (N,

)]

(3,1)

, [(M,

), (M,

), (M,

),

(N,

), (N,

), (N,

)]

(3,2)

, [(M,

), (M,

), (M,

),

(N,

), (N,

), (N,

)]

(1,1)

(1,1)

= 11

(1,2)

(1,2)

= 5

(2,1)

(2,1)

= 12

(2,2)

(2,2)

= 7

(3,1)

= 5

(3,2)

= 2

Result

Matrix Multiplication using

two

MapReduce

First MapReduce:

•

The Map Function: For each matrix element

ij

, produce the key

value pair

(j, (

i, m

ij

))

•

Likewise, for each matrix element

jk

, produce the key value

pair

   (j, (

, k, n

jk

Example step by step



, (

)):

, (

))

, (

))

, (

))

, (

))

, (

))

, (

))

, (

))

, (

))

, (

))

Example step by step



, (

):

, (

))

, (

))

, (

))

, (

))

, (

))

, (

))

(1, (M, 1, 3))

(2, (M, 1, 2))

(3, (M, 1, 1))

(1, (M, 2, 1))

(2, (M, 2, 1))

(3, (M, 2, 3))

(1, (M, 3, 0))

(2, (M, 3, 2))

(3, (M, 3, 1))

(1, (N, 1, 2))

(1, (N, 2, 1))

(2, (N, 1, 1))

(2, (N, 2, 0))

(3, (N, 1, 3))

(3, (N, 2, 2))

Controller

Grouping

(see next slide)

So these are the data

produced by the

maps and that are

sent to the controller

, [

(M,

(M,

(M,

, (N,

), (N,

)]

, [

(M,

(M,

(M,

, (N,

), (N,

)]

, [

(M,

(M,

) (M,

, (N,

), (N,

)]

Reducers

Grouping (controller):

•

The Reduce Function: For each key

, examine its list of associated

values.

For each value

that comes from

, say (

ij

), and each

value that comes from

, say (

jk

), produce a key-value pair with

key equal to (

) and value equal to the product of these elements,

ij

jk

((

),

) y ((

),

((

),

) y ((

),

((

),

) y ((

),

, [

(M,

(M,

(M,

(N,

), (N,

)]

, [

(M,

(M,

(M,

(N,

), (N,

)]

, [

(M,

(M,

) (M,

(N,

), (N,

)]

((

),

) y ((

),

((

),

) y ((

),

((

),

) y ((

),

((

),

) y ((

),

((

),

) y ((

),

((

),

) y ((

),

This is the input

for the second

MapReduce

Reducers

•

The Map Function:

This function is just the identity

. That is, for every

input element with key (i, k) and value

, produce exactly this key-

value pair.

((1, 1),

) y ((1, 2),

((2, 1),

) y ((2, 2),

((3, 1),

) y ((3, 2),

((1, 1),

) y ((1, 2),

((2, 1),

) y ((2, 2),

((3, 1),

) y ((3, 2),

((1, 1),

) y ((1, 2),

((2, 1),

) y ((2, 2),

((3, 1),

) y ((3, 2),

Controller

((1, 1), [

])

((1, 2), [

])

((2, 1), [

])

((2, 2), [

])

((3, 1), [

])

((3, 2), [

])

Reducers

Key

List of values

•

The Reduce Function: For each key (i, k), produce the sum of the list

of values associated with this key. The result is a pair ((i, k),

), where

 is the value of the element in row i and column k of the matrix P =

MN

((1,1), [

])

((1,2), [

])

((2,1), [

])

((2,2), [

])

((3,1), [

])

((3,2), [

])

Reducers

((1,1),

((1,2),

((2,1),

((2,2),

((3,1),

((3,2),

Result

Slide Note

Embed Share

Download

Modern data-mining applications require managing immense amounts of data quickly, leveraging parallelism in computing clusters. MapReduce, a programming technique, enables efficient large-scale data calculations on computing clusters, reducing costs compared to special-purpose machines. MapReduce is implemented in various systems, simplifying parallel execution and coordination tasks. The process involves Map tasks generating key-value pairs from input data, followed by Reduce tasks executing operations based on key-value pairs.

cama Follow

Uploaded on Oct 04, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

MapReduce Francisco Moreno Excerpts from Mining of Massive Datasets Rajaraman, Leskovec, and Ullman

Introduction Modern data-mining applications, often called big-data analysis, require us to manage immense amounts of data quickly. There is ample opportunity to exploit parallelism. Parallelism is exploited not from a supercomputer, but from computing clusters .

Introduction In the past, applications that called for parallel processing, such as large scientific calculations, were done on special-purpose parallel computers with many processors and specialized hardware. However, the prevalence of large-scaleWeb services has caused more and more computing to be done on installations with thousands of compute nodes operating more or less independently. In these installations, the compute nodes are commodity* hardware, which greatly reduces the cost compared with special-purpose parallel machines. * Producto b sico, materia prima

Introduction Central to this new software stack is a programming technique called MapReduce. Implementations of MapReduce enable many of the most common calculations on large-scale data to be performed on computing clusters efficiently

MapReduce MapReduce is a style of computing that has been implemented in several systems, including Google s internal implementation and the popular open-source implementation Hadoop It has also been implemented by: - Couchbase - CouchDB (Apache CouchDB) - Infinispan - MongoDB - RavenDB - Riak e.g., in Oracle It can also be mimicked in other systems,

MapReduce All you need to write are two functions, called Map and Reduce, while the system manages the parallel execution, coordination of tasks that execute Map or Reduce, and also deals with the possibility that one of these tasks will fail to execute. In brief, a MapReduce computation executes as follows:

MapReduce Step 1 Some number of Map tasks (mappers) each are given one or more chunks (a chunk is a collection of elements, e.g., documents) from a distributed file system. These Map tasks turn the chunk into a sequence of key-value pairs. The way key-value pairs are produced from the input data is determined by the code written by the user for the Map function. Note que ac las claves (keys) no tienen el sentido de claves nicas (claves primarias, claves alternativas). De hecho, una funci n map podr a generar varias parejas clave-valor (key-value pairs) id nticas

MapReduce Step 2 The key-value pairs from each Map task are collected by a master controller (combiner, shuffler) and grouped by key*. The keys are divided among all the Reduce tasks (reducers), so all key- value pairs with the same key wind up at the same Reduce task**. * Es un proceso similar a un GROUP BY de SQL, solo que ac no se aplican funciones de agregaci n sobre los valores. Con los valores de cada clave (key) se forma una colecci n. ** Esto es as ya que todos los valores de una misma clave est n reunidos en una colecci n: (clave, colecci n de valores).

MapReduce Step 3 The Reduce tasks work on one key at a time, and combine (usually, aggregate, consolidate ) all the values* associated with that key in some way. The manner of combination (aggregation, consolidation) of values is determined by the code written by the user for the Reduce function. * O sea, la colecci n de valores de cada clave

Schematic of a MapReduce computation (k1, [values]) (k2, [values]) -- -- Key-value pairs (k, v) Map 1 Reduce 1 C o n t r o l l e r (k3, [values]) (k4, [values]) Reduce 2 Key-value pairs (k, v) -- -- (k5, [values]) (k6, [values]) Map 2 Combined result Reduce 3 (k7, [values]) (k8, [values]) (k9, [values]) -- -- Key-value pairs (k, v) Map n Reduce m Keys with all their values (collection of values) Input chunks Group by keys

Reduce: count the number of values A first schematic example cc name (0, Juan) (1, Pedro) 10 Juan 21 Pedro Map 1 (0, [Juan, Lisa, Ana]) (0, 3) Reduce 1 Controller (0, [Juan, Lisa, Ana]) (1, [Pedro, Pierre]) (0, 3) (1, 2) (1, [Pedro, Pierre]) Reduce 2 (1, 2) cc name Keys with all their values 30 Lisa 40 Ana 35 Pierre (0, Lisa) (0, Ana) (1, Pierre) Group by keys Map 2 Combined result Input documents Map: if cc is odd then 1 else 0

Reduce: count the number of values A first schematic example cc name (0, 1) (1, 1) 10 Juan 21 Pedro Map 1 (0, [1, 1, 1]) (0, 3) Reduce 1 Controller (0, [1, 1, 1]) (1, [1, 1]) (0, 3) (1, 2) Input documents (1, [1, 1]) Reduce 2 (1, 2) cc name Keys with all their values 30 Lisa 40 Ana 35 Pierre Combined result (0, 1) (0, 1) (1, 1) Group by keys Map 2 Here, instead of emitting the names, just emits 1s Map: if cc is odd then 1 else 0

Reduce: count the number of values A first schematic example cc name (0, 1) (1, 1) (-1, 1) 10 Juan 21 Pedro NA Eric Map 1 (0, [1, 1, 1]) (0, 3) Reduce 1 Controller (0, [1, 1, 1]) (1, [1, 1]) (-1, [1, 1]) (0, 3) (1, 2) (-1, 2) Input documents (1, [1, 1]) (-1, [1, 1]) Reduce 2 (1, 2) (-1, 2)Combined cc name Keys with all their values (0, 1) (0, 1) (1, 1) (-1, 1) 30 Lisa 40 Ana 35 Pierre NA Bill Group by keys Map 2 result Here, reduce 2 processes two (key, values) Map: if cc is odd then 1 else 0. If cc = NA then -1

The The Map Map Tasks Tasks We view input files for a Map task as consisting of elements, which can be any type: a tuple or a document, for example. A chunk is a collection of elements, and no element is stored across two chunks. Technically, all inputs to Map tasks and outputs from Reduce tasks are of the key-value-pair form. Insisting on this form for inputs and outputs is motivated by the desire to allow composition of several MapReduce processes.

The The Map Map Tasks Tasks The Map function takes an input element as its argument and produces zero or more key-value pairs. The types of keys and values are each arbitrary. Further, keys are not keys in the usual sense; they do not have to be unique: a Map task can produce several key-value pairs with the same key, even from the same element.

Example We shall illustrate a MapReduce computation with what has become the standard example application: counting the number of occurrences for each word in a collection of documents. In this example, the input file is a repository of documents, and each document is an element The Map function for this example uses keys that are of type String (the words) and values that are integers.

Example The Map task reads a document and breaks it into its sequence of words w1,w2, . . . ,wn. It then emits a sequence of key-value pairs where the value is always 1. That is, the output of the Map task for this document is the sequence of key-value pairs: (w1, 1), (w2, 1), . . . , (wn, 1)

Example Note that a single Map task will typically process many documents all the documents in one or more chunks. Note also that if a word w appears m times among all the documents assigned to that process, then there will be m key-value pairs (w, 1) among its output. An option, which we show later, is to combine these m pairs into a single pair (w, m)

Grouping Grouping by by Key Key As soon as the Map tasks have all completed successfully, the key- value pairs are grouped by key, and the values associated with each key are formed into a list (collection) of values. The grouping is performed by the controller, regardless of what the Map and Reduce tasks do.

Grouping Grouping by by Key Key The master controller process knows how many Reduce tasks there will be, say r such tasks or the user could tells the MapReduce system what r should be*. Then the master controller picks a hash function** that applies to keys and produces a bucket number from 0 to r - 1. This number represents the corresponding Reduce process where the key-value pair is destined. * A similar situation for the number of map tasks. **Optionally, users can specify their own hash function or other method for assigning keys to Reduce tasks. However, whatever algorithm is used, each key is assigned to one and only one Reduce task.

Grouping Grouping by by Key Key The controller performs the grouping by key and distributes to the Reduce tasks: The input to the Reduce task that handles key k is a pair of the form (k, [v1, v2, . . . , vp]) Where (k, v1), (k, v2), . . . , (k, vp) are all the key-value pairs with key k coming from all the Map tasks.

The The Reduce Reduce Tasks Tasks The Reduce function s argument (input parameter) is a pair consisting of a key and its list of associated values. The output of the Reduce function is a sequence of zero or more key- value pairs. These key-value pairs can be of a type different from those sent from Map tasks to Reduce tasks, but often they are the same type.

The The Reduce Reduce Tasks Tasks We shall refer to the application of the Reduce function to a single key and its associated list of values as a reducer (reducci n). A Reduce task receives one or more keys and their associated value lists. That is, a Reduce task executes one or more reducers. The outputs from all the Reduce tasks are merged into a single output (output file) Combined result

Example Let us continue with our word-count example. The Reduce function simply adds up all the values. The output of a reducer consists of the word and the sum. Thus, the output of all the Reduce tasks is a sequence of (w, m) pairs, where w is a word that appears at least once among all the input documents and m is indeed the total number of occurrences of w among all those documents.

Example schematically: (casa, 1) (gato, 1) (ni o, 1) (casa, 1) (gato, 1) casa gato ni o casa gato Map 1 (casa, [1, 1, 1]) (gato, [1, 1]) (casa, 3) (gato, 2) Reduce 1 Controller (casa, [1, 1, 1]) (gato, [1, 1]) (perro, [1, 1]) (ni o, [1, 1, 1]) (casa, 3) (gato, 2) (perro, 2) (ni o, 3) Reduce 2 (perro, 2) (ni o, 3) (casa, 1) (ni o, 1) (perro, 1) (ni o, 1) (perro, 1) (perro, [1, 1]) (ni o, [1, 1, 1]) casa ni o.. perro ni o perro Map 2 Combined result Por supuesto, en este ejemplo no se consideran las posibles palabras representadas por los puntos suspensivos

An alternative for the Example In our example, we can push some of what the reducers do to the Map tasks: Instead of the previous Map tasks producing many pairs (w, 1), (w, 1), . . ., we could apply the process that the Reduce function performs within the Map task, before the output of the Map tasks is subject to grouping and subsequent aggregation by the Reduce function. These key-value pairs would thus be replaced by one pair with key w and value equal to the sum of all the 1 s in all those pairs.

An alternative for the Example That is, the pairs with key w generated by a single Map task would be replaced by a pair (w, m), where m is the number of times that w appears among the documents handled by this Map task. Note that it is still necessary to do grouping and aggregation and to pass the result to the Reduce tasks, since there will typically be one key-value pair with key w coming from each of the Map tasks.

An alternative for the Example schematically: (casa, 2) (gato, 2) (ni o, 1) casa gato ni o casa gato Map 1 (casa, [2, 1]) (gato, [2]) (casa, 3) (gato, 2) Reduce 1 Controller (casa, [2, 1]) (gato, [2]) (perro, [2]) (ni o, [1, 2]) (casa, 3) (gato, 2) (perro, 2) (ni o, 3) Reduce 2 (perro, 2) (ni o, 3) (casa, 1) (ni o, 2) (perro, 2) (perro, [2]) (ni o, [1, 2]) casa ni o.. perro ni o perro Map 2 Combined result

Computing Natural Join by MapReduce The idea behind implementing natural join via MapReduce can be seen if we look at the specific case of joining R(A, B) with S(B, C). We must find tuples that agree on their B components, that is the second component from tuples of R and the first component of tuples of S. Join attribute

Computing Natural Join by MapReduce We shall use the B-value of tuples from either relation as the key. The value will be the other component (attributes) and the name of the relation, so the Reduce function can know where each tuple came from: (B, (the name of the relation, the other attributes of the relation)) Value Key

Computing Natural Join by MapReduce The Map Function: For each tuple (a, b) of R(A, B), produce the key-value pair (b, (R, a)). For each tuple (b, c) of S(B, C), produce the key-value pair (b, (S, c)).

Example EMP ced 10 20 30 DEPT dep name 1 2 3 dep 1 1 2 Ventas Cocina Mercadeo Foreign key

EMP Natural Join DEPT ced dep name 10 1 20 1 30 2 Ventas Ventas Cocina

Computing Natural Join by MapReduce Tuples produced by the map functions: From EMP: From DEPT: (1, (EMP, 10)) (1, (EMP, 20)) (2, (EMP, 30)) (1, (DEPT, Ventas)) (2, (DEPT, Cocina)) (3, (DEPT, Mercadeo))

Computing Natural Join by MapReduce Controller: Grouping EMP C o n t r o l l e r (1, (EMP, 10)) (1, (EMP, 20)) (2, (EMP, 30)) (1, [(EMP, 10), (EMP, 20), (DEPT, Ventas)]) Map tasks (2, [(EMP, 30), (DEPT, Cocina)]) (1, (DEPT, Ventas)) (2, (DEPT, Cocina)) (3, (DEPT, Mercadeo)) (3, [(DEPT, Mercadeo)]) DEPT

Computing Natural Join by MapReduce The Reduce Function: Now that the controller has associated each key value b with a list of pairs that are either of the form (R, a) or (S, c), construct all pairs consisting of one with first component R and the other with first component S*. The output from this key and value list is a sequence of key-value pairs. Each value is one of the triples (a, b, c) such that (R, a) and (S, c) are on the input list of values of b. Here, the key (in the reduce output) is irrelevant. The result (join) is made up of all the triples (values). * That is, the cartesian product

Computing Natural Join by MapReduce Reducers (1, [(EMP, 10), (EMP, 20), (DEPT, Ventas)]) (1, [(10, 1, Ventas), (20, 1, Ventas])]) (2, [(EMP, 30), (DEPT, Cocina)]) (2, [(30, 2, Cocina)]) (3, [(DEPT, Mercadeo)]) (3, []) Result

Computing Natural Join by MapReduce Result: (10, 1, Ventas) (20, 1, Ventas) (30, 2, Cocina)

Matrix-Vector Multiplication by MapReduce Suppose we have an n n matrix M, whose element in row i and column j will be denoted mij. Suppose we also have a vector v of length n, whose jth element is vj . Then the matrix-vector product is the vector x of length n, whose ith element xi is given by

Example 37 51 39 8 9 3 4 5 2 1 2 3 2 3 9 x = i = 1 : 8x2 + 4x3 + 1x9 = 37 i = 2 : 9x2 + 5x3 + 2x9 = 51 i = 3 : 3x2 + 2x3 + 3x9 = 39

Matrix-Vector Multiplication by MapReduce The matrix M and the vector v each will be stored in a file of the DFS (data file system). The Map function is written to apply to one element of M. Each Map task will operate on a chunk of the matrix M. From each matrix element mij it produces the key-value pair (i, mijvj). Thus, all terms of the sum that make up the component xi of the matrix-vector product will get the same key, i. The Reduce Function: The Reduce function simply sums all the values associated with a given key i. The result will be a pair (i, xi).

Example schematically (1/2): (1, 8*2) Map 1 Map 2 Map 3 (1, 4*3) (1, 1*9) Controller (2, 9*2) 8 9 3 4 5 2 1 2 3 2 3 9 Map 4 Map 5 Map 6 (1, [16, 12, 9]) (2, [18, 15, 18]) (3, [6, 6, 27]) x (2, 5*3) Reducers (2, 2*9) (3, 3*2) Map 7 Map 8 Map 9 (3, 2*3) (3, 3*9) Row

Example schematically (2/2): Reduce 1 (1, 37) Controller (1, [16, 12, 9]) (2, [18, 15, 18]) (3, [6, 6, 27]) (1, 37) (2, 51) (3, 39) (2, 51) Reduce 2 Reduce 3 Combined result (3, 39)

Matrix Multiplication by MapReduce If M is a matrix with element mij in row i and column j, and N is a matrix with element njk in row j and column k, then the product P = MN is the matrix P with element pik in row i and column k, where h It is required that the number of columns of M (h) equals the number of rows of N (h), so the sum over j makes sense.

Example Tomado de: http://www.proferiera.comocreartuweb.es/material5/unidad2/objetos/multiplicacion-de-matrices2.gif

Matrix Multiplication by MapReduce The Map Function: For each element mij of M, produce all the key- value pairs ((i, k), (M, j,mij)) for k = 1, 2, . . ., up to the number of columns of N. Similarly, for each element njk of N, produce all the key-value pairs ((i, k), (N, j, njk) for i = 1, 2, . . ., up to the number of rows of M. Note that M and N are really indicators to tell which of the two relations a value comes from.

Example step by step 3 1 0 2 1 2 1 3 1 Consider element mij = m1,1 = 3 then For k = 1 ((1, 1), (M, 1, 3)) For k = 2 ((1, 2), (M, 1, 3))

3 1 0 2 1 2 1 3 1 So for each element mij of M we have: Row i = 1 mij = m1,1 = 3 then For k = 1 ((1, 1), (M, 1, 3)) For k = 2 ((1, 2), (M, 1, 3)) mij = m1,2 = 2 then For k = 1 ((1, 1), (M, 2, 2)) For k = 2 ((1, 2), (M, 2, 2)) mij = m1,3 = 1 then For k = 1 ((1, 1), (M, 3, 1)) For k = 2 ((1, 2), (M, 3, 1))

3 1 0 2 1 2 1 3 1 Row i = 2 mij = m2,1 = 1 then For k = 1 ((2, 1), (M, 1, 1)) For k = 2 ((2, 2), (M, 1, 1)) mij = m2,2 = 1 then For k = 1 ((2, 1), (M, 2, 1)) For k = 2 ((2, 2), (M, 2, 1)) mij = m2,3 = 3 then For k = 1 ((2, 1), (M, 3, 3)) For k = 2 ((2, 2), (M, 3, 3))

3 1 0 2 1 2 1 3 1 Row i = 3 mij = m3,1 = 0 then For k = 1 ((3, 1), (M, 1, 0)) For k = 2 ((3, 2), (M, 1, 0)) mij = m3,2 = 2 then For k = 1 ((3, 1), (M, 2, 2)) For k = 2 ((3, 2), (M, 2, 2)) mij = m3,3 = 1 then For k = 1 ((3, 1), (M, 3, 1)) For k = 2 ((3, 2), (M, 3, 1))

Introduction to MapReduce: Efficient Data Processing Technique

Download Presentation

Presentation Transcript

Related

More Related Content