Introduction to MapReduce: Efficient Data Processing Technique
Modern data-mining applications require managing immense amounts of data quickly, leveraging parallelism in computing clusters. MapReduce, a programming technique, enables efficient large-scale data calculations on computing clusters, reducing costs compared to special-purpose machines. MapReduce is implemented in various systems, simplifying parallel execution and coordination tasks. The process involves Map tasks generating key-value pairs from input data, followed by Reduce tasks executing operations based on key-value pairs.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
MapReduce Francisco Moreno Excerpts from Mining of Massive Datasets Rajaraman, Leskovec, and Ullman
Introduction Modern data-mining applications, often called big-data analysis, require us to manage immense amounts of data quickly. There is ample opportunity to exploit parallelism. Parallelism is exploited not from a supercomputer, but from computing clusters .
Introduction In the past, applications that called for parallel processing, such as large scientific calculations, were done on special-purpose parallel computers with many processors and specialized hardware. However, the prevalence of large-scaleWeb services has caused more and more computing to be done on installations with thousands of compute nodes operating more or less independently. In these installations, the compute nodes are commodity* hardware, which greatly reduces the cost compared with special-purpose parallel machines. * Producto b sico, materia prima
Introduction Central to this new software stack is a programming technique called MapReduce. Implementations of MapReduce enable many of the most common calculations on large-scale data to be performed on computing clusters efficiently
MapReduce MapReduce is a style of computing that has been implemented in several systems, including Google s internal implementation and the popular open-source implementation Hadoop It has also been implemented by: - Couchbase - CouchDB (Apache CouchDB) - Infinispan - MongoDB - RavenDB - Riak e.g., in Oracle It can also be mimicked in other systems,
MapReduce All you need to write are two functions, called Map and Reduce, while the system manages the parallel execution, coordination of tasks that execute Map or Reduce, and also deals with the possibility that one of these tasks will fail to execute. In brief, a MapReduce computation executes as follows:
MapReduce Step 1 Some number of Map tasks (mappers) each are given one or more chunks (a chunk is a collection of elements, e.g., documents) from a distributed file system. These Map tasks turn the chunk into a sequence of key-value pairs. The way key-value pairs are produced from the input data is determined by the code written by the user for the Map function. Note que ac las claves (keys) no tienen el sentido de claves nicas (claves primarias, claves alternativas). De hecho, una funci n map podr a generar varias parejas clave-valor (key-value pairs) id nticas
MapReduce Step 2 The key-value pairs from each Map task are collected by a master controller (combiner, shuffler) and grouped by key*. The keys are divided among all the Reduce tasks (reducers), so all key- value pairs with the same key wind up at the same Reduce task**. * Es un proceso similar a un GROUP BY de SQL, solo que ac no se aplican funciones de agregaci n sobre los valores. Con los valores de cada clave (key) se forma una colecci n. ** Esto es as ya que todos los valores de una misma clave est n reunidos en una colecci n: (clave, colecci n de valores).
MapReduce Step 3 The Reduce tasks work on one key at a time, and combine (usually, aggregate, consolidate ) all the values* associated with that key in some way. The manner of combination (aggregation, consolidation) of values is determined by the code written by the user for the Reduce function. * O sea, la colecci n de valores de cada clave
Schematic of a MapReduce computation (k1, [values]) (k2, [values]) -- -- Key-value pairs (k, v) Map 1 Reduce 1 C o n t r o l l e r (k3, [values]) (k4, [values]) Reduce 2 Key-value pairs (k, v) -- -- (k5, [values]) (k6, [values]) Map 2 Combined result Reduce 3 (k7, [values]) (k8, [values]) (k9, [values]) -- -- Key-value pairs (k, v) Map n Reduce m Keys with all their values (collection of values) Input chunks Group by keys
Reduce: count the number of values A first schematic example cc name (0, Juan) (1, Pedro) 10 Juan 21 Pedro Map 1 (0, [Juan, Lisa, Ana]) (0, 3) Reduce 1 Controller (0, [Juan, Lisa, Ana]) (1, [Pedro, Pierre]) (0, 3) (1, 2) (1, [Pedro, Pierre]) Reduce 2 (1, 2) cc name Keys with all their values 30 Lisa 40 Ana 35 Pierre (0, Lisa) (0, Ana) (1, Pierre) Group by keys Map 2 Combined result Input documents Map: if cc is odd then 1 else 0
Reduce: count the number of values A first schematic example cc name (0, 1) (1, 1) 10 Juan 21 Pedro Map 1 (0, [1, 1, 1]) (0, 3) Reduce 1 Controller (0, [1, 1, 1]) (1, [1, 1]) (0, 3) (1, 2) Input documents (1, [1, 1]) Reduce 2 (1, 2) cc name Keys with all their values 30 Lisa 40 Ana 35 Pierre Combined result (0, 1) (0, 1) (1, 1) Group by keys Map 2 Here, instead of emitting the names, just emits 1s Map: if cc is odd then 1 else 0
Reduce: count the number of values A first schematic example cc name (0, 1) (1, 1) (-1, 1) 10 Juan 21 Pedro NA Eric Map 1 (0, [1, 1, 1]) (0, 3) Reduce 1 Controller (0, [1, 1, 1]) (1, [1, 1]) (-1, [1, 1]) (0, 3) (1, 2) (-1, 2) Input documents (1, [1, 1]) (-1, [1, 1]) Reduce 2 (1, 2) (-1, 2)Combined cc name Keys with all their values (0, 1) (0, 1) (1, 1) (-1, 1) 30 Lisa 40 Ana 35 Pierre NA Bill Group by keys Map 2 result Here, reduce 2 processes two (key, values) Map: if cc is odd then 1 else 0. If cc = NA then -1
The The Map Map Tasks Tasks We view input files for a Map task as consisting of elements, which can be any type: a tuple or a document, for example. A chunk is a collection of elements, and no element is stored across two chunks. Technically, all inputs to Map tasks and outputs from Reduce tasks are of the key-value-pair form. Insisting on this form for inputs and outputs is motivated by the desire to allow composition of several MapReduce processes.
The The Map Map Tasks Tasks The Map function takes an input element as its argument and produces zero or more key-value pairs. The types of keys and values are each arbitrary. Further, keys are not keys in the usual sense; they do not have to be unique: a Map task can produce several key-value pairs with the same key, even from the same element.
Example We shall illustrate a MapReduce computation with what has become the standard example application: counting the number of occurrences for each word in a collection of documents. In this example, the input file is a repository of documents, and each document is an element The Map function for this example uses keys that are of type String (the words) and values that are integers.
Example The Map task reads a document and breaks it into its sequence of words w1,w2, . . . ,wn. It then emits a sequence of key-value pairs where the value is always 1. That is, the output of the Map task for this document is the sequence of key-value pairs: (w1, 1), (w2, 1), . . . , (wn, 1)
Example Note that a single Map task will typically process many documents all the documents in one or more chunks. Note also that if a word w appears m times among all the documents assigned to that process, then there will be m key-value pairs (w, 1) among its output. An option, which we show later, is to combine these m pairs into a single pair (w, m)
Grouping Grouping by by Key Key As soon as the Map tasks have all completed successfully, the key- value pairs are grouped by key, and the values associated with each key are formed into a list (collection) of values. The grouping is performed by the controller, regardless of what the Map and Reduce tasks do.
Grouping Grouping by by Key Key The master controller process knows how many Reduce tasks there will be, say r such tasks or the user could tells the MapReduce system what r should be*. Then the master controller picks a hash function** that applies to keys and produces a bucket number from 0 to r - 1. This number represents the corresponding Reduce process where the key-value pair is destined. * A similar situation for the number of map tasks. **Optionally, users can specify their own hash function or other method for assigning keys to Reduce tasks. However, whatever algorithm is used, each key is assigned to one and only one Reduce task.
Grouping Grouping by by Key Key The controller performs the grouping by key and distributes to the Reduce tasks: The input to the Reduce task that handles key k is a pair of the form (k, [v1, v2, . . . , vp]) Where (k, v1), (k, v2), . . . , (k, vp) are all the key-value pairs with key k coming from all the Map tasks.
The The Reduce Reduce Tasks Tasks The Reduce function s argument (input parameter) is a pair consisting of a key and its list of associated values. The output of the Reduce function is a sequence of zero or more key- value pairs. These key-value pairs can be of a type different from those sent from Map tasks to Reduce tasks, but often they are the same type.
The The Reduce Reduce Tasks Tasks We shall refer to the application of the Reduce function to a single key and its associated list of values as a reducer (reducci n). A Reduce task receives one or more keys and their associated value lists. That is, a Reduce task executes one or more reducers. The outputs from all the Reduce tasks are merged into a single output (output file) Combined result
Example Let us continue with our word-count example. The Reduce function simply adds up all the values. The output of a reducer consists of the word and the sum. Thus, the output of all the Reduce tasks is a sequence of (w, m) pairs, where w is a word that appears at least once among all the input documents and m is indeed the total number of occurrences of w among all those documents.
Example schematically: (casa, 1) (gato, 1) (ni o, 1) (casa, 1) (gato, 1) casa gato ni o casa gato Map 1 (casa, [1, 1, 1]) (gato, [1, 1]) (casa, 3) (gato, 2) Reduce 1 Controller (casa, [1, 1, 1]) (gato, [1, 1]) (perro, [1, 1]) (ni o, [1, 1, 1]) (casa, 3) (gato, 2) (perro, 2) (ni o, 3) Reduce 2 (perro, 2) (ni o, 3) (casa, 1) (ni o, 1) (perro, 1) (ni o, 1) (perro, 1) (perro, [1, 1]) (ni o, [1, 1, 1]) casa ni o.. perro ni o perro Map 2 Combined result Por supuesto, en este ejemplo no se consideran las posibles palabras representadas por los puntos suspensivos
An alternative for the Example In our example, we can push some of what the reducers do to the Map tasks: Instead of the previous Map tasks producing many pairs (w, 1), (w, 1), . . ., we could apply the process that the Reduce function performs within the Map task, before the output of the Map tasks is subject to grouping and subsequent aggregation by the Reduce function. These key-value pairs would thus be replaced by one pair with key w and value equal to the sum of all the 1 s in all those pairs.
An alternative for the Example That is, the pairs with key w generated by a single Map task would be replaced by a pair (w, m), where m is the number of times that w appears among the documents handled by this Map task. Note that it is still necessary to do grouping and aggregation and to pass the result to the Reduce tasks, since there will typically be one key-value pair with key w coming from each of the Map tasks.
An alternative for the Example schematically: (casa, 2) (gato, 2) (ni o, 1) casa gato ni o casa gato Map 1 (casa, [2, 1]) (gato, [2]) (casa, 3) (gato, 2) Reduce 1 Controller (casa, [2, 1]) (gato, [2]) (perro, [2]) (ni o, [1, 2]) (casa, 3) (gato, 2) (perro, 2) (ni o, 3) Reduce 2 (perro, 2) (ni o, 3) (casa, 1) (ni o, 2) (perro, 2) (perro, [2]) (ni o, [1, 2]) casa ni o.. perro ni o perro Map 2 Combined result
Computing Natural Join by MapReduce The idea behind implementing natural join via MapReduce can be seen if we look at the specific case of joining R(A, B) with S(B, C). We must find tuples that agree on their B components, that is the second component from tuples of R and the first component of tuples of S. Join attribute
Computing Natural Join by MapReduce We shall use the B-value of tuples from either relation as the key. The value will be the other component (attributes) and the name of the relation, so the Reduce function can know where each tuple came from: (B, (the name of the relation, the other attributes of the relation)) Value Key
Computing Natural Join by MapReduce The Map Function: For each tuple (a, b) of R(A, B), produce the key-value pair (b, (R, a)). For each tuple (b, c) of S(B, C), produce the key-value pair (b, (S, c)).
Example EMP ced 10 20 30 DEPT dep name 1 2 3 dep 1 1 2 Ventas Cocina Mercadeo Foreign key
EMP Natural Join DEPT ced dep name 10 1 20 1 30 2 Ventas Ventas Cocina
Computing Natural Join by MapReduce Tuples produced by the map functions: From EMP: From DEPT: (1, (EMP, 10)) (1, (EMP, 20)) (2, (EMP, 30)) (1, (DEPT, Ventas)) (2, (DEPT, Cocina)) (3, (DEPT, Mercadeo))
Computing Natural Join by MapReduce Controller: Grouping EMP C o n t r o l l e r (1, (EMP, 10)) (1, (EMP, 20)) (2, (EMP, 30)) (1, [(EMP, 10), (EMP, 20), (DEPT, Ventas)]) Map tasks (2, [(EMP, 30), (DEPT, Cocina)]) (1, (DEPT, Ventas)) (2, (DEPT, Cocina)) (3, (DEPT, Mercadeo)) (3, [(DEPT, Mercadeo)]) DEPT
Computing Natural Join by MapReduce The Reduce Function: Now that the controller has associated each key value b with a list of pairs that are either of the form (R, a) or (S, c), construct all pairs consisting of one with first component R and the other with first component S*. The output from this key and value list is a sequence of key-value pairs. Each value is one of the triples (a, b, c) such that (R, a) and (S, c) are on the input list of values of b. Here, the key (in the reduce output) is irrelevant. The result (join) is made up of all the triples (values). * That is, the cartesian product
Computing Natural Join by MapReduce Reducers (1, [(EMP, 10), (EMP, 20), (DEPT, Ventas)]) (1, [(10, 1, Ventas), (20, 1, Ventas])]) (2, [(EMP, 30), (DEPT, Cocina)]) (2, [(30, 2, Cocina)]) (3, [(DEPT, Mercadeo)]) (3, []) Result
Computing Natural Join by MapReduce Result: (10, 1, Ventas) (20, 1, Ventas) (30, 2, Cocina)
Matrix-Vector Multiplication by MapReduce Suppose we have an n n matrix M, whose element in row i and column j will be denoted mij. Suppose we also have a vector v of length n, whose jth element is vj . Then the matrix-vector product is the vector x of length n, whose ith element xi is given by
Example 37 51 39 8 9 3 4 5 2 1 2 3 2 3 9 x = i = 1 : 8x2 + 4x3 + 1x9 = 37 i = 2 : 9x2 + 5x3 + 2x9 = 51 i = 3 : 3x2 + 2x3 + 3x9 = 39
Matrix-Vector Multiplication by MapReduce The matrix M and the vector v each will be stored in a file of the DFS (data file system). The Map function is written to apply to one element of M. Each Map task will operate on a chunk of the matrix M. From each matrix element mij it produces the key-value pair (i, mijvj). Thus, all terms of the sum that make up the component xi of the matrix-vector product will get the same key, i. The Reduce Function: The Reduce function simply sums all the values associated with a given key i. The result will be a pair (i, xi).
Example schematically (1/2): (1, 8*2) Map 1 Map 2 Map 3 (1, 4*3) (1, 1*9) Controller (2, 9*2) 8 9 3 4 5 2 1 2 3 2 3 9 Map 4 Map 5 Map 6 (1, [16, 12, 9]) (2, [18, 15, 18]) (3, [6, 6, 27]) x (2, 5*3) Reducers (2, 2*9) (3, 3*2) Map 7 Map 8 Map 9 (3, 2*3) (3, 3*9) Row
Example schematically (2/2): Reduce 1 (1, 37) Controller (1, [16, 12, 9]) (2, [18, 15, 18]) (3, [6, 6, 27]) (1, 37) (2, 51) (3, 39) (2, 51) Reduce 2 Reduce 3 Combined result (3, 39)
Matrix Multiplication by MapReduce If M is a matrix with element mij in row i and column j, and N is a matrix with element njk in row j and column k, then the product P = MN is the matrix P with element pik in row i and column k, where h It is required that the number of columns of M (h) equals the number of rows of N (h), so the sum over j makes sense.
Example Tomado de: http://www.proferiera.comocreartuweb.es/material5/unidad2/objetos/multiplicacion-de-matrices2.gif
Matrix Multiplication by MapReduce The Map Function: For each element mij of M, produce all the key- value pairs ((i, k), (M, j,mij)) for k = 1, 2, . . ., up to the number of columns of N. Similarly, for each element njk of N, produce all the key-value pairs ((i, k), (N, j, njk) for i = 1, 2, . . ., up to the number of rows of M. Note that M and N are really indicators to tell which of the two relations a value comes from.
Example step by step 3 1 0 2 1 2 1 3 1 Consider element mij = m1,1 = 3 then For k = 1 ((1, 1), (M, 1, 3)) For k = 2 ((1, 2), (M, 1, 3))
3 1 0 2 1 2 1 3 1 So for each element mij of M we have: Row i = 1 mij = m1,1 = 3 then For k = 1 ((1, 1), (M, 1, 3)) For k = 2 ((1, 2), (M, 1, 3)) mij = m1,2 = 2 then For k = 1 ((1, 1), (M, 2, 2)) For k = 2 ((1, 2), (M, 2, 2)) mij = m1,3 = 1 then For k = 1 ((1, 1), (M, 3, 1)) For k = 2 ((1, 2), (M, 3, 1))
3 1 0 2 1 2 1 3 1 Row i = 2 mij = m2,1 = 1 then For k = 1 ((2, 1), (M, 1, 1)) For k = 2 ((2, 2), (M, 1, 1)) mij = m2,2 = 1 then For k = 1 ((2, 1), (M, 2, 1)) For k = 2 ((2, 2), (M, 2, 1)) mij = m2,3 = 3 then For k = 1 ((2, 1), (M, 3, 3)) For k = 2 ((2, 2), (M, 3, 3))
3 1 0 2 1 2 1 3 1 Row i = 3 mij = m3,1 = 0 then For k = 1 ((3, 1), (M, 1, 0)) For k = 2 ((3, 2), (M, 1, 0)) mij = m3,2 = 2 then For k = 1 ((3, 1), (M, 2, 2)) For k = 2 ((3, 2), (M, 2, 2)) mij = m3,3 = 1 then For k = 1 ((3, 1), (M, 3, 1)) For k = 2 ((3, 2), (M, 3, 1))