Introduction to Map Reduce Paradigm in Data Science

Intro to Data Science
Recitation #
4
Tel Aviv University 2017/2018
Slava Novgorodov
Today’s lesson
Introduction to Data Science:
Map Reduce Paradigm
Recall
Examples
Connection to previous parts of the course
MapReduce vs. Spark
Discussion
Examples
MapReduce Paradigm
MapReduce
 is a programming model and an
associated implementation for processing and
generating 
big data sets
 with a 
parallel
,
distributed
 algorithm on a 
cluster
Map() – 
performs filtering and sorting
Reduce
() – 
performs a summary operation
When to use MapReduce?
Problems that are 
huge
, but not 
hard
Problems that easy to parallelize  (easily
partitionable and combinable)
You should only implement Map and Reduce!
E
x
a
m
p
l
e
The “Hello, World!” of Map Reduce –
WordCout
Given a file with many rows, find how many
times each word appears in the 
whole file
5
I
n
p
u
t
:
t
h
i
s
 
i
s
 
f
i
r
s
t
 
l
i
n
e
a
n
d
 
t
h
i
s
 
i
s
 
s
e
c
o
n
d
 
l
i
n
e
 
a
n
d
 
a
n
o
t
h
e
r
 
l
i
n
e
O
u
t
p
u
t
:
t
h
i
s
,
 
2
i
s
,
 
2
f
i
r
s
t
,
 
1
 
l
i
n
e
,
 
3
a
n
d
,
 
2
 
s
e
c
o
n
d
,
 
1
a
n
o
t
h
e
r
,
 
1
E
x
a
m
p
l
e
 
 
s
o
l
u
t
i
o
n
The “Hello, World!” of Map Reduce -
WordCout
6
W
o
r
d
C
o
u
n
t
 
F
l
o
w
 
i
n
 
M
/
R
7
W
o
r
d
C
o
u
n
t
 
F
l
o
w
 
i
n
 
M
/
R
8
W
o
r
d
C
o
u
n
t
 
F
l
o
w
 
i
n
 
M
/
R
9
A
n
o
t
h
e
r
 
W
o
r
d
C
o
u
n
t
10
10
WordCount
 – implementation
Map:
 
def
 mapfn(k, v):
  
for
 w 
in
 v.split():
   
yield
 w, 1
Reduce:
 
def
 reducefn(k, vs):
  
result = sum(vs)
  
return
 result
This particular implementation is in Python (as the rest of the recitation).
Java, Scala and other languages are also supported.
It’s not important to remember the syntax, remember the pseudo-code
Running our first MapReduce
The default option – Hadoop (installed on TAU
servers)
Installing Hadoop distribution from a Docker
Lightweight mode – Python “simulator”
e.g.
 
https://github.com/ziyuang/mincemeatpy
Running on Hadoop
hadoop
 \
jar
 <path_to_hadoop.jar> \
-mapper
 "python mapper.py" \
-reducer
 "python reducer.py" \
-input
 "wordcount/mobydick.txt" \
-output
 "wordcount/output"
For the simplicity, we will use the python “simulator” of MapReduce.
Social Networks example
Task:
 Find all mutual friends of all pairs of users
Input:
 
A -> B C D
 
B -> A C D E
 
C -> A B D E
 
D -> A B C E
 
E -> B C D
Output:
  
('A', 'D') -> {'B', 'C’}
  
('A', 'C') -> {'D', 'B’}
  
('A', 'B') -> {'D', 'C’}
  
('B', 'C') -> {'D', 'A', 'E’}
  
('B', 'E') -> {'D', 'C’}
  
('B', 'D') -> {'A', 'C', 'E’}
  
('C', 'D') -> {'A', 'B', 'E’}
  
('C', 'E') -> {'D', 'B’}
  
('D', 'E') -> {'B', 'C'}
Social Networks example - solution
Solution: 
friends.py
Map:
def
 mapfn(k, v):
 
d = v.split("->")
 
friends = set(d[1].strip().split(" "))
 
for
 w 
in
 friends:
  
first = d[0].strip()
  
second = w
  
if
 first > second:
   
temp = first
   
first = second
   
second = temp
  
yield
 (first, second), friends
Reduce:
def
 reducefn(k, vs):
 
 
ret = vs[0]
 
for
 s 
in
 vs:
  
ret = ret.intersection(s)
 
return
 ret
Social Networks example - data
Data: 
https://snap.stanford.edu/data/egonets-Facebook.html
The format is not the same as in previous example.
How we will convert it to the same format?
Crawling and incoming links
Task:
 Find all incoming links for web-pages
Input:
 
A.com -> B.com C.com
 
B.com -> D.com E.com
 
C.com -> A.com E.com
 
D.com -> A.com E.com
 
E.com -> D.com
Output:
  
A.com -> ['C.com', 'D.com']
  
B.com -> ['A.com’]
  
C.com -> ['A.com’]
  
E.com -> ['B.com', 'C.com', 'D.com’]
  
D.com -> ['B.com', 'E.com']
Incoming links example - solution
Solution: 
incoming_links.py
Map:
def
 mapfn(k, v):
 
d = v.split("->")
 
pages = set(d[1].strip().split(" "))
 
for
 w 
in
 pages:
  
yield
 w, d[0].strip()
Reduce:
def
 reducefn(k, vs):
 
 
return
 vs
Important tokens in search queries
Task:
 Find all “important” tokens in search queries
Input:
 
cheap smartphone
 
new movies
 
smartphone
 
 
interesting movies
Output:
  
cheap REGULAR
  
smartphone IMPORTANT
  
new REGULAR
  
movies IMPORTANT
  
Important tokens example - solution
Solution: 
important_tokens.py
Map:
def
 mapfn(k, v):
 
d = v.split("->")
 
tokens = set(d[1].strip().split(" "))
 
type = “PART”
 
if
 (len(d) == 1):
  
type = “FULL”
 
for
 w 
in
 tokens:
  
yield
 w, type
Reduce:
def
 reducefn(k, vs):
    if “PART” in vs and “FULL” in vs:
 
 
return
 k, “IMPORTANT”
    return
 k, “REGULAR”
M
o
s
t
 
p
o
p
u
l
a
r
 
t
e
r
m
 
i
n
 
s
e
a
r
c
h
 
q
u
e
r
i
e
s
With MapReduce – easy…
What if we have only one machine with super-
limited memory - like we can store only 5
words and 5 counters. And we want to find 5
most popular search terms.
Ideas?
M
o
s
t
 
p
o
p
u
l
a
r
 
t
e
r
m
 
i
n
 
s
e
a
r
c
h
 
q
u
e
r
i
e
s
Solution:
 Misra-Gries algorithm for
approximate counts on streams
Model the queries terms as a stream and
maintain the counters.
If a counter for a specific term exists – increment
If there is no counter for a term, but still enough
memory – add new counter
If there is no memory, decrement every counter
by 1, remove 0 counters
M
i
s
r
a
-
G
r
i
e
s
 
 
s
t
e
p
-
b
y
-
s
t
e
p
K = 2
Stream = “
cheap
, 
blue
, 
smartphone
, 
Barcelona
, 
cheap
, 
cheap
,
Barcelona
, 
smartphone
, 
cheap
, 
Barcelona
, 
cheap
, 
Barcelona
,
smartphone
, 
blue
, 
cheap
, 
Barcelona
 , …”
Processing now: 
cheap
Memory: {(
cheap
, 1)}
M
i
s
r
a
-
G
r
i
e
s
 
 
s
t
e
p
-
b
y
-
s
t
e
p
K = 2
Stream = “
cheap
, 
blue
, 
smartphone
, 
Barcelona
, 
cheap
, 
cheap
,
Barcelona
, 
smartphone
, 
cheap
, 
Barcelona
, 
cheap
, 
Barcelona
,
smartphone
, 
blue
, 
cheap
, 
Barcelona
 , …”
Processing now: 
blue
Memory: {(
cheap
, 1), (
blue
,1)}
M
i
s
r
a
-
G
r
i
e
s
 
 
s
t
e
p
-
b
y
-
s
t
e
p
K = 2
Stream = “
cheap
, 
blue
, 
smartphone
, 
Barcelona
, 
cheap
, 
cheap
,
Barcelona
, 
smartphone
, 
cheap
, 
Barcelona
, 
cheap
, 
Barcelona
,
smartphone
, 
blue
, 
cheap
, 
Barcelona
 , …”
Processing now: 
smartphone
Memory: {
(
cheap
, 0)
, 
(
blue
,0)
}
M
i
s
r
a
-
G
r
i
e
s
 
 
s
t
e
p
-
b
y
-
s
t
e
p
K = 2
Stream = “
cheap
, 
blue
, 
smartphone
, 
Barcelona
, 
cheap
, 
cheap
,
Barcelona
, 
smartphone
, 
cheap
, 
Barcelona
, 
cheap
, 
Barcelona
,
smartphone
, 
blue
, 
cheap
, 
Barcelona
 , …”
Processing now: 
Barcelona
Memory: {(
Barcelona
,1)}
M
i
s
r
a
-
G
r
i
e
s
 
 
s
t
e
p
-
b
y
-
s
t
e
p
K = 2
Stream = “
cheap
, 
blue
, 
smartphone
, 
Barcelona
, 
cheap
, 
cheap
,
Barcelona
, 
smartphone
, 
cheap
, 
Barcelona
, 
cheap
, 
Barcelona
,
smartphone
, 
blue
, 
cheap
, 
Barcelona
 , …”
Processing now: 
cheap
Memory: {(
Barcelona
,1), (
cheap
,1)}
M
i
s
r
a
-
G
r
i
e
s
 
 
s
t
e
p
-
b
y
-
s
t
e
p
K = 2
Stream = “
cheap
, 
blue
, 
smartphone
, 
Barcelona
, 
cheap
, 
cheap
,
Barcelona
, 
smartphone
, 
cheap
, 
Barcelona
, 
cheap
, 
Barcelona
,
smartphone
, 
blue
, 
cheap
, 
Barcelona
 , …”
Processing now: 
cheap
Memory: {(
Barcelona
,1), (
cheap
,2)}
M
i
s
r
a
-
G
r
i
e
s
 
 
s
t
e
p
-
b
y
-
s
t
e
p
K = 2
Stream = “
cheap
, 
blue
, 
smartphone
, 
Barcelona
, 
cheap
, 
cheap
,
Barcelona
, 
smartphone
, 
cheap
, 
Barcelona
, 
cheap
, 
Barcelona
,
smartphone
, 
blue
, 
cheap
, 
Barcelona
 , …”
Processing now: 
Barcelona
Memory: {(
Barcelona
,2), (
cheap
,2)}
M
i
s
r
a
-
G
r
i
e
s
 
 
s
t
e
p
-
b
y
-
s
t
e
p
K = 2
Stream = “
cheap
, 
blue
, 
smartphone
, 
Barcelona
, 
cheap
, 
cheap
,
Barcelona
, 
smartphone
, 
cheap
, 
Barcelona
, 
cheap
, 
Barcelona
,
smartphone
, 
blue
, 
cheap
, 
Barcelona
 , …”
Processing now: 
smartphone
Memory: {(
Barcelona
,1), (
cheap
,1)}
M
i
s
r
a
-
G
r
i
e
s
 
 
s
t
e
p
-
b
y
-
s
t
e
p
K = 2
Stream = “
cheap
, 
blue
, 
smartphone
, 
Barcelona
, 
cheap
, 
cheap
,
Barcelona
, 
smartphone
, 
cheap
, 
Barcelona
, 
cheap
, 
Barcelona
,
smartphone
, 
blue
, 
cheap
, 
Barcelona
 , …”
Processing now: 
cheap
Memory: {(
Barcelona
,1), (
cheap
,2)}
M
i
s
r
a
-
G
r
i
e
s
 
 
s
t
e
p
-
b
y
-
s
t
e
p
K = 2
Stream = “
cheap
, 
blue
, 
smartphone
, 
Barcelona
, 
cheap
, 
cheap
,
Barcelona
, 
smartphone
, 
cheap
, 
Barcelona
, 
cheap
, 
Barcelona
,
smartphone
, 
blue
, 
cheap
, 
Barcelona
 , …”
Processing now: 
Barcelona
Memory: {(
Barcelona
,2), (
cheap
,2)}
M
i
s
r
a
-
G
r
i
e
s
 
 
s
t
e
p
-
b
y
-
s
t
e
p
K = 2
Stream = “
cheap
, 
blue
, 
smartphone
, 
Barcelona
, 
cheap
, 
cheap
,
Barcelona
, 
smartphone
, 
cheap
, 
Barcelona
, 
cheap
, 
Barcelona
,
smartphone
, 
blue
, 
cheap
, 
Barcelona
 , …”
Processing now: 
cheap
Memory: {(
Barcelona
,2), (
cheap
,3)}
M
i
s
r
a
-
G
r
i
e
s
 
 
s
t
e
p
-
b
y
-
s
t
e
p
K = 2
Stream = “
cheap
, 
blue
, 
smartphone
, 
Barcelona
, 
cheap
, 
cheap
,
Barcelona
, 
smartphone
, 
cheap
, 
Barcelona
, 
cheap
, 
Barcelona
,
smartphone
, 
blue
, 
cheap
, 
Barcelona
 , …”
Processing now: 
Barcelona
Memory: {(
Barcelona
,3), (
cheap
,3)}
M
i
s
r
a
-
G
r
i
e
s
 
 
s
t
e
p
-
b
y
-
s
t
e
p
K = 2
Stream = “
cheap
, 
blue
, 
smartphone
, 
Barcelona
, 
cheap
, 
cheap
,
Barcelona
, 
smartphone
, 
cheap
, 
Barcelona
, 
cheap
, 
Barcelona
,
smartphone
, 
blue
, 
cheap
, 
Barcelona
 , …”
Processing now: 
smartphone
Memory: {(
Barcelona
,2), (
cheap
,2)}
M
i
s
r
a
-
G
r
i
e
s
 
 
s
t
e
p
-
b
y
-
s
t
e
p
K = 2
Stream = “
cheap
, 
blue
, 
smartphone
, 
Barcelona
, 
cheap
, 
cheap
,
Barcelona
, 
smartphone
, 
cheap
, 
Barcelona
, 
cheap
, 
Barcelona
,
smartphone
, 
blue
, 
cheap
, 
Barcelona
 , …”
Processing now: 
blue
Memory: {(
Barcelona
,1), (
cheap
,1)}
M
i
s
r
a
-
G
r
i
e
s
 
 
s
t
e
p
-
b
y
-
s
t
e
p
K = 2
Stream = “
cheap
, 
blue
, 
smartphone
, 
Barcelona
, 
cheap
, 
cheap
,
Barcelona
, 
smartphone
, 
cheap
, 
Barcelona
, 
cheap
, 
Barcelona
,
smartphone
, 
blue
, 
cheap
, 
Barcelona
 , …”
Processing now: 
cheap
Memory: {(
Barcelona
,1), (
cheap
,2)}
M
i
s
r
a
-
G
r
i
e
s
 
 
s
t
e
p
-
b
y
-
s
t
e
p
K = 2
Stream = “
cheap
, 
blue
, 
smartphone
, 
Barcelona
, 
cheap
, 
cheap
,
Barcelona
, 
smartphone
, 
cheap
, 
Barcelona
, 
cheap
, 
Barcelona
,
smartphone
, 
blue
, 
cheap
, 
Barcelona
 , …”
Processing now: 
Barcelona
Memory: {(
Barcelona
,2), (
cheap
,2)}
M
i
s
r
a
-
G
r
i
e
s
 
 
s
t
e
p
-
b
y
-
s
t
e
p
K = 2
Stream = “
cheap
, 
blue
, 
smartphone
, 
Barcelona
, 
cheap
, 
cheap
,
Barcelona
, 
smartphone
, 
cheap
, 
Barcelona
, 
cheap
, 
Barcelona
,
smartphone
, 
blue
, 
cheap
, 
Barcelona
 , …”
Output: {(
Barcelona
,2), (
cheap
,2)}
Real counts: {(
cheap
,6), (
Barcelona
,2), (
smartphone
,3), (
blue
,2)}
The Top-K is the same!
In worst case may be wrong, depends on a stream.
For theoretical results:
http://www.cs.utexas.edu/users/misra/scannedPdf.dir/FindRepeatedElements.pdf
P
s
e
u
d
o
-
s
y
n
o
n
y
m
s
Data – queries:
 
buy cheap house
 
buy big house
 
buy new house
 
rent cheap car
 
rent new car
 
rent Volkswagen car
 
Desired output:
 
cheap – new (2)
Ideas? How many M/R stages needed?
implementation – HW
K-Means – Recall from Recitation 2
Used for clustering of unlabeled data
Example: Image compression
K
-
M
e
a
n
s
 
 
a
n
i
m
a
t
i
o
n
Very good animation here:
 
http://shabal.in/visuals/kmeans/2.html
K
-
M
e
a
n
s
 
 
a
n
i
m
a
t
i
o
n
K
-
M
e
a
n
s
 
 
s
t
e
p
 
#
1
2
K
-
M
e
a
n
s
 
o
n
 
M
a
p
R
e
d
u
c
e
Data – points (x,y) in [0,1]x[0,1]:
 
0.72    0.44
 
0.16    0.82
 
0.42    0.37
 
0.19    0.65
 
Desired output:
 
(0.72    0.44)    (0.55    0.83)
 
(0.16    0.82)    (0.55    0.83)
 
(0.42    0.37)    (0.29    0.16)
 
Python implementation – HW
K-Means – solution sketch
Map:
See if there is centroids in a file, if no, generate randomly k points.
On map stage, check for each point the closest
 
c1     (0.72    0.44)
 
c2     (0.16    0.82)
 
c3     (0.42    0.37)
 
c2     (0.19    0.65)
 
Reduce:
 
Calculate new centroids, see if there are point that want to change…
Important note: the algorithm is iterative and should run until stopping
condition reached (Which stopping condition?)
implementation – HW
MapReduce disadvantages
Iterative tasks that needs to be executed again
and again (such as many ML algorithms), will
store intermediate results on the hard drive, i.e.
we will pay I/O for storing “useless” data
Map Reduce executes JVM on each iteration –
hence we have some constant running cost
To solve such tasks, we can use 
Spark
, which
generalizes the MapReduce idea, and saves on
unnecessary I/O
Spark vs. Hadoop
Spark vs. Hadoop
When still should we use Hadoop?
How fast you need your results? Examples:
You process updated data once a day and data should be
ready next day for reviews or analysis.
Application that sends recommended products to
subscribers
When you have longer time say a week or 2 weeks to
process data.
If you can afford longer data processing latency, it will
be much “cheaper”. If you buy an expensive server that
process data in 30 minutes and the rest 23 hours 30
minutes is idle, you can buy a cheaper server and
process data in say 8 hours (e.g. overnight)
References
http://stevekrenzel.com/finding-friends-with-mapreduce
https://www.slideshare.net/andreaiacono/mapreduce-
34478449
https://spark.apache.org/docs/latest/programming-
guide.html
https://habrahabr.ru/post/103490/
 (Russian)
Questions?
Slide Note
Embed
Share

Explore the MapReduce paradigm in data science through examples, discussions on when to use it, and implementation details. Understand how MapReduce is utilized for processing and generating big data sets efficiently.

  • Data Science
  • MapReduce Paradigm
  • Big Data
  • Parallel Computing

Uploaded on Sep 26, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Intro to Data Science Recitation #4 Tel Aviv University 2017/2018 Slava Novgorodov

  2. Todays lesson Introduction to Data Science: Map Reduce Paradigm Recall Examples Connection to previous parts of the course MapReduce vs. Spark Discussion Examples

  3. MapReduce Paradigm MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster Map() performs filtering and sorting Reduce() performs a summary operation

  4. When to use MapReduce? Problems that are huge, but not hard Problems that easy to parallelize (easily partitionable and combinable) You should only implement Map and Reduce!

  5. Example The Hello, World! of Map Reduce WordCout Given a file with many rows, find how many times each word appears in the whole file Output: this, 2 is, 2 first, 1 line, 3 and, 2 second, 1 another, 1 Input: this is first line and this is second line and another line 5

  6. Example solution The Hello, World! of Map Reduce - WordCout 6

  7. WordCount Flow in M/R 7

  8. WordCount Flow in M/R 8

  9. WordCount Flow in M/R 9

  10. Another WordCount 10

  11. WordCount implementation Map: def mapfn(k, v): for w in v.split(): yield w, 1 Reduce: def reducefn(k, vs): result = sum(vs) return result This particular implementation is in Python (as the rest of the recitation). Java, Scala and other languages are also supported. It s not important to remember the syntax, remember the pseudo-code

  12. Running our first MapReduce The default option Hadoop (installed on TAU servers) Installing Hadoop distribution from a Docker Lightweight mode Python simulator e.g. https://github.com/ziyuang/mincemeatpy

  13. Running on Hadoop hadoop \ jar <path_to_hadoop.jar> \ -mapper "python mapper.py" \ -reducer "python reducer.py" \ -input "wordcount/mobydick.txt" \ -output "wordcount/output" For the simplicity, we will use the python simulator of MapReduce.

  14. Social Networks example Task: Find all mutual friends of all pairs of users Input: A -> B C D B -> A C D E C -> A B D E D -> A B C E E -> B C D Output: ('A', 'D') -> {'B', 'C } ('A', 'C') -> {'D', 'B } ('A', 'B') -> {'D', 'C } ('B', 'C') -> {'D', 'A', 'E } ('B', 'E') -> {'D', 'C } ('B', 'D') -> {'A', 'C', 'E } ('C', 'D') -> {'A', 'B', 'E } ('C', 'E') -> {'D', 'B } ('D', 'E') -> {'B', 'C'}

  15. Social Networks example - solution Solution: friends.py Map: def mapfn(k, v): Reduce: def reducefn(k, vs): d = v.split("->") friends = set(d[1].strip().split(" ")) for w in friends: first = d[0].strip() second = w if first > second: ret = vs[0] for s in vs: ret = ret.intersection(s) return ret temp = first first = second second = temp yield (first, second), friends

  16. Social Networks example - data Data: https://snap.stanford.edu/data/egonets-Facebook.html The format is not the same as in previous example. How we will convert it to the same format?

  17. Crawling and incoming links Task: Find all incoming links for web-pages Input: A.com -> B.com C.com B.com -> D.com E.com C.com -> A.com E.com D.com -> A.com E.com E.com -> D.com Output: A.com -> ['C.com', 'D.com'] B.com -> ['A.com ] C.com -> ['A.com ] E.com -> ['B.com', 'C.com', 'D.com ] D.com -> ['B.com', 'E.com']

  18. Incoming links example - solution Solution: incoming_links.py Map: def mapfn(k, v): Reduce: def reducefn(k, vs): d = v.split("->") pages = set(d[1].strip().split(" ")) for w in pages: yield w, d[0].strip() return vs

  19. Important tokens in search queries Task: Find all important tokens in search queries Input: cheap smartphone new movies smartphone interesting movies Output: cheap REGULAR smartphone IMPORTANT new REGULAR movies IMPORTANT

  20. Important tokens example - solution Solution: important_tokens.py Map: def mapfn(k, v): Reduce: def reducefn(k, vs): if PART in vs and FULL in vs: return k, IMPORTANT return k, REGULAR d = v.split("->") tokens = set(d[1].strip().split(" ")) type = PART if (len(d) == 1): type = FULL for w in tokens: yield w, type

  21. Most popular term in search queries With MapReduce easy What if we have only one machine with super- limited memory - like we can store only 5 words and 5 counters. And we want to find 5 most popular search terms. Ideas?

  22. Most popular term in search queries Solution: Misra-Gries algorithm for approximate counts on streams Model the queries terms as a stream and maintain the counters. If a counter for a specific term exists increment If there is no counter for a term, but still enough memory add new counter If there is no memory, decrement every counter by 1, remove 0 counters

  23. Misra-Gries step-by-step K = 2 Stream = cheap, blue, smartphone, Barcelona, cheap, cheap, Barcelona, smartphone, cheap, Barcelona, cheap, Barcelona, smartphone, blue, cheap, Barcelona , Processing now: cheap Memory: {(cheap, 1)}

  24. Misra-Gries step-by-step K = 2 Stream = cheap, blue, smartphone, Barcelona, cheap, cheap, Barcelona, smartphone, cheap, Barcelona, cheap, Barcelona, smartphone, blue, cheap, Barcelona , Processing now: blue Memory: {(cheap, 1), (blue,1)}

  25. Misra-Gries step-by-step K = 2 Stream = cheap, blue, smartphone, Barcelona, cheap, cheap, Barcelona, smartphone, cheap, Barcelona, cheap, Barcelona, smartphone, blue, cheap, Barcelona , Processing now: smartphone Memory: {(cheap, 0), (blue,0)}

  26. Misra-Gries step-by-step K = 2 Stream = cheap, blue, smartphone, Barcelona, cheap, cheap, Barcelona, smartphone, cheap, Barcelona, cheap, Barcelona, smartphone, blue, cheap, Barcelona , Processing now: Barcelona Memory: {(Barcelona,1)}

  27. Misra-Gries step-by-step K = 2 Stream = cheap, blue, smartphone, Barcelona, cheap, cheap, Barcelona, smartphone, cheap, Barcelona, cheap, Barcelona, smartphone, blue, cheap, Barcelona , Processing now: cheap Memory: {(Barcelona,1), (cheap,1)}

  28. Misra-Gries step-by-step K = 2 Stream = cheap, blue, smartphone, Barcelona, cheap, cheap, Barcelona, smartphone, cheap, Barcelona, cheap, Barcelona, smartphone, blue, cheap, Barcelona , Processing now: cheap Memory: {(Barcelona,1), (cheap,2)}

  29. Misra-Gries step-by-step K = 2 Stream = cheap, blue, smartphone, Barcelona, cheap, cheap, Barcelona, smartphone, cheap, Barcelona, cheap, Barcelona, smartphone, blue, cheap, Barcelona , Processing now: Barcelona Memory: {(Barcelona,2), (cheap,2)}

  30. Misra-Gries step-by-step K = 2 Stream = cheap, blue, smartphone, Barcelona, cheap, cheap, Barcelona, smartphone, cheap, Barcelona, cheap, Barcelona, smartphone, blue, cheap, Barcelona , Processing now: smartphone Memory: {(Barcelona,1), (cheap,1)}

  31. Misra-Gries step-by-step K = 2 Stream = cheap, blue, smartphone, Barcelona, cheap, cheap, Barcelona, smartphone, cheap, Barcelona, cheap, Barcelona, smartphone, blue, cheap, Barcelona , Processing now: cheap Memory: {(Barcelona,1), (cheap,2)}

  32. Misra-Gries step-by-step K = 2 Stream = cheap, blue, smartphone, Barcelona, cheap, cheap, Barcelona, smartphone, cheap, Barcelona, cheap, Barcelona, smartphone, blue, cheap, Barcelona , Processing now: Barcelona Memory: {(Barcelona,2), (cheap,2)}

  33. Misra-Gries step-by-step K = 2 Stream = cheap, blue, smartphone, Barcelona, cheap, cheap, Barcelona, smartphone, cheap, Barcelona, cheap, Barcelona, smartphone, blue, cheap, Barcelona , Processing now: cheap Memory: {(Barcelona,2), (cheap,3)}

  34. Misra-Gries step-by-step K = 2 Stream = cheap, blue, smartphone, Barcelona, cheap, cheap, Barcelona, smartphone, cheap, Barcelona, cheap, Barcelona, smartphone, blue, cheap, Barcelona , Processing now: Barcelona Memory: {(Barcelona,3), (cheap,3)}

  35. Misra-Gries step-by-step K = 2 Stream = cheap, blue, smartphone, Barcelona, cheap, cheap, Barcelona, smartphone, cheap, Barcelona, cheap, Barcelona, smartphone, blue, cheap, Barcelona , Processing now: smartphone Memory: {(Barcelona,2), (cheap,2)}

  36. Misra-Gries step-by-step K = 2 Stream = cheap, blue, smartphone, Barcelona, cheap, cheap, Barcelona, smartphone, cheap, Barcelona, cheap, Barcelona, smartphone, blue, cheap, Barcelona , Processing now: blue Memory: {(Barcelona,1), (cheap,1)}

  37. Misra-Gries step-by-step K = 2 Stream = cheap, blue, smartphone, Barcelona, cheap, cheap, Barcelona, smartphone, cheap, Barcelona, cheap, Barcelona, smartphone, blue, cheap, Barcelona , Processing now: cheap Memory: {(Barcelona,1), (cheap,2)}

  38. Misra-Gries step-by-step K = 2 Stream = cheap, blue, smartphone, Barcelona, cheap, cheap, Barcelona, smartphone, cheap, Barcelona, cheap, Barcelona, smartphone, blue, cheap, Barcelona , Processing now: Barcelona Memory: {(Barcelona,2), (cheap,2)}

  39. Misra-Gries step-by-step K = 2 Stream = cheap, blue, smartphone, Barcelona, cheap, cheap, Barcelona, smartphone, cheap, Barcelona, cheap, Barcelona, smartphone, blue, cheap, Barcelona , Output: {(Barcelona,2), (cheap,2)} Real counts: {(cheap,6), (Barcelona,2), (smartphone,3), (blue,2)} The Top-K is the same! In worst case may be wrong, depends on a stream. For theoretical results: http://www.cs.utexas.edu/users/misra/scannedPdf.dir/FindRepeatedElements.pdf

  40. Pseudo-synonyms Data queries: buy cheap house buy big house buy new house rent cheap car rent new car rent Volkswagen car Desired output: cheap new (2) Ideas? How many M/R stages needed? implementation HW

  41. K-Means Recall from Recitation 2 Used for clustering of unlabeled data Example: Image compression

  42. K-Means animation Very good animation here: http://shabal.in/visuals/kmeans/2.html

  43. K-Means animation

  44. K-Means step #12

  45. K-Means on MapReduce Data points (x,y) in [0,1]x[0,1]: 0.72 0.44 0.16 0.82 0.42 0.37 0.19 0.65 Desired output: (0.72 0.44) (0.55 0.83) (0.16 0.82) (0.55 0.83) (0.42 0.37) (0.29 0.16) Python implementation HW

  46. K-Means solution sketch Map: See if there is centroids in a file, if no, generate randomly k points. On map stage, check for each point the closest c1 (0.72 0.44) c2 (0.16 0.82) c3 (0.42 0.37) c2 (0.19 0.65) Reduce: Calculate new centroids, see if there are point that want to change Important note: the algorithm is iterative and should run until stopping condition reached (Which stopping condition?) implementation HW

  47. MapReduce disadvantages Iterative tasks that needs to be executed again and again (such as many ML algorithms), will store intermediate results on the hard drive, i.e. we will pay I/O for storing useless data Map Reduce executes JVM on each iteration hence we have some constant running cost To solve such tasks, we can use Spark, which generalizes the MapReduce idea, and saves on unnecessary I/O

  48. Spark vs. Hadoop

  49. Spark vs. Hadoop

  50. When still should we use Hadoop? How fast you need your results? Examples: You process updated data once a day and data should be ready next day for reviews or analysis. Application that sends recommended products to subscribers When you have longer time say a week or 2 weeks to process data. If you can afford longer data processing latency, it will be much cheaper . If you buy an expensive server that process data in 30 minutes and the rest 23 hours 30 minutes is idle, you can buy a cheaper server and process data in say 8 hours (e.g. overnight)

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#