Processing Big Data with Apache Pig in Hadoop Ecosystem

 
C
C
5
2
1
2
-
1
P
r
o
c
e
s
a
m
i
e
n
t
o
 
M
a
s
i
v
o
 
d
e
 
D
a
t
o
s
O
t
o
ñ
o
 
2
0
1
9
L
e
c
t
u
r
e
 
4
A
p
a
c
h
e
 
P
i
g
 
Aidan Hogan
aidhog@gmail.com
 
H
ADOOP
: W
RAPPING
 U
P
 
Hadoop: 
Supermarket
 Example
Compute total sales per hour of the day?
More in Hadoop: Multiple Inputs
Multiple inputs, different map for each
One reducer
More in Hadoop: Chaining Jobs
Run and wait
Output of Job1 set to
Input of Job2
More in Hadoop: Number of Reducers
Set number of
parallel reducer tasks
for the job
Why would we ask for 1 reduce task?
Output requires a merge on one machine
(for example, sorting, top-
k
)
Hadoop: Filtered 
Supermarket
 Example
Compute total sales per hour of the day …
but exclude certain item IDs passed as an input file?
More in Hadoop: Distributed Cache
 
 
Some tasks need “global knowledge”
Hopefully not too much though
 
Use a distributed cache:
Makes global data available locally to all nodes
On the local hard-disk of each machine
How might we use this?
Make the filtered products global and read
them (into memory?) when processing items
 
Apache Hadoop … Internals (if interested)
 
http://ercoppa.github.io/HadoopInternals/
 
H
ADOOP
 
VS
. SQL
 
Hadoop: 
(ಠ_ಠ)
 
SQL
So why not just use SQL?
Relational database engines not 
typically
 built for large workloads over bulk data;
they optimise for answering queries that touch a small fraction of the data.
At some stage, they will not scale further.
But this is a reason not to use a 
relational database
.
The question was: why not just use 
SQL
?
 
A
PACHE
 P
IG
: O
VERVIEW
 
Apache Pig
 
C
r
e
a
t
e
 
M
a
p
R
e
d
u
c
e
 
p
r
o
g
r
a
m
s
 
t
o
r
u
n
 
o
n
 
H
a
d
o
o
p
 
U
s
e
 
a
 
h
i
g
h
-
l
e
v
e
l
 
s
c
r
i
p
t
i
n
g
l
a
n
g
u
a
g
e
 
c
a
l
l
e
d
 
P
i
g
 
L
a
t
i
n
 
C
a
n
 
e
m
b
e
d
 
U
s
e
r
 
D
e
f
i
n
e
d
F
u
n
c
t
i
o
n
s
:
 
c
a
l
l
 
a
 
J
a
v
a
 
f
u
n
c
t
i
o
n
(
o
r
 
P
y
t
h
o
n
,
 
R
u
b
y
,
 
e
t
c
.
)
 
B
a
s
e
d
 
o
n
 
P
i
g
 
R
e
l
a
t
i
o
n
s
Apache Pig
C
r
e
a
t
e
 
M
a
p
R
e
d
u
c
e
 
p
r
o
g
r
a
m
s
 
t
o
r
u
n
 
o
n
 
H
a
d
o
o
p
U
s
e
 
a
 
h
i
g
h
-
l
e
v
e
l
 
s
c
r
i
p
t
i
n
g
l
a
n
g
u
a
g
e
 
c
a
l
l
e
d
 
P
i
g
 
L
a
t
i
n
C
a
n
 
e
m
b
e
d
 
U
s
e
r
 
D
e
f
i
n
e
d
F
u
n
c
t
i
o
n
s
:
 
c
a
l
l
 
a
 
J
a
v
a
 
f
u
n
c
t
i
o
n
(
o
r
 
P
y
t
h
o
n
,
 
R
u
b
y
,
 
e
t
c
.
)
B
a
s
e
d
 
o
n
 
P
i
g
 
R
e
l
a
t
i
o
n
s
Atwhay anguagelay isyay isthay ?
Pig Latin: Hello Word Count
 
input_lines
 = 
LOAD
 '/tmp/book.txt' 
AS
 (line:chararray);
 
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
words
 = 
FOREACH
 
input_lines
 
GENERATE FLATTEN
(
TOKENIZE
(line)) 
AS
 word;
 
-- filter out any words that are just white spaces
filtered_words 
= 
FILTER
 
words
 
BY
 word 
MATCHES
 '\\w+';
 
-- create a group for each word
word_groups
 = 
GROUP
 
filtered_words
 
BY
 word;
 
-- count the entries in each group
word_count
 = 
FOREACH
 
word_groups
 
GENERATE COUNT
(
filtered_words
) 
AS
 count, group 
AS
 word;
 
-- order the records by count
ordered_word_count
 = 
ORDER
 
word_count
 
BY
 count 
DESC
;
 
STORE
 
ordered_word_count
 
INTO
 '/tmp/book-word-count.txt';
 
M
a
p
 
R
e
d
u
c
e
 
M
a
p
 
+
 
R
e
d
u
c
e
Any ideas which lines correspond to map
and which to reduce?
 
A
PACHE
 P
IG
: A
N
 E
XAMPLE
 
Pig: Products by Hour
customer412
 
1L_Leche
  
2014-03-31T08:47:57Z
 
$900
customer412
 
Nescafe
  
2014-03-31T08:47:57Z
 
$2.000
customer412
 
Nescafe
  
2014-03-31T08:47:57Z
 
$2.000
customer413
 
400g_Zanahoria
 
2014-03-31T08:48:03Z
 
$1.240
customer413
 
El_Mercurio
 
2014-03-31T08:48:03Z
 
$500
customer413
 
Gillette_Mach3
 
2014-03-31T08:48:03Z
 
$8.250
customer413
 
Santo_Domingo
 
2014-03-31T08:48:03Z
 
$2.450
customer413
 
Nescafe
  
2014-03-31T08:48:03Z
 
$2.000
customer414
 
Rosas
  
2014-03-31T08:48:24Z
 
$7.000
customer414
 
Chocolates
 
2014-03-31T08:48:24Z
 
$9.230
customer414
 
300g_Frutillas
 
2014-03-31T08:48:24Z
 
$1.230
customer415
 
Nescafe
  
2014-03-31T08:48:35Z
 
$2.000
customer415
 
12 Huevos
 
2014-03-31T08:48:35Z
 
$2.200
t
r
a
n
s
a
c
t
.
t
x
t
Find the number of items sold per hour of the day
Pig: Products by Hour
g
r
u
n
t
>
 
R
E
G
I
S
T
E
R
 
u
s
e
r
D
e
f
i
n
e
d
F
u
n
c
t
i
o
n
s
.
j
a
r
;
 
 
u
s
e
r
D
e
f
i
n
e
d
F
u
n
c
t
i
o
n
s
.
j
a
r
User-defined-functions written in Java (or Python, Ruby, etc. …)
Pig: Products by Hour
g
r
u
n
t
>
 
R
E
G
I
S
T
E
R
 
u
s
e
r
D
e
f
i
n
e
d
F
u
n
c
t
i
o
n
s
.
j
a
r
;
g
r
u
n
t
>
 
r
a
w
 
=
 
L
O
A
D
 
t
r
a
n
s
a
c
t
.
t
x
t
'
 
U
S
I
N
G
 
P
i
g
S
t
o
r
a
g
e
(
'
\
t
'
)
 
A
S
 
(
c
u
s
t
,
 
i
t
e
m
,
 
t
i
m
e
,
 
p
r
i
c
e
)
;
 
View data as a (streaming) relation with
fields (cust, item, etc.) and tuples (data rows) …
 
r
a
w
:
Pig: Products by Hour
g
r
u
n
t
>
 
R
E
G
I
S
T
E
R
 
u
s
e
r
D
e
f
i
n
e
d
F
u
n
c
t
i
o
n
s
.
j
a
r
;
g
r
u
n
t
>
 
r
a
w
 
=
 
L
O
A
D
 
t
r
a
n
s
a
c
t
.
t
x
t
'
 
U
S
I
N
G
 
P
i
g
S
t
o
r
a
g
e
(
'
\
t
'
)
 
A
S
 
(
c
u
s
t
,
 
i
t
e
m
,
 
t
i
m
e
,
 
p
r
i
c
e
)
;
g
r
u
n
t
>
 
p
r
e
m
i
u
m
 
=
 
F
I
L
T
E
R
 
r
a
w
 
B
Y
 
o
r
g
.
u
d
f
.
M
i
n
P
r
i
c
e
1
0
0
0
(
p
r
i
c
e
)
;
 
Filter tuples depending on their value for a given
attribute (in this case, price < 1000)
r
a
w
:
Pig: Products by Hour
g
r
u
n
t
>
 
R
E
G
I
S
T
E
R
 
u
s
e
r
D
e
f
i
n
e
d
F
u
n
c
t
i
o
n
s
.
j
a
r
;
g
r
u
n
t
>
 
r
a
w
 
=
 
L
O
A
D
 
t
r
a
n
s
a
c
t
.
t
x
t
'
 
U
S
I
N
G
 
P
i
g
S
t
o
r
a
g
e
(
'
\
t
'
)
 
A
S
 
(
c
u
s
t
,
 
i
t
e
m
,
 
t
i
m
e
,
 
p
r
i
c
e
)
;
g
r
u
n
t
>
 
p
r
e
m
i
u
m
 
=
 
F
I
L
T
E
R
 
r
a
w
 
B
Y
 
o
r
g
.
u
d
f
.
M
i
n
P
r
i
c
e
1
0
0
0
(
p
r
i
c
e
)
;
Filter tuples depending on their value for a given
attribute (in this case, price < 1000)
 
p
r
e
m
i
u
m
:
Pig: Products by Hour
g
r
u
n
t
>
 
R
E
G
I
S
T
E
R
 
u
s
e
r
D
e
f
i
n
e
d
F
u
n
c
t
i
o
n
s
.
j
a
r
;
g
r
u
n
t
>
 
r
a
w
 
=
 
L
O
A
D
 
t
r
a
n
s
a
c
t
.
t
x
t
'
 
U
S
I
N
G
 
P
i
g
S
t
o
r
a
g
e
(
'
\
t
'
)
 
A
S
 
(
c
u
s
t
,
 
i
t
e
m
,
 
t
i
m
e
,
 
p
r
i
c
e
)
;
g
r
u
n
t
>
 
p
r
e
m
i
u
m
 
=
 
F
I
L
T
E
R
 
r
a
w
 
B
Y
 
o
r
g
.
u
d
f
.
M
i
n
P
r
i
c
e
1
0
0
0
(
p
r
i
c
e
)
;
g
r
u
n
t
>
 
h
o
u
r
l
y
 
=
 
F
O
R
E
A
C
H
 
p
r
e
m
i
u
m
 
G
E
N
E
R
A
T
E
 
c
u
s
t
,
 
i
t
e
m
,
 
o
r
g
.
u
d
f
.
E
x
t
r
a
c
t
H
o
u
r
(
t
i
m
e
)
 
A
S
 
h
o
u
r
,
 
p
r
i
c
e
;
p
r
e
m
i
u
m
:
Pig: Products by Hour
g
r
u
n
t
>
 
R
E
G
I
S
T
E
R
 
u
s
e
r
D
e
f
i
n
e
d
F
u
n
c
t
i
o
n
s
.
j
a
r
;
g
r
u
n
t
>
 
r
a
w
 
=
 
L
O
A
D
 
t
r
a
n
s
a
c
t
.
t
x
t
'
 
U
S
I
N
G
 
P
i
g
S
t
o
r
a
g
e
(
'
\
t
'
)
 
A
S
 
(
c
u
s
t
,
 
i
t
e
m
,
 
t
i
m
e
,
 
p
r
i
c
e
)
;
g
r
u
n
t
>
 
p
r
e
m
i
u
m
 
=
 
F
I
L
T
E
R
 
r
a
w
 
B
Y
 
o
r
g
.
u
d
f
.
M
i
n
P
r
i
c
e
1
0
0
0
(
p
r
i
c
e
)
;
g
r
u
n
t
>
 
h
o
u
r
l
y
 
=
 
F
O
R
E
A
C
H
 
p
r
e
m
i
u
m
 
G
E
N
E
R
A
T
E
 
c
u
s
t
,
 
i
t
e
m
,
 
o
r
g
.
u
d
f
.
E
x
t
r
a
c
t
H
o
u
r
(
t
i
m
e
)
 
A
S
 
h
o
u
r
,
 
p
r
i
c
e
;
 
h
o
u
r
l
y
:
Pig: Products by Hour
g
r
u
n
t
>
 
R
E
G
I
S
T
E
R
 
u
s
e
r
D
e
f
i
n
e
d
F
u
n
c
t
i
o
n
s
.
j
a
r
;
g
r
u
n
t
>
 
r
a
w
 
=
 
L
O
A
D
 
t
r
a
n
s
a
c
t
.
t
x
t
'
 
U
S
I
N
G
 
P
i
g
S
t
o
r
a
g
e
(
'
\
t
'
)
 
A
S
 
(
c
u
s
t
,
 
i
t
e
m
,
 
t
i
m
e
,
 
p
r
i
c
e
)
;
g
r
u
n
t
>
 
p
r
e
m
i
u
m
 
=
 
F
I
L
T
E
R
 
r
a
w
 
B
Y
 
o
r
g
.
u
d
f
.
M
i
n
P
r
i
c
e
1
0
0
0
(
p
r
i
c
e
)
;
g
r
u
n
t
>
 
h
o
u
r
l
y
 
=
 
F
O
R
E
A
C
H
 
p
r
e
m
i
u
m
 
G
E
N
E
R
A
T
E
 
c
u
s
t
,
 
i
t
e
m
,
 
o
r
g
.
u
d
f
.
E
x
t
r
a
c
t
H
o
u
r
(
t
i
m
e
)
 
A
S
 
h
o
u
r
,
 
p
r
i
c
e
;
g
r
u
n
t
>
 
u
n
i
q
u
e
 
=
 
D
I
S
T
I
N
C
T
 
h
o
u
r
l
y
;
h
o
u
r
l
y
:
Pig: Products by Hour
g
r
u
n
t
>
 
R
E
G
I
S
T
E
R
 
u
s
e
r
D
e
f
i
n
e
d
F
u
n
c
t
i
o
n
s
.
j
a
r
;
g
r
u
n
t
>
 
r
a
w
 
=
 
L
O
A
D
 
t
r
a
n
s
a
c
t
.
t
x
t
'
 
U
S
I
N
G
 
P
i
g
S
t
o
r
a
g
e
(
'
\
t
'
)
 
A
S
 
(
c
u
s
t
,
 
i
t
e
m
,
 
t
i
m
e
,
 
p
r
i
c
e
)
;
g
r
u
n
t
>
 
p
r
e
m
i
u
m
 
=
 
F
I
L
T
E
R
 
r
a
w
 
B
Y
 
o
r
g
.
u
d
f
.
M
i
n
P
r
i
c
e
1
0
0
0
(
p
r
i
c
e
)
;
g
r
u
n
t
>
 
h
o
u
r
l
y
 
=
 
F
O
R
E
A
C
H
 
p
r
e
m
i
u
m
 
G
E
N
E
R
A
T
E
 
c
u
s
t
,
 
i
t
e
m
,
 
o
r
g
.
u
d
f
.
E
x
t
r
a
c
t
H
o
u
r
(
t
i
m
e
)
 
A
S
 
h
o
u
r
,
 
p
r
i
c
e
;
g
r
u
n
t
>
 
u
n
i
q
u
e
 
=
 
D
I
S
T
I
N
C
T
 
h
o
u
r
l
y
;
g
r
u
n
t
>
 
h
r
I
t
e
m
 
=
 
G
R
O
U
P
 
u
n
i
q
u
e
 
B
Y
 
(
i
t
e
m
,
 
h
o
u
r
)
;
u
n
i
q
u
e
:
 
Pig: Products by Hour
g
r
u
n
t
>
 
R
E
G
I
S
T
E
R
 
u
s
e
r
D
e
f
i
n
e
d
F
u
n
c
t
i
o
n
s
.
j
a
r
;
g
r
u
n
t
>
 
r
a
w
 
=
 
L
O
A
D
 
t
r
a
n
s
a
c
t
.
t
x
t
'
 
U
S
I
N
G
 
P
i
g
S
t
o
r
a
g
e
(
'
\
t
'
)
 
A
S
 
(
c
u
s
t
,
 
i
t
e
m
,
 
t
i
m
e
,
 
p
r
i
c
e
)
;
g
r
u
n
t
>
 
p
r
e
m
i
u
m
 
=
 
F
I
L
T
E
R
 
r
a
w
 
B
Y
 
o
r
g
.
u
d
f
.
M
i
n
P
r
i
c
e
1
0
0
0
(
p
r
i
c
e
)
;
g
r
u
n
t
>
 
h
o
u
r
l
y
 
=
 
F
O
R
E
A
C
H
 
p
r
e
m
i
u
m
 
G
E
N
E
R
A
T
E
 
c
u
s
t
,
 
i
t
e
m
,
 
o
r
g
.
u
d
f
.
E
x
t
r
a
c
t
H
o
u
r
(
t
i
m
e
)
 
A
S
 
h
o
u
r
,
 
p
r
i
c
e
;
g
r
u
n
t
>
 
u
n
i
q
u
e
 
=
 
D
I
S
T
I
N
C
T
 
h
o
u
r
l
y
;
g
r
u
n
t
>
 
h
r
I
t
e
m
 
=
 
G
R
O
U
P
 
u
n
i
q
u
e
 
B
Y
 
(
i
t
e
m
,
 
h
o
u
r
)
;
 
h
r
I
t
e
m
:
Pig: Products by Hour
g
r
u
n
t
>
 
R
E
G
I
S
T
E
R
 
u
s
e
r
D
e
f
i
n
e
d
F
u
n
c
t
i
o
n
s
.
j
a
r
;
g
r
u
n
t
>
 
r
a
w
 
=
 
L
O
A
D
 
t
r
a
n
s
a
c
t
.
t
x
t
'
 
U
S
I
N
G
 
P
i
g
S
t
o
r
a
g
e
(
'
\
t
'
)
 
A
S
 
(
c
u
s
t
,
 
i
t
e
m
,
 
t
i
m
e
,
 
p
r
i
c
e
)
;
g
r
u
n
t
>
 
p
r
e
m
i
u
m
 
=
 
F
I
L
T
E
R
 
r
a
w
 
B
Y
 
o
r
g
.
u
d
f
.
M
i
n
P
r
i
c
e
1
0
0
0
(
p
r
i
c
e
)
;
g
r
u
n
t
>
 
h
o
u
r
l
y
 
=
 
F
O
R
E
A
C
H
 
p
r
e
m
i
u
m
 
G
E
N
E
R
A
T
E
 
c
u
s
t
,
 
i
t
e
m
,
 
o
r
g
.
u
d
f
.
E
x
t
r
a
c
t
H
o
u
r
(
t
i
m
e
)
 
A
S
 
h
o
u
r
,
 
p
r
i
c
e
;
g
r
u
n
t
>
 
u
n
i
q
u
e
 
=
 
D
I
S
T
I
N
C
T
 
h
o
u
r
l
y
;
g
r
u
n
t
>
 
h
r
I
t
e
m
 
=
 
G
R
O
U
P
 
u
n
i
q
u
e
 
B
Y
 
(
i
t
e
m
,
 
h
o
u
r
)
;
g
r
u
n
t
>
 
h
r
I
t
e
m
C
n
t
 
=
 
F
O
R
E
A
C
H
 
h
r
I
t
e
m
 
G
E
N
E
R
A
T
E
 
f
l
a
t
t
e
n
(
$
0
)
,
 
C
O
U
N
T
(
$
1
)
 
A
S
 
c
o
u
n
t
;
h
r
I
t
e
m
:
 
count
 
Pig: Products by Hour
g
r
u
n
t
>
 
R
E
G
I
S
T
E
R
 
u
s
e
r
D
e
f
i
n
e
d
F
u
n
c
t
i
o
n
s
.
j
a
r
;
g
r
u
n
t
>
 
r
a
w
 
=
 
L
O
A
D
 
t
r
a
n
s
a
c
t
.
t
x
t
'
 
U
S
I
N
G
 
P
i
g
S
t
o
r
a
g
e
(
'
\
t
'
)
 
A
S
 
(
c
u
s
t
,
 
i
t
e
m
,
 
t
i
m
e
,
 
p
r
i
c
e
)
;
g
r
u
n
t
>
 
p
r
e
m
i
u
m
 
=
 
F
I
L
T
E
R
 
r
a
w
 
B
Y
 
o
r
g
.
u
d
f
.
M
i
n
P
r
i
c
e
1
0
0
0
(
p
r
i
c
e
)
;
g
r
u
n
t
>
 
h
o
u
r
l
y
 
=
 
F
O
R
E
A
C
H
 
p
r
e
m
i
u
m
 
G
E
N
E
R
A
T
E
 
c
u
s
t
,
 
i
t
e
m
,
 
o
r
g
.
u
d
f
.
E
x
t
r
a
c
t
H
o
u
r
(
t
i
m
e
)
 
A
S
 
h
o
u
r
,
 
p
r
i
c
e
;
g
r
u
n
t
>
 
u
n
i
q
u
e
 
=
 
D
I
S
T
I
N
C
T
 
h
o
u
r
l
y
;
g
r
u
n
t
>
 
h
r
I
t
e
m
 
=
 
G
R
O
U
P
 
u
n
i
q
u
e
 
B
Y
 
(
i
t
e
m
,
 
h
o
u
r
)
;
g
r
u
n
t
>
 
h
r
I
t
e
m
C
n
t
 
=
 
F
O
R
E
A
C
H
 
h
r
I
t
e
m
 
G
E
N
E
R
A
T
E
 
f
l
a
t
t
e
n
(
$
0
)
,
 
C
O
U
N
T
(
$
1
)
 
A
S
 
c
o
u
n
t
;
 
h
r
I
t
e
m
C
n
t
:
Pig: Products by Hour
g
r
u
n
t
>
 
R
E
G
I
S
T
E
R
 
u
s
e
r
D
e
f
i
n
e
d
F
u
n
c
t
i
o
n
s
.
j
a
r
;
g
r
u
n
t
>
 
r
a
w
 
=
 
L
O
A
D
 
t
r
a
n
s
a
c
t
.
t
x
t
'
 
U
S
I
N
G
 
P
i
g
S
t
o
r
a
g
e
(
'
\
t
'
)
 
A
S
 
(
c
u
s
t
,
 
i
t
e
m
,
 
t
i
m
e
,
 
p
r
i
c
e
)
;
g
r
u
n
t
>
 
p
r
e
m
i
u
m
 
=
 
F
I
L
T
E
R
 
r
a
w
 
B
Y
 
o
r
g
.
u
d
f
.
M
i
n
P
r
i
c
e
1
0
0
0
(
p
r
i
c
e
)
;
g
r
u
n
t
>
 
h
o
u
r
l
y
 
=
 
F
O
R
E
A
C
H
 
p
r
e
m
i
u
m
 
G
E
N
E
R
A
T
E
 
c
u
s
t
,
 
i
t
e
m
,
 
o
r
g
.
u
d
f
.
E
x
t
r
a
c
t
H
o
u
r
(
t
i
m
e
)
 
A
S
 
h
o
u
r
,
 
p
r
i
c
e
;
g
r
u
n
t
>
 
u
n
i
q
u
e
 
=
 
D
I
S
T
I
N
C
T
 
h
o
u
r
l
y
;
g
r
u
n
t
>
 
h
r
I
t
e
m
 
=
 
G
R
O
U
P
 
u
n
i
q
u
e
 
B
Y
 
(
i
t
e
m
,
 
h
o
u
r
)
;
g
r
u
n
t
>
 
h
r
I
t
e
m
C
n
t
 
=
 
F
O
R
E
A
C
H
 
h
r
I
t
e
m
 
G
E
N
E
R
A
T
E
 
f
l
a
t
t
e
n
(
$
0
)
,
 
C
O
U
N
T
(
$
1
)
 
A
S
 
c
o
u
n
t
;
g
r
u
n
t
>
 
h
r
I
t
e
m
C
n
t
S
o
r
t
e
d
 
=
 
O
R
D
E
R
 
h
r
I
t
e
m
C
n
t
 
B
Y
 
c
o
u
n
t
 
D
E
S
C
;
h
r
I
t
e
m
C
n
t
:
 
Pig: Products by Hour
g
r
u
n
t
>
 
R
E
G
I
S
T
E
R
 
u
s
e
r
D
e
f
i
n
e
d
F
u
n
c
t
i
o
n
s
.
j
a
r
;
g
r
u
n
t
>
 
r
a
w
 
=
 
L
O
A
D
 
t
r
a
n
s
a
c
t
.
t
x
t
'
 
U
S
I
N
G
 
P
i
g
S
t
o
r
a
g
e
(
'
\
t
'
)
 
A
S
 
(
c
u
s
t
,
 
i
t
e
m
,
 
t
i
m
e
,
 
p
r
i
c
e
)
;
g
r
u
n
t
>
 
p
r
e
m
i
u
m
 
=
 
F
I
L
T
E
R
 
r
a
w
 
B
Y
 
o
r
g
.
u
d
f
.
M
i
n
P
r
i
c
e
1
0
0
0
(
p
r
i
c
e
)
;
g
r
u
n
t
>
 
h
o
u
r
l
y
 
=
 
F
O
R
E
A
C
H
 
p
r
e
m
i
u
m
 
G
E
N
E
R
A
T
E
 
c
u
s
t
,
 
i
t
e
m
,
 
o
r
g
.
u
d
f
.
E
x
t
r
a
c
t
H
o
u
r
(
t
i
m
e
)
 
A
S
 
h
o
u
r
,
 
p
r
i
c
e
;
g
r
u
n
t
>
 
u
n
i
q
u
e
 
=
 
D
I
S
T
I
N
C
T
 
h
o
u
r
l
y
;
g
r
u
n
t
>
 
h
r
I
t
e
m
 
=
 
G
R
O
U
P
 
u
n
i
q
u
e
 
B
Y
 
(
i
t
e
m
,
 
h
o
u
r
)
;
g
r
u
n
t
>
 
h
r
I
t
e
m
C
n
t
 
=
 
F
O
R
E
A
C
H
 
h
r
I
t
e
m
 
G
E
N
E
R
A
T
E
 
f
l
a
t
t
e
n
(
$
0
)
,
 
C
O
U
N
T
(
$
1
)
 
A
S
 
c
o
u
n
t
;
g
r
u
n
t
>
 
h
r
I
t
e
m
C
n
t
S
o
r
t
e
d
 
=
 
O
R
D
E
R
 
h
r
I
t
e
m
C
n
t
 
B
Y
 
c
o
u
n
t
 
D
E
S
C
;
 
h
r
I
t
e
m
C
n
t
S
o
r
t
e
d
:
Pig: Products by Hour
g
r
u
n
t
>
 
R
E
G
I
S
T
E
R
 
u
s
e
r
D
e
f
i
n
e
d
F
u
n
c
t
i
o
n
s
.
j
a
r
;
g
r
u
n
t
>
 
r
a
w
 
=
 
L
O
A
D
 
t
r
a
n
s
a
c
t
.
t
x
t
'
 
U
S
I
N
G
 
P
i
g
S
t
o
r
a
g
e
(
'
\
t
'
)
 
A
S
 
(
c
u
s
t
,
 
i
t
e
m
,
 
t
i
m
e
,
 
p
r
i
c
e
)
;
g
r
u
n
t
>
 
p
r
e
m
i
u
m
 
=
 
F
I
L
T
E
R
 
r
a
w
 
B
Y
 
o
r
g
.
u
d
f
.
M
i
n
P
r
i
c
e
1
0
0
0
(
p
r
i
c
e
)
;
g
r
u
n
t
>
 
h
o
u
r
l
y
 
=
 
F
O
R
E
A
C
H
 
p
r
e
m
i
u
m
 
G
E
N
E
R
A
T
E
 
c
u
s
t
,
 
i
t
e
m
,
 
o
r
g
.
u
d
f
.
E
x
t
r
a
c
t
H
o
u
r
(
t
i
m
e
)
 
A
S
 
h
o
u
r
,
 
p
r
i
c
e
;
g
r
u
n
t
>
 
u
n
i
q
u
e
 
=
 
D
I
S
T
I
N
C
T
 
h
o
u
r
l
y
;
g
r
u
n
t
>
 
h
r
I
t
e
m
 
=
 
G
R
O
U
P
 
u
n
i
q
u
e
 
B
Y
 
(
i
t
e
m
,
 
h
o
u
r
)
;
g
r
u
n
t
>
 
h
r
I
t
e
m
C
n
t
 
=
 
F
O
R
E
A
C
H
 
h
r
I
t
e
m
 
G
E
N
E
R
A
T
E
 
f
l
a
t
t
e
n
(
$
0
)
,
 
C
O
U
N
T
(
$
1
)
 
A
S
 
c
o
u
n
t
;
g
r
u
n
t
>
 
h
r
I
t
e
m
C
n
t
S
o
r
t
e
d
 
=
 
O
R
D
E
R
 
h
r
I
t
e
m
C
n
t
 
B
Y
 
c
o
u
n
t
 
D
E
S
C
;
g
r
u
n
t
>
 
S
T
O
R
E
 
h
r
I
t
e
m
C
n
t
S
o
r
t
e
d
 
I
N
T
O
 
o
u
t
p
u
t
.
t
x
t
;
h
r
I
t
e
m
C
n
t
S
o
r
t
e
d
:
 
A
PACHE
 P
IG
: S
CHEMA
 
Pig Relations
 
Pig Relations
: Like relational tables
Except tuples can be “jagged”
Fields in the same column don’t need to be same type
Relations are by default unordered
 
Pig Schema
: Names for fields, etc.
 
 
 
 
 
A
S
 
(
c
u
s
t
,
 
i
t
e
m
,
 
t
i
m
e
,
 
p
r
i
c
e
)
;
 
Pig Fields
Pig Fields
:
Reference using name
p
r
e
m
i
u
m
 
=
 
F
I
L
T
E
R
 
r
a
w
 
B
Y
 
o
r
g
.
u
d
f
.
M
i
n
P
r
i
c
e
1
0
0
0
(
p
r
i
c
e
)
;
… or position
p
r
e
m
i
u
m
 
=
 
F
I
L
T
E
R
 
r
a
w
 
B
Y
 
o
r
g
.
u
d
f
.
M
i
n
P
r
i
c
e
1
0
0
0
(
$
3
)
;
 
More readable!
 
Starts at zero.
 
A
PACHE
 P
IG
: T
YPES
 
Pig Simple Types
 
Pig Types
:
L
O
A
D
 
t
r
a
n
s
a
c
t
.
t
x
t
'
 
U
S
I
N
G
 
P
i
g
S
t
o
r
a
g
e
(
'
\
t
'
)
 
A
S
(
c
u
s
t
:
c
h
a
r
A
r
r
a
y
,
 
i
t
e
m
:
c
h
a
r
A
r
r
a
y
,
 
t
i
m
e
:
d
a
t
e
t
i
m
e
,
p
r
i
c
e
:
i
n
t
)
;
 
int
, 
long
, 
float
, 
double
, 
biginteger
, 
bigdecimal
,
boolean
, 
chararray 
(string), 
bytearray 
(blob),
datetime
Pig Types: Duck Typing
What happens if you omit types?
Fields default to 
bytearray
Implicit conversions if needed (~duck typing)
A
 
=
 
L
O
A
D
 
'
d
a
t
a
'
 
A
S
 
(
c
u
s
t
,
 
i
t
e
m
,
 
h
o
u
r
,
 
p
r
i
c
e
)
;
B
 
=
 
F
O
R
E
A
C
H
 
A
 
G
E
N
E
R
A
T
E
 
h
o
u
r
 
+
 
4
 
%
 
2
4
;
C
 
=
 
F
O
R
E
A
C
H
 
A
 
G
E
N
E
R
A
T
E
 
h
o
u
r
 
+
 
4
f
 
%
 
2
4
;
 
hour an integer
 
hour a float
cat data;
(3,8,9) (4,5,6)
(1,4,7) (3,7,5)
(2,5,8) (9,5,8)
 
A
 
=
 
L
O
A
D
 
'
d
a
t
a
'
 
A
S
 
(
t
1
:
t
u
p
l
e
(
t
1
a
:
i
n
t
,
t
1
b
:
i
n
t
,
t
1
c
:
i
n
t
)
,
t
2
:
t
u
p
l
e
(
t
2
a
:
i
n
t
,
t
2
b
:
i
n
t
,
t
2
c
:
i
n
t
)
)
;
 
D
U
M
P
 
A
;
((3,8,9),(4,5,6)) ((1,4,7),(3,7,5)) ((2,5,8),(9,5,8))
 
X
 
=
 
F
O
R
E
A
C
H
 
A
 
G
E
N
E
R
A
T
E
 
t
1
.
t
1
a
,
t
2
.
$
0
;
Pig Complex Types: Tuple
 
A
:
cat data;
(3,8,9) (4,5,6)
(1,4,7) (3,7,5)
(2,5,8) (9,5,8)
 
A
 
=
 
L
O
A
D
 
'
d
a
t
a
'
 
A
S
 
(
t
1
:
t
u
p
l
e
(
t
1
a
:
i
n
t
,
t
1
b
:
i
n
t
,
t
1
c
:
i
n
t
)
,
t
2
:
t
u
p
l
e
(
t
2
a
:
i
n
t
,
t
2
b
:
i
n
t
,
t
2
c
:
i
n
t
)
)
;
 
D
U
M
P
 
A
;
((3,8,9),(4,5,6)) ((1,4,7),(3,7,5)) ((2,5,8),(9,5,8))
 
X
 
=
 
F
O
R
E
A
C
H
 
A
 
G
E
N
E
R
A
T
E
 
t
1
.
t
1
a
,
t
2
.
$
0
;
D
U
M
P
 
X
;
(3,4)
(1,3)
(2,9)
Pig Complex Types: Tuple
X
:
cat data;
(3,8,9)
(2,3,6)
(1,4,7)
(2,5,8)
 
A
 
=
 
L
O
A
D
 
'
d
a
t
a
'
 
A
S
 
(
c
1
:
i
n
t
,
 
c
2
:
i
n
t
,
 
c
3
:
i
n
t
)
;
B
 
=
 
G
R
O
U
P
 
A
 
B
Y
 
c
1
;
Pig Complex Types: Bag
 
A
:
cat data;
(3,8,9)
(2,3,6)
(1,4,7)
(2,5,8)
 
A
 
=
 
L
O
A
D
 
'
d
a
t
a
'
 
A
S
 
(
c
1
:
i
n
t
,
 
c
2
:
i
n
t
,
 
c
3
:
i
n
t
)
;
B
 
=
 
G
R
O
U
P
 
A
 
B
Y
 
c
1
;
D
U
M
P
 
B
;
(
1,
{
(1,4,7)
}
)
(
2,
{
(2,5,8),(2,3,6)
}
)
(
3,
{
(3,8,9)
}
)
Pig Complex Types: Bag
B
:
Pig Complex Types: Map
 
cat prices;
[Nescafe#”$2.000”]
[Gillette_Mach3#”$8.250”]
A
 
=
 
L
O
A
D
 
p
r
i
c
e
s
 
A
S
 
(
M
:
m
a
p
 
[
]
)
;
Pig Complex Types: Summary
 
tuple
: A row in a table / a list of fields
e.g., (customer412, Nescafe, 08, $2.000)
 
 
bag
: A set of tuples (allows duplicates)
e.g., { (cust412, Nescafe, 08, $2.000), (cust413, Gillette_Mach3, 08, $8.250) }
 
map
: A set of key–value pairs
e.g., [Nescafe#$2.000]
 
 
A
PACHE
 P
IG
:
   U
NNESTING
 (F
LATTEN
)
 
Pig Latin: Hello Word Count
input_lines
 = 
LOAD
 '/tmp/book.txt' 
AS
 (line:chararray);
-- Extract words from each line and put them into a pig bag
-- datatype, then flatten the bag to get one word on each row
words
 = 
FOREACH
 
input_lines
 
GENERATE FLATTEN
(
TOKENIZE
(line)) 
AS
 word;
-- filter out any words that are just white spaces
filtered_words 
= 
FILTER
 
words
 
BY
 word 
MATCHES
 '\\w+';
-- create a group for each word
word_groups
 = 
GROUP
 
filtered_words
 
BY
 word;
-- count the entries in each group
word_count
 = 
FOREACH
 
word_groups
 
GENERATE COUNT
(
filtered_words
) 
AS
 count, group 
AS
 word;
-- order the records by count
ordered_word_count
 = 
ORDER
 
word_count
 
BY
 count 
DESC
;
STORE
 
ordered_word_count
 
INTO
 '/tmp/book-word-count.txt';
cat data;
(3,8,9) (4,5,6)
(1,4,7) (3,7,5)
(2,5,8) (9,5,8)
 
A
 
=
 
L
O
A
D
 
'
d
a
t
a
'
 
A
S
 
(
t
1
:
t
u
p
l
e
(
t
1
a
:
i
n
t
,
t
1
b
:
i
n
t
,
t
1
c
:
i
n
t
)
,
t
2
:
t
u
p
l
e
(
t
2
a
:
i
n
t
,
t
2
b
:
i
n
t
,
t
2
c
:
i
n
t
)
)
;
 
D
U
M
P
 
A
;
((3,8,9),(4,5,6))
((1,4,7),(3,7,5))
((2,5,8),(9,5,8))
 
X
 
=
 
F
O
R
E
A
C
H
 
A
 
G
E
N
E
R
A
T
E
 
f
l
a
t
t
e
n
(
t
1
)
,
 
f
l
a
t
t
e
n
(
t
2
)
;
 
 
Pig Complex Types: Flatten Tuples
 
A
:
cat data;
(3,8,9) (4,5,6)
(1,4,7) (3,7,5)
(2,5,8) (9,5,8)
 
A
 
=
 
L
O
A
D
 
'
d
a
t
a
'
 
A
S
 
(
t
1
:
t
u
p
l
e
(
t
1
a
:
i
n
t
,
t
1
b
:
i
n
t
,
t
1
c
:
i
n
t
)
,
t
2
:
t
u
p
l
e
(
t
2
a
:
i
n
t
,
t
2
b
:
i
n
t
,
t
2
c
:
i
n
t
)
)
;
 
D
U
M
P
 
A
;
((3,8,9),(4,5,6))
((1,4,7),(3,7,5))
((2,5,8),(9,5,8))
 
X
 
=
 
F
O
R
E
A
C
H
 
A
 
G
E
N
E
R
A
T
E
 
f
l
a
t
t
e
n
(
t
1
)
,
 
f
l
a
t
t
e
n
(
t
2
)
;
D
U
M
P
 
X
;
(3,8,9,4,5,6)
(1,4,7,3,7,5)
(2,5,8,9,5,8)
Pig Complex Types: Flatten Tuples
 
X
:
cat data;
(3,8,9) (4,5,6)
(1,4,7) (3,7,5)
(2,5,8) (9,5,8)
 
A
 
=
 
L
O
A
D
 
'
d
a
t
a
'
 
A
S
 
(
t
1
:
t
u
p
l
e
(
t
1
a
:
i
n
t
,
t
1
b
:
i
n
t
,
t
1
c
:
i
n
t
)
,
t
2
:
t
u
p
l
e
(
t
2
a
:
i
n
t
,
t
2
b
:
i
n
t
,
t
2
c
:
i
n
t
)
)
;
 
D
U
M
P
 
A
;
((3,8,9),(4,5,6))
((1,4,7),(3,7,5))
((2,5,8),(9,5,8))
 
Y
 
=
 
F
O
R
E
A
C
H
 
A
 
G
E
N
E
R
A
T
E
 
t
1
,
 
f
l
a
t
t
e
n
(
t
2
)
;
 
 
 
 
 
Pig Complex Types: Flatten Tuples
A
:
cat data;
(3,8,9) (4,5,6)
(1,4,7) (3,7,5)
(2,5,8) (9,5,8)
 
A
 
=
 
L
O
A
D
 
'
d
a
t
a
'
 
A
S
 
(
t
1
:
t
u
p
l
e
(
t
1
a
:
i
n
t
,
t
1
b
:
i
n
t
,
t
1
c
:
i
n
t
)
,
t
2
:
t
u
p
l
e
(
t
2
a
:
i
n
t
,
t
2
b
:
i
n
t
,
t
2
c
:
i
n
t
)
)
;
 
D
U
M
P
 
A
;
((3,8,9),(4,5,6))
((1,4,7),(3,7,5))
((2,5,8),(9,5,8))
 
Y
 
=
 
F
O
R
E
A
C
H
 
A
 
G
E
N
E
R
A
T
E
 
t
1
,
 
f
l
a
t
t
e
n
(
t
2
)
;
D
U
M
P
 
Y
;
((3,8,9),4,5,6)
((1,4,7),3,7,5)
((2,5,8),9,5,8)
Pig Complex Types: Flatten Tuples
 
Y
:
cat data;
(3,8,9)
(2,3,6)
(1,4,7)
(2,5,8)
 
A
 
=
 
L
O
A
D
 
'
d
a
t
a
'
 
A
S
 
(
c
1
:
i
n
t
,
 
c
2
:
i
n
t
,
 
c
3
:
i
n
t
)
;
B
 
=
 
G
R
O
U
P
 
A
 
B
Y
 
c
1
;
D
U
M
P
 
B
;
(
1,
{
(1,4,7)
}
)
(
2,
{
(2,5,8),(2,3,6)
}
)
(
3,
{
(3,8,9)
}
)
 
C
 
=
 
F
O
R
E
A
C
H
 
B
 
G
E
N
E
R
A
T
E
 
f
l
a
t
t
e
n
(
A
)
;
Pig Complex Types: Bag
B
:
cat data;
(3,8,9)
(2,3,6)
(1,4,7)
(2,5,8)
 
A
 
=
 
L
O
A
D
 
'
d
a
t
a
'
 
A
S
 
(
c
1
:
i
n
t
,
 
c
2
:
i
n
t
,
 
c
3
:
i
n
t
)
;
B
 
=
 
G
R
O
U
P
 
A
 
B
Y
 
c
1
;
D
U
M
P
 
B
;
(
1,
{
(1,4,7)
}
)
(
2,
{
(2,5,8),(2,3,6)
}
)
(
3,
{
(3,8,9)
}
)
 
C
 
=
 
F
O
R
E
A
C
H
 
B
 
G
E
N
E
R
A
T
E
 
f
l
a
t
t
e
n
(
A
)
;
D
U
M
P
 
C
;
(3,8,9)
(2,3,6)
(2,5,8)
(1,4,7)
Pig Complex Types: Bag
 
C
:
cat data;
(3,8,9)
(2,3,6)
(1,4,7)
(2,5,8)
 
A
 
=
 
L
O
A
D
 
'
d
a
t
a
'
 
A
S
 
(
c
1
:
i
n
t
,
 
c
2
:
i
n
t
,
 
c
3
:
i
n
t
)
;
B
 
=
 
G
R
O
U
P
 
A
 
B
Y
 
c
1
;
D
U
M
P
 
B
;
(
1,
{
(1,4,7)
}
)
(
2,
{
(2,5,8),(2,3,6)
}
)
(
3,
{
(3,8,9)
}
)
 
D
 
=
 
F
O
R
E
A
C
H
 
B
 
G
E
N
E
R
A
T
E
 
g
r
o
u
p
,
 
f
l
a
t
t
e
n
(
A
)
;
Pig Complex Types: Bag
B
:
cat data;
(3,8,9)
(2,3,6)
(1,4,7)
(2,5,8)
 
A
 
=
 
L
O
A
D
 
'
d
a
t
a
'
 
A
S
 
(
c
1
:
i
n
t
,
 
c
2
:
i
n
t
,
 
c
3
:
i
n
t
)
;
B
 
=
 
G
R
O
U
P
 
A
 
B
Y
 
c
1
;
D
U
M
P
 
B
;
(
1,
{
(1,4,7)
}
)
(
2,
{
(2,5,8),(2,3,6)
}
)
(
3,
{
(3,8,9)
}
)
 
D
 
=
 
F
O
R
E
A
C
H
 
B
 
G
E
N
E
R
A
T
E
 
g
r
o
u
p
,
 
f
l
a
t
t
e
n
(
A
)
;
D
U
M
P
 
D
;
(3,3,8,9)
(2,2,3,6)
(2,2,5,8)
(1,1,4,7)
Pig Complex Types: Bag
 
D
:
cat data;
(3,8,9)
(2,3,6)
(1,4,7)
(2,5,8)
 
A
 
=
 
L
O
A
D
 
'
d
a
t
a
'
 
A
S
 
(
c
1
:
i
n
t
,
 
c
2
:
i
n
t
,
 
c
3
:
i
n
t
)
;
B
 
=
 
G
R
O
U
P
 
A
 
B
Y
 
c
1
;
D
U
M
P
 
B
;
(
1,
{
(1,4,7)
}
)
(
2,
{
(2,5,8),(2,3,6)
}
)
(
3,
{
(3,8,9)
}
)
 
E
 
=
 
F
O
R
E
A
C
H
 
B
 
{
 
 
 
 
F
A
 
=
 
F
I
L
T
E
R
 
A
 
B
Y
 
c
1
 
>
 
1
;
    C1 = FA.c1;
 
 
 
 
G
E
N
E
R
A
T
E
 
g
r
o
u
p
,
 
C
O
U
N
T
(
D
1
)
 
A
S
 
c
n
t
;
}
Pig Complex Types: Bag
B
:
cat data;
(3,8,9)
(2,3,6)
(1,4,7)
(2,5,8)
A
 
=
 
L
O
A
D
 
'
d
a
t
a
'
 
A
S
 
(
c
1
:
i
n
t
,
 
c
2
:
i
n
t
,
 
c
3
:
i
n
t
)
;
B
 
=
 
G
R
O
U
P
 
A
 
B
Y
 
c
1
;
D
U
M
P
 
B
;
(
1,
{
(1,4,7)
}
)
(
2,
{
(2,5,8),(2,3,6)
}
)
(
3,
{
(3,8,9)
}
)
E
 
=
 
F
O
R
E
A
C
H
 
B
 
{
 
 
 
 
F
A
 
=
 
F
I
L
T
E
R
 
A
 
B
Y
 
c
1
 
>
 
1
;
    C1 = FA.c1;
 
 
 
 
G
E
N
E
R
A
T
E
 
g
r
o
u
p
,
 
C
O
U
N
T
(
D
1
)
 
A
S
 
c
n
t
;
}
Pig Complex Types: Bag
E
:
 
A
PACHE
 P
IG
: O
PERATORS
 
Pig Atomic Operators
 
Comparison
==, !=, >, <, >=, <=, 
matches
 (regex)
 
Arithmetic
+ , −, *, /
 
Reference
tuple.field, map#value
 
Boolean
AND, OR, NOT
 
Casting
 
Pig Conditionals
 
Ternary operator
:
h
r
1
2
 
=
 
F
O
R
E
A
C
H
 
i
t
e
m
 
G
E
N
E
R
A
T
E
 
h
o
u
r
%
1
2
,
 
(
h
o
u
r
>
1
2
 
?
 
p
m
 
:
 
a
m
)
;
 
 
Cases
:
X
 
=
 
F
O
R
E
A
C
H
 
A
 
G
E
N
E
R
A
T
E
 
h
o
u
r
%
1
2
,
 
(
        
CASE
  
WHEN
 hour>12 THEN ‘pm’
  
ELSE
 ‘am’
 
END
);
 
 
Pig Aggregate Operators
 
Grouping:
GROUP
: group on a single relation
G
R
O
U
P
 
p
r
e
m
i
u
m
 
B
Y
 
(
i
t
e
m
,
 
h
o
u
r
)
;
COGROUP
: group multiple relations
C
O
G
R
O
U
P
 
p
r
e
m
i
u
m
 
B
Y
 
(
i
t
e
m
,
 
h
o
u
r
)
,
 
c
h
e
a
p
 
B
Y
 
(
i
t
e
m
,
 
h
o
u
r
)
;
 
Aggregate Operations
:
AVG
, 
MIN
, 
MAX
, 
SUM
, 
COUNT
, 
SIZE
, 
CONCAT
 
 
 
 
 
 
 
Can GROUP multiple items or
COGROUP single item
(COGROUP considered more
readable for multiple items)
cat data1;
(Nescafe,08,120)
(El_Mercurio,08,142)
(Nescafe,09,153)
 
cat data2;
(2000,Nescafe)
(8250,
 Gillette_Mach3
)
(500,
 El_Mercurio
)
 
A
 
=
 
L
O
A
D
 
'
d
a
t
a
1
'
 
A
S
 
(
p
r
o
d
:
c
h
a
r
A
r
r
a
y
,
 
h
o
u
r
:
i
n
t
,
 
c
o
u
n
t
:
i
n
t
)
;
B
 
=
 
L
O
A
D
 
'
d
a
t
a
2
'
 
A
S
 
(
p
r
i
c
e
:
i
n
t
,
 
n
a
m
e
:
c
h
a
r
A
r
r
a
y
)
;
X
 
=
 
J
O
I
N
 
A
 
B
Y
 
p
r
o
d
,
 
B
 
B
Y
 
n
a
m
e
;
 
D
U
M
P
 
X
:
(
El_Mercurio,08,142
, 
500,
 El_Mercurio
)
(
Nescafe,08,120
, 
2000,Nescafe
)
(
Nescafe,09,153
, 
2000,Nescafe
)
Pig Joins
 
X
:
Pig Joins
Inner join
: As shown (default)
Self join
: Copy an alias and join with that
Outer joins
:
LEFT 
/ 
RIGHT 
/ 
FULL
Cross product
:
CROSS
Anyone remember what an INNER JOIN is versus an
OUTER JOIN / LEFT / RIGHT / FULL versus a CROSS PRODUCT?
X
 
=
 
G
R
O
U
P
 
A
 
B
Y
 
h
o
u
r
 
P
A
R
T
I
T
I
O
N
 
B
Y
 
o
r
g
.
u
d
p
.
P
a
r
t
i
t
i
o
n
e
r
 
P
A
R
A
L
L
E
L
 
5
;
 
Pig Aggregate/Join Implementations
 
Custom partitioning / number of reducers:
PARTITION BY 
specifies a UDF for partitioning
PARALLEL 
specifies number of reducers
X
 
=
 
J
O
I
N
 
A
 
B
Y
 
p
r
o
d
,
 
B
 
B
Y
 
n
a
m
e
 
P
A
R
T
I
T
I
O
N
 
B
Y
 
o
r
g
.
u
d
p
.
P
a
r
t
i
t
i
o
n
e
r
 
P
A
R
A
L
L
E
L
 
5
;
Pig: Disambiguate
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
c
a
t
 
d
a
t
a
1
;
(
N
e
s
c
a
f
e
,
0
8
,
1
2
0
)
(
E
l
_
M
e
r
c
u
r
i
o
,
0
8
,
1
4
2
)
(
N
e
s
c
a
f
e
,
0
9
,
1
5
3
)
c
a
t
 
d
a
t
a
2
;
(
2
0
0
0
,
N
e
s
c
a
f
e
)
(
8
2
5
0
,
G
i
l
l
e
t
t
e
_
M
a
c
h
3
)
(
5
0
0
,
E
l
_
M
e
r
c
u
r
i
o
)
A
 
=
 
L
O
A
D
 
'
d
a
t
a
1
'
 
A
S
 
(
p
r
o
d
N
a
m
e
:
c
h
a
r
A
r
r
a
y
,
 
h
o
u
r
:
i
n
t
,
 
c
o
u
n
t
:
i
n
t
)
;
B
 
=
 
L
O
A
D
 
'
d
a
t
a
2
'
 
A
S
 
(
p
r
i
c
e
:
i
n
t
,
 
p
r
o
d
N
a
m
e
:
c
h
a
r
A
r
r
a
y
)
;
X
 
=
 
J
O
I
N
 
A
 
B
Y
 
p
r
o
d
N
a
m
e
,
 
B
 
B
Y
 
p
r
o
d
N
a
m
e
;
D
U
M
P
 
X
:
(
E
l
_
M
e
r
c
u
r
i
o
,
0
8
,
1
4
2
,
5
0
0
,
E
l
_
M
e
r
c
u
r
i
o
)
(
N
e
s
c
a
f
e
,
0
8
,
1
2
0
,
 
2
0
0
0
,
N
e
s
c
a
f
e
)
(
N
e
s
c
a
f
e
,
0
9
,
1
5
3
,
 
2
0
0
0
,
N
e
s
c
a
f
e
)
Y
 
=
 
F
O
R
E
A
C
H
 
X
 
G
E
N
E
R
A
T
E
 
p
r
o
d
N
a
m
e
Y
 
=
 
F
O
R
E
A
C
H
 
X
 
G
E
N
E
R
A
T
E
 
A
:
:
p
r
o
d
N
a
m
e
 
which prodName?
Pig: Split
r
a
w
 
=
 
L
O
A
D
 
t
r
a
n
s
a
c
t
.
t
x
t
'
 
U
S
I
N
G
 
P
i
g
S
t
o
r
a
g
e
(
'
\
t
'
)
 
A
S
 
(
c
u
s
t
,
 
i
t
e
m
,
 
t
i
m
e
,
 
p
r
i
c
e
)
;
n
u
m
e
r
i
c
 
=
 
F
O
R
E
A
C
H
 
r
a
w
 
G
E
N
E
R
A
T
E
 
c
u
s
t
 
i
t
e
m
 
t
i
m
e
 
o
r
g
.
u
d
f
.
R
e
m
o
v
e
D
o
l
l
a
r
S
i
g
n
(
p
r
i
c
e
)
 
A
S
 
p
r
i
c
e
;
S
P
L
I
T
 
n
u
m
e
r
i
c
 
I
N
T
O
 
c
h
e
a
p
 
I
F
 
p
r
i
c
e
<
1
0
0
0
,
 
p
r
e
m
i
u
m
 
I
F
 
p
r
i
c
e
>
=
1
0
0
0
;
 
 
n
u
m
e
r
i
c
:
 
c
h
e
a
p
:
 
p
r
e
m
i
u
m
:
Pig: Rank
r
a
w
 
=
 
L
O
A
D
 
t
r
a
n
s
a
c
t
.
t
x
t
'
 
U
S
I
N
G
 
P
i
g
S
t
o
r
a
g
e
(
'
\
t
'
)
 
A
S
 
(
c
u
s
t
,
 
i
t
e
m
,
 
t
i
m
e
,
 
p
r
i
c
e
)
;
n
u
m
e
r
i
c
 
=
 
F
O
R
E
A
C
H
 
r
a
w
 
G
E
N
E
R
A
T
E
 
c
u
s
t
 
i
t
e
m
 
t
i
m
e
 
o
r
g
.
u
d
f
.
R
e
m
o
v
e
D
o
l
l
a
r
S
i
g
n
(
p
r
i
c
e
)
 
A
S
 
p
r
i
c
e
;
r
a
n
k
e
d
 
=
 
R
A
N
K
 
n
u
m
e
r
i
c
;
 
 
n
u
m
e
r
i
c
:
 
r
a
n
k
e
d
:
Pig: Rank
r
a
w
 
=
 
L
O
A
D
 
t
r
a
n
s
a
c
t
.
t
x
t
'
 
U
S
I
N
G
 
P
i
g
S
t
o
r
a
g
e
(
'
\
t
'
)
 
A
S
 
(
c
u
s
t
,
 
i
t
e
m
,
 
t
i
m
e
,
 
p
r
i
c
e
)
;
n
u
m
e
r
i
c
 
=
 
F
O
R
E
A
C
H
 
r
a
w
 
G
E
N
E
R
A
T
E
 
c
u
s
t
 
i
t
e
m
 
t
i
m
e
 
o
r
g
.
u
d
f
.
R
e
m
o
v
e
D
o
l
l
a
r
S
i
g
n
(
p
r
i
c
e
)
 
A
S
 
p
r
i
c
e
;
r
a
n
k
e
d
 
=
 
R
A
N
K
 
n
u
m
e
r
i
c
 
B
Y
 
p
r
i
c
e
 
A
S
C
,
 
c
u
s
t
 
D
E
S
C
;
 
n
u
m
e
r
i
c
:
 
r
a
n
k
e
d
:
Pig: Other Operators
 
 
FILTER
: Filter tuples by an expression
LIMIT
: Only return a certain number of tuples
MAPREDUCE
: Run a native Hadoop .jar
ORDER BY
: Sort tuples
SAMPLE
: Sample tuples
UNION
:
 
Concatenate two relations
 
 
A
PACHE
 P
IG
:
 
N
ULOS
 
Pig: Nulls
cat data1;
(Nescafe,08,)
(El_Mercurio,08,142)
(,09,153)
 
A
 = 
LOAD
 'data1' 
AS
 (prodName:charArray, hour:int, count:int);
 
DUMP
 
A 
:
(Nescafe,08,)
(El_Mercurio,08,142)
(,09,153)
 
Nulls represent incomplete information
Pig: Nulls with 
JOIN
Nulls represent incomplete information
They behave as per nulls in SQL
cat data1;
(Nescafe,08,)
(El_Mercurio,08,142)
(,09,153)
 
cat data2;
(2000,)
(8250,Gillette_Mach3)
(500,El_Mercurio)
 
A
 = 
LOAD
 'data1' 
AS
 (prodName:charArray, hour:int, count:int);
B
 
= 
LOAD
 'data2' 
AS
 (price:int, prodName:charArray);
X 
= 
JOIN
 
A
 
BY
 prodName, 
B
 
BY
 prodName;
 
DUMP
 
X
:
(
El_Mercurio,08,142
,
500,El_Mercurio
)
 
 
A
PACHE
 P
IG
:
 
E
XECUTION
 
 
Pig translated to MapReduce in Hadoop
 
Pig is only an interface/scripting language for
MapReduce
Three Ways to Execute Pig: (i) Grunt
grunt
> 
in_lines = 
LOAD
 '/tmp/book.txt' 
AS
 (line:chararray);
grunt
> 
words = 
FOREACH
 in_lines 
GENERATE FLATTEN
(
TOKENIZE
(line)) 
AS
 word;
grunt>
 filtered_words = 
FILTER
 words 
BY
 word 
MATCHES
 '\\w+';
grunt>
grunt>
 
STORE
 ordered_word_count 
INTO
 '/tmp/book-word-count.txt';
 
 
grunt
> pig 
wordcount.pig
 
Three Ways to Execute Pig: (ii) Script
w
o
r
d
c
o
u
n
t
.
p
i
g
Three Ways to Execute Pig: (iii) Embedded
 
More Reading
 
 
 
https://pig.apache.org/docs/r0.14.0/basic.html
 
A
PACHE
 H
IVE
:
 
A 
MENTION
 
Apache Hive
 
SQL-style language that compiles into
MapReduce jobs in Hadoop
 
 
 
 
 
Similar to Apache Pig but …
Pig more procedural whilst Hive more declarative
Questions?
 
Slide Note
Embed
Share

Explore how Apache Pig can be utilized in the Hadoop ecosystem to process large-scale data efficiently. Learn about concepts such as handling multiple inputs, job chaining, setting reducers, and utilizing a distributed cache. Compare Hadoop with SQL and understand why SQL might not be suitable for large data workloads. Dive into Hadoop internals for a deeper understanding.

  • Big Data
  • Apache Pig
  • Hadoop Ecosystem
  • Data Processing
  • Distributed Cache

Uploaded on Sep 15, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTO O 2019 Lecture 4 Apache Pig Aidan Hogan aidhog@gmail.com

  2. HADOOP: WRAPPING UP

  3. Hadoop: Supermarket Example Compute total sales per hour of the day?

  4. More in Hadoop: Multiple Inputs Multiple inputs, different map for each One reducer

  5. More in Hadoop: Chaining Jobs Run and wait Output of Job1 set to Input of Job2

  6. More in Hadoop: Number of Reducers Set number of parallel reducer tasks for the job Why would we ask for 1 reduce task? Output requires a merge on one machine (for example, sorting, top-k)

  7. Hadoop: Filtered Supermarket Example Compute total sales per hour of the day but exclude certain item IDs passed as an input file?

  8. More in Hadoop: Distributed Cache Some tasks need global knowledge Hopefully not too much though Use a distributed cache: Makes global data available locally to all nodes On the local hard-disk of each machine How might we use this? Make the filtered products global and read them (into memory?) when processing items

  9. Apache Hadoop Internals (if interested) http://ercoppa.github.io/HadoopInternals/

  10. HADOOPVS. SQL

  11. Hadoop: (_)

  12. SQL So why not just use SQL? Relational database engines not typically built for large workloads over bulk data; they optimise for answering queries that touch a small fraction of the data. At some stage, they will not scale further. But this is a reason not to use a relational database. The question was: why not just use SQL?

  13. APACHE PIG: OVERVIEW

  14. Apache Pig Create MapReduce programs to run on Hadoop run on Hadoop Use a high-level scripting language called Pig Latin Pig Latin Can embed User Defined Functions Functions: call a Java function (or Python, Ruby, etc.) User Defined Based on Pig Relations Pig Relations

  15. Apache Pig Create MapReduce programs to run on Hadoop run on Hadoop Use a high-level scripting language called Pig Latin Atwhay anguagelay isyay isthay ? Pig Latin Can embed User Defined Functions Functions: call a Java function (or Python, Ruby, etc.) User Defined Based on Pig Relations Pig Relations

  16. Pig Latin: Hello Word Count input_lines = LOAD '/tmp/book.txt' AS (line:chararray); -- Extract words from each line and put them into a pig bag -- datatype, then flatten the bag to get one word on each row words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; Map Map -- filter out any words that are just white spaces filtered_words = FILTER words BY word MATCHES '\\w+'; -- create a group for each word word_groups = GROUP filtered_words BY word; Reduce Reduce -- count the entries in each group word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; -- order the records by count ordered_word_count = ORDER word_count BY count DESC; Map Map + + Reduce Reduce STORE ordered_word_count INTO '/tmp/book-word-count.txt'; Any ideas which lines correspond to map and which to reduce?

  17. APACHE PIG: AN EXAMPLE

  18. Pig: Products by Hour transact.txt transact.txt customer412 customer412 customer412 customer413 customer413 customer413 customer413 customer413 customer414 customer414 customer414 customer415 customer415 1L_Leche Nescafe Nescafe 400g_Zanahoria El_Mercurio Gillette_Mach3 Santo_Domingo Nescafe Rosas Chocolates 300g_Frutillas Nescafe 12 Huevos 2014-03-31T08:47:57Z 2014-03-31T08:47:57Z 2014-03-31T08:47:57Z 2014-03-31T08:48:03Z 2014-03-31T08:48:03Z 2014-03-31T08:48:03Z 2014-03-31T08:48:03Z 2014-03-31T08:48:03Z 2014-03-31T08:48:24Z 2014-03-31T08:48:24Z 2014-03-31T08:48:24Z 2014-03-31T08:48:35Z 2014-03-31T08:48:35Z $900 $2.000 $2.000 $1.240 $500 $8.250 $2.450 $2.000 $7.000 $9.230 $1.230 $2.000 $2.200 Find the number of items sold per hour of the day

  19. Pig: Products by Hour grunt> REGISTER REGISTER userDefinedFunctions.jar; User-defined-functions written in Java (or Python, Ruby, etc. ) userDefinedFunctions.jar userDefinedFunctions.jar

  20. Pig: Products by Hour grunt> REGISTER grunt> raw = LOAD REGISTER userDefinedFunctions.jar; LOAD transact.txt' USING USING PigStorage('\t') AS AS (cust, item, time, price); View data as a (streaming) relation with fields (cust, item, etc.) and tuples (data rows) cust cust item item time time price price customer412 1L_Leche 2014-03-31T08:47:57Z $900 customer412 Nescafe 2014-03-31T08:47:57Z $2.000 customer412 Nescafe 2014-03-31T08:47:57Z $2.000 customer413 400g_Zanahoria 2014-03-31T08:48:03Z $1.240 raw: raw:

  21. Pig: Products by Hour grunt> REGISTER grunt> raw = LOAD grunt>premium = FILTER REGISTER userDefinedFunctions.jar; LOAD transact.txt' USING FILTER raw BY USING PigStorage('\t') AS BY org.udf.MinPrice1000(price); AS (cust, item, time, price); Filter tuples depending on their value for a given attribute (in this case, price < 1000) cust cust item item time time price price customer412 1L_Leche 2014-03-31T08:47:57Z $900 customer412 Nescafe 2014-03-31T08:47:57Z $2.000 customer412 Nescafe 2014-03-31T08:47:57Z $2.000 customer413 400g_Zanahoria 2014-03-31T08:48:03Z $1.240 raw: raw:

  22. Pig: Products by Hour grunt> REGISTER grunt> raw = LOAD grunt>premium = FILTER REGISTER userDefinedFunctions.jar; LOAD transact.txt' USING FILTER raw BY USING PigStorage('\t') AS BY org.udf.MinPrice1000(price); AS (cust, item, time, price); Filter tuples depending on their value for a given attribute (in this case, price < 1000) cust cust item item time time price price customer412 Nescafe 2014-03-31T08:47:57Z $2.000 customer412 Nescafe 2014-03-31T08:47:57Z $2.000 customer413 400g_Zanahoria 2014-03-31T08:48:03Z $1.240 customer413 Gillette_Mach3 2014-03-31T08:48:03Z $8.250 premium: premium:

  23. Pig: Products by Hour grunt> REGISTER grunt> raw = LOAD grunt>premium = FILTER grunt>hourly = FOREACH REGISTER userDefinedFunctions.jar; LOAD transact.txt' USING FILTER raw BY FOREACH premium GENERATE USING PigStorage('\t') AS BY org.udf.MinPrice1000(price); GENERATE cust, item, org.udf.ExtractHour(time) AS AS (cust, item, time, price); AS hour, price; cust cust item item time time price price customer412 Nescafe 2014-03-31T08:47:57Z $2.000 customer412 Nescafe 2014-03-31T08:47:57Z $2.000 customer413 400g_Zanahoria 2014-03-31T08:48:03Z $1.240 customer413 Gillette_Mach3 2014-03-31T08:48:03Z $8.250 premium: premium:

  24. Pig: Products by Hour grunt> REGISTER grunt> raw = LOAD grunt>premium = FILTER grunt>hourly = FOREACH REGISTER userDefinedFunctions.jar; LOAD transact.txt' USING FILTER raw BY FOREACH premium GENERATE USING PigStorage('\t') AS BY org.udf.MinPrice1000(price); GENERATE cust, item, org.udf.ExtractHour(time) AS AS (cust, item, time, price); AS hour, price; cust cust item item hour hour price price customer412 Nescafe 08 $2.000 customer412 Nescafe 08 $2.000 customer413 400g_Zanahoria 08 $1.240 customer413 Gillette_Mach3 08 $8.250 hourly: hourly:

  25. Pig: Products by Hour grunt> REGISTER grunt> raw = LOAD grunt>premium = FILTER grunt>hourly = FOREACH grunt>unique = DISTINCT REGISTER userDefinedFunctions.jar; LOAD transact.txt' USING FILTER raw BY FOREACH premium GENERATE DISTINCT hourly; USING PigStorage('\t') AS BY org.udf.MinPrice1000(price); GENERATE cust, item, org.udf.ExtractHour(time) AS AS (cust, item, time, price); AS hour, price; cust cust item item hour hour price price customer412 Nescafe 08 $2.000 customer412 Nescafe 08 $2.000 customer413 400g_Zanahoria 08 $1.240 customer413 Gillette_Mach3 08 $8.250 hourly: hourly:

  26. Pig: Products by Hour grunt> REGISTER grunt> raw = LOAD grunt>premium = FILTER grunt>hourly = FOREACH grunt>unique = DISTINCT grunt>hrItem = GROUP REGISTER userDefinedFunctions.jar; LOAD transact.txt' USING FILTER raw BY FOREACH premium GENERATE DISTINCT hourly; GROUP unique BY USING PigStorage('\t') AS BY org.udf.MinPrice1000(price); GENERATE cust, item, org.udf.ExtractHour(time) AS AS (cust, item, time, price); AS hour, price; BY (item, hour); cust cust item item hour hour price price customer412 Nescafe 08 $2.000 customer413 400g_Zanahoria 08 $1.240 customer413 Gillette_Mach3 08 $8.250 customer413 Santo_Domingo 08 $2.450 unique: unique:

  27. Pig: Products by Hour grunt> REGISTER grunt> raw = LOAD grunt>premium = FILTER grunt>hourly = FOREACH grunt>unique = DISTINCT grunt>hrItem = GROUP REGISTER userDefinedFunctions.jar; LOAD transact.txt' USING FILTER raw BY FOREACH premium GENERATE DISTINCT hourly; GROUP unique BY USING PigStorage('\t') AS BY org.udf.MinPrice1000(price); GENERATE cust, item, org.udf.ExtractHour(time) AS AS (cust, item, time, price); AS hour, price; BY (item, hour); [ [item,hour item,hour] ] cust cust item item hour hour price price customer412 Nescafe 08 $2.000 [Nescafe,08] customer413 Nescafe 08 $2.000 customer415 Nescafe 08 $2.000 [400g_Zanahoria,08] customer413 400g_Zanahoria 08 $1.240 hrItem hrItem: :

  28. Pig: Products by Hour grunt> REGISTER grunt> raw = LOAD grunt>premium = FILTER grunt>hourly = FOREACH grunt>unique = DISTINCT grunt>hrItem = GROUP grunt>hrItemCnt = FOREACH REGISTER userDefinedFunctions.jar; LOAD transact.txt' USING FILTER raw BY FOREACH premium GENERATE DISTINCT hourly; GROUP unique BY FOREACH hrItem GENERATE USING PigStorage('\t') AS BY org.udf.MinPrice1000(price); GENERATE cust, item, org.udf.ExtractHour(time) AS AS (cust, item, time, price); AS hour, price; BY (item, hour); GENERATE flatten($0), COUNT COUNT($1) AS AS count; [ [item,hour item,hour] ] cust cust item item hour hour price price customer412 Nescafe 08 $2.000 [Nescafe,08] customer413 Nescafe 08 $2.000 count customer415 Nescafe 08 $2.000 [400g_Zanahoria,08] customer413 400g_Zanahoria 08 $1.240 hrItem hrItem: :

  29. Pig: Products by Hour grunt> REGISTER grunt> raw = LOAD grunt>premium = FILTER grunt>hourly = FOREACH grunt>unique = DISTINCT grunt>hrItem = GROUP grunt>hrItemCnt = FOREACH REGISTER userDefinedFunctions.jar; LOAD transact.txt' USING FILTER raw BY FOREACH premium GENERATE DISTINCT hourly; GROUP unique BY FOREACH hrItem GENERATE USING PigStorage('\t') AS BY org.udf.MinPrice1000(price); GENERATE cust, item, org.udf.ExtractHour(time) AS AS (cust, item, time, price); AS hour, price; BY (item, hour); GENERATE flatten($0), COUNT COUNT($1) AS AS count; [ [item,hour item,hour] ] count count [400g_Zanahoria,08] 1 [Nescafe,08] 3 hrItemCnt hrItemCnt: :

  30. Pig: Products by Hour grunt> REGISTER grunt> raw = LOAD grunt>premium = FILTER grunt>hourly = FOREACH grunt>unique = DISTINCT grunt>hrItem = GROUP grunt>hrItemCnt = FOREACH grunt>hrItemCntSorted = ORDER REGISTER userDefinedFunctions.jar; LOAD transact.txt' USING FILTER raw BY FOREACH premium GENERATE DISTINCT hourly; GROUP unique BY FOREACH hrItem GENERATE ORDER hrItemCnt BY USING PigStorage('\t') AS BY org.udf.MinPrice1000(price); GENERATE cust, item, org.udf.ExtractHour(time) AS AS (cust, item, time, price); AS hour, price; BY (item, hour); GENERATE flatten($0), COUNT BY count DESC COUNT($1) AS AS count; DESC; [ [item,hour item,hour] ] count count [400g_Zanahoria,08] 1 [Nescafe,08] 3 hrItemCnt hrItemCnt: :

  31. Pig: Products by Hour grunt> REGISTER grunt> raw = LOAD grunt>premium = FILTER grunt>hourly = FOREACH grunt>unique = DISTINCT grunt>hrItem = GROUP grunt>hrItemCnt = FOREACH grunt>hrItemCntSorted = ORDER REGISTER userDefinedFunctions.jar; LOAD transact.txt' USING FILTER raw BY FOREACH premium GENERATE DISTINCT hourly; GROUP unique BY FOREACH hrItem GENERATE ORDER hrItemCnt BY USING PigStorage('\t') AS BY org.udf.MinPrice1000(price); GENERATE cust, item, org.udf.ExtractHour(time) AS AS (cust, item, time, price); AS hour, price; BY (item, hour); GENERATE flatten($0), COUNT BY count DESC COUNT($1) AS AS count; DESC; [ [item,hour item,hour] ] count count [Nescafe,08] 3 [400g_Zanahoria,08] 1 hrItemCntSorted hrItemCntSorted: :

  32. Pig: Products by Hour grunt> REGISTER grunt> raw = LOAD grunt>premium = FILTER grunt>hourly = FOREACH grunt>unique = DISTINCT grunt>hrItem = GROUP grunt>hrItemCnt = FOREACH grunt>hrItemCntSorted = ORDER grunt>STORE STORE hrItemCntSorted INTO REGISTER userDefinedFunctions.jar; LOAD transact.txt' USING FILTER raw BY FOREACH premium GENERATE DISTINCT hourly; GROUP unique BY FOREACH hrItem GENERATE ORDER hrItemCnt BY USING PigStorage('\t') AS BY org.udf.MinPrice1000(price); GENERATE cust, item, org.udf.ExtractHour(time) AS AS (cust, item, time, price); AS hour, price; BY (item, hour); GENERATE flatten($0), COUNT BY count DESC INTO output.txt ; COUNT($1) AS AS count; DESC; [ [item,hour item,hour] ] count count [Nescafe,08] 3 [400g_Zanahoria,08] 1 hrItemCntSorted hrItemCntSorted: :

  33. APACHE PIG: SCHEMA

  34. Pig Relations Pig Relations: Like relational tables Except tuples can be jagged Fields in the same column don t need to be same type Relations are by default unordered Pig Schema: Names for fields, etc. AS AS (cust, item, time, price); cust cust item item time time price price customer412 1L_Leche 2014-03-31T08:47:57Z $900 customer412 Nescafe 2014-03-31T08:47:57Z $2.000 customer412 Nescafe 2014-03-31T08:47:57Z $2.000 customer413 400g_Zanahoria 2014-03-31T08:48:03Z $1.240

  35. Pig Fields Pig Fields: Reference using name premium = FILTER or position premium = FILTER More readable! FILTER raw BY BY org.udf.MinPrice1000(price); FILTER raw BY BY org.udf.MinPrice1000($3); Starts at zero. cust cust item item time time price price customer412 1L_Leche 2014-03-31T08:47:57Z $900 customer412 Nescafe 2014-03-31T08:47:57Z $2.000 customer412 Nescafe 2014-03-31T08:47:57Z $2.000 customer413 400g_Zanahoria 2014-03-31T08:48:03Z $1.240

  36. APACHE PIG: TYPES

  37. Pig Simple Types Pig Types: LOAD LOAD transact.txt' USING (cust:charArray, item:charArray, time:datetime, price:int); USING PigStorage('\t') AS AS int, long, float, double, biginteger, bigdecimal, boolean, chararray (string), bytearray (blob), datetime

  38. Pig Types: Duck Typing What happens if you omit types? Fields default to bytearray Implicit conversions if needed (~duck typing) A = LOAD LOAD 'data' AS B = FOREACH FOREACHA GENERATE C = FOREACH FOREACHA GENERATE AS (cust, item, hour, price); GENERATEhour + 4 % 24; GENERATEhour + 4f % 24; hour an integer hour a float

  39. Pig Complex Types: Tuple cat data; (3,8,9) (4,5,6) (1,4,7) (3,7,5) (2,5,8) (9,5,8) A = LOAD LOAD'data' AS AS (t1:tuple(t1a:int,t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int)); DUMP DUMP A; ((3,8,9),(4,5,6)) ((1,4,7),(3,7,5)) ((2,5,8),(9,5,8)) X = FOREACH FOREACH A GENERATE GENERATEt1.t1a,t2.$0; t1 t1 t2 t2 t1a t1b t1c t2a t2b t2c 3 8 9 4 5 6 1 4 7 3 7 5 A: A: 2 5 8 9 5 8

  40. Pig Complex Types: Tuple cat data; (3,8,9) (4,5,6) (1,4,7) (3,7,5) (2,5,8) (9,5,8) A = LOAD LOAD'data' AS AS (t1:tuple(t1a:int,t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int)); DUMP DUMP A; ((3,8,9),(4,5,6)) ((1,4,7),(3,7,5)) ((2,5,8),(9,5,8)) X = FOREACH FOREACH A GENERATE DUMP DUMP X; (3,4) (1,3) (2,9) GENERATEt1.t1a,t2.$0; $0 $0 $1 $1 3 4 1 3 X: X: 2 9

  41. Pig Complex Types: Bag cat data; (3,8,9) (2,3,6) (1,4,7) (2,5,8) A = LOAD LOAD'data' AS B = GROUP GROUP A BY AS (c1:int, c2:int, c3:int); BY c1; c1 c1 c2 c2 c3 c3 3 8 9 2 3 6 1 4 7 A: A: 2 5 8

  42. Pig Complex Types: Bag cat data; (3,8,9) (2,3,6) (1,4,7) (2,5,8) A = LOAD LOAD'data' AS B = GROUP GROUP A BY DUMP DUMPB; (1,{(1,4,7)}) (2,{(2,5,8),(2,3,6)}) (3,{(3,8,9)}) AS (c1:int, c2:int, c3:int); BY c1; group group (c1) (c1) A A c1 c2 c3 3 3 8 9 2 3 6 2 2 5 8 B: B: 1 1 4 7

  43. Pig Complex Types: Map cat prices; [Nescafe# $2.000 ] [Gillette_Mach3# $8.250 ] A = LOAD LOAD prices AS AS (M:map []);

  44. Pig Complex Types: Summary tuple: A row in a table / a list of fields e.g., (customer412, Nescafe, 08, $2.000) bag: A set of tuples (allows duplicates) e.g., { (cust412, Nescafe, 08, $2.000), (cust413, Gillette_Mach3, 08, $8.250) } map: A set of key value pairs e.g., [Nescafe#$2.000]

  45. APACHE PIG: UNNESTING (FLATTEN)

  46. Pig Latin: Hello Word Count input_lines = LOAD '/tmp/book.txt' AS (line:chararray); -- Extract words from each line and put them into a pig bag -- datatype, then flatten the bag to get one word on each row words = FOREACH input_lines GENERATE FLATTEN(TOKENIZE(line)) AS word; -- filter out any words that are just white spaces filtered_words = FILTER words BY word MATCHES '\\w+'; -- create a group for each word word_groups = GROUP filtered_words BY word; -- count the entries in each group word_count = FOREACH word_groups GENERATE COUNT(filtered_words) AS count, group AS word; -- order the records by count ordered_word_count = ORDER word_count BY count DESC; STORE ordered_word_count INTO '/tmp/book-word-count.txt';

  47. Pig Complex Types: Flatten Tuples cat data; (3,8,9) (4,5,6) (1,4,7) (3,7,5) (2,5,8) (9,5,8) A = LOAD LOAD'data' AS AS (t1:tuple(t1a:int,t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int)); DUMP DUMP A; ((3,8,9),(4,5,6)) ((1,4,7),(3,7,5)) ((2,5,8),(9,5,8)) X = FOREACH FOREACH A GENERATE GENERATEflatten(t1), flatten(t2); t1 t1 t2 t2 t1a t1b t1c t2a t2b t2c 3 8 9 4 5 6 1 4 7 3 7 5 A: A: 2 5 8 9 5 8

  48. Pig Complex Types: Flatten Tuples cat data; (3,8,9) (4,5,6) (1,4,7) (3,7,5) (2,5,8) (9,5,8) A = LOAD LOAD'data' AS AS (t1:tuple(t1a:int,t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int)); DUMP DUMP A; ((3,8,9),(4,5,6)) ((1,4,7),(3,7,5)) ((2,5,8),(9,5,8)) X = FOREACH FOREACH A GENERATE DUMP DUMP X; (3,8,9,4,5,6) (1,4,7,3,7,5) (2,5,8,9,5,8) GENERATEflatten(t1), flatten(t2); t1a t1a t1b t1b t1c t1c t2a t2a t2b t2b t2c t2c 3 8 9 4 5 6 1 4 7 3 7 5 X: X: 2 5 8 9 5 8

  49. Pig Complex Types: Flatten Tuples cat data; (3,8,9) (4,5,6) (1,4,7) (3,7,5) (2,5,8) (9,5,8) A = LOAD LOAD'data' AS AS (t1:tuple(t1a:int,t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int)); DUMP DUMP A; ((3,8,9),(4,5,6)) ((1,4,7),(3,7,5)) ((2,5,8),(9,5,8)) Y = FOREACH FOREACH A GENERATE GENERATEt1, flatten(t2); t1 t1 t2 t2 t1a t1b t1c t2a t2b t2c 3 8 9 4 5 6 1 4 7 3 7 5 A: A: 2 5 8 9 5 8

  50. Pig Complex Types: Flatten Tuples cat data; (3,8,9) (4,5,6) (1,4,7) (3,7,5) (2,5,8) (9,5,8) A = LOAD LOAD'data' AS AS (t1:tuple(t1a:int,t1b:int,t1c:int),t2:tuple(t2a:int,t2b:int,t2c:int)); DUMP DUMP A; ((3,8,9),(4,5,6)) ((1,4,7),(3,7,5)) ((2,5,8),(9,5,8)) Y = FOREACH FOREACH A GENERATE DUMP DUMP Y; ((3,8,9),4,5,6) ((1,4,7),3,7,5) ((2,5,8),9,5,8) GENERATEt1, flatten(t2); t1 t1 t2a t2a t2b t2b t2c t2c t1a t1b t1c 3 8 9 4 5 6 1 4 7 3 7 5 Y: Y: 2 5 8 9 5 8

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#