Data-Centered Crowdsourcing Workshop with Prof. Tova Milo and Slava Novgorodov

undefined
D
A
T
A
-
C
E
N
T
E
R
E
D
C
R
O
W
D
S
O
U
R
C
I
N
G
W
O
R
K
S
H
O
P
 
 
 
 
 
 
 
P
R
O
F
.
 
T
O
V
A
 
M
I
L
O
S
L
A
V
A
 
N
O
V
G
O
R
O
D
O
V
T
E
L
 
A
V
I
V
 
U
N
I
V
E
R
S
I
T
Y
 
2
0
1
6
/
2
0
1
7
ADMINISTRATIVE NOTES
C
o
u
r
s
e
 
g
o
a
l
:
L
e
a
r
n
 
a
b
o
u
t
 
c
r
o
w
d
-
d
a
t
a
 
s
o
u
r
c
i
n
g
 
a
n
d
 
p
r
e
p
a
r
e
 
a
 
f
i
n
a
l
p
r
o
j
e
c
t
 
(
i
m
p
r
o
v
e
m
e
n
t
 
o
f
 
e
x
i
s
t
i
n
g
 
p
r
o
b
l
e
m
s
 
s
o
l
u
t
i
o
n
 
/
s
o
l
v
i
n
g
 
n
e
w
 
p
r
o
b
l
e
m
)
G
r
o
u
p
 
s
i
z
e
:
 
~
4
 
s
t
u
d
e
n
t
s
R
e
q
u
i
r
e
m
e
n
t
s
:
 
D
a
t
a
B
a
s
e
s
 
(
S
Q
L
)
 
i
s
 
r
e
c
o
m
m
e
n
d
e
d
,
W
e
b
 
p
r
o
g
r
a
m
m
i
n
g
 
(
w
e
 
w
i
l
l
 
d
o
 
a
 
s
h
o
r
t
 
o
v
e
r
v
i
e
w
)
,
(
o
p
t
i
o
n
a
l
l
y
)
 
M
o
b
i
l
e
 
d
e
v
e
l
o
p
m
e
n
t
 
(
w
e
 
w
i
l
l
 
n
o
t
 
t
e
a
c
h
 
i
t
)
ADMINISTRATIVE NOTES (2)
S
c
h
e
d
u
l
e
:
3
 
i
n
t
r
o
 
m
e
e
t
i
n
g
s
1
st
 meeting – overview of crowdsourcing
2
nd
 meeting – open problems, possible projects
3
rd
 meeting – Web programming overview
M
i
d
-
t
e
r
m
 
m
e
e
t
i
n
g
F
i
n
a
l
 
m
e
e
t
i
n
g
 
a
n
d
 
p
r
o
j
e
c
t
s
 
p
r
e
s
e
n
t
a
t
i
o
n
D
a
t
e
s
:
 
h
t
t
p
:
/
/
s
l
a
v
a
n
o
v
.
c
o
m
/
t
e
a
c
h
i
n
g
/
c
r
o
w
d
1
6
1
7
b
/
WHAT IS CROWDSOURCING?
C
r
o
w
d
s
o
u
r
c
i
n
g
 
=
 
C
r
o
w
d
 
+
 
O
u
t
s
o
u
r
c
i
n
g
C
r
o
w
d
s
o
u
r
c
i
n
g
 
i
s
 
t
h
e
 
a
c
t
 
o
f
 
s
o
u
r
c
i
n
g
 
t
a
s
k
s
t
r
a
d
i
t
i
o
n
a
l
l
y
 
p
e
r
f
o
r
m
e
d
 
b
y
 
s
p
e
c
i
f
i
c
 
i
n
d
i
v
i
d
u
a
l
s
t
o
 
a
 
g
r
o
u
p
 
o
f
 
p
e
o
p
l
e
 
o
r
 
c
o
m
m
u
n
i
t
y
 
(
c
r
o
w
d
)
t
h
r
o
u
g
h
 
a
n
 
o
p
e
n
 
c
a
l
l
.
C
R
O
W
D
S
O
U
R
C
I
N
G
M
a
i
n
 
i
d
e
a
:
 
H
a
r
n
e
s
s
 
t
h
e
 
c
r
o
w
d
 
t
o
 
a
 
t
a
s
k
Task: solve bugs
Task: find an appropriate treatment to an illness
Task: construct a database of facts
W
h
y
 
n
o
w
?
Internet and smart phones …
    We are all connected, all of the time!!!
T
H
E
 
C
L
A
S
S
I
C
A
L
 
E
X
A
M
P
L
E
 
G
A
L
A
X
Y
 
Z
O
O
 
M
O
R
E
A
N
D
 
E
V
E
N
 
M
O
R
E
CROWDSOURCING:
UNIFYING PRINCIPLES
M
a
i
n
 
g
o
a
l
“Outsourcing” a task to a crowd of users
K
i
n
d
s
 
o
f
 
t
a
s
k
s
Tasks that can be performed by a computer, but inefficiently
Tasks that can’t be performed by a computer
C
h
a
l
l
e
n
g
e
s
How to motivate the crowd?
Get data,  minimize errors, estimate quality
Direct users to contribute where is most needed \ they are experts
 
MOTIVATING THE CROWD
 
A
l
t
r
u
i
s
m
F
u
n
M
o
n
e
y
O
u
t
s
o
u
r
c
i
n
g
 
d
a
t
a
 
c
o
l
l
e
c
t
i
o
n
 
t
o
 
t
h
e
 
c
r
o
w
d
 
o
f
 
W
e
b
 
u
s
e
r
s
When people can provide the data
When people are the only source of data
When people can efficiently clean and/or
 
 
 
o
r
g
a
n
i
z
e
 
t
h
e
 
d
a
t
a
T
w
o
 
m
a
i
n
 
a
s
p
e
c
t
s
 
[
D
F
K
K
1
2
]
:
Using the crowd to create better databases
U
s
i
n
g
 
d
a
t
a
b
a
s
e
 
t
e
c
h
n
o
l
o
g
i
e
s
 
t
o
 
c
r
e
a
t
e
 
b
e
t
t
e
r
 
c
r
o
w
d
 
d
a
t
a
s
o
u
r
c
i
n
g
a
p
p
l
i
c
a
t
i
o
n
s
[DFKK’12]: Crowdsourcing Applications and Platforms: A Data Management Perspective,
A.Doan, M. J. Franklin, D. Kossmann, T. Kraska, VLDB 2011
C
R
O
W
D
 
D
A
T
A
 
S
O
U
R
C
I
N
G
M
Y
 
F
A
V
O
R
I
T
E
 
E
X
A
M
P
L
E
R
e
C
a
p
t
h
a
100,000 web sites
40 million words/day
CROWDSOURCING
RESEARCH GROUPS
A
n
 
i
n
c
o
m
p
l
e
t
e
 
l
i
s
t
 
o
f
 
g
r
o
u
p
s
 
w
o
r
k
i
n
g
 
o
n
 
C
r
o
w
d
s
o
u
r
c
i
n
g
:
Q
u
r
k
 
(
M
I
T
)
C
r
o
w
d
D
B
 
(
B
e
r
k
e
l
e
y
 
a
n
d
 
E
T
H
 
Z
u
r
i
c
h
)
D
e
c
o
 
(
S
t
a
n
f
o
r
d
 
a
n
d
 
U
C
S
C
)
C
r
o
w
d
F
o
r
g
e
 
(
C
M
U
)
H
K
U
S
T
 
D
B
 
G
r
o
u
p
W
a
l
m
a
r
t
L
a
b
s
M
o
D
a
S
 
(
T
e
l
 
A
v
i
v
 
U
n
i
v
e
r
s
i
t
y
)
 
 
 
 
C
R
O
W
D
S
O
U
R
C
I
N
G
M
A
R
K
E
T
P
L
A
C
E
S
 
 
M
E
C
H
A
N
I
C
A
L
 
T
U
R
K
M
E
C
H
A
N
I
C
A
L
 
T
U
R
K
R
e
q
u
e
s
t
o
r
 
p
l
a
c
e
s
 
H
u
m
a
n
 
I
n
t
e
l
l
i
g
e
n
c
e
 
T
a
s
k
s
 
(
H
I
T
)
Min price: $0,01
Provide expiration date and UI
# of assignments
R
e
q
u
e
s
t
o
r
 
a
p
p
r
o
v
e
 
j
o
b
s
 
a
n
d
 
p
a
y
m
e
n
t
s
Special API
W
o
r
k
e
r
s
 
c
h
o
o
s
e
 
j
o
b
s
,
 
d
o
 
t
h
e
m
 
a
n
d
 
g
e
t
t
i
n
g
 
m
o
n
e
y
USES OF HUMAN COMPUTATION
D
a
t
a
 
c
l
e
a
n
i
n
g
/
i
n
t
e
g
r
a
t
i
o
n
 
(
P
r
o
P
u
b
l
i
c
a
)
F
i
n
d
i
n
g
 
m
i
s
s
i
n
g
 
p
e
o
p
l
e
 
(
H
a
i
t
i
,
 
F
o
s
s
e
t
t
,
 
G
r
a
y
)
T
r
a
n
s
l
a
t
i
o
n
/
T
r
a
n
s
c
r
i
p
t
i
o
n
 
(
S
p
e
a
k
e
r
T
e
x
t
)
W
o
r
d
 
P
r
o
c
e
s
s
i
n
g
 
(
S
o
y
l
e
n
t
)
O
u
t
s
o
u
r
c
e
d
 
i
n
s
u
r
a
n
c
e
 
c
l
a
i
m
s
 
p
r
o
c
e
s
s
i
n
g
D
a
t
a
 
j
o
u
r
n
a
l
i
s
m
 
(
G
u
a
r
d
i
a
n
)
T
Y
P
E
S
 
O
F
 
T
A
S
K
S
S
o
u
r
c
e
:
 
P
a
i
d
 
C
r
o
w
d
s
o
u
r
c
i
n
g
,
 
S
m
a
r
t
S
h
e
e
t
.
c
o
m
OVERVIEW OF RECENT RESEARCH
C
r
o
w
d
s
o
u
r
c
e
d
 
D
a
t
a
b
a
s
e
s
,
 
Q
u
e
r
y
 
e
v
a
l
u
a
t
i
o
n
,
S
o
r
t
s
/
j
o
i
n
s
,
 
T
o
p
-
K
CrowdDB, Qurk, Deco,
C
r
o
w
d
s
o
u
r
c
e
d
 
D
a
t
a
 
C
o
l
l
e
c
t
i
o
n
/
C
l
e
a
n
i
n
g
AskIt, QOCO,….
C
r
o
w
d
 
s
o
u
r
c
e
d
 
D
a
t
a
 
M
i
n
i
n
g
CrowdMining, OASSIS, …
I
m
a
g
e
 
t
a
g
g
i
n
g
,
 
m
e
d
i
a
 
m
e
t
a
-
d
a
t
a
 
c
o
l
l
e
c
t
i
o
n
C
r
o
w
d
s
o
u
r
c
e
d
 
r
e
c
o
m
m
e
n
d
a
t
i
o
n
s
 
a
n
d
 
p
l
a
n
n
i
n
g
M
o
t
i
v
a
t
i
o
n
:
W
h
y
 
w
e
 
n
e
e
d
 
c
r
o
w
d
s
o
u
r
c
e
d
 
d
a
t
a
b
a
s
e
s
?
T
h
e
r
e
 
a
r
e
 
m
a
n
y
 
t
h
i
n
g
s
 
(
q
u
e
r
i
e
s
)
 
t
h
a
t
 
c
a
n
n
o
t
 
b
e
d
o
n
e
 
(
a
n
s
w
e
r
e
d
)
 
i
n
 
c
l
a
s
s
i
c
a
l
 
D
B
 
a
p
p
r
o
a
c
h
W
e
 
c
a
l
l
 
t
h
e
m
:
 
D
B
-
H
a
r
d
 
q
u
e
r
i
e
s
E
x
a
m
p
l
e
s
C
R
O
W
D
 
S
O
U
R
C
E
D
 
D
A
T
A
B
A
S
E
S
D
B
-
H
A
R
D
 
Q
U
E
R
I
E
S
 
(
1
)
SELECT
 
Market_Cap 
FROM
 
Companies 
WHERE
 
Company_Name = ‘I.B.M’
R
e
s
u
l
t
:
 
0
 
r
o
w
s
P
r
o
b
l
e
m
:
 
E
n
t
i
t
y
 
R
e
s
o
l
u
t
i
o
n
D
B
-
H
A
R
D
 
Q
U
E
R
I
E
S
 
(
2
)
SELECT
 
Market_Cap 
FROM
 
Companies 
WHERE
 
Company_Name = ‘Apple’
R
e
s
u
l
t
:
 
0
 
r
o
w
s
P
r
o
b
l
e
m
:
 
C
l
o
s
e
d
 
W
o
r
l
d
 
A
s
s
u
m
p
t
i
o
n
D
B
-
H
A
R
D
 
Q
U
E
R
I
E
S
 
(
3
)
SELECT
 
Image 
FROM
 
Images
WHERE
 
Theme = ‘Business Success’ 
ORDER BY 
relevance
R
e
s
u
l
t
:
 
0
 
r
o
w
s
P
r
o
b
l
e
m
:
 
M
i
s
s
i
n
g
 
I
n
t
e
l
l
i
g
e
n
c
e
C
R
O
W
D
D
B
U
s
e
 
t
h
e
 
c
r
o
w
d
 
t
o
 
a
n
s
w
e
r
 
D
B
-
H
a
r
d
 
q
u
e
r
i
e
s
Use the crowd when:
Looking for new data (Open World Assumption)
Doing a fuzzy comparison
Recognize patterns
D
o
n
t
 
u
s
e
 
t
h
e
 
c
r
o
w
d
 
w
h
e
n
:
Doing anything the computer already does well
C
L
O
S
E
D
 
W
O
R
L
D
 
V
S
O
P
E
N
 
W
O
R
L
D
O
W
A
Used in Knowledge representation
C
W
A
Used in classical DBMS
E
x
a
m
p
l
e
:
Statement: Marry is citizen of France
Question: Is Paul citizen of France?
CWA: No
O
W
A
:
 
U
n
k
n
o
w
n
C
R
O
W
D
S
Q
L
 
 
C
R
O
W
D
C
O
L
U
M
N
D
D
L
 
E
x
t
e
n
s
i
o
n
:
 
CREATE
 
TABLE
 
Department
 
(
   
university
 
STRING
 
,
   
name
 
STRING
 
,
   
url
 
CROWD
 
STRING
 
,
   
phone
 
STRING
 
,
   
PRIMARY
 
KEY
 
( university , name )
)
;
C
R
O
W
D
S
Q
L
 
E
X
A
M
P
L
E
 
#
1
 
INSERT
 
INTO
 
Department
 
(university, name) 
VALUES
 
(“TAU”
, 
“CS”);
R
e
s
u
l
t
:
C
R
O
W
D
S
Q
L
 
E
X
A
M
P
L
E
 
#
2
 
SELECT
 
url
 
FROM
 
Department 
WHERE
 
name = “Math”;
S
i
d
e
 
e
f
f
e
c
t
 
o
f
 
t
h
i
s
 
q
u
e
r
y
:
Crowdsourcing of CNULL values of Math departments
C
R
O
W
D
S
Q
L
 
 
C
R
O
W
D
T
A
B
L
E
D
D
L
 
E
x
t
e
n
s
i
o
n
:
 
CREATE
 
CROWD
 
TABLE
 
Professor(
   
name 
STRING
 
PRIMARY
 
KEY
 
,
   
email 
STRING
 
UNIQUE
 
,
   
university 
STRING
 
,
   
department 
STRING
 
,
   
FOREIGN
 
KEY
 
( university , department )
    
REF
 
Department ( university , name )
 
)
;
C
R
O
W
D
S
Q
L
 
 
S
U
B
J
E
C
T
I
V
E
C
O
M
P
A
R
I
S
O
N
S
T
w
o
 
f
u
n
c
t
i
o
n
s
CROWDEQUAL
Takes 2 parameters and asks the crowd to decide if they are
equals
~= is a syntactic sugar
CROWDORDER
Used when we need the help of crowd to rank or order results
C
R
O
W
D
E
Q
U
A
L
E
X
A
M
P
L
E
S
E
L
E
C
T
 
p
r
o
f
i
l
e
 
F
R
O
M
 
d
e
p
a
r
t
m
e
n
t
 
W
H
E
R
E
 
n
a
m
e
 
~
=
 
C
S
;
T
o
 
a
s
k
 
f
o
r
 
a
l
l
 
"
C
S
"
 
d
e
p
a
r
t
m
e
n
t
s
,
 
t
h
e
 
f
o
l
l
o
w
i
n
g
 
q
u
e
r
y
 
c
o
u
l
d
 
b
e
p
o
s
e
d
.
 
H
e
r
e
,
 
t
h
e
 
q
u
e
r
y
 
w
r
i
t
e
r
 
a
s
k
s
 
t
h
e
 
c
r
o
w
d
 
t
o
 
d
o
 
e
n
t
i
t
y
 
r
e
s
o
l
u
t
i
o
n
 
w
i
t
h
 
t
h
e
p
o
s
s
i
b
l
y
 
d
i
f
f
e
r
e
n
t
 
n
a
m
e
s
 
g
i
v
e
n
 
f
o
r
 
C
o
m
p
u
t
e
r
 
S
c
i
e
n
c
e
 
i
n
 
t
h
e
 
d
a
t
a
b
a
s
e
.
C
R
O
W
D
O
R
D
E
R
E
X
A
M
P
L
E
SELECT
 
p
 
FROM
 
Picture
 
WHERE
 
subject
 
= “Golden Gate Bridge”
ORDER BY 
CROWDORDER
 
(p, “Which picture visualizes better %subject”);
 
T
h
e
 
f
o
l
l
o
w
i
n
g
 
C
r
o
w
d
S
Q
L
 
q
u
e
r
y
 
a
s
k
s
 
f
o
r
 
a
 
r
a
n
k
i
n
g
 
o
f
 
p
i
c
t
u
r
e
s
 
w
i
t
h
 
r
e
g
a
r
d
t
o
 
h
o
w
 
w
e
l
l
 
t
h
e
s
e
 
p
i
c
t
u
r
e
s
 
d
e
p
i
c
t
 
t
h
e
 
G
o
l
d
e
n
 
G
a
t
e
 
B
r
i
d
g
e
.
U
I
 
G
E
N
E
R
A
T
I
O
N
C
l
e
a
r
 
U
I
 
i
s
 
k
e
y
 
t
o
 
q
u
a
l
i
t
y
 
o
f
 
a
n
s
w
e
r
s
 
a
n
d
 
r
e
s
p
o
n
s
e
 
t
i
m
e
S
Q
L
 
S
c
h
e
m
a
 
t
o
 
a
u
t
o
-
g
e
n
e
r
a
t
e
d
 
U
I
Q
U
E
R
Y
 
P
L
A
N
 
G
E
N
E
R
A
T
I
O
N
Query:
 
SELECT
 
* 
FROM
 
d Professor p, Department d
  
WHERE
 
d.name = p.dep 
AND
 
p.name = “Karp”
D
E
A
L
I
N
G
 
W
I
T
H
 
O
P
E
N
-
W
O
R
L
D
Q
u
r
k
 
(
M
I
T
)
:
 
D
e
c
l
a
r
a
t
i
v
e
 
w
o
r
k
f
l
o
w
 
m
a
n
a
g
e
m
e
n
t
s
y
s
t
e
m
 
t
h
a
t
 
a
l
l
o
w
s
 
h
u
m
a
n
 
c
o
m
p
u
t
a
t
i
o
n
 
o
v
e
r
 
d
a
t
a
(
h
u
m
a
n
 
i
s
 
a
 
p
a
r
t
 
o
f
 
q
u
e
r
y
 
e
x
e
c
u
t
i
o
n
)
Q
U
R
K
:
 
T
H
E
 
B
E
G
I
N
N
I
N
G
S
c
h
e
m
a
c
e
l
e
b
(
n
a
m
e
 
t
e
x
t
,
 
i
m
g
 
u
r
l
)
Q
u
e
r
y
 
 
 
S
E
L
E
C
T
 
c
.
n
a
m
e
 
F
R
O
M
 
c
e
l
e
b
 
A
S
 
c
 
W
H
E
R
E
 
i
s
F
e
m
a
l
e
(
c
)
U
D
F
(
U
s
e
r
 
D
e
f
i
n
e
d
 
F
u
n
c
t
i
o
n
)
 
-
 
i
s
F
e
m
a
l
e
:
TASK isFemale(field) TYPE Filter:
 
Prompt: "<table><tr> \
  
<td><img src=’%s’></td> \
  
<td>Is the person in the image a woman?</td> \
 
        </tr></table>", tuple[field]
 
YesText: "Yes"
 
NoText: "No"
 
Combiner: MajorityVote
I
S
F
E
M
A
L
E
 
F
U
N
C
T
I
O
N
 
(
U
I
)
J
O
I
N
S
c
h
e
m
a
p
h
o
t
o
s
(
i
m
g
 
u
r
l
)
Q
u
e
r
y
 
 
 
 
 
S
E
L
E
C
T
 
c
.
n
a
m
e
 
F
R
O
M
 
c
e
l
e
b
 
c
 
J
O
I
N
 
p
h
o
t
o
s
 
p
 
 
 
 
 
O
N
 
s
a
m
e
P
e
r
s
o
n
(
c
.
i
m
g
,
 
p
.
i
m
g
)
s
a
m
e
P
e
r
s
o
n
:
TASK samePerson(f1, f2) TYPE EquiJoin:
 
SingluarName: "celebrity"
 
PluralName: "celebrities"
 
LeftPreview: "<img src=’%s’ class=smImg>",tuple1[f1]
 
LeftNormal: "<img src=’%s’ class=lgImg>",tuple1[f1]
 
RightPreview: "<img src=’%s’ class=smImg>",tuple2[f2]
 
RightNormal: "<img src=’%s’ class=lgImg>",tuple2[f2]
 
Combiner: MajorityVote
J
O
I
N
 
 
U
I
 
E
X
A
M
P
L
E
# of HITs = |R| * |S|
J
O
I
N
 
 
N
A
Ï
V
E
 
B
A
T
C
H
I
N
G
#
 
o
f
 
H
I
T
s
 
=
 
(
|
R
|
 
*
 
|
S
|
)
 
/
 
b
J
O
I
N
 
 
S
M
A
R
T
 
B
A
T
C
H
I
N
G
#
 
o
f
 
H
I
T
s
 
=
 
(
|
R
|
 
*
 
|
S
|
)
 
/
 
(
r
 
*
 
s
)
F
E
A
T
U
R
E
E
X
T
R
A
C
T
I
O
N
S
E
L
E
C
T
 
c
.
n
a
m
e
 
F
R
O
M
 
c
e
l
e
b
 
c
 
J
O
I
N
 
p
h
o
t
o
s
 
p
O
N
 
s
a
m
e
P
e
r
s
o
n
(
c
.
i
m
g
,
p
.
i
m
g
)
A
N
D
 
P
O
S
S
I
B
L
Y
 
g
e
n
d
e
r
(
c
.
i
m
g
)
 
=
 
g
e
n
d
e
r
(
p
.
i
m
g
)
A
N
D
 
P
O
S
S
I
B
L
Y
 
h
a
i
r
C
o
l
o
r
(
c
.
i
m
g
)
 
=
 
h
a
i
r
C
o
l
o
r
(
p
.
i
m
g
)
A
N
D
 
P
O
S
S
I
B
L
Y
 
s
k
i
n
C
o
l
o
r
(
c
.
i
m
g
)
 
=
 
s
k
i
n
C
o
l
o
r
(
p
.
i
m
g
)
TASK gender(field) TYPE Generative:
 
Prompt: "<table><tr> \
   
<td><img src=’%s’> \
   
<td>What this person’s gender? \
  
  </table>", tuple[field]
 
Response: Radio("Gender",
   
["Male","Female",UNKNOWN])
 
Combiner: MajorityVote
E
C
O
N
O
M
I
C
S
 
O
F
 
F
E
A
T
U
R
E
E
X
T
R
A
C
T
I
O
N
D
a
t
a
s
e
t
:
 
T
a
b
l
e
1
 
[
2
0
 
r
o
w
s
]
 
 
x
 
 
T
a
b
l
e
2
 
[
2
0
 
r
o
w
s
]
J
o
i
n
 
w
i
t
h
 
n
o
 
f
i
l
t
e
r
i
n
g
 
(
C
r
o
s
s
 
P
r
o
d
u
c
t
)
:
 
4
0
0
 
c
o
m
p
a
r
i
s
o
n
s
F
i
l
t
e
r
i
n
g
 
o
n
 
1
 
p
a
r
a
m
e
t
e
r
 
(
s
a
y
 
g
e
n
d
e
r
)
:
+40 extra HITS
For example: 
11 females, 9 males in Table1
  
   
10 females, 10 males in Table2
 
J
o
i
n
 
a
f
t
e
r
 
f
i
l
t
e
r
i
n
g
:
 
~
1
0
0
 
c
o
m
p
a
r
i
s
o
n
s
N
o
-
F
i
l
t
e
r
/
F
i
l
t
e
r
 
H
I
T
s
 
r
a
t
i
o
:
 
4
0
0
/
1
4
0
D
e
c
r
e
a
s
e
 
t
h
e
 
n
u
m
b
e
r
 
o
f
 
H
I
T
s
 
 
~
 
 
3
x
P
O
S
S
I
B
L
Y
 
F
I
L
T
E
R
S
S
E
L
E
C
T
I
O
N
G
e
n
d
e
r
?
P
O
S
S
I
B
L
Y
 
F
I
L
T
E
R
S
S
E
L
E
C
T
I
O
N
S
k
i
n
 
c
o
l
o
r
?
P
O
S
S
I
B
L
Y
 
F
I
L
T
E
R
S
S
E
L
E
C
T
I
O
N
H
a
i
r
 
c
o
l
o
r
?
?
?
QURK – MORE FEATURES:
COUNTING WITH CROWD
 
c
r
o
w
d
-
p
o
w
e
r
e
d
 
s
e
l
e
c
t
i
v
i
t
y
 
e
s
t
i
m
a
t
i
o
n
 
1
%
 
5
0
%
Given a dataset of images, run queries on it
(filtering, aggregation).
Images are unlabeled
No prior knowledge on distribution.
M
O
T
I
V
A
T
I
N
G
 
E
X
A
M
P
L
E
 
#
1
S
c
h
e
m
a
p
e
o
p
l
e
 
(
n
a
m
e
 
v
a
r
c
h
a
r
2
(
3
2
)
,
 
p
h
o
t
o
 
i
m
g
)
Q
u
e
r
y
 
 
 
 
 
S
E
L
E
C
T
 
*
 
F
R
O
M
 
p
e
o
p
l
e
W
H
E
R
E
 
g
e
n
d
e
r
=
M
a
l
e
 
A
N
D
 
h
a
i
r
C
o
l
o
r
=
r
e
d
M
O
T
I
V
A
T
I
N
G
 
E
X
A
M
P
L
E
 
#
1
M
O
T
I
V
A
T
I
N
G
 
E
X
A
M
P
L
E
 
#
1
F
i
l
t
e
r
 
b
y
 
g
e
n
d
e
r
(
p
h
o
t
o
)
 
=
 
m
a
l
e
M
O
T
I
V
A
T
I
N
G
 
E
X
A
M
P
L
E
 
#
1
F
i
l
t
e
r
 
b
y
 
h
a
i
r
C
o
l
o
r
(
p
h
o
t
o
)
 
=
 
r
e
d
M
O
T
I
V
A
T
I
N
G
 
E
X
A
M
P
L
E
 
#
1
F
i
l
t
e
r
 
b
y
 
g
e
n
d
e
r
(
p
h
o
t
o
)
 
=
 
m
a
l
e
 
,
 
t
h
e
 
b
y
 
h
a
i
r
C
o
l
o
r
(
p
h
o
t
o
)
 
=
 
r
e
d
First pass: 10 HITs (result – 5 photos)
Second pass: 5 HIT
T
o
t
a
l
:
 
1
5
 
H
I
T
s
F
i
l
t
e
r
 
b
y
 
h
a
i
r
C
o
l
o
r
(
p
h
o
t
o
)
 
=
 
r
e
d
,
 
t
h
e
n
 
b
y
 
g
e
n
d
e
r
(
p
h
o
t
o
)
 
=
 
m
a
l
e
First pass: 10 HITs (result – 1 photo)
Second pass: 1 HIT
T
o
t
a
l
:
 
1
1
 
H
I
T
s
H
O
W
 
M
A
N
Y
 
M
A
L
E
S
/
F
E
M
A
L
E
S
?
I
N
T
E
R
F
A
C
E
:
 
L
A
B
E
L
I
N
G
I
N
T
E
R
F
A
C
E
:
 
C
O
U
N
T
I
N
G
E
S
T
I
M
A
T
I
N
G
 
C
O
U
N
T
S
C
a
n
t
 
s
h
o
w
 
a
l
l
 
t
h
e
 
i
m
a
g
e
s
 
t
o
 
e
v
e
r
y
 
u
s
e
r
S
h
o
w
 
r
a
n
d
o
m
 
s
a
m
p
l
e
Sampling error
Worker error
Dependent samples
C
O
U
N
T
I
N
G
 
V
S
 
L
A
B
E
L
I
N
G
D
a
t
a
s
e
t
:
 
5
0
0
 
i
m
a
g
e
s
L
a
b
e
l
i
n
g
10 images per HIT  
(can be 5 – 20)
5 workers per HIT (majority) 
(can be 3 – 7)
T
o
t
a
l
 
H
I
T
s
 
=
 
5
0
0
/
1
0
 
*
 
5
 
=
 
2
5
0
C
o
u
n
t
i
n
g
75 images per HIT 
(can be 50 – 150)
1 worker per HIT 
(spammer detection algo – later)
T
o
t
a
l
 
H
I
T
s
 
=
 
5
0
0
/
7
5
 
=
 
7
x
3
7
.
5
 
t
i
m
e
s
 
c
h
e
a
p
e
r
!
AVOIDING SPAMMERS:
FORMAL
If no spammers, just average all the results
Average the contribution of each user = F
i
 
(really helps!)
Initialize:
Define:
Iterate:
Finally:
 
f
r
a
c
t
i
o
n
 
o
f
 
m
a
l
e
s
 
0
 
1
 
a
v
e
r
a
g
e
 
a
c
t
u
a
l
f
r
a
c
t
i
o
n
 
o
u
t
l
i
e
r
 
c
o
n
f
i
d
e
n
c
e
i
n
t
e
r
v
a
l
 
n
e
w
a
v
e
r
a
g
e
AVOIDING SPAMMERS:
DEMONSTRATION
C
O
O
R
D
I
N
A
T
E
D
 
A
T
T
A
C
K
S
f
r
a
c
t
i
o
n
 
o
f
 
m
a
l
e
s
 
0
1
a
c
t
u
a
l
f
r
a
c
t
i
o
n
C
O
O
R
D
I
N
A
T
E
D
 
A
T
T
A
C
K
S
f
r
a
c
t
i
o
n
 
o
f
 
m
a
l
e
s
 
0
1
a
c
t
u
a
l
f
r
a
c
t
i
o
n
S
O
L
U
T
I
O
N
:
A
D
D
 
R
A
N
D
O
M
 
K
N
O
W
N
 
R
E
S
U
L
T
S
G
O
L
D
 
S
T
A
N
D
A
R
D
 
I
M
A
G
E
S
N
o
t
 
o
n
l
y
 
1
 
g
o
l
d
e
n
 
t
r
u
t
h
 
t
a
s
k
Each worker complete only 1-2 tasks
Spammers can identify those tasks
B
u
t
 
d
i
s
t
r
i
b
u
t
e
 
g
o
l
d
e
n
 
t
r
u
t
h
 
o
v
e
r
 
a
l
l
 
t
a
s
k
s
O
l
d
 
a
p
p
r
o
x
.
:
 
 
F
 
=
 
C
/
R
N
e
w
 
a
p
p
r
o
x
.
:
 
F
 
=
 
(
C
-
G
)
/
(
R
-
G
)
C – count provided by worker
R – number of items
G – number of golden truth images
f
r
a
c
t
i
o
n
 
o
f
 
m
a
l
e
s
 
0
1
a
c
t
u
a
l
f
r
a
c
t
i
o
n
R
A
N
D
O
M
 
R
E
S
U
L
T
S
 
W
E
A
K
E
N
C
O
O
R
D
I
N
A
T
E
D
 
A
T
T
A
C
K
E
R
S
T
R
A
D
E
O
F
F
 
c
a
n
 
w
i
t
h
s
t
a
n
d
 
e
x
t
r
e
m
e
 
c
o
o
r
d
i
n
a
t
e
d
a
t
t
a
c
k
s
 
(
>
7
0
%
 
a
t
t
a
c
k
e
r
s
)
 
i
n
 
e
x
c
h
a
n
g
e
 
f
o
r
 
c
o
m
m
e
n
s
u
r
a
t
e
n
u
m
b
e
r
 
o
f
 
k
n
o
w
n
 
l
a
b
e
l
s
G
i
v
e
n
 
a
 
D
A
G
 
a
n
d
 
s
o
m
e
 
u
n
k
n
o
w
n
 
t
a
r
g
e
t
(
s
)
W
e
 
c
a
n
 
a
s
k
 
Y
E
S
/
N
O
 
q
u
e
s
t
i
o
n
s
E
.
g
.
 
r
e
a
c
h
a
b
i
l
i
t
y
I
s
 
i
t
 
A
s
i
a
n
?
 
:
 
Y
E
S
I
s
 
i
t
 
T
h
a
i
?
 
:
 
N
o
I
s
 
i
t
C
h
i
n
e
s
e
?
 
:
 
Y
E
S
B
E
S
T
 
U
S
E
 
O
F
 
R
E
S
O
U
R
C
E
S
:
H
U
M
A
N
 
A
S
S
I
S
T
E
D
 
G
R
A
P
H
 
S
E
A
R
C
H
H
u
m
a
n
A
s
s
i
s
t
e
d
 
G
r
a
p
h
 
S
e
a
r
c
h
:
 
I
t
s
 
O
k
a
y
 
t
o
 
A
s
k
 
Q
u
e
s
t
i
o
n
s
,
A. Parameswaran, A. D. Sarma,  H. G. Molina, N. Polyzotis, j. Widom, VLDB ‘11
F
i
n
d
 
a
n
 
o
p
t
i
m
a
l
 
s
e
t
 
o
f
 
q
u
e
s
t
i
o
n
s
 
t
o
 
f
i
n
d
 
t
h
e
 
t
a
r
g
e
t
 
n
o
d
e
s
O
p
t
i
m
i
z
e
 
c
o
s
t
:
 
M
i
n
i
m
a
l
 
#
 
o
f
 
q
u
e
s
t
i
o
n
s
O
p
t
i
m
i
z
e
 
a
c
c
u
r
a
c
y
:
 
M
i
n
i
m
a
l
 
#
 
o
f
 
p
o
s
s
i
b
l
e
 
t
a
r
g
e
t
s
C
h
a
l
l
e
n
g
e
s
Answer correlations
 
   
(Falafel 
 Middle Eastern)
Location in the graph affects information gain
    
   
(leaves are likely to get a NO)
Asking several questions in parallel to reduce latency 
T
H
E
 
O
B
J
E
C
T
I
V
E
S
i
n
g
l
e
 
t
a
r
g
e
t
/
M
u
l
t
i
p
l
e
 
t
a
r
g
e
t
s
O
n
l
i
n
e
/
O
f
f
l
i
n
e
Online: one question at a time
Offline: pre-compute all questions
Hybrid approach
G
r
a
p
h
 
s
t
r
u
c
t
u
r
e
P
R
O
B
L
E
M
 
D
I
M
E
N
S
I
O
N
S
IMPORTANCE OF (GOOD) UI
Good UI – better results
Good UI – faster results
Bad UI – inaccurate results
Bad UI – workers leave without completing the task
C
H
A
L
L
E
N
G
E
S
O
p
e
n
 
v
s
.
 
c
l
o
s
e
d
 
w
o
r
l
d
 
a
s
s
u
m
p
t
i
o
n
A
s
k
i
n
g
 
t
h
e
 
r
i
g
h
t
 
q
u
e
s
t
i
o
n
s
E
s
t
i
m
a
t
i
n
g
 
t
h
e
 
q
u
a
l
i
t
y
 
o
f
 
a
n
s
w
e
r
s
I
n
c
r
e
m
e
n
t
a
l
 
p
r
o
c
e
s
s
i
n
g
 
o
f
 
u
p
d
a
t
e
s
MORE CHALLENGES
D
i
s
t
r
i
b
u
t
e
d
 
m
a
n
a
g
e
m
e
n
t
 
o
f
 
h
u
g
e
 
d
a
t
a
P
r
o
c
e
s
s
i
n
g
 
o
f
 
t
e
x
t
u
a
l
 
a
n
s
w
e
r
s
S
e
m
a
n
t
i
c
s
M
o
r
e
 
i
d
e
a
s
?
RESEARCH AGENDA
Data Model/Query Language
Query Execution/Query Optimization
Quality Control
Storage/Caching
User Interfaces
Worker Behavior/Worker Relationship Management
Interactivity
Platform design
Hybrid Human/Machine algorithms
  
(from VLDB’11 Tutorial by Doan, Franklin, Kossman, Kraska)
TEASERS
TEASERS
TEASERS
TEASERS
TEASERS
 
 
M
o
r
e
 
 
n
e
x
t
 
w
e
e
k
Q
u
e
s
t
i
o
n
s
?
REFERENCES
This presentation partially based on:
“Counting with Crowd” slides by Adam Marcus
Crowdsourcing tutorial from VLDB’11 by Doan, Franklin,
Kossman and Kraska
“Introduction to Crowdsourcing” slides by Tova Milo
References:
https://users.soe.ucsc.edu/~alkis/papers/ivd.pdf
https://amplab.cs.berkeley.edu/wp-content/uploads/2011/06/CrowdDB-Answering-
Queries-with-Crowdsourcing.pdf
http://db.csail.mit.edu/pubs/mturk-cameraready.pdf
Slide Note
Embed
Share

Join the Data-Centered Crowdsourcing Workshop at Tel Aviv University to learn about crowd-data sourcing, improve existing solutions, and tackle new problems. The course covers databases, web programming, and more. Explore the principles and examples of crowdsourcing, understand its main goals and challenges. Get hands-on experience in harnessing the crowd for various tasks in this engaging workshop.

  • Crowdsourcing Workshop
  • Data-Centered
  • Prof. Tova Milo
  • Slava Novgorodov
  • Tel Aviv University

Uploaded on Sep 08, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. DATA-CENTERED CROWDSOURCING WORKSHOP PROF. TOVA MILO SLAVA NOVGORODOV TEL AVIV UNIVERSITY 2016/2017

  2. ADMINISTRATIVE NOTES Course goal: Learn about crowd-data sourcing and prepare a final project (improvement of existing problem s solution / solving new problem) Group size: ~4 students Requirements: DataBases (SQL) is recommended, Web programming (we will do a short overview), (optionally) Mobile development (we will not teach it)

  3. ADMINISTRATIVE NOTES (2) Schedule: 3 intro meetings 1stmeeting overview of crowdsourcing 2ndmeeting open problems, possible projects 3rdmeeting Web programming overview Mid-term meeting Final meeting and projects presentation Dates: http://slavanov.com/teaching/crowd1617b/

  4. WHAT IS CROWDSOURCING? Crowdsourcing = Crowd + Outsourcing Crowdsourcing is the act of sourcing tasks traditionally performed by specific individuals to a group of people or community (crowd) through an open call.

  5. CROWDSOURCING Main idea: Harness the crowd to a task Task: solve bugs Task: find an appropriate treatment to an illness Task: construct a database of facts Why now? Internet and smart phones We are all connected, all of the time!!!

  6. THE CLASSICAL EXAMPLE

  7. GALAXY ZOO

  8. MORE

  9. AND EVEN MORE

  10. CROWDSOURCING: UNIFYING PRINCIPLES Main goal Outsourcing a task to a crowd of users Kinds of tasks Tasks that can be performed by a computer, but inefficiently Tasks that can t be performed by a computer Challenges How to motivate the crowd? Get data, minimize errors, estimate quality Direct users to contribute where is most needed \ they are experts

  11. MOTIVATING THE CROWD Altruism Fun Money

  12. CROWD DATA SOURCING Outsourcing data collection to the crowd of Web users When people can provide the data When people are the only source of data When people can efficiently clean and/or organize the data Two main aspects [DFKK 12]: Using the crowd to create better databases Using database technologies to create better crowd datasourcing applications [DFKK 12]: Crowdsourcing Applications and Platforms: A Data Management Perspective, A.Doan, M. J. Franklin, D. Kossmann, T. Kraska, VLDB 2011

  13. MY FAVORITE EXAMPLE ReCaptha 100,000 web sites 40 million words/day

  14. CROWDSOURCING RESEARCH GROUPS An incomplete list of groups working on Crowdsourcing: Qurk (MIT) CrowdDB (Berkeley and ETH Zurich) Deco (Stanford and UCSC) CrowdForge (CMU) HKUST DB Group WalmartLabs MoDaS (Tel Aviv University)

  15. CROWDSOURCING MARKETPLACES

  16. MECHANICAL TURK

  17. MECHANICAL TURK Requestor places Human Intelligence Tasks (HIT) Min price: $0,01 Provide expiration date and UI # of assignments Requestor approve jobs and payments Special API Workers choose jobs, do them and getting money

  18. USES OF HUMAN COMPUTATION Data cleaning/integration (ProPublica) Finding missing people (Haiti, Fossett, Gray) Translation/Transcription (SpeakerText) Word Processing (Soylent) Outsourced insurance claims processing Data journalism (Guardian)

  19. TYPES OF TASKS Source: Paid Crowdsourcing , SmartSheet.com

  20. OVERVIEW OF RECENT RESEARCH Crowdsourced Databases, Query evaluation, Sorts/joins, Top-K CrowdDB, Qurk, Deco, Crowdsourced Data Collection/Cleaning AskIt, QOCO, . Crowd sourced Data Mining CrowdMining, OASSIS, Image tagging, media meta-data collection Crowdsourced recommendations and planning

  21. CROWD SOURCED DATABASES Motivation: Why we need crowdsourced databases? There are many things (queries) that cannot be done (answered) in classical DB approach We call them: DB-Hard queries Examples

  22. DB-HARD QUERIES (1) SELECT Market_Cap FROM Companies WHERE Company_Name = I.B.M Result: 0 rows Problem: Entity Resolution

  23. DB-HARD QUERIES (2) SELECT Market_Cap FROM Companies WHERE Company_Name = Apple Result: 0 rows Problem: Closed World Assumption

  24. DB-HARD QUERIES (3) SELECT Image FROM Images WHERE Theme = Business Success ORDER BY relevance Result: 0 rows Problem: Missing Intelligence

  25. CROWDDB Use the crowd to answer DB-Hard queries Use the crowd when: Looking for new data (Open World Assumption) Doing a fuzzy comparison Recognize patterns Don t use the crowd when: Doing anything the computer already does well

  26. CLOSED WORLD VS OPEN WORLD OWA Used in Knowledge representation CWA Used in classical DBMS Example: Statement: Marry is citizen of France Question: Is Paul citizen of France? CWA: No OWA: Unknown

  27. CROWDSQL CROWD COLUMN DDL Extension: CREATE TABLE Department ( university STRING , name STRING , url CROWD STRING , phone STRING , PRIMARY KEY ( university , name ) );

  28. CROWDSQL EXAMPLE #1 INSERT INTO Department (university, name) VALUES ( TAU , CS ); Result: University TAU Name CS Url CNULL Phone NULL

  29. CROWDSQL EXAMPLE #2 SELECT url FROM Department WHERE name = Math ; Side effect of this query: Crowdsourcing of CNULL values of Math departments

  30. CROWDSQL CROWD TABLE DDL Extension: CREATE CROWD TABLE Professor( name STRING PRIMARY KEY , email STRING UNIQUE , university STRING , department STRING , FOREIGN KEY ( university , department ) REF Department ( university , name ) );

  31. CROWDSQL SUBJECTIVE COMPARISONS Two functions CROWDEQUAL Takes 2 parameters and asks the crowd to decide if they are equals ~= is a syntactic sugar CROWDORDER Used when we need the help of crowd to rank or order results

  32. CROWDEQUAL EXAMPLE SELECT profile FROM department WHERE name ~= CS ; To ask for all "CS" departments, the following query could be posed. Here, the query writer asks the crowd to do entity resolution with the possibly different names given for Computer Science in the database.

  33. CROWDORDER EXAMPLE SELECT p FROM Picture WHERE subject = Golden Gate Bridge ORDER BY CROWDORDER (p, Which picture visualizes better %subject ); The following CrowdSQL query asks for a ranking of pictures with regard to how well these pictures depict the Golden Gate Bridge.

  34. UI GENERATION Clear UI is key to quality of answers and response time SQL Schema to auto-generated UI

  35. QUERY PLAN GENERATION Query: SELECT * FROM d Professor p, Department d WHERE d.name = p.dep AND p.name = Karp

  36. DEALING WITH OPEN-WORLD

  37. Qurk (MIT): Declarative workflow management system that allows human computation over data (human is a part of query execution)

  38. QURK: THE BEGINNING Schema celeb(name text, img url) Query SELECT c.name FROM celeb AS c WHERE isFemale(c) UDF(User Defined Function) - isFemale: TASK isFemale(field) TYPE Filter: Prompt: "<table><tr> \ <td><img src= %s ></td> \ <td>Is the person in the image a woman?</td> \ </tr></table>", tuple[field] YesText: "Yes" NoText: "No" Combiner: MajorityVote

  39. ISFEMALE FUNCTION (UI)

  40. JOIN Schema photos(img url) Query SELECT c.name FROM celeb c JOIN photos p ON samePerson(c.img, p.img) samePerson: TASK samePerson(f1, f2) TYPE EquiJoin: SingluarName: "celebrity" PluralName: "celebrities" LeftPreview: "<img src= %s class=smImg>",tuple1[f1] LeftNormal: "<img src= %s class=lgImg>",tuple1[f1] RightPreview: "<img src= %s class=smImg>",tuple2[f2] RightNormal: "<img src= %s class=lgImg>",tuple2[f2] Combiner: MajorityVote

  41. JOIN UI EXAMPLE # of HITs = |R| * |S|

  42. JOIN NAVE BATCHING # of HITs = (|R| * |S|) / b

  43. JOIN SMART BATCHING # of HITs = (|R| * |S|) / (r * s)

  44. FEATURE EXTRACTION SELECT c.name FROM celeb c JOIN photos p ON samePerson(c.img,p.img) AND POSSIBLY gender(c.img) = gender(p.img) AND POSSIBLY hairColor(c.img) = hairColor(p.img) AND POSSIBLY skinColor(c.img) = skinColor(p.img) TASK gender(field) TYPE Generative: Prompt: "<table><tr> \ <td><img src= %s > \ <td>What this person s gender? \ </table>", tuple[field] Response: Radio("Gender", ["Male","Female",UNKNOWN]) Combiner: MajorityVote

  45. ECONOMICS OF FEATURE EXTRACTION Dataset: Table1 [20 rows] x Table2 [20 rows] Join with no filtering (Cross Product): 400 comparisons Filtering on 1 parameter (say gender): +40 extra HITS For example: 11 females, 9 males in Table1 10 females, 10 males in Table2 Join after filtering: ~100 comparisons No-Filter/Filter HITs ratio: 400/140 Decrease the number of HITs ~ 3x

  46. POSSIBLY FILTERS SELECTION Gender?

  47. POSSIBLY FILTERS SELECTION Skin color?

  48. POSSIBLY FILTERS SELECTION Hair color???

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#