Knowledge Bases and Harvesting Information

O
u
t
l
i
n
e
1.
I
n
t
r
o
d
u
c
t
i
o
n
2.
H
a
r
v
e
s
t
i
n
g
 
C
l
a
s
s
e
s
3.
H
a
r
v
e
s
t
i
n
g
 
F
a
c
t
s
4.
C
o
m
m
o
n
 
S
e
n
s
e
 
K
n
o
w
l
e
d
g
e
5.
K
n
o
w
l
e
d
g
e
 
C
o
n
s
o
l
i
d
a
t
i
o
n
6.
W
e
b
 
C
o
n
t
e
n
t
 
A
n
a
l
y
t
i
c
s
7.
W
r
a
p
-
U
p
K
B
s
G
o
a
l
W
o
r
d
N
e
t
E
x
t
r
a
c
t
i
o
n
 
f
r
o
m
 
W
i
k
i
p
e
d
i
a
E
x
t
r
a
c
t
i
o
n
 
f
r
o
m
 
t
e
x
t
E
x
t
r
a
c
t
i
o
n
 
f
r
o
m
 
t
a
b
l
e
s
K
n
o
w
l
e
d
g
e
 
B
a
s
e
s
 
a
r
e
 
l
a
b
e
l
e
d
 
g
r
a
p
h
s
singer
person
resource
location
city
Tupelo
subclassOf
subclassOf
type
bornIn
type
subclassOf
 
Classes/
Concepts/
Types
 
Instances/
entities
 
Relations/
Predicates
A knowledge base
 
can be seen as a directed labeled multi-graph, 
where the nodes are entities and the edges relations.
2
A
n
 
e
n
t
i
t
y
 
c
a
n
 
h
a
v
e
 
d
i
f
f
e
r
e
n
t
 
l
a
b
e
l
s
singer
person
“Elvis”
“The King”
 
type
 
label
 
label
T
h
e
 
s
a
m
e
l
a
b
e
l
 
f
o
r
 
t
w
o
e
n
t
i
t
i
e
s
:
a
m
b
i
g
u
i
t
y
T
h
e
 
s
a
m
e
e
n
t
i
t
y
h
a
s
 
t
w
o
l
a
b
e
l
s
:
s
y
n
o
n
y
m
y
type
3
D
i
f
f
e
r
e
n
t
 
v
i
e
w
s
 
o
f
 
a
 
k
n
o
w
l
e
d
g
e
 
b
a
s
e
singer
type
type(Elvis, singer)
bornIn(Elvis,Tupelo)
...
Graph notation:
Logical notation:
Triple notation:
Tupelo
bornIn
 
W
e
 
u
s
e
 
"
R
D
F
S
 
O
n
t
o
l
o
g
y
"
a
n
d
 
"
K
n
o
w
l
e
d
g
e
 
B
a
s
e
 
(
K
B
)
"
s
y
n
o
n
y
m
o
u
s
l
y
.
4
O
u
t
l
i
n
e
1.
I
n
t
r
o
d
u
c
t
i
o
n
2.
H
a
r
v
e
s
t
i
n
g
 
C
l
a
s
s
e
s
3.
H
a
r
v
e
s
t
i
n
g
 
F
a
c
t
s
4.
C
o
m
m
o
n
 
S
e
n
s
e
 
K
n
o
w
l
e
d
g
e
5.
K
n
o
w
l
e
d
g
e
 
C
o
n
s
o
l
i
d
a
t
i
o
n
6.
W
e
b
 
C
o
n
t
e
n
t
 
A
n
a
l
y
t
i
c
s
7.
W
r
a
p
-
U
p
K
B
s
 
 
G
o
a
l
W
o
r
d
N
e
t
E
x
t
r
a
c
t
i
o
n
 
f
r
o
m
 
W
i
k
i
p
e
d
i
a
E
x
t
r
a
c
t
i
o
n
 
f
r
o
m
 
t
e
x
t
E
x
t
r
a
c
t
i
o
n
 
f
r
o
m
 
t
a
b
l
e
s
G
o
a
l
:
 
F
i
n
d
i
n
g
 
c
l
a
s
s
e
s
 
a
n
d
 
i
n
s
t
a
n
c
e
s
singer
person
type
Which classes exist?
(aka
 entity types, unary
predicates, concepts)
subclassOf
Which subsumptions
hold?
Which entities belong to
which classes?
Which entities exist?
6
W
o
r
d
N
e
t
 
i
s
 
a
 
l
e
x
i
c
a
l
 
k
n
o
w
l
e
d
g
e
 
b
a
s
e
W
o
r
d
N
e
t
 
p
r
o
j
e
c
t
 
(
1
9
8
5
-
n
o
w
)
singer
person
subclassOf
living being
subclassOf
“person”
 
label
“individual”
“soul”
WordNet contains
82,000 classes
WordNet contains
118,000 class
 labels
WordNet contains
thousands of subclassOf
relationships
7
[
M
i
l
l
e
r
 
1
9
9
5
,
 
F
e
l
l
b
a
u
m
 
1
9
9
8
]
W
o
r
d
N
e
t
 
e
x
a
m
p
l
e
:
 
s
u
p
e
r
c
l
a
s
s
e
s
8
W
o
r
d
N
e
t
 
e
x
a
m
p
l
e
:
 
s
u
b
c
l
a
s
s
e
s
9
W
o
r
d
N
e
t
 
e
x
a
m
p
l
e
:
 
i
n
s
t
a
n
c
e
s
o
n
l
y
 
3
2
 
s
i
n
g
e
r
s
 
!
?
4
 
g
u
i
t
a
r
i
s
t
s
5
 
s
c
i
e
n
t
i
s
t
s
0
 
e
n
t
e
r
p
r
i
s
e
s
2
 
e
n
t
r
e
p
r
e
n
e
u
r
s
W
o
r
d
N
e
t
 
c
l
a
s
s
e
s
l
a
c
k
 
i
n
s
t
a
n
c
e
s
 
10
O
u
t
l
i
n
e
1.
I
n
t
r
o
d
u
c
t
i
o
n
2.
H
a
r
v
e
s
t
i
n
g
 
C
l
a
s
s
e
s
3.
H
a
r
v
e
s
t
i
n
g
 
F
a
c
t
s
4.
C
o
m
m
o
n
 
S
e
n
s
e
 
K
n
o
w
l
e
d
g
e
5.
K
n
o
w
l
e
d
g
e
 
C
o
n
s
o
l
i
d
a
t
i
o
n
6.
W
e
b
 
C
o
n
t
e
n
t
 
A
n
a
l
y
t
i
c
s
7.
W
r
a
p
-
U
p
K
B
s
 
 
G
o
a
l
 
W
o
r
d
N
e
t
 
E
x
t
r
a
c
t
i
o
n
 
f
r
o
m
 
W
i
k
i
p
e
d
i
a
E
x
t
r
a
c
t
i
o
n
 
f
r
o
m
 
t
e
x
t
E
x
t
r
a
c
t
i
o
n
 
f
r
o
m
 
t
a
b
l
e
s
W
i
k
i
p
e
d
i
a
 
i
s
 
a
 
r
i
c
h
 
s
o
u
r
c
e
 
o
f
 
i
n
s
t
a
n
c
e
s
Larry 
Sanger
Jimmy
Wales
12
W
i
k
i
p
e
d
i
a
'
s
 
c
a
t
e
g
o
r
i
e
s
 
c
o
n
t
a
i
n
 
c
l
a
s
s
e
s
 
But: categories do not form a taxonomic hierarchy
13
C
a
t
e
g
o
r
i
e
s
 
c
a
n
 
b
e
 
l
i
n
k
e
d
 
t
o
 
W
o
r
d
N
e
t
American people of Syrian descent
singer
gr. person
people
descent
WordNet
 
American people of Syrian descent
 
pre-modifier
 
head
 
post-modifier
person
 
Noungroup
parsing
Wikipedia
 
Stemming
 
person
 
Most
frequent
meaning
“person”
“singer”
“people”
“descent”
 
Head has to
be plural
14
Y
A
G
O
 
=
 
W
o
r
d
N
e
t
+
W
i
k
i
p
e
d
i
a
American people of Syrian descent
WordNet
person
Wikipedia
organism
subclassOf
subclassOf
 
R
e
l
a
t
e
d
 
p
r
o
j
e
c
t
:
W
i
k
i
T
a
x
o
n
o
m
y
105,000 subclassOf links
88% accuracy
[
P
o
n
z
e
t
t
o
 
&
 
S
t
r
u
b
e
:
 
A
A
A
I
0
7
a
n
d
 
f
o
l
l
o
w
-
u
p
s
]
 
200,000 classes
460,000 subclassOf
3 Mio. instances
96% accuracy
[
S
u
c
h
a
n
e
k
:
 
W
W
W
0
7
a
n
d
 
f
o
l
l
o
w
-
u
p
s
]
Steve Jobs
type
15
L
i
n
k
 
W
i
k
i
p
e
d
i
a
 
&
 
W
o
r
d
N
e
t
 
b
y
 
R
a
n
d
o
m
 
W
a
l
k
s
[
N
a
v
i
g
l
i
 
2
0
1
0
]
F
o
r
m
u
l
a
 
O
n
e
 
d
r
i
v
e
r
s
 construct 
neighborhood
 around 
source
 and 
target
 nodes
 use contextual similarity (glosses etc.) as 
edge weights
 compute 
personalized PR (PPR) 
with source as start node
 rank 
candidate targets 
by their 
PPR scores
{
d
r
i
v
e
r
,
 
 
 
d
e
v
i
c
e
 
d
r
i
v
e
r
}
c
o
m
p
u
t
e
r
p
r
o
g
r
a
m
c
h
a
u
f
f
e
u
r
r
a
c
e
 
d
r
i
v
e
r
t
r
u
c
k
e
r
t
o
o
l
c
a
u
s
a
l
a
g
e
n
t
B
a
r
n
e
y
O
l
d
f
i
e
l
d
{
d
r
i
v
e
r
,
 
o
p
e
r
a
t
o
r
 
 
o
f
 
v
e
h
i
c
l
e
}
F
o
r
m
u
l
a
 
O
n
e
 
c
h
a
m
p
i
o
n
s
t
r
u
c
k
d
r
i
v
e
r
s
m
o
t
o
r
r
a
c
i
n
g
M
i
c
h
a
e
l
S
c
h
u
m
a
c
h
e
r
Wikipedia  categories
WordNet classes
16
>
O
u
t
l
i
n
e
1.
I
n
t
r
o
d
u
c
t
i
o
n
2.
H
a
r
v
e
s
t
i
n
g
 
C
l
a
s
s
e
s
3.
H
a
r
v
e
s
t
i
n
g
 
F
a
c
t
s
4.
C
o
m
m
o
n
 
S
e
n
s
e
 
K
n
o
w
l
e
d
g
e
5.
K
n
o
w
l
e
d
g
e
 
C
o
n
s
o
l
i
d
a
t
i
o
n
6.
W
e
b
 
C
o
n
t
e
n
t
 
A
n
a
l
y
t
i
c
s
7.
W
r
a
p
-
U
p
K
B
s
 
 
G
o
a
l
 
W
o
r
d
N
e
t
 
E
x
t
r
a
c
t
i
o
n
 
f
r
o
m
 
W
i
k
i
p
e
d
i
a
 
E
x
t
r
a
c
t
i
o
n
 
f
r
o
m
 
t
e
x
t
E
x
t
r
a
c
t
i
o
n
 
f
r
o
m
 
t
a
b
l
e
s
H
e
a
r
s
t
 
p
a
t
t
e
r
n
s
 
e
x
t
r
a
c
t
 
i
n
s
t
a
n
c
e
s
 
f
r
o
m
 
t
e
x
t
[
M
.
 
H
e
a
r
s
t
 
1
9
9
2
]
H
e
a
r
s
t
 
d
e
f
i
n
e
d
 
l
e
x
i
c
o
-
s
y
n
t
a
c
t
i
c
 
p
a
t
t
e
r
n
s
 
f
o
r
 
t
y
p
e
 
r
e
l
a
t
i
o
n
s
h
i
p
:
X
 
s
u
c
h
 
a
s
 
Y
;
 
X
 
l
i
k
e
 
Y
;
X
 
a
n
d
 
o
t
h
e
r
 
Y
;
 
 
X
 
i
n
c
l
u
d
i
n
g
 
Y
;
X
,
 
e
s
p
e
c
i
a
l
l
y
 
Y
;
 
c
o
m
p
a
n
i
e
s
 
s
u
c
h
 
a
s
 
A
p
p
l
e
G
o
o
g
l
e
,
 
M
i
c
r
o
s
o
f
t
 
a
n
d
 
o
t
h
e
r
 
c
o
m
p
a
n
i
e
s
I
n
t
e
r
n
e
t
 
c
o
m
p
a
n
i
e
s
 
l
i
k
e
 
A
m
a
z
o
n
 
a
n
d
 
F
a
c
e
b
o
o
k
C
h
i
n
e
s
e
 
c
i
t
i
e
s
 
i
n
c
l
u
d
i
n
g
 
K
u
n
m
i
n
g
 
a
n
d
 
S
h
a
n
g
r
i
-
L
a
c
o
m
p
u
t
e
r
 
p
i
o
n
e
e
r
s
 
l
i
k
e
 
t
h
e
 
l
a
t
e
 
S
t
e
v
e
 
J
o
b
s
 
type(Apple, company), type(Google, company), ...
 
F
i
n
d
 
s
u
c
h
 
p
a
t
t
e
r
n
s
 
i
n
 
t
e
x
t
:
 
 
 
 
 
 
/
/
b
e
t
t
e
r
 
w
i
t
h
 
P
O
S
 
t
a
g
g
i
n
g
G
o
a
l
:
 
 
f
i
n
d
 
i
n
s
t
a
n
c
e
s
 
o
f
 
c
l
a
s
s
e
s
 
D
e
r
i
v
e
 
t
y
p
e
(
Y
,
X
)
18
P
r
o
b
a
s
e
 
b
u
i
l
d
s
 
a
 
t
a
x
o
n
o
m
y
 
f
r
o
m
 
t
h
e
 
W
e
b
P
r
o
B
a
s
e
2
.
7
 
M
i
o
.
 
c
l
a
s
s
e
s
 
f
r
o
m
1
.
7
 
B
i
o
.
 
W
e
b
 
p
a
g
e
s
[
W
u
 
e
t
 
a
l
.
:
 
S
I
G
M
O
D
 
2
0
1
2
]
U
s
e
 
H
e
a
r
s
t
 
l
i
b
e
r
a
l
l
y
 
t
o
 
o
b
t
a
i
n
 
m
a
n
y
 
i
n
s
t
a
n
c
e
 
c
a
n
d
i
d
a
t
e
s
:
 
 
 
p
l
a
n
t
s
 
s
u
c
h
 
a
s
 
t
r
e
e
s
 
a
n
d
 
g
r
a
s
s
 
 
 
p
l
a
n
t
s
 
i
n
c
l
u
d
e
 
w
a
t
e
r
 
t
u
r
b
i
n
e
s
 
 
 
w
e
s
t
e
r
n
 
m
o
v
i
e
s
 
s
u
c
h
 
a
s
 
T
h
e
 
G
o
o
d
,
 
t
h
e
 
B
a
d
,
 
a
n
d
 
t
h
e
 
U
g
l
y
P
r
o
b
l
e
m
:
 
s
i
g
n
a
l
 
v
s
.
 
n
o
i
s
e
A
s
s
e
s
s
 
c
a
n
d
i
d
a
t
e
 
p
a
i
r
s
 
s
t
a
t
i
s
t
i
c
a
l
l
y
:
 
 
 
P
[
X
|
Y
]
 
>
>
 
P
[
X
*
|
Y
]
 
 
 
 
 
 
 
 
s
u
b
c
l
a
s
s
O
f
(
Y
,
X
)
P
r
o
b
l
e
m
:
 
a
m
b
i
g
u
i
t
y
 
o
f
 
l
a
b
e
l
s
G
r
o
u
p
 
s
e
n
s
e
s
 
b
y
 
c
o
-
o
c
c
u
r
r
i
n
g
 
e
n
t
i
t
i
e
s
:
 
 
X
 
s
u
c
h
 
a
s
 
Y
1
 
a
n
d
 
Y
2
 
 
s
a
m
e
 
s
e
n
s
e
 
o
f
 
X
19
R
e
c
u
r
s
i
v
l
e
y
 
a
p
p
l
y
 
d
o
u
b
l
y
-
a
n
c
h
o
r
e
d
 
p
a
t
t
e
r
n
s
[
K
o
z
a
r
e
v
a
/
H
o
v
y
 
2
0
1
0
,
 
D
a
l
v
i
 
e
t
 
a
l
.
 
2
0
1
2
]
W
,
 
Y
 
a
n
d
 
Z
 
I
f
 
t
w
o
 
o
f
 
t
h
r
e
e
 
p
l
a
c
e
h
o
l
d
e
r
s
 
m
a
t
c
h
 
s
e
e
d
s
,
 
h
a
r
v
e
s
t
 
t
h
e
 
t
h
i
r
d
:
 
G
o
o
g
l
e
,
 
M
i
c
r
o
s
o
f
t
 
a
n
d
 
A
m
a
z
o
n
 
C
h
e
r
r
y
,
 
 
A
p
p
l
e
,
 
a
n
d
 
B
a
n
a
n
a
G
o
a
l
:
 
 
 
 
 
 
f
i
n
d
 
i
n
s
t
a
n
c
e
s
 
o
f
 
c
l
a
s
s
e
s
S
t
a
r
t
 
w
i
t
h
 
a
 
s
e
t
 
o
f
 
s
e
e
d
s
:
 
 
 
 
 
 
 
 
 
 
c
o
m
p
a
n
i
e
s
 
=
 
{
M
i
c
r
o
s
o
f
t
,
 
G
o
o
g
l
e
}
 
t
y
p
e
(
A
m
a
z
o
n
,
 
c
o
m
p
a
n
y
)
P
a
r
s
e
 
W
e
b
 
d
o
c
u
m
e
n
t
s
 
a
n
d
 
f
i
n
d
 
t
h
e
 
p
a
t
t
e
r
n
20
>
I
n
s
t
a
n
c
e
s
 
c
a
n
 
b
e
 
e
x
t
r
a
c
t
e
d
 
f
r
o
m
 
t
a
b
l
e
s
[
K
o
z
a
r
e
v
a
/
H
o
v
y
 
2
0
1
0
,
 
D
a
l
v
i
 
e
t
 
a
l
.
 
2
0
1
2
]
G
o
a
l
:
 
f
i
n
d
 
i
n
s
t
a
n
c
e
s
 
o
f
 
c
l
a
s
s
e
s
S
t
a
r
t
 
w
i
t
h
 
a
 
s
e
t
 
o
f
 
s
e
e
d
s
:
 
 
 
 
 
 
 
 
 
 
c
i
t
i
e
s
 
=
 
{
P
a
r
i
s
,
 
S
h
a
n
g
h
a
i
,
 
B
r
i
s
b
a
n
e
}
P
a
r
s
e
 
W
e
b
 
d
o
c
u
m
e
n
t
s
 
a
n
d
 
f
i
n
d
 
t
a
b
l
e
s
 
I
f
 
a
t
 
l
e
a
s
t
 
t
w
o
 
s
e
e
d
s
 
a
p
p
e
a
r
 
i
n
 
a
 
c
o
l
u
m
n
,
 
h
a
r
v
e
s
t
 
t
h
e
 
o
t
h
e
r
s
:
 
t
y
p
e
(
B
e
r
l
i
n
,
 
c
i
t
y
)
t
y
p
e
(
L
o
n
d
o
n
,
 
c
i
t
y
)
21
T
a
k
e
-
H
o
m
e
 
L
e
s
s
o
n
s
S
e
m
a
n
t
i
c
 
c
l
a
s
s
e
s
 
f
o
r
 
e
n
t
i
t
i
e
s
>
 
1
0
 
M
i
o
.
 
e
n
t
i
t
i
e
s
 
i
n
 
1
0
0
,
0
0
0
s
 
o
f
 
c
l
a
s
s
e
s
b
a
c
k
b
o
n
e
 
f
o
r
 
o
t
h
e
r
 
k
i
n
d
s
 
o
f
 
k
n
o
w
l
e
d
g
e
 
h
a
r
v
e
s
t
i
n
g
g
r
e
a
t
 
m
i
l
e
a
g
e
 
f
o
r
 
s
e
m
a
n
t
i
c
 
s
e
a
r
c
h
e
.
g
.
 
p
o
l
i
t
i
c
i
a
n
s
 
w
h
o
 
a
r
e
 
s
c
i
e
n
t
i
s
t
s
,
 
 
 
 
 
 
 
 
F
r
e
n
c
h
 
p
r
o
f
e
s
s
o
r
s
 
w
h
o
 
f
o
u
n
d
e
d
 
I
n
t
e
r
n
e
t
 
c
o
m
p
a
n
i
e
s
,
 
 
V
a
r
i
e
t
y
 
o
f
 
m
e
t
h
o
d
s
n
o
u
n
 
p
h
r
a
s
e
 
a
n
a
l
y
s
i
s
,
 
r
a
n
d
o
m
 
w
a
l
k
s
,
 
e
x
t
r
a
c
t
i
o
n
 
f
r
o
m
 
t
a
b
l
e
s
,
 
 
S
t
i
l
l
 
r
o
o
m
 
f
o
r
 
i
m
p
r
o
v
e
m
e
n
t
h
i
g
h
e
r
 
c
o
v
e
r
a
g
e
,
 
d
e
e
p
e
r
 
i
n
 
l
o
n
g
 
t
a
i
l
,
 
 
22
O
p
e
n
 
P
r
o
b
l
e
m
s
 
a
n
d
 
G
r
a
n
d
 
C
h
a
l
l
e
n
g
e
s
W
i
k
i
p
e
d
i
a
 
c
a
t
e
g
o
r
i
e
s
 
r
e
l
o
a
d
e
d
:
 
l
a
r
g
e
r
 
c
o
v
e
r
a
g
e
U
n
i
v
e
r
s
a
l
 
s
o
l
u
t
i
o
n
 
f
o
r
 
t
a
x
o
n
o
m
y
 
a
l
i
g
n
m
e
n
t
N
e
w
 
n
a
m
e
 
f
o
r
 
k
n
o
w
n
 
e
n
t
i
t
y
 
v
s
.
 
n
e
w
 
e
n
t
i
t
y
?
L
o
n
g
 
t
a
i
l
 
o
f
 
e
n
t
i
t
i
e
s
c
o
m
p
r
e
h
e
n
s
i
v
e
 
&
 
c
o
n
s
i
s
t
e
n
t
 
i
n
s
t
a
n
c
e
O
f
 
a
n
d
 
s
u
b
C
l
a
s
s
O
f
a
c
r
o
s
s
 
W
i
k
i
p
e
d
i
a
 
a
n
d
 
W
o
r
d
N
e
t
e
.
g
.
 
p
e
o
p
l
e
 
l
o
s
t
 
a
t
 
s
e
a
,
 
A
C
M
 
F
e
l
l
o
w
,
 
 
 
 
 
 
 
J
e
w
i
s
h
 
p
h
y
s
i
c
i
s
t
s
 
e
m
i
g
r
a
t
i
n
g
 
f
r
o
m
 
G
e
r
m
a
n
y
 
t
o
 
U
S
A
,
 
e
.
g
.
 
L
a
d
y
 
G
a
g
a
 
v
s
.
 
R
a
d
i
o
 
G
a
g
a
 
v
s
.
 
S
t
e
f
a
n
i
 
J
o
a
n
n
e
 
A
n
g
e
l
i
n
a
 
G
e
r
m
a
n
o
t
t
a
e
.
g
.
 
W
i
k
i
p
e
d
i
a
s
,
 
d
m
o
z
.
o
r
g
,
 
b
a
i
k
e
.
b
a
i
d
u
.
c
o
m
,
 
a
m
a
z
o
n
,
 
l
i
b
r
a
r
y
t
h
i
n
g
 
t
a
g
s
,
 
b
e
y
o
n
d
 
W
i
k
i
p
e
d
i
a
:
 
d
o
m
a
i
n
-
s
p
e
c
i
f
i
c
 
e
n
t
i
t
y
 
c
a
t
a
l
o
g
s
e
.
g
.
 
m
u
s
i
c
,
 
b
o
o
k
s
,
 
b
o
o
k
 
c
h
a
r
a
c
t
e
r
s
,
 
e
l
e
c
t
r
o
n
i
c
 
p
r
o
d
u
c
t
s
,
 
r
e
s
t
a
u
r
a
n
t
s
,
 
23
Slide Note
Embed
Share

This content delves into the concept of knowledge bases, exploring how information is harvested, consolidated, and analyzed. It covers the extraction of data from various sources like WordNet, Wikipedia, and web content, providing insights into classes, facts, and common sense knowledge. The goal is to find classes, instances, and relationships within these knowledge bases.

  • Knowledge Bases
  • Information Harvesting
  • Data Extraction
  • WordNet
  • Web Content Analytics

Uploaded on Oct 10, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Outline 1. Introduction 2. Harvesting Classes 3. Harvesting Facts 4. Common Sense Knowledge KBs Goal WordNet Extraction from Wikipedia Extraction from text Extraction from tables 5. Knowledge Consolidation 6. Web Content Analytics 7. Wrap-Up

  2. Knowledge Bases are labeled graphs resource subclassOf subclassOf Classes/ Concepts/ Types person location subclassOf singer city Relations/ Predicates type type bornIn Instances/ entities Tupelo A knowledge base can be seen as a directed labeled multi-graph, where the nodes are entities and the edges relations. 2

  3. An entity can have different labels The same entity has two labels: synonymy person The same label for two entities: ambiguity singer type type label label The King Elvis 3

  4. Different views of a knowledge base Triple notation: We use "RDFS Ontology" and "Knowledge Base (KB)" synonymously. Subject Predicate Object Elvis type singer Graph notation: Elvis bornIn Tupelo singer ... ... ... type Logical notation: bornIn Tupelo type(Elvis, singer) bornIn(Elvis,Tupelo) ... 4

  5. Outline 1. Introduction 2. Harvesting Classes 3. Harvesting Facts 4. Common Sense Knowledge KBs Goal WordNet Extraction from Wikipedia Extraction from text Extraction from tables 5. Knowledge Consolidation 6. Web Content Analytics 7. Wrap-Up

  6. Goal: Finding classes and instances Which classes exist? (aka entity types, unary predicates, concepts) person subclassOf Which subsumptions hold? singer type Which entities belong to which classes? Which entities exist? 6

  7. WordNet is a lexical knowledge base living being WordNet contains 82,000 classes subclassOf person label WordNet contains thousands of subclassOf relationships subclassOf person singer individual [Miller 1995, Fellbaum 1998] WordNet project (1985-now) soul WordNet contains 118,000 class labels 7

  8. WordNet example: superclasses 8

  9. WordNet example: subclasses 9

  10. WordNet example: instances only 32 singers !? 4 guitarists 5 scientists 0 enterprises 2 entrepreneurs WordNet classes lack instances 10

  11. Outline 1. Introduction 2. Harvesting Classes 3. Harvesting Facts 4. Common Sense Knowledge KBs Goal WordNet Extraction from Wikipedia Extraction from text Extraction from tables 5. Knowledge Consolidation 6. Web Content Analytics 7. Wrap-Up

  12. Wikipedia is a rich source of instances Jimmy Wales Larry Sanger 12

  13. Wikipedia's categories contain classes But: categories do not form a taxonomic hierarchy 13

  14. Categories can be linked to WordNet singer gr. person person people descent WordNet descent person people singer Most frequent meaning Head has to be plural person Stemming head pre-modifier post-modifier Noungroup parsing American people of Syrian descent American people of Syrian descent Wikipedia 14

  15. YAGO = WordNet+Wikipedia Related project: WikiTaxonomy 105,000 subclassOf links 88% accuracy [Ponzetto & Strube: AAAI 07 and follow-ups] 200,000 classes 460,000 subclassOf 3 Mio. instances 96% accuracy [Suchanek: WWW 07 and follow-ups] organism subclassOf WordNet person subclassOf American people of Syrian descent Wikipedia type 15 Steve Jobs

  16. Link Wikipedia & WordNet by Random Walks construct neighborhood around source and target nodes use contextual similarity (glosses etc.) as edge weights compute personalized PR (PPR) with source as start node rank candidate targets by their PPR scores causal agent Michael Schumacher {driver, operator of vehicle} motor racing chauffeur tool race driver Formula One drivers Barney Oldfield computer program trucker Formula One champions {driver, device driver} truck drivers > Wikipedia categories WordNet classes 16 [Navigli 2010]

  17. Outline 1. Introduction 2. Harvesting Classes 3. Harvesting Facts 4. Common Sense Knowledge KBs Goal WordNet Extraction from Wikipedia Extraction from text Extraction from tables 5. Knowledge Consolidation 6. Web Content Analytics 7. Wrap-Up

  18. Hearst patterns extract instances from text [M. Hearst 1992] Goal: find instances of classes Hearst defined lexico-syntactic patterns for type relationship: X such as Y; X like Y; X and other Y; X including Y; X, especially Y; Find such patterns in text: //better with POS tagging companies such as Apple Google, Microsoft and other companies Internet companies like Amazon and Facebook Chinese cities including Kunming and Shangri-La computer pioneers like the late Steve Jobs Derive type(Y,X) type(Apple, company), type(Google, company), ... 18

  19. Probase builds a taxonomy from the Web Use Hearst liberally to obtain many instance candidates: plants such as trees and grass plants include water turbines western movies such as The Good, the Bad, and the Ugly Problem: signal vs. noise Assess candidate pairs statistically: P[X|Y] >> P[X*|Y] subclassOf(Y,X) Problem: ambiguity of labels Group senses by co-occurring entities: X such as Y1 and Y2 same sense of X ProBase 2.7 Mio. classes from 1.7 Bio. Web pages [Wu et al.: SIGMOD 2012] 19

  20. Recursivley apply doubly-anchored patterns [Kozareva/Hovy 2010, Dalvi et al. 2012] Goal: find instances of classes Start with a set of seeds: companies = {Microsoft, Google} Parse Web documents and find the pattern W, Y and Z If two of three placeholders match seeds, harvest the third: type(Amazon, company) Google, Microsoft and Amazon Cherry, Apple, and Banana > 20

  21. Instances can be extracted from tables [Kozareva/Hovy 2010, Dalvi et al. 2012] Goal: find instances of classes Start with a set of seeds: cities = {Paris, Shanghai, Brisbane} Parse Web documents and find tables Paris France Shanghai China Berlin Germany London UK Paris Iliad Helena Odysseus Odysee Rama Mahabaratha Iliad If at least two seeds appear in a column, harvest the others: type(Berlin, city) type(London, city) 21

  22. Take-Home Lessons Semantic classes for entities > 10 Mio. entities in 100,000 s of classes backbone for other kinds of knowledge harvesting great mileage for semantic search e.g. politicians who are scientists, French professors who founded Internet companies, Variety of methods noun phrase analysis, random walks, extraction from tables, Still room for improvement higher coverage, deeper in long tail, 22

  23. Open Problems and Grand Challenges Wikipedia categories reloaded: larger coverage comprehensive & consistent instanceOf and subClassOf across Wikipedia and WordNet e.g. people lost at sea, ACM Fellow, Jewish physicists emigrating from Germany to USA, Long tail of entities beyond Wikipedia: domain-specific entity catalogs e.g. music, books, book characters, electronic products, restaurants, New name for known entity vs. new entity? e.g. Lady Gaga vs. Radio Gaga vs. Stefani Joanne Angelina Germanotta Universal solution for taxonomy alignment e.g. Wikipedia s, dmoz.org, baike.baidu.com, amazon, librarything tags, 23

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#