Corpora and Statistical Methods Lecture 6

undefined
Semantic similarity, vector space models and word-
sense disambiguation
Corpora and Statistical Methods
Lecture 6
undefined
Semantic similarity
Part 1
Synonymy
 
Different phonological
/orthographic
 words
highly related meanings
:
sofa / couch
boy / lad
 
Traditional definition:
w1 is synonymous with w2 if w1 can replace w2 in a sentence, 
salva
veritate
Is this ever the case? Can we replace one word for another and keep
our sentence identical?
The importance of 
text genre & 
register
 
With near-synonyms, there are often register-governed
conditions of use.
 
E.g. 
naive
 vs 
gullible 
vs
 ingenuous
You're so bloody 
gullible […]
[…] outside on the pavement trying to entice 
gullible 
idiots in […]
You're so 
ingenuous 
. You tackle things the wrong way.
The commentator's 
ingenuous 
query could just as well have been
prompted […]
However, it is 
ingenuous 
to suppose that peace process […]
 
(source: BNC)
Synonymy vs. Similarity
 
The contextual theory of synonymy:
based on the work of Wittgenstein (1953), and Firth (1957)
You shall know a word by the company it keeps
 (Firth 1957)
 
Under this view, perfect synonyms might not exist.
 
But words can be judged as highly similar if people put them
into the same linguistic contexts, and judge the change to be
slight.
Synonymy vs. similarity: example
 
Miller & Charles 1991:
 
Weak contextual hypothesis
:
 
The similarity of the context
 in
which 2 words appear contributes
 to the semantic
similarity of those words
.
E.g. 
snake 
is similar to [resp. synonym of] 
serpent
 to the extent that we find
snake
 and 
serpent
 in the same linguistic contexts.
It is much more likely that 
snake/serpent
 will occur in similar contexts than
snake/toad
 
NB: this is not a discrete notion of synonymy, but a continuous definition of
similarity
The Miller/Charles experiment
 
Subjects
 were given sentences with missing words; asked to place words they felt
were OK in each context.
 
Method to compare words 
A
 and 
B
:
find sentences containing 
A
find sentences containing 
B
delete 
A 
and 
B
 from sentences and shuffle them
ask people to choose which sentences to place 
A 
and 
B
 in.
 
Results:
People 
tend to
 put similar words in the same context
, and this is highly correlated with
occurrence in similar contexts in corpora.
 
Issues with similarity
 
“Similar” is a much broader concept than “synonymous”:
 
“Contextually related, though differing in meaning”:
man / woman
boy / girl
master / pupil
 
“Contextually related, but with opposite meanings”:
big / small
clever / stupid
Uses of similarity
 
Assumption: semantically similar words behave in similar
ways
 
Information retrieval: query expansion with related terms
 
K nearest neighbours
, e.g.:
given: a set of elements, each assigned to some topic
task: classify an unknown 
w 
by topic
method: find the topic that is most prevalent among 
w’
s semantic
neighbours
Common approaches
 
Vector-space approaches:
represent word 
w
 as a vector containing the words (or other features)
in the context of 
w
compare the vectors of w1, w2
various vector-distance measures available
 
Information-theoretic measures:
w1 is similar to w2 to the extent that knowing about w1 increases my
knowledge (decreases my uncertainty) about w2
undefined
Vector-space models
 
Basic data structure
 
Matrix M
M
ij
 = no. of times w
i
 co-occurs with w
j
 (in some window).
Can also have Document * word matrix
We can treat matrix cells as boolean: if 
M
ij
 
> 0, then 
w
i
 co-occurs
with w
j
 
, else it does not.
Distance measures
Many measures take a set-theoretic perspective:
vectors can be:
binary (indicate co-occurrence or not)
real-valued (indicate frequency
,  or probability
)
similarity is a function of what two vectors have in common
Classic similarity/distance measures
Boolean vector (sets)
Real-valued vector
 
Dice coefficient
 
 
 
Jaccard Coefficient
 
Dice coefficient
 
 
 
Jaccard Coefficient
 
Dice 
(
car, truck
)
On the boolean matrix: 
(2 * 2)/(4+2) = 0.66
 
Jaccard
On the boolean matrix: 2/4 = 0.5
 
Dice is more “generous”; Jaccard penalises lack of overlap more.
Dice vs. Jaccard
Classic similarity/distance measures
Boolean vector (sets)
Real-valued vector
 
Cosine similarity
 
Cosine similarity
(= angle between 2 vectors)
undefined
probabilistic approaches
 
Turning counts to probabilities
 
 
 
 
 
 
 
P(spacewalking|cosmonaut) = ½ = 0.5
P(red|car) = ¼ = 0.25
 
NB: this transforms each row into a probability distribution
corresponding to a word
Probabilistic measures of distance
 
KL-Divergence:
treat W1 as an approximation of W2
 
 
 
Problems:
asymmetric: D(p||q) ≠ D(q||p)
not so useful for word-word similarity
if denominator
 
= 0, then D(
v
||
w
) is 
undefined
Probabilistic measures of distance
 
Information radius
 (aka Jenson-Shannon Divergence)
compares total divergence between 
p
 and 
q
 to the 
average 
of p and q
symmetric!
 
 
 
 
Dagan et al (1997) showed this measure to be superior to KL-
Divergence, when applied to a word sense disambiguation
task.
Some characteristics of vector-space
measures
 
1.
Very simple conceptually;
 
2.
Flexible: can represent similarity based on document co-
occurrence, word co-occurrence etc;
 
3.
Vectors can be arbitrarily large, representing wide context
windows;
 
4.
Can be expanded to take into account grammatical
relations (e.g. head-modifier, verb-argument, etc).
Grammar-informed methods: Lin (1998)
 
Intuition:
The similarity of any two things (words, documents, people, plants) is
a function of the information gained by having:
a joint description of a and b
 in terms of what they have in common
compared to
describing a and b separately
 
E.g. do we gain more by a joint description of:
apple 
and 
chair 
(both THINGS…)
apple
 and 
banana 
(both FRUIT: more specific)
Lin’s definition cont/d
 
Essentially, we compare the info content of the “common”
definition to the info content of the “separate” definition
 
 
 
NB: essentially mutual information!
 
An application to corpora
 
From a corpus-based point of view, what do words have in
common?
context, obviously
 
How to define context?
just “bag-of-words” (typical of vector-space models)
more grammatically sophisticated
Kilgarriff’s (2003) application
 
Definition of the notion of context, following Lin:
 
define F(w) as the set of grammatical contexts in which w occurs
 
a context is a triple 
<rel,w,w’>
:
rel is a grammatical relation
w is the word of interest
w’ is the other word in rel
 
Grammatical relations can be obtained using a dependency parser.
Grammatical co-occurrence matrix for
cell
Source: Jurafsky & Martin (2009), after Lin (1998)
Example with w = 
cell
 
Example triples:
<subject-of, 
cell
, 
absorb
>
<object-of, 
cell
, 
attack
>
<
nmod-of
,
 
cell
, 
architecture
>
 
Observe that each triple 
f 
consists of 
the relation 
r, 
the second word in the
relation 
w’, 
..and the word of interest 
w
 
We can now compute the level of association between the word 
w
 and each
of its triples 
f
:
 
 
 
An information-theoretic measure that was proposed as a generalisation of the
idea of pointwise mutual information.
 
 
Calculating similarity
 
Given that we have grammatical triples for our words of
interest, similarity of w1 and w2 is a function of:
the triples they have in common
the triples that are unique to each
 
 
 
 
I.e.: mutual info of what the two words have in common,
divided by sum of mutual info of what each word has
 
Sample results: 
master
 & 
pupil
 
common:
Subject-of: 
read, sit, know
Modifier: 
good, form
Possession: 
interest
master
 only:
Subject-of: 
ask
Modifier: 
past
 (cf. 
past master)
pupil
 only:
Subject-of: 
make, find
PP_at-p: 
school
Concrete implementation
The online SketchEngine gives grammatical relations of
words, plus thesaurus which rates words by similarity to a
head word.
This is based on the Lin 1998 model.
Limitations (or characteristics)
 
Only applicable as a measure of similarity between words of
the same category
makes no sense to compare grammatical relations of different
category words
 
Does not distinguish between near-synonyms and “similar”
words
student ~ pupil
master ~ pupil
 
MI is sensitive to low-frequency: a relation which occurs only
once in the corpus can come out as highly significant.
Slide Note
Embed
Share

Exploring the intricate relationship between semantic similarity and synonymy through examples and theories such as contextual hypothesis. Discover how words are judged as highly similar based on linguistic contexts.

  • Semantic similarity
  • Synonymy
  • Linguistic contexts
  • Word meanings

Uploaded on Feb 21, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Corpora and Statistical Methods Lecture 6 Semantic similarity, vector space models and word- sense disambiguation

  2. Part 1 Semantic similarity

  3. Synonymy Different phonological/orthographic words highly related meanings: sofa / couch boy / lad Traditional definition: w1 is synonymous with w2 if w1 can replace w2 in a sentence, salva veritate Is this ever the case? Can we replace one word for another and keep our sentence identical?

  4. The importance of text genre & register With near-synonyms, there are often register-governed conditions of use. E.g. naive vs gullible vs ingenuous You're so bloody gullible [ ] [ ] outside on the pavement trying to entice gullible idiots in [ ] You're so ingenuous . You tackle things the wrong way. The commentator's ingenuous query could just as well have been prompted [ ] However, it is ingenuous to suppose that peace process [ ] (source: BNC)

  5. Synonymy vs. Similarity The contextual theory of synonymy: based on the work of Wittgenstein (1953), and Firth (1957) You shall know a word by the company it keeps (Firth 1957) Under this view, perfect synonyms might not exist. But words can be judged as highly similar if people put them into the same linguistic contexts, and judge the change to be slight.

  6. Synonymy vs. similarity: example Miller & Charles 1991: Weak contextual hypothesis:The similarity of the context in which 2 words appear contributes to the semantic similarity of those words. E.g. snake is similar to [resp. synonym of] serpent to the extent that we find snake and serpent in the same linguistic contexts. It is much more likely that snake/serpent will occur in similar contexts than snake/toad NB: this is not a discrete notion of synonymy, but a continuous definition of similarity

  7. The Miller/Charles experiment Subjects were given sentences with missing words; asked to place words they felt were OK in each context. Method to compare words A and B: find sentences containing A find sentences containing B delete A and B from sentences and shuffle them ask people to choose which sentences to place A and B in. Results: People tend to put similar words in the same context, and this is highly correlated with occurrence in similar contexts in corpora.

  8. Issues with similarity Similar is a much broader concept than synonymous : Contextually related, though differing in meaning : man / woman boy / girl master / pupil Contextually related, but with opposite meanings : big / small clever / stupid

  9. Uses of similarity Assumption: semantically similar words behave in similar ways Information retrieval: query expansion with related terms K nearest neighbours, e.g.: given: a set of elements, each assigned to some topic task: classify an unknown w by topic method: find the topic that is most prevalent among w s semantic neighbours

  10. Common approaches Vector-space approaches: represent word w as a vector containing the words (or other features) in the context of w compare the vectors of w1, w2 various vector-distance measures available Information-theoretic measures: w1 is similar to w2 to the extent that knowing about w1 increases my knowledge (decreases my uncertainty) about w2

  11. Vector-space models

  12. Basic data structure Matrix M Mij= no. of times wico-occurs with wj(in some window). Can also have Document * word matrix We can treat matrix cells as boolean: if Mij> 0, then wico-occurs with wj, else it does not.

  13. Distance measures Many measures take a set-theoretic perspective: vectors can be: binary (indicate co-occurrence or not) real-valued (indicate frequency, or probability) similarity is a function of what two vectors have in common

  14. Classic similarity/distance measures Boolean vector (sets) Boolean vector (sets) Real Real- -valued vector valued vector Dice coefficient 2 Dice coefficient = n w 1 n w + v 2 min( , ) w v i i 1 i w v = i + v i i Jaccard Coefficient w Jaccard Coefficient = n max( n v min( , ) w v i i 1 i w v = i , ) w v i i 1

  15. Dice vs. Jaccard Dice (car, truck) On the boolean matrix: (2 * 2)/(4+2) = 0.66 Jaccard On the boolean matrix: 2/4 = 0.5 Dice is more generous ; Jaccard penalises lack of overlap more.

  16. Classic similarity/distance measures Boolean vector (sets) Boolean vector (sets) Real Real- -valued vector valued vector Cosine similarity Cosine similarity (= angle between 2 vectors) v w n v w v w v w i i = = 1 i v w n n 2 2 v w i i = = 1 1 i i

  17. probabilistic approaches

  18. Turning counts to probabilities P(spacewalking|cosmonaut) = = 0.5 P(red|car) = = 0.25 NB: this transforms each row into a probability distribution corresponding to a word

  19. Probabilistic measures of distance KL-Divergence: treat W1 as an approximation of W2 ( | ) P x v x = ( || ) ( | ) log D v w P x v ( | ) P x w Problems: asymmetric: D(p||q) D(q||p) not so useful for word-word similarity if denominator= 0, then D(v||w) is undefined

  20. Probabilistic measures of distance Information radius (aka Jenson-Shannon Divergence) compares total divergence between p and q to the average of p and q symmetric! + 2 + 2 v w v w = = + ( , ) ( , ) || || IRad W W IRad v w D v D w 1 2 Dagan et al (1997) showed this measure to be superior to KL- Divergence, when applied to a word sense disambiguation task.

  21. Some characteristics of vector-space measures Very simple conceptually; 1. Flexible: can represent similarity based on document co- occurrence, word co-occurrence etc; 2. Vectors can be arbitrarily large, representing wide context windows; 3. Can be expanded to take into account grammatical relations (e.g. head-modifier, verb-argument, etc). 4.

  22. Grammar-informed methods: Lin (1998) Intuition: The similarity of any two things (words, documents, people, plants) is a function of the information gained by having: a joint description of a and b in terms of what they have in common compared to describing a and b separately E.g. do we gain more by a joint description of: apple and chair (both THINGS ) apple and banana (both FRUIT: more specific)

  23. Lins definition cont/d Essentially, we compare the info content of the common definition to the info content of the separate definition inf content . of joint descriptio n = ( , ) sim a b inf content of separate descriptio ns NB: essentially mutual information!

  24. An application to corpora From a corpus-based point of view, what do words have in common? context, obviously How to define context? just bag-of-words (typical of vector-space models) more grammatically sophisticated

  25. Kilgarriffs (2003) application Definition of the notion of context, following Lin: define F(w) as the set of grammatical contexts in which w occurs a context is a triple <rel,w,w >: rel is a grammatical relation w is the word of interest w is the other word in rel Grammatical relations can be obtained using a dependency parser.

  26. Grammatical co-occurrence matrix for cell Source: Jurafsky & Martin (2009), after Lin (1998)

  27. Example with w = cell Example triples: <subject-of, cell, absorb> <object-of, cell, attack> <nmod-of,cell, architecture> Observe that each triple f consists of the relation r, the second word in the relation w , ..and the word of interest w We can now compute the level of association between the word w and each of its triples f: ( , ) P w f = ( , ) log I w f 2 ( ) ( | ) ( | ' ) P w P r w P w w An information-theoretic measure that was proposed as a generalisation of the idea of pointwise mutual information.

  28. Calculating similarity Given that we have grammatical triples for our words of interest, similarity of w1 and w2 is a function of: the triples they have in common the triples that are unique to each ( ( ) 2 ( ( ) + ( ( )) I F w I F w = ( , ) 1 2 simLin w w 1 2 ( )) ( ( )) I F w I F w 1 2 I.e.: mutual info of what the two words have in common, divided by sum of mutual info of what each word has

  29. Sample results: master & pupil common: Subject-of: read, sit, know Modifier: good, form Possession: interest master only: Subject-of: ask Modifier: past (cf. past master) pupil only: Subject-of: make, find PP_at-p: school

  30. Concrete implementation The online SketchEngine gives grammatical relations of words, plus thesaurus which rates words by similarity to a head word. This is based on the Lin 1998 model.

  31. Limitations (or characteristics) Only applicable as a measure of similarity between words of the same category makes no sense to compare grammatical relations of different category words Does not distinguish between near-synonyms and similar words student ~ pupil master ~ pupil MI is sensitive to low-frequency: a relation which occurs only once in the corpus can come out as highly significant.

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#