Language Modeling: An Overview of Probabilistic Models and Applications

undefined
Human Language Technologies
Giuseppe Attardi
Giuseppe Attardi
Language Modeling
Language Modeling
IP notice: some slides from: Dan Jurafsky, Jim Martin, Sandiway Fong, Dan Klein
Outline
Language Modeling (N-grams)
N-gram Intro
The Chain Rule
The Shannon Visualization Method
Evaluation:
Perplexity
Smoothing:
Laplace (Add-1)
Add-prior
Probabilistic Language Model
Goal: assign a probability to a sentence
Machine Translation:
P(high winds tonite) > P(large winds tonite)
Spell Correction
“The office is about fifteen minuets from my house"
P(about fifteen minutes from)
 
> P(about fifteen minuets from)
Speech Recognition
P(I saw a van) >>  P(eyes awe of an)
 
Summarization, question-­answering, etc.
Why Language Models
We have an English speech recognition system,
which answer is better?
         Speech
   
Interpretation
speech recognition system
speech cognition system
speck podcast histamine
スピーチ が 救出 ストン
Language models tell us the answer!
Language Modeling
We want to compute
We want to compute
 
P
(
w
1
,
w
2
,
w
3
,
w
4
,
w
5
w
n
) = 
P
(
W
)
 
= the probability of a sequence
Alternatively we want to compute
Alternatively we want to compute
 
P
(
w
5
|
w
1
,
w
2
,
w
3
,
w
4
)
 
= the probability of a word given some previous words
The model that computes
The model that computes
 
P
(
W
) or
 
P
(
w
n
|
w
1
,
w
2
w
n
-1
)
is called the 
language model
.
A better term for this would be 
A better term for this would be 
The Grammar
The Grammar
But 
But 
Language model
Language model
 or LM is standard
 or LM is standard
Computing 
P
(W)
How to compute this joint probability:
P
(
the
, 
other
, 
day
, 
I
, 
was
, 
walking
,
along
, 
and
, 
saw
, 
a
, 
lizard
)
Intuition: let
s rely on the Chain Rule of Probability
The Chain Rule
Recall the definition of conditional probabilities
Rewriting:
More generally
 
P
(
A
,
B
,
C
,
D
) = 
P
(
A
)
P
(
B
|
A
)
P
(
C
|
A
,
B
)
P
(
D
|
A
,
B
,
C
)
In general
 
P
(
x
1
,
x
2
,
x
3
,…
x
n
) = 
P
(
x
1
)
P
(
x
2
|
x
1
)
P
(
x
3
|
x
1
,
x
2
)…
P
(
x
n
|
x
1
x
n-1
)
The Chain Rule applied to joint probability of words in sentence
P
(
the big red dog was
) =
 
P
(
the) • P(big|the
)
 
 P
(
red|the big
)
 
P
(
dog|the big red
)
 
 P
(
was|the big red dog
)
Obvious estimate
How to estimate?
How to estimate?
P
(
the | its water is so transparent that
)
P
(
the | its water is so transparent that
) 
=
=
C
(
its water is so transparent that the
)
____________________________________________________________________________________________
C
(
its water is so transparent that
)
Unfortunately
There are a lot of possible sentences
There are a lot of possible sentences
We will never be able to get enough data to compute
We will never be able to get enough data to compute
the statistics for those long prefixes
the statistics for those long prefixes
 
 
P(lizard|the,other,day,I,was,walking,along,and,saw,a
)
)
 
 
or
or
 
 
P(the|its water is so transparent that)
Markov Assumption
Make the simplifying assumption
Make the simplifying assumption
P(lizard|the,other,day,I,was,walking,along,and,saw,a) 
=
P(lizard|a)
or maybe
or maybe
P(lizard|the,other,day,I,was,walking,along,and,saw,a) 
=
P(lizard|saw,a)
So for each component in the product, replace with
the approximation (assuming a prefix of 
N
)
 Bigram model
Markov Assumption
N-gram models
We
 
can extend to trigrams, 4-­grams, 5-­grams
In general this is an insufficient model of language
because language has long-­distance dependencies:
“The computer which I had just put into the machine
room on the fifth floor crashed.”
But we can often get away with N-­gram models
Estimating bigram probabilities
The Maximum Likelihood Estimate
An example
<s> I am Sam </s>
<s> Sam I am </s>
<s> I do not like green eggs and ham </s>
This is the 
Maximum Likelihood Estimate
, because it is the
one which maximizes 
P(Training set|Model)
Maximum Likelihood Estimates
The 
The 
Maximum Likelihood Estimate
Maximum Likelihood Estimate
 of some parameter of a
 of some parameter of a
model 
model 
M
M
 from a training set 
 from a training set 
T
T
is the estimate that
maximizes the likelihood of the training set 
T
 given the model 
M
Suppose the word 
Suppose the word 
Chinese
Chinese
 occurs 400 times in a corpus of
 occurs 400 times in a corpus of
a million words (e.g. the Brown corpus)
a million words (e.g. the Brown corpus)
What is the probability that a random word from some other
What is the probability that a random word from some other
text will be 
text will be 
Chinese
Chinese
MLE estimate is 400/1000000 = .004
MLE estimate is 400/1000000 = .004
This may be a bad estimate for some other corpus
But it is the 
But it is the 
estimate
estimate
 that makes it 
 that makes it 
most likely
most likely
 that 
 that 
Chinese
Chinese
will occur 400 times in a million word corpus.
will occur 400 times in a million word corpus.
T
h
e
 
m
a
x
i
m
u
m
 
l
i
k
e
l
i
h
o
o
d
m
e
t
h
o
d
 
(
d
i
s
c
r
e
t
e
 
d
i
s
t
r
i
b
u
t
i
o
n
)
:
1.
Write down the probability of
each observation by using the
model parameters
2.
Write down the probability of
all the data
 
 
 
3.
Find the value parameter(s)
that maximize this probability
Maximum Likelihood
We want to estimate the probability, 
p
, that individuals are
infected with a certain kind of parasite.
Likelihood function:
- Find the value parameter(s) that
maximize this probability
Maximum likelihood
We want to estimate the probability, 
p
, that individuals are
infected with a certain kind of parasite.
Computing the MLE
Set the derivative to 0:
Solutions:
p = 0
 
(minimum)
p = 1
 
(minimum)
p = 0.6
 
(maximum)
More examples: Berkeley Restaurant Project
can you tell me about any good cantonese
can you tell me about any good cantonese
restaurants close by
restaurants close by
mid priced thai food is what i
mid priced thai food is what i
m looking for
m looking for
tell me about chez panisse
tell me about chez panisse
can you give me a listing of the kinds of food that
can you give me a listing of the kinds of food that
are available
are available
i
i
m looking for a good place to eat breakfast
m looking for a good place to eat breakfast
when is caffe venezia open during the day
when is caffe venezia open during the day
Raw bigram counts
Out of 9222 sentences
Raw bigram probabilities
Normalize by unigrams (divide by 
C
(
w
-1
)
):
Result:
Bigram estimates of sentence probabilities
P(<s> I want english food </s>) =
 
P(i|<s>)  x
 
 
P(want|I)  x
 
P(english|want) x
 
P(food|english)  x
 
P(</s>|food)
  =.000031
What kinds of knowledge?
P
P
(
(
english|want
english|want
)  = 
)  = 
.0011
.0011
P
P
(
(
chinese|want
chinese|want
) =  
) =  
.0065
.0065
P
P
(
(
to|want
to|want
) = 
) = 
.66
.66
P
P
(
(
eat | to
eat | to
) = 
) = 
.28
.28
P
P
(
(
food | to
food | to
) = 
) = 
0
0
P
P
(
(
want | spend
want | spend
) = 
) = 
0
0
P
P
(
(
i | <s>
i | <s>
) = 
) = 
.25
.25
Practical Issues
Compute in log space
Avoid underflow
Adding is faster than multiplying
log(
p
1
p
2
p
3
p
4
) 
= 
log(
p
1
) + 
log(
p
2
) + log(
p
3
) + log(
p
4
)
Shannon’
s Game
What if we turn these models around and use
What if we turn these models around and use
them to 
them to 
generate
generate
 
 
random sentences that are 
random sentences that are 
like
like
the sentences from which the model was derived.
the sentences from which the model was derived.
Jim Martin
The Shannon Visualization Method
Generate random sentences:
Generate random sentences:
Choose a random bigram <s>, w according to its
Choose a random bigram <s>, w according to its
probability
probability
Now choose a random bigram (w, x) according to its
Now choose a random bigram (w, x) according to its
probability
probability
And so on until we choose </s>
And so on until we choose </s>
Then string the words together
Then string the words together
 
 
<s> I
<s> I
           I
           I
 want
 want
             
             
want
want
 to
 to
                       
                       
to
to
 eat
 eat
 
 
  
  
                    
                    
eat
eat
 Chinese
 Chinese
   
   
           
           
Chinese
Chinese
 food
 food
   
   
                           
                           
food 
food 
 </s>
 </s>
Approximating Shakespeare
Shakespeare as corpus
N=884,647 tokens, V=29,066
N=884,647 tokens, V=29,066
Shakespeare produced 300,000 bigram
Shakespeare produced 300,000 bigram
types out of V
types out of V
2
2
= 844 million possible
= 844 million possible
bigrams:  so, 99.96% of the possible
bigrams:  so, 99.96% of the possible
bigrams were never seen (have zero entries
bigrams were never seen (have zero entries
in the table)
in the table)
Quadrigrams:
Quadrigrams:
What's coming out looks like Shakespeare because it 
What's coming out looks like Shakespeare because it 
is
is
Shakespeare
Shakespeare
The Wall Street Journal is not Shakespeare (no offense)
Lesson 1: the perils of overfitting
N-grams only work well for word prediction if the
N-grams only work well for word prediction if the
test corpus looks like the training corpus
test corpus looks like the training corpus
In real life, it often doesn
In real life, it often doesn
t
t
We need to train robust models, adapt to test
We need to train robust models, adapt to test
set, etc.
set, etc.
Train and Test Corpora
A language model must be trained on a 
A language model must be trained on a 
large
large
corpus
corpus
 of text to estimate good parameter values.
 of text to estimate good parameter values.
Model can be evaluated based on its ability to
Model can be evaluated based on its ability to
predict a high probability for a disjoint (held-out)
predict a high probability for a disjoint (held-out)
test corpus (testing on the training corpus would
test corpus (testing on the training corpus would
give an optimistically biased estimate).
give an optimistically biased estimate).
Ideally, the training (and test) corpus should be
Ideally, the training (and test) corpus should be
representative of the actual application data.
representative of the actual application data.
May need to 
May need to 
adapt
adapt
 a general model to a small
 a general model to a small
amount of new (
amount of new (
in-domain
in-domain
) data by adding highly
) data by adding highly
weighted small corpus to original training data.
weighted small corpus to original training data.
undefined
Smoothing
 
Smoothing
Since there are a combinatorial number of possible word
Since there are a combinatorial number of possible word
sequences, many rare (but not impossible) combinations
sequences, many rare (but not impossible) combinations
never occur in training, so MLE incorrectly assigns zero to
never occur in training, so MLE incorrectly assigns zero to
many parameters (aka 
many parameters (aka 
sparse data
sparse data
).
).
If a new combination occurs during testing, it is given a
If a new combination occurs during testing, it is given a
probability of zero and the 
probability of zero and the 
entire sequence gets a
entire sequence gets a
probability of zero
probability of zero
 (i.e. infinite perplexity).
 (i.e. infinite perplexity).
In practice, parameters are 
In practice, parameters are 
smoothed
smoothed
 (aka 
 (aka 
regularized
regularized
) to
) to
reassign some probability mass to unseen events.
reassign some probability mass to unseen events.
Adding probability mass to unseen events requires removing it from
Adding probability mass to unseen events requires removing it from
seen ones (
seen ones (
discounting
discounting
) in order to maintain a joint distribution that
) in order to maintain a joint distribution that
sums to 1.
sums to 1.
Smoothing is like Robin Hood:
Steal from the rich and give to the poor (in probability mass
)
Slide from Dan Klein
Laplace smoothing
Also called add-one smoothing
Just add one to all the counts!
Very simple
MLE estimate:
Laplace estimate:
Reconstructed counts:
Laplace smoothed bigram counts
B
e
r
k
e
l
e
y
 
R
e
s
t
a
u
r
a
n
t
 
C
o
r
p
u
s
Laplace-smoothed bigrams
Reconstituted counts
Note big change to counts
C
(
want to
)
 
went from 608 to 238!
P
(
to|want
)
 from .66 to .26!
Discount 
d = c*/c
d
 for 
chinese food
 = .10
  
A 10x reduction!
So in general, Laplace is a blunt instrument
But Laplace smoothing not used for N-grams, as we have
much better methods
Despite its flaws Laplace (add-k) is however still used to
smooth other probabilistic models in NLP, especially
For pilot studies
in domains where the number of zeros isn
t so huge.
Add-k
Add a small fraction instead of 1
k = 0.01
Even better: Bayesian unigram prior smoothing for bigrams
Maximum Likelihood Estimation
Laplace Smoothing
Bayesian Prior Smoothing
Lesson 2: zeros or not?
Zipf
Zipf
s Law:
s Law:
A small number of events occur with high frequency
A small number of events occur with high frequency
A large number of events occur with low frequency
A large number of events occur with low frequency
You can quickly collect statistics on the high frequency events
You can quickly collect statistics on the high frequency events
You might have to wait an arbitrarily long time to get valid
You might have to wait an arbitrarily long time to get valid
statistics on low frequency events
statistics on low frequency events
Result:
Result:
Our estimates are sparse! no counts at all for the vast bulk of
Our estimates are sparse! no counts at all for the vast bulk of
things we want to estimate!
things we want to estimate!
Some of the zeroes in the table are really zeros  But others are
Some of the zeroes in the table are really zeros  But others are
simply low frequency events you haven't seen yet.  After all,
simply low frequency events you haven't seen yet.  After all,
ANYTHING CAN HAPPEN!
ANYTHING CAN HAPPEN!
How to address?
How to address?
Answer:
Answer:
Estimate the likelihood of unseen N-grams!
Estimate the likelihood of unseen N-grams!
Slide from B. Dorr and J. Hirschberg
Zipf's law
Zipf's Law for the Brown Corpus
Zipf law: interpretation
Principle of 
Principle of 
least effort
least effort
: both the speaker and the
: both the speaker and the
hearer in communication try to minimize effort:
hearer in communication try to minimize effort:
Speakers tend to use a small vocabulary of common
Speakers tend to use a small vocabulary of common
(shorter) words
(shorter) words
Hearers prefer a large vocabulary of rarer less ambiguous
Hearers prefer a large vocabulary of rarer less ambiguous
words
words
Zipf's law is the result of this compromise
Zipf's law is the result of this compromise
Other laws …
Other laws …
Number of meanings 
Number of meanings 
m
m
 of a word obeys the law:
 of a word obeys the law:
 
 
m
m
 
 
 1/
 1/
f
f
Inverse relationship between frequency and length
Inverse relationship between frequency and length
Practical Issues
We do everything in log space
Avoid underflow
(also adding is faster than multiplying)
Language Modeling Toolkits
SRILM
SRILM
http://www.speech.sri.com/projects/srilm/
IRSTLM
IRSTLM
Ken LM
Ken LM
Google N-Gram Release
Google Book N-grams
http://ngrams.googlelabs.com/
Google N-Gram Release
serve as the incoming 92
serve as the incoming 92
serve as the incubator 99
serve as the incubator 99
serve as the independent 794
serve as the independent 794
serve as the index 223
serve as the index 223
serve as the indication 72
serve as the indication 72
serve as the indicator 120
serve as the indicator 120
serve as the indicators 45
serve as the indicators 45
serve as the indispensable 111
serve as the indispensable 111
serve as the indispensible 40
serve as the indispensible 40
serve as the individual 234
serve as the individual 234
http://googleresearch.blogspot.com/2006/08/all-our-n-gram-are-belong-to-you.html
undefined
Evaluation and Perplexity
 
Evaluation
Train parameters of our model on a 
training set
.
How do we evaluate how well our model works?
Look at the models performance on some new data
This is what happens in the real world; we want to
know how our model performs on data we haven
t
seen
Use a 
test set
. A dataset which is different than our
training set
Then we need an 
evaluation metric
 to tell us how
well our model is doing on the test set.
One such metric is 
perplexity
Evaluating N-gram models
Best evaluation for an N-gram
Best evaluation for an N-gram
Put model A in a task (language
Put model A in a task (language
identification, speech recognizer,
identification, speech recognizer,
machine translation system)
machine translation system)
Run the task, get an accuracy for A (how
Run the task, get an accuracy for A (how
many langs identified correctly, or Word
many langs identified correctly, or Word
Error Rate, or etc)
Error Rate, or etc)
Put model B in task, get accuracy for B
Put model B in task, get accuracy for B
Compare accuracy for A and B
Compare accuracy for A and B
Extrinsic evaluation
Extrinsic evaluation
Language Identification task
Create an N-gram model for each language
Compute the probability of a given text
P
lang1
(
text
)
P
lang2
(
text
)
P
lang3
(
text
)
Select language with highest probability
 
lang = argmax
l
 P
l
(
text
)
Difficulty of extrinsic (in-vivo) evaluation of  N-gram models
Extrinsic evaluation
Extrinsic evaluation
This is really time-consuming
This is really time-consuming
Can take days to run an experiment
Can take days to run an experiment
So
So
As a temporary solution, in order to run
As a temporary solution, in order to run
experiments
experiments
To evaluate N-grams we often use an 
To evaluate N-grams we often use an 
intrinsic
intrinsic
evaluation, an approximation called 
evaluation, an approximation called 
perplexity
perplexity
But perplexity is a poor approximation unless the
But perplexity is a poor approximation unless the
test data looks 
test data looks 
just
just
 like the training data
 like the training data
So is 
So is 
generally only useful in pilot experiments
generally only useful in pilot experiments
(generally is not sufficient to publish)
(generally is not sufficient to publish)
Perplexity
The intuition behind perplexity as a measure is
The intuition behind perplexity as a measure is
the notion of surprise.
the notion of surprise.
How surprised is the language model when it
How surprised is the language model when it
sees the test set?
sees the test set?
Where surprise is a measure of...
Where surprise is a measure of...
Gee, I didn’
Gee, I didn’
t see that coming...
t see that coming...
The more surprised the model is, the lower the
The more surprised the model is, the lower the
probability it assigned to the test set
probability it assigned to the test set
The higher the probability, the less surprised it was
The higher the probability, the less surprised it was
Perplexity
Measures of how well a model “fits” the test
data.
Uses the probability that the model assigns to the
test corpus.
Normalizes for the number of words in the test
corpus and takes the inverse.
Measures the weighted average branching factor
in predicting the next word (lower is better).
Perplexity
Perplexity:
Chain rule:
For bigrams:
Minimizing perplexity is the same as maximizing probability
The best language model is one that best predicts an unseen test set
Perplexity as branching factor
How hard is the task of recognizing digits
0,1,2,3,4,5,6,7,8,9
Perplexity: 10
Lower perplexity = better model
Model trained on 38 million words from
the Wall Street Journal (WSJ) using a
19,979 word vocabulary.
Evaluation on a disjoint set of 1.5 million
WSJ words.
Unknown Words
How to handle words in the test corpus
How to handle words in the test corpus
that did not occur in the training data, i.e.
that did not occur in the training data, i.e.
out of vocabulary 
out of vocabulary 
(OOV) words?
(OOV) words?
Train a model that includes an explicit
Train a model that includes an explicit
symbol for an unknown word (<UNK>):
symbol for an unknown word (<UNK>):
1.
Choose a vocabulary in advance and replace
Choose a vocabulary in advance and replace
other words in the training corpus with
other words in the training corpus with
<UNK>, or
<UNK>, or
2.
Replace the first occurrence of each word in
Replace the first occurrence of each word in
the training data with <UNK>.
the training data with <UNK>.
Unknown Words handling
Training of <UNK> probabilities
Training of <UNK> probabilities
Create a fixed lexicon L of size V
Create a fixed lexicon L of size V
Any training word not in L changed to  <UNK>
Any training word not in L changed to  <UNK>
Now we train its probabilities like a normal word
Now we train its probabilities like a normal word
At decoding time
At decoding time
In text input: use <UNK> probabilities for any
In text input: use <UNK> probabilities for any
word not in training
word not in training
undefined
Smoothing
 
Advanced LM stuff
Current best smoothing algorithm
Current best smoothing algorithm
Kneser-Ney smoothing
Kneser-Ney smoothing
Other stuff
Other stuff
Interpolation
Interpolation
Backoff
Backoff
Variable-length n-grams
Variable-length n-grams
Class-based n-grams
Class-based n-grams
Clustering
Clustering
Hand-built classes
Hand-built classes
Cache LMs
Cache LMs
Topic-based LMs
Topic-based LMs
Sentence mixture models
Sentence mixture models
Skipping LMs
Skipping LMs
Parser-based LMs
Parser-based LMs
Word Embeddings
Word Embeddings
Backoff and Interpolation
If we are estimating:
Trigram 
P
(
z|xy
)
but 
C
(
xyz
)
 is zero
Use info from:
Bigram 
P
(
z|y
)
Or even:
Unigram 
P
(
z
)
How to combine the trigram/bigram/unigram
info?
Backoff versus interpolation
Backoff
: use trigram if you have it,
otherwise bigram, otherwise unigram
Interpolation
: mix all three
Backoff
Only use lower-order model when 
data
 for
higher-order model is unavailable
Recursively back-off to weaker models until data
is available
Where 
P*
 is a discounted probability estimate to
reserve mass for unseen events and 
’s are back-off
weights (see book for details).
Interpolation
Simple interpolation
Lambdas conditional on context:
How to set the lambdas?
Use a 
Use a 
held-out
held-out
 corpus
 corpus
Choose lambdas which maximize the probability of
Choose lambdas which maximize the probability of
data
data
i.e. fix the N-gram probabilities
i.e. fix the N-gram probabilities
then search for lambda values that,
then search for lambda values that,
when plugged into previous equation,
when plugged into previous equation,
give largest probability for held-out set
give largest probability for held-out set
Can use EM (Expectation Maximization) to do this
Can use EM (Expectation Maximization) to do this
search
search
Training Data
Held-Out Data
Test
Data
Intuition of backoff+discounting
How much probability to assign to all the zero
How much probability to assign to all the zero
trigrams?
trigrams?
Use Good-Turing or other discounting algorithm
Use Good-Turing or other discounting algorithm
How to divide that probability mass among
How to divide that probability mass among
different contexts?
different contexts?
Use the N-1 gram estimates
Use the N-1 gram estimates
What do we do for the unigram words not seen in
What do we do for the unigram words not seen in
training?
training?
Out Of Vocabulary
Out Of Vocabulary
 = OOV words
 = OOV words
Problem for N-Grams: Long Distance Dependencies
Sometimes local context does not provide enough
Sometimes local context does not provide enough
predictive clues, due to the presence of 
predictive clues, due to the presence of 
long-
long-
distance dependencies
distance dependencies
.
.
Syntactic dependencies
Syntactic dependencies
“The 
“The 
man
man
 next to the large oak tree near the grocery store on
 next to the large oak tree near the grocery store on
the corner 
the corner 
is
is
 tall.”
 tall.”
“The 
“The 
men
men
 next to the large oak tree near the grocery store on
 next to the large oak tree near the grocery store on
the corner 
the corner 
are
are
 tall.”
 tall.”
Semantic dependencies
Semantic dependencies
“The 
“The 
bird
bird
 next to the large oak tree near the grocery store on
 next to the large oak tree near the grocery store on
the corner 
the corner 
flies
flies
 rapidly.”
 rapidly.”
“The 
“The 
man
man
 next to the large oak tree near the grocery store on
 next to the large oak tree near the grocery store on
the corner 
the corner 
talks
talks
 rapidly.”
 rapidly.”
More complex models of language are needed to
More complex models of language are needed to
handle such dependencies.
handle such dependencies.
ARPA format
Language Models
Language models assign a probability that a
Language models assign a probability that a
sentence is a legal string in a language.
sentence is a legal string in a language.
They are useful as a component of many NLP
They are useful as a component of many NLP
systems, such as ASR, OCR, and MT.
systems, such as ASR, OCR, and MT.
Simple N-gram models are easy to train on
Simple N-gram models are easy to train on
unsupervised corpora and can provide useful
unsupervised corpora and can provide useful
estimates of sentence likelihood.
estimates of sentence likelihood.
MLE gives inaccurate parameters for models
MLE gives inaccurate parameters for models
trained on sparse data.
trained on sparse data.
Smoothing techniques adjust parameter
Smoothing techniques adjust parameter
estimates to account for unseen (but not
estimates to account for unseen (but not
impossible) events.
impossible) events.
Exercise
Write two programs
Write two programs
train-unigram: Creates a unigram model
test-unigram: Reads a unigram model and calculates
entropy and coverage for the test set
Test
Test
 them test/01-train-input.txt test/01-test-
 them test/01-train-input.txt test/01-test-
input.txt
input.txt
Train
Train
 the model on data/wiki-en-train.word
 the model on data/wiki-en-train.word
Calculate 
Calculate 
entropy
entropy
 and 
 and 
coverage
coverage
 on data/wiki-
 on data/wiki-
entest.word
entest.word
Report
Report
 your scores next week
 your scores next week
Pseudo code: train-unigram
create a 
map 
counts
create a 
variable 
total_count 
= 0
for each 
line 
in 
the 
training_file
split 
line 
into an array of 
words
append 
“</s>” to the end of 
words
for each 
word 
in 
words
add 
1 to 
counts
[
word
]
add 
1 to 
total_count
open 
the 
model_file 
for writing
for each 
word, count 
in 
counts
probability 
= 
counts
[
word
]/
total_count
print 
word
, 
probability 
to 
model_file
Pseudo-code: test-unigram
Load model
create a 
map 
probabilities
for each 
line 
in 
model_file
split 
line 
into 
w 
and 
P
set 
probabilities
[
w
] = 
P
Test and print
for each 
line 
in 
test_file
split 
line 
into an array of 
words
append 
“</s>” to the end of 
words
for each 
w 
in 
words
add 
1 to 
W
set 
P = 
λ
unk / V
if 
probabilities
[
w
] exists
set 
P += 
λ1 * 
probabilities[
w
]
else
add 
1 to 
unk
add 
-
log2 P 
to 
H
print 
“entropy = 
”+H/W
print 
“coverage = ” + (W-unk)/W
Summary
Language Modeling (N-grams)
Language Modeling (N-grams)
N-grams
N-grams
The Chain Rule
The Chain Rule
The Shannon Visualization Method
The Shannon Visualization Method
Evaluation:
Evaluation:
Perplexity
Perplexity
Smoothing:
Smoothing:
Laplace (Add-1)
Laplace (Add-1)
Add-k
Add-k
Add-prior
Add-prior
Slide Note
Embed
Share

Dive into the world of language modeling with a focus on probabilistic models like N-grams, the Chain Rule, and Shannon Visualization Method. Explore the importance of assigning probabilities to textual data for tasks such as machine translation, spell correction, speech recognition, and more. Discover the significance of language models in improving English speech recognition systems.

  • Language Modeling
  • Probabilistic Models
  • N-grams
  • Speech Recognition
  • Machine Translation

Uploaded on Oct 03, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Universit di Pisa Human Language Technologies Giuseppe Attardi Language Modeling IP notice: some slides from: Dan Jurafsky, Jim Martin, Sandiway Fong, Dan Klein

  2. Outline Language Modeling (N-grams) N-gram Intro The Chain Rule The Shannon Visualization Method Evaluation: Perplexity Smoothing: Laplace (Add-1) Add-prior

  3. Probabilistic Language Model Goal: assign a probability to a sentence Machine Translation: P(high winds tonite) > P(large winds tonite) Spell Correction The office is about fifteen minuets from my house" P(about fifteen minutes from) > P(about fifteen minuets from) Speech Recognition P(I saw a van) >> P(eyes awe of an) Summarization, question-answering, etc.

  4. Why Language Models We have an English speech recognition system, which answer is better? Speech Interpretation speech recognition system speech cognition system speck podcast histamine Language models tell us the answer!

  5. Language Modeling We want to compute P(w1,w2,w3,w4,w5 wn) = P(W) = the probability of a sequence Alternatively we want to compute P(w5|w1,w2,w3,w4) = the probability of a word given some previous words The model that computes P(W) or P(wn|w1,w2 wn-1) is called the language model. A better term for this would be The Grammar But Language model or LM is standard

  6. Computing P(W) How to compute this joint probability: P( the , other , day , I , was , walking , along , and , saw , a , lizard ) Intuition: let s rely on the Chain Rule of Probability

  7. The Chain Rule Recall the definition of conditional probabilities P(B| A)=P(A B) P(A) Rewriting: P(A B)=P(A)P(B|A) More generally P(A,B,C,D) = P(A)P(B|A)P(C|A,B)P(D|A,B,C) In general P(x1,x2,x3, xn) = P(x1)P(x2|x1)P(x3|x1,x2) P(xn|x1 xn-1)

  8. The Chain Rule applied to joint probability of words in sentence P( the big red dog was ) = P(the) P(big|the) P(red|the big) P(dog|the big red) P(was|the big red dog)

  9. Obvious estimate How to estimate? P(the | its water is so transparent that) P(the | its water is so transparent that) = C(its water is so transparent that the) ____________________________________________________________________________________________ C(its water is so transparent that)

  10. Unfortunately There are a lot of possible sentences We will never be able to get enough data to compute the statistics for those long prefixes P(lizard|the,other,day,I,was,walking,along,and,saw,a) or P(the|its water is so transparent that)

  11. Markov Assumption Make the simplifying assumption P(lizard|the,other,day,I,was,walking,along,and,saw,a) = P(lizard|a) or maybe P(lizard|the,other,day,I,was,walking,along,and,saw,a) = P(lizard|saw,a)

  12. Markov Assumption So for each component in the product, replace with the approximation (assuming a prefix of N) n 1) P(wn |wn N+1 n 1 P(wn |w1 ) Bigram model P(wn|w1 n 1) P(wn|wn 1)

  13. N-gram models We can extend to trigrams, 4-grams, 5-grams In general this is an insufficient model of language because language has long-distance dependencies: The computer which I had just put into the machine room on the fifth floor crashed. But we can often get away with N-gram models

  14. Estimating bigram probabilities The Maximum Likelihood Estimate P(wi|wi 1) =count(wi 1,wi) count(wi 1) P(wi|wi 1) =c(wi 1,wi) c(wi 1)

  15. An example <s> I am Sam </s> <s> Sam I am </s> <s> I do not like green eggs and ham </s> This is the Maximum Likelihood Estimate, because it is the one which maximizes P(Training set|Model)

  16. Maximum Likelihood Estimates The Maximum Likelihood Estimate of some parameter of a model M from a training set T is the estimate that maximizes the likelihood of the training set T given the model M Suppose the word Chinese occurs 400 times in a corpus of a million words (e.g. the Brown corpus) What is the probability that a random word from some other text will be Chinese MLE estimate is 400/1000000 = .004 This may be a bad estimate for some other corpus But it is the estimate that makes it most likely that Chinese will occur 400 times in a million word corpus.

  17. Maximum Likelihood We want to estimate the probability, p, that individuals are infected with a certain kind of parasite. The maximum likelihood method (discrete distribution): 1. Write down the probability of each observation by using the model parameters 2. Write down the probability of all the data Ind. Infected Probability of observation p 1 p p p 1 p 1 2 3 4 5 6 7 8 9 10 1 0 1 1 0 1 1 0 0 1 p p = 6 4 Pr( Data | ) 1 ( ) p p p 1 p 1 p 3. Find the value parameter(s) that maximize this probability p

  18. Maximum likelihood We want to estimate the probability, p, that individuals are infected with a certain kind of parasite. Likelihood function: Ind. Infected Probability of observation p 1 p p p 1 p = = 6 4 ( ) Pr( Data | ) 1 ( ) L p p p p 1 2 3 4 5 6 7 8 9 10 1 0 1 1 0 1 1 0 0 1 - Find the value parameter(s) that maximize this probability 0.0012 0.0008 p L(p, K, N) p 0.0004 1 p 1 p 0.0000 p 0.0 0.2 0.4 0.6 0.8 1.0 p

  19. Computing the MLE Set the derivative to 0: 0 =d dpp6(1- p)4= 6p5(1- p)4- p64(1- p)3= p5(1- p)3[6(1- p)-4p)= p5(1- p)3[6-10p] Solutions: p = 0 p = 1 p = 0.6 (minimum) (minimum) (maximum)

  20. More examples: Berkeley Restaurant Project can you tell me about any good cantonese restaurants close by mid priced thai food is what i m looking for tell me about chez panisse can you give me a listing of the kinds of food that are available i m looking for a good place to eat breakfast when is caffe venezia open during the day

  21. Raw bigram counts Out of 9222 sentences

  22. Raw bigram probabilities Normalize by unigrams (divide by C(w-1)): Result:

  23. Bigram estimates of sentence probabilities P(<s> I want english food </s>) = P(i|<s>) x P(want|I) x P(english|want) x P(food|english) x P(</s>|food) =.000031

  24. What kinds of knowledge? P(english|want) = .0011 P(chinese|want) = .0065 P(to|want) = .66 P(eat | to) = .28 P(food | to) = 0 P(want | spend) = 0 P(i | <s>) = .25

  25. Practical Issues Compute in log space Avoid underflow Adding is faster than multiplying log(p1 p2 p3 p4) = log(p1) + log(p2) + log(p3) + log(p4)

  26. Shannons Game What if we turn these models around and use them to generate random sentences that are like the sentences from which the model was derived. Jim Martin

  27. The Shannon Visualization Method Generate random sentences: Choose a random bigram <s>, w according to its probability Now choose a random bigram (w, x) according to its probability And so on until we choose </s> Then string the words together <s> I I want want to to eat eat Chinese Chinese food food </s>

  28. Approximating Shakespeare

  29. Shakespeare as corpus N=884,647 tokens, V=29,066 Shakespeare produced 300,000 bigram types out of V2= 844 million possible bigrams: so, 99.96% of the possible bigrams were never seen (have zero entries in the table) Quadrigrams: What's coming out looks like Shakespeare because it is Shakespeare

  30. The Wall Street Journal is not Shakespeare (no offense)

  31. Lesson 1: the perils of overfitting N-grams only work well for word prediction if the test corpus looks like the training corpus In real life, it often doesn t We need to train robust models, adapt to test set, etc.

  32. Train and Test Corpora A language model must be trained on a large corpus of text to estimate good parameter values. Model can be evaluated based on its ability to predict a high probability for a disjoint (held-out) test corpus (testing on the training corpus would give an optimistically biased estimate). Ideally, the training (and test) corpus should be representative of the actual application data. May need to adapt a general model to a small amount of new (in-domain) data by adding highly weighted small corpus to original training data.

  33. Smoothing

  34. Smoothing Since there are a combinatorial number of possible word sequences, many rare (but not impossible) combinations never occur in training, so MLE incorrectly assigns zero to many parameters (aka sparse data). If a new combination occurs during testing, it is given a probability of zero and the entire sequence gets a probability of zero (i.e. infinite perplexity). In practice, parameters are smoothed (aka regularized) to reassign some probability mass to unseen events. Adding probability mass to unseen events requires removing it from seen ones (discounting) in order to maintain a joint distribution that sums to 1.

  35. Smoothing is like Robin Hood: Steal from the rich and give to the poor (in probability mass) Slide from Dan Klein

  36. Laplace smoothing Also called add-one smoothing Just add one to all the counts! Very simple MLE estimate: Laplace estimate: Reconstructed counts:

  37. Laplace smoothed bigram counts Berkeley Restaurant Corpus

  38. Laplace-smoothed bigrams

  39. Reconstituted counts

  40. Note big change to counts C(want to)went from 608 to 238! P(to|want) from .66 to .26! Discount d = c*/c d for chinese food = .10 So in general, Laplace is a blunt instrument A 10x reduction! But Laplace smoothing not used for N-grams, as we have much better methods Despite its flaws Laplace (add-k) is however still used to smooth other probabilistic models in NLP, especially For pilot studies in domains where the number of zeros isn t so huge.

  41. Add-k Add a small fraction instead of 1 k = 0.01

  42. Even better: Bayesian unigram prior smoothing for bigrams Maximum Likelihood Estimation P(w2|w1) =C(w1,w2) C(w1) Laplace Smoothing PLaplace(w2|w1) =C(w1,w2)+1 C(w1)+vocab Bayesian Prior Smoothing PPrior(w2|w1) =C(w1,w2)+ P(w2) C(w1)+1

  43. Lesson 2: zeros or not? Zipf s Law: A small number of events occur with high frequency A large number of events occur with low frequency You can quickly collect statistics on the high frequency events You might have to wait an arbitrarily long time to get valid statistics on low frequency events Result: Our estimates are sparse! no counts at all for the vast bulk of things we want to estimate! Some of the zeroes in the table are really zeros But others are simply low frequency events you haven't seen yet. After all, ANYTHING CAN HAPPEN! How to address? Answer: Estimate the likelihood of unseen N-grams! Slide from B. Dorr and J. Hirschberg

  44. Zipf's law decompressor are needed to see this picture. f 1/r (f proportional to 1/r) there is a constant k such that f r = k

  45. Zipf'sLaw for the BrownCorpus

  46. Zipf law: interpretation Principle of least effort: both the speaker and the hearer in communication try to minimize effort: Speakers tend to use a small vocabulary of common (shorter) words Hearers prefer a large vocabulary of rarer less ambiguous words Zipf's law is the result of this compromise Other laws Number of meanings m of a word obeys the law: m 1/ f Inverse relationship between frequency and length

  47. Practical Issues We do everything in log space Avoid underflow (also adding is faster than multiplying)

  48. Language Modeling Toolkits SRILM http://www.speech.sri.com/projects/srilm/ IRSTLM Ken LM

  49. Google N-Gram Release

  50. Google Book N-grams http://ngrams.googlelabs.com/

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#