N-Gram Models in Language Modelling

undefined
 
Language modelling using N-Grams
 
Corpora and Statistical Methods
Example task
 
The 
word-prediction
 task (Shannon game)
 
Given:
a sequence of words (the 
history
)
a choice of next word
 
Predict:
the most likely next word
 
Generalises easily to other problems, such as predicting the
POS of unknowns based on history.
Applications of the Shannon game
 
Automatic speech recognition (cf. tutorial 1):
given a sequence of possible words, estimate its probability
 
Context-sensitive spelling correction:
Many spelling errors are real words
He walked for miles in the dessert. 
(resp. 
desert
)
 
Identifying such errors requires a global estimate of the
probability of a sentence.
Applications of N-gram models generally
 
POS Tagging:
predict the POS of an unknown word by looking at its history
 
Statistical parsing:
e.g. predict the group of words that together form a phrase
 
Statistical NL Generation:
given a semantic form to be realised as text, and several possible
realisations, select the most probable one.
A real-world example: Google’s 
did you
mean
 
 
Google uses an n-gram
model (based on sequences
of characters, not words).
 
In this case, the sequence
apple desserts
 is much more
probable than 
apple deserts
How it works
 
Documents provided by the search engine are added to:
An index (for fast retrieval)
A language model (based on probability of a sequence of characters)
 
A submitted query (“apple deserts”) can be modified (using
character insertions, deletions, substitutions and transpositions) to
yield a query that fits the language model better (“apple desserts”).
 
Outcome is a context-sensitive spelling correction:
“apple deserts” 
 “apple desserts”
“frod baggins” 
 “frodo baggins”
“frod” 
 “ford”
 
 
The noisy channel model
 
After Jurafsky and Martin (2009), 
Speech and Language
Processing
 (2
nd
 Ed). Prentice Hall p. 198
undefined
 
The assumptions behind n-gram models
 
The Markov Assumption
 
Markov models:
probabilistic models which predict the likelihood of a future unit
based on 
limited history
 
in language modelling, this pans out as the 
local history assumption
:
the probability of 
w
n
 depends on a limited number of prior words
 
utility of the assumption:
we can rely on a small 
n
 for our n-gram models (bigram, trigram)
long 
n-grams
 become exceedingly sparse
Probabilities become very small with long sequences
The structure of an n-gram model
 
The task can be re-stated in conditional probabilistic terms:
 
 
 
 
Limiting 
n
 under the Markov Assumption means:
greater chance of finding more than one occurrence of the sequence 
w
1
…w
n-1
more robust statistical estimations
 
N-grams are essentially 
equivalence classes 
or 
bins
every unique n-gram is a type or 
bin
 
Structure of n-gram models (II)
 
If we construct a model where all histories with the same n-1
words are considered one class or bin, we have an 
(n-1)
th
order Markov Model
 
Note terminology:
n-gram model 
= 
(n-1)
th
 order Markov Model
Methodological considerations
 
We are often concerned with:
building an 
n-gram
 model
evaluating it
 
We therefore make a distinction between 
training
 and 
test
data
You never test on your training data
If you do, you’re bound to get good results.
N-gram models tend to be
 overtrained
, i.e.: if you train on a corpus
C, your model will be biased towards expecting the kinds of events in
C.
Another term for this: 
overfitting
 
Dividing the data
 
Given: a corpus of 
n
 units (words, sentences, … depending
on the task)
A large proportion of the corpus is reserved for training.
A smaller proportion for testing/evaluation (normally 5-10%)
Held-out (validation) data
 
Held-out estimation:
during training, we sometimes estimate parameters for our model
empirically
 
commonly used in smoothing (how much probability space do we
want to set aside for unseen data)?
 
therefore, the training set is often split further into 
training data
 and
validation data
normally, held-out data is 10% of the size of the training data
Development data
 
A common approach:
1.
train an algorithm on training data
a.
 (estimate further parameters on held-out data if required)
2.
evaluate it
3.
re-tune it
4.
go back to Step 1 until no further finetuning necessary
5.
Carry out final evaluation
 
For this purpose, it’s useful to have:
training data for step 1
development
 set for steps 2-4
final test set for step 5
Significance testing
 
Often, we compare the performance of our algorithm against some
baseline.
 
A single, raw performance score won’t tell us much. We need to test for
significance
 (e.g. using 
t-
test).
 
Typical method:
Split test set into several small test sets, e.g. 20 samples
evaluation carried out separately on each
mean and variance estimated based on 20 different samples
test for significant difference between algorithm and a predefined baseline
 
Size of n-gram models
 
In a corpus of vocabulary size N, the assumption is that any
combination of 
n
 words is a potential 
n-gram
.
For a bigram model: N
2
 possible 
n-grams
 in principle
For a trigram model: N
3
 possible n-grams.
Size (continued)
 
Each n-gram in our model is a parameter used to estimate
probability of the next possible word.
too many parameters make the model unwieldy
too many parameters lead to data sparseness: most of them will have f
= 0 or 1
 
Most models stick to unigrams, bigrams or trigrams.
estimation can also combine different order models
Further considerations
 
When building a model, we tend to take into account the
start-of-sentence symbol:
the girl swallowed a large green caterpillar
<s> the
the girl
 
Also typical to map all tokens 
w
 such that count(w) < 
k
 to
<UNK>:
usually, tokens with frequency 1 or 2 are just considered “unknown”
or “unseen”
this reduces the parameter space
undefined
 
Building models using Maximum Likelihood
Estimation
 
 
Maximum Likelihood Estimation Approach
 
Basic equation:
 
 
 
In a unigram model, this reduces to simple probability.
 
MLE models estimate probability using relative frequency.
Limitations of MLE
 
MLE builds the model that maximises the probability of the
training data.
 
Unseen events in the training data are assigned zero
probability.
Since 
n-gram
 models tend to be sparse, this is a real problem.
 
Consequences:
seen events are given more probability mass than they have
unseen events are given zero mass
Seen/unseen
A
A’
 
Probability mass of events in
training data
 
Probability mass
of events not in
training data
 
The problem with
MLE is that it
distributes A’
among members
of A.
 
The solution
 
Solution is to correct MLE estimation using a 
smoothing
technique.
 
More on this in the next part
 
But cf. Tutorial 1, which introduced the simplest method of
smoothing known.
 
Adequacy of different order models
 
Manning/Schutze `99 report results for n-gram models of a corpus of the
novels of Austen.
 
Task: use n-gram model to predict the probability of a sentence in the test
data.
 
Models:
unigram: essentially zero-context markov model, uses only the probability of
individual words
bigram
trigram
4-gram
 
Example test case
 
Training Corpus: five Jane Austen novels
 Corpus size = 617,091 words
 Vocabulary size = 14,585 unique types
 
 Task: predict the next word of the trigram
 
“inferior to ________”
from test data, 
Persuasion
:
“[In person, she was] 
inferior to 
both
 [sisters.]”
 
Selecting an 
n
Vocabulary (V) = 20,000 words
 
Adequacy of unigrams
 
Problems with unigram models:
not entirely hopeless because most sentences contain a majority
of highly common words
ignores syntax completely:
P(In person she was inferior) = P(inferior was she person in)
 
Adequacy of bigrams
 
Bigrams:
improve situation dramatically
some unexpected results:
p(she|person) decreases compared to the unigram model. Though 
she 
is
very common, it is uncommon after 
person
 
Adequacy of trigrams
 
Trigram models will do brilliantly when they’re useful.
They capture a surprising amount of contextual variation in
text.
Biggest limitation:
most new trigrams in test data will not have been seen in training data.
 
Problem carries over to 4-grams, and is much worse!
 
Reliability vs. Discrimination
 
larger n:  more information about the context of the specific
instance (greater 
discrimination
)
 
smaller n:  more instances in training data, better statistical
estimates (more 
reliability
)
Backing off
 
Possible way of striking a balance between reliability and
discrimination:
backoff model
:
where possible, use a trigram
if trigram is unseen, try and “back off” to a bigram model
if bigrams are unseen, try and “back off” to a unigram
undefined
 
Evaluating language models
 
Perplexity
 
Recall: Entropy is a measure of 
uncertainty
:
high entropy = high uncertainty
 
perplexity:
if I’ve trained on a sample, how surprised am I when exposed to a
new sample?
a measure of uncertainty of a model on new data
Entropy as “expected value”
 
 
 
 
One way to think of the summation part is as a weighted average
of the information content.
 
We can view this average value as an “expectation”: the expected
surprise/uncertainty of our model.
Comparing distributions
 
We have a language model built from a sample. The sample is
a probability distribution 
q 
over n-grams.
q(x) = the probability of some n-gram x in our model.
 
The sample is generated from a true population (“the
language”) with probability distribution 
p
.
p(x) = the probability of x in the true distribution
Evaluating a language model
 
We’d like an estimate of how good our model is as a model of
the language
i.e. we’d like to compare 
q
 to 
p
 
We don’t have access to p. (Hence, can’t use KL-Divergence)
 
Instead, we use our test data as an estimate of p.
Cross-entropy: basic intuition
 
Measure the number of bits needed to identify an event
coming from 
p
, if we code it according to q:
We draw sequences according to 
p;
but we sum the log of their probability according to q.
 
This estimate is called cross-entropy 
H(p,q)
Cross-entropy: p vs. q
 
Cross-entropy is an 
upper bound
 on the entropy of the true
distribution p:
H(p) ≤ H(p,q)
if our model distribution (q) is good,  H(p,q) ≈ H(p)
 
We estimate cross-entropy based on our test data.
Gives an estimate of the distance of our language model from
the distribution in the test sample.
Estimating cross-entropy
 
Probability
according
to p (test set)
 
Entropy according
to q (language model)
Perplexity
 
The perplexity of a language model with probability distribution 
q
, relative to a
test set with probability distribution 
p
 is:
 
 
 
 
 
A perplexity value of 
k
 (obtained on a test set) tells us:
our model is as surprised on average as it would be if it had to make 
k
 guesses for every
sequence (n-gram) in the test data.
 
The lower the perplexity, the better the language model (the lower the surprise
on our test data).
 
 
Perplexity example (Jurafsky & Martin, 2000,
p. 228)
 
 
Trained unigram, bigram and trigram models from a corpus
of news text (Wall Street Journal)
applied smoothing
38 million words
Vocab of 19,979 (low-frequency words mapped to UNK).
Computed perplexity on a test set of 1.5 million words.
J&M’s results
 
Trigrams do best of all.
Value suggests the extent to
which the model can fit the
data in the test set.
Note: with unigrams, the
model has to make lots of
guesses!
 
Summary
 
Main point about Markov-based language models:
data sparseness is always a problem
smoothing techniques are required to estimate probability of
unseen events
 
Next part discusses more refined smoothing techniques than
those seen so far.
undefined
 
Smoothing (aka discounting) techniques
 
Part 2
 
Overview…
 
Smoothing methods:
Simple smoothing
Witten-Bell & Good-Turing estimation
Held-out estimation and cross-validation
 
Combining several n-gram models:
back-off models
Rationale behind smoothing
Sample frequencies
 seen events with
probability P
 unseen events
(including
“grammatical” zeroes”)
with probability 0
Real population
frequencies
 seen events
(including the unseen
events in our sample)
 
+ smoothing
to approximate
Lower probabilities for seen events
(
discounting
). Left over probability
mass distributed over unseens
(
smoothing
).
 
results in
undefined
 
 
Laplace’s law, Lidstone’s law and the Jeffreys-Perks law
 
Instances in the Training Corpus:
“inferior to ________”
F(w)
Maximum Likelihood Estimate
F(w)
Unknowns are
assigned 0%
probability mass
Actual Probability Distribution
F(w)
These are non-
zero probabilities
in the real
distribution
 
LaPlace’s Law (
Add-one smoothing
)
F(w)
 
LaPlace’s Law (
Add-one smoothing
)
F(w)
LaPlace’s Law
F(w)
NB. This method ends
up assigning most prob.
mass to unseens
Generalisation: 
Lidstone’s Law
 
 
P = probability of specific n-gram
C(x) = count of n-gram 
x 
in training data
N = total n-grams in training data
V = number of “bins” (
possible
 n-grams)
 = small positive number
M.L.E: 
 = 0
LaPlace’s Law:  
 = 1 (add-one smoothing)
Jeffreys-Perks Law:  
 = ½
 
Jeffreys-Perks Law
F(w)
 
Objections to Lidstone’s Law
 
Need an 
a priori
 way to determine 
.
 
Predicts all unseen events to be equally likely
 
Gives probability estimates linear in the M.L.E. frequency
undefined
 
 
Witten-Bell discounting
Main intuition
 
A zero-frequency event can be thought of as an event which
hasn’t happened (yet).
The probability of it happening can be estimated from the
probability of sth happening for the first time (i.e. the 
hapaxes
in our corpus).
 
The count of things which are seen only once can be used to
estimate the count of things that are never seen.
Witten-Bell method
 
1.
T
 = no. of times we saw an event for the first time.
 
   = no of different n-gram types (bins)
 
NB: T is no. of types actually attested (unlike V, the no of possible in our
previous estimations)
 
2.
Estimate 
total
 probability mass of unseen n-grams:
 
 
 
 
 
Basically, MLE of the probability of a new type event occurring (“being seen for the
first time”)
This is the total probability mass to be distributed among all zero events (unseens)
no of actual n-grams (N) + no of
actual types (T)
Witten-Bell method
 
3.
Divide the total probability mass among all the zero n-grams.
Can distribute it equally.
 
4.
Remove this probability mass from the non-zero n-grams
(discounting):
 
 
Witten-Bell vs. Add-one
 
If we work with unigrams, Witten-Bell and Add-one
smoothing give very similar results.
 
The difference is with n-grams for n>1.
 
Main idea: estimate probability of an unseen bigram
<w1,w2> from the probability of seeing a 
bigram starting
with w1
 for the first time.
Witten-Bell with bigrams
Generalised total probability mass estimate:
 
No. bigram types beginning
with w
1
No. bigram tokens beginning
with w
1
Estimated total probability of bigrams
starting with w
1
Witten-Bell with bigrams
 
Non-zero bigrams get discounted as before, but again conditioning on
history:
 
 
 
 
 
 
Note: Witten-Bell won’t assign the same probability mass to all unseen n-
grams.
The amount assigned will depend on the first word in the bigram (first n-
1 words in the n-gram).
 
undefined
 
 
Good-Turing Frequency Estimation
Good-Turing method
 
Introduced by Good (1953), but partly attributed to Alan
Turing
work carried out at Bletchley Park during WWII
 
“Simple Good-Turing” method (Gale and Sampson 1995)
 
Main idea:
re-estimate amount of probability mass assigned to low-frequency or
zero n-grams based on the number of n-grams (types) with higher
frequencies
 
Rationale
 
Given:
sample frequency of a type (n-gram, aka bin, aka equivalence
class)
 
GT provides:
an estimate of the true population frequency of a type
an estimate of the 
total probability of unseen types
 in the
population.
Ingredients
 
the sample frequency 
C(x)
 of an n-gram 
x 
in a corpus of size
N
 with vocabulary size 
V
 
the no. of n-gram 
types 
with frequency C, 
T
c
 
C
*(x)
: the estimated true population frequency of an n-gram
x with sample frequency 
C(x)
N.B. in a perfect sample, 
C(x) = C*(x)
in real life, 
C*(x) < C(x)
 (i.e. sample overestimates the true
frequency)
Some background
 
Suppose:
we had access to the true population probability of our n-grams
we treat each occurrence of an n-gram
 
as a Bernoulli trial: either the n-gram is x
 
or not
i.e. a binomial assumption
 
Then, we could calculate the expected no of types with frequency C, T
c
= the expected 
frequency of frequency
 
 
where:
 
T
C
 = no. of n-gram types with frequency C
 
N = total no. of n-grams
 
Background continued
 
Given an estimate of E(T
C
), we could then calculate C*
Fundamental underlying theorem:
 
 
 
 
Note: this makes the estimated (“true”) frequency C* a
function of the expected number of types with frequency C+1.
Like Witten-bell, it makes the adjusted count of zero-frequency
events dependent on events of frequency 1.
Background continued
 
We can use the above to calculate adjusted frequencies
directly.
Often, though, we want to calculate the total “missing
probability mass” for zero-count n-grams (the unseens):
 
 
 
 
Where:
 T1 is the number of types with frequency 1
N is the total number of items seen in training
 
 
Example of readjusted counts
 
From: Jurafsky & Martin 2009
Examples are bigram counts from two corpora.
A little problem
 
 
 
 
The GT theorem assumes that we know the expected
population
 count of types!
We’ve assumed that we get this from a corpus, but this, of course, is
not the case.
 
Secondly, T
C+1
 will often be zero! For example, it’s quite possible
to find several n-grams with frequency 100, and no n-grams with
frequency 101!
Note that this is more typical for high frequencies, than low ones.
Low frequencies and gaps
 
Low C
:
 linear trend.
Higher C
:
 angular
discontinuity.
Frequencies in corpus display
“jumps” and so do frequencies
of frequencies.
This implies the presence of
gaps at higher frequencies.
 
C: 
log10 frequency
(after Gale and Sampson 1995)
T
C
: log10 frequency of frequency
Possible solution
 
1.
Use Good-Turing for n-grams
with corpus frequency less than
some constant 
k
  (typically, 
k
 = 5).
Low-frequency types are
numerous, so GT is reliable.
High-frequency types are
assumed to be near the “truth”.
 
 
 
2.
To avoid gaps (where T
c+1
 = 0),
empirically estimate a function
S(C) that acts as a proxy for E(T
C
)
Proxy function for gaps
 
For any sample 
C
, let
:
 
 
 
where:
C’’
 is the next highest non-zero
frequency
C’ is the previous non-zero
frequency
log10 frequency
(after Gale and Sampson 1995)
log10 S
C
Gale and Sampson’s combined proposal
 
For low frequencies (< k), use standard equation, assuming E(T
C
) = T
C
 
 
 
 
If we have gaps (i.e. T
C 
=0), we use our proxy function for T
C. 
Obtained through linear regression to
fit the log-log curve
 
 
 
And for high frequencies, we can assume that C* = C
 
Finally, estimate probability of n-gram:
GT Estimation: Final step
 
GT gives 
approximations
 to probabilities.
Re-estimated probabilities of n-grams won’t sum to 1
necessary to re-normalise
Gale/Sampson 1995:
 
A final word on GT smoothing
 
In practice, GT is very seldom used on its own.
 
Most frequently, we use GT with 
backoff
, 
about which,
more later...
undefined
 
 
Held-out estimation & cross-validation
 
Held-out estimation: General idea
 
“hold back” some training data
 
create our language model
 
compare, for each n-gram (
w
1
…w
n
):
C
t
: estimated frequency of the n-gram based on training data
C
h
: frequency of the n-gram in the held-out data
Held-out estimation
 
Define 
Tot
C
 
as:
total no. of times that 
n-grams
 with frequency C in the training corpus
actually occurred in the held-out data
 
 
 
Re-estimate the probability:
 
Cross-validation
 
Problem with held-out estimation:
our training set is smaller
 
Way around this:
divide training data into training + validation data (roughly
equal sizes)
use each half first as training then as validation (i.e. train twice)
take a mean
Cross-Validation
(a.k.a. deleted estimation)
Use 
training
 and 
validation
 data
A
B
train
validate
validate
train
Model 1
Model 2
Model 1
Model 2
 
+
Final Model
 
Split training data:
 
train on A, validate on B
 
train on B, validate on A
 
combine model 1 & 2
Cross-Validation
 
Combined estimate (arithmetic mean):
undefined
 
 
Combining estimators
: 
backoff
 and interpolation
 
The rationale
 
We would like to balance between reliability and
discrimination:
use trigram where useful
otherwise back off to bigram, unigram
 
How can you develop a model to utilize different length 
n-
grams
 as appropriate?
Interpolation vs. Backoff
 
 
Interpolation: compute probability of an n-gram as a function
of:
The n-gram itself
All lower-order n-grams
Probabilities are linearly interpolated.
Lower-order n-grams are always used.
 
Backoff:
If n-gram exists in model, use that
Else fall back to lower order n-grams
Simple interpolation: trigram example
 
Combine all estimates, weighted by a factor.
 
 
 
 
 
All 
parameters
 
should 
sum to 1:
 
NB: we have different interpolation parameters for the
various n-gram sizes.
More sophisticated version
 
Suppose we have the trigram
s:
 
(
the dog barked
)
(
the puppy barked
)
 
Suppose (
the dog
) occurs several times in our corpus, but not (
the
puppy
)
 
In our interpolation, we might want to weight trigrams of the
form (
the dog 
_) more than (
the puppy
 _) (because the former is
composed of a more reliable bigram)
 
Rather than using the same parameter for all trigrams, we could
condition on the initial bigram.
 
Sophisticated
 interpolation: trigram
example
 
Combine all estimates, weighted 
by factors that depend on
the context
.
 
 
 
 
 
Where do parameters come from?
 
Typically:
We estimate counts from training data.
We estimate parameters from held-out data.
The lambdas are chosen so that they maximise the likelihood on
the held-out data.
 
Often, the expectation maximisation (EM) algorithm is used
to discover the right values to plug into the equations.
(more on this later)
 
Backoff
 
Recall that backoff models only use lower order n-grams
when the higher order one is unavailable.
 
Best known model by Katz (1987).
Uses backoff with smoothed probabilities
Smoothed probabilities obtained using Good-Turing estimation.
Backoff: trigram example
 
Backoff estimate:
 
 
 
 
That is:
If the trigram 
has count > 0
, we use the smoothed (P*) estimate
If not, we recursively back off to lower orders, interpolating
with a parameter (alpha)
 
Backoff vs. Simple smoothing
 
With Good-Turing smoothing, we typically end up with the
“leftover” probability mass that is distributed equally among
the unseens.
So GF tells us how much leftover probability there is.
 
Backoff gives us a better way of distributing this mass among
unseen trigrams, by relying on the counts of their component
bigrams and unigrams.
So backoff tells us how to divide that leftover probability.
Why we need those alphas
 
If we rely on true probabilities, then for a given word and a given n-
gram window, the probability of the word sums to 1:
 
 
 
 
But if we back off to lower-order model when the trigram
probability is 0, we’re adding extra probability mass, and the sum
will now exceed 1.
We therefore need:
P* 
 to discount the original MLE estimate (P)
Alpha parameters
 to ensure that the probability from the lower-order n-
grams sums up to exactly the amount we discounted in P*.
Computing the alphas -- I
 
Recall: we have 
C(w
1
w
2
w
3
) = 0
Let ß(w
1
w
2
) represent the amount of probability left over
when we discount (seen) trigrams containing 
w
3
 
 
 
The sum of probabilities P for seen trigrams
involving w
3
  (preceded by any two tokens) is
1. The smoothed probabilities P* sum to less
than 1. We’re taking the remainder.
Computing the alphas -- II
 
We now compute alpha:
 
 
 
The denominator sums over all 
unseen
 trigrams
involving our bigram. We distribute the remaining
mass ß(w
1
w
2
)  overall all those trigrams.
What about unseen bigrams?
 
So what happens if even 
(w
1
w
2
) 
in 
(w
1
w
2
w
3
) 
has count zero?
 
 
I.e. we fall to an even lower order. Moreover:
 
 
And:
 
Problems with Backing-Off
 
Suppose (w
2
 w
3
) is common but trigram (w
1
 w
2 
w
3
) is
unseen
 
This may be a 
meaningful
 gap, rather than a gap due to chance
and scarce data
i.e., a “grammatical null”
 
May not want to back-off to lower-order probability
in this case, p = 0 is accurate!
 
References
 
Gale, W.A., and Sampson, G. (1995). Good-Turing frequency
estimation without tears. 
Journal of Quantitative Linguistics,
 2:
217-237
Slide Note
Embed
Share

N-gram models play a crucial role in language modelling by predicting the next word in a sequence based on the probability of previous words. This technology is used in various applications such as word prediction, speech recognition, and spelling correction. By analyzing history and probabilities, N-gram models can predict parts of speech, aid in statistical parsing, and facilitate natural language generation. Real-world examples, like Google's search engine, demonstrate how N-gram models improve query accuracy through context-sensitive spelling corrections. The model works by modifying queries based on character-level adjustments to align with the language model, resulting in more accurate search results. The noisy channel model further underlines the importance of N-gram models in speech and language processing.

  • N-Gram Models
  • Language Modelling
  • Speech Recognition
  • Spelling Correction
  • Natural Language Generation.

Uploaded on Sep 12, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Corpora and Statistical Methods Language modelling using N-Grams

  2. Example task The word-prediction task (Shannon game) Given: a sequence of words (the history) a choice of next word Predict: the most likely next word Generalises easily to other problems, such as predicting the POS of unknowns based on history.

  3. Applications of the Shannon game Automatic speech recognition (cf. tutorial 1): given a sequence of possible words, estimate its probability Context-sensitive spelling correction: Many spelling errors are real words He walked for miles in the dessert. (resp. desert) Identifying such errors requires a global estimate of the probability of a sentence.

  4. Applications of N-gram models generally POS Tagging: predict the POS of an unknown word by looking at its history Statistical parsing: e.g. predict the group of words that together form a phrase Statistical NL Generation: given a semantic form to be realised as text, and several possible realisations, select the most probable one.

  5. A real-world example: Googles did you mean Google uses an n-gram model (based on sequences of characters, not words). In this case, the sequence apple desserts is much more probable than apple deserts

  6. How it works Documents provided by the search engine are added to: An index (for fast retrieval) A language model (based on probability of a sequence of characters) A submitted query ( apple deserts ) can be modified (using character insertions, deletions, substitutions and transpositions) to yield a query that fits the language model better ( apple desserts ). Outcome is a context-sensitive spelling correction: apple deserts apple desserts frod baggins frodo baggins frod ford

  7. The noisy channel model After Jurafsky and Martin (2009), Speech and Language Processing (2ndEd). Prentice Hall p. 198

  8. The assumptions behind n-gram models

  9. The Markov Assumption Markov models: probabilistic models which predict the likelihood of a future unit based on limited history in language modelling, this pans out as the local history assumption: the probability of wndepends on a limited number of prior words utility of the assumption: we can rely on a small n for our n-gram models (bigram, trigram) long n-grams become exceedingly sparse Probabilities become very small with long sequences

  10. The structure of an n-gram model The task can be re-stated in conditional probabilistic terms: ( | ... ) P w w w 1 1 n n Limiting n under the Markov Assumption means: greater chance of finding more than one occurrence of the sequence w1 wn-1 more robust statistical estimations N-grams are essentially equivalence classes or bins every unique n-gram is a type or bin

  11. Structure of n-gram models (II) If we construct a model where all histories with the same n-1 words are considered one class or bin, we have an (n-1)th order Markov Model Note terminology: n-gram model = (n-1)thorder Markov Model

  12. Methodological considerations We are often concerned with: building an n-gram model evaluating it We therefore make a distinction between training and test data You never test on your training data If you do, you re bound to get good results. N-gram models tend to be overtrained, i.e.: if you train on a corpus C, your model will be biased towards expecting the kinds of events in C. Another term for this: overfitting

  13. Dividing the data Given: a corpus of n units (words, sentences, depending on the task) A large proportion of the corpus is reserved for training. A smaller proportion for testing/evaluation (normally 5-10%)

  14. Held-out (validation) data Held-out estimation: during training, we sometimes estimate parameters for our model empirically commonly used in smoothing (how much probability space do we want to set aside for unseen data)? therefore, the training set is often split further into training data and validation data normally, held-out data is 10% of the size of the training data

  15. Development data A common approach: 1. train an algorithm on training data a.(estimate further parameters on held-out data if required) 2. evaluate it 3. re-tune it 4. go back to Step 1 until no further finetuning necessary 5. Carry out final evaluation For this purpose, it s useful to have: training data for step 1 development set for steps 2-4 final test set for step 5

  16. Significance testing Often, we compare the performance of our algorithm against some baseline. A single, raw performance score won t tell us much. We need to test for significance (e.g. using t-test). Typical method: Split test set into several small test sets, e.g. 20 samples evaluation carried out separately on each mean and variance estimated based on 20 different samples test for significant difference between algorithm and a predefined baseline

  17. Size of n-gram models In a corpus of vocabulary size N, the assumption is that any combination of n words is a potential n-gram. For a bigram model: N2possible n-grams in principle For a trigram model: N3possible n-grams.

  18. Size (continued) Each n-gram in our model is a parameter used to estimate probability of the next possible word. too many parameters make the model unwieldy too many parameters lead to data sparseness: most of them will have f = 0 or 1 Most models stick to unigrams, bigrams or trigrams. estimation can also combine different order models

  19. Further considerations When building a model, we tend to take into account the start-of-sentence symbol: the girl swallowed a large green caterpillar <s> the the girl Also typical to map all tokens w such that count(w) < k to <UNK>: usually, tokens with frequency 1 or 2 are just considered unknown or unseen this reduces the parameter space

  20. Building models using Maximum Likelihood Estimation

  21. Maximum Likelihood Estimation Approach Basic equation: ( w ... ) P w w = 1 ( | ... ) n P w w w 1 1 n n ( ... ) P w 1 1 n In a unigram model, this reduces to simple probability. MLE models estimate probability using relative frequency.

  22. Limitations of MLE MLE builds the model that maximises the probability of the training data. Unseen events in the training data are assigned zero probability. Since n-gram models tend to be sparse, this is a real problem. Consequences: seen events are given more probability mass than they have unseen events are given zero mass

  23. Seen/unseen Probability mass of events not in training data A A The problem with MLE is that it distributes A among members of A. Probability mass of events in training data

  24. The solution Solution is to correct MLE estimation using a smoothing technique. More on this in the next part But cf. Tutorial 1, which introduced the simplest method of smoothing known.

  25. Adequacy of different order models Manning/Schutze `99 report results for n-gram models of a corpus of the novels of Austen. Task: use n-gram model to predict the probability of a sentence in the test data. Models: unigram: essentially zero-context markov model, uses only the probability of individual words bigram trigram 4-gram

  26. Example test case Training Corpus: five Jane Austen novels Corpus size = 617,091 words Vocabulary size = 14,585 unique types Task: predict the next word of the trigram inferior to ________ from test data, Persuasion: [In person, she was] inferior to both[sisters.]

  27. Selecting an n Vocabulary (V) = 20,000 words Vocabulary (V) = 20,000 words n Number of bins (i.e. no. of possible unique n-grams) 400,000,000 2 (bigrams) 3 (trigrams) 8,000,000,000,000 4 (4-grams) 1.6 x 1017

  28. Adequacy of unigrams Problems with unigram models: not entirely hopeless because most sentences contain a majority of highly common words ignores syntax completely: P(In person she was inferior) = P(inferior was she person in)

  29. Adequacy of bigrams Bigrams: improve situation dramatically some unexpected results: p(she|person) decreases compared to the unigram model. Though she is very common, it is uncommon after person

  30. Adequacy of trigrams Trigram models will do brilliantly when they re useful. They capture a surprising amount of contextual variation in text. Biggest limitation: most new trigrams in test data will not have been seen in training data. Problem carries over to 4-grams, and is much worse!

  31. Reliability vs. Discrimination larger n: more information about the context of the specific instance (greater discrimination) smaller n: more instances in training data, better statistical estimates (more reliability)

  32. Backing off Possible way of striking a balance between reliability and discrimination: backoff model: where possible, use a trigram if trigram is unseen, try and back off to a bigram model if bigrams are unseen, try and back off to a unigram

  33. Evaluating language models

  34. Perplexity Recall: Entropy is a measure of uncertainty: high entropy = high uncertainty perplexity: if I ve trained on a sample, how surprised am I when exposed to a new sample? a measure of uncertainty of a model on new data

  35. Entropy as expected value x = ( ) ( ) log ( ) H X p x p x X One way to think of the summation part is as a weighted average of the information content. We can view this average value as an expectation : the expected surprise/uncertainty of our model.

  36. Comparing distributions We have a language model built from a sample. The sample is a probability distribution q over n-grams. q(x) = the probability of some n-gram x in our model. The sample is generated from a true population ( the language ) with probability distribution p. p(x) = the probability of x in the true distribution

  37. Evaluating a language model We d like an estimate of how good our model is as a model of the language i.e. we d like to compare q to p We don t have access to p. (Hence, can t use KL-Divergence) Instead, we use our test data as an estimate of p.

  38. Cross-entropy: basic intuition Measure the number of bits needed to identify an event coming from p, if we code it according to q: We draw sequences according to p; but we sum the log of their probability according to q. This estimate is called cross-entropy H(p,q)

  39. Cross-entropy: p vs. q Cross-entropy is an upper bound on the entropy of the true distribution p: H(p) H(p,q) if our model distribution (q) is good, H(p,q) H(p) We estimate cross-entropy based on our test data. Gives an estimate of the distance of our language model from the distribution in the test sample.

  40. Estimating cross-entropy x = ( , ) ( ) log ( ) H p q p x q x Probability according to p (test set) Entropy according to q (language model)

  41. Perplexity The perplexity of a language model with probability distribution q, relative to a test set with probability distribution p is: ( , ) H p q 2 A perplexity value of k (obtained on a test set) tells us: our model is as surprised on average as it would be if it had to make k guesses for every sequence (n-gram) in the test data. The lower the perplexity, the better the language model (the lower the surprise on our test data).

  42. Perplexity example (Jurafsky & Martin, 2000, p. 228) Trained unigram, bigram and trigram models from a corpus of news text (Wall Street Journal) applied smoothing 38 million words Vocab of 19,979 (low-frequency words mapped to UNK). Computed perplexity on a test set of 1.5 million words.

  43. J&Ms results Trigrams do best of all. Value suggests the extent to which the model can fit the data in the test set. Note: with unigrams, the model has to make lots of guesses! N-Gram model Perplexit y Unigram 962 Bigram 170 Trigram 109

  44. Summary Main point about Markov-based language models: data sparseness is always a problem smoothing techniques are required to estimate probability of unseen events Next part discusses more refined smoothing techniques than those seen so far.

  45. Part 2 Smoothing (aka discounting) techniques

  46. Overview Smoothing methods: Simple smoothing Witten-Bell & Good-Turing estimation Held-out estimation and cross-validation Combining several n-gram models: back-off models

  47. Rationale behind smoothing Sample frequencies Real population frequencies + smoothing to approximate seen events with probability P seen events unseen events (including grammatical zeroes ) with probability 0 (including the unseen events in our sample) results in Lower probabilities for seen events (discounting). Left over probability mass distributed over unseens (smoothing).

  48. Laplaces law, Lidstones law and the Jeffreys-Perks law

  49. Instances in the Training Corpus: inferior to ________ F(w)

  50. Maximum Likelihood Estimate F(w) Unknowns are assigned 0% probability mass

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#