Introduction to NLP Smoothing Techniques

 
Smoothing and Interpolation
Smoothing
 
If the vocabulary size is |V|=1M
Too many parameters to estimate even a unigram model
MLE assigns values of 0 to unseen (yet not impossible)
data
Let alone bigram or trigram models
Smoothing (regularization)
Reassigning some probability mass to unseen data
 
Smoothing
 
How to model novel words?
Or novel bigrams?
Distributing some of the probability mass to allow for novel
events
Add-one (Laplace) smoothing:
Bigrams: P(w
i
|w
i-1
) = (c(w
i-1
,w
i
)+1)/(c(w
i-1
)+V)
This method reassigns too much probability mass to unseen
events
Possible to do add-
k
 instead of add-one
Both of these don’t work well in practice
Advanced Smoothing
 
Good-Turing
Try to predict the probabilities of unseen events based on the
probabilities of seen events
Kneser-Ney
Class-based n-grams
Example
 
Corpus:
cat dog cat rabbit mouse fish fish mouse hamster hamster fish turtle tiger cat
rabbit cat dog dog fox lion
What is the probability the next item is “mouse”?
P
MLE
 (mouse) = 2/20
What is the probability the next item is “elephant” or some other
previously unseen animal?
Trickier
Is it 0/20?
Note that P (that the next animal is unseen) > 0
Therefore we need to discount the probabilities of the animals that have already
been seen
P
MLE
 (mouse) < 2/20
Good Turing
 
Idea
Estimate the frequencies of unseen events based on the events seen
once
Actual counts c
N
r
 = number of n-grams that occur exactly c times in
the corpus
N
0
 = total number of n-grams in the corpus
Revised counts c*
c* = (c+1) N
c+1
/N
c
Example
 
Corpus:
cat dog cat rabbit mouse fish fish mouse hamster hamster fish turtle tiger cat rabbit
cat dog dog fox lion
Counts
C(cat) = 4
C(dog) = 3
C(fish) = 3
C(mouse) = 2
C(rabbit) = 2
C(hamster) = 2
C(fox) = 1
C(turtle) = 1
C(tiger) = 1
C(lion) = 1
N
1
=4, N
2
=3, N
3
=2, N
4
=1
Example (cont’d)
 
N
1
=4, N
2
=3, N
3
=2, N
4
=1
Revised counts c* = (c+1) N
c+1
/N
c 
(or best-fit Power Law estimate if
counts are unreliable)
C*(cat) = 4
C*(dog) = (3+1) x 1/2 = 2
C*(mouse) = (2+1) x 2/3 = 2
C*(rabbit) = (2+1) x 2/3 = 2
C*(hamster) = (2+1) x 2/3 = 2
C*(fox) = (1+1) x 3/4 = 6/4
C*(turtle) = (1+1) x 3/4 = 6/4
C*(tiger) = (1+1) x 3/4 = 6/4
C*(lion) = (1+1) x 3/4 = 6/4
C*(elephant) = N1/N = 4/20
Note that these counts don’t necessarily add to 1, so they still need
to be normalized.
P*(lion) = 6/4 / 20 = 6/80
More information
http://beyondexpectations.quora.com/An-Intuitive-Explanation-of-Good-Turing-
Smoothing
Dealing with Sparse Data
 
Two main techniques used
Backoff
Interpolation
Backoff
 
Going back to the lower-order n-gram model if
the higher-order model is sparse (e.g., frequency
<= 1)
Learning the parameters
From a development data set
Interpolation
 
If P’(w
i
|w
i-1
,w
i-2
) is sparse:
Use 
λ
1
P’(w
i
|w
i-1
,w
i-2
) +
λ
2
P’(w
i
|w
i-1
)+
λ
3
P’(w
i
)
Ensure that 
λ
1
+
λ
2
+
λ
3
=1, 
λ
1
,
λ
2
,
λ
3
≤1, 
λ
1
,
λ
2
,
λ
3
≥0
Better than backoff
Estimate the hyper-parameters 
λ
1
,
λ
2
,
λ
3
 from held-out data (or
using EM), e.g., using 5-fold cross-validation
See [Chen and Goodman 1998] for more details
Software:
http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.h
tml
 
[slide from Michael Collins]
Slide Note
Embed
Share

In natural language processing (NLP), smoothing and interpolation methods are essential for handling unseen data and improving language model performance. This content covers various smoothing techniques like Add-One (Laplace) smoothing, Good-Turing, and advanced methods like Kneser-Ney class-based n-grams. Explore how these techniques address issues with estimating probabilities for novel words and events in NLP models, providing insights into improving model accuracy and robustness.

  • NLP
  • Smoothing Techniques
  • Language Models
  • Add-One Smoothing
  • Good-Turing

Uploaded on Feb 20, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. NLP

  2. Introduction to NLP Smoothing and Interpolation

  3. Smoothing If the vocabulary size is |V|=1M Too many parameters to estimate even a unigram model MLE assigns values of 0 to unseen (yet not impossible) data Let alone bigram or trigram models Smoothing (regularization) Reassigning some probability mass to unseen data

  4. Smoothing How to model novel words? Or novel bigrams? Distributing some of the probability mass to allow for novel events Add-one (Laplace) smoothing: Bigrams: P(wi|wi-1) = (c(wi-1,wi)+1)/(c(wi-1)+V) This method reassigns too much probability mass to unseen events Possible to do add-k instead of add-one Both of these don t work well in practice

  5. Advanced Smoothing Good-Turing Try to predict the probabilities of unseen events based on the probabilities of seen events Kneser-Ney Class-based n-grams

  6. Example Corpus: cat dog cat rabbit mouse fish fish mouse hamster hamster fish turtle tiger cat rabbit cat dog dog fox lion What is the probability the next item is mouse ? PMLE(mouse) = 2/20 What is the probability the next item is elephant or some other previously unseen animal? Trickier Is it 0/20? Note that P (that the next animal is unseen) > 0 Therefore we need to discount the probabilities of the animals that have already been seen PMLE(mouse) < 2/20

  7. Good Turing Idea Estimate the frequencies of unseen events based on the events seen once Actual counts c Nr= number of n-grams that occur exactly c times in the corpus N0= total number of n-grams in the corpus Revised counts c* c* = (c+1) Nc+1/Nc

  8. Example Corpus: cat dog cat rabbit mouse fish fish mouse hamster hamster fish turtle tiger cat rabbit cat dog dog fox lion Counts C(cat) = 4 C(dog) = 3 C(fish) = 3 C(mouse) = 2 C(rabbit) = 2 C(hamster) = 2 C(fox) = 1 C(turtle) = 1 C(tiger) = 1 C(lion) = 1 N1=4, N2=3, N3=2, N4=1

  9. Example (contd) N1=4, N2=3, N3=2, N4=1 Revised counts c* = (c+1) Nc+1/Nc(or best-fit Power Law estimate if counts are unreliable) C*(cat) = 4 C*(dog) = (3+1) x 1/2 = 2 C*(mouse) = (2+1) x 2/3 = 2 C*(rabbit) = (2+1) x 2/3 = 2 C*(hamster) = (2+1) x 2/3 = 2 C*(fox) = (1+1) x 3/4 = 6/4 C*(turtle) = (1+1) x 3/4 = 6/4 C*(tiger) = (1+1) x 3/4 = 6/4 C*(lion) = (1+1) x 3/4 = 6/4 C*(elephant) = N1/N = 4/20 Note that these counts don t necessarily add to 1, so they still need to be normalized. P*(lion) = 6/4 / 20 = 6/80 More information http://beyondexpectations.quora.com/An-Intuitive-Explanation-of-Good-Turing- Smoothing

  10. Dealing with Sparse Data Two main techniques used Backoff Interpolation

  11. Backoff Going back to the lower-order n-gram model if the higher-order model is sparse (e.g., frequency <= 1) Learning the parameters From a development data set

  12. Interpolation If P (wi|wi-1,wi-2) is sparse: Use 1P (wi|wi-1,wi-2) + 2P (wi|wi-1)+ 3P (wi) Ensure that 1+ 2+ 3=1, 1, 2, 3 1, 1, 2, 3 0 Better than backoff Estimate the hyper-parameters 1, 2, 3from held-out data (or using EM), e.g., using 5-fold cross-validation See [Chen and Goodman 1998] for more details Software: http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.h tml

  13. [slide from Michael Collins]

  14. NLP

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#