Introduction to NLP Smoothing Techniques

Smoothing and Interpolation

Smoothing

•

If the vocabulary size is |V|=1M

–

Too many parameters to estimate even a unigram model

–

MLE assigns values of 0 to unseen (yet not impossible)

data

–

Let alone bigram or trigram models

•

Smoothing (regularization)

–

Reassigning some probability mass to unseen data

Smoothing

•

How to model novel words?

–

Or novel bigrams?

–

Distributing some of the probability mass to allow for novel

events

•

Add-one (Laplace) smoothing:

–

Bigrams: P(w

|w

i-1

) = (c(w

i-1

,w

)+1)/(c(w

i-1

)+V)

–

This method reassigns too much probability mass to unseen

events

•

Possible to do add-

 instead of add-one

–

Both of these don’t work well in practice

Advanced Smoothing

•

Good-Turing

–

Try to predict the probabilities of unseen events based on the

probabilities of seen events

•

Kneser-Ney

•

Class-based n-grams

Example

•

Corpus:

–

cat dog cat rabbit mouse fish fish mouse hamster hamster fish turtle tiger cat

rabbit cat dog dog fox lion

•

What is the probability the next item is “mouse”?

–

MLE

 (mouse) = 2/20

•

What is the probability the next item is “elephant” or some other

previously unseen animal?

–

Trickier

–

Is it 0/20?

–

Note that P (that the next animal is unseen) > 0

–

Therefore we need to discount the probabilities of the animals that have already

been seen

–

MLE

 (mouse) < 2/20

Good Turing

•

Idea

–

Estimate the frequencies of unseen events based on the events seen

once

•

Actual counts c

•

 = number of n-grams that occur exactly c times in

the corpus

•

 = total number of n-grams in the corpus

•

Revised counts c*

–

c* = (c+1) N

c+1

/N

Example

•

Corpus:

–

cat dog cat rabbit mouse fish fish mouse hamster hamster fish turtle tiger cat rabbit

cat dog dog fox lion

•

Counts

–

C(cat) = 4

–

C(dog) = 3

–

C(fish) = 3

–

C(mouse) = 2

–

C(rabbit) = 2

–

C(hamster) = 2

–

C(fox) = 1

–

C(turtle) = 1

–

C(tiger) = 1

–

C(lion) = 1

•

=4, N

=3, N

=2, N

=1

Example (cont’d)

•

=4, N

=3, N

=2, N

=1

•

Revised counts c* = (c+1) N

c+1

/N

(or best-fit Power Law estimate if

counts are unreliable)

–

C*(cat) = 4

–

C*(dog) = (3+1) x 1/2 = 2

–

C*(mouse) = (2+1) x 2/3 = 2

–

C*(rabbit) = (2+1) x 2/3 = 2

–

C*(hamster) = (2+1) x 2/3 = 2

–

C*(fox) = (1+1) x 3/4 = 6/4

–

C*(turtle) = (1+1) x 3/4 = 6/4

–

C*(tiger) = (1+1) x 3/4 = 6/4

–

C*(lion) = (1+1) x 3/4 = 6/4

–

C*(elephant) = N1/N = 4/20

•

Note that these counts don’t necessarily add to 1, so they still need

to be normalized.

–

P*(lion) = 6/4 / 20 = 6/80

•

More information

–

http://beyondexpectations.quora.com/An-Intuitive-Explanation-of-Good-Turing-

Smoothing

Dealing with Sparse Data

•

Two main techniques used

–

Backoff

–

Interpolation

Backoff

•

Going back to the lower-order n-gram model if

the higher-order model is sparse (e.g., frequency

<= 1)

•

Learning the parameters

–

From a development data set

Interpolation

•

If P’(w

|w

i-1

,w

i-2

) is sparse:

–

Use

λ

P’(w

|w

i-1

,w

i-2

) +

λ

P’(w

|w

i-1

)+

λ

P’(w

–

Ensure that

λ

λ

λ

=1,

λ

λ

λ

≤1,

λ

λ

λ

≥0

–

Better than backoff

–

Estimate the hyper-parameters

λ

λ

λ

 from held-out data (or

using EM), e.g., using 5-fold cross-validation

•

See [Chen and Goodman 1998] for more details

•

Software:

–

http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.h

tml

[slide from Michael Collins]

Slide Note

Embed Share

Download

In natural language processing (NLP), smoothing and interpolation methods are essential for handling unseen data and improving language model performance. This content covers various smoothing techniques like Add-One (Laplace) smoothing, Good-Turing, and advanced methods like Kneser-Ney class-based n-grams. Explore how these techniques address issues with estimating probabilities for novel words and events in NLP models, providing insights into improving model accuracy and robustness.

buttner_g Follow

Uploaded on Feb 20, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

NLP

Introduction to NLP Smoothing and Interpolation

Smoothing If the vocabulary size is |V|=1M Too many parameters to estimate even a unigram model MLE assigns values of 0 to unseen (yet not impossible) data Let alone bigram or trigram models Smoothing (regularization) Reassigning some probability mass to unseen data

Smoothing How to model novel words? Or novel bigrams? Distributing some of the probability mass to allow for novel events Add-one (Laplace) smoothing: Bigrams: P(wi|wi-1) = (c(wi-1,wi)+1)/(c(wi-1)+V) This method reassigns too much probability mass to unseen events Possible to do add-k instead of add-one Both of these don t work well in practice

Advanced Smoothing Good-Turing Try to predict the probabilities of unseen events based on the probabilities of seen events Kneser-Ney Class-based n-grams

Example Corpus: cat dog cat rabbit mouse fish fish mouse hamster hamster fish turtle tiger cat rabbit cat dog dog fox lion What is the probability the next item is mouse ? PMLE(mouse) = 2/20 What is the probability the next item is elephant or some other previously unseen animal? Trickier Is it 0/20? Note that P (that the next animal is unseen) > 0 Therefore we need to discount the probabilities of the animals that have already been seen PMLE(mouse) < 2/20

Good Turing Idea Estimate the frequencies of unseen events based on the events seen once Actual counts c Nr= number of n-grams that occur exactly c times in the corpus N0= total number of n-grams in the corpus Revised counts c* c* = (c+1) Nc+1/Nc

Example Corpus: cat dog cat rabbit mouse fish fish mouse hamster hamster fish turtle tiger cat rabbit cat dog dog fox lion Counts C(cat) = 4 C(dog) = 3 C(fish) = 3 C(mouse) = 2 C(rabbit) = 2 C(hamster) = 2 C(fox) = 1 C(turtle) = 1 C(tiger) = 1 C(lion) = 1 N1=4, N2=3, N3=2, N4=1

Example (contd) N1=4, N2=3, N3=2, N4=1 Revised counts c* = (c+1) Nc+1/Nc(or best-fit Power Law estimate if counts are unreliable) C*(cat) = 4 C*(dog) = (3+1) x 1/2 = 2 C*(mouse) = (2+1) x 2/3 = 2 C*(rabbit) = (2+1) x 2/3 = 2 C*(hamster) = (2+1) x 2/3 = 2 C*(fox) = (1+1) x 3/4 = 6/4 C*(turtle) = (1+1) x 3/4 = 6/4 C*(tiger) = (1+1) x 3/4 = 6/4 C*(lion) = (1+1) x 3/4 = 6/4 C*(elephant) = N1/N = 4/20 Note that these counts don t necessarily add to 1, so they still need to be normalized. P*(lion) = 6/4 / 20 = 6/80 More information http://beyondexpectations.quora.com/An-Intuitive-Explanation-of-Good-Turing- Smoothing

Dealing with Sparse Data Two main techniques used Backoff Interpolation

Backoff Going back to the lower-order n-gram model if the higher-order model is sparse (e.g., frequency <= 1) Learning the parameters From a development data set

Interpolation If P (wi|wi-1,wi-2) is sparse: Use 1P (wi|wi-1,wi-2) + 2P (wi|wi-1)+ 3P (wi) Ensure that 1+ 2+ 3=1, 1, 2, 3 1, 1, 2, 3 0 Better than backoff Estimate the hyper-parameters 1, 2, 3from held-out data (or using EM), e.g., using 5-fold cross-validation See [Chen and Goodman 1998] for more details Software: http://www.speech.cs.cmu.edu/SLM/toolkit_documentation.h tml

[slide from Michael Collins]

NLP

Introduction to NLP Smoothing Techniques

Download Presentation

Presentation Transcript

Related

More Related Content