Advanced Smoothing Algorithms in Natural Language Processing

Natural Language

Processing

Lecture 8

Advanced: Good Turing

Smoothing

Reminder: Add-1 (Laplace) Smoothing

More general formulations: Add-

Unigram prior smoothing

Advanced smoothing algorithms

•

Intuition used by many smoothing algorithms

•

Good-Turing

•

Kneser-Ney

•

Witten-Bell

•

Use the count of things we’ve

seen

once

•

to help estimate the count of things we’ve

never seen

Notation: N

 = Frequency of frequency c

•

 = the count of things we’ve seen c times

•

Sam I am I am Sam I do not eat

I   3

sam 2

am  2

do  1

not 1

eat 1

= 3

= 2

= 1

Good-Turing smoothing intuition

•

You are

fishing (a scenario from Josh Goodman), and caught:

•

10 carp, 3 perch, 2 whitefish,

1 trout, 1 salmon, 1 eel

= 18 fish

•

How likely is it that next species is trout?

•

1/18

•

How likely is it that next species is new (i.e. catfish or bass)

•

Let’s use our estimate of things-we-saw-once to estimate the new

things.

•

3/18 (because N

=3)

•

Assuming so, how likely is it that next species is trout?

•

Must be less than 1/18

•

How to estimate?

•

Seen once (trout)

•

c = 1

•

MLE p = 1/18

•

C*(trout) = 2 * N2/N1

             = (2 * 1)/3

               = 2/3

•

GT

(trout) = (2/3)/18=1/27

Good Turing calculations

•

Unseen (bass or catfish)

•

c = 0:

•

MLE p = 0/18 = 0

•

GT

 (unseen) = N

/N = 3/18

Ney et al.’s Good Turing Intuition

Held-out words:

H. Ney, U. Essen, and R. Kneser, 1995. On the estimation of 'small' probabilities by leaving-one-out.

IEEE Trans. PAMI. 17:12,1202-1212

Ney

et al.

Good Turing Intuition

(slide from Dan Klein)

•

Intuition from leave-one-out validation

•

Take each of the

 training words out in turn

•

 training sets of size

–1, held-out of size 1

•

What fraction of held-out words are unseen in training?

•

•

What fraction of held-out words are seen

 times in

training?

•

+1)

+1

•

So in the future we expect (

+1)

+1

 of the words to be

those with training count

•

There are

 words with training count

•

Each should occur with probability:

•

+1)

+1

•

…or expected count:

. . . .

. . . .

Training

After Held out

Good-Turing complications

(slide from Dan Klein)

•

Problem: what about “

the”?  (say c=4417)

•

For small k, N

> N

k+1

•

For large k, too jumpy, zeros wreck

estimates

•

Simple Good-Turing [Gale and

Sampson]: replace empirical N

 with a

best-fit power law once counts get

unreliable

Resulting Good-Turing numbers

•

Numbers from Church and Gale (1991)

•

22 million words of AP Newswire

Advanced: Kneser-Ney

Smoothing

Resulting Good-Turing numbers

•

Numbers from Church and Gale (1991)

•

22 million words of AP Newswire

•

It sure looks like c* = (c - .75)

Absolute Discounting Interpolation

•

Save ourselves some time and just subtract 0.75 (or some d)!

•

(Maybe keeping a couple extra values of d for counts 1 and 2)

•

But should we really just use the regular unigram P(w)?

discounted bigram

unigram

Interpolation weight

Kneser-Ney Smoothing I

•

Better estimate for probabilities of lower-order unigrams!

•

Shannon game:

I can’t see without my reading

____________________

•

“Francisco” is more common than “glasses”

•

… but “

Francisco” always follows “San”

•

The unigram is useful exactly when we haven’t seen this bigram!

•

Instead of  P(w): “How likely is w”

•

continuation

(w):  “How likely is w to appear as a novel continuation?

•

For each word, count the number of bigram types it completes

•

Every bigram type was a novel continuation the first time it was seen

Francisco/

 glasses

Kneser-Ney Smoothing II

•

How many times does w appear as a novel continuation:

•

Normalized by the total number of word bigram types

Kneser-Ney Smoothing III

•

Alternative metaphor: The number of  # of word types seen to precede w

•

normalized by the # of words preceding all words:

•

A frequent word (Francisco) occurring in only one context (San) will have a low

continuation probability

Kneser-Ney Smoothing IV

λ is a normalizing constant; the probability mass we’ve discounted

the normalized discount

The number of word types that can follow w

i-1

= # of word types we discounted

= # of times we applied normalized discount

Kneser-Ney Smoothing: Recursive

formulation

Continuation count = Number of unique single word contexts for



Slide Note

Embed Share

Download

Dive into the world of advanced smoothing algorithms like Good-Turing, Kneser-Ney, and Witten-Bell used in Natural Language Processing. Explore concepts such as Good-Turing Smoothing Intuition, Unigram Prior Smoothing, and more general formulations for better language modeling through statistical techniques. Learn how these algorithms estimate probabilities and handle unseen data to enhance language processing tasks.

nicolai_t Follow

Uploaded on Oct 02, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Natural Language Processing Lecture 8

Advanced: Good Turing Smoothing

Reminder: Add-1 (Laplace) Smoothing PAdd-1(wi|wi-1)=c(wi-1,wi)+1 c(wi-1)+V

More general formulations: Add-k + ( , ) c w ( w ) k = 1 ( | ) i i P w w 1 Add k i i + c w k V 1 i = m kV c(wi-1,wi)+m(1 c(wi-1)+m V) PAdd-k(wi|wi-1)=

Unigram prior smoothing c(wi-1,wi)+m(1 c(wi-1)+m V) PAdd-k(wi|wi-1)= PUnigramPrior(wi|wi-1)=c(wi-1,wi)+mP(wi) c(wi-1)+m

Advanced smoothing algorithms Intuition used by many smoothing algorithms Good-Turing Kneser-Ney Witten-Bell Use the count of things we ve seenonce to help estimate the count of things we ve never seen

Notation: Nc= Frequency of frequency c Nc= the count of things we ve seen c times Sam I am I am Sam I do not eat I 3 sam 2 am 2 do 1 not 1 eat 1 N1 = 3 N2 = 2 N3 = 1

Good-Turing smoothing intuition You are fishing (a scenario from Josh Goodman), and caught: 10 carp, 3 perch, 2 whitefish, 1 trout, 1 salmon, 1 eel = 18 fish How likely is it that next species is trout? 1/18 How likely is it that next species is new (i.e. catfish or bass) Let s use our estimate of things-we-saw-once to estimate the new things. 3/18 (because N1=3) Assuming so, how likely is it that next species is trout? Must be less than 1/18 How to estimate?

Good Turing calculations *(things with zero frequency)=N1 c*=(c+1)Nc+1 PGT N Nc Unseen (bass or catfish) c = 0: MLE p = 0/18 = 0 Seen once (trout) c = 1 MLE p = 1/18 P*GT (unseen) = N1/N = 3/18 C*(trout) = 2 * N2/N1 = (2 * 1)/3 = 2/3 P*GT(trout) = (2/3)/18=1/27

Ney et al.s Good Turing Intuition H. Ney, U. Essen, and R. Kneser, 1995. On the estimation of 'small' probabilities by leaving-one-out. IEEE Trans. PAMI. 17:12,1202-1212 Held-out words:

After Held out Training Ney et al. Good Turing Intuition (slide from Dan Klein) N1 N0 Intuition from leave-one-out validation Take each of the c training words out in turn c training sets of size c 1, held-out of size 1 What fraction of held-out words are unseen in training? N1/c What fraction of held-out words are seen k times in training? (k+1)Nk+1/c So in the future we expect (k+1)Nk+1/c of the words to be those with training count k There are Nk words with training count k Each should occur with probability: (k+1)Nk+1/c/Nk or expected count: N2 N1 N3 N2 . . . . . . . . N3511 N3510 k*=(k +1)Nk+1 N4417 N4416 Nk

Good-Turing complications (slide from Dan Klein) Problem: what about the ? (say c=4417) For small k, Nk > Nk+1 For large k, too jumpy, zeros wreck estimates N1 N2 N3 Simple Good-Turing [Gale and Sampson]: replace empirical Nk with a best-fit power law once counts get unreliable N1 N2

Resulting Good-Turing numbers Numbers from Church and Gale (1991) 22 million words of AP Newswire Count c Good Turing c* 0 1 2 3 4 5 6 7 8 9 .0000270 0.446 1.26 2.24 3.24 4.22 5.19 6.21 7.24 8.25 c*=(c+1)Nc+1 Nc

Advanced: Kneser-Ney Smoothing

Resulting Good-Turing numbers Numbers from Church and Gale (1991) 22 million words of AP Newswire Count c 0 1 2 3 4 5 6 7 8 9 Good Turing c* .0000270 0.446 1.26 2.24 3.24 4.22 5.19 6.21 7.24 8.25 c*=(c+1)Nc+1 Nc It sure looks like c* = (c - .75)

Absolute Discounting Interpolation Save ourselves some time and just subtract 0.75 (or some d)! discounted bigram Interpolation weight PAbsoluteDiscounting(wi|wi-1)=c(wi-1,wi)-d +l(wi-1)P(w) c(wi-1) unigram (Maybe keeping a couple extra values of d for counts 1 and 2) But should we really just use the regular unigram P(w)?

Kneser-Ney Smoothing I Better estimate for probabilities of lower-order unigrams! Shannon game: I can t see without my reading____________________? Francisco is more common than glasses but Francisco always follows San Francisco/ glasses The unigram is useful exactly when we haven t seen this bigram! Instead of P(w): How likely is w Pcontinuation(w): How likely is w to appear as a novel continuation? For each word, count the number of bigram types it completes Every bigram type was a novel continuation the first time it was seen PCONTINUATION(w) {wi-1:c(wi-1,w)>0}

Kneser-Ney Smoothing II How many times does w appear as a novel continuation: PCONTINUATION(w) {wi-1:c(wi-1,w)>0} Normalized by the total number of word bigram types {(wj-1,wj):c(wj-1,wj)>0} {wi-1:c(wi-1,w)>0} {(wj-1,wj):c(wj-1,wj)>0} PCONTINUATION(w)=

Kneser-Ney Smoothing III Alternative metaphor: The number of # of word types seen to precede w |{wi-1:c(wi-1,w)>0}| normalized by the # of words preceding all words: {wi-1:c(wi-1,w)>0} {w'i-1:c(w'i-1,w')>0} PCONTINUATION(w)= w' A frequent word (Francisco) occurring in only one context (San) will have a low continuation probability

Kneser-Ney Smoothing IV PKN(wi|wi-1)=max(c(wi-1,wi)-d,0) +l(wi-1)PCONTINUATION(wi) c(wi-1) is a normalizing constant; the probability mass we ve discounted d l(wi-1)= c(wi-1){w:c(wi-1,w)>0} The number of word types that can follow wi-1 = # of word types we discounted = # of times we applied normalized discount the normalized discount

Kneser-Ney Smoothing: Recursive formulation i )-d,0) i-1) i-1)=max(cKN(wi-n+1 i-1)PKN(wi|wi-n+2 i-1 +l(wi-n+1 PKN(wi|wi-n+1 ) cKN(wi-n+1 count( ) for the highest order continuationcount( ) for lower order cKN( )= Continuation count = Number of unique single word contexts for

Advanced Smoothing Algorithms in Natural Language Processing

Download Presentation

Presentation Transcript

Related

More Related Content