Understanding English Parts of Speech and Tagging

Part-of-Speech Tagging
 
CSE 628
Niranjan Balasubramanian
 
Many slides and material from:
Ray Mooney (UT Austin)
Mausam (IIT Delhi) *
 
* Mausam’s excellent deck was itself composed using material from other NLP greats!
 
 
Part-of-Speech Tagging
 
P
a
r
t
-
o
f
-
s
p
e
e
c
h
 
t
a
g
g
i
n
g
 
i
s
 
t
h
e
 
m
o
s
t
 
b
a
s
i
c
 
s
y
n
t
a
c
t
i
c
 
a
n
a
l
y
s
i
s
 
w
e
 
c
a
n
 
d
o
.
 
Input: A sequence of tokens (e.g., a sentence, tweet, search queries).
Output: An assignment of POS tags for each word in the input
 
2
English Parts of Speech
Noun:
 
 
Syntactic Function: 
 
Subjects or Objects of verbs.
 
Semantic Type:
  
Person, place or thing.
 
Sub-categories:
      
Singular (NN):  dog, fork
      
Plural (NNS):  dogs, forks
      
Proper (NNP, NNPS): John, Springfields
      
Personal pronoun (PRP): I, you, he, she, it
      
Wh-pronoun  (WP): who, what
Verb
 
Syntactic Function:
  
predicate, heads a verb phrase.
 
Semantic Type:
  
actions or processes.
 
Sub-categories:
      
Base, infinitive (VB):  eat
      
Past tense (VBD):  ate
      
Gerund (VBG):  eating
      
Past participle (VBN):  eaten
      
Non 3
rd
 person singular present tense (VBP): eat
      
3
rd
 person singular present tense: (VBZ): eats
      
Modal (MD): should, can
      
To (TO): to (to eat)
 
3
English Parts of Speech
Adjective (modify nouns)
Basic (JJ): red, tall
Comparative (JJR): redder, taller
Superlative (JJS): reddest, tallest
 
Adverb (modify verbs)
Basic (RB): quickly
Comparative (RBR): quicker
Superlative (RBS): quickest
 
Preposition (IN): on, in, by, to, with
 
Determiner:
Basic (DT) a, an, the
WH-determiner (WDT): which, that
 
Coordinating Conjunction (CC): and, but, or,
 
Particle (RP): off (took off), up (put up)
 
 
 
4
Closed vs. Open Class
C
l
o
s
e
d
 
c
l
a
s
s
 
c
a
t
e
g
o
r
i
e
s
 
a
r
e
 
c
o
m
p
o
s
e
d
 
o
f
 
a
 
s
m
a
l
l
,
 
f
i
x
e
d
 
s
e
t
 
o
f
 
g
r
a
m
m
a
t
i
c
a
l
 
f
u
n
c
t
i
o
n
w
o
r
d
s
 
f
o
r
 
a
 
g
i
v
e
n
 
l
a
n
g
u
a
g
e
.
Pronouns, Prepositions, Modals, Determiners, Particles, Conjunctions
 
O
p
e
n
 
c
l
a
s
s
 
c
a
t
e
g
o
r
i
e
s
 
h
a
v
e
 
l
a
r
g
e
 
n
u
m
b
e
r
 
o
f
 
w
o
r
d
s
 
a
n
d
 
n
e
w
 
o
n
e
s
 
a
r
e
 
e
a
s
i
l
y
i
n
v
e
n
t
e
d
.
Nouns: 
  
Transmogrifier, Selfie, …
Verbs: 
  
You can verb many nouns. E.g., Google it.
Adjectives:
 
Many nouns can be made into adjectives. E.g., geeky
 Abverb:
  
e.g., Webly supervised learning
 
5
English POS Tagsets
B
r
o
w
n
 
c
o
r
p
u
s
 
u
s
e
d
 
a
 
l
a
r
g
e
 
s
e
t
 
o
f
 
8
7
 
P
O
S
 
t
a
g
s
.
 
P
e
n
n
 
T
r
e
e
b
a
n
k
 
h
a
s
 
a
 
s
e
t
 
o
f
 
4
5
 
t
a
g
s
.
Most commonly used in NLP.
 Reduced from the Brown set for use in the context of a parsed corpus
 
C
5
 
 
t
a
g
s
e
t
 
u
s
e
d
 
f
o
r
 
t
h
e
 
B
r
i
t
i
s
h
 
N
a
t
i
o
n
a
l
 
C
o
r
p
u
s
 
(
B
N
C
)
 
h
a
s
 
6
1
 
t
a
g
s
.
 
6
 
What is a good tag set? Not settled. We will avoid this
discussion!
Why do POS tagging?
POS tagging is often used an early step in an NLP pipeline.
 
Text-to-speech e.g., How do you pronounce “lead”?
 
Can write regular expressions over POS tags to identify phrases etc.
 
e.g., (Det) Adj* N+ over the output for noun phrases, etc.
 
If you know the tag, you can back off to it in other tasks
Using proper nouns is a good back off method for finding people.
 
Assigning classes to words or phrases is quite useful
Noun, Verb, Adjective, …
Subject, Verb, Direct Object, Indirect Object, …
Person, Organization, Date, …
Drug, Disease, Side-effect, …
 
7
History
 
8
Punch-line
State-of-the-art techniques achieve > 98% accuracy on newswire texts.
Not a PhD topic anymore!
 
POS tagging in resource poor languages is harder (~87%)
 
POS tagging for Tweets is harder (state-of-the art ~ 88 %)
 
 
 
9
Why is POS-tagging hard?
Ambiguity
Same word can have different POS tags in different contexts.
 
 
 
 
 
 
 
About 11% of the vocabulary (and 40% of word tokens) in the Brown
I know 
that
 he is honest = IN
Yes, 
that
 play was nice = DT
You can’t go 
that
 far = RB
 
Cannot specify POS tags for all words
New words appear all the time.
 
 
 
 
10
Why is POS tagging hard?
Deciding on the correct part of speech can be difficult even for people
 
M
r
s
/
N
N
P
 
S
h
a
e
f
e
r
/
N
N
P
 
n
e
v
e
r
/
R
B
 
g
o
t
/
V
B
D
 
a
r
o
u
n
d
/
?
 
t
o
/
T
O
 
j
o
i
n
i
n
g
/
V
B
G
 
H
e
/
N
N
P
 
w
a
n
t
s
/
V
B
P
 
t
o
/
T
O
 
g
o
/
V
B
 
a
r
o
u
n
d
/
?
 
t
h
e
/
D
T
 
c
o
r
n
e
r
/
N
N
 
C
h
a
t
e
a
u
/
N
N
P
 
P
e
t
r
u
s
/
N
N
P
 
c
o
s
t
s
/
V
B
Z
 
a
r
o
u
n
d
/
?
 
2
5
0
/
C
D
 
 
I have to understand the meaning of the sentence before I can decide what POS
tag to assign.
 
 
11
Why is POS tagging hard?
Deciding on the correct part of speech can be difficult even for people
 
M
r
s
/
N
N
P
 
S
h
a
e
f
e
r
/
N
N
P
 
n
e
v
e
r
/
R
B
 
g
o
t
/
V
B
D
 
a
r
o
u
n
d
/
?
 
t
o
/
T
O
 
j
o
i
n
i
n
g
/
V
B
G
 
“got around” – replace with “managed” with roughly equivalent meaning.
 
This means that around is used as a particle – i.e., something that cannot stand
 
on its own.
 
H
e
/
N
N
P
 
w
a
n
t
s
/
V
B
P
 
t
o
/
T
O
 
g
o
/
V
B
 
a
r
o
u
n
d
/
?
 
t
h
e
/
D
T
 
c
o
r
n
e
r
/
N
N
 
“go around the corner” – around describes going with respect to the location.
 
You can replace around with “go to the corner” “go by the corner” etc.
 
C
h
a
t
e
a
u
/
N
N
P
 
P
e
t
r
u
s
/
N
N
P
 
c
o
s
t
s
/
V
B
Z
 
a
r
o
u
n
d
/
?
 
2
5
0
/
C
D
 
“costs around 250” – replace around with roughly to mean the same thing.
 
around is describing the action “costs”.
 
I have to understand the meaning of the sentence before I can decide what POS
tag to assign.
 
 
12
Why is POS tagging hard?
Deciding on the correct part of speech can be difficult even for people
 
M
r
s
/
N
N
P
 
S
h
a
e
f
e
r
/
N
N
P
 
n
e
v
e
r
/
R
B
 
g
o
t
/
V
B
D
 
a
r
o
u
n
d
/
R
P
 
t
o
/
T
O
 
j
o
i
n
i
n
g
/
V
B
G
 
H
e
/
N
N
P
 
w
a
n
t
s
/
V
B
P
 
t
o
/
T
O
 
g
o
/
V
B
 
a
r
o
u
n
d
/
I
N
 
t
h
e
/
D
T
 
c
o
r
n
e
r
/
N
N
 
C
h
a
t
e
a
u
/
N
N
P
 
P
e
t
r
u
s
/
N
N
P
 
c
o
s
t
s
/
V
B
Z
 
a
r
o
u
n
d
/
R
B
 
2
5
0
/
C
D
 
Average POS tagging disagreement amongst expert human judges for the Penn
Treebank was 3.5%
 
13
What information to use?
 
Bill
  
saw
  
that 
 
man 
 
yesterday
 
14
What information to use?
 
Bill
  
saw
  
that 
  
man 
 
yesterday
 
 
 
 
 
N
N
P
N
N
 
D
T
 
N
N
 
N
N
V
B
 
V
B
(
D
)
 
I
N
 
V
B
 
N
N
 
Knowledge of word probabilities
man is rarely used as a verb….
 
Knowledge of POS of neighboring words
 
A verb is less likely to follow another verb.
 
15
Learning Approach
16
Problem Formulations
Baseline
Assign most frequent tag for each word based on training.
 
Hidden Markov Models (HMM)
Previous tag and current word should influence current tag
 
Feature rich model
Maximum Entropy (Maxent) model
 
Discriminative model
Conditional Random Field (CRF)
 
 
17
Most Frequent Tag
Pros ?
 
 
 
 
Cons ?
 
18
Most Frequent Tag
Pros ?
Easy to implement 

90% accurate!
 
 
Cons ?
 
19
Most Frequent Tag
Pros ?
Easy to implement 

90% accurate!
 
 
Cons ?
Unknown words: You may not have seen all words.
 
Context dependence:
 
Bill paid me $50.
 
vs.
Bill me $50.
 
20
Hidden Markov Model
Probabilistic generative model for sequences.
 
A
s
s
u
m
e
 
a
n
 
u
n
d
e
r
l
y
i
n
g
 
s
e
t
 
o
f
 
h
i
d
d
e
n
 
(
u
n
o
b
s
e
r
v
e
d
)
 
s
t
a
t
e
s
 
i
n
 
w
h
i
c
h
 
t
h
e
 
m
o
d
e
l
 
c
a
n
 
b
e
(
e
.
g
.
 
p
a
r
t
s
 
o
f
 
s
p
e
e
c
h
)
.
 
Assume probabilistic transitions between states over time (e.g. transition from POS
to another POS as sequence is generated).
 
A
s
s
u
m
e
 
a
 
p
r
o
b
a
b
i
l
i
s
t
i
c
 
g
e
n
e
r
a
t
i
o
n
 
o
f
 
t
o
k
e
n
s
 
f
r
o
m
 
s
t
a
t
e
s
 
(
e
.
g
.
 
w
o
r
d
s
 
g
e
n
e
r
a
t
e
d
 
f
o
r
e
a
c
h
 
P
O
S
)
.
 
Assume current state is dependent only on the previous state.
 
21
POS HMMs: A Generative Story
There is some probabilistic process (or machine) that generates sentences.
a)
S
w
i
t
c
h
e
s
 
t
h
r
o
u
g
h
 
a
 
f
i
n
i
t
e
 
s
e
t
 
o
f
 
P
O
S
 
s
t
a
t
e
s
 
p
r
o
b
a
b
i
l
i
s
t
i
c
a
l
l
y
.
 
e.g., Switches from state i to j with probability Pr(S
j
 | S
i
)
 
b) 
 
In each state the machine emits some word with a probability that is
 
specific to the state.
 
e.g. Pr(dog | state 1) = 0.63
  
Pr(dog | state 2) = 0.21
 
A
n
y
 
s
e
n
t
e
n
c
e
 
i
s
 
a
 
s
e
q
u
e
n
c
e
 
o
f
 
(
o
b
s
e
r
v
e
d
)
 
w
o
r
d
s
 
e
m
i
t
t
e
d
 
b
y
a
 
f
i
n
i
t
e
 
s
t
a
t
e
 
m
a
c
h
i
n
e
 
t
h
a
t
 
s
w
i
t
c
h
e
s
 
t
h
r
o
u
g
h
 
(
h
i
d
d
e
n
)
 
P
O
S
 
s
t
a
t
e
s
 
 
22
Formal Definition of an HMM
A set of 
N +2
 states 
       
S 
= {
s
0
,
s
1
,
s
2
, …
 s
N,
 s
F
}
 
A set of 
M
 possible observations 
    
O 
= {
o
1
,
o
2
, …, 
o
M
}
 
A state transition probability distribution 
   
A 
= {
a
ij
}
 
 
 
 
Observation probability distribution
, 
    
B 
= {
b
0
(
k
), 
b
1
(
k
), 
b
2
(
k
), …, 
b
F
(
k
)}
 
 
 
Parameters of the model 
λ
={
A
,
B
}
 
23
Three things you can do with an HMM
Decode the state sequences that the machine went through.
 
Learn the parameters of the model.
 
Predict likelihood of observation sequences.
 
 
 
24
Observation Likelihood
 
25
Given the transition and emission tables:
 
What is the probability of observing the sentence given the model?
 
Pr(Dogs chase cats.) = 
 
Pr(NNS|START) x Pr(Dogs|NNS)
    
    
  
x Pr(VBP|NNS) x Pr(chase|VBP)
    
    
  
x Pr(NNS|VBP) x Pr(cats|NNS)
    
    
  
x Pr(DOT | NNS) x Pr(  . | DOT )
Decoding
 
26
Given the transition and emission probabilities tables:
 
What are the (most likely) hidden states S1, S2, and S3 that
the machine transitioned through in order to produce the
observed sentence?
Most Likely State Sequence (Decoding)
Given an observation sequence, 
O
, and a model, 
λ
,  what is the most likely state sequence,
Q
=
q
1
,
q
2
,…
q
T
, that generated this sequence from this model?
 
 
Used for sequence labeling.
Assume each state corresponds to a tag.
Determine the globally best assignment of tags to all tokens in a sequence.
Uses a principled approach grounded in probability theory.
 
27
Most Likely State Sequence (Decoding)
Suppose there are N states (pos tags + start + final states).
 
Let there be an input sentence that has T words.
 
Given the N x N transition matrix A and the N x T emission matrix B:
What is a naive algorithm for finding the most likely state sequence?
 
 
 
 
28
Most Likely State Sequence (Decoding)
Suppose there are N states (pos tags + start + final states).
 
Let there be an input sentence that has T words.
 
Given the N x N transition matrix A and the N x T emission matrix B:
What is a naive algorithm for finding the most likely state sequence?
 
 
 
 
 
 
 
What is the complexity of 
Brute
?
 
 
 
29
Brute(S, T):
 
Iterate through every possible state sequence S.
 
Compute Pr(S | T)
 
Select the S that maximizes Pr(S | T)
Most Likely State Sequence (Decoding)
Suppose there are N states (pos tags + start + final states).
 
Let there be an input sentence that has T words.
 
Given the N x N transition matrix A and the N x T emission matrix B:
What is a naive algorithm for finding the most likely state sequence?
 
 
 
 
 
 
 
What is the complexity of 
Brute
?
  
O(
number of state sequences
 x T)
 
 
 
 
30
Brute(S, T):
 
Iterate through every possible state sequence S.
 
Compute Pr(S | T)
 
Select the S that maximizes Pr(S | T)
A More Efficient Solution
Dynamic Programming can also be used
Exploit the Markov assumption
Efficiently determine the most likely state sequence for a given observation and
model.
 
Standard procedure is called the 
Viterbi algorithm
 (Viterbi, 1967)
Has O(N
2 
T) time complexity.
 
31
Viterbi Scores
Recursively compute the probability of the most likely subsequence of
states that accounts for the first 
t
 observations and ends in state 
s
j
.
 
32
 
 
Also record 
back pointers
 to trace back the most probable state
sequence.
bt
t
(
j
) stores the state at time 
t
-1 that maximizes the probability
that system was in state 
s
j
 at time 
t
 (given the observed sequence).
Computing the Viterbi Scores
Initialization
 
 
 
Recursion
 
 
 
Termination
 
33
Computing the Viterbi Backpointers
Initialization
 
Recursion
 
 
Termination
 
34
 
Final state in the most probable state sequence.
Follow back pointers to initial state to construct full sequence.
Viterbi Backpointers
 
35
 
s
1
 
s
2
 
s
N
 
 
 
s
0
 
s
F
 
 
    
   
 
    
   
 
    
   
 
    
   
 
t
1
 
t
2
 
t
3
 
t
T-1
 
t
T
Viterbi Backtrace
36
s
1
s
2
s
N
s
0
s
F
    
   
    
   
    
   
    
   
t
1
t
2
t
3
t
T-1
t
T
 
Most likely Sequence: s
0 
s
N 
s
1
 s
2
 …s
2
 s
F
Learning the parameters of the HMM
 
37
 
Transition Probabilities:
 
  
Pr(S
i
 | S
j
) or a
ij
 
Emission Probabilities:
 
  
Pr(v
k
 | S
j
) or b
j
(k)
Learning
Supervised Learning
:  All training sequences are completely labeled (tagged).
Unsupervised
 
Learning
: All training sequences are unlabeled (but generally know the
number of tags, i.e. states).
Semisupervised Learning
: Some training sequences are labeled, most are unlabeled.
 
38
Supervised HMM Training
If training sequences are labeled (tagged) with the underlying state sequences
that generated them, then the parameters, λ={A,B} can all be estimated
directly.
39
Supervised Parameter Estimation
Estimate state transition probabilities based on tag bigram and unigram statistics in
the labeled data.
 
 
 
Estimate the observation probabilities based on tag/word co-occurrence statistics
in the labeled data.
 
 
 
Use appropriate smoothing if training data is sparse.
 
40
Learning and Using HMM Taggers
Use a corpus of labeled sequence data to easily construct an HMM using supervised training.
 
 
Given a novel unlabeled test sequence to tag, use the Viterbi algorithm to predict the most
likely (globally optimal) tag sequence.
 
 
41
Evaluating Taggers
T
r
a
i
n
 
o
n
 
t
r
a
i
n
i
n
g
 
s
e
t
 
o
f
 
l
a
b
e
l
e
d
 
s
e
q
u
e
n
c
e
s
.
 
P
o
s
s
i
b
l
y
 
t
u
n
e
 
p
a
r
a
m
e
t
e
r
s
 
b
a
s
e
d
 
o
n
 
p
e
r
f
o
r
m
a
n
c
e
 
o
n
 
a
 
d
e
v
e
l
o
p
m
e
n
t
 
s
e
t
.
 
M
e
a
s
u
r
e
 
a
c
c
u
r
a
c
y
 
o
n
 
a
 
d
i
s
j
o
i
n
t
 
t
e
s
t
 
s
e
t
.
 
G
e
n
e
r
a
l
l
y
 
m
e
a
s
u
r
e
 
t
a
g
g
i
n
g
 
a
c
c
u
r
a
c
y
,
 
i
.
e
.
 
t
h
e
 
p
e
r
c
e
n
t
a
g
e
 
o
f
 
t
o
k
e
n
s
 
t
a
g
g
e
d
c
o
r
r
e
c
t
l
y
.
 
Accuracy of most modern POS taggers, including HMMs is 96−97% (for Penn tagset
trained on about 800K words) .
Generally matching human agreement level.
 
42
Stop Here.
Questions?
 
Next class
Intro to machine learning (possibly).
Unsupervised learning for HMMs
Issues with HMMs
MEMMs
 
Sign up on Piazza.
 
 
Think projects!
 
 
43
Slide Note
Embed
Share

English Parts of Speech and Tagging involve analyzing syntactic functions and semantic types of words in a sentence. This process assigns POS tags to each word based on its role in the sentence, such as nouns, verbs, adjectives, adverbs, prepositions, determiners, pronouns, and conjunctions. The distinction between closed and open class words is also explored, along with variations in POS tag sets used in linguistic corpora.


Uploaded on Jul 29, 2024 | 4 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Part-of-Speech Tagging CSE 628 Niranjan Balasubramanian Many slides and material from: Ray Mooney (UT Austin) Mausam (IIT Delhi) * * Mausam s excellent deck was itself composed using material from other NLP greats!

  2. Part-of-Speech Tagging Part-of-speech tagging is the most basic syntactic analysis we can do. Input: A sequence of tokens (e.g., a sentence, tweet, search queries). Output: An assignment of POS tags for each word in the input A DT tall JJ boy NN ran VBD home NN . . 2

  3. English Parts of Speech Noun: Syntactic Function: Semantic Type: Sub-categories: Subjects or Objects of verbs. Person, place or thing. Singular (NN): dog, fork Plural (NNS): dogs, forks Proper (NNP, NNPS): John, Springfields Personal pronoun (PRP): I, you, he, she, it Wh-pronoun (WP): who, what Verb Syntactic Function: Semantic Type: Sub-categories: actions or processes. predicate, heads a verb phrase. Base, infinitive (VB): eat Past tense (VBD): ate Gerund (VBG): eating Past participle (VBN): eaten Non 3rd person singular present tense (VBP): eat 3rd person singular present tense: (VBZ): eats Modal (MD): should, can To (TO): to (to eat) 3

  4. English Parts of Speech Adjective (modify nouns) Basic (JJ): red, tall Comparative (JJR): redder, taller Superlative (JJS): reddest, tallest Adverb (modify verbs) Basic (RB): quickly Comparative (RBR): quicker Superlative (RBS): quickest Preposition (IN): on, in, by, to, with Determiner: Basic (DT) a, an, the WH-determiner (WDT): which, that Coordinating Conjunction (CC): and, but, or, Particle (RP): off (took off), up (put up) 4

  5. Closed vs. Open Class Closed class categories are composed of a small, fixed set of grammatical function words for a given language. Pronouns, Prepositions, Modals, Determiners, Particles, Conjunctions Open class categories have large number of words and new ones are easily invented. Nouns: Transmogrifier, Selfie, Verbs: You can verb many nouns. E.g., Google it. Adjectives: Many nouns can be made into adjectives. E.g., geeky Abverb: e.g., Webly supervised learning 5

  6. English POS Tagsets Brown corpus used a large set of 87 POS tags. Penn Treebank has a set of 45 tags. Most commonly used in NLP. Reduced from the Brown set for use in the context of a parsed corpus C5 tagset used for the British National Corpus (BNC) has 61 tags. What is a good tag set? Not settled. We will avoid this discussion! 6

  7. Why do POS tagging? POS tagging is often used an early step in an NLP pipeline. Text-to-speech e.g., How do you pronounce lead ? Can write regular expressions over POS tags to identify phrases etc. e.g., (Det) Adj* N+ over the output for noun phrases, etc. If you know the tag, you can back off to it in other tasks Using proper nouns is a good back off method for finding people. Assigning classes to words or phrases is quite useful Noun, Verb, Adjective, Subject, Verb, Direct Object, Indirect Object, Person, Organization, Date, Drug, Disease, Side-effect, 7

  8. History 8

  9. Punch-line State-of-the-art techniques achieve > 98% accuracy on newswire texts. Not a PhD topic anymore! POS tagging in resource poor languages is harder (~87%) POS tagging for Tweets is harder (state-of-the art ~ 88 %) 9

  10. Why is POS-tagging hard? Ambiguity Same word can have different POS tags in different contexts. About 11% of the vocabulary (and 40% of word tokens) in the Brown I know that he is honest = IN Yes, that play was nice = DT You can t go that far = RB Cannot specify POS tags for all words New words appear all the time. 10

  11. Why is POS tagging hard? Deciding on the correct part of speech can be difficult even for people Mrs/NNP Shaefer/NNP never/RB got/VBD around/? to/TO joining/VBG He/NNP wants/VBP to/TO go/VB around/? the/DT corner/NN Chateau/NNP Petrus/NNP costs/VBZ around/? 250/CD I have to understand the meaning of the sentence before I can decide what POS tag to assign. 11

  12. Why is POS tagging hard? Deciding on the correct part of speech can be difficult even for people Mrs/NNP Shaefer/NNP never/RB got/VBD around/? to/TO joining/VBG got around replace with managed with roughly equivalent meaning. This means that around is used as a particle i.e., something that cannot stand on its own. He/NNP wants/VBP to/TO go/VB around/? the/DT corner/NN go around the corner around describes going with respect to the location. You can replace around with go to the corner go by the corner etc. Chateau/NNP Petrus/NNP costs/VBZ around/? 250/CD costs around 250 replace around with roughly to mean the same thing. around is describing the action costs . I have to understand the meaning of the sentence before I can decide what POS tag to assign. 12

  13. Why is POS tagging hard? Deciding on the correct part of speech can be difficult even for people Mrs/NNP Shaefer/NNP never/RB got/VBD around/RP to/TO joining/VBG He/NNP wants/VBP to/TO go/VB around/IN the/DT corner/NN Chateau/NNP Petrus/NNP costs/VBZ around/RB 250/CD Average POS tagging disagreement amongst expert human judges for the Penn Treebank was 3.5% 13

  14. What information to use? Bill saw that man yesterday 14

  15. What information to use? Bill saw NN VB(D) that DT IN man NN VB yesterday NN NN NNP VB Knowledge of word probabilities man is rarely used as a verb . Knowledge of POS of neighboring words A verb is less likely to follow another verb. 15

  16. Learning Approach Machine Learning Manually Annotated Training Corpora Tagging Model NLP System Automatically Annotated Text Raw Text 16

  17. Problem Formulations Baseline Assign most frequent tag for each word based on training. Hidden Markov Models (HMM) Previous tag and current word should influence current tag Feature rich model Maximum Entropy (Maxent) model Discriminative model Conditional Random Field (CRF) 17

  18. Most Frequent Tag Pros ? Cons ? 18

  19. Most Frequent Tag Pros ? Easy to implement . 90% accurate! Cons ? 19

  20. Most Frequent Tag Pros ? Easy to implement . 90% accurate! Cons ? Unknown words: You may not have seen all words. Context dependence: Bill paid me $50. vs. Bill me $50. 20

  21. Hidden Markov Model Probabilistic generative model for sequences. Assume an underlying set of hidden (unobserved) states in which the model can be (e.g. parts of speech). Assume probabilistic transitions between states over time (e.g. transition from POS to another POS as sequence is generated). Assume a probabilisticgeneration of tokens from states (e.g. words generated for each POS). Assume current state is dependent only on the previous state. 21

  22. POS HMMs: A Generative Story There is some probabilistic process (or machine) that generates sentences. a) Switches through a finite set of POS states probabilistically. e.g., Switches from state i to j with probability Pr(Sj | Si) b) In each state the machine emits some word with a probability that is specific to the state. e.g. Pr(dog | state 1) = 0.63 Pr(dog | state 2) = 0.21 Any sentence is a sequence of (observed) words emitted by a finite state machine that switches through (hidden) POS states 22

  23. Formal Definition of an HMM A set of N +2 states S = {s0,s1,s2, sN, sF} A set of M possible observations O = {o1,o2, , oM} A state transition probability distribution aij=P(qt+1=sj|qt=si) N + = 1 A = {aij} = 1 0 a a i N ij iF j Observation probability distribution, B = {b0(k), b1(k), b2(k), , bF(k)} bj(k)=P(vk at t|qt=sj) Parameters of the model ={A,B} 23

  24. Three things you can do with an HMM Decode the state sequences that the machine went through. Learn the parameters of the model. Predict likelihood of observation sequences. 24

  25. Observation Likelihood START NNS VBP NNS DOT chase cats . Dogs Given the transition and emission tables: What is the probability of observing the sentence given the model? Pr(Dogs chase cats.) = Pr(NNS|START) x Pr(Dogs|NNS) x Pr(VBP|NNS) x Pr(chase|VBP) x Pr(NNS|VBP) x Pr(cats|NNS) x Pr(DOT | NNS) x Pr( . | DOT ) 25

  26. Decoding START S1 S2 S3 . chase cats . Dogs Given the transition and emission probabilities tables: What are the (most likely) hidden states S1, S2, and S3 that the machine transitioned through in order to produce the observed sentence? 26

  27. Most Likely State Sequence (Decoding) Given an observation sequence, O, and a model, , what is the most likely state sequence, Q=q1,q2, qT, that generated this sequence from this model? Used for sequence labeling. Assume each state corresponds to a tag. Determine the globally best assignment of tags to all tokens in a sequence. Uses a principled approach grounded in probability theory. 27

  28. Most Likely State Sequence (Decoding) Suppose there are N states (pos tags + start + final states). Let there be an input sentence that has T words. Given the N x N transition matrix A and the N x T emission matrix B: What is a naive algorithm for finding the most likely state sequence? 28

  29. Most Likely State Sequence (Decoding) Suppose there are N states (pos tags + start + final states). Let there be an input sentence that has T words. Given the N x N transition matrix A and the N x T emission matrix B: What is a naive algorithm for finding the most likely state sequence? Brute(S, T): Iterate through every possible state sequence S. Compute Pr(S | T) Select the S that maximizes Pr(S | T) What is the complexity of Brute? 29

  30. Most Likely State Sequence (Decoding) Suppose there are N states (pos tags + start + final states). Let there be an input sentence that has T words. Given the N x N transition matrix A and the N x T emission matrix B: What is a naive algorithm for finding the most likely state sequence? Brute(S, T): Iterate through every possible state sequence S. Compute Pr(S | T) Select the S that maximizes Pr(S | T) What is the complexity of Brute? O(number of state sequences x T) 30

  31. A More Efficient Solution Dynamic Programming can also be used Exploit the Markov assumption Efficiently determine the most likely state sequence for a given observation and model. Standard procedure is called the Viterbi algorithm (Viterbi, 1967) Has O(N2 T) time complexity. 31

  32. Viterbi Scores Recursively compute the probability of the most likely subsequence of states that accounts for the first t observations and ends in state sj. = = ( ) max ,..., , 1 q ( , ,..., , ,..., , | ) v j P q q q o o q s 0 1 1 1 t t t t j q q 0 1 t Also record back pointers to trace back the most probable state sequence. btt(j) stores the state at time t-1 that maximizes the probability that system was in state sj at time t (given the observed sequence). 32

  33. Computing the Viterbi Scores Initialization Recursion = ( ) ( ) 1 v j a b o j N 1 0 1 j j Termination N vt(j)=max vt-1(i)aijbj(ot) 1 j N, 2 t T i=1 N P*=vT+1(sF)=max vT(i)aiF i=1 33

  34. Computing the Viterbi Backpointers Initialization Recursion = ( ) 1 bt j s j N 1 0 Termination N btt(j)=argmax vt-1(i)aijbj(ot) 1 j N, 2 t T i=1 N = = * ( ) argmax = v ( ) q bt s i a + 1 T T F T iF 1 i Final state in the most probable state sequence. Follow back pointers to initial state to construct full sequence. 34

  35. Viterbi Backpointers s1 s2 sF s0 sN t1 t2 t3 tT-1 tT 35

  36. Viterbi Backtrace s1 s2 sF s0 sN t1 t2 t3 tT-1 tT Most likely Sequence: s0 sN s1 s2 s2 sF 36

  37. Learning the parameters of the HMM Transition Probabilities: Pr(Si | Sj) or aij NN 0.2 0.3 0.2 VB 0.4 0.1 0.001 JJ 0.01 0.15 0.2 NN VB JJ Emission Probabilities: Pr(vk | Sj) or bj(k) NNS 0.4 0.3 0.2 VB 0.2 0.001 0.2 JJ 0.01 0.15 0.2 dogs cats chase 37

  38. Learning Supervised Learning: All training sequences are completely labeled (tagged). Unsupervised Learning: All training sequences are unlabeled (but generally know the number of tags, i.e. states). Semisupervised Learning: Some training sequences are labeled, most are unlabeled. 38

  39. Supervised HMM Training If training sequences are labeled (tagged) with the underlying state sequences that generated them, then the parameters, ={A,B} can all be estimated directly. Training Sequences John ate the apple A dog bit Mary Mary hit the dog John gave Mary the cat. ... Supervised HMM Training Det Noun PropNoun Verb 39

  40. Supervised Parameter Estimation Estimate state transition probabilities based on tag bigram and unigram statistics in the labeled data. = = ( q , = ) C q s s + t 1 t i j = a ij ( ) C q s t i Estimate the observation probabilities based on tag/word co-occurrence statistics in the labeled data. = = ( , = ) C q s o v i j i k = ( ) b k j ( ) C q s i j Use appropriate smoothing if training data is sparse. 40

  41. Learning and Using HMM Taggers Use a corpus of labeled sequence data to easily construct an HMM using supervised training. Given a novel unlabeled test sequence to tag, use the Viterbi algorithm to predict the most likely (globally optimal) tag sequence. 41

  42. Evaluating Taggers Train on training set of labeled sequences. Possibly tune parameters based on performance on a development set. Measure accuracy on a disjoint test set. Generally measure tagging accuracy, i.e. the percentage of tokens tagged correctly. Accuracy of most modern POS taggers, including HMMs is 96 97% (for Penn tagset trained on about 800K words) . Generally matching human agreement level. 42

  43. Stop Here. Questions? Next class Intro to machine learning (possibly). Unsupervised learning for HMMs Issues with HMMs MEMMs Sign up on Piazza. Think projects! 43

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#