Advanced NLP Modeling Techniques: Approximation-aware Training

 
Section 6:
Approximation-aware Training
 
 
1
 
Outline
 
Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
Then this tutorial is extremely practical for you!
1.
Models:
 Factor graphs can express interactions among linguistic
structures.
2.
Algorithm: 
BP estimates the global effect of these interactions on
each variable, using local computations.
3.
Intuitions: 
What’s going on here?  Can we trust BP’s estimates?
4.
Fancier Models: 
Hide a whole grammar and dynamic
programming algorithm within a single factor.  BP coordinates
multiple factors.
5.
Tweaked Algorithm: 
Finish in fewer steps and make the steps
faster.
6.
Learning: 
Tune the parameters.  Approximately improve the true
predictions -- or truly improve the approximate predictions.
7.
Software: 
Build the model you want!
 
2
 
Outline
 
Do you want to push past the simple NLP models (logistic
regression, PCFG, etc.) that we've all been using for 20 years?
Then this tutorial is extremely practical for you!
1.
Models:
 Factor graphs can express interactions among linguistic
structures.
2.
Algorithm: 
BP estimates the global effect of these interactions on
each variable, using local computations.
3.
Intuitions: 
What’s going on here?  Can we trust BP’s estimates?
4.
Fancier Models: 
Hide a whole grammar and dynamic
programming algorithm within a single factor.  BP coordinates
multiple factors.
5.
Tweaked Algorithm: 
Finish in fewer steps and make the steps
faster.
6.
Learning: 
Tune the parameters.  Approximately improve the true
predictions -- or truly improve the approximate predictions.
7.
Software: 
Build the model you want!
 
3
Modern NLP
4
 
NLP
Machine Learning for NLP
5
Linguistics
inspires
the
structures
we want
to predict
 
Machine Learning for NLP
 
6
Our 
model
defines a
score for
each
structure
 
 
Machine Learning for NLP
 
7
It also tells
us what to
optimize
Our 
model
defines a
score for
each
structure
 
Machine Learning for NLP
8
Learning
tunes the
parameters
of the
model
 
Given training instances
{(x
1
, y
1
), (x
2
, y
2
),…, (x
n
, y
n
)}
 
Find the best model parameters, 
θ
 
Machine Learning for NLP
 
9
Learning
tunes the
parameters
of the
model
 
Given training instances
{(x
1
, y
1
), (x
2
, y
2
),…, (x
n
, y
n
)}
 
Find the best model parameters, 
θ
 
Machine Learning for NLP
 
10
Learning
tunes the
parameters
of the
model
 
Given training instances
{(x
1
, y
1
), (x
2
, y
2
),…, (x
n
, y
n
)}
 
Find the best model parameters, 
θ
Machine Learning for NLP
11
Inference
finds the
best
structure
for a new
sentence
Given a 
new sentence, 
x
new
Search over the 
set of all possible structures
(often exponential in size of 
x
new
)
Return the 
highest scoring 
structure, 
y
*
(
Inference is
usually called as
a 
subroutine
 in
learning)
Machine Learning for NLP
12
Inference
finds the
best
structure
for a new
sentence
Given a 
new sentence, 
x
new
Search over the 
set of all possible structures
(often exponential in size of 
x
new
)
Return the 
Minimum Bayes Risk (MBR)
structure, 
y
*
(
Inference is
usually called as
a 
subroutine
 in
learning)
Machine Learning for NLP
13
Inference
finds the
best
structure
for a new
sentence
Given a 
new sentence, 
x
new
Search over the 
set of all possible structures
(often exponential in size of 
x
new
)
Return the 
Minimum Bayes Risk (MBR)
structure, 
y
*
(
Inference is
usually called as
a 
subroutine
 in
learning)
13
Easy
Polynomial time
NP-hard
 
Modern NLP
 
14
Linguistics
inspires the
structures we
want to predict
It also tells us
what to optimize
Our 
model
defines a score
for each structure
Learning
 tunes the
parameters of the
model
Inference
 finds
the best structure
for a new
sentence
(Inference
 is usually
called as a subroutine
in learning)
 
An Abstraction for Modeling
 
15
Now we can
work at this
level of
abstraction.
Training
 
Thus far, we’ve seen how to
compute (approximate)
marginals, given a factor
graph…
 
…but where do the potential
tables 
ψ
α 
come from?
Some have a fixed structure
(e.g. 
Exactly1, CKYTree
)
Others could be trained ahead
of time (e.g. 
TrigramHMM
)
For the rest, we 
define them
parametrically and learn the
parameters!
16
Standard CRF Parameterization
Define each potential function in terms of a
fixed set of feature functions:
17
 
Observed
variables
 
Predicted
variables
Standard CRF Parameterization
Define each potential function in terms of a
fixed set of feature functions:
18
Standard CRF Parameterization
Define each potential function in terms of a
fixed set of feature functions:
19
What is Training?
 
That’s easy:
 
 
Training
 = picking 
good
 model parameters!
 
 
But how do we know if the
model parameters are any “good”?
20
Conditional Log-likelihood Training
 
1.
Choose 
model
Such that derivative in #3 is ea
2.
Choose 
objective:
Assign high probability to the
things we observe and low
probability to everything else
21
 
3.
Compute
derivative 
by
hand 
using the
chain rule
4.
Replace 
exact
inference 
by
approximate
inference
Conditional Log-likelihood Training
1.
Choose 
model
Such that derivative in #3 is easy
2.
Choose 
objective:
Assign high probability to the
things we observe and low
probability to everything else
22
 
3.
Compute
derivative 
by
hand 
using the
chain rule
4.
Replace 
exact
inference 
by
approximate
inference
Stochastic Gradient Descent
 
Input:
Training data, 
{(
x
(i)
, y
(i)
) : 1 ≤ 
i
N
 }
Initial model parameters, 
θ
Output:
Trained model parameters, 
θ.
 
Algorithm:
   While not converged:
Sample a training example 
(
x
(i)
, y
(i)
)
Compute the 
gradient of 
log(
p
θ
(
y
(i) 
| x
(i)
))
 
with
respect to our model parameters 
θ
.
Take a (small) step in the direction of the gradient.
23
What’s wrong with the usual approach?
If you add too many factors, your predictions
might get worse!
The model might be richer, but we replace
the 
true marginals 
with 
approximate
marginals 
(e.g. beliefs computed by BP)
Approximate inference can cause gradients
for structured learning to go 
awry
! (Kulesza &
Pereira, 2008).
24
What’s wrong with the usual approach?
 
Mistakes made by Standard CRF Training:
1.
Using BP (approximate)
2.
Not taking loss function into account
3.
Should be doing MBR decoding
 
Big pile of approximations…
     
…which has tunable parameters.
 
Treat it like a neural net, and run backprop!
25
 
Modern NLP
 
26
Linguistics
inspires the
structures we
want to predict
It also tells us
what to optimize
Our 
model
defines a score
for each structure
Learning
 tunes the
parameters of the
model
Inference
 finds
the best structure
for a new
sentence
(Inference
 is usually
called as a subroutine
in learning)
Empirical Risk Minimization
 
1. Given training data:
27
 
2. Choose each of these:
Decision function
Loss function
 
Examples
: Linear regression,
Logistic regression, Neural Network
 
Examples
: Mean-squared error,
Cross Entropy
Empirical Risk Minimization
1. Given training data:
 
3. Define goal:
28
2. Choose each of these:
Decision function
Loss function
 
4. Train with SGD:
(take small steps
opposite the gradient)
 
1. Given training data:
 
3. Define goal:
 
29
 
2. Choose each of these:
Decision function
 
 
Loss function
 
 
 
 
4. Train with SGD:
(take small steps
opposite the gradient)
 
Empirical Risk Minimization
 
Conditional Log-likelihood Training
 
1.
Choose 
model
Such that derivative in #3 is easy
2.
Choose 
objective:
Assign high probability to the
things we observe and low
probability to everything else
 
30
 
3.
Compute
derivative 
by
hand 
using the
chain rule
4.
Replace 
true
inference 
by
approximate
inference
What went wrong?
How did we compute
these 
approximate
marginal probabilities
anyway?
31
 
By Belief Propagation
of course!
 
Error Back-Propagation
 
32
Slide from (Stoyanov & Eisner, 2012)
Error Back-Propagation
33
Slide from (Stoyanov & Eisner, 2012)
 
Error Back-Propagation
 
34
Slide from (Stoyanov & Eisner, 2012)
 
Error Back-Propagation
 
35
Slide from (Stoyanov & Eisner, 2012)
 
Error Back-Propagation
 
36
Slide from (Stoyanov & Eisner, 2012)
 
Error Back-Propagation
 
37
Slide from (Stoyanov & Eisner, 2012)
 
Error Back-Propagation
 
38
Slide from (Stoyanov & Eisner, 2012)
 
Error Back-Propagation
 
39
Slide from (Stoyanov & Eisner, 2012)
Error Back-Propagation
40
Slide from (Stoyanov & Eisner, 2012)
Error Back-Propagation
41
 
ϴ
Slide from (Stoyanov & Eisner, 2012)
 
Error Back-Propagation
 
Applying the chain rule of differentiation
over and over.
Forward pass:
Regular computation (inference + decoding) in
the model (+ remember intermediate
quantities).
Backward pass:
Replay the forward pass in reverse, computing
gradients.
 
42
Background: Backprop through time
Recurrent neural
network:
 
BPTT:
1. Unroll the
computation
over time
43
(Robinson & Fallside, 1987)
(Werbos, 1988)
(Mozer, 1995)
a
x
t
b
t
x
t+1
y
t+1
 
 
2. Run
backprop
through the
resulting feed-
forward
network
What went wrong?
How did we compute
these 
approximate
marginal probabilities
anyway?
44
 
By Belief Propagation
of course!
 
ERMA
 
Empirical Risk Minimization
under Approximations
(ERMA)
Apply 
Backprop through
time
 to Loopy BP
Unrolls the BP
computation graph
Includes inference,
decoding, loss and all the
approximations along the
way
 
45
(Stoyanov, Ropson, & Eisner, 2011)
ERMA
 
1.
Choose 
model 
to be the
computation with all its
approximations
2.
Choose 
objective
 
to likewise include the
approximations
3.
Compute 
derivative
 by
backpropagation (treating
the entire computation as
if it were a neural network)
4.
Make no approximations!
(Our gradient is exact)
46
Key idea: Open up the black box!
(Stoyanov, Ropson, & Eisner, 2011)
ERMA
 
Empirical Risk Minimization
47
Key idea: Open up the black box!
 
Minimum Bayes Risk (MBR) Decoder
(Stoyanov, Ropson, & Eisner, 2011)
Approximation-aware Learning
 
What if we’re using
Structured BP instead of
regular BP?
No problem, the same
approach still applies!
The only difference is
that we 
embed dynamic
programming
algorithms 
inside our
computation graph.
 
48
Key idea: Open up the black box!
(Gormley, Dredze, & Eisner, 2015)
Connection to Deep Learning
49
y
 
exp(Θ
y
f(x))
(Gormley, Yu, & Dredze, 
In submission
)
 
Empirical Risk Minimization under
Approximations (ERMA)
 
50
 
MLE
Figure from (Stoyanov & Eisner, 2012)
Slide Note
Embed
Share

Push beyond traditional NLP models like logistic regression and PCFG with approximation-aware training. Explore factor graphs, BP algorithm, and fancier models to improve predictions. Learn how to tweak algorithms, tune parameters, and build custom models for machine learning in NLP.

  • NLP Modeling
  • Advanced Techniques
  • Machine Learning
  • Factor Graphs
  • Training

Uploaded on Sep 20, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Section 6: Approximation-aware Training 1

  2. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1. Models: Factor graphs can express interactions among linguistic structures. 2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3. Intuitions: What s going on here? Can we trust BP s estimates? 4. Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5. Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6. Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7. Software: Build the model you want! 2

  3. Outline Do you want to push past the simple NLP models (logistic regression, PCFG, etc.) that we've all been using for 20 years? Then this tutorial is extremely practical for you! 1. Models: Factor graphs can express interactions among linguistic structures. 2. Algorithm: BP estimates the global effect of these interactions on each variable, using local computations. 3. Intuitions: What s going on here? Can we trust BP s estimates? 4. Fancier Models: Hide a whole grammar and dynamic programming algorithm within a single factor. BP coordinates multiple factors. 5. Tweaked Algorithm: Finish in fewer steps and make the steps faster. 6. Learning: Tune the parameters. Approximately improve the true predictions -- or truly improve the approximate predictions. 7. Software: Build the model you want! 3

  4. Modern NLP Linguistics Mathematical Modeling NLP Machine Learning Combinatorial Optimization 4

  5. Machine Learning for NLP Linguistics Linguistics inspires the structures we want to predict No semantic interpretation 5

  6. Machine Learning for NLP Linguistics Mathematical Modeling p ( ) = 0.50 Our model defines a score for each structure p ( ) = 0.25 p ( ) = 0.10 p ( ) = 0.01 6

  7. Machine Learning for NLP Linguistics Mathematical Modeling p ( ) = 0.50 Our model defines a score for each structure p ( ) = 0.25 p ( ) = 0.10 It also tells us what to optimize p ( ) = 0.01 7

  8. Machine Learning for NLP Given training instances{(x1, y1), (x2, y2), , (xn, yn)} Linguistics Mathematical Modeling y1: y2: x1: x2: y3: Machine Learning x3: Find the best model parameters, Learning tunes the parameters of the model 8

  9. Machine Learning for NLP Given training instances{(x1, y1), (x2, y2), , (xn, yn)} Linguistics Mathematical Modeling y1: y2: x1: x2: y3: Machine Learning x3: Find the best model parameters, Learning tunes the parameters of the model 9

  10. Machine Learning for NLP Given training instances{(x1, y1), (x2, y2), , (xn, yn)} Linguistics Mathematical Modeling y1: y2: x1: x2: y3: Machine Learning x3: Find the best model parameters, Learning tunes the parameters of the model 10

  11. Machine Learning for NLP Given a new sentence, xnew Search over the set of all possible structures (often exponential in size of xnew) Return the highest scoring structure, y* Linguistics Mathematical Modeling Machine Learning Combinatorial Optimization Inference finds the best structure for a new sentence y*: y*: xnew: xnew: (Inference is usually called as a subroutine in learning) 11

  12. Machine Learning for NLP Given a new sentence, xnew Search over the set of all possible structures (often exponential in size of xnew) Return the Minimum Bayes Risk (MBR) structure, y* Linguistics Mathematical Modeling Machine Learning Combinatorial Optimization Inference finds the best structure for a new sentence y*: y*: xnew: xnew: (Inference is usually called as a subroutine in learning) 12

  13. Machine Learning for NLP Given a new sentence, xnew Search over the set of all possible structures (often exponential in size of xnew) Return the Minimum Bayes Risk (MBR) structure, y* Linguistics Mathematical Modeling Machine Learning Combinatorial Optimization Inference finds the best structure for a new sentence Polynomial time NP-hard Easy (Inference is usually called as a subroutine in learning) 13 13

  14. Modern NLP Our model defines a score for each structure Linguistics inspires the structures we want to predict It also tells us what to optimize Linguistics Mathematical Modeling NLP Inference finds the best structure for a new sentence Machine Learning Combinatorial Optimization Learning tunes the parameters of the model (Inference is usually called as a subroutine in learning) 14

  15. Mathematical Modeling An Abstraction for Modeling Now we can work at this level of abstraction. 15

  16. Training Thus far, we ve seen how to compute (approximate) marginals, given a factor graph Two ways to learn: 1. Standard CRF Training (very simple; often yields state-of-the- art results) ERMA (less simple; but takes approximations and loss function into account) but where do the potential tables come from? Some have a fixed structure (e.g. Exactly1, CKYTree) Others could be trained ahead of time (e.g. TrigramHMM) For the rest, we define them parametrically and learn the parameters! 2. 16

  17. Standard CRF Parameterization Define each potential function in terms of a fixed set of feature functions: Observed variables Predicted variables 17

  18. Standard CRF Parameterization Define each potential function in terms of a fixed set of feature functions: p n n v d 6 2 4 8 1 3 5 7 9 like arrow time flies an 18

  19. Standard CRF Parameterization Define each potential function in terms of a fixed set of feature functions: s 13 vp 12 pp 11 np 10 p n n v d 6 2 4 8 5 9 1 3 7 19 like flies time an arrow

  20. What is Training? That s easy: Training = picking good model parameters! But how do we know if the model parameters are any good ? 20

  21. LearningConditional Log-likelihood Training Machine 1. Choose model Such that derivative in #3 is ea Choose objective: Assign high probability to the things we observe and low probability to everything else 2. 3. Compute derivative by hand using the chain rule Replace exact inference by approximate inference 4. 21

  22. Conditional Log-likelihood Training Machine Learning 1. Choose model Such that derivative in #3 is easy Choose objective: Assign high probability to the things we observe and low probability to everything else 2. 3. Compute derivative by hand using the chain rule Replace exact inference by approximate inference 4. We can approximate the factor marginals by the (normalized) factor beliefs from BP! 22

  23. Whats wrong with the usual approach? If you add too many factors, your predictions might get worse! The model might be richer, but we replace the true marginals with approximate marginals (e.g. beliefs computed by BP) Approximate inference can cause gradients for structured learning to go awry! (Kulesza & Pereira, 2008). 24

  24. Whats wrong with the usual approach? Mistakes made by Standard CRF Training: 1. Using BP (approximate) 2. Not taking loss function into account 3. Should be doing MBR decoding Big pile of approximations which has tunable parameters. Treat it like a neural net, and run backprop! 25

  25. Modern NLP Our model defines a score for each structure Linguistics inspires the structures we want to predict It also tells us what to optimize Linguistics Mathematical Modeling NLP Inference finds the best structure for a new sentence Machine Learning Combinatorial Optimization Learning tunes the parameters of the model (Inference is usually called as a subroutine in learning) 26

  26. Empirical Risk Minimization 1. Given training data: y1: y2: x1: x2: y3: x3: 2. Choose each of these: Decision function Examples: Linear regression, Logistic regression, Neural Network Loss function Examples: Mean-squared error, Cross Entropy 27

  27. Empirical Risk Minimization 1. Given training data: 3. Define goal: 2. Choose each of these: Decision function 4. Train with SGD: (take small steps opposite the gradient) Loss function 28

  28. Empirical Risk Minimization 1. Given training data: 3. Define goal: 2. Choose each of these: Decision function 4. Train with SGD: (take small steps opposite the gradient) Loss function 29

  29. Conditional Log-likelihood Training Machine Learning 1. Choose model Such that derivative in #3 is easy Choose objective: Assign high probability to the things we observe and low probability to everything else 2. 3. Compute derivative by hand using the chain rule Replace true inference by approximate inference 4. 30

  30. What went wrong? Machine Learning How did we compute these approximate marginal probabilities anyway? By Belief Propagation of course! 31

  31. Error Back-Propagation 32 Slide from (Stoyanov & Eisner, 2012)

  32. Error Back-Propagation 33 Slide from (Stoyanov & Eisner, 2012)

  33. Error Back-Propagation 34 Slide from (Stoyanov & Eisner, 2012)

  34. Error Back-Propagation 35 Slide from (Stoyanov & Eisner, 2012)

  35. Error Back-Propagation 36 Slide from (Stoyanov & Eisner, 2012)

  36. Error Back-Propagation 37 Slide from (Stoyanov & Eisner, 2012)

  37. Error Back-Propagation 38 Slide from (Stoyanov & Eisner, 2012)

  38. Error Back-Propagation 39 Slide from (Stoyanov & Eisner, 2012)

  39. Error Back-Propagation 40 Slide from (Stoyanov & Eisner, 2012)

  40. Error Back-Propagation P(y3=noun|x) (y1 y2)= (y3 y1)* (y4 y1) y3 41 Slide from (Stoyanov & Eisner, 2012)

  41. Error Back-Propagation Applying the chain rule of differentiation over and over. Forward pass: Regular computation (inference + decoding) in the model (+ remember intermediate quantities). Backward pass: Replay the forward pass in reverse, computing gradients. 42

  42. Background: Backprop through time Recurrent neural network: BPTT: 1. Unroll the computation over time y4 yt+1 x4 b3 2. Run backprop through the resulting feed- forward network xt+1 x3 bt b2 x2 a xt b1 x1 a 43

  43. What went wrong? Machine Learning How did we compute these approximate marginal probabilities anyway? By Belief Propagation of course! 44

  44. ERMA Empirical Risk Minimization under Approximations (ERMA) Apply Backprop through time to Loopy BP Unrolls the BP computation graph Includes inference, decoding, loss and all the approximations along the way 45 (Stoyanov, Ropson, & Eisner, 2011)

  45. ERMA Machine Learning Key idea: Open up the black box! 1. Choose model to be the computation with all its approximations Choose objective to likewise include the approximations Compute derivative by backpropagation (treating the entire computation as if it were a neural network) Make no approximations! (Our gradient is exact) 2. 3. 4. 46 (Stoyanov, Ropson, & Eisner, 2011)

  46. ERMA Machine Learning Key idea: Open up the black box! Empirical Risk Minimization Minimum Bayes Risk (MBR) Decoder 47 (Stoyanov, Ropson, & Eisner, 2011)

  47. Approximation-aware Learning Machine Learning Key idea: Open up the black box! What if we re using Structured BP instead of regular BP? No problem, the same approach still applies! The only difference is that we embed dynamic programming algorithms inside our computation graph. 48 (Gormley, Dredze, & Eisner, 2015)

  48. Connection to Deep Learning exp( y f(x)) y P(y|x) ex ef1 efn hfn W gf1 gfn hf1 W ef1,t efn,1 ef1,1 efn,t 49 (Gormley, Yu, & Dredze, In submission)

  49. Empirical Risk Minimization under Approximations (ERMA) Approximation Aware No Yes MLE No SVMstruct [Finley and Joachims, 2008] M3N [Taskar et al., 2003] Softmax-margin [Gimpel & Smith, 2010] ERMA Loss Aware Yes 50 Figure from (Stoyanov & Eisner, 2012)

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#