Exploring Fast & Accurate Parsing With Learning to Prune

Exploring the Frontier of
Fast & Accurate Parsing
 
1
L
e
a
r
n
i
n
g
 
t
o
 
P
r
u
n
e
:
 
Tim Vieira
 & Jason Eisner
Johns Hopkins University
Exploring the Frontier of
Fast & Accurate Parsing
 
2
L
e
a
r
n
i
n
g
 
t
o
 
P
r
u
n
e
:
 
as fast
 
more
accurate
Exploring the Frontier of
Fast & Accurate Parsing
 
3
L
e
a
r
n
i
n
g
 
t
o
 
P
r
u
n
e
:
We already use ML to learn be accurate.
This work uses ML to 
learn
 to be fast too!
Need systems with
tunable
 runtimes
Learning
 to be Fast & Accurate
4
Optimizing for different 
λ
's gives
a Pareto frontier of systems
 
 
 
Grammar
@S → NP VP
VP → VBD NP
NP → DT NN
DT → the
VBG → cooks
NN → chef
Parsing
5
Parsing
 
Grammar
@S → NP VP
VP → VBD NP
NP → DT NN
DT → the
VBG → cooks
NN → chef
 
6
VP
VP
PP
ROOT
S
<.>
NN
DT
IN
NN
DT
VBD
NP
NP
Berkeley Grammar (sm6):
up to
 64
 refinements
to each non-terminal
 
Grammar
@S
0
 → NP
0
 VP
0
@S
0
 → NP
0
 VP
1
@S
0
 → NP
0
 VP
2
@S
0
 → NP
1
 VP
0
@S
1
 → NP
0
 VP
0
VP
0
 → VBD
0
 NP
0
NP
0
→ DT
0
 NN
0
NN
39
 → spoon
VBD
2
 → ate
DT
0
 → the
NP
 
Pruned CKY parsing
 
7
8
 
VP
 
PP
O(|G|· width) runtime
per span – SLOW!
 
VP
Parsing in a nutshell
Final "." attaches
at the top
Speeding up parsing
9
Prunes all nonterminals
that might cover that span
 
Roark & Hollingshead (2008)
Bodenstab et al. (2011)
What could we
easily prune?
"[… with ] a" is a bad place to
end a constituent
Pruning saves runtime by
avoiding the fill operation
10
Ideally
Train a classifier to predict
whether a span is
GOLD
 or 
NON-GOLD
 
(Bodenstab et al., 2011)
 
Gold spans are 
bad
 to prune, but not equally bad …
11
Impact of PRUNE depends on context
 
(sentence, grammar, other PRUNE choices)
fails with no parse :-(
recovers by finding a
pretty good parse
Gold spans are 
bad
 to prune, but not equally bad …
Non-gold spans are 
good
 to prune, but not equally good …
12
Impact of PRUNE depends on context
 
(sentence, grammar, other PRUNE choices)
saves a lot of BUILD time
for this span 
and
 other
spans built as a result :-)
saves a tiny bit of
BUILD time
also, kills off a top-scoring
but wrong parse :-)
13
Impact of PRUNE depends on context
 
Want our classifier to take this into account
 
 
 
 
 
 
 
 
 
Bodenstab et al. (2011) just weight gold errors more than non-gold
We refine this approach by considering the real cost of each error in
context
really bad error
a few mild errors
 
End-to-end training
 
14
15
End-to-end training
What if we prune
here instead?
15
16
Rolling out lots of changes to get training data
16
"second-guess" each decision
the current classifier makes
 
LOLS
 (Chang+,2015)
 
prohibitively slow
to train!
 
Making rollouts fast
 
17
Change propagation
18
Already computed this
This is such a tiny change.
Do you really want to re-run the parser
from scratch?
Immediate effects
lie in this range
Changeprop
19
Change
this span
Changes may
cascade
Changes quickly "peter out"
because each span only cares
about its 
best
 derivation.
Homework: figure out the rest.
Hints:
use indexing data structures
a priority queue of changes to process
and an undo list (to flip back quickly)
 
90% of rollouts change
< 10% of the chart
20
change in reward
change in input
 
Rollouts are basically computing a gradient!
Dynamic programming speedup
21
Relax decoding
to be stochastic
 
We use 
expected reward
 as a differentiable proxy
Dynamic programming speedup
Not differentiable!
22
 
* Efficient for a certain
class of reward functions
with a variant of inside-
outside (Li & Eisner,'09)
O(n
2
) · O(G n
3
) 
= O(G · n
5
)
O(G · n
3
)
10-14x faster!
Dynamic programming speedup
 
Experiments
 
23
24
 
Setup
Data:
 English Penn Treebank
Reward:
corpus-level F
1 
 - 
 λ 
number of edges built
Grammar:
Coarse: Vanilla PTB grammar
Fine: Berkeley split-merge grammar level 6
Two training speed-ups
changeprop (CP)
:
per-sentence
 F
1 
 - 
 λ 
number of nodes built
dynamic programming (DP)
:
expected recall - 
 λ 
size of mask
Controlled experiment
25
Learning curve
Sweep baseline curve by
varying 
asymmetric penalty
 
26
Infer a 
λ
 value that the
baseline
 seems to optimize
 
Learning curve
Train
 with LOLS
for 6 days
27
On the 7
th
 day,
 
LOLS rested, and
performance was good!
Learning curve
28
Learning curve
29
Learning curve
 
millions of edges built per sentence
 
 = significantly higher reward
(paired permutation test, p < 0.05)
 
30
 
Test set results: Coarse-grained grammar
 
31
 
Test set results: Fine-grained grammar
Conclusions
 
Learning
 systems that are fast and accurate (in the Pareto
sense) is an exciting new area.
We looked at pruned dynamic programming.
We found that optimizing 
end-to-end performance
 lead to
better frontier in the case of parsing.
Not all errors are equal, so we must measure the
consequences of each pruning decision.
Measuring these consequences requires 
running the system
over and over
 again with different pruning masks.
We sped this up with our change propagation and dynamic
programming algorithms.
32
 
Thanks!
 
@xtimv
 
33
 
github.com/timvieira/learning-to-prune
 
34
 
Backup
slides
 
35
Relationship to other parsers
36
Slightly faster
& more accurate
than the method that
inspired our baseline
 
37
 
% of chart changed
 
Changes are 
really
 sparse!
 
Empirical comparison
 
Sentence
length
 
Runtime
(seconds)
 
Average
time to do
all rollouts
 
38
 
Runtime?
 
O(G*n^3)
 
O(n^2) cells
O(G*n) time to fill each cell
 
#symbols in each cell of a realistic
grammar
 
How many grammar
lookups
to fill this cell
 
255*484
 + 443*484
 + 456*483
 + 472*471
 + 488*347
= 949,728
 
Almost a million lookups
for one cell!
 
39
 
Speeding-up rollouts
 
CP
coarse grammar: 4.9-6.6x faster
fine grammar: 1.9-2.4x faster
DP
coarse grammar: 10.4-11.9x
fine grammar: 10.5-13.8x
 
40
 
Runtime with Jacobian
 
 
General trick: apply expansion before nonlinearity.
 
Needs T Jacobian-vector products which is asymptotically as slow as
running inference n^2 times
 
This is because computing marginals happens has worse constant
factors than one-best so It's better to use Changeprop is much faster.
 
 
41
Hard to optimize
high-dimensional
Cross-section of objective along a random direction
42
Slide Note

Today I'm going to talk about learning fast and accurate parsers using ideas from reinforcement learning

But first, I want to emphasize what it means to be both fast *CLICK* and accurate at the same time

I'll illustrate this with this scatter plot

Here each point represents a system for doing some task, say parsing.

- Its POSITION is determined by measuring the system's *CLICK* accuracy and *CLICK* runtime on some evaluation data.

- The reason we have so many points is that we have

- varied the underlying models *CLICK*

- and the "black magic" used to make it fast *CLICK*

The Pareto frontier captures the essence of what it means to be both fast and accurate.

It is a curve that represents the BEST ACCURACY THAT CAN BE ACHIEVED WIHTIN A RUNTIME BUDGET

In other words, there is NO INCENTIVE to use a system that isn't the Pareto frontier

SUCH AS this one with the red "x" -- BECAUSE …

Embed
Share

In this informative content, the concept of learning to prune is discussed in the context of exploring the frontier of fast and accurate parsing. It delves into the optimization tradeoff between runtime and accuracy in end-to-end systems, showcasing a Pareto frontier of different system performances. Through images and examples, the text illustrates how pruning techniques can save runtime by selectively avoiding operations on nonterminals. The content highlights the importance of tuning systems' runtimes using machine learning for both accuracy and speed. Additionally, it provides insights into parsing grammars and the significance of pruning for efficient parsing operations.

  • Parsing
  • Learning to Prune
  • Fast parsing
  • Accuracy optimization
  • Machine learning

Uploaded on Sep 25, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Learning to Prune: Learning to Prune: Pareto Pareto Exploring the Frontier of Fast & Accurate Parsing Pareto frontier accuracy Tim Vieira & Jason Eisner Johns Hopkins University runtime 1

  2. Learning to Prune: Learning to Prune: Pareto Pareto Exploring the Frontier of Fast & Accurate Parsing as fast Pareto frontier more accurate accuracy runtime 2

  3. Learning to Prune: Learning to Prune: Need systems with tunable runtimes Pareto Pareto Exploring the Frontier of Fast & Accurate Parsing Pareto frontier accuracy We already use ML to learn be accurate. This work uses ML to learn to be fast too! runtime 3

  4. Learning to be Fast & Accurate Tradeoff parameter - = rewar d accurac y runtim e Accuracy and runtime of the end-to-end system accuracy Optimizing for different 's gives a Pareto frontier of systems runtime 4

  5. Parsing ROOT Grammar @S NP VP VP VBD NP NP DT NN DT the VBG cooks NN chef S VP PP VP NP NP NP VBD ate DT the IN DT a NN <.> . NN Papa caviar with spoon 5

  6. Parsing ROOT ROOT0 Grammar @S0 NP0 VP0 @S0 NP0 VP1 @S0 NP0 VP2 @S0 NP1 VP0 Grammar @S NP VP VP VBD NP NP DT NN DT the VBG cooks NN chef VP0 VBD0 NP0 NP0 DT0 NN0 NN39 spoon VBD2 ate DT0 the S S0 VP VP0 Berkeley Grammar (sm6): up to 64 refinements to each non-terminal @S1 NP0 VP0 PP PP5 VP VP19 NP NP35 NP38 NP NP NP31 Papa VBD VBD2 ate DT DT0 the IN IN23 with DT DT15 a NN NN39 <.> <.>2 . NN NN8 caviar spoon 6

  7. Pruned CKY parsing 7

  8. Parsing in a nutshell O(|G| width) runtime per span SLOW! VP VP PP Papa ate the caviar with a spoon . 8

  9. Speeding up parsing What could we easily prune? Pruning saves runtime by avoiding the fill operation Prunes all nonterminals that might cover that span Roark & Hollingshead (2008) Bodenstab et al. (2011) Final "." attaches at the top "[ with ] a" is a bad place to end a constituent with a spoon . Pap a at e th e cavia r 9

  10. Ideally Train a classifier to predict whether a span is GOLD or NON-GOLD ROOT S (Bodenstab et al., 2011) @S PP VP NP NP NP VBD DT IN DT NN <.> NN with a spoon . Pap a at e th e cavia r 10

  11. Impact of PRUNE depends on context (sentence, grammar, other PRUNE choices) Gold spans are bad to prune, but not equally bad recovers by finding a pretty good parse fails with no parse :-( 11

  12. Impact of PRUNE depends on context (sentence, grammar, other PRUNE choices) Gold spans are bad to prune, but not equally bad Non-gold spans are good to prune, but not equally good saves a lot of BUILD time for this span and other spans built as a result :-) saves a tiny bit of BUILD time also, kills off a top-scoring but wrong parse :-) 12

  13. Impact of PRUNE depends on context Want our classifier to take this into account a few mild errors really bad error Bodenstab et al. (2011) just weight gold errors more than non-gold We refine this approach by considering the real cost of each error in context 13

  14. End-to-end training 14

  15. End-to-end training What if we prune here instead? rollout = - accuracy runtime 15 15

  16. Rolling out lots of changes to get training data prohibitively slow to train! "second-guess" each decision the current classifier makes Weighted classification dataset LOLS (Chang+,2015) Retrain the classifier 16 16

  17. Making rollouts fast 17

  18. Change propagation Already computed this This is such a tiny change. Do you really want to re-run the parser from scratch? 18

  19. Changeprop Changes quickly "peter out" because each span only cares about its best derivation. Immediate effects lie in this range Changes may cascade 90% of rollouts change < 10% of the chart Change this span Homework: figure out the rest. Hints: use indexing data structures a priority queue of changes to process and an undo list (to flip back quickly) 19

  20. Dynamic programming speedup Rollouts are basically computing a gradient! change in reward change in input 20

  21. Dynamic programming speedup Not differentiable! We use expected reward as a differentiable proxy Relax decoding to be stochastic 21

  22. Dynamic programming speedup 10-14x faster! O(G n3) O(n2) O(G n3) = O(G n5) * Efficient for a certain class of reward functions with a variant of inside- outside (Li & Eisner,'09) 22

  23. Experiments 23

  24. Controlled experiment Setup Data: English Penn Treebank Reward: corpus-level F1 - number of edges built Grammar: Coarse: Vanilla PTB grammar Fine: Berkeley split-merge grammar level 6 Two training speed-ups changeprop (CP): per-sentence F1 - number of nodes built dynamic programming (DP): expected recall - size of mask 24

  25. Learning curve Sweep baseline curve by varying asymmetric penalty 25

  26. Learning curve Infer a value that the baseline seems to optimize 26

  27. Learning curve Train with LOLS for 6 days On the 7th day, LOLS rested, and performance was good! 27

  28. Learning curve 28

  29. Learning curve 29

  30. Test set results: Coarse-grained grammar millions of edges built per sentence = significantly higher reward (paired permutation test, p < 0.05) 30

  31. Test set results: Fine-grained grammar 31

  32. Conclusions Learning systems that are fast and accurate (in the Pareto sense) is an exciting new area. We looked at pruned dynamic programming. We found that optimizing end-to-end performance lead to better frontier in the case of parsing. Not all errors are equal, so we must measure the consequences of each pruning decision. Measuring these consequences requires running the system over and over again with different pruning masks. We sped this up with our change propagation and dynamic programming algorithms. 32

  33. Thanks! @xtimv github.com/timvieira/learning-to-prune 33

  34. 34

  35. Backup slides 35

  36. Relationship to other parsers Slightly faster & more accurate than the method that inspired our baseline 36

  37. Changes are really sparse! % of chart changed 37 37

  38. Empirical comparison (seconds) Runtime Average time to do all rollouts Sentence length 38

  39. Runtime? #symbols in each cell of a realistic grammar O(G*n^3) O(n^2) cells O(G*n) time to fill each cell How many grammar lookups to fill this cell 255*484 + 443*484 + 456*483 + 472*471 + 488*347 = 949,728 Almost a million lookups for one cell! 39

  40. Speeding-up rollouts CP coarse grammar: 4.9-6.6x faster fine grammar: 1.9-2.4x faster DP coarse grammar: 10.4-11.9x fine grammar: 10.5-13.8x 40

  41. Runtime with Jacobian General trick: apply expansion before nonlinearity. Needs T Jacobian-vector products which is asymptotically as slow as running inference n^2 times This is because computing marginals happens has worse constant factors than one-best so It's better to use Changeprop is much faster. 41

  42. Hard to optimize Cross-section of objective along a random direction high-dimensional riddled with local optima Hard to search directly in parameter space piecewise constant Instead, we optimize by iteratively refining the supervision given to the classifier 42

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#