Neural Shift-Reduce Dependency Parsing in Natural Language Processing

CS 388: Natural Language Processing:

Neural Shift-Reduce

Dependency Parsing

Raymond J. Mooney

University of Texas at Austin

Shift Reduce Parser

•

Deterministically builds a parse

incrementally, bottom up, and left to right,

without backtracking.

•

Maintains buffer of input words and a stack

of constructed constituents.

•

Perform sequence of operations/actions:

–

Shift

: Push the next word in the buffer onto the

stack.

–

Reduce

: Replace a set of the top elements on

the stack with a constituent composed of them.

Sample Parse of

“Bob eats pasta”

Buffer:

Bob eats pasta

Stack

Sample Parse of

“Bob eats pasta”

Buffer:

eats pasta

Stack

Action:

Shift

Bob

Sample Parse of

“Bob eats pasta”

Buffer:

eats pasta

Stack

Action:

Reduce(Bob



NP)

(NP Bob)

Sample Parse of

“Bob eats pasta”

Buffer:

pasta

Stack

Action:

Shift

 eats

(NP Bob)

Sample Parse of

“Bob eats pasta”

Buffer:

pasta

Stack

Action:

Reduce(eats



VB)

 (VB eats)

(NP Bob)

Sample Parse of

“Bob eats pasta”

Buffer:

Stack

Action:

Shift

 (VB eats)

(NP Bob)

 pasta

Sample Parse of

“Bob eats pasta”

Buffer:

Stack

Action:

Reduce(pasta



NP

 (VB eats)

(NP Bob)

 (NP pasta)

Sample Parse of

“Bob eats pasta”

Buffer:

Stack

Action:

Reduce(VB NP



VP

(NP Bob)

 (VP (VB eats)(NP pasta))

Sample Parse of

“Bob eats pasta”

Buffer:

Stack

Action:

Reduce(S



 NP VP

 (S (NP Bob) (VP (VB eats)(NP pasta)))

Shift Reduce Parsing

•

Must use “look ahead” to use next words in

the buffer to pick the correct action.

•

Originally introduced to parse programming

languages which are DCFLs.

•

Use for NLP requires heuristics to pick an

action at each step which (due to ambiguity)

could be wrong, resulting in a “garden

path.”

•

Can perform backup when an impasse is

reached in order to search for a parse.

Shift-Reduce Dependency Parser

•

Easily adapted to dependency parsing by

using reduce operators that introduce

dependency arcs.

•

In addition to a stack and buffer, maintain a

set of dependency arcs created.

Arc-Standard System

(Nivre, 2004)

•

Buffer

= [

, b

,… b

•

Stack

= [

, s

,… s

•

Arcs

= {

label

, w

…}

•

Configuration

= (

s, b, A

•

Initial Config:

([ROOT], [

, w

, … w

], {})

•

Final Config:

([ROOT], [],

label

, w

…})

Arc Standard Actions

Sample Parse of

“He has good control”

Buffer:

[He, has, good, control]

Stack

Arcs

ROOT

Sample Parse of

“He has good control”

Buffer:

[has, good, control]

Stack

Arcs

ROOT

Action:

Shift

He

Sample Parse of

“He has good control”

Buffer:

[good, control]

Stack

Arcs

ROOT

Action:

Shift

He

has

Sample Parse of

“He has good control”

Buffer:

[good, control]

Stack

Arcs

ROOT

Action:

LeftArc(nsubj)

has

nsubj(has,He)

Sample Parse of

“He has good control”

Buffer:

[control]

Stack

Arcs

ROOT

Action:

Shift

has

nsubj(has,He)

good

Sample Parse of

“He has good control”

Buffer:

[]

Stack

Arcs

ROOT

Action:

Shift

has

nsubj(has,He)

good

control

Sample Parse of

“He has good control”

Buffer:

[]

Stack

Arcs

ROOT

Action:

LeftArc(amod)

has

nsubj(has,He)

control

amod(control,good)

Sample Parse of

“He has good control”

Buffer:

[]

Stack

Arcs

ROOT

Action:

RightArc(dobj)

has

nsubj(has,He)

amod(control,good)

dobj(has,control)

Sample Parse of

“He has good control”

Buffer:

[]

Stack

Arcs

ROOT

Action:

RightArc(root)

nsubj(has,He)

amod(control,good)

dobj(has,control)

root(ROOT,has)

Stanford Neural Dependency Parser

(Chen and Manning, 2014)

•

Train a neural net to choose the best shift-

reduce parser action to take at each step.

•

Uses features (words, POS tags, arc labels)

extracted from the current stack, buffer, and

arcs as context.

•

History (thru citation trail):

–

Neural shift-reduce parser

(Mayberry & Miikkulainen, 1999)

–

Decision-tree shift-reduce parser

(Hermjakob & Mooney, 1997)

–

Simple learned shift-reduce parser

(Simmons & Yu, 1992)

Neural Architecture

Parse action classification

Context Features Used

(rc = right-child, lc=left-child)

•

The top 3 words on the stack and buffer: s

; s

; b

; b

; b

•

The first and second leftmost / rightmost

children of the top two words on the stack:

lc

(s

); rc

(s

); lc

(s

); rc

(s

), i = 1; 2.

•

The leftmost-of-leftmost and rightmost-of-

rightmost children of the top two words on

the stack:

lc

(lc

(s

)); rc

(rc

(s

)), i = 1; 2.

•

Also include the POS tag and parent arc

label (where available) for these same items.

Input Embeddings

•

Instead of using one-hot input encodings,

words and POS tags are “embedded” in a 50

dimensional set of input features.

•

Embedding POS tags is unusual since there

are relatively few; however, it allows

similar tags (e.g. NN and NNS) to have

similar embeddings and thereby behave

similarly.

Cube Activation Function

•

Alternative non-linear output function

instead of sigmoid (softmax) or tanh.

•

Allows modeling the product terms of

for any three different input elements.

•

Based on previous empirical results,

capturing interactions of three elements

seems important for shift-reduce

dependency parsing.

Training Data

•

Automatically construct dependency parses

from treebank phrase-structure parse trees.

•

Compute correct sequence of “oracle” shift-

reduce parse actions (transitions,

) at each

step from gold-standard parse trees.

•

Determine correct parse sequence by using

a “shortest stack” oracle which always

prefers LeftArc over Shift.

Training Algorithm

•

Training objective is to minimize the cross-

entropy loss, plus a L2-regularization term:

•

Initialize word embeddings to precomputed

values such as Word2Vec.

•

Use AdaGrad with dropout to compute

model parameters that approximately

minimize this objective.

Evaluation Metrics

for Dependency Parsing

•

Unlabeled Atachment Score (UAS):

% of

tokens for which a system has predicted the

correct parent.

•

Labeled Atachment Score (LAS):

% of

tokens for which a system has predicted the

correct parent with the correct arc label.

Sample Results on Penn WSJ Treebank

Conclusions

•

Shift-reduce parsing is an efficient and

effective alternative to standard PCFG

parsing.

•

Particularly effective for dependency parsing.

•

Models deterministic, left-to-right parsing that

seems to characterize human parsing

(therefore subject to garden paths).

•

Neural methods to select parse operations

give state-of-the-art results.

Slide Note

Embed Share

Download

This content explores the concept of Shift-Reduce Dependency Parsing in the context of Natural Language Processing. It describes how a Shift-Reduce Parser incrementally builds a parse without backtracking, maintaining a buffer of input words and a stack of constructed constituents. The process involves shifting words onto the stack and reducing elements to form constituents in a deterministic manner. A sample parse of the sentence "Bob eats pasta" is used to illustrate the actions of shifting and reducing constituents within the parser to generate a parse tree.

delyth Follow

Uploaded on Aug 26, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

CS 388: Natural Language Processing: Neural Shift-Reduce Dependency Parsing Raymond J. Mooney University of Texas at Austin 1

Shift Reduce Parser Deterministically builds a parse incrementally, bottom up, and left to right, without backtracking. Maintains buffer of input words and a stack of constructed constituents. Perform sequence of operations/actions: Shift: Push the next word in the buffer onto the stack. Reduce: Replace a set of the top elements on the stack with a constituent composed of them. 2

Sample Parse of Bob eats pasta Buffer: Bob eats pasta Stack 3

Sample Parse of Bob eats pasta Action: Shift Buffer: eats pasta Stack Bob 4

Sample Parse of Bob eats pasta Action: Reduce(Bob NP) Buffer: eats pasta Stack (NP Bob) 5

Sample Parse of Bob eats pasta Action: Shift Buffer: pasta Stack eats (NP Bob) 6

Sample Parse of Bob eats pasta Action: Reduce(eats VB) Buffer: pasta Stack (VB eats) (NP Bob) 7

Sample Parse of Bob eats pasta Action: Shift Buffer: Stack pasta (VB eats) (NP Bob) 8

Sample Parse of Bob eats pasta Action: Reduce(pasta NP) Buffer: Stack (NP pasta) (VB eats) (NP Bob) 9

Sample Parse of Bob eats pasta Action: Reduce(VB NP VP) Buffer: Stack (VP (VB eats)(NP pasta)) (NP Bob) 10

Sample Parse of Bob eats pasta Action: Reduce(S NP VP) Buffer: Stack (S (NP Bob) (VP (VB eats)(NP pasta))) 11

Shift Reduce Parsing Must use look ahead to use next words in the buffer to pick the correct action. Originally introduced to parse programming languages which are DCFLs. Use for NLP requires heuristics to pick an action at each step which (due to ambiguity) could be wrong, resulting in a garden path. Can perform backup when an impasse is reached in order to search for a parse. 12

Shift-Reduce Dependency Parser Easily adapted to dependency parsing by using reduce operators that introduce dependency arcs. In addition to a stack and buffer, maintain a set of dependency arcs created. 13

Arc-Standard System (Nivre, 2004) Buffer b = [b1, b2, bn] Stack s = [s1, s2, sm] Arcs A = {label(wi, wj), } Configuration c = (s, b, A) Initial Config: ([ROOT], [w1, w2, wn], {}) Final Config: ([ROOT], [],{label(wi, wj), }) 14

Arc Standard Actions 15

Sample Parse of He has good control Stack ROOT Arcs Buffer: [He, has, good, control] 16

Sample Parse of He has good control Action: Shift Stack He Arcs Buffer: [has, good, control] ROOT 17

Sample Parse of He has good control Action: Shift Stack has Arcs Buffer: [good, control] He ROOT 18

Sample Parse of He has good control Action: LeftArc(nsubj) Stack has Arcs Buffer: [good, control] nsubj(has,He) ROOT 19

Sample Parse of He has good control Action: Shift Stack good Arcs nsubj(has,He) Buffer: [control] has ROOT 20

Sample Parse of He has good control Action: Shift Stack control Arcs Buffer: [] nsubj(has,He) good has ROOT 21

Sample Parse of He has good control Action: LeftArc(amod) Stack control Arcs Buffer: [] nsubj(has,He) amod(control,good) has ROOT 22

Sample Parse of He has good control Action: RightArc(dobj) Stack has Arcs Buffer: [] nsubj(has,He) amod(control,good) dobj(has,control) ROOT 23

Sample Parse of He has good control Action: RightArc(root) Stack ROOT Arcs Buffer: [] nsubj(has,He) amod(control,good) dobj(has,control) root(ROOT,has) 24

Stanford Neural Dependency Parser (Chen and Manning, 2014) Train a neural net to choose the best shift- reduce parser action to take at each step. Uses features (words, POS tags, arc labels) extracted from the current stack, buffer, and arcs as context. History (thru citation trail): Neural shift-reduce parser (Mayberry & Miikkulainen, 1999) Decision-tree shift-reduce parser (Hermjakob & Mooney, 1997) Simple learned shift-reduce parser (Simmons & Yu, 1992) 25

Neural Architecture Parse action classification 26

Context Features Used (rc = right-child, lc=left-child) The top 3 words on the stack and buffer: s1; s2; s3; b1; b2; b3; The first and second leftmost / rightmost children of the top two words on the stack: lc1(si); rc1(si); lc2(si); rc2 (si), i = 1; 2. The leftmost-of-leftmost and rightmost-of- rightmost children of the top two words on the stack: lc1(lc1(si)); rc1(rc1(si)), i = 1; 2. Also include the POS tag and parent arc label (where available) for these same items. 27

Input Embeddings Instead of using one-hot input encodings, words and POS tags are embedded in a 50 dimensional set of input features. Embedding POS tags is unusual since there are relatively few; however, it allows similar tags (e.g. NN and NNS) to have similar embeddings and thereby behave similarly. 28

Cube Activation Function Alternative non-linear output function instead of sigmoid (softmax) or tanh. Allows modeling the product terms of xixjxk for any three different input elements. Based on previous empirical results, capturing interactions of three elements seems important for shift-reduce dependency parsing. 29

Training Data Automatically construct dependency parses from treebank phrase-structure parse trees. Compute correct sequence of oracle shift- reduce parse actions (transitions, ti) at each step from gold-standard parse trees. Determine correct parse sequence by using a shortest stack oracle which always prefers LeftArc over Shift. 30

Training Algorithm Training objective is to minimize the cross- entropy loss, plus a L2-regularization term: Initialize word embeddings to precomputed values such as Word2Vec. Use AdaGrad with dropout to compute model parameters that approximately minimize this objective. 31

Evaluation Metrics for Dependency Parsing Unlabeled Atachment Score (UAS): % of tokens for which a system has predicted the correct parent. Labeled Atachment Score (LAS): % of tokens for which a system has predicted the correct parent with the correct arc label. 32

Sample Results on Penn WSJ Treebank 33

Conclusions Shift-reduce parsing is an efficient and effective alternative to standard PCFG parsing. Particularly effective for dependency parsing. Models deterministic, left-to-right parsing that seems to characterize human parsing (therefore subject to garden paths). Neural methods to select parse operations give state-of-the-art results. 34

Neural Shift-Reduce Dependency Parsing in Natural Language Processing

Download Presentation

Presentation Transcript

Related

More Related Content