Importance of Context in Statistical Machine Translation

 
Target-Side Context for Discriminative Models in
Statistical MT
 
Aleš Tamchyna
,
Alexander Fraser,
Ondřej Bojar,
Marcin Junczys-Dowmunt
 
 
 
ACL 2016
    
August 9, 2016
 
Outline
 
-
M
o
t
i
v
a
t
i
o
n
-
Model Description
-
Integration in Phrase-Based Decoding
-
Experimental Evaluation
-
Conclusion
Why Context Matters in MT: Source
střelba
s
h
o
o
t
i
n
g
natáčení
?
of
expensive
the
film
W
i
d
e
r
 
s
o
u
r
c
e
 
c
o
n
t
e
x
t
 
r
e
q
u
i
r
e
d
 
f
o
r
d
i
s
a
m
b
i
g
u
a
t
i
o
n
 
o
f
 
w
o
r
d
 
s
e
n
s
e
.
Previous work has looked at using
source context in MT.
Why Context Matters in MT: Target
kočka
 
nominative
kočky
 
genitive
kočce
 
dative
kočku
 
accusative
kočko
 
vocative
kočce
 
locative
kočkou
 
instrumental
the man    saw    a cat    .
Correct case depends on
how we translate the
previous words.
si všiml
uviděl
W
i
d
e
r
 
t
a
r
g
e
t
 
c
o
n
t
e
x
t
r
e
q
u
i
r
e
d
 
f
o
r
 
d
i
s
a
m
b
i
g
u
a
t
i
o
n
o
f
 
w
o
r
d
 
i
n
f
l
e
c
t
i
o
n
.
How Does PBMT Fare?
 
s
h
o
o
t
i
n
g
 
o
f
 
t
h
e
 
f
i
l
m
 
.
n
a
t
á
č
e
n
í
 
f
i
l
m
u
 
.
s
h
o
o
t
i
n
g
 
o
f
 
t
h
e
 
e
x
p
e
n
s
i
v
e
 
f
i
l
m
 
.
s
t
ř
e
l
b
y
 
n
a
 
d
r
a
h
ý
 
f
i
l
m
 
.
 
t
h
e
 
m
a
n
 
s
a
w
 
a
 
c
a
t
 
.
 
 
 
 
m
u
ž
 
u
v
i
d
ě
l
 
k
o
č
k
u
a
c
c
 
.
 
t
h
e
 
m
a
n
 
s
a
w
 
a
 
b
l
a
c
k
 
c
a
t
 
.
 
 
 
 
m
u
ž
 
s
p
a
t
ř
i
l
 
č
e
r
n
o
u
a
c
c
 
k
o
č
k
u
a
c
c
 
.
 
t
h
e
 
m
a
n
 
s
a
w
 
a
 
y
e
l
l
o
w
i
s
h
 
c
a
t
 
.
 
 
 
 
m
u
ž
 
s
p
a
t
ř
i
l
 
n
a
ž
l
o
u
t
l
á
n
o
m
 
k
o
č
k
a
n
o
m
 
.
 
 
Outline
 
-
Motivation
-
M
o
d
e
l
 
D
e
s
c
r
i
p
t
i
o
n
-
Integration in Phrase-Based Decoding
-
Experimental Evaluation
-
Conclusion
A Discriminative Model of Source and Target Context
Let 
F, E
 be the source and target sentence.
Model the following probability distribution:
Where:
Model Features (1/2)
 
L
a
b
e
l
 
I
n
d
e
p
e
n
d
e
n
t
 
(
S
 
=
 
s
h
a
r
e
d
)
:
-
source window: 
 
-1^saw -2^really ...
-
source words:
 
a cat
-
source phrase:
 
a_cat
 
-
context window:
 
-1^uviděl -2^vážně
-
context bilingual:
 
saw^uviděl really^vážně
 
L
a
b
e
l
 
D
e
p
e
n
d
e
n
t
 
(
T
 
=
 
t
r
a
n
s
l
a
t
i
o
n
)
:
-
target words:
 
kočku
-
target phrase:
 
kočku
 
F
u
l
l
 
F
e
a
t
u
r
e
 
S
e
t
:
 
{
 
S
×
T
 
 
S
 
 
T
 
}
 
cat&kočku ...a_cat&kočku ... saw^uviděl&kočku ... -1^uviděl&kočku ... a_cat ... kočku
Model Features (2/2)
 
-
train a single model where each class is defined by label-dependent features
-
s
o
u
r
c
e
:
 
f
o
r
m
,
 
l
e
m
m
a
,
 
p
a
r
t
 
o
f
 
s
p
e
e
c
h
,
 
d
e
p
e
n
d
e
n
c
y
 
p
a
r
e
n
t
,
 
s
y
n
t
a
c
t
i
c
 
r
o
l
e
-
t
a
r
g
e
t
:
 
f
o
r
m
,
 
l
e
m
m
a
,
 
(
c
o
m
p
l
e
x
)
 
m
o
r
p
h
o
l
o
g
i
c
a
l
 
t
a
g
 
(
e
.
g
.
 
N
N
F
S
1
-
-
-
-
-
A
-
-
-
-
)
-
Allows to learn e.g.:
-
subjects (role=Sb) often translate into nominative case
-
nouns are usually accusative when preceded by an adjective in accusative case
-
lemma “cat” maps to lemma “kočka” regardless of word form (inflection)
 
Outline
 
-
Motivation
-
Model Description
-
I
n
t
e
g
r
a
t
i
o
n
 
i
n
 
P
h
r
a
s
e
-
B
a
s
e
d
 
D
e
c
o
d
i
n
g
-
Experimental Evaluation
-
Conclusion
 
Challenges in Decoding
 
 
ten
 
muž
 
uviděl
 
.
 
.
 
.
 
.
 
uviděl
 
kočkou
 
.
 
.
 
.
 
.
 
.
 
.
 
kočku
 
kočka
 
kočkou
 
kočku
 
kočka
 
muž
 
.
 
.
 
.
 
.
 
.
 
.
 
the    man     saw     a    cat    .
 
ten    pán     uviděl    kočka    .
 
    muž                      kočkou
kočku
 
-
s
o
u
r
c
e
 
c
o
n
t
e
x
t
 
r
e
m
a
i
n
s
 
c
o
n
s
t
a
n
t
 
w
h
e
n
 
w
e
d
e
c
o
d
e
 
a
 
s
i
n
g
l
e
 
s
e
n
t
e
n
c
e
-
e
a
c
h
 
t
r
a
n
s
l
a
t
i
o
n
 
o
p
t
i
o
n
 
e
v
a
l
u
a
t
e
d
 
i
n
 
m
a
n
y
d
i
f
f
e
r
e
n
t
 
t
a
r
g
e
t
 
c
o
n
t
e
x
t
s
-
as many as a language model
 
Trick #1: Source- and Target-Context Score Parts
 
ten
 
muž
 
uviděl
 
.
 
.
 
.
 
.
 
uviděl
 
kočkou
 
.
 
.
 
.
 
.
 
.
 
.
 
kočku
 
kočka
 
kočkou
 
kočku
 
kočka
 
muž
 
.
 
.
 
.
 
.
 
.
 
.
 
the    man     saw     a    cat    .
 
ten    pán     uviděl    kočka    .
 
    muž                      kočkou
kočku
 
score(kočku|muž uviděl, a cat, the man saw a cat) =
   w · fv(kočku, muž uviděl, a cat, the man saw a cat)
 
-
most features do not depend on target-side
context “muž uviděl”
-
divide the feature vector into two components
-
pre-compute source-context only part of the
score before decoding
Tricks #2 and #3
 
-
C
a
c
h
e
 
f
e
a
t
u
r
e
 
v
e
c
t
o
r
s
-
e
a
c
h
 
t
r
a
n
s
l
a
t
i
o
n
 
o
p
t
i
o
n
 
(
k
o
č
k
u
)
 
w
i
l
l
 
b
e
 
s
e
e
n
 
m
u
l
t
i
p
l
e
 
t
i
m
e
s
 
d
u
r
i
n
g
 
d
e
c
o
d
i
n
g
-
cache its feature vector before decoding
-
t
a
r
g
e
t
-
s
i
d
e
 
c
o
n
t
e
x
t
s
 
r
e
p
e
a
t
 
w
i
t
h
i
n
 
a
 
s
i
n
g
l
e
 
s
e
a
r
c
h
 
(
m
u
ž
 
u
v
i
d
ě
l
 
-
>
 
*
)
-
cache context features for each new context
-
C
a
c
h
e
 
f
i
n
a
l
 
r
e
s
u
l
t
s
-
pre-compute and store scores for all possible translations of the current phrase
-
needed for normalization anyway
 
Evaluation of Decoding Speed
 
Outline
 
-
Motivation
-
Model Description
-
Integration in Phrase-Based Decoding
-
E
x
p
e
r
i
m
e
n
t
a
l
 
E
v
a
l
u
a
t
i
o
n
-
Conclusion
 
Scaling to Large Data
 
-
BLEU scores, English-Czech translation
-
training data: subsets of CzEng 1.0
 
Additional Language Pairs
Manual Evaluation
 
-
blind evaluation of system outputs, 104 random test sentences
-
English-Czech translation
-
sample BLEU scores: 15.08, 16.22, 16.53
Conclusion
 
-
novel discriminative model for MT that uses both source- and target-side
context information
-
(relatively) efficient integration directly into MT decoding
-
significant improvement of BLEU for English-Czech even on large-scale data
-
consistent 
improvement for three other language pairs
-
model freely available as part of the Moses toolkit
 
 
Thank you!
 
 
Questions?
 
Extra slides
Intrinsic Evaluation
-
the task: predict the correct translation in the current context
-
baseline: select the most frequent translation from the candidates, i.e.,
translation with the highest P(e|f)
-
English-Czech translation, tested on WMT13 test set
s
h
o
o
t
i
n
g
Model Training: Parallel Data
the man saw a black cat .
  
muž viděl černou|A4
kočku|N4 .
...
the black cat noticed the man .
 
černá|A1 kočka|N1 viděla muže .
Model Training
 
-
Vowpal Wabbit
-
quadratic feature combinations generated automatically
-
objective function: logistic loss
-
setting: 
--csoaa_ldf mc
-
10 iterations over data
-
select best model based on held-out accuracy
-
no regularization
Training Efficiency
 
-
huge number of features generated (hundreds of GBs when compressed)
-
feature extraction
-
easily parallelizable task: simply split data into many chunks
-
each chunk processed in a multithreaded instance of Moses
-
model training
-
Vowpal Wabbit is fast
-
training can be parallelized using VW AllReduce
-
workers train on independent chunks, share parameter updates with a master node
-
linear speed-up
-
10-20 jobs
Additional Language Pairs (1/2)
 
-
English-German
-
parallel data: 4.3M sentence pairs (Europarl + Common Crawl)
-
dev/test: WMT13/WMT14
-
English-Polish
-
not included in WMT so far
-
parallel data: 750k sentence pairs (Europarl + WIT)
-
dev/test: IWSLT sets (TED talks) 2010, 2011, 2012
-
English-Romanian
-
included only in WMT16
-
parallel data: 600k sentence pairs (Europarl + SETIMES2)
-
dev/test: WMT16 dev test, split in half
 
LMs over Morphological Tags
 
-
a stronger baseline: add LMs over tags for better morphological coherence
-
do our models still improve translation?
-
1M sentence pairs, English-Czech translation
Phrase-Based MT: Quick Refresher
the man saw a cat .
query phrase table
the    man     saw     a    cat    .
ten    pán     uviděl    kočka    .
    muž                      kočkou
                      uviděl kočku
decode
P
LM
 = P(muž|<s>) · P(uviděl kočku | <s> muž) · ... · P( </s> | kočku .)
System Outputs: Example
 
i
n
p
u
t
:
t
h
e
 
m
o
s
t
 
i
n
t
e
n
s
i
v
e
 
m
i
n
i
n
g
 
t
o
o
k
 
p
l
a
c
e
 
t
h
e
r
e
 
f
r
o
m
 
1
9
5
3
 
t
o
 
1
9
6
2
 
.
b
a
s
e
l
i
n
e
:
n
e
j
v
í
c
e
 
i
n
t
e
n
z
i
v
n
í
 
t
ě
ž
b
a
 
d
o
š
l
o
 
t
a
m
 
z
 
r
o
k
u
 
1
9
5
3
 
,
 
a
b
y
 
1
9
6
2
 
.
t
h
e
_
m
o
s
t
 
i
n
t
e
n
s
i
v
e
 
m
i
n
i
n
g
n
o
m
 
t
h
e
r
e
_
o
c
c
u
r
r
e
d
 
t
h
e
r
e
 
f
r
o
m
 
1
9
5
3
 
,
 
i
n
_
o
r
d
e
r
_
t
o
 
1
9
6
2
 
.
+
s
o
u
r
c
e
:
n
e
j
v
í
c
e
 
i
n
t
e
n
z
i
v
n
í
 
t
ě
ž
b
y
 
m
í
s
t
o
 
t
a
m
 
z
 
r
o
k
u
 
1
9
5
3
 
d
o
 
r
o
k
u
 
1
9
6
2
 
.
t
h
e
_
m
o
s
t
 
i
n
t
e
n
s
i
v
e
 
m
i
n
i
n
g
g
e
n
 
p
l
a
c
e
 
t
h
e
r
e
 
f
r
o
m
 
y
e
a
r
 
1
9
5
3
 
u
n
t
i
l
 
y
e
a
r
 
1
9
6
2
 
.
+
t
a
r
g
e
t
:
n
e
j
v
í
c
e
 
i
n
t
e
n
z
i
v
n
í
 
t
ě
ž
b
a
 
p
r
o
b
í
h
a
l
a
 
o
d
 
r
o
k
u
 
1
9
5
3
 
d
o
 
r
o
k
u
 
1
9
6
2
 
.
the_most intensive mining
nom
 occurred from year 1953 until year 1962 .
Slide Note
Embed
Share

Understanding the significance of context in machine translation is crucial for improving accuracy and disambiguating word sense. This research delves into the impact of target-side context for discriminative models in statistical machine translation, showcasing how context influences model performance and translation quality based on source and target sentences. By analyzing various contexts, such as source context for word sense disambiguation and target context for word inflection, the study highlights the role of wider context in enhancing translation outcomes.

  • Statistical Machine Translation
  • Contextual Models
  • Target-side Context
  • Word Sense Disambiguation
  • Translation Quality

Uploaded on Oct 06, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Target-Side Context for Discriminative Models in Statistical MT Ale Tamchyna, Alexander Fraser, Ond ej Bojar, Marcin Junczys-Dowmunt ACL 2016 August 9, 2016

  2. Outline - Motivation - Model Description - Integration in Phrase-Based Decoding - Experimental Evaluation - Conclusion

  3. Why Context Matters in MT: Source of the expensive film shooting ? Wider source context required for disambiguation of word sense. st elba nat en Previous work has looked at using source context in MT.

  4. Why Context Matters in MT: Target the man saw a cat . Correct case depends on how we translate the previous words. ko ka ko ky ko ce ko ku ko ko ko ce ko kou nominative genitive Wider target context required for disambiguation of word inflection. dative si v iml uvid l accusative vocative locative instrumental

  5. How Does PBMT Fare? shooting of the film . nat en filmu . shooting of the expensive film . st elby na drah film . the man saw a cat . mu uvid l ko kuacc. the man saw a black cat . mu spat il ernouaccko kuacc. the man saw a yellowish cat . mu spat il na loutl nomko kanom.

  6. Outline - Motivation - Model Description - Integration in Phrase-Based Decoding - Experimental Evaluation - Conclusion

  7. A Discriminative Model of Source and Target Context Let F, E be the source and target sentence. target phrase source phrase Model the following probability distribution: source context target context Where: weight vector feature vector

  8. Model Features (1/2) Label Independent (S = shared): - source window: -1^saw -2^really ... - source words: - source phrase: the man really saw a cat . a cat a_cat - - context window: -1^uvid l -2^v n context bilingual: saw^uvid l really^v n . . . v n uvid l ko ku Label Dependent (T = translation): - target words: - target phrase: ko ku ko ku Full Feature Set: { S T S T } cat&ko ku ...a_cat&ko ku ... saw^uvid l&ko ku ... -1^uvid l&ko ku ... a_cat ... ko ku

  9. Model Features (2/2) - train a single model where each class is defined by label-dependent features - source: form, lemma, part of speech, dependency parent, syntactic role - target: form, lemma, (complex) morphological tag (e.g. NNFS1-----A----) - Allows to learn e.g.: - subjects (role=Sb) often translate into nominative case - nouns are usually accusative when preceded by an adjective in accusative case - lemma cat maps to lemma ko ka regardless of word form (inflection)

  10. Outline - Motivation - Model Description - Integration in Phrase-Based Decoding - Experimental Evaluation - Conclusion

  11. Challenges in Decoding the man saw a cat . . . . ten ko ka ten p n uvid l ko ka . mu ko kou mu uvid l ko kou ko ku ko ku . . . ko ka - source context remains constant when we decode a single sentence each translation option evaluated in many different target contexts - as many as a language model . . . - ko kou uvid l . . . ko ku mu . . . .

  12. Trick #1: Source- and Target-Context Score Parts the man saw a cat . . . . ten ko ka ten p n uvid l ko ka . mu ko kou mu uvid l ko kou ko ku ko ku . . . ko ka score(ko ku|mu uvid l, a cat, the man saw a cat) = w fv(ko ku, mu uvid l, a cat, the man saw a cat) . . . - most features do not depend on target-side context mu uvid l divide the feature vector into two components pre-compute source-context only part of the score before decoding ko kou uvid l . . . - - ko ku mu . . . .

  13. Tricks #2 and #3 - Cache feature vectors - each translation option ( ko ku ) will be seen multiple times during decoding - cache its feature vector before decoding - target-side contexts repeat within a single search ( mu uvid l -> *) - cache context features for each new context - Cache final results - pre-compute and store scores for all possible translations of the current phrase - needed for normalization anyway

  14. Evaluation of Decoding Speed Integration Avg. Time per Sentence baseline 0.8 s naive: only #3 13.7 s +tricks #1, #2 2.9 s

  15. Outline - Motivation - Model Description - Integration in Phrase-Based Decoding - Experimental Evaluation - Conclusion

  16. Scaling to Large Data - - BLEU scores, English-Czech translation training data: subsets of CzEng 1.0

  17. Additional Language Pairs

  18. Manual Evaluation - blind evaluation of system outputs, 104 random test sentences - English-Czech translation - sample BLEU scores: 15.08, 16.22, 16.53 Setting Equal Baseline is better New is better baseline vs. +source 52 26 26 baseline vs. +target 52 18 34

  19. Conclusion - novel discriminative model for MT that uses both source- and target-side context information - (relatively) efficient integration directly into MT decoding - significant improvement of BLEU for English-Czech even on large-scale data - consistent improvement for three other language pairs - model freely available as part of the Moses toolkit

  20. Thank you! Questions?

  21. Extra slides

  22. Intrinsic Evaluation - the task: predict the correct translation in the current context - baseline: select the most frequent translation from the candidates, i.e., translation with the highest P(e|f) shooting - English-Czech translation, tested on WMT13 test set Model Accuracy baseline 51.5 +source context 66.3 +target context 74.8*

  23. Model Training: Parallel Data Training examples: + st elb &gunmen st elb &fled ... - nat en &gunmen nat en &fled ... gunmen fled after the shooting . pachatel po st elb uprchli . ... - st elb &film st elb &expensive ... + nat en &film nat en &fled ... shooting of an expensive film . nat en drah ho filmu . ... - st elb &director st elb &left ... + nat en &director nat en &left ... the director left the shooting . re is r ode el z nat en . ... the man saw a black cat . ko ku|N4 . - prev=A4&N1 prev=A4&ko ka ... + prev=A4&N4 prev=A4&ko ku ... mu vid l ernou|A4 ... + prev=A1&N1 prev=A1&ko ka ... - prev=A1&N4 prev=A1&ko ku ... the black cat noticed the man . ern |A1 ko ka|N1 vid la mu e .

  24. Model Training - Vowpal Wabbit - quadratic feature combinations generated automatically - objective function: logistic loss - setting: --csoaa_ldf mc - 10 iterations over data - select best model based on held-out accuracy - no regularization

  25. Training Efficiency - huge number of features generated (hundreds of GBs when compressed) - feature extraction - easily parallelizable task: simply split data into many chunks - each chunk processed in a multithreaded instance of Moses - model training - Vowpal Wabbit is fast - training can be parallelized using VW AllReduce - workers train on independent chunks, share parameter updates with a master node linear speed up

  26. Additional Language Pairs (1/2) - English-German - parallel data: 4.3M sentence pairs (Europarl + Common Crawl) - dev/test: WMT13/WMT14 - English-Polish - not included in WMT so far - parallel data: 750k sentence pairs (Europarl + WIT) - dev/test: IWSLT sets (TED talks) 2010, 2011, 2012 - English-Romanian included only in WMT16

  27. LMs over Morphological Tags - a stronger baseline: add LMs over tags for better morphological coherence - do our models still improve translation? System BLEU - 1M sentence pairs, English-Czech translation baseline 13.0 +tag LM 14.0 +source 14.5 +target 14.8

  28. Phrase-Based MT: Quick Refresher the man saw a cat . query phrase table . . . ten the man saw a cat . . mu uvid l ko ku ten p n uvid l ko ka . mu ko kou decode uvid l . . . . . . uvid l ko ku . PLM= P(mu |<s>) P(uvid l ko ku | <s> mu ) ... P( </s> | ko ku .)

  29. System Outputs: Example the most intensive mining took place there from 1953 to 1962 . input: baseline: nejv ce intenzivn t ba do lo tam z roku 1953 , aby 1962 . the_most intensive miningnomthere_occurred there from 1953 , in_order_to 1962 . nejv ce intenzivn t by m sto tam z roku 1953 do roku 1962 . +source: the_most intensive mininggenplace there from year 1953 until year 1962 . nejv ce intenzivn t ba prob hala od roku 1953 do roku 1962 . +target: the_most intensive miningnomoccurred from year 1953 until year 1962 .

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#