Machine Translation: Challenges and Solutions

undefined
Machine Translation: Introduction
 
Human Language Technologies
Human Language Technologies
Dipartimento di Informatica
Dipartimento di Informatica
Università di Pisa
Università di Pisa
Outline
Intro and a little history
Language Similarities and Divergences
Three classic MT Approaches
Transfer
Interlingua
Direct
Modern Statistical MT
Neural MT
Evaluation
What is MT?
Translating a text from one language to another automatically
Google Translate
The translation
The translation
http://translate.google.com/translate?hl=en&sl=
es&tl=en&u=http%3A%2F%2Fwww.cocinadomini
cana.com%2Facompanamientos-ensaladas-
pastelones%2F1907-tostones.html
The original recipe for tostones
The original recipe for tostones
http://www.cocinadominicana.com/acompanam
ientos-ensaladas-pastelones/1907-tostones.html
Fried banana is eaten in many
parts of Latin America, and
especially in the Caribbean
Plátano frito se come en
muchísimas partes de
Latinoamérica, y en especial en
el Caribe
Google Translate
French recipe
French recipe
http://translate.google.com/
http://translate.google.com/
translate?hl=en&sl=fr&u=http://www.tarte-tatin.info/recette-tarte-
translate?hl=en&sl=fr&u=http://www.tarte-tatin.info/recette-tarte-
tatin.html&ei=BduiSYK3C4KOsQObvLm_CQ&sa=X&oi=translate&resnum=4&ct=result&prev=/sear
tatin.html&ei=BduiSYK3C4KOsQObvLm_CQ&sa=X&oi=translate&resnum=4&ct=result&prev=/sear
ch?q=tarte+tatin+recettes&num=100
ch?q=tarte+tatin+recettes&num=100
Machine Translation
The Story of the Stone (
The Story of the Stone (
The Dream of the Red Chamber
The Dream of the Red Chamber
)
)
Cao Xueqin 1792
Chinese gloss
Chinese gloss
: Dai-yu alone on bed top think-of-with-gratitude Bao-
: Dai-yu alone on bed top think-of-with-gratitude Bao-
chai again listen to window outside bamboo tip plantain leaf of on-
chai again listen to window outside bamboo tip plantain leaf of on-
top rain sound sigh drop clear cold penetrate curtain not feeling
top rain sound sigh drop clear cold penetrate curtain not feeling
again fall down tears come
again fall down tears come
Hawkes translation
Hawkes translation
: As she lay there alone, Dai-yu
: As she lay there alone, Dai-yu
s thoughts
s thoughts
turned to Bao-chai. Then she listened to the insistent rustle of the
turned to Bao-chai. Then she listened to the insistent rustle of the
rain on the bamboos and plantains outside her window.  The
rain on the bamboos and plantains outside her window.  The
coldness penetrated the curtains of her bed. Almost without
coldness penetrated the curtains of her bed. Almost without
noticing it she had begun to cry
noticing it she had begun to cry
.
.
Machine Translation
Issues:
Sentence segmentation: 4 English sentences to 1 Chinese
Grammatical differences
Chinese rarely marks tense:
As, turned to, had begun
tou
 
 penetrated
No pronouns or articles in Chinese
Stylistic and cultural differences
Bamboo tip plantain leaf 
 bamboos and plantains
Ma 
curtain
 
 curtains of her bed
Rain sound sigh drop 
 insistent rustle of the rain
Alignment in Machine Translation
Not just literature
Hansards: Canadian parliamentary proceedings
What is MT already good enough for?
Tasks for which a rough translation is fine
Extracting information (finding recipes!)
Web pages
email
Tasks for which MT can be post-edited
MT as first pass
Computer-aided human translation
Tasks in sublanguage domains where high-quality MT is
possible
FAHQT (Fully Automatic High Quality Translation)
What is MT not yet good enough for?
Really hard stuff
Literature
Natural spoken speech (meetings, court reporting)
Really important stuff
Medical translation in hospitals
Emergency phone calls
MT History
1946 Booth and Weaver discuss MT at Rockefeller foundation in New York
1947-48 idea of dictionary-based direct translation
1949 Weaver memorandum popularized idea
1952 all 18 MT researchers in world meet at MIT
1954 IBM/Georgetown Demo Russian-English MT
1955-65 lots of labs take up MT
Warren Weaver memo
http://www.stanford.edu/class/linguist289/weaver001.pdf
http://www.stanford.edu/class/linguist289/weaver001.pdf
There are certain invariant properties which are… to some
There are certain invariant properties which are… to some
statistically useful degree, common to all languages.
statistically useful degree, common to all languages.
On March 4, 1947, 
On March 4, 1947, 
having considerable exposure to computer
having considerable exposure to computer
design problems during the war, and being aware of the speed,
design problems during the war, and being aware of the speed,
capacity, and logical flexibility possible in modern electronic
capacity, and logical flexibility possible in modern electronic
computers
computers
, Weaver suggested that computers to be used for
, Weaver suggested that computers to be used for
translation
translation
History of MT: Pessimism
1959/1960: Bar-Hillel 
1959/1960: Bar-Hillel 
Report on the state of MT in US and
Report on the state of MT in US and
GB
GB
Argued FAHQT too hard (semantic ambiguity, etc.)
Should work on semi-automatic instead of automatic
His argument:
Little John was looking for his toy box. Finally, he found it.  
The box was
in the pen
. John was very happy.
Only human knowledge lets us know that 
playpens
 are bigger than
boxes, but 
writing pens
 are smaller
His claim: we would have to encode all of human knowledge
History of MT: Pessimism
The ALPAC report
Headed by John R. Pierce of Bell Labs
Conclusions:
Supply of human translators exceeds demand
All the Soviet literature is already being translated
MT has been a failure: all current MT work had to be post-edited
Sponsored evaluations which showed that intelligibility and
informativeness was worse than human translations
Results:
MT research suffered
Funding loss
Number of research labs declined
Association for Machine Translation and Computational Linguistics dropped MT
from its name
History of MT
1976 Meteo, weather forecasts from English to French
Systran (Babelfish) been used for 40 years
1970
s
European focus in MT; mainly ignored in US
1980
s
ideas of using early AI techniques in MT  (KBMT, CMU)
Focus on 
interlingua
 systems, especially in Japan
1990
s
Commercial MT systems
Statistical MT
Speech-to-speech translation
2000
s
Statistical MT takes off
Google Translate
2015
Neural MT takes off
Language Similarities and Divergences
Some aspects of human language are universal or near-universal, others
Some aspects of human language are universal or near-universal, others
diverge greatly
diverge greatly
Typology
Typology
: the study of systematic cross-linguistic similarities and
: the study of systematic cross-linguistic similarities and
differences
differences
What are the dimensions along with human languages vary?
What are the dimensions along with human languages vary?
Morphology
 
Morpheme
Morpheme
Minimal meaningful unit of language
Word = Morpheme+Morpheme+Morpheme+…
Word = Morpheme+Morpheme+Morpheme+…
Stems: also called lemma, base form, root, lexeme
Stems: also called lemma, base form, root, lexeme
 
hope
+ing 
 
hop
ing
 
hop
 
 
hopp
ing
Affixes
Affixes
Prefixes: 
Anti
dis
establishmentarianism
Suffixes: Antidisestablish
ment
ari
an
ism
Infixes: hingi (
borrow
) – h
um
ingi (
borrower
) in Tagalog
Circumfixes: sagen (
say
) – 
ge
sag
t
 (
said
) in German
Morphological Variation
Isolating languages
Isolating languages
Cantonese, Vietnamese: each word generally has 
one morpheme
Vs. Polysynthetic languages
Vs. Polysynthetic languages
Siberian Yupik (
Eskimo
): single word may have very 
many morphemes
Agglutinative languages
Agglutinative languages
Turkish: morphemes have clean boundaries
Vs. Fusion languages
Vs. Fusion languages
Russian: single affix may have many morphemes
One word one phrase
Turkish
Turkish
uygarla
uygarla
ş
ş
t
t
ı
ı
ramad
ramad
ı
ı
klar
klar
ı
ı
m
m
ı
ı
zdanm
zdanm
ış
ış
s
s
ı
ı
n
n
ı
ı
zcas
zcas
ı
ı
na
na
uygar+la
uygar+la
ş
ş
+t
+t
ı
ı
r+ama+d
r+ama+d
ı
ı
k+lar+
k+lar+
ı
ı
m
m
ı
ı
z+dan+m
z+dan+m
ış
ış
+s
+s
ı
ı
n
n
ı
ı
z+cas
z+cas
ı
ı
na
na
Behaving as if you are among those whom we could not cause to become civilized
Behaving as if you are among those whom we could not cause to become civilized
German
German
Donau
Donau
dampf
dampf
schiffahrts
schiffahrts
elektrizitäten
elektrizitäten
haupt
haupt
betriebs
betriebs
werk
werk
bau
bau
unterbeamten
unterbeamten
gesellsc
gesellsc
haft
haft
Danube steam shipping electricity main plant construction subordinate company
Danube steam shipping electricity main plant construction subordinate company
Donaudampfschifffahrtsgesellschaftskapitän
Donaudampfschifffahrtsgesellschaftskapitän
Donau+dampf+Schiffahrts+gesellschafts+kapitän
Donau+dampf+Schiffahrts+gesellschafts+kapitän
Danube steam shipping company captain
Danube steam shipping company captain
Index of synthesis
Index of synthesis
Slide from Holger Diessel
isolating
synthetic
 
V
i
e
t
n
a
m
e
s
e
 
E
n
g
l
i
s
h
 
R
u
s
s
i
a
n
 
O
n
e
i
d
a
Isolating language
Isolating language
Vietnamese (Comrie 1981: 43)
Khi         tôi        đến        nhà         bạn, 
      
chúng     tôi      bắt     đầu  làm bài.
When
 
I          come     house     friend     PL
 
         I          begin
 
do   lesson
When I came to my friend’s house, we began to do lessons.
Cantonese
keui wa  chyuhn gwok    jeui   daaih gaan     nguk  haih li     gaan
he    say entire   country most big    building house is    this building
Slide from Holger Diessel
Synthetic language
Synthetic language
(2)
(2)
 
 
Kirundi (Whaley 1997:20)
Kirundi (Whaley 1997:20)
 
 
Y-a-bi-
Y-a-bi-
gur
gur
-i-ye
-i-ye
     
     
abâna
abâna
 
 
CL1-PST-CL8.them-
CL1-PST-CL8.them-
buy
buy
-APPL-ASP
-APPL-ASP
  
  
CL2.children
CL2.children
 
 
He bought them for the children.
He bought them for the children.
Slide from Holger Diessel
Polysynthetic language
Polysynthetic language
Noun-incorporation (cf. fox-hunting, bird-watching)
Noun-incorporation (cf. fox-hunting, bird-watching)
(3)
(3)
 
 
Mohawk (Mithun 1984: 868)
Mohawk (Mithun 1984: 868)
a
a
.
.
r
r
-
-
u
u
k
k
w
w
e
e
t
t
-
-
í
í
:
:
y
y
o
o
  
  
he-person-nice
he-person-nice
  
  
He is a nice person
He is a nice person
b
b
.
.
w
w
a
a
-
-
h
h
i
i
-
-
s
s
e
e
r
r
e
e
t
t
h
h
-
-
ó
ó
h
h
a
a
r
r
e
e
-
-
s
s
e
e
 
 
 
 
 
 
PST-he/me-car-wash-for
PST-he/me-car-wash-for
  
  
He car-wash for me
He car-wash for me
 (= 
 (= 
He washed my car
He washed my car
)
)
c
c
.
.
k
k
v
v
t
t
s
s
y
y
u
u
v
v
-
-
k
k
u
u
w
w
a
a
-
-
n
n
y
y
a
a
t
t
-
-
ó
ó
:
:
a
a
s
s
e
e
  
  
fish
fish
 
 
FUT-they/her-throat-slit
FUT-they/her-throat-slit
  
  
They will throat-slit a fish
They will throat-slit a fish
Slide from Holger Diessel
Index of fusion
Index of fusion
agglutinative
fusional
 
S
w
a
h
i
l
i
 
R
u
s
s
i
a
n
 
O
n
e
i
d
a
Slide from Holger Diessel
Agglutinative language
Agglutinative language
(1)
 
Turkish (Comrie 1981: 44)
   
SG
  
PL
 
Nominative
 
adam
  
adam-lar
 
Accusative
 
adam-
K
 
adam-lar-
K
 
Genitive
 
adam-
K
n
 
adam-lar-
K
n
 
Dative
  
adam-a
 
adam-lar-a
 
Locative
 
adam-da
 
adam-lar-da
 
Ablative
 
adam-dan
 
adam-lar-dan
Slide from Holger Diessel
Fusional language
Fusional language
(2)
 
Russian
 
SG
 
PL
 
 SG
 
PL 
 
Nominative
 
stol
 
stol-y
 
lip-a
 
lip-y
Accusative
 
stol
 
stol-y
 
lip-u
 
lip-y
Genitive
 
stol-a
 
stol-ov
 
lip-y
 
lip
Dative
 
stol-u
 
stol-am
 
lip-e
 
lip-am
Instrumental
 
stol-om
 
stol-ami
 
lip-oj
 
lip-ami
Prepositional
 
stol-e
 
stol-ax
 
lip-e
 
lip-ax
Slide from Holger Diessel
Word Order
SVO
SVO
 (Subject-Verb-Object) languages
 (Subject-Verb-Object) languages
English, German, French, Mandarin
SOV
SOV
 Languages
 Languages
Japanese, Hindi
VSO
VSO
 languages
 languages
Irish, Classical Arabic
SVO languages generally use prepositions: 
SVO languages generally use prepositions: 
to Yuriko
to Yuriko
VSO languages generally use postpositions: 
VSO languages generally use postpositions: 
Yuriko ni
Yuriko ni
Segmentation Variation
Not every writing system has 
Not every writing system has 
word boundaries
word boundaries
 marked
 marked
Chinese, Japanese, Thai, Vietnamese
Some languages tend to have 
Some languages tend to have 
sentences
sentences
 that are 
 that are 
quite long
quite long
,
,
closer to English paragraphs than sentences:
closer to English paragraphs than sentences:
Modern Standard Arabic, Chinese
Inferential Load: cold vs. hot langs
Some 
Some 
cold
cold
 languages require the hearer to do more 
 languages require the hearer to do more 
figuring
figuring
out
out
 of who the various actors in the various events are:
 of who the various actors in the various events are:
Japanese, Chinese
Other 
Other 
hot
hot
 languages are pretty explicit about saying who did
 languages are pretty explicit about saying who did
what to whom:
what to whom:
English
Inferential Load (2)
All noun phrases in
blue do not appear
in the Chinese text …
But they are needed
for a good translation
Lexical Divergences
Word to phrases:
English 
computer science
 = French 
informatique
POS divergences
English: 
she likes/VERB to sing
German: Sie singt gerne/ADV
English:  
I
m hungry/ADJ
Spamish: 
tengo hambre/NOUN
Lexical Divergences: Specificity
Grammatical constraints
English has gender on pronouns, Mandarin not.
So translating 
3rd person
 from Chinese to English, need to figure out gender of
the person!
Similarly from English 
they
 to French 
ils/elles
Semantic constraints
English: ‘brother
 
Mandarin: 
gege
 (older) versus 
didi
 (younger)
English: 
wall
German: 
Wand
 (inside) 
Mauer
 (outside)
German: 
Berg
English: 
hill
 or 
mountain
Lexical Divergence: many-to-many
Lexical Divergence: lexical gaps
Japanese: no word for 
Japanese: no word for 
privacy
privacy
English: no word for Cantonese 
English: no word for Cantonese 
haauseun
haauseun
 or Japanese 
 or Japanese 
oyakoko
oyakoko
(something like `filial piety
(something like `filial piety
)
)
English 
English 
cow
cow
 vs. 
 vs. 
beef
beef
, Cantonese 
, Cantonese 
ngau
ngau
English 
English 
fish
fish
, Spanish 
, Spanish 
pez
pez
 vs. 
 vs. 
pescado
pescado
Event-to-argument divergences
English
English
The bottle 
floated 
out
.
Spanish
Spanish
La botella 
salió 
flotando
.
The bottle 
exited 
floating
Verb-framed lang: mark direction of 
Verb-framed lang: mark direction of 
motion on verb
motion on verb
Spanish, French, Arabic, Hebrew, Japanese, Tamil, Polynesian, Mayan,
Bantu families
Satellite-framed lang: mark direction of 
Satellite-framed lang: mark direction of 
motion on satellite
motion on satellite
Crawl out, float off, jump down, walk over to, run after
Rest of Indo-European, Hungarian, Finnish, Chinese
Structural divergences
German: 
German: 
Wir treffen uns
Wir treffen uns
 
 
am Mittwoch
am Mittwoch
English: 
English: 
We
We
ll meet
ll meet
 
 
on Wednesday
on Wednesday
Head Swapping
English: X swim across Y
English: X swim across Y
Spanish: X crucar Y nadando
Spanish: X crucar Y nadando
English: I 
English: I 
like
like
 to eat
 to eat
German: Ich esse 
German: Ich esse 
gern
gern
English: 
English: 
I
I
d
d
 
 
prefer
prefer
 
 
vanilla
vanilla
German: 
German: 
Mir
Mir
 wäre 
 wäre 
Vanille
Vanille
 
 
lieber
lieber
Thematic divergence
Spanish: 
Spanish: 
Y
Y
 
 
me gusto
me gusto
English: I like 
English: I like 
Y
Y
German: Mir fällt 
German: Mir fällt 
der Termin
der Termin
 ein
 ein
English: I
English: I
 
 
forget
forget
 
 
the date
the date
Divergence counts from Bonnie Dorr
32% of sentences in UN Spanish/English Corpus (5K)
3 
Classical
 methods for MT
Direct
Transfer
Interlingua
Three MT Approaches: Direct, Transfer, Interlingual
Direct Translation
Proceed word-by-word through text
Translating each word
No intermediate structures except morphology
Knowledge is in the form of
Huge bilingual dictionary
word-to-word translation information
After word translation, can do simple reordering
Adjective ordering English -> French/Spanish
Direct MT Dictionary entry
Direct MT
Problems with direct MT
German
Chinese
The Transfer Model
Idea: apply 
Idea: apply 
contrastive knowledge
contrastive knowledge
, i.e., knowledge about the
, i.e., knowledge about the
difference between two languages
difference between two languages
Steps:
Steps:
Analysis:  Syntactically parse Source language
Transfer: Rules to turn this parse into parse for Target language
Generation: Generate Target sentence from parse tree
English to French
Generally
English: Adjective Noun
French: Noun Adjective
Note: not always true
‘Route mauvaise’ -> 
bad road, badly-paved road
Mauvaise route
 
wrong road
but is a reasonable first approximation
Rule:
Transfer rules
Japanese
Lexical transfer
Transfer-based 
systems
 also need lexical transfer rules
Bilingual dictionary (like for direct MT)
English 
home:
German
nach Hause  (going home)
Heim (home game)
Heimat (homeland, home country)
zu Hause (at home)
Can list 
at home <-> zu Hause
Or do Word Sense Disambiguation
Systran: combining direct and transfer
Analysis
Morphological analysis, POS tagging
Chunking of NPs, PPs, phrases
Shallow dependency parsing
Transfer
Translation of idioms
Word sense disambiguation
Assigning prepositions based on governing verbs
Synthesis
Apply rich bilingual dictionary
Deal with reordering
Morphological generation
Transfer: some problems
N
2
 sets of transfer rules!
Grammar and lexicon full of language-specific stuff
Hard to build, hard to maintain
Interlingua
Intuition: Instead of lang-lang knowledge rules, use the
meaning of the sentence to help
Steps:
1.
translate source sentence into meaning representation
2.
generate target sentence from meaning.
Interlingua
Interlingua
Idea is that some of the MT work that we need to do is part of other NLP
Idea is that some of the MT work that we need to do is part of other NLP
tasks
tasks
E.g., disambiguating 
E.g., disambiguating 
Eng:book
Eng:book
 
 
Spa:
Spa:
libro
libro
 from 
 from 
Eng:book
Eng:book
 
 
Spa:
Spa:
reservar
reservar
So we could have concepts like BOOKVOLUME and RESERVE and solve
So we could have concepts like BOOKVOLUME and RESERVE and solve
this problem once for each language
this problem once for each language
Direct MT: pros and cons 
(Bonnie Dorr)
Pros
Fast
Simple
Cheap
No translation rules hidden in lexicon
Cons
Unreliable
Not powerful
Rule proliferation
Requires lots of context
Major restructuring after lexical substitution
Interlingual MT: pros and cons 
(B. Dorr)
Pros
Avoids the N
2
 problem
Easier to write rules
Cons:
Semantics is HARD
Useful information lost (paraphrase)
undefined
Moving toward Statistical MT
 
Warren Weaver (1947)
When I look at an article in
Russian, I say to myself: This
is really written in English, but
it has been coded in some
strange symbols.  I will now
proceed to decode.
Kevin Knight slide
Rosetta Stone
Carved in 196 BC
Found in 1799
Decoded in 1822
Egyptian hieroglyphs
Egyptian Demotic
Greek
Kevin Knight slide
undefined
Centauri/Arcturan [Knight, 1997]
Your assignment, translate this to Arcturan:
Your assignment, translate this to Arcturan:
   
   
farok crrrok hihok yorok clok kantok ok-yurp
farok crrrok hihok yorok clok kantok ok-yurp
Kevin Knight slide
Centauri/Arcturan Parallel Corpus
 
Slide from Kevin Knight
Centauri to Arcturan Traslation
 
Slide from Kevin Knight
Translate this to Arcturan:    
farok 
crrrok hihok yorok clok kantok ok-yurp
(
(
(
(
(
(
Translating this to Arcturan:    
farok
 crrrok hihok yorok clok kantok ok-yurp
Centauri/Arcturan Alignment
1a. ok-voon ororok sprok .
1b. at-voon bichat dat .
7a. lalok 
farok
 ororok lalok sprok izok enemok .
7b. wat 
jjat
 bichat wat dat vat eneat .
(
2a. ok-drubel ok-voon anok plok sprok .
2b. at-drubel at-voon pippat rrat dat .
8a. lalok brok anok plok nok .
8b. iat lat pippat rrat nnat .
(
3a. erok sprok izok hihok ghirok .
3b. totat dat arrat vat hilat .
9a. wiwok nok izok kantok ok-yurp .
9b. totat nnat quat oloat at-yurp .
(
4a. ok-voon anok drok brok jok .
4b. at-voon krat pippat sat lat .
10a. lalok mok nok yorok ghirok clok .
10b. wat nnat gat mat bat hilat .
(
5a. wiwok 
farok
 izok stok .
5b. totat 
jjat
 quat cat .
11a. lalok nok crrrok hihok yorok zanzanok .
11b. wat nnat arrat mat zanzanat .
(
6a. lalok sprok izok jok stok .
6b. wat dat krat quat cat .
12a. lalok rarok nok izok hihok mok .
12b. wat nnat forat arrat vat gat .
(
Slide from Kevin Knight
Centauri/Arcturan Alignment
Your assignment, translate this to Arcturan:    
farok
 crrrok hihok yorok clok kantok ok-yurp
???
Slide from Kevin Knight
Centauri/Arcturan Alignment
Your assignment, translate this to Arcturan:    
farok
 crrrok 
hihok
 yorok clok kantok ok-yurp
Slide from Kevin Knight
Centauri/Arcturan Alignment
Your assignment, translate this to Arcturan:    
farok
 crrrok 
hihok
 yorok clok kantok ok-yurp
Slide from Kevin Knight
Centauri/Arcturan Alignment
Your assignment, translate this to Arcturan:    
farok
 crrrok 
hihok
 
yorok
 clok kantok ok-yurp
Slide from Kevin Knight
Centauri/Arcturan Alignment
Your assignment, translate this to Arcturan:    
farok
 crrrok 
hihok
 
yorok
 clok kantok ok-yurp
???
Slide from Kevin Knight
Centauri/Arcturan Alignment
Your assignment, translate this to Arcturan:    
farok
 crrrok 
hihok yorok
 clok kantok ok-yurp
Slide from Kevin Knight
Centauri/Arcturan Alignment
Your assignment, translate this to Arcturan:    
farok
 crrrok 
hihok yorok
 
clok
 kantok ok-yurp
process of
elimination
Slide from Kevin Knight
Centauri/Arcturan Alignment
Your assignment, translate this to Arcturan:    
farok
 crrrok 
hihok yorok
 
clok
 kantok ok-yurp
cognate?
Slide from Kevin Knight
Your assignment, put these words in order:    { 
jjat, arrat, mat, bat, oloat, at-yurp
 
}
Centauri/Arcturan Alignment
zero
fertility
Slide from Kevin Knight
C
l
i
e
n
t
s
 
d
o
 
n
o
t
 
s
e
l
l
 
p
h
a
r
m
a
c
e
u
t
i
c
a
l
s
 
i
n
 
E
u
r
o
p
e
 
=
>
 
C
l
i
e
n
t
e
s
 
n
o
 
v
e
n
d
e
n
 
m
e
d
i
c
i
n
a
s
 
e
n
 
E
u
r
o
p
a
 
Slide from Kevin Knight
It’s Really Spanish/English
Summary
Intro and a little history
Language Similarities and Divergences
Three classic MT Approaches
Transfer
Interlingua
Direct
Slide Note
Embed
Share

This material covers the basics of Machine Translation (MT), including a brief history, classic MT approaches, and modern techniques like Statistical MT and Neural MT. It also presents case studies using Google Translate for language translation and discusses the issues faced in MT, such as sentence segmentation and grammatical differences between languages.

  • Machine Translation
  • Challenges
  • Solutions
  • Language Technologies
  • Neural MT

Uploaded on Sep 18, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Universit di Pisa Machine Translation: Introduction Human Language Technologies Dipartimento di Informatica Universit di Pisa Slides from Dan Jurafsky

  2. Outline Intro and a little history Language Similarities and Divergences Three classic MT Approaches Transfer Interlingua Direct Modern Statistical MT Neural MT Evaluation

  3. What is MT? Translating a text from one language to another automatically

  4. Google Translate The translation http://translate.google.com/translate?hl=en&sl= es&tl=en&u=http%3A%2F%2Fwww.cocinadomini cana.com%2Facompanamientos-ensaladas- pastelones%2F1907-tostones.html Fried banana is eaten in many parts of Latin America, and especially in the Caribbean Pl tano frito se come en much simas partes de Latinoam rica, y en especial en el Caribe The original recipe for tostones http://www.cocinadominicana.com/acompanam ientos-ensaladas-pastelones/1907-tostones.html

  5. Google Translate French recipe http://translate.google.com/translate?hl=en&sl=fr&u=http://www.tarte-tatin.info/recette-tarte- tatin.html&ei=BduiSYK3C4KOsQObvLm_CQ&sa=X&oi=translate&resnum=4&ct=result&prev=/sear ch?q=tarte+tatin+recettes&num=100

  6. Machine Translation The Story of the Stone ( The Dream of the Red Chamber ) Cao Xueqin 1792 Chinese gloss: Dai-yu alone on bed top think-of-with-gratitude Bao- chai again listen to window outside bamboo tip plantain leaf of on- top rain sound sigh drop clear cold penetrate curtain not feeling again fall down tears come Hawkes translation: As she lay there alone, Dai-yu s thoughts turned to Bao-chai. Then she listened to the insistent rustle of the rain on the bamboos and plantains outside her window. The coldness penetrated the curtains of her bed. Almost without noticing it she had begun to cry.

  7. Machine Translation Issues: Sentence segmentation: 4 English sentences to 1 Chinese Grammatical differences Chinese rarely marks tense: As, turned to, had begun tou penetrated No pronouns or articles in Chinese Stylistic and cultural differences Bamboo tip plantain leaf bamboos and plantains Ma curtain curtains of her bed Rain sound sigh drop insistent rustle of the rain

  8. Alignment in Machine Translation

  9. Not just literature Hansards: Canadian parliamentary proceedings

  10. What is MT already good enough for? Tasks for which a rough translation is fine Extracting information (finding recipes!) Web pages email Tasks for which MT can be post-edited MT as first pass Computer-aided human translation Tasks in sublanguage domains where high-quality MT is possible FAHQT (Fully Automatic High Quality Translation)

  11. What is MT not yet good enough for? Really hard stuff Literature Natural spoken speech (meetings, court reporting) Really important stuff Medical translation in hospitals Emergency phone calls

  12. MT History 1946 Booth and Weaver discuss MT at Rockefeller foundation in New York 1947-48 idea of dictionary-based direct translation 1949 Weaver memorandum popularized idea 1952 all 18 MT researchers in world meet at MIT 1954 IBM/Georgetown Demo Russian-English MT 1955-65 lots of labs take up MT

  13. Warren Weaver memo http://www.stanford.edu/class/linguist289/weaver001.pdf There are certain invariant properties which are to some statistically useful degree, common to all languages. On March 4, 1947, having considerable exposure to computer design problems during the war, and being aware of the speed, capacity, and logical flexibility possible in modern electronic computers , Weaver suggested that computers to be used for translation

  14. History of MT: Pessimism 1959/1960: Bar-Hillel Report on the state of MT in US and GB Argued FAHQT too hard (semantic ambiguity, etc.) Should work on semi-automatic instead of automatic His argument: Little John was looking for his toy box. Finally, he found it. The box was in the pen. John was very happy. Only human knowledge lets us know that playpens are bigger than boxes, but writing pens are smaller His claim: we would have to encode all of human knowledge

  15. History of MT: Pessimism The ALPAC report Headed by John R. Pierce of Bell Labs Conclusions: Supply of human translators exceeds demand All the Soviet literature is already being translated MT has been a failure: all current MT work had to be post-edited Sponsored evaluations which showed that intelligibility and informativeness was worse than human translations Results: MT research suffered Funding loss Number of research labs declined Association for Machine Translation and Computational Linguistics dropped MT from its name

  16. History of MT 1976 Meteo, weather forecasts from English to French Systran (Babelfish) been used for 40 years 1970 s European focus in MT; mainly ignored in US 1980 s ideas of using early AI techniques in MT (KBMT, CMU) Focus on interlingua systems, especially in Japan 1990 s Commercial MT systems Statistical MT Speech-to-speech translation 2000 s Statistical MT takes off Google Translate 2015 Neural MT takes off

  17. Language Similarities and Divergences Some aspects of human language are universal or near-universal, others diverge greatly Typology: the study of systematic cross-linguistic similarities and differences What are the dimensions along with human languages vary?

  18. Morphology Morpheme Minimal meaningful unit of language Word = Morpheme+Morpheme+Morpheme+ Stems: also called lemma, base form, root, lexeme hope+ing hoping hop hopping Affixes Prefixes: Antidisestablishmentarianism Suffixes: Antidisestablishmentarianism Infixes: hingi (borrow) humingi (borrower) in Tagalog Circumfixes: sagen (say) gesagt (said) in German

  19. Morphological Variation Isolating languages Cantonese, Vietnamese: each word generally has one morpheme Vs. Polysynthetic languages Siberian Yupik ( Eskimo ): single word may have very many morphemes Agglutinative languages Turkish: morphemes have clean boundaries Vs. Fusion languages Russian: single affix may have many morphemes

  20. One word one phrase Turkish uygarla t ramad klar m zdanm s n zcas na uygar+la +t r+ama+d k+lar+ m z+dan+m +s n z+cas na Behaving as if you are among those whom we could not cause to become civilized German Donaudampfschiffahrtselektrizit tenhauptbetriebswerkbauunterbeamtengesellsc haft Danube steam shipping electricity main plant construction subordinate company Donaudampfschifffahrtsgesellschaftskapit n Donau+dampf+Schiffahrts+gesellschafts+kapit n Danube steam shipping company captain

  21. Index of synthesis isolating synthetic Vietnamese English Russian Oneida Slide from Holger Diessel

  22. Isolating language Vietnamese (Comrie 1981: 43) Khi t i n nh b n, When I come house friend PL I begin do lesson ch ng t i b t u l m b i. When I came to my friend s house, we began to do lessons. Cantonese keui wa chyuhn gwok jeui daaih gaan nguk haih li gaan he say entire country most big building house is this building Slide from Holger Diessel

  23. Synthetic language (2) Kirundi (Whaley 1997:20) Y-a-bi-gur-i-ye CL1-PST-CL8.them-buy-APPL-ASP ab na CL2.children He bought them for the children. Slide from Holger Diessel

  24. Polysynthetic language Noun-incorporation (cf. fox-hunting, bird-watching) (3) Mohawk (Mithun 1984: 868) a. r-ukwe t- :yo he-person-nice He is a nice person b. wa-hi- sereth- hare- se PST-he/me-car-wash-for He car-wash for me (= He washed my car) c. kvtsyu v-kuwa-nya t- : ase fish FUT-they/her-throat-slit They will throat-slit a fish Slide from Holger Diessel

  25. Index of fusion agglutinative fusional Swahili Russian Oneida Slide from Holger Diessel

  26. Agglutinative language (1) Turkish (Comrie 1981: 44) SG PL Nominative adam Accusative adam-K Genitive Dative Locative Ablative adam-lar adam-lar-K adam-lar-Kn adam-lar-a adam-lar-da adam-lar-dan adam-Kn adam-a adam-da adam-dan Slide from Holger Diessel

  27. Fusional language (2) Russian SG PL SG PL Nominative Accusative Genitive Dative Instrumental Prepositional stol-e stol stol stol-a stol-u stol-om stol-ami lip-oj stol-ax stol-y stol-y stol-ov stol-am lip-e lip-a lip-u lip-y lip-y lip-y lip lip-am lip-ami lip-ax lip-e Slide from Holger Diessel

  28. Word Order SVO (Subject-Verb-Object) languages English, German, French, Mandarin SOV Languages Japanese, Hindi VSO languages Irish, Classical Arabic SVO languages generally use prepositions: to Yuriko VSO languages generally use postpositions: Yuriko ni

  29. Segmentation Variation Not every writing system has word boundaries marked Chinese, Japanese, Thai, Vietnamese Some languages tend to have sentences that are quite long, closer to English paragraphs than sentences: Modern Standard Arabic, Chinese

  30. Inferential Load: cold vs. hot langs Some cold languages require the hearer to do more figuring out of who the various actors in the various events are: Japanese, Chinese Other hot languages are pretty explicit about saying who did what to whom: English

  31. Inferential Load (2) All noun phrases in blue do not appear in the Chinese text But they are needed for a good translation

  32. Lexical Divergences Word to phrases: English computer science = French informatique POS divergences English: she likes/VERB to sing German: Sie singt gerne/ADV English: I m hungry/ADJ Spamish: tengo hambre/NOUN

  33. Lexical Divergences: Specificity Grammatical constraints English has gender on pronouns, Mandarin not. So translating 3rd person from Chinese to English, need to figure out gender of the person! Similarly from English they to French ils/elles Semantic constraints English: brother Mandarin: gege (older) versus didi (younger) English: wall German: Wand (inside) Mauer (outside) German: Berg English: hill or mountain

  34. Lexical Divergence: many-to-many

  35. Lexical Divergence: lexical gaps Japanese: no word for privacy English: no word for Cantonese haauseun or Japanese oyakoko (something like `filial piety ) English cow vs. beef , Cantonese ngau English fish , Spanish pez vs. pescado

  36. Event-to-argument divergences English The bottle floated out. Spanish La botella sali flotando. The bottle exited floating Verb-framed lang: mark direction of motion on verb Spanish, French, Arabic, Hebrew, Japanese, Tamil, Polynesian, Mayan, Bantu families Satellite-framed lang: mark direction of motion on satellite Crawl out, float off, jump down, walk over to, run after Rest of Indo-European, Hungarian, Finnish, Chinese

  37. Structural divergences German: Wir treffen uns am Mittwoch English: We ll meet on Wednesday

  38. Head Swapping English: X swim across Y Spanish: X crucar Y nadando English: I like to eat German: Ich esse gern English: I d prefer vanilla German: Mir w re Vanille lieber

  39. Thematic divergence Spanish: Y me gusto English: I like Y German: Mir f llt der Termin ein English: Iforget the date

  40. Divergence counts from Bonnie Dorr 32% of sentences in UN Spanish/English Corpus (5K) X tener hambre Y have hunger Categorial 98% X dar pu aladas a Z X stab Z Conflational 83% X entrar en Y X enter Y Structural 35% X cruzar Y nadando X swim across Y Head Swapping 8% X gustar a Y Y likes X Thematic 6%

  41. 3 Classical methods for MT Direct Transfer Interlingua

  42. Three MT Approaches: Direct, Transfer, Interlingual

  43. Direct Translation Proceed word-by-word through text Translating each word No intermediate structures except morphology Knowledge is in the form of Huge bilingual dictionary word-to-word translation information After word translation, can do simple reordering Adjective ordering English -> French/Spanish

  44. Direct MT Dictionary entry

  45. Direct MT

  46. Problems with direct MT German Chinese

  47. The Transfer Model Idea: apply contrastive knowledge, i.e., knowledge about the difference between two languages Steps: Analysis: Syntactically parse Source language Transfer: Rules to turn this parse into parse for Target language Generation: Generate Target sentence from parse tree

  48. English to French Generally English: Adjective Noun French: Noun Adjective Note: not always true Route mauvaise -> bad road, badly-paved road Mauvaise route wrong road but is a reasonable first approximation Rule:

  49. Transfer rules Japanese

  50. Lexical transfer Transfer-based systems also need lexical transfer rules Bilingual dictionary (like for direct MT) English home: German nach Hause (going home) Heim (home game) Heimat (homeland, home country) zu Hause (at home) Can list at home <-> zu Hause Or do Word Sense Disambiguation

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#