Comparative Study of Similar Southeast Asian Languages

S
i
m
i
l
a
r
 
S
o
u
t
h
e
a
s
t
 
A
s
i
a
n
 
L
a
n
g
u
a
g
e
s
:
C
o
r
p
u
s
-
B
a
s
e
d
 
C
a
s
e
 
S
t
u
d
y
 
o
n
T
h
a
i
-
L
a
o
t
i
a
n
 
a
n
d
 
M
a
l
a
y
-
I
n
d
o
n
e
s
i
a
n
Chenchen Ding, Masao Utiyama, Eiichiro Sumita
Advanced Translation Technology Laboratory, ASTREC, NICT, Japan
1
Motivation
For similar languages
Specific and efficient approaches can be designed
Techniques on well-studied languages can be applied to low-resourced ones
How to measure the similarity
Scripts: 
 
related or comparable writing systems
 
similar letters
Vocabulary: etymologically related words
  
similar spellings
Syntax:
 
phrase / sentence structure
   
similar word orders
2
Outline
Asian language 
t
reebank 
(ALT) 
project
Similar languages and related processing
Investigation and experiments
Conclusion and future works
3
Motivation of Asian Language Treebank
Compared with European languages
Most Asian languages are low-resourced and understudied
NLP techniques cannot be developed and applied
ALT can facilitate
Tokenization / POS tagging / Parsing
Cross-lingual processing
E
stablish a solid basis for Asian language processing
4
Details of 
Asian Language Treebank
Treebanks for six Asian languages and English
Burmese, Indonesian, Japanese, Khmer, Malay, Vietnamese
April 2016 -- March 2019
Candidate languages in future
Laotian, Tagalog, Thai
All the raw parallel data are available
http://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/
5
Similar Languages in ALT
URL
en.wikinews.org/wiki/2007_Rugby_World_Cup:_Italy_31_-_5_Portugal
English sentences
Italy have defeated Portugal 31-5 in Pool C of the 2007 Rugby World Cup at Parc
des Princes, Paris, France.
6
Similar Languages in ALT
URL
en.wikinews.org/wiki/2007_Rugby_World_Cup:_Italy_31_-_5_Portugal
Indonesian and Malay translations
Italia berhasil mengalahkan Portugal 31-5 di grup C dalam Piala Dunia Rugby
2007 di Parc des Princes, Paris, Perancis.
Itali telah mengalahkan Portugal 31-5 dalam Pool C pada Piala Dunia Ragbi
2007 di Parc des Princes, Paris, Perancis.
7
Similar Languages in ALT
URL
en.wikinews.org/wiki/2007_Rugby_World_Cup:_Italy_31_-_5_Portugal
Indonesian and Malay translations
Italia
 
berhasil
 mengalahkan Portugal 31-5 
di
 
grup 
C
 dalam 
Piala Dunia 
Rugby
2007 di Parc des Princes, Paris, Perancis.
Itali
 
telah
 mengalahkan Portugal 31-5 
dalam
 
Pool
 C 
pada
 Piala Dunia 
Ragbi
2007 di Parc des Princes, Paris, Perancis.
8
Similar Languages in ALT
URL
en.wikinews.org/wiki/2007_Rugby_World_Cup:_Italy_31_-_5_Portugal
Laotian and Thai translations
ອິຕາລີໄດ້ເສຍໃຫ້ປ໊ອກຕຸຍການ31ຕໍ່5ໃນພູລ
C
ຂອງການແຂ່ງຂັນຣັກບີ້ລະດັບ
ໂລກປີ2007ທີ່ປາກເດແພຣັງປາຣີປະເທດຝຣັ່ງ.
อิตาลีได้เอาชนะโปรตุเกสด้วยคะแนน31ต่อ5ในกลุ่ม
c
ของการแข่งขันรักบี้เวิลด์คัพปี2007ที่สนามปาร์กเดแพร็งส์ที่กรุง
ปารีสประเ
9
Similar Languages in ALT
URL
en.wikinews.org/wiki/2007_Rugby_World_Cup:_Italy_31_-_5_Portugal
Laotian and Thai translations
ອິຕາລີໄດ້ເສ
ຍໃຫ້ປ໊ອກ
ຕຸ
ານ
31ຕໍ່5ໃນ
ພູລ
C
ຂອງກາ
ແຂ່ງຂັນຣັກບີ້
ລະດັບ
ໂລກ
ປີ2007ທີ່ປາກເດແພຣັງປາຣີປ
ະເທດຝຣັ່ງ.
อิตาลีได้เอ
าชนะโปร
ตุ
สด้วยคะแนน
31ต่
5ใน
กลุ่ม
c
ของกา
แข่งขันรักบี้
เวิลด์คัพ
ปี2007ที่
สนาม
ปา
ร์
กเดแพร็ง
ส์ที่กรุง
ปารี
ระเ
10
Processing Similar Languages in NLP
Translation between Catalan and Spanish
Can we translate letters? D. Vilar et al., 2007, WMT
Translation between Japanese and Korean
The last years’ WAT
Character-based processing
Apply SMT techniques on Japanese to Burmese
Empirical dependency-based head finalization for statistical Chinese-, English-,
and French-to-Myanmar (Burmese) MT. C. Ding et al. 2014, IWSLT
11
Two Southeast Asian Language Pairs
Thai-Laotian
T
onal languages from the Tai-Kadai language family, mutually intelligible
Abugida writing systems
Etymologically related words
Isolating in morphology, h
ead-initial in syntax
Malay-Indonesian
From Austronesian languages family, mutually intelligible
Using 
Latin scripts
“Different registers of one language” 
12
Data and Pre-processing
Raw translations from ALT
Sentences : train / dev / test
 
18,000 / 1,000 / 1,000
Tokens:
Simple tokenization for Malay and Indonesian
Punctuation marks detached
Unbreakable unit segmentation for Thai and Laotian
Dependent diacritics attached to independent letters
13
Word Order
Kendall’s tau on Thai and Laotian
14
Word Order
Kendall’s tau on Malay and Indonesian
15
For Comparison
Kendall’s tau on Japanese-English and English-French
16
Uncertainty
 in Token Correspondence
X-axis: log probability of Thai tokens
Y-axis: Entropy on corresponding Laotian tokens
17
Uncertainty in Token Correspondence
X-axis: log probability of Laotian tokens
Y-axis: Entropy on corresponding Thai tokens
18
Uncertainty in Token Correspondence
X-axis: log probability of Malay tokens
Y-axis: Entropy on corresponding Indonesian tokens
19
Uncertainty in Token Correspondence
X-axis: log probability of Indonesian tokens
Y-axis: Entropy on corresponding Malay tokens
20
For Comparison
X-axis: log probability of Japanese characters
Y-axis: Entropy on corresponding Korean characters
21
For Comparison
X-axis: log probability of Japanese tokens
Y-axis: Entropy on corresponding 
English
 
word
s
22
Experimental Results from SMT
Moses PB-based SMT
The parallel data in ALT is not sufficient for a practical system
E
xperiments to investigate the reordering requirement in translation
23
Conclusion and Future Work
The similarities between Thai-Laotian and Malay-Indonesian
Have been investigated in this study
Based on the ALT data
The Thai-Laotian pair is similar to Japanese-Korean pair
The Malay-Indonesian pair is extremely similar in word order
Future Work
Harmonious annotation of the language pairs in corpus construction
Unified techniques for NLP tasks / applications
24
Slide Note
Embed
Share

This research focuses on analyzing the similarities among Southeast Asian languages like Thai, Laotian, and Malay-Indonesian using corpus-based case studies. It explores techniques for measuring language similarity based on scripts, vocabulary, and syntax. The study also highlights the importance of the Asian Language Treebank project in developing natural language processing techniques for low-resourced Asian languages.

  • Southeast Asian
  • Language Similarity
  • Corpus Study
  • Asian Language Treebank
  • NLP Techniques

Uploaded on Aug 28, 2024 | 2 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Similar Southeast Asian Languages: Similar Southeast Asian Languages: Corpus Corpus- -Based Case Study Based Case Study on Thai Thai- -Laotian and Malay Laotian and Malay- -Indonesian on Indonesian Chenchen Ding, Masao Utiyama, Eiichiro Sumita Advanced Translation Technology Laboratory, ASTREC, NICT, Japan 1

  2. Motivation For similar languages Specific and efficient approaches can be designed Techniques on well-studied languages can be applied to low-resourced ones How to measure the similarity Scripts: related or comparable writing systems Vocabulary: etymologically related words Syntax: phrase / sentence structure similar letters similar spellings similar word orders 2

  3. Outline Asian language treebank (ALT) project Similar languages and related processing Investigation and experiments Conclusion and future works 3

  4. Motivation of Asian Language Treebank Compared with European languages Most Asian languages are low-resourced and understudied NLP techniques cannot be developed and applied ALT can facilitate Tokenization / POS tagging / Parsing Cross-lingual processing Establish a solid basis for Asian language processing 4

  5. Details of Asian Language Treebank Treebanks for six Asian languages and English Burmese, Indonesian, Japanese, Khmer, Malay, Vietnamese April 2016 -- March 2019 Candidate languages in future Laotian, Tagalog, Thai All the raw parallel data are available http://www2.nict.go.jp/astrec-att/member/mutiyama/ALT/ 5

  6. Similar Languages in ALT URL en.wikinews.org/wiki/2007_Rugby_World_Cup:_Italy_31_-_5_Portugal English sentences Italy have defeated Portugal 31-5 in Pool C of the 2007 Rugby World Cup at Parc des Princes, Paris, France. 6

  7. Similar Languages in ALT URL en.wikinews.org/wiki/2007_Rugby_World_Cup:_Italy_31_-_5_Portugal Indonesian and Malay translations Italia berhasil mengalahkan Portugal 31-5 di grup C dalam Piala Dunia Rugby 2007 di Parc des Princes, Paris, Perancis. Itali telah mengalahkan Portugal 31-5 dalam Pool C pada Piala Dunia Ragbi 2007 di Parc des Princes, Paris, Perancis. 7

  8. Similar Languages in ALT URL en.wikinews.org/wiki/2007_Rugby_World_Cup:_Italy_31_-_5_Portugal Indonesian and Malay translations Italia berhasil mengalahkan Portugal 31-5 di grup C dalam Piala Dunia Rugby 2007 di Parc des Princes, Paris, Perancis. Itali telah mengalahkan Portugal 31-5 dalam Pool C pada Piala Dunia Ragbi 2007 di Parc des Princes, Paris, Perancis. 8

  9. Similar Languages in ALT URL en.wikinews.org/wiki/2007_Rugby_World_Cup:_Italy_31_-_5_Portugal Laotian and Thai translations 31 5 C 2007 . 31 5 c 2007 9

  10. Similar Languages in ALT URL en.wikinews.org/wiki/2007_Rugby_World_Cup:_Italy_31_-_5_Portugal Laotian and Thai translations 31 5 C 2007 . 31 5 c 2007 10

  11. Processing Similar Languages in NLP Translation between Catalan and Spanish Can we translate letters? D. Vilar et al., 2007, WMT Translation between Japanese and Korean The last years WAT Character-based processing Apply SMT techniques on Japanese to Burmese Empirical dependency-based head finalization for statistical Chinese-, English-, and French-to-Myanmar (Burmese) MT. C. Ding et al. 2014, IWSLT 11

  12. Two Southeast Asian Language Pairs Thai-Laotian Tonal languages from the Tai-Kadai language family, mutually intelligible Abugida writing systems Etymologically related words Isolating in morphology, head-initial in syntax Malay-Indonesian From Austronesian languages family, mutually intelligible Using Latin scripts Different registers of one language 12

  13. Data and Pre-processing Raw translations from ALT Sentences : train / dev / test 18,000 / 1,000 / 1,000 Tokens: Simple tokenization for Malay and Indonesian Punctuation marks detached Unbreakable unit segmentation for Thai and Laotian Dependent diacritics attached to independent letters 13

  14. Word Order Kendall s tau on Thai and Laotian 14

  15. Word Order Kendall s tau on Malay and Indonesian 15

  16. For Comparison Kendall s tau on Japanese-English and English-French 16

  17. Uncertainty in Token Correspondence X-axis: log probability of Thai tokens Y-axis: Entropy on corresponding Laotian tokens 17

  18. Uncertainty in Token Correspondence X-axis: log probability of Laotian tokens Y-axis: Entropy on corresponding Thai tokens 18

  19. Uncertainty in Token Correspondence X-axis: log probability of Malay tokens Y-axis: Entropy on corresponding Indonesian tokens 19

  20. Uncertainty in Token Correspondence X-axis: log probability of Indonesian tokens Y-axis: Entropy on corresponding Malay tokens 20

  21. For Comparison X-axis: log probability of Japanese characters Y-axis: Entropy on corresponding Korean characters 21

  22. For Comparison X-axis: log probability of Japanese tokens Y-axis: Entropy on corresponding English words 22

  23. Experimental Results from SMT Moses PB-based SMT The parallel data in ALT is not sufficient for a practical system Experiments to investigate the reordering requirement in translation 23

  24. Conclusion and Future Work The similarities between Thai-Laotian and Malay-Indonesian Have been investigated in this study Based on the ALT data The Thai-Laotian pair is similar to Japanese-Korean pair The Malay-Indonesian pair is extremely similar in word order Future Work Harmonious annotation of the language pairs in corpus construction Unified techniques for NLP tasks / applications 24

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#