Cross-Lingual Named Entity Recognition via Wikification

undefined
Cross-Lingual Named Entity Recognition via
Wikification
Chen-Tse Tsai
, Stephen Mayhew, and Dan Roth
University of Illinois at Urbana-Champaign
Multilingual NER
 
NER for many languages
Person, Organization, Location, Miscellaneous
Challenge: no training data for most languages
Directly transferring a model trained on another language
Language-independent features
2
NER Model
Cross-Lingual
Wikifier
 Tayvan, ABD ve
İngiltere'de hukuk
okuması, Tsai'ye 
bir
LL.B. kazandırdı
এটি মূলত তুরস্কে কথিত হয়, তবে
সাইপ্রাস
, 
গ্রিস
, ও 
পূর্ব ইউরোপের
বহু দেশে তুর্কীভাষী সম্প্রদায় আছে
 Indonésíà jẹ́ 
orílẹ̀-èdè
olómìnira
, pẹ̀lú aṣòfin
àti ààrẹ adìbòyàn. Olúìlú
rẹ̀ ni 
Jakarta
.
Verstehen
 
Albrecht
Lehmann
 läßt
Flüchtlinge
 und
U.N.
 official 
Ekeus
heads for 
Baghdad
 
English training data
 
German training data
 
Turkish
 
Bengali
 
Yoruba
 
Tayvan
, 
ABD
 ve
İngiltere'de hukuk
okuması, 
Tsai
'ye 
bir
LL.B. kazandırdı
এটি মূলত তুরস্কে কথিত হয়, তবে
সাইপ্রাস, গ্রিস, ও পূর্ব ইউরোপের
বহু দেশে তুর্কীভাষী সম্প্রদায় আছে
 Indonésíà jẹ́ orílẹ̀-èdè
olómìnira, pẹ̀lú aṣòfin
àti ààrẹ adìbòyàn. Olúìlú
rẹ̀ ni Jakarta.
Cross-Lingual Wikification 
[Tsai and Roth, NAACL’16]
 
Given mentions in a non-English document, find the corresponding titles in
the English Wikipedia
 
 
 
 
 
 
 
 
Only requires multilingual Wikipedia dump
Grounding to titles in (target language 
Wikipedia 
 
English Wikipedia)
Smaller 
ikipedia, poorer coverage
3
 
Tayvan
,
 
ABD
 
ve İngiltere'de hukuk
okuması, 
Tsai
'ye bir LL.B. kazandırdı
Outline
The Key Idea
Cross-Lingual NER Model
Wikifier-based features
Evaluation and Analysis
4
Key Idea
 
Cross-lingual wikifier generates good language-independent features for NER
by grounding n-grams
 
 
 
 
 
1.
W
ords in any language 
are
 grounded to the English Wikipedia
Features extracted based on the titles can be used across languages
2.
Instead of the traditional pipeline
: 
NER -> Wikification
Wikified n-grams provide features for the NER model
 
5
 
nachvollziehenden 
Verstehen
 
Albrecht
 
Lehmann
 läßt 
Flüchtlinge
 und Vertriebene in 
Westdeutschland
Understanding
Albert,_Duke_of_Prussia
Jens_Lehmann
Refugee
Western_Germany
media_common
quotation_subject
person
noble_person
person
athlete
field_of_study
literature_subject
location
country
Cross-Lingual NER Model
 
We use Illinois NER 
[Ratinov and Roth, 2009]
 as the base model
Features
 
 
 
 
 
 
 
 
 
Wikifier features facilitate direct transfer!
6
Wikifier Features
 
We ground every n-grams (
n<5
) to the English Wikipedia
The cross-lingual wikifier is modified in two ways
In the 
candidate generation step
,  we only query title candidates by the
whole mention string
Otherwise the bigram “in Germany” will be linked to the title Germany.
In the 
ranking step
, we exclude the ranking features which use “other
mentions”, since we don’t know what are other mentions
 
NER Features for each word
The 
FreeBase types
 and 
Wikipedia categories
 of the top two titles of the
current, previous, and next word
The FreeBase types of the n-grams covering the word
7
ata Sets
The model can be used on 
293
 
languages in Wikipedia
English, Spanish, German, Dutch
CoNLL 2002/2003 shared task
Turkish, Tagalog, Bengali, Tamil, Yoruba
LORELEI and REFLEX packages
8
Results
9
Non-latin script
Monolingual Experiments (train and test on the same language)
 
Direct Transfer Experiments (train on English)
Previous transfer model is trained on English
EN: English, ES: Spanish, NL: Dutch, DE: German
ALL: EN, ES, NL, DE, Turkish, Tagalog, Yoruba
Adding training data from other languages helps
Multiple Training Languages
10
49.47
66.61
37.57
50.43
Domain Adaptation
 
Transfer is good, but not as good as monolingual results
Can we improve the monolingual results using 
transfer?
 
 
 
 
 
Target: training on the target language
Target + Source: adding English training data
This is consistent with the analysis in Chang et al. 
(
EMNLP
10
)
Target+Source can do slightly better
11
Conclusion
 
W
e propose a language-independent NER 
model based on a cross-lingual
wikifier.
 
We study a wide range of languages in both monolingual and cross-lingual
settings, and show significant improvements over strong baselines.
 
We are making an end-to-end cross-lingual wikification demo for most
languages in Wikipedia, and will make the system available.
 
Thank you!
12
Slide Note
Embed
Share

Multilingual NER for many languages including Person, Organization, Location, and Miscellaneous entities. Directly transferring a model trained on one language to another using language-independent features. Cross-lingual Wikification technique to find corresponding titles in English Wikipedia for non-English documents. Key idea: Using cross-lingual wikifier to generate language-independent features for NER.

  • Named Entity Recognition
  • Multilingual
  • Wikification
  • Language-independent
  • Cross-lingual

Uploaded on Feb 26, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Cross-Lingual Named Entity Recognition via Wikification Chen-Tse Tsai, Stephen Mayhew, and Dan Roth University of Illinois at Urbana-Champaign

  2. Multilingual NER NER for many languages Person, Organization, Location, Miscellaneous Challenge: no training data for most languages Directly transferring a model trained on another language Language-independent features Turkish Tayvan, ngiltere'de okumas , Tsai'ye bir Tayvan, ngiltere'de okumas , Tsai'ye bir LL.B. kazand rd LL.B. kazand rd ABD ABD ve ve hukuk hukuk English training data U.N. official Ekeus heads for Baghdad NER Model Bengali , , , , , , German training data Cross-Lingual Wikifier Verstehen Albrecht Lehmann l t Fl chtlinge und Yoruba Indon s j or l - d ol m nira, p l a fin ti r Ol l r ni Jakarta. Ol l r ni Jakarta. Indon s j or l - d ol m nira, p l a fin ti r ad b y n. ad b y n. 2

  3. Cross-Lingual Wikification [Tsai and Roth, NAACL16] Given mentions in a non-English document, find the corresponding titles in the English Wikipedia Tayvan, ABD ve ngiltere'de hukuk okumas , Tsai'ye bir LL.B. kazand rd Only requires multilingual Wikipedia dump Grounding to titles in (target language Wikipedia English Wikipedia) Smaller ikipedia, poorer coverage 3

  4. Outline The Key Idea Cross-Lingual NER Model Wikifier-based features Evaluation and Analysis 4

  5. Key Idea Cross-lingual wikifier generates good language-independent features for NER by grounding n-grams Person Location nachvollziehenden Verstehen Albrecht Lehmann l t Fl chtlinge und Vertriebene in Westdeutschland Western_Germany Understanding Albert,_Duke_of_Prussia Jens_Lehmann Refugee media_common quotation_subject person noble_person person athlete field_of_study literature_subject location country 1. Words in any language are grounded to the English Wikipedia Features extracted based on the titles can be used across languages Instead of the traditional pipeline: NER -> Wikification Wikified n-grams provide features for the NER model 2. 5

  6. Cross-Lingual NER Model We use Illinois NER [Ratinov and Roth, 2009] as the base model Features Feature Group Description Word forms and affixes Word type: contain capital? digit? Base features Previous tag pattern Tag context Previous tags Gazetteers Wikipedia titles in multiple languages Wikifier Freebase types and Wikipedia categories Wikifier features facilitate direct transfer! 6

  7. Wikifier Features We ground every n-grams (n<5) to the English Wikipedia The cross-lingual wikifier is modified in two ways In the candidate generation step, we only query title candidates by the whole mention string Otherwise the bigram in Germany will be linked to the title Germany. In the ranking step, we exclude the ranking features which use other mentions , since we don t know what are other mentions NER Features for each word The FreeBase types and Wikipedia categories of the top two titles of the current, previous, and next word The FreeBase types of the n-grams covering the word 7

  8. ata Sets The model can be used on 293languages in Wikipedia English, Spanish, German, Dutch CoNLL 2002/2003 shared task English Spanish German Dutch # Training 23.5K 18.8K 11.9K 13.3K # Test 5.6K 3.6K 3.7K 3.9K Turkish, Tagalog, Bengali, Tamil, Yoruba LORELEI and REFLEX packages Turkish Tagalag Bengali Tamil Yoruba # Training 5.1K 4.6K 8.8K 7.0K 4.1K # Test 2.2K 3.4K 3.5K 4.6K 3.4K 8

  9. Results English Dutch German Spanish Turkish Tagalog Yoruba Bengali Tamil Avg Wikipedia Size 5.1M 1.9M 1.9M 1.3M 269K 64K 31K 42K 85K Monolingual Experiments (train and test on the same language) 89.92 100 83.87 84.49 77.64 74.7 +Wikifier 73.86 71.15 73.13 80 57.6 60.02 60 Base+Gaze. 40 Wikifier only 20 0 82.4 83.6 89.4 70.4 57.1 57.1 69.5 69.3 76.7 72.9 Direct Transfer Experiments (train on English) Non-latin script 100 +Wikifier 80 65.44 61.56 60.55 60 48.12 Base+Gaze. 43.27 47.12 41.4 36.65 29.64 40 Wikifier only 18.18 20 5.65 0 64.06 50.26 34.47 54.59 30.21 34.37 3.25 0.3 33.8 58.4 40.4 59.3 - - - - - - T ckstr m 12 - - - 43.6 51.3 36.0 34.8 26.0 - Zhang 16 9

  10. Multiple Training Languages Previous transfer model is trained on English Training Languages Turkish Tagalog Yoruba Average EN 47.12 65.44 36.65 49.74 EN+ES 44.85 66.61 66.61 37.57 37.57 49.68 EN+NL 48.34 66.09 36.87 50.43 50.43 EN+DE 49.47 49.47 64.10 35.14 49.57 EN+ES+NL+DE 49.00 66.37 38.02 51.13 ALL (w/o test lang) 49.83 67.12 37.56 51.50 EN: English, ES: Spanish, NL: Dutch, DE: German ALL: EN, ES, NL, DE, Turkish, Tagalog, Yoruba Adding training data from other languages helps 10

  11. Domain Adaptation Transfer is good, but not as good as monolingual results Can we improve the monolingual results using transfer? Approach Spanish Dutch Turkish Tagalog Average Target 83.87 84.49 73.86 77.64 79.96 Target + Source 84.17 84.81 74.52 77.80 80.33 FrustEasy [Daum , 07] 83.89 84.08 73.73 77.04 79.69 Target: training on the target language Target + Source: adding English training data This is consistent with the analysis in Chang et al. (EMNLP 10) Target+Source can do slightly better 11

  12. Conclusion We propose a language-independent NER model based on a cross-lingual wikifier. We study a wide range of languages in both monolingual and cross-lingual settings, and show significant improvements over strong baselines. We are making an end-to-end cross-lingual wikification demo for most languages in Wikipedia, and will make the system available. Thank you! 12

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#