Cross-Lingual Named Entity Recognition via Wikification

undefined

Cross-Lingual Named Entity Recognition via

Wikification

Chen-Tse Tsai

, Stephen Mayhew, and Dan Roth

University of Illinois at Urbana-Champaign

Multilingual NER

•

NER for many languages

•

Person, Organization, Location, Miscellaneous

•

Challenge: no training data for most languages

•

Directly transferring a model trained on another language

•

Language-independent features

NER Model

Cross-Lingual

Wikifier

 Tayvan, ABD ve

İngiltere'de hukuk

okuması, Tsai'ye

bir

LL.B. kazandırdı

এটি মূলত তুরস্কে কথিত হয়, তবে

সাইপ্রাস

গ্রিস

, ও

পূর্ব ইউরোপের

বহু দেশে তুর্কীভাষী সম্প্রদায় আছে

 Indonésíà jẹ́

orílẹ̀-èdè

olómìnira

, pẹ̀lú aṣòfin

àti ààrẹ adìbòyàn. Olúìlú

rẹ̀ ni

Jakarta

Verstehen

Albrecht

Lehmann

 läßt

Flüchtlinge

und

U.N.

 official

Ekeus

heads for

Baghdad

English training data

German training data

Turkish

Bengali

Yoruba

Tayvan

ABD

ve

İngiltere'de hukuk

okuması,

Tsai

'ye

bir

LL.B. kazandırdı

এটি মূলত তুরস্কে কথিত হয়, তবে

সাইপ্রাস, গ্রিস, ও পূর্ব ইউরোপের

বহু দেশে তুর্কীভাষী সম্প্রদায় আছে

 Indonésíà jẹ́ orílẹ̀-èdè

olómìnira, pẹ̀lú aṣòfin

àti ààrẹ adìbòyàn. Olúìlú

rẹ̀ ni Jakarta.

Cross-Lingual Wikification

[Tsai and Roth, NAACL’16]

•

Given mentions in a non-English document, find the corresponding titles in

the English Wikipedia

•

Only requires multilingual Wikipedia dump

•

Grounding to titles in (target language

Wikipedia

∩

English Wikipedia)

•

Smaller

Ｗ

ikipedia, poorer coverage

Tayvan

ABD

ve İngiltere'de hukuk

okuması,

Tsai

'ye bir LL.B. kazandırdı

Outline

•

The Key Idea

•

Cross-Lingual NER Model

•

Wikifier-based features

•

Evaluation and Analysis

Key Idea

•

Cross-lingual wikifier generates good language-independent features for NER

by grounding n-grams

1.

ords in any language

are

 grounded to the English Wikipedia

•

Features extracted based on the titles can be used across languages

2.

Instead of the traditional pipeline

NER -> Wikification

•

Wikified n-grams provide features for the NER model

…

nachvollziehenden

Verstehen

Albrecht

Lehmann

 läßt

Flüchtlinge

 und Vertriebene in

Westdeutschland

Understanding

Albert,_Duke_of_Prussia

Jens_Lehmann

Refugee

Western_Germany

media_common

quotation_subject

person

noble_person

person

athlete

field_of_study

literature_subject

location

country

Cross-Lingual NER Model

•

We use Illinois NER

[Ratinov and Roth, 2009]

 as the base model

•

Features

•

Wikifier features facilitate direct transfer!

Wikifier Features

•

We ground every n-grams (

n<5

) to the English Wikipedia

•

The cross-lingual wikifier is modified in two ways

•

In the

candidate generation step

,  we only query title candidates by the

whole mention string

•

Otherwise the bigram “in Germany” will be linked to the title Germany.

•

In the

ranking step

, we exclude the ranking features which use “other

mentions”, since we don’t know what are other mentions

•

NER Features for each word

•

The

FreeBase types

and

Wikipedia categories

 of the top two titles of the

current, previous, and next word

•

The FreeBase types of the n-grams covering the word

Ｄ

ata Sets

•

The model can be used on

languages in Wikipedia

•

English, Spanish, German, Dutch

•

CoNLL 2002/2003 shared task

•

Turkish, Tagalog, Bengali, Tamil, Yoruba

•

LORELEI and REFLEX packages

Results

Non-latin script

Monolingual Experiments (train and test on the same language)

Direct Transfer Experiments (train on English)

•

Previous transfer model is trained on English

•

EN: English, ES: Spanish, NL: Dutch, DE: German

•

ALL: EN, ES, NL, DE, Turkish, Tagalog, Yoruba

•

Adding training data from other languages helps

Multiple Training Languages

49.47

66.61

37.57

50.43

Domain Adaptation

•

Transfer is good, but not as good as monolingual results

•

Can we improve the monolingual results using

transfer?

•

Target: training on the target language

•

Target + Source: adding English training data

•

This is consistent with the analysis in Chang et al.

EMNLP

’

•

Target+Source can do slightly better

Conclusion

•

e propose a language-independent NER

model based on a cross-lingual

wikifier.

•

We study a wide range of languages in both monolingual and cross-lingual

settings, and show significant improvements over strong baselines.

•

We are making an end-to-end cross-lingual wikification demo for most

languages in Wikipedia, and will make the system available.

Thank you!

Slide Note

Embed Share

Download

Multilingual NER for many languages including Person, Organization, Location, and Miscellaneous entities. Directly transferring a model trained on one language to another using language-independent features. Cross-lingual Wikification technique to find corresponding titles in English Wikipedia for non-English documents. Key idea: Using cross-lingual wikifier to generate language-independent features for NER.

christca Follow

Uploaded on Feb 26, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Cross-Lingual Named Entity Recognition via Wikification Chen-Tse Tsai, Stephen Mayhew, and Dan Roth University of Illinois at Urbana-Champaign

Multilingual NER NER for many languages Person, Organization, Location, Miscellaneous Challenge: no training data for most languages Directly transferring a model trained on another language Language-independent features Turkish Tayvan, ngiltere'de okumas , Tsai'ye bir Tayvan, ngiltere'de okumas , Tsai'ye bir LL.B. kazand rd LL.B. kazand rd ABD ABD ve ve hukuk hukuk English training data U.N. official Ekeus heads for Baghdad NER Model Bengali , , , , , , German training data Cross-Lingual Wikifier Verstehen Albrecht Lehmann l t Fl chtlinge und Yoruba Indon s j or l - d ol m nira, p l a fin ti r Ol l r ni Jakarta. Ol l r ni Jakarta. Indon s j or l - d ol m nira, p l a fin ti r ad b y n. ad b y n. 2

Cross-Lingual Wikification [Tsai and Roth, NAACL16] Given mentions in a non-English document, find the corresponding titles in the English Wikipedia Tayvan, ABD ve ngiltere'de hukuk okumas , Tsai'ye bir LL.B. kazand rd Only requires multilingual Wikipedia dump Grounding to titles in (target language Wikipedia English Wikipedia) Smaller ikipedia, poorer coverage 3

Outline The Key Idea Cross-Lingual NER Model Wikifier-based features Evaluation and Analysis 4

Key Idea Cross-lingual wikifier generates good language-independent features for NER by grounding n-grams Person Location nachvollziehenden Verstehen Albrecht Lehmann l t Fl chtlinge und Vertriebene in Westdeutschland Western_Germany Understanding Albert,_Duke_of_Prussia Jens_Lehmann Refugee media_common quotation_subject person noble_person person athlete field_of_study literature_subject location country 1. Words in any language are grounded to the English Wikipedia Features extracted based on the titles can be used across languages Instead of the traditional pipeline: NER -> Wikification Wikified n-grams provide features for the NER model 2. 5

Cross-Lingual NER Model We use Illinois NER [Ratinov and Roth, 2009] as the base model Features Feature Group Description Word forms and affixes Word type: contain capital? digit? Base features Previous tag pattern Tag context Previous tags Gazetteers Wikipedia titles in multiple languages Wikifier Freebase types and Wikipedia categories Wikifier features facilitate direct transfer! 6

Wikifier Features We ground every n-grams (n<5) to the English Wikipedia The cross-lingual wikifier is modified in two ways In the candidate generation step, we only query title candidates by the whole mention string Otherwise the bigram in Germany will be linked to the title Germany. In the ranking step, we exclude the ranking features which use other mentions , since we don t know what are other mentions NER Features for each word The FreeBase types and Wikipedia categories of the top two titles of the current, previous, and next word The FreeBase types of the n-grams covering the word 7

ata Sets The model can be used on 293languages in Wikipedia English, Spanish, German, Dutch CoNLL 2002/2003 shared task English Spanish German Dutch # Training 23.5K 18.8K 11.9K 13.3K # Test 5.6K 3.6K 3.7K 3.9K Turkish, Tagalog, Bengali, Tamil, Yoruba LORELEI and REFLEX packages Turkish Tagalag Bengali Tamil Yoruba # Training 5.1K 4.6K 8.8K 7.0K 4.1K # Test 2.2K 3.4K 3.5K 4.6K 3.4K 8

Results English Dutch German Spanish Turkish Tagalog Yoruba Bengali Tamil Avg Wikipedia Size 5.1M 1.9M 1.9M 1.3M 269K 64K 31K 42K 85K Monolingual Experiments (train and test on the same language) 89.92 100 83.87 84.49 77.64 74.7 +Wikifier 73.86 71.15 73.13 80 57.6 60.02 60 Base+Gaze. 40 Wikifier only 20 0 82.4 83.6 89.4 70.4 57.1 57.1 69.5 69.3 76.7 72.9 Direct Transfer Experiments (train on English) Non-latin script 100 +Wikifier 80 65.44 61.56 60.55 60 48.12 Base+Gaze. 43.27 47.12 41.4 36.65 29.64 40 Wikifier only 18.18 20 5.65 0 64.06 50.26 34.47 54.59 30.21 34.37 3.25 0.3 33.8 58.4 40.4 59.3 - - - - - - T ckstr m 12 - - - 43.6 51.3 36.0 34.8 26.0 - Zhang 16 9

Multiple Training Languages Previous transfer model is trained on English Training Languages Turkish Tagalog Yoruba Average EN 47.12 65.44 36.65 49.74 EN+ES 44.85 66.61 66.61 37.57 37.57 49.68 EN+NL 48.34 66.09 36.87 50.43 50.43 EN+DE 49.47 49.47 64.10 35.14 49.57 EN+ES+NL+DE 49.00 66.37 38.02 51.13 ALL (w/o test lang) 49.83 67.12 37.56 51.50 EN: English, ES: Spanish, NL: Dutch, DE: German ALL: EN, ES, NL, DE, Turkish, Tagalog, Yoruba Adding training data from other languages helps 10

Domain Adaptation Transfer is good, but not as good as monolingual results Can we improve the monolingual results using transfer? Approach Spanish Dutch Turkish Tagalog Average Target 83.87 84.49 73.86 77.64 79.96 Target + Source 84.17 84.81 74.52 77.80 80.33 FrustEasy [Daum , 07] 83.89 84.08 73.73 77.04 79.69 Target: training on the target language Target + Source: adding English training data This is consistent with the analysis in Chang et al. (EMNLP 10) Target+Source can do slightly better 11

Conclusion We propose a language-independent NER model based on a cross-lingual wikifier. We study a wide range of languages in both monolingual and cross-lingual settings, and show significant improvements over strong baselines. We are making an end-to-end cross-lingual wikification demo for most languages in Wikipedia, and will make the system available. Thank you! 12

Cross-Lingual Named Entity Recognition via Wikification

Download Presentation

Presentation Transcript

Related

More Related Content