Cross-Lingual Named Entity Recognition via Wikification
Multilingual NER for many languages including Person, Organization, Location, and Miscellaneous entities. Directly transferring a model trained on one language to another using language-independent features. Cross-lingual Wikification technique to find corresponding titles in English Wikipedia for non-English documents. Key idea: Using cross-lingual wikifier to generate language-independent features for NER.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Cross-Lingual Named Entity Recognition via Wikification Chen-Tse Tsai, Stephen Mayhew, and Dan Roth University of Illinois at Urbana-Champaign
Multilingual NER NER for many languages Person, Organization, Location, Miscellaneous Challenge: no training data for most languages Directly transferring a model trained on another language Language-independent features Turkish Tayvan, ngiltere'de okumas , Tsai'ye bir Tayvan, ngiltere'de okumas , Tsai'ye bir LL.B. kazand rd LL.B. kazand rd ABD ABD ve ve hukuk hukuk English training data U.N. official Ekeus heads for Baghdad NER Model Bengali , , , , , , German training data Cross-Lingual Wikifier Verstehen Albrecht Lehmann l t Fl chtlinge und Yoruba Indon s j or l - d ol m nira, p l a fin ti r Ol l r ni Jakarta. Ol l r ni Jakarta. Indon s j or l - d ol m nira, p l a fin ti r ad b y n. ad b y n. 2
Cross-Lingual Wikification [Tsai and Roth, NAACL16] Given mentions in a non-English document, find the corresponding titles in the English Wikipedia Tayvan, ABD ve ngiltere'de hukuk okumas , Tsai'ye bir LL.B. kazand rd Only requires multilingual Wikipedia dump Grounding to titles in (target language Wikipedia English Wikipedia) Smaller ikipedia, poorer coverage 3
Outline The Key Idea Cross-Lingual NER Model Wikifier-based features Evaluation and Analysis 4
Key Idea Cross-lingual wikifier generates good language-independent features for NER by grounding n-grams Person Location nachvollziehenden Verstehen Albrecht Lehmann l t Fl chtlinge und Vertriebene in Westdeutschland Western_Germany Understanding Albert,_Duke_of_Prussia Jens_Lehmann Refugee media_common quotation_subject person noble_person person athlete field_of_study literature_subject location country 1. Words in any language are grounded to the English Wikipedia Features extracted based on the titles can be used across languages Instead of the traditional pipeline: NER -> Wikification Wikified n-grams provide features for the NER model 2. 5
Cross-Lingual NER Model We use Illinois NER [Ratinov and Roth, 2009] as the base model Features Feature Group Description Word forms and affixes Word type: contain capital? digit? Base features Previous tag pattern Tag context Previous tags Gazetteers Wikipedia titles in multiple languages Wikifier Freebase types and Wikipedia categories Wikifier features facilitate direct transfer! 6
Wikifier Features We ground every n-grams (n<5) to the English Wikipedia The cross-lingual wikifier is modified in two ways In the candidate generation step, we only query title candidates by the whole mention string Otherwise the bigram in Germany will be linked to the title Germany. In the ranking step, we exclude the ranking features which use other mentions , since we don t know what are other mentions NER Features for each word The FreeBase types and Wikipedia categories of the top two titles of the current, previous, and next word The FreeBase types of the n-grams covering the word 7
ata Sets The model can be used on 293languages in Wikipedia English, Spanish, German, Dutch CoNLL 2002/2003 shared task English Spanish German Dutch # Training 23.5K 18.8K 11.9K 13.3K # Test 5.6K 3.6K 3.7K 3.9K Turkish, Tagalog, Bengali, Tamil, Yoruba LORELEI and REFLEX packages Turkish Tagalag Bengali Tamil Yoruba # Training 5.1K 4.6K 8.8K 7.0K 4.1K # Test 2.2K 3.4K 3.5K 4.6K 3.4K 8
Results English Dutch German Spanish Turkish Tagalog Yoruba Bengali Tamil Avg Wikipedia Size 5.1M 1.9M 1.9M 1.3M 269K 64K 31K 42K 85K Monolingual Experiments (train and test on the same language) 89.92 100 83.87 84.49 77.64 74.7 +Wikifier 73.86 71.15 73.13 80 57.6 60.02 60 Base+Gaze. 40 Wikifier only 20 0 82.4 83.6 89.4 70.4 57.1 57.1 69.5 69.3 76.7 72.9 Direct Transfer Experiments (train on English) Non-latin script 100 +Wikifier 80 65.44 61.56 60.55 60 48.12 Base+Gaze. 43.27 47.12 41.4 36.65 29.64 40 Wikifier only 18.18 20 5.65 0 64.06 50.26 34.47 54.59 30.21 34.37 3.25 0.3 33.8 58.4 40.4 59.3 - - - - - - T ckstr m 12 - - - 43.6 51.3 36.0 34.8 26.0 - Zhang 16 9
Multiple Training Languages Previous transfer model is trained on English Training Languages Turkish Tagalog Yoruba Average EN 47.12 65.44 36.65 49.74 EN+ES 44.85 66.61 66.61 37.57 37.57 49.68 EN+NL 48.34 66.09 36.87 50.43 50.43 EN+DE 49.47 49.47 64.10 35.14 49.57 EN+ES+NL+DE 49.00 66.37 38.02 51.13 ALL (w/o test lang) 49.83 67.12 37.56 51.50 EN: English, ES: Spanish, NL: Dutch, DE: German ALL: EN, ES, NL, DE, Turkish, Tagalog, Yoruba Adding training data from other languages helps 10
Domain Adaptation Transfer is good, but not as good as monolingual results Can we improve the monolingual results using transfer? Approach Spanish Dutch Turkish Tagalog Average Target 83.87 84.49 73.86 77.64 79.96 Target + Source 84.17 84.81 74.52 77.80 80.33 FrustEasy [Daum , 07] 83.89 84.08 73.73 77.04 79.69 Target: training on the target language Target + Source: adding English training data This is consistent with the analysis in Chang et al. (EMNLP 10) Target+Source can do slightly better 11
Conclusion We propose a language-independent NER model based on a cross-lingual wikifier. We study a wide range of languages in both monolingual and cross-lingual settings, and show significant improvements over strong baselines. We are making an end-to-end cross-lingual wikification demo for most languages in Wikipedia, and will make the system available. Thank you! 12