Unsupervised Multiword Expression Extraction Using Measure Clustering Approach
Goal of this study is to develop an unsupervised method for extracting multiword expressions (MWEs) like idioms, terms, and proper names of different semantic types. The research focuses on properties of MWEs, data analysis, statistical measures, and clustering results to supplement lexical resources. The approach involves identifying non-compositionality, statistical idiosyncrasy, and morphological/syntactic rigidity of MWEs. Various methods including statistical analysis, context measures, and vector representations are applied for MWE extraction. The study emphasizes the importance of identifying and extracting MWEs accurately across languages, without relying on specific linguistic resources or prior knowledge.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
25th International Conference on Computational Linguistics and Intellectual Technologies, May 29 June 1, Moscow, RSUH Measure clustering approach to MWE extraction Petr Rossyaykin Moscow State University petrrossyaykin@gmail.com Natalia Loukachevitch Moscow State University louk_nat@mail.ru
Our goal Provide an unsupervised, possibly resource- and language-independent method to extract multiword expressions (namely, nominal phrases of 2 words) of different semantic types in order to supplement lexical resources
Outline 1. MWEs and their properties 2. Data & task 3. Measures Statistical association measures Context measures Distributional measures 4. Clustering 5. Results 6. Discussion
Multiword expressions Multiword expressions (MWEs) are lexical items that: (a) can be decomposed into multiple lexemes; and (b) display lexical, syntactic, semantic, pragmatic and/or statistical idiomaticity (Baldwin & Kim 2010)
Multiword expressions (MWEs) Examples: idioms (to kick the bucket, to spill the beans), terms (vowel harmony, extended projection principle), proper names (South Korea, Lady Gaga), lexicalized expressions (by and large, coat of arms), etc.?
Properties of MWEs 1. Non-compositionality (e.g. blue devils) 2. Statistical idiosyncrasy (e.g. law and order vs order and law) 3. Morphological/syntactic idiosyncrasy (e. g. by and large) and rigidity (John kicked the bucket vs#John kicked a bucket vs#The bucket was kicked by John) + pragmatic idiomaticity (e. g. good morning) MWEs differ in regard to the prominence of these properties
Overview of methods Method Properties used statistical statistical idiosyncrasy context non-compositionality vector representations morphosyntactic non-compositionality morphological/syntactic rigidity external (machine translation, parallel corpora, thesauri) non-compositionality metalinguistic/heuristic different combinational (machine learning) different
Task The number of MWEs is comparable to that of single lexemes and new ones appear in language constantly (Jackendoff 1997, Sag et al. 2002) The task is to propose a method to supplement lexical resources with new MWEs (noun phrases) Russian language thesaurus RuThes (Loukachevitch et al. 2014) is used as a gold standard:
Data Corpus Russian news texts from the Internet published in 2011 446 million tokens Pre-processing lemmatised, upper-cased, POS-tagged
Initial candidate expressions N-N (e.g. head of state ) and Adj-N (e.g. criminal case ) bigrams Observed frequency > 200 37767 candidate bigrams in total Positive class (T) Negative class (N) 9837 expressions present in RuThes Other 27930 expressions human rights social network point of view investigation process virtual communication secrets of beauty
20 most frequent expressions 1-10 11-20 1 2 3 14 10 11 12 13 4 5 6 7 8 9 15 16 17 18 19 20
20 most frequent expressions 1-10 11-20 1 2 3 CRIMINAL CASE LAST YEAR LAW ENFORCEMENT AGENCY RUSSIAN FEDERATION NEAREST FUTURE OFFICIAL SITE THOUSAND RUBLE HEAD OF STATE 11 12 13 THIS DAY YOUNG MAN GENERAL DIRECTOR 4 5 6 7 8 14 15 16 17 18 POINT OF VIEW INVESTIGATIVE COMMITTEE MILLION RUBLE CURRENT YEAR RESTRAINT OF LIBERTY (IMPRISONMENT) PROTEST RALLY FOREIGN AFFAIR 9 10 INTERNAL AFFAIR THOUSAND PEOPLE 19 20
Statistical association measures 1) t-score 2) log-likelihood ratio (LLR) 3) chi-sqaure 9) Piatetsky-Shapiro coefficient (PS) 10) confidence 11) sample deviation 18) MI / NF( ) 19) PMI / NF( ) 20) MI / NFmaxd 21) PMI/NFmax (Hoang et al. 2009) 4) Dice coefficient (DC) 5) modified DC 6) geometric mean 7) odds ratio 8) Poisson significance measure (PSM) 12) pointwise mutual information (PMI) 13) augmented PMI 14) local PMI 15) cubic PMI (PMI3) 16) normalized PMI (NPMI) 17) normalized MI 22) NPMIC(Carlini et al. 2014)
Frequency distribution Red points positive class Blue points negative class X log(expected frequency) Y log(observed frequency) Black curve PMI3 at the level of 30 (best among statistical measures)
Frequency distribution Red points positive class Blue points negative class X log(expected frequency) Y log(observed frequency) Black curve PMI3 at the level of 30 (best among statistical measures)
Context measures Idea compare sets of contexts of n-grams Measures: 1. gravity count 2. modified gravity count 3. type-LR 4. type-FLR 5. context intersection 6. independent context intersection + combinations
type-LR We can model lexical rigidity (non-substitutability) using sets of internal contexts (neighboring words): ?????? ?,? = ?(?) ?(?) ?(?,?) ??????(?,?) ???????(?,?) = ?(?) is a set of unique words occurring in corpus immediately to the right from the word x, ?(?) is a set of unique words which occur in corpus immediately to the left from the word x
type-LR example x = general y = director r(general) = { l(director) = { (general) director, (general) secretary, (general) agreement, (general) cleaning, } general (director), fired (director), former (director), claim (director), } 671 unique words in total 17234 unique words in total ?????? ???????,???????? = = 671 17234 3400,59 ?(???????) ?(????????)
Context intersection The following two measures compare external context sets of MWEs (? ?? and ? ?? ) context sets of their single components (? ? and ? ? ) and context sets of single components occurring outside candidate expressions (? ?? and ? ?? ): ? ?? ? ? ? ? ? ? ? ?? ? ? ??????? ?????????. ?,? = ? ?? ? ?? ? ?? ? ?? ? ?? ? ?? ??????????? ?? ?,? =
Combinations with frequency Combining with the observed frequency ? ?,? allows to make use of both non-compositionality and statistical idiosyncrasy of MWEs: ???? ?? ?,? = ? ?,? ?? ?,? log ???? ??? ?,? = log? ?,? ???(?,?)
Distributional measures Loukachevitch & Parkhomenko (2018) achieved extremely high average precision (AP@100 = 0.95) on the same data with the following measures: ?????? = max(cos(? ?? ,?(?))) where w is a word from the model vocabulary distinct from x and y ??? ?? = max(cos(? ?? , ? ?? )) where te is a thesaurus entry (word or phrase)
Individual measures (best) measure AP@100 AP@500 AP@1000 AP@2500 Best of 22 statistical association measures 0,778 0,802 0,780 0,705 LLR PMI3 0,907 0,821 0,795 0,726 Context measures 0,818 0,825 0,796 0,74 type-FLR 0,902 0,866 0,847 0,777 CI*freq 0,915 0,879 0,855 0,789 ICI*log(f) Distributional measures 0.846 0.770 0.694 DFsing 0.583 0.929 0.877 0.834 DFsing*log(f) 0.731 0.953 0.853 0.823 DFthes 0.759 0.950 0.910 0.879 DFthes*log(f) 0.807
Individual results (best) Recall up to 1 Recall up to 0.14
Top-10 lists type-FLR & DFthes type-FLR DFthes Note the absence of matches in top-10 lists
Top-10 lists type-FLR & DFthes type-FLR DFthes Yedioth Ahronoth kindergarten salary European Union law-enforcing agency nuclear power station Central Partnership nuclear station point of view international community Autonews column world community criminal case Near East prosecutor-general s office district court alcohol intoxication Troy ounce Note the absence of matches in top-10 lists government budget detention center
Combinational methods 1. Linear combinations (Zakharov 2017) 2. Machine learning Neural networks (Pecina 2008) Bayesian networks (Tsvetkov & Wintner 2011, Buljan & najder 2017) Clustering (Tutubalina 2015) etc.
Clustering Idea: 1. Use measures as dimensions 2. Cluster expressions 3. Provide ranking function based on clustering results expression vector = <PMI value, t-score, LLR, >
Clustering More ideas: Measures of different types account for different properties of MWEs => allow separating MWEs of different types from free phrases Increasing number of measures/dimensions gives more fine-grained clustering => yields better results
Feature subset Best individual measures: Group Measures 1) PMI3 2) LLR statistical association measures context measures 3) type-FLR 4) lCI*log(freq) distributional measures 5) DFsing 6) DFsing*log(freq) 7) DFthes 8) DFthes*log(freq)
Normalization Values assigned by different measures vary considerably values ranks (binary) logarithms of ranks expression PMI rank log(rank) all-night vigil 13.78 1 0 Christmas tree 7.71 2 1 biological weapon 5.68 3 1.585 yesterday meeting 4.61 4 2 healthy man 3.07 5 2.322 new page 2.08 6 2.585
Ranking function Since we are interested in ranking, rather just classifying, bigrams, we used the following centroid- oriented scoring function: ???? ?? = ? ??, ?0 ?(??, ?1) where ?(?,?) is Euclidean distance, xy is the vector of an expression x y , ?0is the centroid of the larger cluster, and ?1is the centroid of the smaller cluster
Clustering 247 feature subsets with more than 1 element 2 clustering algorithms (scikit-learn): 1. k-means (2 clusters) 2. agglomerative hierarchical clustering (2 clusters, Ward linkage) vs 2 variants of linear combination: 1. sum of ranks logarithms 2. product of ranks logarithms
Results method sum of ranks logarithms measures used type-FLR, DFthes, DFsing*log(f) AP@100 0.976 500 0.945 0.92 1000 2500 0.847 agglomerative clustering of ranks logarithms type-FLR, ICI*log(f), DFthes, DFsing*log(f) 0.986 0.94 0.907 0.84 LLR, type-FLR, DFsing, DFthes, DFthes*log(f) LLR, type-FLR, ICI*log(f), DFsing, DFthes, DFsing*log(f), DFthes*log(f) 0.991 0.955 0.917 0.847 0.988 0.95 0.914 0.844
Discussion 1. The best results are achieved with clustering-based ranking function applied to relatively large subsets of features (from 4 to 7) 2. Two best variants use measures of all three types 3. Except cubic PMI all of the measures we tried to combine appear in the best setups at least twice
Top-20 list best combination 1-10 11-20 No matches with the lists extracted by type-FLR and DFthes
Top-20 list best combination 1-10 11-20 reconditioning Donetsk region Krasnoyark region Orenburg region executive director antigovernment rally administrative liability election campaign exhibition game Kievan Dynamo Primorsky kray narcotic substance Tomsk region oil extraction mobile phone uniformed service mobile network operator comprehensive school federal budget Irkutsk region No matches with the lists extracted by type-FLR and DFthes
Conclusion Ranked lists extracted by measures of different types exhibit significant variation The highest average precision is achieved with the help of measures which utilize both frequency and context/distributional information => The choice of measures is crucial for the efficiency of combinational MWE extraction Our unsupervised method allows to incorporate relatively large number of measures of different types yielding high AP
References Baldwin T., Kim S. N. (2010), Multiword expressions. In Nitin Indurkhya and Fred J. Damerau, editors, Handbook of Natural Language Processing, pages 267 292. CRC Press, Taylor and Francis Group, Boca Raton, FL, USA, 2 edition. Buljan, najder J. (2017), Combining Linguistic Features for the Detection of Croatian Multiword Expressions. Proceedings of the 13th Workshop on Multiword Expressions (MWE 2017), Valencia, 194 199. Jackendoff R. (1997), The Architecture of the Language Faculty. Number 28 in Linguistic Inquiry Monographs. MIT Press, Cambridge, MA, USA. 262 p. Loukachevitch N., Dobrov B., Chetviorkin I. (2014), "Ruthes-lite, a publicly available version of thesaurus of russian language ruthes." Computational Linguistics and Intellectual Technologies: Papers from the Annual International Conference Dialogue , Bekasovo, Russia. 2014. Loukachevitch N., Parkhomenko E (2018), Recognition of multiword expressions using word embeddings // Artificial Intelligence. RCAI 2018. Vol. 934 of Communications in Computer and Information Science. Springer Cham, 2018. P. 112 124.
References Pecina P. (2008), A Machine Learning Approach to Multiword Expression Extraction. In Proceedings of the LREC 2008 Workshop Towards a Shared Task for Multiword Expressions (MWE 2008), pages 54 57, Marrakech, 2008. Sag I. A., Baldwin T., Bond F., Copestake A., Flickinger D. (2002), Multiword expressions: A pain in the neck for NLP. In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics (CICLing 2002), pages 1 15, Mexico City, Mexico. Tsvetkov Y., Wintner S (2011), Identification of multi-word expressions by combining multiple linguistic information sources. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 836 845. Association for Computational Linguistics. Tutubalina E. (2015), Clustering-based Approach to Multiword Expression Extraction and Ranking. In NAACL-HTL, pages 39 43, Denver, Colorado, 2015. Zakharov V. (2017), Automatic Collocation Extraction: Association Measures Evaluation and Integration // Computational Linguistics and Intellectual Technologies: Papers from the Annual conference Dialogue . Volume 1 of 2. Computational Linguistics: Practical Applications. Moscow : RSUH, 2017. P. 396 407.