Enhancing Word Representation for Rare Words by Xiao Wenyi

Slide Note
Embed
Share

This article discusses methods for improving word representation for rare words, such as the count-based Skip-gram by Mikolov and the predict-based GloVe by Pennington. It explores the concepts of Skip-gram, GloVe, and issues faced by existing models in handling rare words.


Uploaded on Oct 06, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Improving Word Representation for Rare Words XIAO Wenyi

  2. Basic Work Count-based method : Skip gram by Mikolov[1] Predict-based method: Glove by Pennington[2] [1] Mikolov T, Dean J. Distributed representations of words and phrases and their compositionality[J]. Advances in neural information processing systems, 2013. [2] Pennington J, Socher R, Manning C D. Glove: Global Vectors for Word Representation[C]//EMNLP. 2014, 14: 1532-43.

  3. Skip gram Basic idea: 1.Words with similar distributional contexts should be close to each other in the embedding space. 2. Manipulating the distributional context should lead to similar translation in the embedding space. ????? = ???? ??? + ????? Skipgram Negative Sampling(SGNS): 1. Maximizing the dot product between the sampled words embedding 2. Minimizing the dot product between the focus word and a randomly sampled non-context word.

  4. Example the quick brown fox jumped over the lazy dog Process: 1. Using a window size of 1, we then have the dataset ([the, brown], quick), ([quick, fox], brown), ([brown, jumped], fox), ... 2. Getting (context, target)/(input,output) pairs (quick, the), (quick, brown), (brown, quick), (brown, fox), ... 3. The goal is to make an update to the embedding parameters ? to improve the following objective function ???? = ?????? = ? ???,????? + ???(??(? = ?|?????,?????)) (?) SGNS poorly utilizes the statistics of the corpus since it separate local context windows instead of on global co-occurrence counts! trains on

  5. Glo Ve Building a word-context co-occurrence matrix on the nonzero elements. Using stochastic gradient descent, Glo Ve learns the model parameters for word embedding matrix ? and context embedding matrix ?, also bias terms ?? and ??. The goal is to minimize the following cost function: ?1 ?2 ?? ?1 ?2 ??? ?? ? ?? ??????+ ??+ ??) word-context co-occurrence matrix ???= 0 ,?????? = ??? ??= ?(??,?)(?? ?,?

  6. Glo Ve Glo Ve requires training time proportional to the number of observed co-occurrence pairs, thus shortening the training time. Therefore, it gives no penalty for placing features near to one another whose co-occurrence has not been observed. ?1 ?2 ?? ?1 This will not be used to update the parameters in model! ?2 ??? = 0 ??

  7. Issue of existing models Our model also use the co-occurrence matrix as Glo Ve We focus on the elements which ???= 0: There are two situation that may lead to ???= 0: 1. When both word ??and ??are common words, ???= 0 means the two words are definitely irrelative, not appearing in the context. 2.Either word ??or ??is rare word, ???= 0 due to the fact that we simply haven t seen enough data.

  8. Our proposed model Glove++

  9. Our Proposed Model --Glove ++ ? ?? ??????+ ??+ ??) ??? ??= ?(??,?)(?? ?,? ? ?? log(?(???)) + ??+ ??) ?????++= ?(??,?)(?? ?,? g(x): Enlarge the co-occurrence frequence between two rare words, thus making ???> 0

  10. Glove++ Process We focus on the inner structure of words, making the rare words enabling to refer to some common words. Process: 1. Using Bidirectional LSTM to get a word generator for embedding the words. 2. For every pair of words, we compute their structure similarity 3. We combine this structure similarity with the co-occurrence information together to make some ???> 0

  11. Glove++: LSTM Generator Using bidirectional LSTM to generate the unbelievably . In this way, the embedding word unbelievably is similar to the embedding of word believably . word of

  12. Glove++: Structure Matrix ?1 ?2 ?? ?1 ?2 ???= ???(??,??) 0 ?? Use ? to adjust the structure matrix ?? ?? ?? ?? ??? ??,?? =

  13. Glove++: Add Structure Information ?1= ?[:,?] ?? ?? ?? ?? ?? ?? ??? ?? ??? ?2= ?[?,:] ?? ???= 0 ?? Structure matrix ? Co-occurrence Matrix ? ? ? ? ??? = ???+ ??? ???= 0 + ??? ???= ?1 ?2 ?=1 ?=1

  14. Model Training ? ?? log(?(???)) + ??+ ??) ?????= ?,??(??,?)(?? We use adaptive gradient descent (AdaGrad) [1] to minimize the loss. AdaGrad is a modified form of stochastic gradient descent (SGD) which attempts to guide learning in the proper direction by weighting rarely occurring features more often than those that always fire. [1] http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf

  15. Experiment

  16. Training Dataset A larger english Wikipedia corpus getting from Wikipedia dump 2014: After preprocessing(the punctuation has been discarded), we get a 12G wiki.en.text , one article one line:

  17. Evaluation We evaluate our model in two types of tasks: Word synthetic to test vector space substructure: like, amazing amazingly apparent apparently Word semantic to test ? is to ? as ? is to __ : like, Paris France London Britan The evaluation dataset is from Google analogy task, the format of which is like a question.

  18. Evaluation Dataset Semantic Task (8869 questions) Synthetic Task (10675 questions)

  19. Parameter Setting Word Generator: Character embedding ??= 64, and word embedding in char-level ??= 256 For Bidirectional LSTM, merge the forward and backward embedding to get output word vector with 256 dimension and use sgd as the optimizer. Global word to vector: We put ? = 0.0354 to adjust the structural matrix produced by Word Generator. Optimizer is AdaGrad and we put learning rate = 0.05. We run 50 iterations for vectors of 300 dimensions

  20. Result and Comparison Semantic Task Synthetic Task Total Skip Gram 33.9% 51.3% 43.6% Glo Ve 44.4% 34.1% 36.5% Our: Glove ++ 53.4% 2.4% 33.4% Adding structural information is efficient on getting inner connection among words. The linear adding process is simple.

  21. Related Work for Rare Words [1] Shazeer N, Doherty R, Evans C, et al. Swivel: Improving Embeddings by Noticing What's Missing[J]. arXiv preprint arXiv:1602.02215, 2016. [2] Bojanowski P, Grave E, Joulin A, et al. Enriching Word Vectors with Subword Information[J]. arXiv preprint arXiv:1607.04606, 2016. [3] Cao K, Rei M. A Joint Model for Word Embedding and Word Morphology[J]. arXiv preprint arXiv:1606.02601, 2016. [4] Luong T, Socher R, Manning C D. Better Word Representations with Recursive Neural Networks for Morphology[C]//CoNLL. 2013: 104-113. [5] Qiu S, Cui Q, Bian J, et al. Co-learning of Word Representations and Morpheme Representations[C]//COLING. 2014: 141-150.

  22. Question and Answering

Related


More Related Content