Understanding Word Embeddings and Training Methods

slide1 n.w
1 / 39
Embed
Share

Explore the world of word embeddings, distributed representations, and training methods used in Natural Language Processing. Learn about the basics of word embeddings, representation techniques, and popular algorithms like Word2Vec. Dive into the concepts of CBOW and Skip-grams, along with Hierarchical Softmax and Negative Sampling. Discover how to create dense, low-dimensional embeddings, and gain insights into lemmatization and stemming. Enhance your knowledge of word context and representation with practical examples and insightful visuals.

  • Word Embeddings
  • NLP
  • Training Methods
  • Word2Vec
  • Distributed Representations

Uploaded on | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. word embeddings , ( ) 1

  2. Word embeddings (basics) 2

  3. (representation) (distributed) embedding dense, low dimension : -> 3

  4. , Google news 4

  5. / 4 5 6 1 2 3 ||(1, 2, 3|| 5

  6. 6

  7. dog; 7

  8. TIP: . ; |V| x 1 8

  9. ; 9

  10. Lemmatization Stemming 10

  11. ; 11

  12. (context) . (window) = 3 Center word Context word : (1) center (2) context , 2 |V| x d center- center context- ( , context ) context Learning: ( ) Training examples fix the matrices to work for them 12

  13. w: center representation c: context representation 13

  14. w: center representation c: context representation Negative sampling ( ) 14

  15. Word2Vec Two algorithms 1. Continuous Bag of Words (CBOW) Predict center word from a bag-of-words context 2. Skip-grams (SG) Predict context words given the center word Position independent (do not account for distance from center) Two training methods 1. Hierarchical softmax 2. Negative sampling Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, Jeffrey Dean: Distributed Representations of Words and Phrases and their Compositionality. NIPS 2013: 3111-3119 15

  16. __ __ __ __ CBOW __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ __ skipgram 16

  17. 17

  18. One-hot vectors |V| ( ) R|?|?1 0 1 ? 0 0 . . . 0 0 ? 0 . . . 0 0 0 0 . . . ? 1 0 ? . . . 0 .. . ??????= ???= ?????????= ??= 18

  19. Given matrix W, embedding i-; Lookup/project ??? ? = ? ?? W i ??? ?? ?????? ?? i 0 0 1 0 One-hot or indicator vector, all 0s but position i ?? 19

  20. CBOW |V| number of words N size of embedding m size of the window (context) Use a window of context words to predict the center word Input: 2m context words Output: center word each represented as a one-hot vector 20

  21. CBOW Use a window of context words to predict the center word Learns two matrices (two embeddings per word, one when context, one when center) W N Embedding of the i-th word when center word i W i Embedding of the i-th word when context word N |V| |V| N x |V| center embeddings when output |V| x N context embeddings when input 21

  22. CBOW Intuition The W -embedding of the center word should be similar to the (sum of the) W-embeddings of its context words We want similarity close to one for the center word and close to 0 for all other words 22

  23. CBOW Given window size m ?(?)one hot vector for context words, y one hot vector for the center word 1. INPUT: the one hot vectors for the 2m context words ?(? ?), , ?(? 1),?(?+1), , ?(?+?) 2. GET THE EMBEDDINGS of the context words ?? ?= ??(? ?), , ?? 1= ??(? 1),??+1= ??(?+1), , ??+?= ??(?+?) 3. TAKE THE SUM these vectors ? = ?? ?+?? ?+1+ ??+? 2? , ? ?? 4. COMPUTE SIMILARITY: dot produce W (all center vectors) and context ? z = W ? 5. Turn the score vector to probabilities ? = softmax(z) We want this to be close to 1 for the center word 23

  24. 24

  25. Input layer Index of cat in vocabulary 0 1 0 0 Hidden layer Output layer cat 0 0 0 0 0 0 0 0 0 0 one-hot vector one-hot vector sat 0 0 0 1 0 0 0 1 0 on 0 0 0 0 25

  26. We must learn W and W Input layer 0 1 0 0 Hidden layer Output layer cat 0 ?? ? 0 0 0 0 0 0 V-dim 0 0 ? ? ? 0 sat 0 0 0 1 0 0 N-dim ?? ? V-dim 0 1 0 on 0 0 0 V-dim N will be the size of word vector 0 26

  27. ? ?? ? ????= ???? 0 0.1 2.4 1.6 1.8 0.5 0.9 3.2 2.4 1 Input layer 0.5 2.6 1.4 2.9 1.5 3.6 6.1 2.6 0 0 0 = 1 0 0 0 0.6 1.8 2.7 1.9 2.4 2.0 1.2 1.8 0 0 Output layer xcat 0 0 0 0 0 0 0 0 0 V-dim 0 0 ? =????+ ??? 0 + sat 2 0 0 0 1 0 0 V-dim 0 1 Hidden layer 0 xon 0 N-dim 0 0 V-dim 0 27

  28. ? ?? ? ???= ??? 0 0.1 2.4 1.6 1.8 0.5 0.9 3.2 1.8 0 Input layer 0.5 2.6 1.4 2.9 1.5 3.6 6.1 2.9 0 0 1 = 1 0 0 0 0.6 1.8 2.7 1.9 2.4 2.0 1.2 1.9 0 0 Output layer xcat 0 0 0 0 0 0 0 0 0 V-dim 0 0 ? =????+ ??? 0 + sat 2 0 0 0 1 0 0 V-dim 0 1 Hidden layer 0 xon 0 N-dim 0 0 V-dim 0 28

  29. Input layer 0 1 0 0 Hidden layer Output layer cat 0 ?? ? 0 0 0 0 0 0 V-dim 0 0 ? = ???????(?) ?? ? ? = ? 0 0 0 0 1 0 ? 0 ?? ? 0 1 N-dim 0 on ?sat 0 0 V-dim 0 V-dim N will be the size of word vector 0 29

  30. Input layer 0 We would prefer ? close to ???? 1 0 0 Hidden layer Output layer cat 0 ?? ? 0 0 0 0.01 0 0 0.02 0 V-dim 0.00 0 0 ?? ? ? = ???????(?) ? = ? 0 0.02 0 0.01 0 0 1 0 0.02 ? 0 0.01 ?? ? 0 1 0.7 N-dim 0 on ?sat 0 0 V-dim 0.00 0 ? V-dim N will be the size of word vector 0 30

  31. ? ?? ? 0.1 2.4 1.6 1.8 0.5 0.9 3.2 Contain word s vectors Input layer 0.5 2.6 1.4 2.9 1.5 3.6 6.1 0 1 0 0.6 1.8 2.7 1.9 2.4 2.0 1.2 0 Output layer xcat 0 0 0 0 ?? ? 0 0 0 V-dim 0 0 ?? ? 0 sat 0 0 0 1 0 0 ?? ? V-dim 0 1 Hidden layer 0 xon 0 N-dim 0 0 V-dim 0 We can consider either W (context) or W (center) as the word s representation. Or even take the average. 31

  32. Skipgram Given the center word, predict (or, generate) the context words Input: center word Output: 2m context word each represented as a one-hot vectors Learn two matrices W: N x |V|, input matrix, word representation as center word W : |V| x N, output matrix, word representation as context word 32

  33. 33

  34. Skipgram ?(?)one hot vector for context words 1. Input: one hot vector of the center word ? 2. Get the embedding of the center word ??= ? ? 3. Generate a score vector for each context word z = W ?? 5. Turn the score vector into probabilities ? = softmax(z) We want this to be close to 1 for the context words 34

  35. 35

  36. ! 36

  37. These representations are very good at encoding similarity and dimensions of similarity! Analogies testing dimensions of similarity can be solved quite well just by doing vector subtraction in the embedding space Syntactically xapple xapples xcar xcars xfamily xfamilies Similarly for verb and adjective morphological forms Semantically xshirt xclothing xchair xfurniture xking xman xqueen xwoman 37

  38. Improve language translation bilingual embedding with chinese in green and english in yellow By aligning the word embeddings for the two languages 38

  39. End of lecture CS276: Information Retrieval and Web Search, Christopher Manning and Pandu Nayak, Lecture 14: Distributed Word Representations for Information Retrieval Jordan Boyd-Graber, UMD course Natural Language Processing, https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/ skipgram: Chris McCormick http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/ https://www.analyticsvidhya.com/blog/2017/06/word-embeddings-count-word2veec/ 39

More Related Content