Word Vectors and Training with Gensim

 
Word2vec Tutorial
 
Zhe Ye
yezhejack@sjtu.edu.cn
2018.9.27
 
Outline
 
One-hot representation vs word vectors
Requirement
Virtual environment
Python
Corpus
Gensim
Training word vectors
Evaluation
Analogy
Word clustering
 
One-hot representation vs word
vectors
 
One-hot representation
Sparse: using 3000K dimensions to represent vocabulary
with 3000K word types
Not related
Word vectors
Model statistical information
Dense: using 300 (or less) to represent vocabulary with
3000K word types
related
 
Outline
 
One-hot representation vs word vectors
Requirement
Virtual environment
Python
Corpus
Gensim
Training word vectors
Evaluation
Analogy
Word clustering
 
Virtual environment
 
Features
Provide separate dependency libraries
Do not require admin or sudo to install package
Two popular tools which provide these features
Anaconda
It’s convenient in windows (scipy)
Virtualenv
 
Python
 
It’s very popular in NLP
It’s very simple and easy to understand
Version: Python 3.6
 
Corpus
 
Tokenized plain text
我们很高兴
 
->  
我们
  
  
高兴
 
We are very happy. -> We are very happy .
Tokenized plain text resource
http://www.statmt.org/lm-benchmark/1-billion-word-
language-modeling-benchmark-r13output.tar.gz
Tokenizer
LTP for Chinese (
https://github.com/HIT-SCIR/ltp)
Stanford Tokenizer
(https://nlp.stanford.edu/software/tokenizer.shtml)
 
Training data
 
Gensim
 
Implementing a wrapper for word2vec
(
https://code.google.com/archive/p/word2vec/)
It provides python API
 
Outline
 
One-hot representation vs word vectors
Requirement
Virtual environment
Python
Corpus
Gensim
Training word vectors
Evaluation
Analogy
Word clustering
 
Training word vectors
 
Linux+virtualenv+gensim (recommended)
Windows 10 (64bit) + anaconda+gensim
 
 
 
Training word vectors
 
Using word vectors
 
vec(‘China’) = {0.1502911, 0.19706184, -0.13560876, ..., 0.12463704}
 
Outline
 
One-hot representation vs word vectors
Requirement
Virtual environment
Python
Corpus
Gensim
Training word vectors
Evaluation
Word clustering
Analogy
 
Evaluation
 
most similar
China
Chinese\Beijing\Taiwan\Shanghai\Guangdong\Hainan\Hong_K
ong\Shenzhen
king
kings\queen\monarch\crown_prince\prince\sultan\ruler\princ
es\throne
wonderful
marvelous\fantastic\great\fabulous\terrific\lovely\amazing\be
autiful\magnificant\delightful
 
 
Evaluation
 
Analogy
vector(‘Beijing’)-vector(‘China’) = vector(‘Paris’)-vector(‘France’)
vector(‘Beijing’)-vector(‘China’)+vector(‘France’)=vector(‘Paris’)
 
Slide Note
Embed
Share

Explore the differences between one-hot representation and word vectors, learn about virtual environments in Python for training word vectors, and dive into the process of evaluation, analogy, and word clustering. Discover tools like Anaconda and Virtualenv, popular libraries like Gensim, and the significance of training data and tokenization in corpus processing.

  • Word Vectors
  • Gensim
  • Python
  • Virtual Environment
  • Training Data

Uploaded on Sep 26, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Word2vec Tutorial Zhe Ye yezhejack@sjtu.edu.cn 2018.9.27

  2. Outline One-hot representation vs word vectors Requirement Virtual environment Python Corpus Gensim Training word vectors Evaluation Analogy Word clustering

  3. One-hot representation vs word vectors One-hot representation Sparse: using 3000K dimensions to represent vocabulary with 3000K word types Not related Word vectors Model statistical information Dense: using 300 (or less) to represent vocabulary with 3000K word types related

  4. Outline One-hot representation vs word vectors Requirement Virtual environment Python Corpus Gensim Training word vectors Evaluation Analogy Word clustering

  5. Virtual environment Features Provide separate dependency libraries Do not require admin or sudo to install package Two popular tools which provide these features Anaconda It s convenient in windows (scipy) Virtualenv

  6. Python It s very popular in NLP It s very simple and easy to understand Version: Python 3.6

  7. Corpus Tokenized plain text -> We are very happy. -> We are very happy . Tokenized plain text resource http://www.statmt.org/lm-benchmark/1-billion-word- language-modeling-benchmark-r13output.tar.gz Tokenizer LTP for Chinese (https://github.com/HIT-SCIR/ltp) Stanford Tokenizer (https://nlp.stanford.edu/software/tokenizer.shtml)

  8. Training data

  9. Gensim Implementing a wrapper for word2vec (https://code.google.com/archive/p/word2vec/) It provides python API

  10. Outline One-hot representation vs word vectors Requirement Virtual environment Python Corpus Gensim Training word vectors Evaluation Analogy Word clustering

  11. Training word vectors Linux+virtualenv+gensim (recommended) Windows 10 (64bit) + anaconda+gensim

  12. Training word vectors

  13. Using word vectors vec( China ) = {0.1502911, 0.19706184, -0.13560876, ..., 0.12463704}

  14. Outline One-hot representation vs word vectors Requirement Virtual environment Python Corpus Gensim Training word vectors Evaluation Word clustering Analogy

  15. Evaluation most similar China Chinese\Beijing\Taiwan\Shanghai\Guangdong\Hainan\Hong_K ong\Shenzhen king kings\queen\monarch\crown_prince\prince\sultan\ruler\princ es\throne wonderful marvelous\fantastic\great\fabulous\terrific\lovely\amazing\be autiful\magnificant\delightful

  16. Evaluation Analogy vector( Beijing )-vector( China ) = vector( Paris )-vector( France ) vector( Beijing )-vector( China )+vector( France )=vector( Paris )

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#