Understanding Word Vectors and Training with Gensim

Slide Note
Embed
Share

Explore the differences between one-hot representation and word vectors, learn about virtual environments in Python for training word vectors, and dive into the process of evaluation, analogy, and word clustering. Discover tools like Anaconda and Virtualenv, popular libraries like Gensim, and the significance of training data and tokenization in corpus processing.


Uploaded on Sep 26, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Word2vec Tutorial Zhe Ye yezhejack@sjtu.edu.cn 2018.9.27

  2. Outline One-hot representation vs word vectors Requirement Virtual environment Python Corpus Gensim Training word vectors Evaluation Analogy Word clustering

  3. One-hot representation vs word vectors One-hot representation Sparse: using 3000K dimensions to represent vocabulary with 3000K word types Not related Word vectors Model statistical information Dense: using 300 (or less) to represent vocabulary with 3000K word types related

  4. Outline One-hot representation vs word vectors Requirement Virtual environment Python Corpus Gensim Training word vectors Evaluation Analogy Word clustering

  5. Virtual environment Features Provide separate dependency libraries Do not require admin or sudo to install package Two popular tools which provide these features Anaconda It s convenient in windows (scipy) Virtualenv

  6. Python It s very popular in NLP It s very simple and easy to understand Version: Python 3.6

  7. Corpus Tokenized plain text -> We are very happy. -> We are very happy . Tokenized plain text resource http://www.statmt.org/lm-benchmark/1-billion-word- language-modeling-benchmark-r13output.tar.gz Tokenizer LTP for Chinese (https://github.com/HIT-SCIR/ltp) Stanford Tokenizer (https://nlp.stanford.edu/software/tokenizer.shtml)

  8. Training data

  9. Gensim Implementing a wrapper for word2vec (https://code.google.com/archive/p/word2vec/) It provides python API

  10. Outline One-hot representation vs word vectors Requirement Virtual environment Python Corpus Gensim Training word vectors Evaluation Analogy Word clustering

  11. Training word vectors Linux+virtualenv+gensim (recommended) Windows 10 (64bit) + anaconda+gensim

  12. Training word vectors

  13. Using word vectors vec( China ) = {0.1502911, 0.19706184, -0.13560876, ..., 0.12463704}

  14. Outline One-hot representation vs word vectors Requirement Virtual environment Python Corpus Gensim Training word vectors Evaluation Word clustering Analogy

  15. Evaluation most similar China Chinese\Beijing\Taiwan\Shanghai\Guangdong\Hainan\Hong_K ong\Shenzhen king kings\queen\monarch\crown_prince\prince\sultan\ruler\princ es\throne wonderful marvelous\fantastic\great\fabulous\terrific\lovely\amazing\be autiful\magnificant\delightful

  16. Evaluation Analogy vector( Beijing )-vector( China ) = vector( Paris )-vector( France ) vector( Beijing )-vector( China )+vector( France )=vector( Paris )

Related