Deep Learning Applications in Biotechnology: Word2Vec and Beyond

 
C
S
 
8
8
6
 
D
e
e
p
 
L
e
a
r
n
i
n
g
 
f
o
r
 
B
i
o
t
e
c
h
n
o
l
o
g
y
 
Ming Li
 
Jan. 2022
 
Prelude: our world
 
What do they have in common
 
0
 
These are all lives encoded by DNA / RNA from 30k bases to 3B bases and more.
Genes encoded by DNA are translated to 1000s’of proteins, hence life.
 
Deep Learning
 
Since its invention, deep learning has changed many
research fields: speech recognition, image
processing, natural language processing, automatic
driving, industrial control, especially biotechnology
(for example, protein structure prediction). In this
class, we will review applications of deep learning in
biotechnology. The first few lectures will be on the
necessary backgrounds of deep learning.
 
01
LECTURE ONE
 
Word2Vec: from discrete to
continuous space
 
Things you need to know:
 
01
 
Dot product:
a 
 b = ||a||||b||cos(θ
ab
)
         = a
1
b
1
+a
2
b
2
+ … +a
n
b
n
One can derive 
Cosine similarity
   
cos θ
ab
 = a 
 b  /  ||a||||b||
 
Softmax Function:
If we take an input of [1,2,3,4,1,2,3], the softmax of that is
[0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175].
The softmax function highlights the largest values and
suppress other values, so that they are all positive and
sum to 1.
 
 
Word2Vec
Transforming from discrete to “continuous” space
01
 
1.
Calculus 
(for computing the space covered by a curve)
2.
Word2Vec
 (for computing the “space” covered by the meaning of a
word, or a gene, or a protein)
Word2Vec
Traditional representation of a word's meaning
01
 
1.
Dictionary (or PDB)
, not too useful in computational linguistic research.
2.
WordNet
 (Protein networks)
It is a graph of words, with relationships like “is-a”, synonym sets.
Problems: Depend on human labeling hence missing a lot, hard to
automate this process.
3.   These are all using atomic symbols: hotel, motel, equivalent to 1-hot
vector:
           Hotel:    [0,0,0,0,0,0,0,0,1,0,0,0,0,0]
           Motel:    [0,0,0,0,1,0,0,0,0,0,0,0,0,0]
      These are called one-hot representations. Very long: 13M (google crawl).
      Example: inverted index.
Word2Vec
Problems.
01
Word2Vec
 
How do we solve the problem? This is what Newton and Leibniz did
for calculus:
 
01
 
Word2Vec
 
when #rectangles 
Let's do something similar:
01
 
1.
Use a lot of “rectangles”, a vector of numbers, to represent to approximate
the meaning of a word.
2.
What represents the meanings of a word?
         “You shall know a word by the company it keeps” – J.R. Firth, 1957
Thus, our goal is to assign each word a vector such that similar words have
similar vectors (by dot-product).
 
We will believe JR Firth and use a neural network to train (low dimension) vectors
such that if two words appear together in a text each time, they should get
slightly closer. This allows us to use a 
massive corpus without annotation
! Thus,
we will scan thru training data by looking at a window 2d+1 at a time, given a
center word, trying to predict d words on the left and d words on the right.
Word2Vec
To design a neural network for this:
01
 
More specifically, for a  center word w
t
  and “context words” w
t’
, within a
window of some fixed size say 5 (t’=t-1, … t-5, t+1, … , t+5) we use a neural
network to predict all w
t’
 to maximize:
                             p(w
t’
|w
t
) = …
This has a loss function
                             L = 1 – p(w
t’
|w
t
)
Thus by looking at many positions in a big corpus, we keep on adjusting
these vectors to minimize this loss, we arrive at a (low dimensional) vector
approximation of the meaning of each word, in the sense that if two words
occur in close proximity often then we consider them similar.
 
Word2Vec
To design a neural network for this:
01
 
Thus the objective function is: maximize the probability of any 
context word
given the current 
center word
:
                L’(θ) = Π 
t=1..T 
Π
 d=-1..-5,1..5 
P(w
t+d
 | w
t  
, θ )
Where θ is all the variables we optimize (i,e, the vectors), T=|training text|.
Taking negative logarithm (and average per word) so that we can minimize
               L(θ)  = - 1/T Σ
t=1..T 
 Σ
d=-1 .. -5, 1,..,5 
log P(w
t+d
 | w
t
 )
Then what is P(w
t+d
 | w
t
)? We can just take their vector dot products, and
then take softmax, to approximate it, letting v be the vector for word w:
               L(θ)  ≈ - 1/T Σ
t=1..T 
 Σ
d=-1 .. -5, 1,..,5 
log Softmax (v
t+d
 
 v
t
)
Word2Vec
To design a neural network for this:
01
 
Last slide:
           L(θ)  ≈ - 1/T Σ
t=1..T 
 Σ
d=-1 .. -5, 1,..,5 
log Softmax (v
t+d
 
 v
t
)
 
The softmax of a center word c, and a context/outside word o
           Softmax(v
o
 
 
v
c 
) = e^(v
o
 
 
 v
c
) / Σ
k=1..V
 e^(v
k
 
 v
c
)
 
Note, the index runs over the dictionary of size V, not the whole text T.
Word2Vec
 
Negative Sampling in Word2Vec
 
01
 
Word2Vec
 
In our 
L(θ)  ≈ - 1/T Σ
t=1..T 
 Σ
d=-1 .. -5, 1,..,5 
log Softmax (v
t+d
 
 v
t
) where
            Softmax(v
o
 
 
v
c 
) = e^(v
o
 
 
 v
c
) / Σ
k=1..V
 e^(v
k
 
 v
c
)
     Each time we have to calculate 
Σ
k=1..V
 e^(v
k
 
 v
c
), this is too expensive.
To overcome this, use negative sampling. The objective function: L(
θ
) = -1/T Σ
t=1..T
 L
t
(θ)
          L
t
(
θ
) = log σ (u
o
T
v
c
) + Σ
t=1..k
 E
j~P(w)
 [log σ (- u
j
T
v
c
)]
                  = log σ (u
o
T
v
c
) + Σ
j~P(w)
 [log σ (- u
j
T
v
c
)]
 Where the sigmoid function σ(x) = 1/1+e
-x
, treated as probability for ML people. I.e.
maximize the first term, taking k=10 random samples in the second  term.
For sampling, we can use unigram distribution U(w) or U(w)
3/4
 for rare words.
The skip-gram model
01
Word2Vec
 
Vocabulary size: V
Input layer: center word in 1-hot form.
k-th row of W
VxN
 is 
center vector 
of k-th
word.
k-th column of W’
NXV
 is 
context vector
of the k-th word in V. Note, each word has
2 vectors, both randomly initialized.
The output column y
ij
, i=1..C, has 3 steps
      1) Use the context word 1-hot vector to
           choose its column in W’
NxV
          
2) dot product with h
i
 the center word
       3) compute the softmax
C= context window size
 
The Training of θ
 
01
 
Word2Vec
 
We will train both W
VxN
 and W’
NxV
I.e. compute all vector gradients.
Thus θ is in space R
2NV
, N is vector
size, V is number of words.
       L(θ) /      v , for all vectors in θ.
 
θ
 =
 
[
          
]
in R
2NV
 
v
aardvark
v
a
.
.
v
zebra
u
aardvark
u
a
.
.
u
zebra
 
Gradient Descent
 
01
 
Word2Vec
 
θ
new 
= θ
old
 – α
 
  L(θ
old
) /   / θ
old
Stochastic gradient descent (SGD):
Just do one position (one center
word and its context words) at a time.
 
θ
 =
 
[
          
]
in R
2NV
 
v
aardvark
v
a
.
.
v
zebra
u
aardvark
u
a
.
.
u
zebra
 
CBOW
 
01
 
Word2Vec
 
What about we predict center word, given context word, opposite to the skip-gram
model?
Yes, this is called Continuous Bag Of Words model in the original Word2Vec paper.
 
Results
 
01
 
Word2Vec
 
Results
 
01
 
Word2Vec
 
More realistic data – not everything is perfect
 
01
 
Word2Vec
An interesting application in material science
01
Word2Vec
 
Nature, July 2019 V. Tshitonya et al, “Unsupervised word embeddings capture
latent knowledge from materials science literature”.
Lawrence Berkeley lab material scientists applied word embedding to 3.3
million scientific abstracts published between 1922-2018. V=500k.  Vector size:
200 dimension, used skip-gram model
With no explicit insertion of chemical knowledge
Captured things like periodic table and structure-property relationship in
materials:
              ferromagnetic − NiFe + IrMn ≈ antiferromagnetic
Discovered new thermoelectric materials: “would be years before their
discovery”.
Can you do something for proteins?
Beyond Word2Vec
01
Word2Vec
 
Co-occurrence matrix
       Window based co-occurrence
       Document based co-occurrence
        Ignore the, he, has … frequent words.
        Close proximity weigh more …
In word2vec, if a center word w appears again, we have to repeat this process. Here
they are processed together. Also consider documents.
Symmetric.
SVD decomposition, this was before Word2Vec. But it is O(nm
2
), too slow for large
data.
GloVe (Global vectors model)
01
Word2Vec
 
Combining Word2Vec and Co-occurrence matrix approaches. Minimize
 
          L(θ) = ½ Σ
i,j=1..W
 f(P
i,j
) (u
i
T
v
j
 – log P
i,j
)
2
     Where, u, v vectors are still the same
, P
i,j
 is the count 
that u
i
 and v
j
 co-occur.
Essentially
     This says, the more u
i
,v
j  
co-occur, the larger their dot product should be. f gets rid
of too frequent occurrences.
 
What about these two vectors? X=U+V works.
Summary
01
Word2Vec
 
We have learned that representation is important: When you represent space under
a curve by a lot of rectangles, you can approximate the curve, hence calculus;
When you represent a word by a vector, other close-in-meaning words can take
nearby vectors, measured by “cosine” distance.
 
When a representation (short vectors) of an object allows similarity measures, we
can easily design neural network (or other approaches)  to represent words that are
close in meanings in close vicinities.
Literature & Resources for Word2Vec
01
 
Bengio et al, 2003, A neural probabilistic language model.
Collobert & Weston 2008, NLP (almost) from scratch
Mikolov et al 2013, word2vec paper
Pennington Socher, Manning, GloVe paper, 2014
Rohde et al 2005 (SVD paper) An improved model of semantics similarity
based on lexical co-occurrence.
Plus thousands more.
Resources:
https://mccormickml.com/2016/04/27/word2vec-resources/
https://github.com/clulab/nlp-reading-group/wiki/Word2Vec-Resources
Word2Vec
Project Ideas
01
 
1.
Similar to 
V. Tshitonya et al's work, 
can you explore all biological
literature and find interesting facts such as protein-protein interaction or
biological name identity resolution?
2.
Can we use word-embedding to embed genomes (for example, shatter
genomes into pieces as "words”, but train one vector for each genome)
hence cluster species to build phylogeny to decide their evolutionary
history, similar to [1,2]. You can start with mitochondrial genomes, or
virus genomes (such as different species of covid-19, for this you can
add in other factors such as geographical, time), or bacteria genomes.
 
1. M. Li, J. Badger, X. Chen, S. Kwong, P. Kearney, H. Zhang, An information-based sequence distance and its application to whole
mitochondrial genome phylogeny, Bioinformatics, 17:2(2001), 149-154.
2. C.H. Bennett, M. Li and B. Ma, Chain letters and evolutionary histories. Scientific American, 288:6(June 2003) (feature article), 76-81.
Word2Vec
 
 
02
 
Attention and Transformers
LECTURE TWO
Slide Note
Embed
Share

Explore the intersection of deep learning and biotechnology, focusing on Word2Vec and its applications in protein structure prediction. Understand the transformation from discrete to continuous space, the challenges of traditional word representation methods, and the implications for computational linguistic research. Dive into the realm of pretraining models like GPT and BERT, and witness the power of deep learning in proteomics. Join us as we decode the potential of DNA and RNA through the lens of artificial intelligence.


Uploaded on Jul 17, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Jan. 2022 CS 886 Deep Learning for Biotechnology Ming Li

  2. Prelude: our world 0 What do they have in common These are all lives encoded by DNA / RNA from 30k bases to 3B bases and more. Genes encoded by DNA are translated to 1000s of proteins, hence life.

  3. Deep Learning 01. Word2Vec 02. Attention / Transformer Since its invention, deep learning has changed many research fields: speech recognition, image processing, natural language processing, automatic driving, industrial control, especially biotechnology (for example, protein structure prediction). In this class, we will review applications of deep learning in biotechnology. The first few lectures will be on the necessary backgrounds of deep learning. 03. Pretraining: GPT and BERT 04. Deep learning applications in proteomics 05. Student presentations begin

  4. Word2Vec: from discrete to continuous space 01 LECTURE ONE

  5. Word2Vec 01 Things you need to know: Dot product: a b = ||a||||b||cos( ab) = a1b1+a2b2+ +anbn One can derive Cosine similarity cos ab = a b / ||a||||b|| Softmax Function: If we take an input of [1,2,3,4,1,2,3], the softmax of that is [0.024, 0.064, 0.175, 0.475, 0.024, 0.064, 0.175]. The softmax function highlights the largest values and suppress other values, so that they are all positive and sum to 1.

  6. Word2Vec 01 Transforming from discrete to continuous space 1. 2. Calculus (for computing the space covered by a curve) Word2Vec (for computing the space covered by the meaning of a word, or a gene, or a protein)

  7. Word2Vec 01 Traditional representation of a word's meaning 1. 2. Dictionary (or PDB), not too useful in computational linguistic research. WordNet (Protein networks) It is a graph of words, with relationships like is-a , synonym sets. Problems: Depend on human labeling hence missing a lot, hard to automate this process. 3. These are all using atomic symbols: hotel, motel, equivalent to 1-hot vector: Hotel: [0,0,0,0,0,0,0,0,1,0,0,0,0,0] Motel: [0,0,0,0,1,0,0,0,0,0,0,0,0,0] These are called one-hot representations. Very long: 13M (google crawl). Example: inverted index.

  8. Word2Vec 01 Problems. There is no natural meaning of similarity, hotel motelT = 0 No inherent notion of similarity with 1-hot vectors. They are very long too. For daily speech, 20 thousand words, 50k for machine translation, 20 thousand proteins, millions of species. Google web crawl, 13M words.

  9. Word2Vec 01 How do we solve the problem? This is what Newton and Leibniz did for calculus: when #rectangles

  10. Let's do something similar: Word2Vec 01 1. Use a lot of rectangles , a vector of numbers, to represent to approximate the meaning of a word. What represents the meanings of a word? You shall know a word by the company it keeps J.R. Firth, 1957 Thus, our goal is to assign each word a vector such that similar words have similar vectors (by dot-product). 2. We will believe JR Firth and use a neural network to train (low dimension) vectors such that if two words appear together in a text each time, they should get slightly closer. This allows us to use a massive corpus without annotation! Thus, we will scan thru training data by looking at a window 2d+1 at a time, given a center word, trying to predict d words on the left and d words on the right.

  11. Word2Vec 01 To design a neural network for this: More specifically, for a center word wtand context words wt , within a window of some fixed size say 5 (t =t-1, t-5, t+1, , t+5) we use a neural network to predict all wt to maximize: p(wt |wt) = This has a loss function L = 1 p(wt |wt) Thus by looking at many positions in a big corpus, we keep on adjusting these vectors to minimize this loss, we arrive at a (low dimensional) vector approximation of the meaning of each word, in the sense that if two words occur in close proximity often then we consider them similar.

  12. Word2Vec 01 To design a neural network for this: Thus the objective function is: maximize the probability of any context word given the current center word: L ( ) = t=1..T d=-1..-5,1..5 P(wt+d | wt , ) Where is all the variables we optimize (i,e, the vectors), T=|training text|. Taking negative logarithm (and average per word) so that we can minimize L( ) = - 1/T t=1..T d=-1 .. -5, 1,..,5 log P(wt+d | wt ) Then what is P(wt+d | wt)? We can just take their vector dot products, and then take softmax, to approximate it, letting v be the vector for word w: L( ) - 1/T t=1..T d=-1 .. -5, 1,..,5 log Softmax (vt+d vt)

  13. Word2Vec 01 To design a neural network for this: Last slide: L( ) - 1/T t=1..T d=-1 .. -5, 1,..,5 log Softmax (vt+d vt) The softmax of a center word c, and a context/outside word o Softmax(vo vc ) = e^(vo vc) / k=1..V e^(vk vc) Note, the index runs over the dictionary of size V, not the whole text T.

  14. Word2Vec 01 Negative Sampling in Word2Vec In our L( ) - 1/T t=1..T d=-1 .. -5, 1,..,5 log Softmax (vt+d vt) where Softmax(vo vc ) = e^(vo vc) / k=1..V e^(vk vc) Each time we have to calculate k=1..V e^(vk vc), this is too expensive. To overcome this, use negative sampling. The objective function: L( ) = -1/T t=1..T Lt( ) Lt( ) = log (uoTvc) + t=1..k Ej~P(w) [log (- ujTvc)] = log (uoTvc) + j~P(w) [log (- ujTvc)] Where the sigmoid function (x) = 1/1+e-x, treated as probability for ML people. I.e. maximize the first term, taking k=10 random samples in the second term. For sampling, we can use unigram distribution U(w) or U(w)3/4 for rare words.

  15. Word2Vec 01 The skip-gram model Vocabulary size: V Input layer: center word in 1-hot form. k-th row of WVxN is center vector of k-th word. k-th column of W NXV is context vector of the k-th word in V. Note, each word has 2 vectors, both randomly initialized. The output column yij, i=1..C, has 3 steps 1) Use the context word 1-hot vector to choose its column in W NxV 2) dot product with hi the center word 3) compute the softmax C= context window size

  16. Word2Vec 01 The Training of We will train both WVxN and W NxV I.e. compute all vector gradients. Thus is in space R2NV, N is vector size, V is number of words. L( ) / v , for all vectors in . vaardvark va . . vzebra uaardvark ua . in R2NV = . uzebra

  17. Word2Vec 01 Gradient Descent new = old L( old) / / old Stochastic gradient descent (SGD): Just do one position (one center word and its context words) at a time. vaardvark va . . vzebra uaardvark ua . in R2NV = . uzebra

  18. Word2Vec 01 CBOW What about we predict center word, given context word, opposite to the skip-gram model? Yes, this is called Continuous Bag Of Words model in the original Word2Vec paper.

  19. Word2Vec 01 Results

  20. Word2Vec Results 01

  21. Word2Vec 01 More realistic data not everything is perfect

  22. Word2Vec 01 An interesting application in material science Nature, July 2019 V. Tshitonya et al, Unsupervised word embeddings capture latent knowledge from materials science literature . Lawrence Berkeley lab material scientists applied word embedding to 3.3 million scientific abstracts published between 1922-2018. V=500k. Vector size: 200 dimension, used skip-gram model With no explicit insertion of chemical knowledge Captured things like periodic table and structure-property relationship in materials: ferromagnetic NiFe + IrMn antiferromagnetic Discovered new thermoelectric materials: would be years before their discovery . Can you do something for proteins?

  23. Word2Vec 01 Beyond Word2Vec Co-occurrence matrix Window based co-occurrence Document based co-occurrence Ignore the, he, has frequent words. Close proximity weigh more In word2vec, if a center word w appears again, we have to repeat this process. Here they are processed together. Also consider documents. Symmetric. SVD decomposition, this was before Word2Vec. But it is O(nm2), too slow for large data.

  24. Word2Vec 01 GloVe (Global vectors model) Combining Word2Vec and Co-occurrence matrix approaches. Minimize L( ) = i,j=1..W f(Pi,j) (uiTvj log Pi,j)2 Where, u, v vectors are still the same, Pi,j is the count that ui and vj co-occur. Essentially This says, the more ui,vj co-occur, the larger their dot product should be. f gets rid of too frequent occurrences. What about these two vectors? X=U+V works.

  25. Word2Vec 01 Summary We have learned that representation is important: When you represent space under a curve by a lot of rectangles, you can approximate the curve, hence calculus; When you represent a word by a vector, other close-in-meaning words can take nearby vectors, measured by cosine distance. When a representation (short vectors) of an object allows similarity measures, we can easily design neural network (or other approaches) to represent words that are close in meanings in close vicinities.

  26. Word2Vec 01 Literature & Resources for Word2Vec Bengio et al, 2003, A neural probabilistic language model. Collobert & Weston 2008, NLP (almost) from scratch Mikolov et al 2013, word2vec paper Pennington Socher, Manning, GloVe paper, 2014 Rohde et al 2005 (SVD paper) An improved model of semantics similarity based on lexical co-occurrence. Plus thousands more. Resources: https://mccormickml.com/2016/04/27/word2vec-resources/ https://github.com/clulab/nlp-reading-group/wiki/Word2Vec-Resources

  27. Word2Vec 01 Project Ideas 1. Similar to V. Tshitonya et al's work, can you explore all biological literature and find interesting facts such as protein-protein interaction or biological name identity resolution? Can we use word-embedding to embed genomes (for example, shatter genomes into pieces as "words , but train one vector for each genome) hence cluster species to build phylogeny to decide their evolutionary history, similar to [1,2]. You can start with mitochondrial genomes, or virus genomes (such as different species of covid-19, for this you can add in other factors such as geographical, time), or bacteria genomes. 2. 1. M. Li, J. Badger, X. Chen, S. Kwong, P. Kearney, H. Zhang, An information-based sequence distance and its application to whole mitochondrial genome phylogeny, Bioinformatics, 17:2(2001), 149-154. 2. C.H. Bennett, M. Li and B. Ma, Chain letters and evolutionary histories. Scientific American, 288:6(June 2003) (feature article), 76-81.

  28. Attention and Transformers 02 LECTURE TWO

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#