LARGE LANGUAGE MODELS

undefined
LARGE LANGUAGE MODELS
David Kauchak
CS 159 – Spring 2023
Admin
Final project proposals due today
Start working on the projects
!
Log hours that you work
Mentor hours this week?
A Single Neuron/Perceptron
threshold function
Each input contributes:
x
i
 * w
i
Activation functions
hard threshold:
sigmoid
tanh x
Many other activation functions
Rectified Linear Unit
Softmax (for probabilities)
Neural network
inputs
Individual
perceptrons/neurons
Neural network
inputs
some inputs are
provided/entered
Neural network
inputs
each perceptron computes and
calculates an answer
Neural network
inputs
those answers become inputs
for the next level
Neural network
inputs
finally get the answer after all
levels compute
Recurrent neural networks
inputs
hidden layer(s)
output
Recurrent neural nets
Figure 9.1 from Jurafsky and Martin
x
t
 = input
h
t
 = hidden layer output
y
t
 = output
Recurrent neural networks
Figure 9.2 from Jurafsky and Martin
x
t
 = input
h
t
 = hidden layer output
y
t
 = output
h
t-1
 = hidden layer output from previous input
Recurrent neural networks
x
1
Say you want the output of x1, x2, x3, ….
Recurrent neural networks
x
1
h
1
y
1
Recurrent neural networks
x
2
h
1
Recurrent neural networks
x
2
h
1
h
2
y
2
Recurrent neural networks
x
3
h
2
Recurrent neural networks
x
3
h
2
h
3
y
3
RNNs unrolled
Figure 9.2 from Jurafsky and Martin
Still just a single neural network
Figure 9.2 from Jurafsky and Martin
x
t
 = input
h
t
 = hidden layer output
y
t
 = output
h
t-1
 = hidden layer output from previous input
U, W and V are the weight matrices
RNN language models
How can we use RNNs as language models p(w1, w2, …, wn)?
How do we input a word into a NN?
“One-hot” encoding
For a vocabulary of V words, have V input nodes
All inputs are 0 except the for the one corresponding to
the word
a
apple
banana
zebra
apple
0
1
0
0
x
t
RNN language model
RNN language model
Figure 9.6 from Jurafsky and Martin
Softmax = turn into probabilities
p(w1|<s>)
p(w2|<s> w1)
p(w3|<s> w1 w2)
p(w4|<s> w1 w2 w3)
RNN language model
Figure 9.6 from Jurafsky and Martin
Softmax = turn into probabilities
p(w1|<s>)
p(w2|<s> w1)
p(w3|<s> w1 w2)
p(w4|<s> w1 w2 w3)
RNN language model
Training RNN LM
Figure 9.6 from Jurafsky and Martin
Generation with RNN LM
Figure 9.9 from Jurafsky and Martin
Stacked RNNs
Figure 9.10 from Jurafsky and Martin
Stacked RNNs
-
Multiple hidden layers
-
Still just a single network run over a sequence
-
Allows for better generalization, but can take longer to train and more data!
Challenges with RNN LMs
What context is incorporated for predicting w
i
?
p(w1|<s>)
p(w2|<s> w1)
p(w3|<s> w1 w2)
p(w4|<s> w1 w2 w3)
Challenges with RNN LMs
Just like with an n-gram LM, only use previous history.
What are we missing if we’re predicting p(w1, w2, …, wn)?
p(w1|<s>)
p(w2|<s> w1)
p(w3|<s> w1 w2)
p(w4|<s> w1 w2 w3)
Bidirectional RNN
Figure 9.11 from Jurafsky and Martin
Normal forward RNN
Bidirectional RNN
Figure 9.11 from Jurafsky and Martin
Normal forward RNN
Bidirectional RNN
Figure 9.11 from Jurafsky and Martin
Backward RNN, starting from the last word
Bidirectional RNN
Figure 9.11 from Jurafsky and Martin
Prediction uses collected information from the words before (
left
)
and words after (
right
)
Challenges with RNN LMs
Can we use them for translation (and related tasks)?
Any challenges?
p(w1|<s>)
p(w2|<s> w1)
p(w3|<s> w1 w2)
p(w4|<s> w1 w2 w3)
Challenges with RNN LMs
Can we use them for translation (and related tasks)?
Any challenges?
hasta
luega
y
gracias
por
Challenges with RNN LMs
Translation isn’t word-to-word
Worse for other tasks like summarization
No laila lōʻihi a mahalo no nā mea a pau
Encoder-decoder models
Figure 9.16 from Jurafsky and Martin
Idea:
-
Process the input sentence (e.g., sentence to be translated) with a network
-
Represent the sentence as some function of the hidden states (encoding)
-
Use this context to generate the output
Encoder-decoder models:
simple version
Figure 9.17 from Jurafsky and Martin
The context is the final hidden state of the encoder and is provided
as input to the first step of the decoder
Encoder-decoder models:
improved
Figure 9.18 from Jurafsky and Martin
The context is some combination of 
all of the 
hidden states of the encoder
How is this better?
Encoder-decoder models:
improved
Figure 9.18 from Jurafsky and Martin
The context is some combination of 
all of the 
hidden states of the encoder
Each step of decoding has access to the original, full encoding/context
Encoder-decoder models:
improved
Figure 9.18 from Jurafsky and Martin
Even with this model, different encoding steps may care about different parts
of the context
Encoder-decoder models:
improved
Figure 9.18 from Jurafsky and Martin
Even with this model, different encoding steps may care about different parts
of the context
Attention
Context is dependent on where we are in decoding step and the
relationship between encoder and decoder hidden states
Attention
Figure 9.23 from Jurafsky and Martin
Simple version attention is static, but can learn attention mechanism (i.e., relationship
between encoder and decoder hidden states)
Attention
Figure 9.23 from Jurafsky and Martin
Key RNN challenge: computation is sequential
-
This prevents parallelization
-
Harder to model contextual dependencies
Another model
Figure 10.1 from Jurafsky and Martin
How is this setup different from the RNN?
Another model
Figure 10.1 from Jurafsky and Martin
Do not rely on the hidden states for context information
Parallel: computation can all happen at once
Self-attention
Figure 10.1 from Jurafsky and Martin
Self-attention: 
-
Input is some context (for LMs, the previous words)
-
Learn what parts of the context are important based
Self-attention
Figure 10.1 from Jurafsky and Martin
Transformer block
Figure 10.4 from Jurafsky and Martin
Transformer network
Transformer network vs. RNN
GPT
G
enerative: outputs things
P
re-trained: previously trained on a large corpus
T
ransformer: uses the transformer network
Pre-trained language models
Pre-trained language models are general purpose and
are trained on a very large corpus
They can be used as/is to:
-
Ask p(w1 w2 … wn)
-
Generate text given some seed, p(wi | w1 w2 … wi-1
They can also be “fine-tuned” for particular tasks: take
the current weights and update them based on a specific
application
ChatGPT
ChatGPT
Slide Note
Embed
Share

This content covers a wide range of topics related to neural networks, activation functions, and their applications. From understanding single neurons to recurrent neural networks, the images provide insights into the functioning of neural networks at various levels. Learn about different activation functions such as sigmoid and softmax, how inputs are processed, and how answers are computed through different layers. Explore the building blocks of large language models and dive into the world of deep learning concepts illustrated in a simplified manner.

  • Neural Networks
  • Activation Functions
  • Deep Learning
  • Large Language Models
  • Perceptrons

Uploaded on Feb 17, 2025 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. LARGE LANGUAGE MODELS David Kauchak CS 159 Spring 2023

  2. Admin Final project proposals due today Start working on the projects! Log hours that you work Mentor hours this week?

  3. A Single Neuron/Perceptron Input x1 Each input contributes: xi * wi Weight w1 Weight w2 Input x2 g(in) Output y threshold function Input x3 Weight w3 i in = wi xi Weight w4 Input x4

  4. Activation functions hard threshold: ? ?? = 1 ?? ?? ? ?? ?????? 0 sigmoid 1 g(x) = 1+e-ax tanh x

  5. Many other activation functions Rectified Linear Unit Softmax (for probabilities)

  6. Neural network inputs Individual perceptrons/neurons

  7. Neural network some inputs are provided/entered inputs

  8. Neural network inputs each perceptron computes and calculates an answer

  9. Neural network inputs those answers become inputs for the next level

  10. Neural network inputs finally get the answer after all levels compute

  11. Recurrent neural networks inputs hidden layer(s) output

  12. Recurrent neural nets xt = input ht = hidden layer output yt = output Figure 9.1 from Jurafsky and Martin

  13. Recurrent neural networks xt = input ht-1 = hidden layer output from previous input ht = hidden layer output yt = output Figure 9.2 from Jurafsky and Martin

  14. Recurrent neural networks x1 Say you want the output of x1, x2, x3, .

  15. Recurrent neural networks y1 h1 x1

  16. Recurrent neural networks h1 x2

  17. Recurrent neural networks y2 h2 h1 x2

  18. Recurrent neural networks h2 x3

  19. Recurrent neural networks y3 h3 h2 x3

  20. RNNs unrolled Figure 9.2 from Jurafsky and Martin

  21. Still just a single neural network U, W and V are the weight matrices xt = input ht-1 = hidden layer output from previous input ht = hidden layer output yt = output Figure 9.2 from Jurafsky and Martin

  22. RNN language models How can we use RNNs as language models p(w1, w2, , wn)? How do we input a word into a NN?

  23. One-hot encoding For a vocabulary of V words, have V input nodes All inputs are 0 except the for the one corresponding to the word 0 a apple 1 apple xt 0 banana zebra 0

  24. RNN language model V output nodes N hidden nodes input node V s banana apple zebra a 0 0 0 1

  25. RNN language model p(w1|<s>) p(w2|<s> w1) p(w3|<s> w1 w2) p(w4|<s> w1 w2 w3) Softmax = turn into probabilities Figure 9.6 from Jurafsky and Martin

  26. RNN language model p(w1|<s>) p(w2|<s> w1) p(w3|<s> w1 w2) p(w4|<s> w1 w2 w3) Softmax = turn into probabilities Figure 9.6 from Jurafsky and Martin

  27. RNN language model

  28. Training RNN LM Figure 9.6 from Jurafsky and Martin

  29. Generation with RNN LM Figure 9.9 from Jurafsky and Martin

  30. Stacked RNNs Figure 9.10 from Jurafsky and Martin

  31. Stacked RNNs - Multiple hidden layers - Still just a single network run over a sequence - Allows for better generalization, but can take longer to train and more data!

  32. Challenges with RNN LMs p(w1|<s>) p(w2|<s> w1) p(w3|<s> w1 w2) p(w4|<s> w1 w2 w3) What context is incorporated for predicting wi?

  33. Challenges with RNN LMs p(w1|<s>) p(w2|<s> w1) p(w3|<s> w1 w2) p(w4|<s> w1 w2 w3) Just like with an n-gram LM, only use previous history. What are we missing if we re predicting p(w1, w2, , wn)?

  34. Bidirectional RNN Normal forward RNN Figure 9.11 from Jurafsky and Martin

  35. Bidirectional RNN Normal forward RNN Figure 9.11 from Jurafsky and Martin

  36. Bidirectional RNN Backward RNN, starting from the last word Figure 9.11 from Jurafsky and Martin

  37. Bidirectional RNN Prediction uses collected information from the words before (left) and words after (right) Figure 9.11 from Jurafsky and Martin

  38. Challenges with RNN LMs p(w1|<s>) p(w2|<s> w1) p(w3|<s> w1 w2) p(w4|<s> w1 w2 w3) Can we use them for translation (and related tasks)? Any challenges?

  39. Challenges with RNN LMs hasta luega y gracias por Can we use them for translation (and related tasks)? Any challenges?

  40. Challenges with RNN LMs No laila l ihi a mahalo no n mea a pau Translation isn t word-to-word Worse for other tasks like summarization

  41. Encoder-decoder models Idea: - Process the input sentence (e.g., sentence to be translated) with a network - Represent the sentence as some function of the hidden states (encoding) - Use this context to generate the output Figure 9.16 from Jurafsky and Martin

  42. Encoder-decoder models: simple version The context is the final hidden state of the encoder and is provided as input to the first step of the decoder Figure 9.17 from Jurafsky and Martin

  43. Encoder-decoder models: improved The context is some combination of all of the hidden states of the encoder How is this better? Figure 9.18 from Jurafsky and Martin

  44. Encoder-decoder models: improved The context is some combination of all of the hidden states of the encoder Each step of decoding has access to the original, full encoding/context Figure 9.18 from Jurafsky and Martin

  45. Encoder-decoder models: improved Even with this model, different encoding steps may care about different parts of the context Figure 9.18 from Jurafsky and Martin

  46. Encoder-decoder models: improved Even with this model, different encoding steps may care about different parts of the context Figure 9.18 from Jurafsky and Martin

  47. Attention Context is dependent on where we are in decoding step and the relationship between encoder and decoder hidden states

  48. Attention Simple version attention is static, but can learn attention mechanism (i.e., relationship between encoder and decoder hidden states) Figure 9.23 from Jurafsky and Martin

  49. Attention Key RNN challenge: computation is sequential - This prevents parallelization - Harder to model contextual dependencies Figure 9.23 from Jurafsky and Martin

  50. Another model How is this setup different from the RNN? Figure 10.1 from Jurafsky and Martin

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#