Understanding Attention Mechanism in Neural Machine Translation
In neural machine translation, attention mechanisms allow selective encoding of information and adaptive decoding for accurate output generation. By learning to align and translate, attention models encode input sequences into vectors, focusing on relevant parts during decoding. Utilizing soft attention with probability distributions over inputs, attention weights are assigned to prioritize crucial information. The encoder-decoder architecture incorporates context vectors to enhance translation accuracy, with attention weights learned using feed-forward networks. Key ideas include implementing attention as a probability distribution and extending the model to include pertinent context information for decoding tasks.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Attention for translation Learn to encode multiple pieces of information and use them selectively for the output. Encode the input sentence into a sequence of vectors. Choose a subset of these adaptively while decoding (translating) choose those vectors most relevant for current output. I.e., learn to jointly align and translate. Question: How can we learn and use a vector to decide where to focus attention? How can we make that differentiable to work with gradient descent? Bahdanau et al., 2015 : https://arxiv.org/pdf/1409.0473.pdf
Soft attention Use a probability distribution over all inputs. Classification assigned probability to all possible outputs Attention uses probability to weight all possible inputs learn to weight more relevant parts more heavily. https://distill.pub/2016/augmented-rnns/
Attention for translation Input, x, and output, y, are sequences Encoder has a hidden state hiassociated with each xi. These states come from a bidirectional RNN to allow information from both sides in encoding. https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
Attention for translation Decoder: each output yi is predicted using Previous output yi-1 Decoder hidden state si Context vector ci Decoder hidden state depends on Previous output yi-1 Previous state si-1 Context vector ci Attention is embedded in the context vector via learned weights https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
Context vector The encoder hidden states are combined in a weighted average to form a context vector ct for the tth output. This can capture features from each part of the input. https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
Attention for translation The weights va and Wa are learned using a feed- forward network trained with the rest of the network. https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
Attention for translation Key ideas: Implement attention as a probability distribution over inputs/features. Extend encoder/decoder pair to include context information relevant to the current decoding task. https://lilianweng.github.io/lil-log/2018/06/24/attention-attention.html
Attention with images Can combine a CNN with an RNN using attention. CNN extracts high-level features. RNN generates a description, using attention to focus on relevant parts of the image. https://distill.pub/2016/augmented-rnns/
Self-attention Previous: focus attention on the input while working on the output Self-attention: focus on other parts of input while processing the input http://jalammar.github.io/illustrated-transformer/
Self-attention Use each input vector to produce query, key, and value for that input: each is defined by a matrix multiplication of the embedding with each of WQ, WK, WV http://jalammar.github.io/illustrated-transformer/
Self-attention Similarity is determined by dot product of the Query of one input with the Key of all the inputs E.g., for input 1, get a vector of dot products (q1, k1), (q1, k2), Do a scaling and softmax to get a distribution over the input vectors. This gives a distribution p11, p12, that is the attention for input 1 on all inputs. Use the attention vector to do a weighted sum over the Value vectors for the inputs: z1 = p11v1 + p12v2+ This is the output of the self-attention for input 1 http://jalammar.github.io/illustrated-transformer/
Self-attention Multi-headed attention: run several copies in parallel and concatenate the outputs for next layer.
Related work Neural Turing Machines combine an RNN with an external memory. https://distill.pub/2016/augmented-rnns/
Neural Turing Machines Use attention to do weighted read/writes at every location. Can combine content-based attention with location- based attention to take advantage of both. https://distill.pub/2016/augmented-rnns/
Related work Adaptive computation time for RNNs Include a probability distribution on the number of steps for a single input Final output is a weighted sum of the steps for that input https://distill.pub/2016/augmented-rnns/
Related work Neural programmer Determine a sequence of operations to solve some problem. Use a probability distribution to combine multiple possible sequences. https://distill.pub/2016/augmented-rnns/
Attention: summary Attention uses a probability distribution to allow the learning and use of relevant inputs for RNN output This can be used in multiple ways to augment RNNs: Better use of input to encoder External memory Program control (adaptive computation) Neural programming