Neural Image Caption Generation: Show and Tell with NIC Model Architecture

Slide Note
Embed
Share

This presentation delves into the intricacies of Neural Image Captioning, focusing on a model known as Neural Image Caption (NIC). The NIC's primary goal is to automatically generate descriptive English sentences for images. Leveraging the Encoder-Decoder structure, the NIC uses a deep CNN as the encoder to process images into fixed-length vectors and an RNN decoder to produce target sentences. By training the model to maximize the likelihood of generating accurate sequences, the NIC can effectively describe image content in English sentences. Inspired by machine translation tasks, the NIC draws parallels between translating languages and generating image captions. Moreover, the NIC model architecture adopts a CNN winner from the ILSVRC 2014 competition for image transformation and an LSTM RNN for sentence generation. RNNs, particularly LSTM, are preferred for sequential tasks due to their ability to maintain information flow from one step to the next. The presentation also discusses the importance of RNN and LSTM in addressing long-term dependency issues, ensuring accurate and coherent image descriptions.


Uploaded on Nov 12, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Show and Tell: A Neural Image Caption Generator (CVPR 2015) Presenters: Tianlu Wang, Yin Zhang October 5th

  2. Neural Image Caption (NIC) Main Goal: automatically describe the content of an image using properly formed English sentences Human: A young girl asleep on the sofa cuddling a stuffed bear. NIC: A baby is asleep next to a teddy bear. Mathematically, to build a single joint model that takes an image I as input, and is trained to maximize the likelihood p(Sentence|Image) of producing a target sequence of words

  3. Inspiration from Machine Translation task The target sentence is generated by maximizing the likelihood P(T|S), where T is the target language and S is the source language Use the Encoder - Decoder structure Encoder (RNN): transform the source language into a rich fixed length vector Decoder (RNN): take the output of encoder as input and generates the target sentence An example of translating words written in source language ABCD to those in target language XYZQ

  4. NIC Model Architecture Follow the Encoder - Decoder structure Encoder (deep CNN): transform the image into a rich fixed length vector Decoder (RNN): take the output of encoder as input and generates the target sentence

  5. NIC Model Architecture Choice of CNN: winner on the ILSVRC 2014 classification competition Choice of RNN: LSTM RNN (Recurrent Neural Network with LSTM cell) In training process, they left the CNN unchanged, only trained the RNN part.

  6. RNN(Recurrent Neural Network) Why? Sequential task: speech, text and video E.g. translate a word based on the previous one Advantage: Pass information from one step to next, information persistence How? Loops, multiple copies of same cell(module), passing a message to a successor Want to know more? http://karpathy.github.io/2015/05/21/rnn-effectiveness/

  7. RNN & LSTM Why it s better? Long term dependency problem: translation of the last word depends on the information of the first word when gap between relevant information grows, RNN fails Long Short Term Memory Networks remembers information for long periods of time.

  8. LSTM(Long Short Term Memory) Cell state: information flows along it! Gate: optionally let information through

  9. LSTM Cont.(forget gate) input x f (vector, every element is 0 or 1) previous output h decide what information to throw away from the cell state

  10. LSTM Cont. decide what values will be updated input gate: decide what new information will be stored in cell state push the value to be between -1 and 1 create new candidate values update the old cell state into new cell state

  11. LSTM Cont.(output gate) decide what parts of cell state we ll output output the parts we decided to

  12. Result BLEU: https://en.wikipedia.org/wiki/BLEU

  13. Reference: Show and Tell: A Neural Image Caption Generator, Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan https://arxiv.org/pdf/1411.4555v2.pdf http://techtalks.tv/talks/show-and-tell-a-neural-image-caption- generator/61592/ Understanding LSTM Networks,colah s blog http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Related


More Related Content