Neural Image Caption Generation: Show and Tell with NIC Model Architecture

 
Show and Tell: A Neural Image
Caption Generator
 
(CVPR
 
2015)
 
 
Presenters:
 Tianlu Wang
,
 
 Y
i
n Zhang
Oct
ober
 5
th
 
Human: A young girl asleep on the sofa
cuddling a stuffed bear.
 
NIC: A baby is asleep next to a teddy bear.
 
Neural Image Caption
 
(NIC)
 
Main
 
Goal:
 
automatically
 
describe
 
the
 
content
 
of
 
an
 
image
 
using
 
properly
 
formed
English
 
sentences
 
Mathematically,
 
to
 
build
 
a
 
single
 
joint
 
model
 
that
 
takes
 
an
 
image
 
I
 
as
 
input,
 
and
 
is
 
trained
 
to
 
maximize
 
the
likelihood
 
p(Sentence|Image)
 
of
 
producing
 
a
 
target
 
sequence
 
of
 
words
 
Inspiration
 
from
 
Machine
 
Translation
 
task
 
The
 
target
 
sentence
 
is
 
generated
 
by
 
maximizing
 
the
 
likelihood
P(T|S),
 
where
 
T
 
is
 
the
 
target
 
language
 
and
 
S
 
is
 
the
 
source
language
 
 
Use
 
the
 
Encoder
 
-
 
Decoder
 
structure
Encoder
 
(RNN):
 
transform
 
the
 
source
 
language
 
into
 
a
 
rich
fixed
 
length
 
vector
Decoder
 
(RNN):
 
take
 
the
 
output
 
of
 
encoder
 
as
 
input
 
and
generates
 
the
 
target
 
sentence
 
An
 
example
 
of
 
translating
 
words
 
written
 
in
source
 
language
 
”ABCD”
 
to
 
those
 
in
 
target
language
 
“XYZQ”
 
NIC
 
Model
 
Architecture
 
Follow
 
the
 
Encoder
 
-
 
Decoder
 
structure
Encoder
 
(deep
 
CNN):
 
transform
 
the
 
image
 
into
 
a
 
rich
 
fixed
 
length
 
vector
Decoder
 
(RNN):
 
take
 
the
 
output
 
of
 
encoder
 
as
 
input
 
and
 
generates
 
the
 
target
 
sentence
 
NIC 
Model
 
Architecture
 
Choice
 
of
 
CNN:
 
winner
 
on
 
the
 
ILSVRC
2014
 
classification
 
competition
 
Choice
 
of
 
RNN:
 
LSTM
 
RNN
 
(Recurrent
Neural
 
Network
 
with
 
LSTM
 
cell)
 
In
 
training
 
process,
 
they
 
left
 
the
CNN
 
unchanged,
 
only
 
trained
 
the
RNN
 
part.
 
RNN(Recurrent Neural Network)
 
Why? Sequential task: speech, text and video…
      E.g. translate a word based on the previous one
Advantage: Pass information from one step to
next, information persistence
How? Loops, multiple copies of same
cell(module), passing a message to a successor
 
Want to know more? 
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
 
    RNN & LSTM
 
Why it’s better?
      
Long term dependency problem:
      translation of the last word depends on the
      information of the first word…
      when gap between relevant information
      grows, RNN fails
Long Short Term Memory
 Networks
remembers information for long periods of
time.
 
LSTM(Long Short Term Memory)
Cell state:
information
flows along it!
Gate: optionally
let information
through
 
LSTM Cont.(forget gate)
 
                   input x
 
previous output h
 
f (vector, every element is 0 or 1)
decide what
information to
throw away from
the cell state
 
LSTM Cont.
 
decide what values will be updated
 
create new candidate values
update the old cell
state into new cell state
 
input gate: decide what
new information will be
stored in cell state
 
push the value to be between -1 and 1
 
LSTM Cont.(output gate)
 
decide what parts of cell state we’ll output
 
output the parts we decided to
 
Result
 
BLEU: 
https://en.wikipedia.org/wiki/BLEU
 
Reference:
 
Show and Tell: A Neural Image Caption Generator, 
Oriol Vinyals,
Alexander Toshev, Samy Bengio, Dumitru Erhan
https://arxiv.org/pdf/1411.4555v2.pdf
http://techtalks.tv/talks/show-and-tell-a-neural-image-caption-
generator/61592/
Understanding LSTM Networks,
 colah’s blog
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
 
 
Slide Note
Embed
Share

This presentation delves into the intricacies of Neural Image Captioning, focusing on a model known as Neural Image Caption (NIC). The NIC's primary goal is to automatically generate descriptive English sentences for images. Leveraging the Encoder-Decoder structure, the NIC uses a deep CNN as the encoder to process images into fixed-length vectors and an RNN decoder to produce target sentences. By training the model to maximize the likelihood of generating accurate sequences, the NIC can effectively describe image content in English sentences. Inspired by machine translation tasks, the NIC draws parallels between translating languages and generating image captions. Moreover, the NIC model architecture adopts a CNN winner from the ILSVRC 2014 competition for image transformation and an LSTM RNN for sentence generation. RNNs, particularly LSTM, are preferred for sequential tasks due to their ability to maintain information flow from one step to the next. The presentation also discusses the importance of RNN and LSTM in addressing long-term dependency issues, ensuring accurate and coherent image descriptions.

  • Neural Image Captioning
  • NIC Model
  • Deep CNN
  • RNN
  • LSTM

Uploaded on Nov 12, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Show and Tell: A Neural Image Caption Generator (CVPR 2015) Presenters: Tianlu Wang, Yin Zhang October 5th

  2. Neural Image Caption (NIC) Main Goal: automatically describe the content of an image using properly formed English sentences Human: A young girl asleep on the sofa cuddling a stuffed bear. NIC: A baby is asleep next to a teddy bear. Mathematically, to build a single joint model that takes an image I as input, and is trained to maximize the likelihood p(Sentence|Image) of producing a target sequence of words

  3. Inspiration from Machine Translation task The target sentence is generated by maximizing the likelihood P(T|S), where T is the target language and S is the source language Use the Encoder - Decoder structure Encoder (RNN): transform the source language into a rich fixed length vector Decoder (RNN): take the output of encoder as input and generates the target sentence An example of translating words written in source language ABCD to those in target language XYZQ

  4. NIC Model Architecture Follow the Encoder - Decoder structure Encoder (deep CNN): transform the image into a rich fixed length vector Decoder (RNN): take the output of encoder as input and generates the target sentence

  5. NIC Model Architecture Choice of CNN: winner on the ILSVRC 2014 classification competition Choice of RNN: LSTM RNN (Recurrent Neural Network with LSTM cell) In training process, they left the CNN unchanged, only trained the RNN part.

  6. RNN(Recurrent Neural Network) Why? Sequential task: speech, text and video E.g. translate a word based on the previous one Advantage: Pass information from one step to next, information persistence How? Loops, multiple copies of same cell(module), passing a message to a successor Want to know more? http://karpathy.github.io/2015/05/21/rnn-effectiveness/

  7. RNN & LSTM Why it s better? Long term dependency problem: translation of the last word depends on the information of the first word when gap between relevant information grows, RNN fails Long Short Term Memory Networks remembers information for long periods of time.

  8. LSTM(Long Short Term Memory) Cell state: information flows along it! Gate: optionally let information through

  9. LSTM Cont.(forget gate) input x f (vector, every element is 0 or 1) previous output h decide what information to throw away from the cell state

  10. LSTM Cont. decide what values will be updated input gate: decide what new information will be stored in cell state push the value to be between -1 and 1 create new candidate values update the old cell state into new cell state

  11. LSTM Cont.(output gate) decide what parts of cell state we ll output output the parts we decided to

  12. Result BLEU: https://en.wikipedia.org/wiki/BLEU

  13. Reference: Show and Tell: A Neural Image Caption Generator, Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan https://arxiv.org/pdf/1411.4555v2.pdf http://techtalks.tv/talks/show-and-tell-a-neural-image-caption- generator/61592/ Understanding LSTM Networks,colah s blog http://colah.github.io/posts/2015-08-Understanding-LSTMs/

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#