Neural Image Caption Generation: Show and Tell with NIC Model Architecture

Show and Tell: A Neural Image

Caption Generator

(CVPR

2015)

Presenters:

 Tianlu Wang

n Zhang

Oct

ober

th

Human: A young girl asleep on the sofa

cuddling a stuffed bear.

NIC: A baby is asleep next to a teddy bear.

Neural Image Caption

(NIC)

Main

Goal:

automatically

describe

the

content

of

an

image

using

properly

formed

English

sentences

Mathematically,

to

build

single

joint

model

that

takes

an

image

as

input,

and

is

trained

to

maximize

the

likelihood

p(Sentence|Image)

of

producing

target

sequence

of

words

Inspiration

from

Machine

Translation

task

The

target

sentence

is

generated

by

maximizing

the

likelihood

P(T|S),

where

is

the

target

language

and

is

the

source

language

Use

the

Encoder

Decoder

structure

•

Encoder

(RNN):

transform

the

source

language

into

rich

fixed

length

vector

•

Decoder

(RNN):

take

the

output

of

encoder

as

input

and

generates

the

target

sentence

An

example

of

translating

words

written

in

source

language

”ABCD”

to

those

in

target

language

“XYZQ”

NIC

Model

Architecture

Follow

the

Encoder

Decoder

structure

•

Encoder

(deep

CNN):

transform

the

image

into

rich

fixed

length

vector

•

Decoder

(RNN):

take

the

output

of

encoder

as

input

and

generates

the

target

sentence

NIC

Model

Architecture

Choice

of

CNN:

winner

on

the

ILSVRC

classification

competition

Choice

of

RNN:

LSTM

RNN

(Recurrent

Neural

Network

with

LSTM

cell)

In

training

process,

they

left

the

CNN

unchanged,

only

trained

the

RNN

part.

RNN(Recurrent Neural Network)

•

Why? Sequential task: speech, text and video…

      E.g. translate a word based on the previous one

•

Advantage: Pass information from one step to

next, information persistence

•

How? Loops, multiple copies of same

cell(module), passing a message to a successor

Want to know more?

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

    RNN & LSTM

•

Why it’s better?

Long term dependency problem:

      translation of the last word depends on the

      information of the first word…

      when gap between relevant information

      grows, RNN fails

•

Long Short Term Memory

 Networks

remembers information for long periods of

time.

LSTM(Long Short Term Memory)

Cell state:

information

flows along it!

Gate: optionally

let information

through

LSTM Cont.(forget gate)

                   input x

previous output h

f (vector, every element is 0 or 1)

decide what

information to

throw away from

the cell state

LSTM Cont.

decide what values will be updated

create new candidate values

update the old cell

state into new cell state

input gate: decide what

new information will be

stored in cell state

push the value to be between -1 and 1

LSTM Cont.(output gate)

decide what parts of cell state we’ll output

output the parts we decided to

Result

BLEU:

https://en.wikipedia.org/wiki/BLEU

Reference:

•

Show and Tell: A Neural Image Caption Generator,

Oriol Vinyals,

Alexander Toshev, Samy Bengio, Dumitru Erhan

https://arxiv.org/pdf/1411.4555v2.pdf

http://techtalks.tv/talks/show-and-tell-a-neural-image-caption-

generator/61592/

•

Understanding LSTM Networks,

 colah’s blog

http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Slide Note

Embed Share

Download

This presentation delves into the intricacies of Neural Image Captioning, focusing on a model known as Neural Image Caption (NIC). The NIC's primary goal is to automatically generate descriptive English sentences for images. Leveraging the Encoder-Decoder structure, the NIC uses a deep CNN as the encoder to process images into fixed-length vectors and an RNN decoder to produce target sentences. By training the model to maximize the likelihood of generating accurate sequences, the NIC can effectively describe image content in English sentences. Inspired by machine translation tasks, the NIC draws parallels between translating languages and generating image captions. Moreover, the NIC model architecture adopts a CNN winner from the ILSVRC 2014 competition for image transformation and an LSTM RNN for sentence generation. RNNs, particularly LSTM, are preferred for sequential tasks due to their ability to maintain information flow from one step to the next. The presentation also discusses the importance of RNN and LSTM in addressing long-term dependency issues, ensuring accurate and coherent image descriptions.

kono_le Follow

Uploaded on Nov 12, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Show and Tell: A Neural Image Caption Generator (CVPR 2015) Presenters: Tianlu Wang, Yin Zhang October 5th

Neural Image Caption (NIC) Main Goal: automatically describe the content of an image using properly formed English sentences Human: A young girl asleep on the sofa cuddling a stuffed bear. NIC: A baby is asleep next to a teddy bear. Mathematically, to build a single joint model that takes an image I as input, and is trained to maximize the likelihood p(Sentence|Image) of producing a target sequence of words

Inspiration from Machine Translation task The target sentence is generated by maximizing the likelihood P(T|S), where T is the target language and S is the source language Use the Encoder - Decoder structure Encoder (RNN): transform the source language into a rich fixed length vector Decoder (RNN): take the output of encoder as input and generates the target sentence An example of translating words written in source language ABCD to those in target language XYZQ

NIC Model Architecture Follow the Encoder - Decoder structure Encoder (deep CNN): transform the image into a rich fixed length vector Decoder (RNN): take the output of encoder as input and generates the target sentence

NIC Model Architecture Choice of CNN: winner on the ILSVRC 2014 classification competition Choice of RNN: LSTM RNN (Recurrent Neural Network with LSTM cell) In training process, they left the CNN unchanged, only trained the RNN part.

RNN(Recurrent Neural Network) Why? Sequential task: speech, text and video E.g. translate a word based on the previous one Advantage: Pass information from one step to next, information persistence How? Loops, multiple copies of same cell(module), passing a message to a successor Want to know more? http://karpathy.github.io/2015/05/21/rnn-effectiveness/

RNN & LSTM Why it s better? Long term dependency problem: translation of the last word depends on the information of the first word when gap between relevant information grows, RNN fails Long Short Term Memory Networks remembers information for long periods of time.

LSTM(Long Short Term Memory) Cell state: information flows along it! Gate: optionally let information through

LSTM Cont.(forget gate) input x f (vector, every element is 0 or 1) previous output h decide what information to throw away from the cell state

LSTM Cont. decide what values will be updated input gate: decide what new information will be stored in cell state push the value to be between -1 and 1 create new candidate values update the old cell state into new cell state

LSTM Cont.(output gate) decide what parts of cell state we ll output output the parts we decided to

Result BLEU: https://en.wikipedia.org/wiki/BLEU

Reference: Show and Tell: A Neural Image Caption Generator, Oriol Vinyals, Alexander Toshev, Samy Bengio, Dumitru Erhan https://arxiv.org/pdf/1411.4555v2.pdf http://techtalks.tv/talks/show-and-tell-a-neural-image-caption- generator/61592/ Understanding LSTM Networks,colah s blog http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Neural Image Caption Generation: Show and Tell with NIC Model Architecture

Download Presentation

Presentation Transcript

Related

More Related Content