Image Captioning
Image captioning plays a crucial role in aiding the blind through text-to-speech conversion, enhancing self-reliance and safety. It also benefits self-driven cars for environment understanding and smart surveillance with scene descriptions. Enriching web content and optimizing paths are among the varied applications.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Image Captioning Group #21 Saurabh Mirani Eunji Song Shiladitya Biswas
Background Image captioning is the processing of describing the contents of an Image / scene using texts. It combines two prominent fields of Machine Learning, namely Computer Vision Natural Language Processing IMAGE CAPTIONING SYSTEM Image Features Encoder Decoder TEXT IMAGE IMAGE IMAGE Dictionary of Words
Importance 1. Aid for the Blind: First convert the Scene to text and then the text to speech, will enhance self-reliance on a daily basis. Generated Scene descriptions can be used by Self-driven cars to properly understand the environment and plan the optimal / safe path accordingly. Smart surveillance with CCTV cameras. Scene descriptions can be used to raise alarms in case of malicious activities. Enriching web content by providing description of images 2. 3. 4.
Literature Survey - w/o Deep Learning Traditional Method (2011 ~ 2014) Retrieval-Based: Search for the closest images among the training image, then select the best caption that fits the given image (Ordenez et al., 2011, Hodosh et al., 2013, Kuznetsova et al., 2012;2014) Template-Based: Fill in the templates using the results of object detection and attribute discovery (Kulkarni et al. 2011, Li et al. 2011, Yang et al. 2011, Mitchell et al. 2012, Elliott & Keller 2013) (Ordonez et. al., Im2text: Describing images using 1 million captioned photographs, NIPS 2011) (Kulkarni et al. Baby Talk: Understanding and Generating Simple Image Descriptions, CVPR 2011)
Literature Survey - w/ Deep Learning Deploying Neural Network [Encoder-Decoder Based] (2014~) [Encoder] CNN for feature extraction [Decoder] Neural language model for caption generation (Kiros et al. 2014) Multi-modal Log-Bilinear Model (Mao et al. 2014) Multi-model RNN (Vinyals et al. 2014, Donahue et al. 2014) LSTM (Xu et al. 2015) LSTM + Attention (Li et al. 2020) Object Semantic Tagging, (Zhang et al. 2021) Visual Feature Extraction (Kiros et al., Multimodal Neural Language Models, ICML 2014.) (Mao et al., Deep Captioning with Multimodal Recurrent Neural Networks(m-RNN), ICLR 2015.) (Vinyals et al., Show and tell: A neural image caption generator, CVPR 2015.)
Why deep learning can solve this problem? A man is scuba diving next to a turtle Image taken from Automated Image Captioning with ConvNets and Recurrent Nets - Andrej Karpathy, Fei-Fei Li
Dataset - COCO (Common Objects in Context) COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features: Images - 2014 split: Train images ~ 83k/13.5GB Validation images ~ 41k/6.6GB Test images ~ 41k/6.7GB Object segmentation Recognition in context Superpixel stuff segmentation 330K images (>200K labeled) 1.5 million object instances 80 object categories 91 stuff categories 5 captions per image 250,000 people with keypoints Captions - 2014 annotations.json for train/val/test General information Image File Names Licenses for each image Image Descriptions/captions for each image
Details of Model Encoder Decoder ================================================ Layer (type:depth-idx) Output Shape Param # ================================================ Decoder Linear: 1-1 Tanh: 1-2 Linear: 1-3 Tanh: 1-4 Embedding: 1-5 [-1, 1, 512] 5,307,392 Attention: 1-6 Linear: 2-1 Linear: 2-2 Tanh: 2-3 Linear: 2-4 Softmax: 2-5 Linear: 1-7 Sigmoid: 1-8 LSTMCell: 1-9 Dropout: 1-10 Linear: 1-11 ================================================= Total params: 21,381,759 Trainable params: 21,381,759 Non-trainable params: 0 Total mult-adds (T): 2.59 ================================================= Input size (MB): 6.43 Forward/backward pass size (MB): 5.13 Params size (MB): 85.53 Estimated Total Size (MB): 97.09 ================================================= -- [-1, 512] 1,049,088 [-1, 512] -- [-1, 512] 1,049,088 [-1, 512] -- -- [-1, 2048] -- [-1, 512] 262,656 [-1, 49, 512] 1,049,088 [-1, 49, 512] -- [-1, 49, 1] 513 [-1, 49] -- [-1, 2048] 1,050,624 [-1, 2048] -- [-1, 512] 6,295,552 [-1, 512] -- [-1, 10366] 5,317,758 ResNet image taken from - Automobile Classification Using Transfer Learning on ResNet Neural Network Architecture
Why RNN? The Decoder's job is to look at the encoded image and generate a caption word by word. Since it's generating a sequence, it would need to be a Recurrent Neural Network (RNN). We have used LSTM. Image taken from: A PyTorch tutorial to Image Captioning
Why Attention? Image taken from: A PyTorch tutorial to Image Captioning
Future Work Planning to train using Wide Residual Networks(WideResNets) as Encoder Planning to train using ResNext as Encoder
References K. Xu, J. Bay, R. Kirosy, K. Cho, A. Courville, R. Salakhutdinovy, R. Zemely, and Y. Bengio, Show, Attend and Tell: Neural Image Caption Generation with Visual Attention, Proceedings of the 32nd International Conference on Machine Learning, PMLR 37:2048-2057, 2015. V. Ordonez, G. Kulkarni, T. L. Berg., Im2text: Describing images using 1 million captioned photographs, in: Advances in Neural Information Processing Systems, 2011, pp. 1143 1151. M. Hodosh, P. Young, J. Hockenmaier, Framing image description as a ranking task: data, models and evaluation metrics, Journal of Artificial Intelligence Research 47 (2013) 853 899. G. Kulkarni, V. Premraj, S. Dhar, S. Li, Y. Choi, A. Berg, and T. Berg, Baby talk: Understanding and generating simple image descriptions, CVPR 2011, 2011, pp. 1601-1608, doi: 10.1109/CVPR.2011.5995466. R. Kiros, R. Salahutdinov, and R. Zemel, Multimodal neural language models, In International Conference on Machine Learning, pp. 595 603, 2014. J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille, Deep captioning with multimodal recurrent neural networks (m-rnn), arXiv:1412.6632 [cs.CV], 2014. O. Vinyals, A. Toshev, S. Bengio and D. Erhan, Show and tell: A neural image caption generator, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 3156-3164, doi:10.1109/CVPR.2015.7298935. X. Li, X. Yin, C. Li, P. Zhang, X. Hu, L. Zhang, L. Wang, H. Hu, L. Dong, F. Wei, Y. Choi, and J. Gao, Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks, ECCV, 2020. P. Zhang, X. Li, X. Hu, J. Yang, L. Zhang, L. Wang, Y. Choi, and J. Gao, VinVL: Revisiting Visual Representations in Vision-Language Models, CVPR, 2021.