Multimodal Recurrent Attention CNN for Image Aesthetic Prediction

Slide Note

Using a multimodal recurrent attention neural network, MRACNN, this study proposes a unified approach for image aesthetic prediction by jointly learning visual and textual features. Inspired by human attention mechanisms, the network utilizes datasets like AVA and photo.net comments to enhance multimodal modeling in image aesthetics. The architecture includes vision stream feature extractors, language stream text-CNN, and multimodal factorized bilinear pooling, leading to significant advancements in this field.

hgra Follow

Uploaded on Oct 11, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Beyond Vision: A Multimodal Recurrent Attention Convolutional Neural Network for Unified Image Aesthetic Prediction Tasks Xiaodan Zhang, Xinbo Gao, Wen Lu, Lihuo He, and Jie Li TMM2020

Contributions Inspired by the human attention mechanism, a recurrent attention neural network is used to extract visual features A multimodal network called MRACNN is proposed to jointly learn the visual features and textual features for image aesthetic prediction We collect the AVA comment dataset and the photo.net comment dataset. These datasets can advance the research on multimodal modelling in image aesthetics

AVA dataset with comments

MRACNN architecture EMD Loss

Vision Streamfeature extractor Base network: VGG-16 or other type of network architecture Input: image resized to 224x224 Output: tensor with dimension (W, H, D), represented as: where L = W x H

Vision StreamLSTM LSTM: Attention:

Language StreamText-CNN

Multimodal Factorized Bilinear Pooling Given the visual feature and the textual feature , the multimodal bilinear models can be defined as: It can also be rewritten as:

ExperimentsFeature Extractor

ExperimentsAblation Study

ExperimentsAttention Map

ExperimentsPerformance Comparison

ExperimentsPerformance on Photo.net

Comments Pros: recurrent attention CNN, multimodal framework Cons: text data may not be available in the real scenario, spatial information not considered in attention module

Multimodal Recurrent Attention CNN for Image Aesthetic Prediction

Download Presentation

Presentation Transcript

Related

More Related Content