Multimodal Recurrent Attention CNN for Image Aesthetic Prediction

Slide Note
Embed
Share

Using a multimodal recurrent attention neural network, MRACNN, this study proposes a unified approach for image aesthetic prediction by jointly learning visual and textual features. Inspired by human attention mechanisms, the network utilizes datasets like AVA and photo.net comments to enhance multimodal modeling in image aesthetics. The architecture includes vision stream feature extractors, language stream text-CNN, and multimodal factorized bilinear pooling, leading to significant advancements in this field.


Uploaded on Oct 11, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Beyond Vision: A Multimodal Recurrent Attention Convolutional Neural Network for Unified Image Aesthetic Prediction Tasks Xiaodan Zhang, Xinbo Gao, Wen Lu, Lihuo He, and Jie Li TMM2020

  2. Contributions Inspired by the human attention mechanism, a recurrent attention neural network is used to extract visual features A multimodal network called MRACNN is proposed to jointly learn the visual features and textual features for image aesthetic prediction We collect the AVA comment dataset and the photo.net comment dataset. These datasets can advance the research on multimodal modelling in image aesthetics

  3. AVA dataset with comments

  4. MRACNN architecture EMD Loss

  5. Vision Streamfeature extractor Base network: VGG-16 or other type of network architecture Input: image resized to 224x224 Output: tensor with dimension (W, H, D), represented as: where L = W x H

  6. Vision StreamLSTM LSTM: Attention:

  7. Language StreamText-CNN

  8. Multimodal Factorized Bilinear Pooling Given the visual feature and the textual feature , the multimodal bilinear models can be defined as: It can also be rewritten as:

  9. ExperimentsFeature Extractor

  10. ExperimentsAblation Study

  11. ExperimentsAttention Map

  12. ExperimentsPerformance Comparison

  13. ExperimentsPerformance on Photo.net

  14. Comments Pros: recurrent attention CNN, multimodal framework Cons: text data may not be available in the real scenario, spatial information not considered in attention module

Related


More Related Content