Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning

被引:74
|
作者
Wang, Cheng [1 ]
Yang, Haojin [1 ]
Meinel, Christoph [1 ]
机构
[1] Univ Potsdam, Hasso Plattner Inst, Prof Dr Helmert Str 2-3, D-14482 Potsdam, Germany
关键词
Deep learning; LSTM; multimodal representations; image captioning; mutli-task learning;
D O I
10.1145/3115432
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Generating a novel and descriptive caption of an image is drawing increasing interests in computer vision, natural language processing, and multimedia communities. In this work, we propose an end-to-end trainable deep bidirectional LSTM (Bi-LSTM (Long Short-Term Memory)) model to address the problem. By combining a deep convolutional neural network (CNN) and two separate LSTM networks, our model is capable of learning long-term visual-language interactions by making use of history and future context information at high-level semantic space. We also explore deep multimodal bidirectional models, in which we increase the depth of nonlinearity transition in different ways to learn hierarchical visual-language embeddings. Data augmentation techniques such as multi-crop, multi-scale, and vertical mirror are proposed to prevent over-fitting in training deep models. To understand how our models "translate" image to sentence, we visualize and qualitatively analyze the evolution of Bi-LSTM internal states over time. The effectiveness and generality of proposed models are evaluated on four benchmark datasets: Flickr8K, Flickr30K, MSCOCO, and Pascal1K datasets. We demonstrate that Bi-LSTM models achieve highly competitive performance on both caption generation and image-sentence retrieval even without integrating an additional mechanism (e.g., object detection, attention model). Our experiments also prove that multi-task learning is beneficial to increase model generality and gain performance. We also demonstrate the performance of transfer learning of the Bi-LSTM model significantly outperforms previous methods on the Pascal1K dataset.
引用
收藏
页数:20
相关论文
共 50 条
  • [1] A Multi-task Learning Approach for Image Captioning
    Zhao, Wei
    Wang, Benyou
    Ye, Jianbo
    Yang, Min
    Zhao, Zhou
    Luo, Ruotian
    Qiao, Yu
    [J]. PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 1205 - 1211
  • [2] MULTI-TASK LEARNING OF STRUCTURED OUTPUT LAYER BIDIRECTIONAL LSTMS FOR SPEECH SYNTHESIS
    Li, Runnan
    Wu, Zhiyong
    Liu, Xunying
    Meng, Helen
    Cai, Lianhong
    [J]. 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5510 - 5514
  • [3] Dependent Multi-Task Learning with Causal Intervention for Image Captioning
    Chen, Wenqing
    Tian, Jidong
    Fan, Caoyun
    He, Hao
    Jin, Yaohui
    [J]. PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 2263 - 2270
  • [4] Multi-task Deep Learning for Image Understanding
    Yu, Bo
    Lane, Ian
    [J]. 2014 6TH INTERNATIONAL CONFERENCE OF SOFT COMPUTING AND PATTERN RECOGNITION (SOCPAR), 2014, : 37 - 42
  • [5] Deep multi-task learning for malware image classification
    Bensaoud, Ahmed
    Kalita, Jugal
    [J]. JOURNAL OF INFORMATION SECURITY AND APPLICATIONS, 2022, 64
  • [6] Multi-task learning for captioning images with novel words
    Zheng, He
    Wu, Jiahong
    Liang, Rui
    Li, Ye
    Li, Xuzhi
    [J]. IET COMPUTER VISION, 2019, 13 (03) : 294 - 301
  • [7] Deep multi-task learning for image/video distortions identification
    Zoubida Ameur
    Sid Ahmed Fezza
    Wassim Hamidouche
    [J]. Neural Computing and Applications, 2022, 34 : 21607 - 21623
  • [8] Deep multi-task learning for image/video distortions identification
    Ameur, Zoubida
    Fezza, Sid Ahmed
    Hamidouche, Wassim
    [J]. Neural Computing and Applications, 2022, 34 (24) : 21607 - 21623
  • [9] Hand Image Understanding via Deep Multi-Task Learning
    Zhang, Xiong
    Huang, Hongsheng
    Tan, Jianchao
    Xu, Hongmin
    Yang, Cheng
    Peng, Guozhu
    Wang, Lei
    Liu, Ji
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 11261 - 11272
  • [10] MULTI-TASK DEEP LEARNING FOR SATELLITE IMAGE PANSHARPENING AND SEGMENTATION
    Khalel, Andrew
    Tasar, Onur
    Charpiat, Guillaume
    Tarabalka, Yuliya
    [J]. 2019 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2019), 2019, : 4869 - 4872