Image Captioning with Deep Bidirectional LSTMs and Multi-Task Learning

被引：74

作者：

Wang, Cheng ^{[1
]}

Yang, Haojin ^{[1
]}

Meinel, Christoph ^{[1
]}

机构：

[1] Univ Potsdam, Hasso Plattner Inst, Prof Dr Helmert Str 2-3, D-14482 Potsdam, Germany

来源：

ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS | 2018年 / 14卷 / 02期

关键词：

Deep learning; LSTM; multimodal representations; image captioning; mutli-task learning;

D O I：

10.1145/3115432

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Generating a novel and descriptive caption of an image is drawing increasing interests in computer vision, natural language processing, and multimedia communities. In this work, we propose an end-to-end trainable deep bidirectional LSTM (Bi-LSTM (Long Short-Term Memory)) model to address the problem. By combining a deep convolutional neural network (CNN) and two separate LSTM networks, our model is capable of learning long-term visual-language interactions by making use of history and future context information at high-level semantic space. We also explore deep multimodal bidirectional models, in which we increase the depth of nonlinearity transition in different ways to learn hierarchical visual-language embeddings. Data augmentation techniques such as multi-crop, multi-scale, and vertical mirror are proposed to prevent over-fitting in training deep models. To understand how our models "translate" image to sentence, we visualize and qualitatively analyze the evolution of Bi-LSTM internal states over time. The effectiveness and generality of proposed models are evaluated on four benchmark datasets: Flickr8K, Flickr30K, MSCOCO, and Pascal1K datasets. We demonstrate that Bi-LSTM models achieve highly competitive performance on both caption generation and image-sentence retrieval even without integrating an additional mechanism (e.g., object detection, attention model). Our experiments also prove that multi-task learning is beneficial to increase model generality and gain performance. We also demonstrate the performance of transfer learning of the Bi-LSTM model significantly outperforms previous methods on the Pascal1K dataset.

引用

页数：20

共 50 条

[1] A Multi-task Learning Approach for Image Captioning
Zhao, Wei
Wang, Benyou
Ye, Jianbo
Yang, Min
Zhao, Zhou
Luo, Ruotian
Qiao, Yu
[J]. PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 1205 - 1211
[2] MULTI-TASK LEARNING OF STRUCTURED OUTPUT LAYER BIDIRECTIONAL LSTMS FOR SPEECH SYNTHESIS
Li, Runnan
Wu, Zhiyong
Liu, Xunying
Meng, Helen
Cai, Lianhong
[J]. 2017 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2017, : 5510 - 5514
[3] Dependent Multi-Task Learning with Causal Intervention for Image Captioning
Chen, Wenqing
Tian, Jidong
Fan, Caoyun
He, Hao
Jin, Yaohui
[J]. PROCEEDINGS OF THE THIRTIETH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2021, 2021, : 2263 - 2270
[4] Multi-task Deep Learning for Image Understanding
Yu, Bo
Lane, Ian
[J]. 2014 6TH INTERNATIONAL CONFERENCE OF SOFT COMPUTING AND PATTERN RECOGNITION (SOCPAR), 2014, : 37 - 42
[5] Deep multi-task learning for malware image classification
Bensaoud, Ahmed
Kalita, Jugal
[J]. JOURNAL OF INFORMATION SECURITY AND APPLICATIONS, 2022, 64
[6] Multi-task learning for captioning images with novel words
Zheng, He
Wu, Jiahong
Liang, Rui
Li, Ye
Li, Xuzhi
[J]. IET COMPUTER VISION, 2019, 13 (03) : 294 - 301
[7] Deep multi-task learning for image/video distortions identification
Zoubida Ameur
Sid Ahmed Fezza
Wassim Hamidouche
[J]. Neural Computing and Applications, 2022, 34 : 21607 - 21623
[8] Deep multi-task learning for image/video distortions identification
Ameur, Zoubida
Fezza, Sid Ahmed
Hamidouche, Wassim
[J]. Neural Computing and Applications, 2022, 34 (24) : 21607 - 21623
[9] Hand Image Understanding via Deep Multi-Task Learning
Zhang, Xiong
Huang, Hongsheng
Tan, Jianchao
Xu, Hongmin
Yang, Cheng
Peng, Guozhu
Wang, Lei
Liu, Ji
[J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 11261 - 11272
[10] MULTI-TASK DEEP LEARNING FOR SATELLITE IMAGE PANSHARPENING AND SEGMENTATION
Khalel, Andrew
Tasar, Onur
Charpiat, Guillaume
Tarabalka, Yuliya
[J]. 2019 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2019), 2019, : 4869 - 4872

← 1 2 3 4 5 →