MMT: A Multimodal Translator for Image Captioning

被引:0
|
作者
Liu, Chang [1 ]
Sun, Fuchun [1 ]
Wang, Changhu [2 ]
机构
[1] Tsinghua Univ, Dept Comp Sci, Beijing, Peoples R China
[2] Toutiao AI Lab, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Image captioning; Deep learning; Natural language generation;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image captioning is a challenging problem. Different from other computer vision tasks such as image classification and object detection, image captioning requires not only understanding the image, but also the knowledge of natural language. In this work, we formulate the problem of image captioning as a multimodal translation task. Analogous to machine translation, we present a sequence-to-sequence recurrent neural network (RNN) model for image caption generation. Different from most existing work where the whole image is represented by a convolutional neural network (CNN) feature, we propose to represent the input image as a sequence of detected objects to serve as the source sequence of the RNN model. In this way, the sequential representation of an image can be naturally translated into a sequence of words, as the target sequence of the RNN model. To obtain the source sequence from the image, objects are first detected by pre-trained detectors and then converted to a sequential representation using heuristic ordering strategies, that is, by the saliency scores of the detected objects. We propose three ordering methods, descending, ascending and random, according to the saliency scores, in order to study the influence of ordering over RNN cells. To obtain the target sequence, the language words are represented as one-hot feature vector. The representations of the objects and the words are then mapped into a common hidden space. The translation from the source sequence to the target sequence is done by leveraging LSTM. Extensive experiments are conducted to evaluate the proposed approach on benchmark dataset, i.e., MSCOCO, and achieve the state-of-the-art performance. The proposed approach is also evaluated by the evaluation server of MS COCO captioning challenge and achieves very competitive results. For example, we achieve CIDEr of 93.2, RougeL of 53.2 and BLEU4 of 31.1. We validate the contribution of each idea, that is, sequential representation and ordering method, by comparison studies, and show that sequential representation indeed improves the performance compared to vanilla CNN + RNN based methods, and ascending ordering outperforms the other two ordering methods.
引用
收藏
页码:784 / 784
页数:1
相关论文
共 50 条
  • [1] MAT: A Multimodal Attentive Translator for Image Captioning
    Liu, Chang
    Sun, Fuchun
    Wang, Changhu
    Wang, Feng
    Yuille, Alan
    [J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 4033 - 4039
  • [2] Multimodal Image Captioning for Marketing Analysis
    Harzig, Philipp
    Brehm, Stephan
    Lienhart, Rainer
    Kaiser, Carolin
    Schallner, Rene
    [J]. IEEE 1ST CONFERENCE ON MULTIMEDIA INFORMATION PROCESSING AND RETRIEVAL (MIPR 2018), 2018, : 158 - 161
  • [3] Improving multimodal datasets with image captioning
    Thao Nguyen
    Gadre, Samir Yitzhak
    Ilharco, Gabriel
    Oh, Sewoong
    Schmidt, Ludwig
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [4] A multimodal fusion approach for image captioning
    Zhao, Dexin
    Chang, Zhi
    Guo, Shutao
    [J]. NEUROCOMPUTING, 2019, 329 : 476 - 485
  • [5] Effective Multimodal Encoding for Image Paragraph Captioning
    Nguyen, Thanh-Son
    Fernando, Basura
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 6381 - 6395
  • [6] Regular Constrained Multimodal Fusion for Image Captioning
    Wang, Liya
    Chen, Haipeng
    Liu, Yu
    Lyu, Yingda
    [J]. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34 (11) : 11900 - 11913
  • [7] Towards Unsupervised Image Captioning with Shared Multimodal Embeddings
    Laina, Iro
    Rupprecht, Christian
    Navab, Nassir
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 7413 - 7423
  • [8] Image Captioning Using Multimodal Deep Learning Approach
    Farkh, Rihem
    Oudinet, Ghislain
    Foued, Yasser
    [J]. Computers, Materials and Continua, 2024, 81 (03): : 3951 - 3968
  • [9] Towards Personalized Image Captioning via Multimodal Memory Networks
    Park, Cesc Chunseong
    Kim, Byeongchang
    Kim, Gunhee
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2019, 41 (04) : 999 - 1012
  • [10] Multimodal Data Augmentation for Image Captioning using Diffusion Models
    Xiao, Changrong
    Xu, Sean Xin
    Zhang, Kunpeng
    [J]. PROCEEDINGS OF THE 1ST WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM3A 2023, 2023, : 23 - 33