Tell and guess: cooperative learning for natural image caption generation with hierarchical refined attention

被引:0
|
作者
Wenqiao Zhang
Siliang Tang
Jiajie Su
Jun Xiao
Yueting Zhuang
机构
[1] Zhejiang University,College of Computer Science and Technology
来源
关键词
Image caption; Cooperative learning; Hierarchical refined attention;
D O I
暂无
中图分类号
学科分类号
摘要
Automatically generating a natural language description of an image is one of the most fundamental and challenging problems in Multimedia Intelligence because it translates information between two different modalities, while such translation requires the ability to understand both modalities. The existing image captioning models have already achieved remarkable performance. However, they heavily rely on the Encoder-Decoder framework is a directional translation which is hard to be further improved. In this paper, we designed the “Tell and Guess” Cooperative Learning model with a Hierarchical Refined Attention mechanism (CL-HRA) that bidirectionally improves the performance to generate more informative captions. The Cooperative Learning (CL) method combines an image caption module (ICM) with an image retrieval module (IRM) - the ICM is responsible for the “Tell” function, which generates informative and natural language descriptions for a given image. While the IRM will “Guess” and try to select that image from a lineup of images based on the given description. Such cooperation mutually improves the learning of two modules. On the other hand, the Hierarchical Refined Attention (HRA) learns to selectively attend the high-level attributes and the low-level visual features, then incorporate them into CL to fulfill the objective gaps from image to caption. The HRA can pay different attention at the different semantic levels to refine the visual representation, while the CL with the human-like mindset is more interpretable to generate a more related caption for the corresponding image. The experimental results on Microsoft COCO dataset show the effectiveness of CL-HRA in terms of several popular image caption generation metrics.
引用
收藏
页码:16267 / 16282
页数:15
相关论文
共 50 条
  • [1] Tell and guess: cooperative learning for natural image caption generation with hierarchical refined attention
    Zhang, Wenqiao
    Tang, Siliang
    Su, Jiajie
    Xiao, Jun
    Zhuang, Yueting
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2021, 80 (11) : 16267 - 16282
  • [2] Learn and Tell: Learning Priors for Image Caption Generation
    Liu, Pei
    Peng, Dezhong
    Zhang, Ming
    [J]. APPLIED SCIENCES-BASEL, 2020, 10 (19): : 1 - 17
  • [3] Image Caption Generation with Hierarchical Contextual Visual Spatial Attention
    Khademi, Mahmoud
    Schulte, Oliver
    [J]. PROCEEDINGS 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS (CVPRW), 2018, : 2024 - 2032
  • [4] Show, Attend and Tell: Neural Image Caption Generation with Visual Attention
    Xu, Kelvin
    Ba, Jimmy Lei
    Kiros, Ryan
    Cho, Kyunghyun
    Courville, Aaron
    Salakhutdinov, Ruslan
    Zemel, Richard S.
    Bengio, Yoshua
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 37, 2015, 37 : 2048 - 2057
  • [5] Multilevel Attention Networks and Policy Reinforcement Learning for Image Caption Generation
    Zhou, Zhibo
    Zhang, Xiaoming
    Li, Zhoujun
    Huang, Feiran
    Xu, Jie
    [J]. BIG DATA, 2022, 10 (06) : 481 - 492
  • [6] Automatic image caption generation using deep learning and multimodal attention
    Dai, Jin
    Zhang, Xinyu
    [J]. COMPUTER ANIMATION AND VIRTUAL WORLDS, 2022, 33 (3-4)
  • [7] Image caption generation with dual attention mechanism
    Liu, Maofu
    Li, Lingjun
    Hu, Huijun
    Guan, Weili
    Tian, Jing
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2020, 57 (02)
  • [8] Image Caption Generation Using Attention Model
    Ramalakshmi, Eliganti
    Jain, Moksh Sailesh
    Uddin, Mohammed Ameer
    [J]. INNOVATIVE DATA COMMUNICATION TECHNOLOGIES AND APPLICATION, ICIDCA 2021, 2022, 96 : 1009 - 1017
  • [9] Bahdanau Attention Based Bengali Image Caption Generation
    Alam, Md Sahrial
    Rahman, Md Sayedur
    Hosen, Md Ikbal
    Mubin, Khairul Anam
    Hossen, Sharif
    Mridha, M. F.
    [J]. 2022 INTERNATIONAL CONFERENCE ON DECISION AID SCIENCES AND APPLICATIONS (DASA), 2022, : 1073 - 1077
  • [10] Fine-grained attention for image caption generation
    Chang, Yan-Shuo
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (03) : 2959 - 2971