Image Captioning with Text-Based Visual Attention

被引:15
|
作者
He, Chen [1 ]
Hu, Haifeng [1 ]
机构
[1] Sun Yat Sen Univ, Sch Elect & Informat Engn, Guangzhou 510006, Guangdong, Peoples R China
基金
中国国家自然科学基金;
关键词
Image captioning; Multimodal recurrent neural network; Text-based visual attention; Transposed weight sharing;
D O I
10.1007/s11063-018-9807-7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Attention mechanisms have attracted considerable interest in image captioning due to its powerful performance. However, many visual attention models lack of considering correlation between image and textual context, which may lead to attention vectors containing irrelevant annotation vectors. In order to overcome this limitation, we propose a new text-based visual attention (TBVA) model which focuses on certain salient object automatically by eliminating the irrelevant information once given previously generated text. The proposed end-to-end caption generation model adopts the architecture of multimodal recurrent neural network. We leverage the transposed weight sharing scheme to achieve better performance by reducing the number of parameters. The effectiveness of our model is validated on MS COCO and Flickr30k. The results show that TBVA outperforms the state-of-art image captioning methods.
引用
收藏
页码:177 / 185
页数:9
相关论文
共 50 条
  • [31] Local-global visual interaction attention for image captioning
    Wang, Changzhi
    Gu, Xiaodong
    [J]. DIGITAL SIGNAL PROCESSING, 2022, 130
  • [32] Image Captioning with a Joint Attention Mechanism by Visual Concept Samples
    Yuan, Jin
    Zhang, Lei
    Guo, Songrui
    Xiao, Yi
    Li, Zhiyong
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2020, 16 (03)
  • [33] Image Sense Classification in Text-Based Image Retrieval
    Chang, Yih-Chen
    Chen, Hsin-Hsi
    [J]. INFORMATION RETRIEVAL TECHNOLOGY, PROCEEDINGS, 2009, 5839 : 124 - 135
  • [34] Graph neural network-based visual relationship and multilevel attention for image captioning
    Sharma, Himanshu
    Srivastava, Swati
    [J]. JOURNAL OF ELECTRONIC IMAGING, 2022, 31 (05)
  • [35] Text-based Image Style Transfer and Synthesis
    He, Yifan
    Li, Jian
    Zhu, Anna
    [J]. 2019 INTERNATIONAL CONFERENCE ON DOCUMENT ANALYSIS AND RECOGNITION WORKSHOPS (ICDARW) AND 8TH INTERNATIONAL WORKSHOP ON CAMERA-BASED DOCUMENT ANALYSIS AND RECOGNITION, VOL 4, 2019, : 43 - 48
  • [36] A Scene Text-Based Image Retrieval System
    Thuy Ho
    Ngoc Ly
    [J]. 2012 IEEE INTERNATIONAL SYMPOSIUM ON SIGNAL PROCESSING AND INFORMATION TECHNOLOGY (ISSPIT), 2012, : 79 - 84
  • [37] avtmNet:Adaptive Visual-Text Merging Network for Image Captioning
    Song, Heng
    Zhu, Junwu
    Jiang, Yi
    [J]. COMPUTERS & ELECTRICAL ENGINEERING, 2020, 84
  • [38] Learning visual relationship and context-aware attention for image captioning
    Wang, Junbo
    Wang, Wei
    Wang, Liang
    Wang, Zhiyong
    Feng, David Dagan
    Tan, Tieniu
    [J]. PATTERN RECOGNITION, 2020, 98
  • [39] VAA: Visual Aligning Attention Model for Remote Sensing Image Captioning
    Zhang, Zhengyuan
    Zhang, Wenkai
    Diao, Wenhui
    Yan, Menglong
    Ga, Xin
    Sun, Xian
    [J]. IEEE ACCESS, 2019, 7 : 137355 - 137364
  • [40] Dynamic Prompt Learning: Addressing Cross-Attention Leakage for Text-Based Image Editing
    Wang, Kai
    Yang, Fei
    Yang, Shiqi
    Butt, Muhammad Atif
    van de Weijer, Joost
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,