Attention-Aligned Transformer for Image Captioning

被引:0
|
作者
Fei, Zhengcong [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China
[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
关键词
REPRESENTATION;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Recently, attention-based image captioning models, which are expected to ground correct image regions for proper word generations, have achieved remarkable performance. However, some researchers have argued "deviated focus" problem of existing attention mechanisms in determining the effective and influential image features. In this paper, we present A(2)- an attention-aligned Transformer for image captioning, which guides attention learning in a perturbation-based self-supervised manner, without any annotation overhead. Specifically, we add mask operation on image regions through a learnable network to estimate the true function in ultimate description generation. We hypothesize that the necessary image region features, where small disturbance causes an obvious performance degradation, deserve more attention weight. Then, we propose four aligned strategies to use this information to refine attention weight distribution. Under such a pattern, image regions are attended correctly with the output words. Extensive experiments conducted on the MS COCO dataset demonstrate that the proposed A(2) Transformer consistently outperforms baselines in both automatic metrics and human evaluation. Trained models and code for reproducing the experiments are publicly available.
引用
收藏
页码:607 / 615
页数:9
相关论文
共 50 条
  • [41] Multimodal attention-based transformer for video captioning
    Hemalatha Munusamy
    Chandra Sekhar C
    [J]. Applied Intelligence, 2023, 53 : 23349 - 23368
  • [42] REFINING ATTENTION: A SEQUENTIAL ATTENTION MODEL FOR IMAGE CAPTIONING
    Fang, Fang
    Li, Qinyu
    Wang, Hanli
    Tang, Pengjie
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2018,
  • [43] Boosted Attention: Leveraging Human Attention for Image Captioning
    Chen, Shi
    Zhao, Qi
    [J]. COMPUTER VISION - ECCV 2018, PT XI, 2018, 11215 : 72 - 88
  • [44] Relational-Convergent Transformer for image captioning
    Chen, Lizhi
    Yang, You
    Hu, Juntao
    Pan, Longyue
    Zhai, Hao
    [J]. DISPLAYS, 2023, 77
  • [45] MIXED KNOWLEDGE RELATION TRANSFORMER FOR IMAGE CAPTIONING
    Chen, Tianyu
    Li, Zhixin
    Wei, Jiahui
    Xian, Tiantao
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4403 - 4407
  • [46] Multimodal attention-based transformer for video captioning
    Munusamy, Hemalatha
    Sekhar, C. Chandra
    [J]. APPLIED INTELLIGENCE, 2023, 53 (20) : 23349 - 23368
  • [47] A Position-Aware Transformer for Image Captioning
    Deng, Zelin
    Zhou, Bo
    He, Pei
    Huang, Jianfeng
    Alfarraj, Osama
    Tolba, Amr
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 70 (01): : 2065 - 2081
  • [48] Context-aware transformer for image captioning
    Yang, Xin
    Wang, Ying
    Chen, Haishun
    Li, Jie
    Huang, Tingting
    [J]. NEUROCOMPUTING, 2023, 549
  • [49] Full-Memory Transformer for Image Captioning
    Lu, Tongwei
    Wang, Jiarong
    Min, Fen
    [J]. SYMMETRY-BASEL, 2023, 15 (01):
  • [50] Retrieval-Augmented Transformer for Image Captioning
    Sarto, Sara
    Cornia, Marcella
    Baraldi, Lorenzo
    Cucchiara, Rita
    [J]. 19TH INTERNATIONAL CONFERENCE ON CONTENT-BASED MULTIMEDIA INDEXING, CBMI 2022, 2022, : 1 - 7