Attention-Aligned Transformer for Image Captioning

被引：0

作者：

Fei, Zhengcong ^{[1
,2
]}

机构：

[1] Chinese Acad Sci, Inst Comp Technol, Key Lab Intelligent Informat Proc, Beijing 100190, Peoples R China

[2] Univ Chinese Acad Sci, Beijing 100049, Peoples R China

来源：

THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE | 2022年

关键词：

REPRESENTATION;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Recently, attention-based image captioning models, which are expected to ground correct image regions for proper word generations, have achieved remarkable performance. However, some researchers have argued "deviated focus" problem of existing attention mechanisms in determining the effective and influential image features. In this paper, we present A(2)- an attention-aligned Transformer for image captioning, which guides attention learning in a perturbation-based self-supervised manner, without any annotation overhead. Specifically, we add mask operation on image regions through a learnable network to estimate the true function in ultimate description generation. We hypothesize that the necessary image region features, where small disturbance causes an obvious performance degradation, deserve more attention weight. Then, we propose four aligned strategies to use this information to refine attention weight distribution. Under such a pattern, image regions are attended correctly with the output words. Extensive experiments conducted on the MS COCO dataset demonstrate that the proposed A(2) Transformer consistently outperforms baselines in both automatic metrics and human evaluation. Trained models and code for reproducing the experiments are publicly available.

引用

页码：607 / 615

页数：9

共 50 条

[41] Multimodal attention-based transformer for video captioning
Hemalatha Munusamy
Chandra Sekhar C
[J]. Applied Intelligence, 2023, 53 : 23349 - 23368
[42] REFINING ATTENTION: A SEQUENTIAL ATTENTION MODEL FOR IMAGE CAPTIONING
Fang, Fang
Li, Qinyu
Wang, Hanli
Tang, Pengjie
[J]. 2018 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2018,
[43] Boosted Attention: Leveraging Human Attention for Image Captioning
Chen, Shi
Zhao, Qi
[J]. COMPUTER VISION - ECCV 2018, PT XI, 2018, 11215 : 72 - 88
[44] Relational-Convergent Transformer for image captioning
Chen, Lizhi
Yang, You
Hu, Juntao
Pan, Longyue
Zhai, Hao
[J]. DISPLAYS, 2023, 77
[45] MIXED KNOWLEDGE RELATION TRANSFORMER FOR IMAGE CAPTIONING
Chen, Tianyu
Li, Zhixin
Wei, Jiahui
Xian, Tiantao
[J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4403 - 4407
[46] Multimodal attention-based transformer for video captioning
Munusamy, Hemalatha
Sekhar, C. Chandra
[J]. APPLIED INTELLIGENCE, 2023, 53 (20) : 23349 - 23368
[47] A Position-Aware Transformer for Image Captioning
Deng, Zelin
Zhou, Bo
He, Pei
Huang, Jianfeng
Alfarraj, Osama
Tolba, Amr
[J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 70 (01): : 2065 - 2081
[48] Context-aware transformer for image captioning
Yang, Xin
Wang, Ying
Chen, Haishun
Li, Jie
Huang, Tingting
[J]. NEUROCOMPUTING, 2023, 549
[49] Full-Memory Transformer for Image Captioning
Lu, Tongwei
Wang, Jiarong
Min, Fen
[J]. SYMMETRY-BASEL, 2023, 15 (01):
[50] Retrieval-Augmented Transformer for Image Captioning
Sarto, Sara
Cornia, Marcella
Baraldi, Lorenzo
Cucchiara, Rita
[J]. 19TH INTERNATIONAL CONFERENCE ON CONTENT-BASED MULTIMEDIA INDEXING, CBMI 2022, 2022, : 1 - 7

← 1 2 3 4 5 →