Position-guided transformer for image captioning

被引：5

作者：

Hu, Juntao ^{[1
]}

Yang, You ^{[2
]}

Yao, Lu ^{[1
]}

An, Yongzhi ^{[1
]}

Pan, Longyue ^{[1
]}

机构：

[1] Chongqing Normal Univ, Sch Comp & Informat Sci, Chongqing 401331, Peoples R China

[2] Chongqing Normal Univ, Natl Ctr Appl Math Chongqing, Chongqing 401331, Peoples R China

来源：

IMAGE AND VISION COMPUTING | 2022年 / 128卷

关键词：

Image captioning; Bi-positional attention; Position encoding; Group normalization; Transformer; Self-attention;

D O I：

10.1016/j.imavis.2022.104575

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Transformer-based frameworks have shown superiorities in image captioning. However, such frameworks are strenuous to consider geometric interrelations among visual contents in an image, as well as fail to prevent changes in the distribution of each layer's input in self-attention. In this work, we first propose a Bi-Positional At-tention (BPA) module, which incorporates absolute and relative position encoding to precisely explore internal relations between objects and their geometric information in an image. Additionally, we use a Group Normaliza-tion (GN) method inside BPA to relieve shifts of the distribution and better exploit the channel dependence of visual features. To validate our proposals, we apply BPA and GN into the original Transformer to constitute our Position-Guided Transformer (PGT) network, which learns a more comprehensive positional representations to augment spatial interactions among objects for image captioning. We conduct extensive experiments to verify the effectiveness of our model. Compared with non-pretraining state-of-the-art methods, experimental results on the MSCOCO benchmark dataset demonstrate that our PGT achieves competitive performance, reaching 134.2% CIDEr score on the Karpathy split with a single model, and 136.2% CIDEr score on the official testing server with an ensemble configuration.(c) 2022 Elsevier B.V. All rights reserved.

引用

页数：11

共 50 条

[1] Position-Guided Point Cloud Panoptic Segmentation Transformer
Xiao, Zeqi
Zhang, Wenwei
Wang, Tai
Loy, Chen Change
Lin, Dahua
Pang, Jiangmiao
[J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024,
[2] A position-aware transformer for image captioning
Deng, Zelin
Zhou, Bo
He, Pei
Huang, Jianfeng
Alfarraj, Osama
Tolba, Amr
[J]. Deng, Zelin (zl_deng@sina.com), 2005, Tech Science Press (70): : 2005 - 2021
[3] A Position-Aware Transformer for Image Captioning
Deng, Zelin
Zhou, Bo
He, Pei
Huang, Jianfeng
Alfarraj, Osama
Tolba, Amr
[J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 70 (01): : 2065 - 2081
[4] Dual Position Relationship Transformer for Image Captioning
Wang, Yaohan
Qian, Wenhua
Nie, Rencan
Xu, Dan
Cao, Jinde
Kim, Pyoungwon
[J]. BIG DATA, 2022, 10 (06) : 515 - 527
[5] PCTrans: Position-Guided Transformer with Query Contrast for Biological Instance Segmentation
Chen, Qi
Huang, Wei
Liu, Xiaoyu
Li, Jiacheng
Xiong, Zhiwei
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 3905 - 3914
[6] PCATNet: Position-Class Awareness Transformer for Image Captioning
Tang, Ziwei
Yi, Yaohua
Yu, Changhui
Yin, Aiguo
[J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 75 (03): : 6007 - 6022
[7] Image Captioning Based on An Improved Transformer with IoU Position Encoding
Li, Yazhou
Shi, Yihui
Liu, Yun
Li, Ruifan
Ma, Zhanyu
[J]. 2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 2066 - 2071
[8] Semantic association enhancement transformer with relative position for image captioning
Jia, Xin
Wang, Yunbo
Peng, Yuxin
Chen, Shengyong
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (15) : 21349 - 21367
[9] Semantic association enhancement transformer with relative position for image captioning
Xin Jia
Yunbo Wang
Yuxin Peng
Shengyong Chen
[J]. Multimedia Tools and Applications, 2022, 81 : 21349 - 21367
[10] Region-guided transformer for remote sensing image captioning
Zhao, Kai
Xiong, Wei
[J]. INTERNATIONAL JOURNAL OF DIGITAL EARTH, 2024, 17 (01)

← 1 2 3 4 5 →