Improving Intra- and Inter-Modality Visual Relation for Image Captioning

被引：14

作者：

Wang, Yong ^{[1
,2
,4
]}

Zhang, WenKai ^{[1
,3
]}

Liu, Qing ^{[1
,3
]}

Zhang, Zhengyuan ^{[1
,2
,4
]}

Gao, Xin ^{[1
,3
]}

Sun, Xian ^{[1
,3
]}

机构：

[1] Chinese Acad Sci, Aerosp Informat Res Inst, Beijing, Peoples R China

[2] Univ Chinese Acad Sci, Sch Elect Elect & Commun Engn, Beijing, Peoples R China

[3] Chinese Acad Sci, Inst Elect, Key Lab Network Informat Syst Technol, Beijing, Peoples R China

[4] Univ Chinese Acad Sci, Beijing, Peoples R China

来源：

MM '20: PROCEEDINGS OF THE 28TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA | 2020年

关键词：

Image Captioning; Intra- and Inter-Modality Visual Relation; Relation Enhanced Transformer Block; Visual Guided Alignment;

D O I：

10.1145/3394171.3413877

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

It is widely shared that capturing relationships among multi-modality features would be helpful for representing and ultimately describing an image. In this paper, we present a novel Intra- and Inter-modality visual Relation Transformer to improve connections among visual features, termed (IRT)-R-2. Firstly, we propose Relation Enhanced Transformer Block (RETB) for image feature learning, which strengthens intra-modality visual relations among objects. Moreover, to bridge the gap between inter-modality feature representations, we align them explicitly via Visual Guided Alignment (VGA) module. Finally, an end-to-end formulation is adopted to train the whole model jointly. Experiments on the MS-COCO dataset show the effectiveness of our model, leading to improvements on all commonly used metrics on the "Karpathy" test split. Extensive ablation experiments are conducted for the comprehensive analysis of the proposed method.

引用

页码：4190 / 4198

页数：9

共 50 条

[21] Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations
Sheng, Xianli
PLOS ONE, 2023, 18 (08):
[22] Echocardiography and magnetic resonance imaging based strain analysis of functional single ventricles: a study of intra- and inter-modality reproducibility
Ghelani, Sunil J.
Harrild, David M.
Gauvreau, Kimberlee
Geva, Tal
Rathod, Rahul H.
INTERNATIONAL JOURNAL OF CARDIOVASCULAR IMAGING, 2016, 32 (07): : 1113 - 1120
[23] Echocardiography and magnetic resonance imaging based strain analysis of functional single ventricles: a study of intra- and inter-modality reproducibility
Sunil J. Ghelani
David M. Harrild
Kimberlee Gauvreau
Tal Geva
Rahul H. Rathod
The International Journal of Cardiovascular Imaging, 2016, 32 : 1113 - 1120
[24] Modeling Intra and Inter-modality Incongruity for Multi-Modal Sarcasm Detection
Pan, Hongliang
Lin, Zheng
Fu, Peng
Qi, Yatao
Wang, Weiping
FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 1383 - 1392
[25] Improving compound-protein interaction prediction by focusing on intra-modality and inter-modality dynamics with a multimodal tensor fusion strategy
Wang, Meng
Wang, Jianmin
Ji, Jianxin
Ma, Chenjing
Wang, Hesong
He, Jia
Song, Yongzhen
Zhang, Xuan
Cao, Yong
Dai, Yanyan
Hua, Menglei
Qin, Ruihao
Li, Kang
Cao, Lei
COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2024, 23 : 3714 - 3729
[26] Quantitative assessment of intra- and inter-modality deformable image registration of the heart, left ventricle, and thoracic aorta on longitudinal 4D-CT and MR images
Omidi, Alireza
Weiss, Elisabeth
Wilson, John S.
Rosu-Bubulac, Mihaela
JOURNAL OF APPLIED CLINICAL MEDICAL PHYSICS, 2022, 23 (02):
[27] I2Transformer: Intra- and Inter-Relation Embedding Transformer for TV Show Captioning
Tu, Yunbin
Li, Liang
Su, Li
Gao, Shengxiang
Yan, Chenggang
Zha, Zheng-Jun
Yu, Zhengtao
Huang, Qingming
IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 3565 - 3577
[28] Dual-Modal Transformer with Enhanced Inter- and Intra-Modality Interactions for Image Captioning
Kumar, Deepika
Srivastava, Varun
Popescu, Daniela Elena
Hemanth, Jude D.
APPLIED SCIENCES-BASEL, 2022, 12 (13):
[29] 3DUS, MRI and CT prostate volume definition: 3D evaluation of intra- and inter-modality and observer variability
Smith, W
Lewis, C
Bauman, G
Rodrigues, G
D'Souza, D
Ash, R
Venkatesan, V
Downey, D
Fenster, A
MEDICAL PHYSICS, 2005, 32 (06) : 2083 - 2083
[30] Improving Visual Question Answering by Image Captioning
Shao, Xiangjun
Dong, Hongsong
Wu, Guangsheng
IEEE ACCESS, 2025, 13 : 46299 - 46311

← 1 2 3 4 5 →