Improving Intra- and Inter-Modality Visual Relation for Image Captioning

被引:14
|
作者
Wang, Yong [1 ,2 ,4 ]
Zhang, WenKai [1 ,3 ]
Liu, Qing [1 ,3 ]
Zhang, Zhengyuan [1 ,2 ,4 ]
Gao, Xin [1 ,3 ]
Sun, Xian [1 ,3 ]
机构
[1] Chinese Acad Sci, Aerosp Informat Res Inst, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Elect Elect & Commun Engn, Beijing, Peoples R China
[3] Chinese Acad Sci, Inst Elect, Key Lab Network Informat Syst Technol, Beijing, Peoples R China
[4] Univ Chinese Acad Sci, Beijing, Peoples R China
关键词
Image Captioning; Intra- and Inter-Modality Visual Relation; Relation Enhanced Transformer Block; Visual Guided Alignment;
D O I
10.1145/3394171.3413877
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It is widely shared that capturing relationships among multi-modality features would be helpful for representing and ultimately describing an image. In this paper, we present a novel Intra- and Inter-modality visual Relation Transformer to improve connections among visual features, termed (IRT)-R-2. Firstly, we propose Relation Enhanced Transformer Block (RETB) for image feature learning, which strengthens intra-modality visual relations among objects. Moreover, to bridge the gap between inter-modality feature representations, we align them explicitly via Visual Guided Alignment (VGA) module. Finally, an end-to-end formulation is adopted to train the whole model jointly. Experiments on the MS-COCO dataset show the effectiveness of our model, leading to improvements on all commonly used metrics on the "Karpathy" test split. Extensive ablation experiments are conducted for the comprehensive analysis of the proposed method.
引用
收藏
页码:4190 / 4198
页数:9
相关论文
共 50 条
  • [21] Image to English translation and comprehension: INT2-VQA method based on inter-modality and intra-modality collaborations
    Sheng, Xianli
    PLOS ONE, 2023, 18 (08):
  • [22] Echocardiography and magnetic resonance imaging based strain analysis of functional single ventricles: a study of intra- and inter-modality reproducibility
    Ghelani, Sunil J.
    Harrild, David M.
    Gauvreau, Kimberlee
    Geva, Tal
    Rathod, Rahul H.
    INTERNATIONAL JOURNAL OF CARDIOVASCULAR IMAGING, 2016, 32 (07): : 1113 - 1120
  • [23] Echocardiography and magnetic resonance imaging based strain analysis of functional single ventricles: a study of intra- and inter-modality reproducibility
    Sunil J. Ghelani
    David M. Harrild
    Kimberlee Gauvreau
    Tal Geva
    Rahul H. Rathod
    The International Journal of Cardiovascular Imaging, 2016, 32 : 1113 - 1120
  • [24] Modeling Intra and Inter-modality Incongruity for Multi-Modal Sarcasm Detection
    Pan, Hongliang
    Lin, Zheng
    Fu, Peng
    Qi, Yatao
    Wang, Weiping
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 1383 - 1392
  • [25] Improving compound-protein interaction prediction by focusing on intra-modality and inter-modality dynamics with a multimodal tensor fusion strategy
    Wang, Meng
    Wang, Jianmin
    Ji, Jianxin
    Ma, Chenjing
    Wang, Hesong
    He, Jia
    Song, Yongzhen
    Zhang, Xuan
    Cao, Yong
    Dai, Yanyan
    Hua, Menglei
    Qin, Ruihao
    Li, Kang
    Cao, Lei
    COMPUTATIONAL AND STRUCTURAL BIOTECHNOLOGY JOURNAL, 2024, 23 : 3714 - 3729
  • [26] Quantitative assessment of intra- and inter-modality deformable image registration of the heart, left ventricle, and thoracic aorta on longitudinal 4D-CT and MR images
    Omidi, Alireza
    Weiss, Elisabeth
    Wilson, John S.
    Rosu-Bubulac, Mihaela
    JOURNAL OF APPLIED CLINICAL MEDICAL PHYSICS, 2022, 23 (02):
  • [27] I2Transformer: Intra- and Inter-Relation Embedding Transformer for TV Show Captioning
    Tu, Yunbin
    Li, Liang
    Su, Li
    Gao, Shengxiang
    Yan, Chenggang
    Zha, Zheng-Jun
    Yu, Zhengtao
    Huang, Qingming
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 3565 - 3577
  • [28] Dual-Modal Transformer with Enhanced Inter- and Intra-Modality Interactions for Image Captioning
    Kumar, Deepika
    Srivastava, Varun
    Popescu, Daniela Elena
    Hemanth, Jude D.
    APPLIED SCIENCES-BASEL, 2022, 12 (13):
  • [29] 3DUS, MRI and CT prostate volume definition: 3D evaluation of intra- and inter-modality and observer variability
    Smith, W
    Lewis, C
    Bauman, G
    Rodrigues, G
    D'Souza, D
    Ash, R
    Venkatesan, V
    Downey, D
    Fenster, A
    MEDICAL PHYSICS, 2005, 32 (06) : 2083 - 2083
  • [30] Improving Visual Question Answering by Image Captioning
    Shao, Xiangjun
    Dong, Hongsong
    Wu, Guangsheng
    IEEE ACCESS, 2025, 13 : 46299 - 46311