Improving Intra- and Inter-Modality Visual Relation for Image Captioning

被引:14
|
作者
Wang, Yong [1 ,2 ,4 ]
Zhang, WenKai [1 ,3 ]
Liu, Qing [1 ,3 ]
Zhang, Zhengyuan [1 ,2 ,4 ]
Gao, Xin [1 ,3 ]
Sun, Xian [1 ,3 ]
机构
[1] Chinese Acad Sci, Aerosp Informat Res Inst, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Elect Elect & Commun Engn, Beijing, Peoples R China
[3] Chinese Acad Sci, Inst Elect, Key Lab Network Informat Syst Technol, Beijing, Peoples R China
[4] Univ Chinese Acad Sci, Beijing, Peoples R China
关键词
Image Captioning; Intra- and Inter-Modality Visual Relation; Relation Enhanced Transformer Block; Visual Guided Alignment;
D O I
10.1145/3394171.3413877
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It is widely shared that capturing relationships among multi-modality features would be helpful for representing and ultimately describing an image. In this paper, we present a novel Intra- and Inter-modality visual Relation Transformer to improve connections among visual features, termed (IRT)-R-2. Firstly, we propose Relation Enhanced Transformer Block (RETB) for image feature learning, which strengthens intra-modality visual relations among objects. Moreover, to bridge the gap between inter-modality feature representations, we align them explicitly via Visual Guided Alignment (VGA) module. Finally, an end-to-end formulation is adopted to train the whole model jointly. Experiments on the MS-COCO dataset show the effectiveness of our model, leading to improvements on all commonly used metrics on the "Karpathy" test split. Extensive ablation experiments are conducted for the comprehensive analysis of the proposed method.
引用
收藏
页码:4190 / 4198
页数:9
相关论文
共 50 条
  • [31] Multimodal fake news detection through intra-modality feature aggregation and inter-modality semantic fusion
    Zhu, Peican
    Hua, Jiaheng
    Tang, Keke
    Tian, Jiwei
    Xu, Jiwei
    Cui, Xiaodong
    COMPLEX & INTELLIGENT SYSTEMS, 2024, 10 (04) : 5851 - 5863
  • [32] Intra and Inter-modality Incongruity Modeling and Adversarial Contrastive Learning for Multimodal Fake News Detection
    Wei, Siqi
    Wu, Bin
    PROCEEDINGS OF THE 4TH ANNUAL ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2024, 2024, : 666 - 674
  • [33] Improving Image Captioning Evaluation by Considering Inter References Variance
    Yi, Yanzhi
    Deng, Hangyu
    Hu, Jinglu
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020), 2020, : 985 - 994
  • [34] Intra- and inter-modal completion of a visual motion representation
    Teramoto, W.
    Hidaka, S.
    Gyoba, J.
    Suzuki, Y-I
    PERCEPTION, 2009, 38 : 132 - 132
  • [35] FUSION-BASED MULTIMODAL MEDICAL IMAGE REGISTRATION COMBINING INTER-MODALITY METRIC AND DISENTANGLEMENT
    Ji, Yu
    Zhu, Zhenyu
    Wei, Ying
    2022 IEEE INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING (IEEE ISBI 2022), 2022,
  • [36] Improving Image Captioning through Visual and Semantic Mutual Promotion
    Zhang, Jing
    Xie, Yingshuai
    Liu, Xiaoqiang
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 4716 - 4724
  • [37] Multi-view inter-modality representation with progressive fusion for image-text matching
    Wu, Jie
    Wang, Leiquan
    Chen, Chenglizhao
    Lu, Jing
    Wu, Chunlei
    NEUROCOMPUTING, 2023, 535 : 1 - 12
  • [38] I3N: Intra- and Inter-Representation Interaction Network for Change Captioning
    Yue, Shengbin
    Tu, Yunbin
    Li, Liang
    Yang, Ying
    Gao, Shengxiang
    Yu, Zhengtao
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 8828 - 8841
  • [39] Image enhancement in lensless inline holographic microscope by inter-modality learning with denoising convolutional neural network
    Chen, Ling
    Chen, Xin
    Cui, Hanchen
    Long, Yong
    Wu, Jigang
    OPTICS COMMUNICATIONS, 2021, 484
  • [40] Improved inter-modality image registration using normalized mutual information with coarse-binned histograms
    Nam, Haewon
    Renaut, Rosemary A.
    Chen, Kewei
    Guo, Hongbin
    Farin, Gerald E.
    COMMUNICATIONS IN NUMERICAL METHODS IN ENGINEERING, 2009, 25 (06): : 583 - 595