Improving Intra- and Inter-Modality Visual Relation for Image Captioning

被引:14
|
作者
Wang, Yong [1 ,2 ,4 ]
Zhang, WenKai [1 ,3 ]
Liu, Qing [1 ,3 ]
Zhang, Zhengyuan [1 ,2 ,4 ]
Gao, Xin [1 ,3 ]
Sun, Xian [1 ,3 ]
机构
[1] Chinese Acad Sci, Aerosp Informat Res Inst, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Elect Elect & Commun Engn, Beijing, Peoples R China
[3] Chinese Acad Sci, Inst Elect, Key Lab Network Informat Syst Technol, Beijing, Peoples R China
[4] Univ Chinese Acad Sci, Beijing, Peoples R China
关键词
Image Captioning; Intra- and Inter-Modality Visual Relation; Relation Enhanced Transformer Block; Visual Guided Alignment;
D O I
10.1145/3394171.3413877
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It is widely shared that capturing relationships among multi-modality features would be helpful for representing and ultimately describing an image. In this paper, we present a novel Intra- and Inter-modality visual Relation Transformer to improve connections among visual features, termed (IRT)-R-2. Firstly, we propose Relation Enhanced Transformer Block (RETB) for image feature learning, which strengthens intra-modality visual relations among objects. Moreover, to bridge the gap between inter-modality feature representations, we align them explicitly via Visual Guided Alignment (VGA) module. Finally, an end-to-end formulation is adopted to train the whole model jointly. Experiments on the MS-COCO dataset show the effectiveness of our model, leading to improvements on all commonly used metrics on the "Karpathy" test split. Extensive ablation experiments are conducted for the comprehensive analysis of the proposed method.
引用
收藏
页码:4190 / 4198
页数:9
相关论文
共 50 条
  • [1] Dynamic Fusion with Intra- and Inter-modality Attention Flow for Visual Question Answering
    Gao, Peng
    Jiang, Zhengkai
    You, Haoxuan
    Lu, Pan
    Hoi, Steven
    Wang, Xiaogang
    Li, Hongsheng
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6632 - 6641
  • [2] Intra- and inter-modality registration of functional and anatomical clinical images
    Eberl, S
    Braun, M
    NEW APPROACHES IN MEDICAL IMAGE ANALYSIS, 1999, 3747 : 102 - 114
  • [3] Maximizing mutual information inside intra- and inter-modality for audio-visual event retrieval
    Ruochen Li
    Nannan Li
    Wenmin Wang
    International Journal of Multimedia Information Retrieval, 2023, 12
  • [4] Maximizing mutual information inside intra- and inter-modality for audio-visual event retrieval
    Li, Ruochen
    Li, Nannan
    Wang, Wenmin
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2023, 12 (01)
  • [5] Contrastive Intra- and Inter-Modality Generation for Enhancing Incomplete Multimedia Recommendation
    Lin, Zhenghong
    Tan, Yanchao
    Zhan, Yunfei
    Liu, Weiming
    Wang, Fan
    Chen, Chaochao
    Wang, Shiping
    Yang, Carl
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 6234 - 6242
  • [6] Fusion of Intra- and Inter-modality Algorithms for Face-Sketch Recognition
    Galea, Christian
    Farrugia, Reuben A.
    COMPUTER ANALYSIS OF IMAGES AND PATTERNS, CAIP 2015, PT II, 2015, 9257 : 700 - 711
  • [7] Cross-Modal Image-Recipe Retrieval via Intra- and Inter-Modality Hybrid Fusion
    Li, Jiao
    Sun, Jialiang
    Xu, Xing
    Yu, Wei
    Shen, Fumin
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 173 - 182
  • [8] Modeling Both Intra- and Inter-Modality Uncertainty for Multimodal Fake News Detection
    Wei, Lingwei
    Hu, Dou
    Zhou, Wei
    Hu, Songlin
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 7906 - 7916
  • [9] Intra- and Inter-Head Orthogonal Attention for Image Captioning
    Zhang, Xiaodan
    Jia, Aozhe
    Ji, Junzhong
    Qu, Liangqiong
    Ye, Qixiang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2025, 34 : 594 - 607
  • [10] Semantic-aware matrix factorization hashing with intra- and inter-modality fusion for image-text retrieval
    Shi, Dongxue
    Liu, Zheng
    Gao, Shanshan
    Li, Ang
    APPLIED INTELLIGENCE, 2025, 55 (01)