Multimodal high-order relational network for vision-and-language tasks

被引:6
|
作者
Pan, Hao [1 ,2 ]
Huang, Jun [1 ,2 ]
机构
[1] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[2] Chinese Acad Sci, Shanghai Adv Res Inst, Shanghai 201210, Peoples R China
基金
国家重点研发计划;
关键词
High-order relations; Vision-and-language tasks;
D O I
10.1016/j.neucom.2022.03.071
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-and-language tasks require the understanding and learning of visual semantic relations, language syntactic relations and mutual relations between these two modalities. Existing methods only focus on intra-modality low-order relations by simply combining pairwise features while ignoring the intramodality high-order relations and the sophisticated correlations between visual and textual relations. We thus propose the multimodal high-order relational network (MORN) to simultaneously capture the intra-modality high-order relations and the sophisticated correlations between visual and textual relations. The MORN model consists of three modules. A coarse-to-fine visual relation encoder first captures the fully-connected relations between all visual objects, and then refines the local relations between neighbor objects. Moreover, a textual relation encoder is used to capture the syntactic relations between text words. Finally, a relational multimodal transformer is designed to align the multimodal representations and model sophisticated correlations between textual and visual relations. Our proposed approach shows state-of-the-art performance on two vision-and-language tasks, including visual question answering (VQA) and visual grounding (VG). (c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页码:62 / 75
页数:14
相关论文
共 50 条
  • [21] HIGH-ORDER LANGUAGE - HOW HIGH IS UP
    GLASS, RL
    JOURNAL OF SYSTEMS AND SOFTWARE, 1989, 9 (01) : 1 - 2
  • [22] Depth-Aware Vision-and-Language Navigation using Scene Query Attention Network
    Tan, Sinan
    Ge, Mengmeng
    Guo, Di
    Liu, Huaping
    Sun, Fuchun
    2022 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA 2022, 2022, : 9390 - 9396
  • [23] PRGNN: Modeling high-order proximity with relational graph neural network for knowledge graph completion
    Zhu, Danhao
    NEUROCOMPUTING, 2024, 594
  • [24] A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports
    Li, Yikuan
    Wang, Hanyin
    Luo, Yuan
    2020 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE, 2020, : 1999 - 2004
  • [25] Multimodal High-Order Relationship Inference Network for Fashion Compatibility Modeling in Internet of Multimedia Things
    Jing, Peiguang
    Cui, Kai
    Zhang, Jing
    Li, Yun
    Su, Yuting
    IEEE INTERNET OF THINGS JOURNAL, 2024, 11 (01) : 353 - 365
  • [26] High-order graph attention network
    He, Liancheng
    Bai, Liang
    Yang, Xian
    Du, Hangyuan
    Liang, Jiye
    INFORMATION SCIENCES, 2023, 630 : 222 - 234
  • [27] A High-Order Double Network Hydrogel
    Xia, Penghui
    Zhang, Wanqi
    Peng, Chaoyi
    Yin, Hanfeng
    Wang, Dan Michelle
    Yang, Jun
    Tuan, Rocky S.
    Jiang, Lei
    Wang, Jianfeng
    MACROMOLECULES, 2024, 57 (23) : 11251 - 11265
  • [28] A Dual Semantic-Aware Recurrent Global-Adaptive Network for Vision-and-Language Navigation
    Wang, Liuyi
    He, Zongtao
    Tang, Jiagui
    Dang, Ronghao
    Wang, Naijia
    Liu, Chengju
    Chen, Qijun
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 1479 - 1487
  • [29] NEW HIGH-ORDER COMPUTER LANGUAGE FOR NAVY
    LOPER, WE
    NAVAL RESEARCH REVIEWS, 1974, 27 (5-6): : 57 - 60
  • [30] A Framework for Vision-Language Warm-up Tasks in Multimodal Dialogue Models
    Lee, Jaewook
    Park, Seongsik
    Park, Seong-Heum
    Kim, Hongjin
    Kim, Harksoo
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 2789 - 2799