Multimodal high-order relational network for vision-and-language tasks

被引:6
|
作者
Pan, Hao [1 ,2 ]
Huang, Jun [1 ,2 ]
机构
[1] Univ Chinese Acad Sci, Beijing 100049, Peoples R China
[2] Chinese Acad Sci, Shanghai Adv Res Inst, Shanghai 201210, Peoples R China
基金
国家重点研发计划;
关键词
High-order relations; Vision-and-language tasks;
D O I
10.1016/j.neucom.2022.03.071
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Vision-and-language tasks require the understanding and learning of visual semantic relations, language syntactic relations and mutual relations between these two modalities. Existing methods only focus on intra-modality low-order relations by simply combining pairwise features while ignoring the intramodality high-order relations and the sophisticated correlations between visual and textual relations. We thus propose the multimodal high-order relational network (MORN) to simultaneously capture the intra-modality high-order relations and the sophisticated correlations between visual and textual relations. The MORN model consists of three modules. A coarse-to-fine visual relation encoder first captures the fully-connected relations between all visual objects, and then refines the local relations between neighbor objects. Moreover, a textual relation encoder is used to capture the syntactic relations between text words. Finally, a relational multimodal transformer is designed to align the multimodal representations and model sophisticated correlations between textual and visual relations. Our proposed approach shows state-of-the-art performance on two vision-and-language tasks, including visual question answering (VQA) and visual grounding (VG). (c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页码:62 / 75
页数:14
相关论文
共 50 条
  • [1] CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks
    Srinivasan, Tejas
    Chang, Ting-Yun
    Alva, Leticia Pinto
    Chochlakis, Georgios
    Rostami, Mohammad
    Thomason, Jesse
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [2] Unifying Vision-and-Language Tasks via Text Generation
    Cho, Jaemin
    Lei, Jie
    Tan, Hao
    Bansal, Mohit
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139
  • [3] History Aware Multimodal Transformer for Vision-and-Language Navigation
    Chen, Shizhe
    Guhur, Pierre-Louis
    Schmid, Cordelia
    Laptev, Ivan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [4] Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers
    Frank, Stella
    Bugliarello, Emanuele
    Elliott, Desmond
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 9847 - 9857
  • [5] Joint Multimodal Embedding and Backtracking Search in Vision-and-Language Navigation
    Hwang, Jisu
    Kim, Incheol
    SENSORS, 2021, 21 (03) : 1 - 23
  • [6] Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions
    Gu, Jing
    Stefani, Eliana
    Wu, Qi
    Thomason, Jesse
    Wang, Xin Eric
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 7606 - 7623
  • [7] Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation
    Zhu, Wanrong
    Wang, Xin Eric
    Fu, Tsu-Jui
    Yan, An
    Narayana, Pradyumna
    Sone, Kazoo
    Basu, Sugato
    Wang, William Yang
    16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 1207 - 1221
  • [8] Multimodal Transformer with Variable-Length Memory for Vision-and-Language Navigation
    Lin, Chuang
    Jiang, Yi
    Cai, Jianfei
    Qu, Lizhen
    Haffari, Gholamreza
    Yuan, Zehuan
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 380 - 397
  • [9] Multimodal attention networks for low-level vision-and-language navigation
    Landi, Federico
    Baraldi, Lorenzo
    Cornia, Marcella
    Corsini, Massimiliano
    Cucchiara, Rita
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2021, 210
  • [10] HyperPELT: Unified Parameter-Efficient Language Model Tuning for Both Language and Vision-and-Language Tasks
    Zhang, Zhengkun
    Guo, Wenya
    Meng, Xiaojun
    Wang, Yasheng
    Wang, Yadao
    Jiang, Xin
    Liu, Qun
    Yang, Zhenglu
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 11442 - 11453