Multimodal high-order relational network for vision-and-language tasks

被引：6

作者：

Pan, Hao ^{[1
,2
]}

Huang, Jun ^{[1
,2
]}

机构：

[1] Univ Chinese Acad Sci, Beijing 100049, Peoples R China

[2] Chinese Acad Sci, Shanghai Adv Res Inst, Shanghai 201210, Peoples R China

来源：

NEUROCOMPUTING | 2022年 / 492卷

基金：

国家重点研发计划;

关键词：

High-order relations; Vision-and-language tasks;

D O I：

10.1016/j.neucom.2022.03.071

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Vision-and-language tasks require the understanding and learning of visual semantic relations, language syntactic relations and mutual relations between these two modalities. Existing methods only focus on intra-modality low-order relations by simply combining pairwise features while ignoring the intramodality high-order relations and the sophisticated correlations between visual and textual relations. We thus propose the multimodal high-order relational network (MORN) to simultaneously capture the intra-modality high-order relations and the sophisticated correlations between visual and textual relations. The MORN model consists of three modules. A coarse-to-fine visual relation encoder first captures the fully-connected relations between all visual objects, and then refines the local relations between neighbor objects. Moreover, a textual relation encoder is used to capture the syntactic relations between text words. Finally, a relational multimodal transformer is designed to align the multimodal representations and model sophisticated correlations between textual and visual relations. Our proposed approach shows state-of-the-art performance on two vision-and-language tasks, including visual question answering (VQA) and visual grounding (VG). (c) 2022 Elsevier B.V. All rights reserved.

引用

页码：62 / 75

页数：14

共 50 条

[21] HIGH-ORDER LANGUAGE - HOW HIGH IS UP
GLASS, RL
JOURNAL OF SYSTEMS AND SOFTWARE, 1989, 9 (01) : 1 - 2
[22] Depth-Aware Vision-and-Language Navigation using Scene Query Attention Network
Tan, Sinan
Ge, Mengmeng
Guo, Di
Liu, Huaping
Sun, Fuchun
2022 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA 2022, 2022, : 9390 - 9396
[23] PRGNN: Modeling high-order proximity with relational graph neural network for knowledge graph completion
Zhu, Danhao
NEUROCOMPUTING, 2024, 594
[24] A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports
Li, Yikuan
Wang, Hanyin
Luo, Yuan
2020 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE, 2020, : 1999 - 2004
[25] Multimodal High-Order Relationship Inference Network for Fashion Compatibility Modeling in Internet of Multimedia Things
Jing, Peiguang
Cui, Kai
Zhang, Jing
Li, Yun
Su, Yuting
IEEE INTERNET OF THINGS JOURNAL, 2024, 11 (01) : 353 - 365
[26] High-order graph attention network
He, Liancheng
Bai, Liang
Yang, Xian
Du, Hangyuan
Liang, Jiye
INFORMATION SCIENCES, 2023, 630 : 222 - 234
[27] A High-Order Double Network Hydrogel
Xia, Penghui
Zhang, Wanqi
Peng, Chaoyi
Yin, Hanfeng
Wang, Dan Michelle
Yang, Jun
Tuan, Rocky S.
Jiang, Lei
Wang, Jianfeng
MACROMOLECULES, 2024, 57 (23) : 11251 - 11265
[28] A Dual Semantic-Aware Recurrent Global-Adaptive Network for Vision-and-Language Navigation
Wang, Liuyi
He, Zongtao
Tang, Jiagui
Dang, Ronghao
Wang, Naijia
Liu, Chengju
Chen, Qijun
PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 1479 - 1487
[29] NEW HIGH-ORDER COMPUTER LANGUAGE FOR NAVY
LOPER, WE
NAVAL RESEARCH REVIEWS, 1974, 27 (5-6): : 57 - 60
[30] A Framework for Vision-Language Warm-up Tasks in Multimodal Dialogue Models
Lee, Jaewook
Park, Seongsik
Park, Seong-Heum
Kim, Hongjin
Kim, Harksoo
2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 2789 - 2799

← 1 2 3 4 5 →