Cross-modality Multiple Relations Learning for Knowledge-based Visual Question Answering

被引:1
|
作者
Wang, Yan [1 ,2 ]
Li, Peize [3 ]
Si, Qingyi [4 ,5 ]
Zhang, Hanwen [4 ,5 ]
Zang, Wenyu [6 ]
Lin, Zheng [4 ,5 ]
Fu, Peng [4 ,5 ]
机构
[1] Jilin Univ, Coll Comp Sci & Technol, Sch Artificial Intelligence, Changchun 130012, Peoples R China
[2] Jilin Univ, Coll Comp Sci & Technol, Minist Educ, Key Lab Symbol Comp & Knowledge Engn, Changchun 130012, Peoples R China
[3] Jilin Univ, Sch Artificial Intelligence, Changchun 130012, Peoples R China
[4] Chinese Acad Sci, Inst Informat Engn, Beijing 100049, Peoples R China
[5] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing 100049, Peoples R China
[6] China Elect Corp, Beijing 100846, Peoples R China
基金
中国国家自然科学基金;
关键词
Cross-modality relation; external knowledge; visual question answering;
D O I
10.1145/3618301
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Knowledge-based visual question answering not only needs to answer the questions based on images but also incorporates external knowledge to study reasoning in the joint space of vision and language. To bridge the gap between visual content and semantic cues, it is important to capture the question-related and semantics-rich vision-language connections. Most existing solutions model simple intra-modality relation or represent cross-modality relation using a single vector, which makes it difficult to effectively model complex connections between visual features and question features. Thus, we propose a cross-modality multiple relations learning model, aiming to better enrich cross-modality representations and construct advanced multi-modality knowledge triplets. First, we design a simple yet effective method to generate multiple relations that represent the rich cross-modality relations. The various cross-modality relations link the textual question to the related visual objects. These multi-modality triplets efficiently align the visual objects and corresponding textual answers. Second, to encourage multiple relations to better align with different semantic relations, we further formulate a novel global-local loss. The global loss enables the visual objects and corresponding textual answers close to each other through cross-modality relations in the vision-language space, and the local loss better preserves semantic diversity among multiple relations. Experimental results on the Outside Knowledge VQA and Knowledge-Routed Visual Question Reasoning datasets demonstrate that our model outperforms the state-of-the-art methods.
引用
收藏
页数:22
相关论文
共 50 条
  • [1] Learning Modality-Invariant Features by Cross-Modality Adversarial Network for Visual Question Answering
    Fu, Ze
    Zheng, Changmeng
    Cai, Yi
    Li, Qing
    Wang, Tao
    [J]. WEB AND BIG DATA, APWEB-WAIM 2021, PT I, 2021, 12858 : 316 - 331
  • [2] Bridging the Cross-Modality Semantic Gap in Visual Question Answering
    Wang, Boyue
    Ma, Yujian
    Li, Xiaoyan
    Gao, Junbin
    Hu, Yongli
    Yin, Baocai
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 13
  • [3] Cross-modality co-attention networks for visual question answering
    Dezhi Han
    Shuli Zhou
    Kuan Ching Li
    Rodrigo Fernandes de Mello
    [J]. Soft Computing, 2021, 25 : 5411 - 5421
  • [4] Cross-modality co-attention networks for visual question answering
    Han, Dezhi
    Zhou, Shuli
    Li, Kuan Ching
    de Mello, Rodrigo Fernandes
    [J]. SOFT COMPUTING, 2021, 25 (07) : 5411 - 5421
  • [5] Cross-modal knowledge reasoning for knowledge-based visual question answering
    Yu, Jing
    Zhu, Zihao
    Wang, Yujing
    Zhang, Weifeng
    Hu, Yue
    Tan, Jianlong
    [J]. PATTERN RECOGNITION, 2020, 108
  • [6] Cross-Modal Retrieval for Knowledge-Based Visual Question Answering
    Lerner, Paul
    Ferret, Olivier
    Guinaudeau, Camille
    [J]. ADVANCES IN INFORMATION RETRIEVAL, ECIR 2024, PT I, 2024, 14608 : 421 - 438
  • [7] Learning to Reason on Tree Structures for Knowledge-Based Visual Question Answering
    Li, Qifeng
    Tang, Xinyi
    Jian, Yi
    [J]. SENSORS, 2022, 22 (04)
  • [8] Robust video question answering via contrastive cross-modality representation learning
    Xun YANG
    Jianming ZENG
    Dan GUO
    Shanshan WANG
    Jianfeng DONG
    Meng WANG
    [J]. Science China(Information Sciences)., 2024, 67 (10) - 226
  • [9] Robust video question answering via contrastive cross-modality representation learning
    Yang, Xun
    Zeng, Jianming
    Guo, Dan
    Wang, Shanshan
    Dong, Jianfeng
    Wang, Meng
    [J]. SCIENCE CHINA-INFORMATION SCIENCES, 2024, 67 (10)
  • [10] Explicit Knowledge-based Reasoning for Visual Question Answering
    Wang, Peng
    Wu, Qi
    Shen, Chunhua
    Dick, Anthony
    van den Hengel, Anton
    [J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 1290 - 1296