Cross-modal learning with prior visual relation knowledge

被引:6
|
作者
Yu, Jing [1 ,3 ]
Zhang, Weifeng [2 ]
Yang, Zhuoqian [4 ]
Qin, Zengchang [4 ]
Hu, Yue [1 ,3 ]
机构
[1] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China
[2] Jiaxing Univ, Coll Math Phys & Informat Engn, Jiaxing City, Zhejiang, Peoples R China
[3] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing, Peoples R China
[4] Beihang Univ, Sch ASEE, Intelligent Comp & Machine Learning Lab, Beijing, Peoples R China
关键词
Visual relation reasoning; Relation embedding; Anisotropic graph convolutional networks; Visual question answering; Cross-modal information retrieval;
D O I
10.1016/j.knosys.2020.106150
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Visual relational reasoning is a central component in recent cross-modal analysis tasks, which aims at reasoning about the visual relationships between objects and their properties. These relationships provide rich semantics and help to enhance the visual representation for improving cross-modal learning. Previous works have succeeded in modeling latent visual relationships or rigid-categorized visual relationships. However, these kinds of methods leave out the problem of ambiguity inherent in the visual relationships because of the diverse relational semantics of different visual appearances. In this work, we explore to model the visual relationships by context-aware representations based on human prior knowledge. Based on such representations, we novelly propose a plug-and-play visual relational reasoning module to enhance image encoding. Specifically, we design an Anisotropic Graph Convolution to utilize the information of relation embeddings and relation directionality between objects for generating relation-aware image representations. We demonstrate the effectiveness of the relational reasoning module by applying it to both Visual Question Answering (VQA) and Cross-Modal Information Retrieval (CMIR) tasks. Extensive experiments are conducted on VQA 2.0 and CMPlaces datasets and superior performance is reported when comparing with state-of-the-art works. (C) 2020 Published by Elsevier B.V.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Learning Visual Locomotion with Cross-Modal Supervision
    Loquercio, Antonio
    Kumar, Ashish
    Malik, Jitendra
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION (ICRA 2023), 2023, : 7295 - 7302
  • [2] Towards Bridged Vision and Language: Learning Cross-Modal Knowledge Representation for Relation Extraction
    Feng, Junhao
    Wang, Guohua
    Zheng, Changmeng
    Cai, Yi
    Fu, Ze
    Wang, Yaowei
    Wei, Xiao-Yong
    Li, Qing
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (01) : 561 - 575
  • [3] Visual context learning based on cross-modal knowledge for continuous sign language recognition
    Liu, Kailin
    Hou, Yonghong
    Guo, Zihui
    Yin, Wenjie
    Ren, Yi
    [J]. VISUAL COMPUTER, 2024,
  • [4] Learning Relation Alignment for Calibrated Cross-modal Retrieval
    Ren, Shuhuai
    Lin, Junyang
    Zhao, Guangxiang
    Men, Rui
    Yang, An
    Zhou, Jingren
    Sun, Xu
    Yang, Hongxia
    [J]. 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 1 (ACL-IJCNLP 2021), 2021, : 514 - 524
  • [5] Learning Cross-Modal Context Graph for Visual Grounding
    Liu, Yongfei
    Wan, Bo
    Zhu, Xiaodan
    He, Xuming
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11645 - 11652
  • [6] CROSS-MODAL LEARNING OF AUDITORY AND VISUAL RHYTHMS IN MAN
    COLE, M
    ETTLINGER, G
    CHOROVER, S
    [J]. BULLETIN OF THE BRITISH PSYCHOLOGICAL SOCIETY, 1961, (44): : A13 - A13
  • [7] DistilVPR: Cross-Modal Knowledge Distillation for Visual Place Recognition
    Wang, Sijie
    She, Rui
    Kang, Qiyu
    Jian, Xingchao
    Zhao, Kai
    Song, Yang
    Tay, Wee Peng
    [J]. THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 9, 2024, : 10377 - 10385
  • [8] Cross-modal knowledge reasoning for knowledge-based visual question answering
    Yu, Jing
    Zhu, Zihao
    Wang, Yujing
    Zhang, Weifeng
    Hu, Yue
    Tan, Jianlong
    [J]. PATTERN RECOGNITION, 2020, 108
  • [9] Explicit Cross-Modal Representation Learning for Visual Commonsense Reasoning
    Zhang, Xi
    Zhang, Feifei
    Xu, Changsheng
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2022, 24 : 2986 - 2997
  • [10] Cross-Modal learning for Audio-Visual Video Parsing
    Lamba, Jatin
    Abhishek
    Akula, Jayaprakash
    Dabral, Rishabh
    Jyothi, Preethi
    Ramakrishnan, Ganesh
    [J]. INTERSPEECH 2021, 2021, : 1937 - 1941