Scene Graph Refinement Network for Visual Question Answering

被引:14
|
作者
Qian, Tianwen [1 ]
Chen, Jingjing [1 ]
Chen, Shaoxiang [1 ]
Wu, Bo [2 ]
Jiang, Yu-Gang [1 ]
机构
[1] Fudan Univ, Shanghai 200437, Peoples R China
[2] MIT IBM Watson AI Lab, Cambridge, MA 02141 USA
关键词
Visualization; Task analysis; Cognition; Transformers; Feature extraction; Semantics; Noise measurement; Visual Question Answering; Scene Graph; Cross-modal Learning; LANGUAGE; VISION;
D O I
10.1109/TMM.2022.3169065
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Visual Question Answering aims to answer the free-form natural language question based on the visual clues in a given image. It is a difficult problem as it requires understanding the fine-grained structured information of both language and image for compositional reasoning. To establish the compositional reasoning, recent works attempt to introduce the scene graph in VQA. However, as the generated scene graphs are usually quite noisy, it greatly limits the performance of question answering. Therefore, this paper proposes to refine the scene graphs for improving the effectiveness. Specifically, we present a novel Scene Graph Refinement network (SGR), which introduces a transformer-based refinement network to enhance the object and relation features for better classification. Moreover, as the question provides valuable clues for distinguishing whether the < subject, predicate, object > triplets are helpful or not, the SGR network exploits the semantic information presented in the questions to select the most relevant relations for question answering. Extensive experiments are conducted on the GQA benchmark demonstrate the effectiveness of our method.
引用
收藏
页码:3950 / 3961
页数:12
相关论文
共 50 条
  • [1] Visual Causal Scene Refinement for Video Question Answering
    Wei, Yushen
    Liu, Yang
    Yan, Hong
    Li, Guanbin
    Lin, Liang
    [J]. PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 377 - 386
  • [2] Scene Text Visual Question Answering
    Biten, Ali Furkan
    Tito, Ruben
    Mafla, Andres
    Gomez, Lluis
    Rusinol, Marcal
    Valveny, Ernest
    Jawahar, C. V.
    Karatzas, Dimosthenis
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 4290 - 4300
  • [3] Graphhopper: Multi-hop Scene Graph Reasoning for Visual Question Answering
    Koner, Rajat
    Li, Hang
    Hildebrandt, Marcel
    Das, Deepan
    Tresp, Volker
    Guennemann, Stephan
    [J]. SEMANTIC WEB - ISWC 2021, 2021, 12922 : 111 - 127
  • [4] Semantic Relation Graph Reasoning Network for Visual Question Answering
    Lan, Hong
    Zhang, Pufen
    [J]. TWELFTH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING SYSTEMS, 2021, 11719
  • [5] Syntax Tree Constrained Graph Network for Visual Question Answering
    Su, Xiangrui
    Zhang, Qi
    Shi, Chongyang
    Liu, Jiachang
    Hu, Liang
    [J]. NEURAL INFORMATION PROCESSING, ICONIP 2023, PT V, 2024, 14451 : 122 - 136
  • [6] Question-aware dynamic scene graph of local semantic representation learning for visual question answering
    Wu, Jinmeng
    Ge, Fulin
    Hong, Hanyu
    Shi, Yu
    Hao, Yanbin
    Ma, Lei
    [J]. PATTERN RECOGNITION LETTERS, 2023, 170 : 93 - 99
  • [7] DSGEM: Dual scene graph enhancement module-based visual question answering
    Wang, Boyue
    Ma, Yujian
    Li, Xiaoyan
    Liu, Heng
    Hu, Yongli
    Yin, Baocai
    [J]. IET COMPUTER VISION, 2023, 17 (06) : 638 - 651
  • [8] Co-attention graph convolutional network for visual question answering
    Liu, Chuan
    Tan, Ying-Ying
    Xia, Tian-Tian
    Zhang, Jiajing
    Zhu, Ming
    [J]. MULTIMEDIA SYSTEMS, 2023, 29 (05) : 2527 - 2543
  • [9] Heterogeneous Interactive Graph Network for Audio-Visual Question Answering
    Zhao, Yihan
    Xi, Wei
    Bai, Gairui
    Liu, Xinhui
    Zhao, Jizhong
    [J]. KNOWLEDGE-BASED SYSTEMS, 2024, 300
  • [10] Relation-Aware Graph Attention Network for Visual Question Answering
    Li, Linjie
    Gan, Zhe
    Cheng, Yu
    Liu, Jingjing
    [J]. 2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 10312 - 10321