LOIS: Looking Out of Instance Semantics for Visual Question Answering

被引:0
|
作者
Zhang, Siyu [1 ]
Chen, Yeming [1 ]
Sun, Yaoru [1 ]
Wang, Fang [2 ]
Shi, Haibo [3 ]
Wang, Haoran [1 ]
机构
[1] Tongji Univ, Dept Comp Sci & Technol, Shanghai 201804, Peoples R China
[2] Brunel Univ, Dept Comp Sci, Uxbridge UB8 3PH, England
[3] Shanghai Univ Finance & Econ, Sch Stat & Management, Shanghai 200433, Peoples R China
基金
中国国家自然科学基金;
关键词
Visual question answering (VQA); instance semantics; visual features; multimodal relation attention; ATTENTION; NETWORK;
D O I
10.1109/TMM.2023.3347093
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Visual question answering (VQA) has been intensively studied as a multimodal task, requiring efforts to bridge vision and language for correct answer inference. Recent attempts have developed various attention-based modules for solving VQA tasks. However, the performance of model inference is largely bottlenecked by visual semantic comprehension. Most existing detection methods rely on bounding boxes, remaining a serious challenge for VQA models to comprehend and correctly infer the causal nexus of contextual object semantics in images. To this end, we propose a finer model framework without bounding boxes in this work, termed Looking Out of Instance Semantics (LOIS) to address this crucial issue. LOIS can achieve more fine-grained feature descriptions to generate visual facts. Furthermore, to overcome the label ambiguity caused by instance masks, two types of relation attention modules: 1) intra-modality and 2) inter-modality, are devised to infer the correct answers from different visual features. Specifically, we implement a mutual relation attention module to model sophisticated and deeper visual semantic relations between instance objects and background information. In addition, our proposed attention model can further analyze salient image regions by focusing on important word-related questions. Experimental results on four benchmark VQA datasets prove that our proposed method has favorable performance in improving visual reasoning capability.
引用
收藏
页码:6202 / 6214
页数:13
相关论文
共 50 条
  • [1] Text-instance graph: Exploring the relational semantics for text-based visual question answering
    Li, Xiangpeng
    Wu, Bo
    Song, Jingkuan
    Gao, Lianli
    Zeng, Pengpeng
    Gan, Chuang
    [J]. PATTERN RECOGNITION, 2022, 124
  • [2] Question Modifiers in Visual Question Answering
    Britton, William
    Sarkhel, Somdeb
    Venugopal, Deepak
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 1472 - 1479
  • [3] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433
  • [4] VQA: Visual Question Answering
    Agrawal, Aishwarya
    Lu, Jiasen
    Antol, Stanislaw
    Mitchell, Margaret
    Zitnick, C. Lawrence
    Parikh, Devi
    Batra, Dhruv
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2017, 123 (01) : 4 - 31
  • [5] Survey on Visual Question Answering
    Bao X.-G.
    Zhou C.-L.
    Xiao K.-J.
    Qin B.
    [J]. Ruan Jian Xue Bao/Journal of Software, 2021, 32 (08): : 2522 - 2544
  • [6] Neural Compositional Denotational Semantics for Question Answering
    Gupta, Nitish
    Lewis, Mike
    [J]. 2018 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2018), 2018, : 2152 - 2161
  • [7] Visual Question Answering A tutorial
    Teney, Damien
    Wu, Qi
    van den Hengel, Anton
    [J]. IEEE SIGNAL PROCESSING MAGAZINE, 2017, 34 (06) : 63 - 75
  • [8] Visual Question Generation as Dual Task of Visual Question Answering
    Li, Yikang
    Duan, Nan
    Zhou, Bolei
    Chu, Xiao
    Ouyang, Wanli
    Wang, Xiaogang
    Zhou, Ming
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 6116 - 6124
  • [9] Instance-sequence reasoning for video question answering
    Liu, Rui
    Han, Yahong
    [J]. FRONTIERS OF COMPUTER SCIENCE, 2022, 16 (06)
  • [10] Instance-sequence reasoning for video question answering
    LIU Rui
    HAN Yahong
    [J]. Frontiers of Computer Science, 2022, 16 (06)