Question-guided feature pyramid network for medical visual question answering

被引:5
|
作者
Yu, Yonglin [1 ]
Li, Haifeng [1 ]
Shi, Hanrong [2 ]
Li, Lin [2 ]
Xiao, Jun [1 ,2 ]
机构
[1] Zhejiang Univ, Childrens Hosp, Natl Clin Res Ctr Child Hlth, Sch Med,Dept Rehabil, Hangzhou, Peoples R China
[2] Zhejiang Univ, Coll Comp Sci & Technol, Hangzhou, Peoples R China
基金
浙江省自然科学基金; 中国国家自然科学基金;
关键词
Visual question answering; Feature pyramid network; Dynamic filter network;
D O I
10.1016/j.eswa.2022.119148
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Medical VQA (VQA-Med) is a critical multi-modal task that raises attention from the community. Existing models utilized just one high-level feature map (i.e., the last layer feature map) extracted by CNN and then fused it with semantic features through the co-attention mechanism. However, only using the high-level feature as a visual feature often ignores the details of the image, which are crucial for VQA-Med. In addition, questions often serve as a guide to targets of attention in the medical image. Therefore, in this paper, we propose a question-guided Feature Pyramid Network (QFPN) for VQA-Med. It extracts multi-level visual features with a feature pyramid network (FPN). In this way, the multi-scale information of medical images can be captured by using the high resolution of low-level features and rich semantic information of high-level features. Besides, a novel question-guided dynamic filter network (DFN) is designed to modulate the fusion progress of multi-level visual features and semantic features with respect to the raised question. Extensive results have demonstrated the effectiveness of the QFPN. Especially, we beat the winner of the 2019 ImageCLEF challenge and achieved 63.8% Accuracy and 65.7% BLEU in the ImageCLEF 2019 VQA-Med dataset.
引用
收藏
页数:8
相关论文
共 50 条
  • [41] Triple attention network for sentimental visual question answering
    Ruwa, Nelson
    Mao, Qirong
    Song, Heping
    Jia, Hongjie
    Dong, Ming
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2019, 189
  • [42] Scene Graph Refinement Network for Visual Question Answering
    Qian, Tianwen
    Chen, Jingjing
    Chen, Shaoxiang
    Wu, Bo
    Jiang, Yu-Gang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 3950 - 3961
  • [43] Fair Attention Network for Robust Visual Question Answering
    Bi Y.
    Jiang H.
    Hu Y.
    Sun Y.
    Yin B.
    IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34 (09) : 1 - 1
  • [44] Question-guided stubborn set methods for state properties
    L. M. Kristensen
    K. Schmidt
    A. Valmari
    Formal Methods in System Design, 2006, 29 : 215 - 251
  • [45] TRANS-VQA: Fully Transformer-Based Image Question-Answering Model Using Question-guided Vision Attention
    Koshti D.
    Gupta A.
    Kalla M.
    Sharma A.
    Inteligencia Artificial, 2024, 27 (73) : 111 - 128
  • [46] Locate Before Answering: Answer Guided Question Localization for Video Question Answering
    Qian, Tianwen
    Cui, Ran
    Chen, Jingjing
    Peng, Pai
    Guo, Xiaowei
    Jiang, Yu-Gang
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 4554 - 4563
  • [47] TRANS-VQA: Fully Transformer-Based Image Question-Answering Model Using Question-guided Vision Attention
    Koshti, Dipali
    Gupta, Ashutosh
    Kalla, Mukesh
    Sharma, Arvind
    INTELIGENCIA ARTIFICIAL-IBEROAMERICAL JOURNAL OF ARTIFICIAL INTELLIGENCE, 2024, 27 (73): : 111 - 128
  • [48] Deep Fuzzy Multi-Teacher Distillation Network for Medical Visual Question Answering
    Liu Y.
    Chen B.
    Wang S.
    Lu G.
    Zhang Z.
    IEEE Transactions on Fuzzy Systems, 2024, 32 (10) : 1 - 15
  • [49] Medical knowledge-based network for Patient-oriented Visual Question Answering
    Jian, Huang
    Chen, Yihao
    Yong, Li
    Yang, Zhenguo
    Gong, Xuehao
    Lee, Wang Fu
    Xu, Xiaohong
    Liu, Wenyin
    INFORMATION PROCESSING & MANAGEMENT, 2023, 60 (02)
  • [50] VQA: Visual Question Answering
    Antol, Stanislaw
    Agrawal, Aishwarya
    Lu, Jiasen
    Mitchell, Margaret
    Batra, Dhruv
    Zitnick, C. Lawrence
    Parikh, Devi
    2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 2425 - 2433