Vision-Language Transformer for Interpretable Pathology Visual Question Answering

被引:27
|
作者
Naseem, Usman [1 ]
Khushi, Matloob [1 ,2 ]
Kim, Jinman [1 ]
机构
[1] Univ Sydney, Sch Comp Sci, Camperdown, NSW 2006, Australia
[2] Univ Suffolk, Ipswich IP4 1QJ, Suffolk, England
关键词
Pathology images; interpretability; visual question answering; vision-language;
D O I
10.1109/JBHI.2022.3163751
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Pathology visual question answering (PathVQA) attempts to answer a medical question posed by pathology images. Despite its great potential in healthcare, it is not widely adopted because it requires interactions on both the image (vision) and question (language) to generate an answer. Existing methods focused on treating vision and language features independently, which were unable to capture the high and low-level interactions that are required for VQA. Further, these methods failed to offer capabilities to interpret the retrieved answers, which are obscure to humans where the models' interpretability to justify the retrieved answers has remained largely unexplored. Motivated by these limitations, we introduce a vision-language transformer that embeds vision (images) and language (questions) features for an interpretable PathVQA. We present an interpretable transformer-based Path-VQA (TraP-VQA), where we embed transformers' encoder layers with vision and language features extracted using pre-trained CNN and domain-specific language model (LM), respectively. A decoder layer is then embedded to upsample the encoded features for the final prediction for PathVQA. Our experiments showed that our TraP-VQA outperformed the state-of-the-art comparative methods with public PathVQA dataset. Our experiments validated the robustness of our model on another medical VQA dataset, and the ablation study demonstrated the capability of our integrated transformer-based vision-language model for PathVQA. Finally, we present the visualization results of both text and images, which explain the reason for a retrieved answer in PathVQA.
引用
收藏
页码:1681 / 1690
页数:10
相关论文
共 50 条
  • [1] Vision-Language Model for Visual Question Answering in Medical Imagery
    Bazi, Yakoub
    Al Rahhal, Mohamad Mahmoud
    Bashmal, Laila
    Zuair, Mansour
    BIOENGINEERING-BASEL, 2023, 10 (03):
  • [2] Transformer-based vision-language alignment for robot navigation and question answering
    Luo, Haonan
    Guo, Ziyu
    Wu, Zhenyu
    Teng, Fei
    Li, Tianrui
    INFORMATION FUSION, 2024, 108
  • [3] SELF-SUPERVISED VISION-LANGUAGE PRETRAINING FOR MEDIAL VISUAL QUESTION ANSWERING
    Li, Pengfei
    Liu, Gang
    Tan, Lin
    Liao, Jinying
    Zhong, Shenjun
    2023 IEEE 20TH INTERNATIONAL SYMPOSIUM ON BIOMEDICAL IMAGING, ISBI, 2023,
  • [4] BIVL-Net: Bidirectional Vision-Language Guidance for Visual Question Answering
    Han, Cong
    Zhang, Feifei
    PATTERN RECOGNITION AND COMPUTER VISION, PT III, PRCV 2024, 2025, 15033 : 481 - 495
  • [5] MiniMedGPT: Efficient Large Vision-Language Model for medical Visual Question Answering
    Alsabbagh, Abdel Rahman
    Mansour, Tariq
    Al-Kharabsheh, Mohammad
    Ebdah, Abdel Salam
    Al-Emaryeen, Roa'a
    Al-Nahhas, Sara
    Mahafza, Waleed
    Al-Kadi, Omar
    PATTERN RECOGNITION LETTERS, 2025, 189 : 8 - 16
  • [6] Vision-language models for medical report generation and visual question answering: a review
    Hartsock, Iryna
    Rasool, Ghulam
    FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2024, 7
  • [7] Surgical-VQLA: Transformer with Gated Vision-Language Embedding for Visual Question Localized-Answering in Robotic Surgery
    Bai, Long
    Islam, Mobarakol
    Seenivasan, Lalithkumar
    Ren, Hongliang
    2023 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND AUTOMATION, ICRA, 2023, : 6859 - 6865
  • [8] Compressing and Debiasing Vision-Language Pre-Trained Models for Visual Question Answering
    Si, Qingyi
    Liu, Yuanxin
    Lin, Zheng
    Fu, Peng
    Cao, Yanan
    Wang, Weiping
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 513 - 529
  • [9] VISION AND TEXT TRANSFORMER FOR PREDICTING ANSWERABILITY ON VISUAL QUESTION ANSWERING
    Le, Tung
    Huy Tien Nguyen
    Minh Le Nguyen
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 934 - 938
  • [10] Faster, Stronger, and More Interpretable: Massive Transformer Architectures for Vision-Language Tasks
    Chen, Tong
    Liu, Sicong
    Chen, Zhiran
    Hu, Wenyan
    Chen, Dachi
    Wang, Yuanxin
    Lyu, Qi
    Le, Cindy X.
    Wang, Wenping
    ADVANCES IN ARTIFICIAL INTELLIGENCE AND MACHINE LEARNING, 2023, 3 (03): : 1369 - 1388