Vision-Language Transformer for Interpretable Pathology Visual Question Answering

被引:27
|
作者
Naseem, Usman [1 ]
Khushi, Matloob [1 ,2 ]
Kim, Jinman [1 ]
机构
[1] Univ Sydney, Sch Comp Sci, Camperdown, NSW 2006, Australia
[2] Univ Suffolk, Ipswich IP4 1QJ, Suffolk, England
关键词
Pathology images; interpretability; visual question answering; vision-language;
D O I
10.1109/JBHI.2022.3163751
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Pathology visual question answering (PathVQA) attempts to answer a medical question posed by pathology images. Despite its great potential in healthcare, it is not widely adopted because it requires interactions on both the image (vision) and question (language) to generate an answer. Existing methods focused on treating vision and language features independently, which were unable to capture the high and low-level interactions that are required for VQA. Further, these methods failed to offer capabilities to interpret the retrieved answers, which are obscure to humans where the models' interpretability to justify the retrieved answers has remained largely unexplored. Motivated by these limitations, we introduce a vision-language transformer that embeds vision (images) and language (questions) features for an interpretable PathVQA. We present an interpretable transformer-based Path-VQA (TraP-VQA), where we embed transformers' encoder layers with vision and language features extracted using pre-trained CNN and domain-specific language model (LM), respectively. A decoder layer is then embedded to upsample the encoded features for the final prediction for PathVQA. Our experiments showed that our TraP-VQA outperformed the state-of-the-art comparative methods with public PathVQA dataset. Our experiments validated the robustness of our model on another medical VQA dataset, and the ablation study demonstrated the capability of our integrated transformer-based vision-language model for PathVQA. Finally, we present the visualization results of both text and images, which explain the reason for a retrieved answer in PathVQA.
引用
收藏
页码:1681 / 1690
页数:10
相关论文
共 50 条
  • [31] Interpretable Visual Question Answering by Visual Grounding from Attention Supervision Mining
    Zhang, Yundong
    Niebles, Juan Carlos
    Soto, Alvaro
    2019 IEEE WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2019, : 349 - 357
  • [32] Interpretable Complex Question Answering
    Chakrabarti, Soumen
    WEB CONFERENCE 2020: PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE (WWW 2020), 2020, : 2455 - 2457
  • [33] VLT: Vision-Language Transformer and Query Generation for Referring Segmentation
    Ding, Henghui
    Liu, Chang
    Wang, Suchen
    Jiang, Xudong
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2023, 45 (06) : 7900 - 7916
  • [34] Probabilistic Neural-symbolic Models for Interpretable Visual Question Answering
    Vedantam, Ramakrishna
    Desai, Karan
    Lee, Stefan
    Rohrbach, Marcus
    Batra, Dhruv
    Parikh, Devi
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 97, 2019, 97
  • [35] VinVL: Revisiting Visual Representations in Vision-Language Models
    Zhang, Pengchuan
    Li, Xiujun
    Hu, Xiaowei
    Yang, Jianwei
    Zhang, Lei
    Wang, Lijuan
    Choi, Yejin
    Gao, Jianfeng
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 5575 - 5584
  • [36] Aligning vision-language for graph inference in visual dialog
    Jiang, Tianling
    Shao, Hailin
    Tian, Xin
    Ji, Yi
    Liu, Chunping
    IMAGE AND VISION COMPUTING, 2021, 116
  • [37] Nouns for visual objects: A hypothesis of the vision-language interface
    Ursini, Francesco-Alessio
    Acquaviva, Paolo
    LANGUAGE SCIENCES, 2019, 72 : 50 - 70
  • [38] BRAVE: Broadening the Visual Encoding of Vision-Language Models
    Kar, Oguzhan Fatih
    Tonioni, Alessio
    Poklukar, Petra
    Kulshrestha, Achin
    Zamir, Amir
    Tombari, Federico
    COMPUTER VISION - ECCV 2024, PT XVI, 2025, 15074 : 113 - 132
  • [39] SurgicalGPT: End-to-End Language-Vision GPT for Visual Question Answering in Surgery
    Seenivasan, Lalithkumar
    Islam, Mobarakol
    Kannan, Gokul
    Ren, Hongliang
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION, MICCAI 2023, PT IX, 2023, 14228 : 281 - 290
  • [40] Bilaterally Slimmable Transformer for Elastic and Efficient Visual Question Answering
    Yu, Zhou
    Jin, Zitian
    Yu, Jun
    Xu, Mingliang
    Wang, Hongbo
    Fan, Jianping
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 9543 - 9556