Vision-Language Transformer for Interpretable Pathology Visual Question Answering

被引:27
|
作者
Naseem, Usman [1 ]
Khushi, Matloob [1 ,2 ]
Kim, Jinman [1 ]
机构
[1] Univ Sydney, Sch Comp Sci, Camperdown, NSW 2006, Australia
[2] Univ Suffolk, Ipswich IP4 1QJ, Suffolk, England
关键词
Pathology images; interpretability; visual question answering; vision-language;
D O I
10.1109/JBHI.2022.3163751
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Pathology visual question answering (PathVQA) attempts to answer a medical question posed by pathology images. Despite its great potential in healthcare, it is not widely adopted because it requires interactions on both the image (vision) and question (language) to generate an answer. Existing methods focused on treating vision and language features independently, which were unable to capture the high and low-level interactions that are required for VQA. Further, these methods failed to offer capabilities to interpret the retrieved answers, which are obscure to humans where the models' interpretability to justify the retrieved answers has remained largely unexplored. Motivated by these limitations, we introduce a vision-language transformer that embeds vision (images) and language (questions) features for an interpretable PathVQA. We present an interpretable transformer-based Path-VQA (TraP-VQA), where we embed transformers' encoder layers with vision and language features extracted using pre-trained CNN and domain-specific language model (LM), respectively. A decoder layer is then embedded to upsample the encoded features for the final prediction for PathVQA. Our experiments showed that our TraP-VQA outperformed the state-of-the-art comparative methods with public PathVQA dataset. Our experiments validated the robustness of our model on another medical VQA dataset, and the ablation study demonstrated the capability of our integrated transformer-based vision-language model for PathVQA. Finally, we present the visualization results of both text and images, which explain the reason for a retrieved answer in PathVQA.
引用
收藏
页码:1681 / 1690
页数:10
相关论文
共 50 条
  • [21] Vision-language AI assistance in human pathology
    Marchal, Iris
    NATURE BIOTECHNOLOGY, 2024, 42 (07) : 1027 - 1027
  • [22] CLVIN: Complete language-vision interaction network for visual question answering
    Chen, Chongqing
    Han, Dezhi
    Shen, Xiang
    KNOWLEDGE-BASED SYSTEMS, 2023, 275
  • [23] ChatFFA: An ophthalmic chat system for unified vision-language understanding and question answering for fundus fluorescein angiography
    Chen, Xiaolan
    Xu, Pusheng
    Li, Yao
    Zhang, Weiyi
    Song, Fan
    He, Mingguang
    Shi, Danli
    ISCIENCE, 2024, 27 (07)
  • [24] Towards Visual Question Answering on Pathology Images
    He, Xuehai
    Cai, Zhuo
    Wei, Wenlan
    Zhang, Yichen
    Mou, Luntian
    Xing, Eric
    Xie, Pengtao
    ACL-IJCNLP 2021: THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 2, 2021, : 708 - 718
  • [25] Object-less Vision-language Model on Visual Question Classification for Blind People
    Le, Tung
    Pho, Khoa
    Thong Bui
    Huy Tien Nguyen
    Minh Le Nguyen
    ICAART: PROCEEDINGS OF THE 14TH INTERNATIONAL CONFERENCE ON AGENTS AND ARTIFICIAL INTELLIGENCE - VOL 3, 2022, : 180 - 187
  • [26] Vision-Language Transformer and Query Generation for Referring Segmentation
    Ding, Henghui
    Liu, Chang
    Wang, Suchen
    Jiang, Xudong
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 16301 - 16310
  • [27] Learning Conditioned Graph Structures for Interpretable Visual Question Answering
    Norcliffe-Brown, Will
    Vafeias, Efstathios
    Parisot, Sarah
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [28] LANGUAGE AND VISUAL RELATIONS ENCODING FOR VISUAL QUESTION ANSWERING
    Liu, Fei
    Liu, Jing
    Fang, Zhiwei
    Lu, Hanqing
    2019 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2019, : 3307 - 3311
  • [29] RS-LLaVA: A Large Vision-Language Model for Joint Captioning and Question Answering in Remote Sensing Imagery
    Bazi, Yakoub
    Bashmal, Laila
    Al Rahhal, Mohamad Mahmoud
    Ricci, Riccardo
    Melgani, Farid
    REMOTE SENSING, 2024, 16 (09)
  • [30] Advancing Vietnamese Visual Question Answering with Transformer and Convolutional
    Nguyen, Ngoc Son
    Nguyen, Van Son
    Le, Tung
    COMPUTERS & ELECTRICAL ENGINEERING, 2024, 119