Vision-Language Transformer for Interpretable Pathology Visual Question Answering

被引:27
|
作者
Naseem, Usman [1 ]
Khushi, Matloob [1 ,2 ]
Kim, Jinman [1 ]
机构
[1] Univ Sydney, Sch Comp Sci, Camperdown, NSW 2006, Australia
[2] Univ Suffolk, Ipswich IP4 1QJ, Suffolk, England
关键词
Pathology images; interpretability; visual question answering; vision-language;
D O I
10.1109/JBHI.2022.3163751
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Pathology visual question answering (PathVQA) attempts to answer a medical question posed by pathology images. Despite its great potential in healthcare, it is not widely adopted because it requires interactions on both the image (vision) and question (language) to generate an answer. Existing methods focused on treating vision and language features independently, which were unable to capture the high and low-level interactions that are required for VQA. Further, these methods failed to offer capabilities to interpret the retrieved answers, which are obscure to humans where the models' interpretability to justify the retrieved answers has remained largely unexplored. Motivated by these limitations, we introduce a vision-language transformer that embeds vision (images) and language (questions) features for an interpretable PathVQA. We present an interpretable transformer-based Path-VQA (TraP-VQA), where we embed transformers' encoder layers with vision and language features extracted using pre-trained CNN and domain-specific language model (LM), respectively. A decoder layer is then embedded to upsample the encoded features for the final prediction for PathVQA. Our experiments showed that our TraP-VQA outperformed the state-of-the-art comparative methods with public PathVQA dataset. Our experiments validated the robustness of our model on another medical VQA dataset, and the ablation study demonstrated the capability of our integrated transformer-based vision-language model for PathVQA. Finally, we present the visualization results of both text and images, which explain the reason for a retrieved answer in PathVQA.
引用
收藏
页码:1681 / 1690
页数:10
相关论文
共 50 条
  • [41] Local self-attention in transformer for visual question answering
    Xiang Shen
    Dezhi Han
    Zihan Guo
    Chongqing Chen
    Jie Hua
    Gaofeng Luo
    Applied Intelligence, 2023, 53 : 16706 - 16723
  • [42] A Transformer-based Medical Visual Question Answering Model
    Liu, Lei
    Su, Xiangdong
    Guo, Hui
    Zhu, Daobin
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 1712 - 1718
  • [43] Local self-attention in transformer for visual question answering
    Shen, Xiang
    Han, Dezhi
    Guo, Zihan
    Chen, Chongqing
    Hua, Jie
    Luo, Gaofeng
    APPLIED INTELLIGENCE, 2023, 53 (13) : 16706 - 16723
  • [44] Pathologyvlm: a large vision-language model for pathology image understanding
    Dawei Dai
    Yuanhui Zhang
    Qianlan Yang
    Long Xu
    Xiaojing Shen
    Shuyin Xia
    Guoyin Wang
    Artificial Intelligence Review, 58 (6)
  • [45] TRAR: Routing the Attention Spans in Transformer for Visual Question Answering
    Zhou, Yiyi
    Ren, Tianhe
    Zhu, Chaoyang
    Sun, Xiaoshuai
    Liu, Jianzhuang
    Ding, Xinghao
    Xu, Mingliang
    Ji, Rongrong
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 2054 - 2064
  • [46] Transformer Module Networks for Systematic Generalization in Visual Question Answering
    Yamada, Moyuru
    D'Amario, Vanessa
    Takemoto, Kentaro
    Boix, Xavier
    Sasaki, Tomotake
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (12) : 10096 - 10105
  • [47] Multiview Language Bias Reduction for Visual Question Answering
    Li, Pengju
    Tan, Zhiyi
    Bao, Bing-Kun
    IEEE MULTIMEDIA, 2023, 30 (01) : 91 - 99
  • [48] An Empirical Study on the Language Modal in Visual Question Answering
    Peng, Daowan
    Wei, Wei
    Mao, Xian-Ling
    Fu, Yuanyuan
    Chen, Dangyang
    PROCEEDINGS OF THE THIRTY-SECOND INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2023, 2023, : 4109 - 4117
  • [49] LANGUAGE TRANSFORMERS FOR REMOTE SENSING VISUAL QUESTION ANSWERING
    Chappuis, Christel
    Mendez, Vincent
    Walt, Eliot
    Lobry, Sylvain
    Le Saux, Bertrand
    Tuia, Devis
    2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 4855 - 4858
  • [50] Target-Driven Structured Transformer Planner for Vision-Language Navigation
    Zhao, Yusheng
    Chen, Jinyu
    Gao, Chen
    Wang, Wenguan
    Yang, Lirong
    Ren, Haibing
    Xia, Huaxia
    Liu, Si
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 4194 - 4203