Enhanced descriptive captioning model for histopathological patches

被引:0
|
作者
Samar Elbedwehy
T. Medhat
Taher Hamza
Mohammed F. Alrahmawy
机构
[1] Kafrelsheikh University,Department of Data Science, Faculty of Artificial Intelligence
[2] Mansoura University,Department of Computer Science, Faculty of Computer and Information Science
[3] Kafrelsheikh University,Department of Electrical Engineering, Faculty of Engineering
来源
关键词
Image captioning; Medical-images; Word-embedding; Concatenation; Transformer;
D O I
暂无
中图分类号
学科分类号
摘要
The interpretation of medical images into a natural language is a developing field of artificial intelligence (AI) called image captioning. This field integrates two branches of artificial intelligence which are computer vision and natural language processing. This is a challenging topic that goes beyond object recognition, segmentation, and classification since it demands an understanding of the relationships between various components in an image and how these objects function as visual representations. The content-based image retrieval (CBIR) uses an image captioning model to generate captions for the user query image. The common architecture of medical image captioning systems consists mainly of an image feature extractor subsystem followed by a caption generation lingual subsystem. We aim in this paper to build an optimized model for histopathological captions of stomach adenocarcinoma endoscopic biopsy specimens. For the image feature extraction subsystem, we did two evaluations; first, we tested 5 different vision models (VGG, ResNet, PVT, SWIN-Large, and ConvNEXT-Large) using (LSTM, RNN, and bidirectional-RNN) and then compare the vision models with (LSTM-without augmentation, LSTM-with augmentation and BioLinkBERT-Large as an embedding layer-with augmentation) to find the accurate one. Second, we tested 3 different concatenations of pairs of vision models (SWIN-Large, PVT_v2_b5, and ConvNEXT-Large) to get among them the most expressive extracted feature vector of the image. For the caption generation lingual subsystem, we tested a pre-trained language embedding model which is BioLinkBERT-Large compared to LSTM in both evaluations, to select from them the most accurate model. Our experiments showed that building a captioning system that uses a concatenation of the two models ConvNEXT-Large and PVT_v2_b5 as an image feature extractor, combined with the BioLinkBERT-Large language embedding model produces the best results among the other combinations.
引用
收藏
页码:36645 / 36664
页数:19
相关论文
共 50 条
  • [31] Image Captioning with Masked Diffusion Model
    Tian, Weidong
    Xu, Wenzheng
    Zhao, Junxiang
    Zhao, Zhongqiu
    [J]. ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT VIII, ICIC 2024, 2024, 14869 : 216 - 227
  • [32] Multiple Videos Captioning Model for Video Storytelling
    Han, Seung-Ho
    Go, Bo-Won
    Choi, Ho-Jin
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2019, : 355 - 358
  • [33] Triple-level relationship enhanced transformer for image captioning
    Zheng, Anqi
    Zheng, Shiqi
    Bai, Cong
    Chen, Deng
    [J]. MULTIMEDIA SYSTEMS, 2023, 29 (04) : 1955 - 1966
  • [34] Memory-enhanced hierarchical transformer for video paragraph captioning
    Zhang, Benhui
    Gao, Junyu
    Yuan, Yuan
    [J]. Neurocomputing, 2025, 615
  • [35] Integrating grid features and geometric coordinates for enhanced image captioning
    Zhao, Fengzhi
    Yu, Zhezhou
    Zhao, He
    Wang, Tao
    Bai, Tian
    [J]. APPLIED INTELLIGENCE, 2024, 54 (01) : 231 - 245
  • [36] Style-Enhanced Transformer for Image Captioning in Construction Scenes
    Song, Kani
    Chen, Linlin
    Wang, Hengyou
    [J]. ENTROPY, 2024, 26 (03)
  • [37] A Unified Visual and Linguistic Semantics Method for Enhanced Image Captioning
    Peng, Jiajia
    Tang, Tianbing
    [J]. APPLIED SCIENCES-BASEL, 2024, 14 (06):
  • [38] Semantic Enhanced Video Captioning with Multi-feature Fusion
    Niu, Tian-Zi
    Dong, Shan-Shan
    Chen, Zhen-Duo
    Luo, Xin
    Guo, Shanqing
    Huang, Zi
    Xu, Xin-Shun
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (06)
  • [39] Multimodal-enhanced hierarchical attention network for video captioning
    Maosheng Zhong
    Youde Chen
    Hao Zhang
    Hao Xiong
    Zhixiang Wang
    [J]. Multimedia Systems, 2023, 29 : 2469 - 2482
  • [40] BENet: bi-directional enhanced network for image captioning
    Yan, Peixin
    Li, Zuoyong
    Hu, Rong
    Cao, Xinrong
    [J]. MULTIMEDIA SYSTEMS, 2024, 30 (01)