Enhanced descriptive captioning model for histopathological patches

被引：0

作者：

Samar Elbedwehy

T. Medhat

Taher Hamza

Mohammed F. Alrahmawy

机构：

[1] Kafrelsheikh University,Department of Data Science, Faculty of Artificial Intelligence

[2] Mansoura University,Department of Computer Science, Faculty of Computer and Information Science

[3] Kafrelsheikh University,Department of Electrical Engineering, Faculty of Engineering

来源：

Multimedia Tools and Applications | 2024年 / 83卷

关键词：

Image captioning; Medical-images; Word-embedding; Concatenation; Transformer;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

The interpretation of medical images into a natural language is a developing field of artificial intelligence (AI) called image captioning. This field integrates two branches of artificial intelligence which are computer vision and natural language processing. This is a challenging topic that goes beyond object recognition, segmentation, and classification since it demands an understanding of the relationships between various components in an image and how these objects function as visual representations. The content-based image retrieval (CBIR) uses an image captioning model to generate captions for the user query image. The common architecture of medical image captioning systems consists mainly of an image feature extractor subsystem followed by a caption generation lingual subsystem. We aim in this paper to build an optimized model for histopathological captions of stomach adenocarcinoma endoscopic biopsy specimens. For the image feature extraction subsystem, we did two evaluations; first, we tested 5 different vision models (VGG, ResNet, PVT, SWIN-Large, and ConvNEXT-Large) using (LSTM, RNN, and bidirectional-RNN) and then compare the vision models with (LSTM-without augmentation, LSTM-with augmentation and BioLinkBERT-Large as an embedding layer-with augmentation) to find the accurate one. Second, we tested 3 different concatenations of pairs of vision models (SWIN-Large, PVT_v2_b5, and ConvNEXT-Large) to get among them the most expressive extracted feature vector of the image. For the caption generation lingual subsystem, we tested a pre-trained language embedding model which is BioLinkBERT-Large compared to LSTM in both evaluations, to select from them the most accurate model. Our experiments showed that building a captioning system that uses a concatenation of the two models ConvNEXT-Large and PVT_v2_b5 as an image feature extractor, combined with the BioLinkBERT-Large language embedding model produces the best results among the other combinations.

引用

页码：36645 / 36664

页数：19

共 50 条

[31] Image Captioning with Masked Diffusion Model
Tian, Weidong
Xu, Wenzheng
Zhao, Junxiang
Zhao, Zhongqiu
[J]. ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT VIII, ICIC 2024, 2024, 14869 : 216 - 227
[32] Multiple Videos Captioning Model for Video Storytelling
Han, Seung-Ho
Go, Bo-Won
Choi, Ho-Jin
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON BIG DATA AND SMART COMPUTING (BIGCOMP), 2019, : 355 - 358
[33] Triple-level relationship enhanced transformer for image captioning
Zheng, Anqi
Zheng, Shiqi
Bai, Cong
Chen, Deng
[J]. MULTIMEDIA SYSTEMS, 2023, 29 (04) : 1955 - 1966
[34] Memory-enhanced hierarchical transformer for video paragraph captioning
Zhang, Benhui
Gao, Junyu
Yuan, Yuan
[J]. Neurocomputing, 2025, 615
[35] Integrating grid features and geometric coordinates for enhanced image captioning
Zhao, Fengzhi
Yu, Zhezhou
Zhao, He
Wang, Tao
Bai, Tian
[J]. APPLIED INTELLIGENCE, 2024, 54 (01) : 231 - 245
[36] Style-Enhanced Transformer for Image Captioning in Construction Scenes
Song, Kani
Chen, Linlin
Wang, Hengyou
[J]. ENTROPY, 2024, 26 (03)
[37] A Unified Visual and Linguistic Semantics Method for Enhanced Image Captioning
Peng, Jiajia
Tang, Tianbing
[J]. APPLIED SCIENCES-BASEL, 2024, 14 (06):
[38] Semantic Enhanced Video Captioning with Multi-feature Fusion
Niu, Tian-Zi
Dong, Shan-Shan
Chen, Zhen-Duo
Luo, Xin
Guo, Shanqing
Huang, Zi
Xu, Xin-Shun
[J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (06)
[39] Multimodal-enhanced hierarchical attention network for video captioning
Maosheng Zhong
Youde Chen
Hao Zhang
Hao Xiong
Zhixiang Wang
[J]. Multimedia Systems, 2023, 29 : 2469 - 2482
[40] BENet: bi-directional enhanced network for image captioning
Yan, Peixin
Li, Zuoyong
Hu, Rong
Cao, Xinrong
[J]. MULTIMEDIA SYSTEMS, 2024, 30 (01)

← 1 2 3 4 5 →