Enhancing medical text detection with vision-language pre-training and efficient segmentation

被引：2

作者：

Li, Tianyang ^{[1
,2
]}

Bai, Jinxu ^{[1
]}

Wang, Qingzhu ^{[1
]}

机构：

[1] Northeast Elect Power Univ, Coll Comp Sci & Technol, Jilin 132012, Peoples R China

[2] Jiangxi New Energy Technol Inst, Xinyu 33800, Jiangxi, Peoples R China

来源：

COMPLEX & INTELLIGENT SYSTEMS | 2024年 / 10卷 / 03期

关键词：

Vision-language pre-training; Medical text detection; Feature fusion; Differentiable binarization;

D O I：

10.1007/s40747-024-01378-3

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Detecting text within medical images presents a formidable challenge in the domain of computer vision due to the intricate nature of textual backgrounds, the dense text concentration, and the possible existence of extreme aspect ratios. This paper introduces an effective and precise text detection system tailored to address these challenges. The system incorporates an optimized segmentation module, a trainable post-processing method, and leverages a vision-language pre-training model (oCLIP). Specifically, our segmentation head integrates three essential components: the Feature Pyramid Network (FPN) module, which combines a residual structure and channel attention mechanism; the Efficient Feature Enhancement Module (EFEM); and the Multi-Scale Feature Fusion with RSEConv (MSFM-RSE), designed specifically for multi-scale feature fusion based on RSEConv. By introducing a residual structure and channel attention mechanism into the FPN module, the convolutional layers are replaced with RSEConv layers that employ a channel attention mechanism, further augmenting the representational capacity of the feature maps. The EFEM, designed as a cascaded U-shaped module, incorporates a spatial attention mechanism to introduce multi-level information, thereby enhancing segmentation performance. Subsequently, the MSFM-RSE adeptly amalgamates features from various depths and scales of the EFEM to generate comprehensive final features tailored for segmentation purposes. Additionally, a post-processing module employs a differentiable binarization strategy, allowing the segmentation network to dynamically determine the binarization threshold. Building on the system's improvement, we introduce a vision-language pre-training model that undergoes extensive training on various visual language understanding tasks. This pre-trained model acquires detailed visual and semantic representations, further reinforcing both the accuracy and robustness in text detection when integrated with the segmentation module. The performance of our proposed model was evaluated through experiments on medical text image datasets, demonstrating excellent results. Multiple benchmark experiments validate its superior performance in comparison to existing methods. Code is available at: https://github.com/csworkcode/VLDBNet.

引用

页码：3995 / 4007

页数：13

共 50 条

[1] Efficient Medical Images Text Detection with Vision-Language Pre-training Approach
Li, Tianyang
Bai, Jinxu
Wang, Qingzhu
Xu, Hanwen
[J]. ASIAN CONFERENCE ON MACHINE LEARNING, VOL 222, 2023, 222
[2] Vision-Language Pre-Training for Boosting Scene Text Detectors
Song, Sibo
Wan, Jianqiang
Yang, Zhibo
Tang, Jun
Cheng, Wenqing
Bai, Xiang
Yao, Cong
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15660 - 15670
[3] Enhancing Dynamic Image Advertising with Vision-Language Pre-training
Wen, Zhoufutu
Zhao, Xinyu
Jin, Zhipeng
Yang, Yi
Jia, Wei
Chen, Xiaodong
Li, Shuanglong
Liu, Lin
[J]. PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 3310 - 3314
[4] Survey on Vision-language Pre-training
Yin, Jiong
Zhang, Zhe-Dong
Gao, Yu-Han
Yang, Zhi-Wen
Li, Liang
Xiao, Mang
Sun, Yao-Qi
Yan, Cheng-Gang
[J]. Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2000 - 2023
[5] Enhancing Visual Grounding in Vision-Language Pre-Training With Position-Guided Text Prompts
Wang, Alex Jinpeng
Zhou, Pan
Shou, Mike Zheng
Yan, Shuicheng
[J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (05) : 3406 - 3421
[6] Position-guided Text Prompt for Vision-Language Pre-training
Wang, Jinpeng
Zhou, Pan
Shou, Mike Zheng
Yan, Shuicheng
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23242 - 23251
[7] Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model
Liang, Mingliang
Larson, Martha
[J]. PROCEEDINGS OF THE 1ST WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM3A 2023, 2023, : 61 - 67
[8] ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
Wang, Weihan
Yang, Zhen
Xu, Bin
Li, Juanzi
Sun, Yankui
[J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3135 - 3146
[9] Improving Medical Speech-to-Text Accuracy using Vision-Language Pre-training Models
Huh, Jaeyoung
Park, Sangjoon
Lee, Jeong Eun
Ye, Jong Chul
[J]. IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2024, 28 (03) : 1692 - 1703
[10] Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting
Xue, Chuhui
Zhang, Wenqing
Hao, Yu
Lu, Shijian
Torr, Philip H. S.
Bai, Song
[J]. COMPUTER VISION - ECCV 2022, PT XXVIII, 2022, 13688 : 284 - 302

← 1 2 3 4 5 →