Enhancing medical text detection with vision-language pre-training and efficient segmentation

被引:2
|
作者
Li, Tianyang [1 ,2 ]
Bai, Jinxu [1 ]
Wang, Qingzhu [1 ]
机构
[1] Northeast Elect Power Univ, Coll Comp Sci & Technol, Jilin 132012, Peoples R China
[2] Jiangxi New Energy Technol Inst, Xinyu 33800, Jiangxi, Peoples R China
关键词
Vision-language pre-training; Medical text detection; Feature fusion; Differentiable binarization;
D O I
10.1007/s40747-024-01378-3
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Detecting text within medical images presents a formidable challenge in the domain of computer vision due to the intricate nature of textual backgrounds, the dense text concentration, and the possible existence of extreme aspect ratios. This paper introduces an effective and precise text detection system tailored to address these challenges. The system incorporates an optimized segmentation module, a trainable post-processing method, and leverages a vision-language pre-training model (oCLIP). Specifically, our segmentation head integrates three essential components: the Feature Pyramid Network (FPN) module, which combines a residual structure and channel attention mechanism; the Efficient Feature Enhancement Module (EFEM); and the Multi-Scale Feature Fusion with RSEConv (MSFM-RSE), designed specifically for multi-scale feature fusion based on RSEConv. By introducing a residual structure and channel attention mechanism into the FPN module, the convolutional layers are replaced with RSEConv layers that employ a channel attention mechanism, further augmenting the representational capacity of the feature maps. The EFEM, designed as a cascaded U-shaped module, incorporates a spatial attention mechanism to introduce multi-level information, thereby enhancing segmentation performance. Subsequently, the MSFM-RSE adeptly amalgamates features from various depths and scales of the EFEM to generate comprehensive final features tailored for segmentation purposes. Additionally, a post-processing module employs a differentiable binarization strategy, allowing the segmentation network to dynamically determine the binarization threshold. Building on the system's improvement, we introduce a vision-language pre-training model that undergoes extensive training on various visual language understanding tasks. This pre-trained model acquires detailed visual and semantic representations, further reinforcing both the accuracy and robustness in text detection when integrated with the segmentation module. The performance of our proposed model was evaluated through experiments on medical text image datasets, demonstrating excellent results. Multiple benchmark experiments validate its superior performance in comparison to existing methods. Code is available at: https://github.com/csworkcode/VLDBNet.
引用
收藏
页码:3995 / 4007
页数:13
相关论文
共 50 条
  • [1] Efficient Medical Images Text Detection with Vision-Language Pre-training Approach
    Li, Tianyang
    Bai, Jinxu
    Wang, Qingzhu
    Xu, Hanwen
    [J]. ASIAN CONFERENCE ON MACHINE LEARNING, VOL 222, 2023, 222
  • [2] Vision-Language Pre-Training for Boosting Scene Text Detectors
    Song, Sibo
    Wan, Jianqiang
    Yang, Zhibo
    Tang, Jun
    Cheng, Wenqing
    Bai, Xiang
    Yao, Cong
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15660 - 15670
  • [3] Enhancing Dynamic Image Advertising with Vision-Language Pre-training
    Wen, Zhoufutu
    Zhao, Xinyu
    Jin, Zhipeng
    Yang, Yi
    Jia, Wei
    Chen, Xiaodong
    Li, Shuanglong
    Liu, Lin
    [J]. PROCEEDINGS OF THE 46TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, SIGIR 2023, 2023, : 3310 - 3314
  • [4] Survey on Vision-language Pre-training
    Yin, Jiong
    Zhang, Zhe-Dong
    Gao, Yu-Han
    Yang, Zhi-Wen
    Li, Liang
    Xiao, Mang
    Sun, Yao-Qi
    Yan, Cheng-Gang
    [J]. Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2000 - 2023
  • [5] Enhancing Visual Grounding in Vision-Language Pre-Training With Position-Guided Text Prompts
    Wang, Alex Jinpeng
    Zhou, Pan
    Shou, Mike Zheng
    Yan, Shuicheng
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2024, 46 (05) : 3406 - 3421
  • [6] Position-guided Text Prompt for Vision-Language Pre-training
    Wang, Jinpeng
    Zhou, Pan
    Shou, Mike Zheng
    Yan, Shuicheng
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23242 - 23251
  • [7] Subsampling of Frequent Words in Text for Pre-training a Vision-Language Model
    Liang, Mingliang
    Larson, Martha
    [J]. PROCEEDINGS OF THE 1ST WORKSHOP ON LARGE GENERATIVE MODELS MEET MULTIMODAL APPLICATIONS, LGM3A 2023, 2023, : 61 - 67
  • [8] ViLTA: Enhancing Vision-Language Pre-training through Textual Augmentation
    Wang, Weihan
    Yang, Zhen
    Xu, Bin
    Li, Juanzi
    Sun, Yankui
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, ICCV, 2023, : 3135 - 3146
  • [9] Improving Medical Speech-to-Text Accuracy using Vision-Language Pre-training Models
    Huh, Jaeyoung
    Park, Sangjoon
    Lee, Jeong Eun
    Ye, Jong Chul
    [J]. IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2024, 28 (03) : 1692 - 1703
  • [10] Language Matters: A Weakly Supervised Vision-Language Pre-training Approach for Scene Text Detection and Spotting
    Xue, Chuhui
    Zhang, Wenqing
    Hao, Yu
    Lu, Shijian
    Torr, Philip H. S.
    Bai, Song
    [J]. COMPUTER VISION - ECCV 2022, PT XXVIII, 2022, 13688 : 284 - 302