EMBERT: A Pre-trained Language Model for Chinese Medical Text Mining

被引:3
|
作者
Cai, Zerui [1 ]
Zhang, Taolin [2 ,3 ]
Wang, Chengyu [3 ]
He, Xiaofeng [1 ]
机构
[1] East China Normal Univ, Sch Comp Sci & Technol, Shanghai, Peoples R China
[2] East China Normal Univ, Sch Software Engn, Shanghai, Peoples R China
[3] Alibaba Grp, Hangzhou, Peoples R China
来源
关键词
Pre-trained language model; Chinese medical text mining; Self-supervised learning; Deep context-aware neural network; IDENTIFICATION;
D O I
10.1007/978-3-030-85896-4_20
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Medical text mining aims to learn models to extract useful information from medical sources. A major challenge is obtaining large-scale labeled data in the medical domain for model training, which is highly expensive. Recent studies show that leveraging massive unlabeled corpora for pre-training language models alleviates this problem by selfsupervised learning. In this paper, we propose EMBERT, an entity-level knowledge-enhanced pre-trained language model, which leverages several distinct self-supervised tasks for Chinese medical text mining. EMBERT captures fine-grained semantic relations among medical terms by three self-supervised tasks, including i) context-entity consistency prediction (whether entities are of equivalence in meanings given certain contexts), ii) entity segmentation (segmenting entities into fine-grained semantic parts) and iii) bidirectional entity masking (predicting the atomic or adjective terms of long entities). The experimental results demonstrate that our model achieves significant improvements over five strong baselines on six public Chinese medical text mining datasets.
引用
收藏
页码:242 / 257
页数:16
相关论文
共 50 条
  • [1] BioHanBERT: A Hanzi-aware Pre-trained Language Model for Chinese Biomedical Text Mining
    Wang, Xiaosu
    Xiong, Yun
    Niu, Hao
    Yue, Jingwen
    Zhu, Yangyong
    Yu, Philip S.
    [J]. 2021 21ST IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2021), 2021, : 1415 - 1420
  • [2] Using a Pre-Trained Language Model for Medical Named Entity Extraction in Chinese Clinic Text
    Zhang, Mengyuan
    Wang, Jin
    Zhang, Xuejie
    [J]. PROCEEDINGS OF 2020 IEEE 10TH INTERNATIONAL CONFERENCE ON ELECTRONICS INFORMATION AND EMERGENCY COMMUNICATION (ICEIEC 2020), 2020, : 312 - 317
  • [3] BioBERT: a pre-trained biomedical language representation model for biomedical text mining
    Lee, Jinhyuk
    Yoon, Wonjin
    Kim, Sungdong
    Kim, Donghyeon
    Kim, Sunkyu
    So, Chan Ho
    Kang, Jaewoo
    [J]. BIOINFORMATICS, 2020, 36 (04) : 1234 - 1240
  • [4] BioVAE: a pre-trained latent variable language model for biomedical text mining
    Trieu, Hai-Long
    Miwa, Makoto
    Ananiadou, Sophia
    [J]. BIOINFORMATICS, 2022, 38 (03) : 872 - 874
  • [5] FinBERT: A Pre-trained Financial Language Representation Model for Financial Text Mining
    Liu, Zhuang
    Huang, Degen
    Huang, Kaiyu
    Li, Zhuang
    Zhao, Jun
    [J]. PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 4513 - 4519
  • [6] Improving text mining in plant health domain with GAN and/or pre-trained language model
    Jiang, Shufan
    Cormier, Stephane
    Angarita, Rafael
    Rousseaux, Francis
    [J]. FRONTIERS IN ARTIFICIAL INTELLIGENCE, 2023, 6
  • [7] ViHealthBERT: Pre-trained Language Models for Vietnamese in Health Text Mining
    Minh Phuc Nguyen
    Vu Hoang Tran
    Vu Hoang
    Ta Duc Huy
    Bui, Trung H.
    Truong, Steven Q. H.
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 328 - 337
  • [8] A Pre-trained Model for Chinese Medical Record Punctuation Restoration
    Yu, Zhipeng
    Ling, Tongtao
    Gu, Fangqing
    Sheng, Huangxu
    Liu, Yi
    [J]. PATTERN RECOGNITION AND COMPUTER VISION, PRCV 2023, PT VII, 2024, 14431 : 101 - 112
  • [9] RoBERTuito: a pre-trained language model for social media text in Spanish
    Manuel Perez, Juan
    Furman, Damian A.
    Alonso Alemany, Laura
    Luque, Franco
    [J]. LREC 2022: THIRTEEN INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, 2022, : 7235 - 7243
  • [10] Leveraging Pre-Trained Language Model for Summary Generation on Short Text
    Zhao, Shuai
    You, Fucheng
    Liu, Zeng Yuan
    [J]. IEEE ACCESS, 2020, 8 : 228798 - 228803