Extracting lung cancer staging descriptors from pathology reports: A generative language model approach

被引:2
|
作者
Cho, Hyeongmin [1 ]
Yoo, Sooyoung [2 ]
Kim, Borham [2 ]
Jang, Sowon [3 ]
Sunwoo, Leonard [3 ]
Kim, Sanghwan [1 ]
Lee, Donghyoung [1 ]
Kim, Seok [2 ]
Nam, Sejin [1 ]
Chung, Jin-Haeng [4 ,5 ]
机构
[1] ezCaretech Res & Dev Ctr, Seoul, South Korea
[2] Seoul Natl Univ, Bundang Hosp, Off eHlth Res & Business, Seongnam, South Korea
[3] Seoul Natl Univ, Bundang Hosp, Dept Radiol, Seongnam, South Korea
[4] Seoul Natl Univ, Coll Med, Dept Pathol, Seoul, South Korea
[5] Seoul Natl Univ, Bundang Hosp, Dept Pathol & Translat Med, Seongnam, South Korea
关键词
Deep learning; Natural language processing; Large language model; Information extraction; Pathology report; Tumor-node classification; CLASSIFICATION; EDITION;
D O I
10.1016/j.jbi.2024.104720
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Background: In oncology, electronic health records contain textual key information for the diagnosis, staging, and treatment planning of patients with cancer. However, text data processing requires a lot of time and effort, which limits the utilization of these data. Recent advances in natural language processing (NLP) technology, including large language models, can be applied to cancer research. Particularly, extracting the information required for the pathological stage from surgical pathology reports can be utilized to update cancer staging according to the latest cancer staging guidelines. Objectives: This study has two main objectives. The first objective is to evaluate the performance of extracting information from text-based surgical pathology reports and determining pathological stages based on the extracted information using fine-tuned generative language models (GLMs) for patients with lung cancer. The second objective is to determine the feasibility of utilizing relatively small GLMs for information extraction in a resource-constrained computing environment. Methods: Lung cancer surgical pathology reports were collected from the Common Data Model database of Seoul National University Bundang Hospital (SNUBH), a tertiary hospital in Korea. We selected 42 descriptors necessary for tumor-node (TN) classification based on these reports and created a gold standard with validation by two clinical experts. The pathology reports and gold standard were used to generate prompt-response pairs for training and evaluating GLMs which then were used to extract information required for staging from pathology reports. Results: We evaluated the information extraction performance of six trained models as well as their performance in TN classification using the extracted information. The Deductive Mistral-7B model, which was pre-trained with the deductive dataset, showed the best performance overall, with an exact match ratio of 92.24% in the information extraction problem and an accuracy of 0.9876 (predicting T and N classification concurrently) in classification. Conclusion: This study demonstrated that training GLMs with deductive datasets can improve information extraction performance, and GLMs with a relatively small number of parameters at approximately seven billion can achieve high performance in this problem. The proposed GLM-based information extraction method is expected to be useful in clinical decision-making support, lung cancer staging and research.
引用
收藏
页数:11
相关论文
共 50 条
  • [41] CATEGORIZING FINDINGS FROM COLONOSCOPY REPORTS OF PATIENTS WITH INFLAMMATORY BOWEL DISEASE USING A GENERATIVE LARGE LANGUAGE MODEL
    Hong, Soonwook
    Zheng, Henry W.
    Greb, Alexandra C.
    Sharma, Vikram
    Limketkai, Berkeley
    GASTROENTEROLOGY, 2024, 166 (05) : S1496 - S1497
  • [42] Automated Classification of Breast Cancer TN Stage from Pathology Reports Using Large Language Models
    Li, B.
    Parker, M.
    Green, W. R.
    McBeth, R. A.
    MEDICAL PHYSICS, 2024, 51 (09) : 6592 - 6592
  • [43] GENERATIVE ARTIFICIAL INTELLIGENCE APPLICATIONS IN DATA ABSTRACTION AND PATTERN ANALYSIS FOR KIDNEY CANCER PATHOLOGY REPORTS
    Pace, William A.
    Jahangirizadeh, Parisa
    Liu, Andrew
    Carlisle, Marvin N.
    Krumm, Robert
    Cowan, Janet E.
    Carroll, Peter R.
    Cooperberg, Matthew R.
    Odisho, Anobel Y.
    JOURNAL OF UROLOGY, 2025, 213 (5S):
  • [44] Using Generative AI to Extract Structured Information from Free Text Pathology Reports
    Shahid, Fahad
    Hsu, Min-Huei
    Chang, Yung-Chun
    Jian, Wen-Shan
    JOURNAL OF MEDICAL SYSTEMS, 2025, 49 (01)
  • [45] Extracting the Truth From Conflicting Eyewitness Reports: A Formal Modeling Approach
    Waubert de Puiseau, Berenike
    Assfalg, Andre
    Erdfelder, Edgar
    Bernstein, Daniel M.
    JOURNAL OF EXPERIMENTAL PSYCHOLOGY-APPLIED, 2012, 18 (04) : 390 - 403
  • [46] Evaluating automated data extraction for lung cancer pathology reports in NSW Cancer Registry
    Chen, Hanyu
    Lawrance, Sheena
    Cooke-Yarborough, Claire
    ASIA-PACIFIC JOURNAL OF CLINICAL ONCOLOGY, 2022, 18 : 159 - 159
  • [47] Performance of an Open-Source Large Language Model in Extracting Information from Free-Text Radiology Reports
    Le Guellec, Bastien
    Lefevre, Alexandre
    Geay, Charlotte
    Shorten, Lucas
    Bruge, Cyril
    Hacein-Bey, Lotfi
    Amouyel, Philippe
    Pruvo, Jean-Pierre
    Kuchcinski, Gregory
    Hamroun, Aghiles
    RADIOLOGY-ARTIFICIAL INTELLIGENCE, 2024, 6 (04)
  • [48] Large language model answers medical questions about standard pathology reports
    Wang, Anqi
    Zhou, Jieli
    Zhang, Peng
    Cao, Haotian
    Xin, Hongyi
    Xu, Xinyun
    Zhou, Haiyang
    FRONTIERS IN MEDICINE, 2024, 11
  • [49] Extracting neural drives from surface EMG: A generative model and simulation studies
    Jiang, Ning
    Parker, Philip A.
    Englehart, Kevin B.
    2007 ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY, VOLS 1-16, 2007, : 4838 - 4841
  • [50] LUNG-CANCER - COMBINED MODALITY APPROACH TO STAGING AND THERAPY
    GOLOMB, HM
    DEMEESTER, TR
    CA-A CANCER JOURNAL FOR CLINICIANS, 1979, 29 (05) : 258 - 275