Natural Language Processing for Surveillance of Cervical and Anal Cancer and Precancer: Algorithm Development and Split-Validation Study

被引:15
|
作者
Oliveira, Carlos R. [1 ]
Niccolai, Patrick [1 ]
Ortiz, Anette Michelle [1 ]
Sheth, Sangini S. [2 ]
Shapiro, Eugene D. [1 ,3 ]
Niccolai, Linda M. [3 ]
Brandt, Cynthia A. [4 ,5 ,6 ,7 ,8 ,9 ,10 ]
机构
[1] Yale Univ, Dept Pediat, Sch Med, POB 208000, New Haven, CT 06520 USA
[2] Yale Univ, Dept Obstet Gynecol & Reprod Sci, Sch Med, New Haven, CT 06520 USA
[3] Yale Sch Publ Hlth, Dept Epidemiol Microbial Dis, New Haven, CT USA
[4] Yale Sch Med, Dept Emergency Med, New Haven, CT USA
[5] Yale Sch Med, Dept Biostat, New Haven, CT USA
[6] Yale Sch Med, Dept Hlth Informat, New Haven, CT USA
[7] Yale Sch Publ Hlth, Dept Emergency Med, New Haven, CT USA
[8] Yale Sch Publ Hlth, Dept Biostat, New Haven, CT USA
[9] Yale Sch Publ Hlth, Dept Hlth Informat, New Haven, CT USA
[10] Vet Affairs Connecticut Healthcare Syst, West Haven, CT USA
基金
美国国家卫生研究院;
关键词
natural language processing; automated data extraction; human papillomavirus; surveillance; pathology reporting; cervical cancer; anal cancer; precancer; cancer; HPV; accuracy; HPV VACCINE; CLASSIFICATION; INFLUENZA; HEALTH;
D O I
10.2196/20826
中图分类号
R-058 [];
学科分类号
摘要
Background: Accurate identification of new diagnoses of human papillomavirus-associated cancers and precancers is an important step toward the development of strategies that optimize the use of human papillomavirus vaccines. The diagnosis of human papillomavirus cancers hinges on a histopathologic report, which is typically stored in electronic medical records as free-form, or unstructured, narrative text. Previous efforts to perform surveillance for human papillomavirus cancers have relied on the manual review of pathology reports to extract diagnostic information, a process that is both labor- and resource-intensive. Natural language processing can be used to automate the structuring and extraction of clinical data from unstructured narrative text in medical records and may provide a practical and effective method for identifying patients with vaccine-preventable human papillomavims disease for surveillance and research. Objective: This study's objective was to develop and assess the accuracy of a natural language processing algorithm for the identification of individuals with cancer or precancer of the cervix and anus. Methods: A pipeline-based natural language processing algorithm was developed, which incorporated machine learning and rule-based methods to extract diagnostic elements from the narrative pathology reports. To test the algorithm's classification accuracy, we used a split-validation study design. Full-length cervical and anal pathology reports were randomly selected from 4 clinical pathology laboratories. Two study team members, blinded to the classifications produced by the natural language processing algorithm, manually and independently reviewed all reports and classified them at the document level according to 2 domains (diagnosis and human papillomavirus testing results). Using the manual review as the gold standard, the algorithm's performance was evaluated using standard measurements of accuracy, recall, precision, and F-measure. Results: The natural language processing algorithm's performance was validated on 949 pathology reports. The algorithm demonstrated accurate identification of abnormal cytology, histology, and positive human papillomavirus tests with accuracies greater than 0.91. Precision was lowest for anal histology reports (0.87, 95% CI 0.59-0.98) and highest for cervical cytology (0.98, 95% CI 0.95-0.99). The natural language processing algorithm missed 2 out of the 15 abnormal anal histology reports, which led to a relatively low recall (0.68, 95% CI 0.43-0.87). Conclusions: This study outlines the development and validation of a freely available and easily implementable natural language processing algorithm that can automate the extraction and classification of clinical data from cervical and anal cytology and histology.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] Natural Language Processing-Based Virtual Cofacilitator for Online Cancer Support Groups: Protocol for an Algorithm Development and Validation Study
    Leung, Yvonne W.
    Wouterloot, Elise
    Adikari, Achini
    Hirst, Graeme
    de Silva, Daswin
    Wong, Jiahui
    Bender, Jacqueline L.
    Gancarz, Mathew
    Gratzer, David
    Alahakoon, Damminda
    Esplen, Mary Jane
    JMIR RESEARCH PROTOCOLS, 2021, 10 (01):
  • [2] Portable Automated Surveillance of Surgical Site Infections Using Natural Language Processing Development and Validation
    Bucher, Brian T.
    Shi, Jianlin
    Ferraro, Jeffrey P.
    Skarda, David E.
    Samore, Matthew H.
    Hurdle, John F.
    Gundlapalli, Adi, V
    Chapman, Wendy W.
    Finlayson, Samuel R. G.
    ANNALS OF SURGERY, 2020, 272 (04) : 629 - 636
  • [3] Mining Clinical Notes for Physical Rehabilitation Exercise Information: Natural Language Processing Algorithm Development and Validation Study
    Sivarajkumar, Sonish
    Gao, Fengyi
    Denny, Parker
    Aldhahwani, Bayan
    Visweswaran, Shyam
    Bove, Allyn
    Wang, Yanshan
    JMIR MEDICAL INFORMATICS, 2024, 12
  • [4] Development and Validation of a Natural Language Processing Algorithm for Extracting Clinical and Pathological Features of Breast Cancer From Pathology Reports
    Munzone, Elisabetta
    Marra, Antonio
    Comotto, Federico
    Guercio, Lorenzo
    Sangalli, Claudia Anna
    Lo Cascio, Martina
    Pagan, Eleonora
    Sangalli, Davide
    Bigoni, Ilaria
    Porta, Francesca Maria
    D'Ercole, Marianna
    Ritorti, Fabiana
    Bagnardi, Vincenzo
    Fusco, Nicola
    Curigliano, Giuseppe
    JCO CLINICAL CANCER INFORMATICS, 2024, 8
  • [5] Development and Validation of an Algorithm to Identify Prostate Cancer Related Mortality in Electronic Medical Records Using Natural Language Processing
    DiBello, Julia R.
    Wallner, Lauren P.
    Zheng, Chengyi
    Yu, Wei
    Li, Bonnie H.
    VanDenEeden, Stephen K.
    Weinmann, Sheila
    Ritzwoller, Debra
    Richert-Boe, Kathryn
    Jacobsen, Stephen J.
    PHARMACOEPIDEMIOLOGY AND DRUG SAFETY, 2015, 24 : 418 - 419
  • [6] Development and Validation of a Natural Language Processing Algorithm to Pseudonymize Documents in the Context of a Clinical Data Warehouse
    Tannier, Xavier
    Wajsburt, Perceval
    Calliger, Alice
    Dura, Basile
    Mouchet, Alexandre
    Hilka, Martin
    Bey, Romain
    METHODS OF INFORMATION IN MEDICINE, 2024, 63 (01/02) : 21 - 34
  • [7] Development of Natural Language Processing Algorithm for Dental Charting
    Zhang, Yifan
    Bogard, Brandon
    Zhang, Chengdui
    2020 IEEE 21ST INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION FOR DATA SCIENCE (IRI 2020), 2020, : 403 - 404
  • [8] Lemmatization Algorithm Development for Bangla Natural Language Processing
    Kowsher, Md
    Tahabilder, Anik
    Sarker, Md Murad Hossain
    Sanjid, Md Zahidul Islam
    Prottasha, Nusrat Jahan
    2020 JOINT 9TH INTERNATIONAL CONFERENCE ON INFORMATICS, ELECTRONICS & VISION (ICIEV) AND 2020 4TH INTERNATIONAL CONFERENCE ON IMAGING, VISION & PATTERN RECOGNITION (ICIVPR), 2020,
  • [9] Development of Natural Language Processing Algorithm for Dental Charting
    Zhang Y.
    Bogard B.
    Zhang C.
    SN Computer Science, 2021, 2 (4)
  • [10] Development of a Structured Query Language and Natural Language Processing Algorithm to Identify Lung Nodules in a Cancer Centre
    Hunter, Benjamin
    Reis, Sara
    Campbell, Des
    Matharu, Sheila
    Ratnakumar, Prashanthi
    Mercuri, Luca
    Hindocha, Sumeet
    Kalsi, Hardeep
    Mayer, Erik
    Glampson, Ben
    Robinson, Emily J.
    Al-Lazikani, Bisan
    Scerri, Lisa
    Bloch, Susannah
    Lee, Richard
    FRONTIERS IN MEDICINE, 2021, 8