Large Scale Semi-Automated Labeling of Routine Free-Text Clinical Records for Deep Learning

被引:14
|
作者
Trivedi, Hari M. [1 ]
Panahiazar, Maryam [2 ]
Liang, April [3 ]
Lituiev, Dmytro [2 ]
Chang, Peter [1 ]
Sohn, Jae Ho [1 ]
Chen, Yunn-Yi [4 ]
Franc, Benjamin L. [1 ]
Joe, Bonnie [1 ]
Hadley, Dexter [2 ]
机构
[1] Univ Calif San Francisco, Dept Radiol & Biomed Imaging, San Francisco, CA 94143 USA
[2] Univ Calif San Francisco, Inst Computat Hlth Sci, San Francisco, CA 94143 USA
[3] Univ Calif San Francisco, Sch Med, San Francisco, CA USA
[4] Univ Calif San Francisco, Dept Pathol, San Francisco, CA 94140 USA
关键词
IBM Watson; Machine learning; Artificial intelligence; Deep learning; Natural language processing (NLP); Pathology; Mammography; CANCER; CLASSIFICATION; ARCHITECTURE; MAMMOGRAPHY; MASSES;
D O I
10.1007/s10278-018-0105-8
中图分类号
R8 [特种医学]; R445 [影像诊断学];
学科分类号
1002 ; 100207 ; 1009 ;
摘要
Breast cancer is a leading cause of cancer death among women in the USA. Screening mammography is effective in reducing mortality, but has a high rate of unnecessary recalls and biopsies. While deep learning can be applied to mammography, large-scale labeled datasets, which are difficult to obtain, are required. We aim to remove many barriers of dataset development by automatically harvesting data from existing clinical records using a hybrid framework combining traditional NLP and IBM Watson. An expert reviewer manually annotated 3521 breast pathology reports with one of four outcomes: left positive, right positive, bilateral positive, negative. Traditional NLP techniques using seven different machine learning classifiers were compared to IBM Watson's automated natural language classifier. Techniques were evaluated using precision, recall, and F-measure. Logistic regression outperformed all other traditional machine learning classifiers and was used for subsequent comparisons. Both traditional NLP and Watson's NLC performed well for cases under 1024 characters with weighted average F-measures above 0.96 across all classes. Performance of traditional NLP was lower for cases over 1024 characters with an F-measure of 0.83. We demonstrate a hybrid framework using traditional NLP techniques combined with IBM Watson to annotate over 10,000 breast pathology reports for development of a large-scale database to be used for deep learning in mammography. Our work shows that traditional NLP and IBM Watson perform extremely well for cases under 1024 characters and can accelerate the rate of data annotation.
引用
收藏
页码:30 / 37
页数:8
相关论文
共 50 条
  • [31] Semi-supervised learning in large scale text categorization
    Xu Z.
    Li J.
    Liu B.
    Bi J.
    Li R.
    Mao R.
    Journal of Shanghai Jiaotong University (Science), 2017, 22 (3) : 291 - 302
  • [32] Automated Classification of Selected Data Elements from Free-text Diagnostic Reports for Clinical Research
    Loepprich, Martin
    Krauss, Felix
    Ganzinger, Matthias
    Senghas, Karsten
    Riezler, Stefan
    Knaup, Petra
    METHODS OF INFORMATION IN MEDICINE, 2016, 55 (04) : 373 - 380
  • [33] The Case for Semi-Automated Design of Microfluidic Very Large Scale Integration (mVLSI) Chips
    McDaniel, Jeffrey
    Grover, William H.
    Brisk, Philip
    PROCEEDINGS OF THE 2017 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE), 2017, : 1793 - 1798
  • [34] Combining semi-automated image analysis techniques with machine learning algorithms to accelerate large-scale genetic studies
    Atkinson, Jonathan A.
    Lobet, Guillaume
    Noll, Manuel
    Meyer, Patrick E.
    Griffiths, Marcus
    Wells, Darren M.
    GIGASCIENCE, 2017, 6 (10):
  • [35] Large-scale Analysis of Free-Text Data for Mental Health Surveillance with Topic Modelling
    Gu, Yang
    Leroy, Gondy
    AMCIS 2020 PROCEEDINGS, 2020,
  • [36] DEEP LEARNING FOR SEMI-AUTOMATED PIRADSV2 SCORING ON MULTIPARAMETRIC PROSTATE MRI
    Sanford, Tom
    Harmon, Stephanie
    Madariaga, Manuel
    Kesani, Deepak
    Mehralivand, Sherif
    Lay, Nathan
    Xu, Sheng
    Bloom, Jonathan
    Lebastchi, Amir
    Ahdoot, Michael
    Merino, Maria
    Wood, Brad
    Valera, Vladimir
    Choyke, Peter
    Pinto, Peter
    Turkbey, Baris
    JOURNAL OF UROLOGY, 2019, 201 (04): : E1076 - E1077
  • [37] A Semi-Automated Explainability-Driven Approach for Malware Analysis through Deep Learning
    Iadarola, Giacomo
    Casolare, Rosangela
    Martinelli, Fabio
    Mercaldo, Francesco
    Peluso, Christian
    Santone, Antonella
    2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [38] Deep Learning-Inspired Automatic Minutiae Extraction from Semi-Automated Annotations
    Zhao, Hongtian
    Yang, Hua
    Zheng, Shibao
    IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2024, E107A (09) : 1509 - 1521
  • [39] Monitoring the Epidemiology of Otitis Using Free-Text Pediatric Medical Notes: A Deep Learning Approach
    Lanera, Corrado
    Lorenzoni, Giulia
    Barbieri, Elisa
    Piras, Gianluca
    Magge, Arjun
    Weissenbacher, Davy
    Dona, Daniele
    Cantarutti, Luigi
    Gonzalez-Hernandez, Graciela
    Giaquinto, Carlo
    Gregori, Dario
    JOURNAL OF PERSONALIZED MEDICINE, 2024, 14 (01):
  • [40] Automated learning of RVM for large scale text sets: Divide to conquer
    Silva, Catarina
    Ribeiro, Bernardete
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2006, PROCEEDINGS, 2006, 4224 : 878 - 886