Large Scale Semi-Automated Labeling of Routine Free-Text Clinical Records for Deep Learning

被引：14

作者：

Trivedi, Hari M. ^{[1
]}

Panahiazar, Maryam ^{[2
]}

Liang, April ^{[3
]}

Lituiev, Dmytro ^{[2
]}

Chang, Peter ^{[1
]}

Sohn, Jae Ho ^{[1
]}

Chen, Yunn-Yi ^{[4
]}

Franc, Benjamin L. ^{[1
]}

Joe, Bonnie ^{[1
]}

Hadley, Dexter ^{[2
]}

机构：

[1] Univ Calif San Francisco, Dept Radiol & Biomed Imaging, San Francisco, CA 94143 USA

[2] Univ Calif San Francisco, Inst Computat Hlth Sci, San Francisco, CA 94143 USA

[3] Univ Calif San Francisco, Sch Med, San Francisco, CA USA

[4] Univ Calif San Francisco, Dept Pathol, San Francisco, CA 94140 USA

来源：

JOURNAL OF DIGITAL IMAGING | 2019年 / 32卷 / 01期

关键词：

IBM Watson; Machine learning; Artificial intelligence; Deep learning; Natural language processing (NLP); Pathology; Mammography; CANCER; CLASSIFICATION; ARCHITECTURE; MAMMOGRAPHY; MASSES;

D O I：

10.1007/s10278-018-0105-8

中图分类号：

R8 [特种医学]; R445 [影像诊断学];

学科分类号：

1002 ; 100207 ; 1009 ;

摘要：

Breast cancer is a leading cause of cancer death among women in the USA. Screening mammography is effective in reducing mortality, but has a high rate of unnecessary recalls and biopsies. While deep learning can be applied to mammography, large-scale labeled datasets, which are difficult to obtain, are required. We aim to remove many barriers of dataset development by automatically harvesting data from existing clinical records using a hybrid framework combining traditional NLP and IBM Watson. An expert reviewer manually annotated 3521 breast pathology reports with one of four outcomes: left positive, right positive, bilateral positive, negative. Traditional NLP techniques using seven different machine learning classifiers were compared to IBM Watson's automated natural language classifier. Techniques were evaluated using precision, recall, and F-measure. Logistic regression outperformed all other traditional machine learning classifiers and was used for subsequent comparisons. Both traditional NLP and Watson's NLC performed well for cases under 1024 characters with weighted average F-measures above 0.96 across all classes. Performance of traditional NLP was lower for cases over 1024 characters with an F-measure of 0.83. We demonstrate a hybrid framework using traditional NLP techniques combined with IBM Watson to annotate over 10,000 breast pathology reports for development of a large-scale database to be used for deep learning in mammography. Our work shows that traditional NLP and IBM Watson perform extremely well for cases under 1024 characters and can accelerate the rate of data annotation.

引用

页码：30 / 37

页数：8

共 50 条

[31] Semi-supervised learning in large scale text categorization
Xu Z.
Li J.
Liu B.
Bi J.
Li R.
Mao R.
Journal of Shanghai Jiaotong University (Science), 2017, 22 (3) : 291 - 302
[32] Automated Classification of Selected Data Elements from Free-text Diagnostic Reports for Clinical Research
Loepprich, Martin
Krauss, Felix
Ganzinger, Matthias
Senghas, Karsten
Riezler, Stefan
Knaup, Petra
METHODS OF INFORMATION IN MEDICINE, 2016, 55 (04) : 373 - 380
[33] The Case for Semi-Automated Design of Microfluidic Very Large Scale Integration (mVLSI) Chips
McDaniel, Jeffrey
Grover, William H.
Brisk, Philip
PROCEEDINGS OF THE 2017 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION (DATE), 2017, : 1793 - 1798
[34] Combining semi-automated image analysis techniques with machine learning algorithms to accelerate large-scale genetic studies
Atkinson, Jonathan A.
Lobet, Guillaume
Noll, Manuel
Meyer, Patrick E.
Griffiths, Marcus
Wells, Darren M.
GIGASCIENCE, 2017, 6 (10):
[35] Large-scale Analysis of Free-Text Data for Mental Health Surveillance with Topic Modelling
Gu, Yang
Leroy, Gondy
AMCIS 2020 PROCEEDINGS, 2020,
[36] DEEP LEARNING FOR SEMI-AUTOMATED PIRADSV2 SCORING ON MULTIPARAMETRIC PROSTATE MRI
Sanford, Tom
Harmon, Stephanie
Madariaga, Manuel
Kesani, Deepak
Mehralivand, Sherif
Lay, Nathan
Xu, Sheng
Bloom, Jonathan
Lebastchi, Amir
Ahdoot, Michael
Merino, Maria
Wood, Brad
Valera, Vladimir
Choyke, Peter
Pinto, Peter
Turkbey, Baris
JOURNAL OF UROLOGY, 2019, 201 (04): : E1076 - E1077
[37] A Semi-Automated Explainability-Driven Approach for Malware Analysis through Deep Learning
Iadarola, Giacomo
Casolare, Rosangela
Martinelli, Fabio
Mercaldo, Francesco
Peluso, Christian
Santone, Antonella
2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
[38] Deep Learning-Inspired Automatic Minutiae Extraction from Semi-Automated Annotations
Zhao, Hongtian
Yang, Hua
Zheng, Shibao
IEICE TRANSACTIONS ON FUNDAMENTALS OF ELECTRONICS COMMUNICATIONS AND COMPUTER SCIENCES, 2024, E107A (09) : 1509 - 1521
[39] Monitoring the Epidemiology of Otitis Using Free-Text Pediatric Medical Notes: A Deep Learning Approach
Lanera, Corrado
Lorenzoni, Giulia
Barbieri, Elisa
Piras, Gianluca
Magge, Arjun
Weissenbacher, Davy
Dona, Daniele
Cantarutti, Luigi
Gonzalez-Hernandez, Graciela
Giaquinto, Carlo
Gregori, Dario
JOURNAL OF PERSONALIZED MEDICINE, 2024, 14 (01):
[40] Automated learning of RVM for large scale text sets: Divide to conquer
Silva, Catarina
Ribeiro, Bernardete
INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2006, PROCEEDINGS, 2006, 4224 : 878 - 886

← 1 2 3 4 5 →