Uncovering Machine Learning-Ready Data from Public Clinical Trial Resources: A case-study on normalization across Aggregate Content of ClinicalTrials.gov

被引:0
|
作者
Hutchison, Emmette R. [1 ]
Zhang, Youyi [1 ]
Nampally, Sreenath [1 ]
Weatherall, Jim [2 ]
Khan, Faisal [1 ]
Shameer, Khader [1 ]
机构
[1] AstraZeneca, Appl Analyt & Artificial Intelligence, Data Sci & Artificial Intelligence, R&D, Gaithersburg, MD 20878 USA
[2] AstraZeneca, Data Sci & Artificial Intelligence, R&D, Cambridge, England
关键词
Clinical Data; Natural Language Processing; Database Normalization; Machine Learning;
D O I
10.1109/BIBM49941.2020.9313362
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
The state of clinical data is a barrier to the development of machine learning models to improve healthcare. Uncontrolled clinical freetext is common in both the patient and clinical trials: the resulting spelling, grammatical errors, phrasing variation, and other resulting variability results in difficult-to-leverage data. As part of our effort to harmonize the Aggregate Analysis of ClinicalTrials.gov (AACT) drop-withdrawal reasons to a controlled vocabulary, we explored two solutions. Elastic's fuzzy matching capability matched entries in the AACT Drop-Withdrawal table to a list of user-specified terms (74.6% coverage). The second approach was a custom pipeline employing NLP preprocessing, Levenshtein Distance (Fuzzy Matching), and semantic similarity mapping using a pre-trained FastText Model (98% coverage). Although manual oversight is still required, the amount of effort to harmonize with a controlled vocabulary is notably reduced. This work enables the rapid harmonization of clinical databases, allowing them to be leveraged for machine learning and analytics.
引用
收藏
页码:2965 / 2967
页数:3
相关论文
empty
未找到相关数据