Criteria2Query: a natural language interface to clinical databases for cohort definition

被引:69
|
作者
Yuan, Chi [1 ,2 ]
Ryan, Patrick B. [1 ,3 ]
Ta, Casey [1 ]
Guo, Yixuan [1 ]
Li, Ziran [1 ]
Hardin, Jill [3 ]
Makadia, Rupa [3 ]
Jin, Peng [1 ]
Shang, Ning [1 ]
Kang, Tian [1 ]
Weng, Chunhua [1 ]
机构
[1] Columbia Univ, Dept Biomed Informat, 622 West 168th St,PH-20,Room 407, New York, NY 10032 USA
[2] Nanjing Univ Sci & Technol, Dept Comp Sci & Technol, Nanjing, Jiangsu, Peoples R China
[3] Janssen Res & Dev, Epidemiol Analyt, Titusville, NJ USA
关键词
cohort definition; natural language processing; natural language interfaces to database; common data model; ELIGIBILITY CRITERIA; REPRESENTATION; EXTRACTION; SYSTEM;
D O I
10.1093/jamia/ocy178
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective Cohort definition is a bottleneck for conducting clinical research and depends on subjective decisions by domain experts. Data-driven cohort definition is appealing but requires substantial knowledge of terminologies and clinical data models. Criteria2Query is a natural language interface that facilitates human-computer collaboration for cohort definition and execution using clinical databases. Materials and Methods Criteria2Query uses a hybrid information extraction pipeline combining machine learning and rule-based methods to systematically parse eligibility criteria text, transforms it first into a structured criteria representation and next into sharable and executable clinical data queries represented as SQL queries conforming to the OMOP Common Data Model. Users can interactively review, refine, and execute queries in the ATLAS web application. To test effectiveness, we evaluated 125 criteria across different disease domains from ClinicalTrials.gov and 52 user-entered criteria. We evaluated F1 score and accuracy against 2 domain experts and calculated the average computation time for fully automated query formulation. We conducted an anonymous survey evaluating usability. Results Criteria2Query achieved 0.795 and 0.805 F1 score for entity recognition and relation extraction, respectively. Accuracies for negation detection, logic detection, entity normalization, and attribute normalization were 0.984, 0.864, 0.514 and 0.793, respectively. Fully automatic query formulation took 1.22 seconds/criterion. More than 80% (11+ of 13) of users would use Criteria2Query in their future cohort definition tasks. Conclusions We contribute a novel natural language interface to clinical databases. It is open source and supports fully automated and interactive modes for autonomous data-driven cohort definition by researchers with minimal human effort. We demonstrate its promising user friendliness and usability.
引用
收藏
页码:294 / 305
页数:12
相关论文
共 50 条
  • [31] A portable natural language interface for diverse databases using ontologies
    Zárate, A
    Pazos, R
    Gelbukh, A
    Padrón, I
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, PROCEEDINGS, 2003, 2588 : 494 - 505
  • [32] Improving the customization of natural language interface to databases using an ontology
    Zarate, M. Jose A.
    Pazos, R. Rodolfo A.
    Gelbukh, Alexander
    Perez, O. Joaquin
    [J]. Computational Science and Its Applications - ICCSA 2007, Pt 1, Proceedings, 2007, 4705 : 424 - 435
  • [33] Conversation-based natural language interface to relational databases
    Owda, Majdi
    Bandar, Zuhair
    Crockett, Keeley
    [J]. PROCEEDING OF THE 2007 IEEE/WIC/ACM INTERNATIONAL CONFERENCE ON WEB INTELLIGENCE AND INTELLIGENT AGENT TECHNOLOGY, WORKSHOPS, 2007, : 363 - 367
  • [34] Bridging the Semantic Gap with SQL Query Logs in Natural Language Interfaces to Databases
    Baik, Christopher
    Jagadish, H. V.
    Li, Yunyao
    [J]. 2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2019), 2019, : 374 - 385
  • [35] Restricted natural language based querying of clinical databases
    Safari, Leila
    Patrick, Jon D.
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2014, 52 : 338 - 353
  • [36] Issues in translating from natural language to SQL in a domain-independent natural language interface to databases
    Gonzalez B., Juan J.
    Pazos Rangel, Rodolfo A.
    Cruz C., I. Cristina
    Fraire H., Hector J.
    de L ., Santos Aguilar
    Perez O., Joaquin
    [J]. MICAI 2006: Advances in Artificial Intelligence, Proceedings, 2006, 4293 : 922 - 931
  • [37] Edgebase: A Cooperative Query Answering Database System With A Natural Language Interface
    Sowah, Edmund
    Xu, Jianqiu
    [J]. 2018 INTERNATIONAL CONFERENCE ON ALGORITHMS, COMPUTING AND ARTIFICIAL INTELLIGENCE (ACAI 2018), 2018,
  • [38] ezNL2SQL: A System for Network Devices Management with a Natural Language Interface for Databases
    Bogojeska, Jasmina
    Lanyi, David
    Botezatu, Mirela
    Wiesmann, Dorothea
    [J]. 2021 IFIP/IEEE INTERNATIONAL SYMPOSIUM ON INTEGRATED NETWORK MANAGEMENT (IM 2021), 2021, : 233 - 240
  • [39] A Natural Language Interface Supporting Complex Logic Questions for Relational Databases
    Ngoc Phuoc An Vo
    Popescu, Octavian
    Sheinin, Vadim
    Khorasani, Elahe
    Yeo, Hangu
    [J]. NATURAL LANGUAGE PROCESSING AND INFORMATION SYSTEMS (NLDB 2019), 2019, 11608 : 384 - 392
  • [40] SCADA-NLI: A Natural Language Query and Control Interface for Distributed Systems
    Wu, Hao
    Shen, Chunshan
    He, Zhuangzhuang
    Wang, Yongmei
    Xu, Xinyuan
    [J]. IEEE ACCESS, 2021, 9 : 78108 - 78127