Criteria2Query: a natural language interface to clinical databases for cohort definition

被引:69
|
作者
Yuan, Chi [1 ,2 ]
Ryan, Patrick B. [1 ,3 ]
Ta, Casey [1 ]
Guo, Yixuan [1 ]
Li, Ziran [1 ]
Hardin, Jill [3 ]
Makadia, Rupa [3 ]
Jin, Peng [1 ]
Shang, Ning [1 ]
Kang, Tian [1 ]
Weng, Chunhua [1 ]
机构
[1] Columbia Univ, Dept Biomed Informat, 622 West 168th St,PH-20,Room 407, New York, NY 10032 USA
[2] Nanjing Univ Sci & Technol, Dept Comp Sci & Technol, Nanjing, Jiangsu, Peoples R China
[3] Janssen Res & Dev, Epidemiol Analyt, Titusville, NJ USA
关键词
cohort definition; natural language processing; natural language interfaces to database; common data model; ELIGIBILITY CRITERIA; REPRESENTATION; EXTRACTION; SYSTEM;
D O I
10.1093/jamia/ocy178
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective Cohort definition is a bottleneck for conducting clinical research and depends on subjective decisions by domain experts. Data-driven cohort definition is appealing but requires substantial knowledge of terminologies and clinical data models. Criteria2Query is a natural language interface that facilitates human-computer collaboration for cohort definition and execution using clinical databases. Materials and Methods Criteria2Query uses a hybrid information extraction pipeline combining machine learning and rule-based methods to systematically parse eligibility criteria text, transforms it first into a structured criteria representation and next into sharable and executable clinical data queries represented as SQL queries conforming to the OMOP Common Data Model. Users can interactively review, refine, and execute queries in the ATLAS web application. To test effectiveness, we evaluated 125 criteria across different disease domains from ClinicalTrials.gov and 52 user-entered criteria. We evaluated F1 score and accuracy against 2 domain experts and calculated the average computation time for fully automated query formulation. We conducted an anonymous survey evaluating usability. Results Criteria2Query achieved 0.795 and 0.805 F1 score for entity recognition and relation extraction, respectively. Accuracies for negation detection, logic detection, entity normalization, and attribute normalization were 0.984, 0.864, 0.514 and 0.793, respectively. Fully automatic query formulation took 1.22 seconds/criterion. More than 80% (11+ of 13) of users would use Criteria2Query in their future cohort definition tasks. Conclusions We contribute a novel natural language interface to clinical databases. It is open source and supports fully automated and interactive modes for autonomous data-driven cohort definition by researchers with minimal human effort. We demonstrate its promising user friendliness and usability.
引用
收藏
页码:294 / 305
页数:12
相关论文
共 50 条
  • [1] Criteria2Query 3.0: Leveraging generative large language models for clinical trial eligibility query generation
    Park, Jimyung
    Fang, Yilu
    Ta, Casey
    Zhang, Gongbo
    Idnay, Betina
    Chen, Fangyi
    Feng, David
    Shyu, Rebecca
    Gordon, Emily R.
    Spotnitz, Matthew
    Weng, Chunhua
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2024, 154
  • [2] Query builder:: A natural language interface for structured databases
    Little, J
    de Ga, M
    Özyer, T
    Alhajj, R
    [J]. COMPUTER AND INFORMATION SCIENCES - ISCIS 2004, PROCEEDINGS, 2004, 3280 : 470 - 479
  • [3] Toward a Cooperative Natural Language Query Interface for Biological Databases
    Jamil, Hasan M.
    [J]. 2011 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE (BIBM 2011), 2011, : 556 - 561
  • [4] A natural language interface plug-in for cooperative query answering in biological databases
    Hasan M Jamil
    [J]. BMC Genomics, 13
  • [5] How to make a natural language interface to query databases accessible to everyone: An example
    Llopis, Miguel
    Ferrandez, Antonio
    [J]. COMPUTER STANDARDS & INTERFACES, 2013, 35 (05) : 470 - 481
  • [6] A natural language interface plug-in for cooperative query answering in biological databases
    Jamil, Hasan M.
    [J]. BMC GENOMICS, 2012, 13
  • [7] COACT: a query interface language for collaborative databases
    Mershad, Khaleel
    Malluhi, Qutaibah M.
    Ouzzani, Mourad
    Tang, Mingjie
    Gribskov, Michael
    Aref, Walid G.
    Prakash, Deo
    [J]. DISTRIBUTED AND PARALLEL DATABASES, 2018, 36 (01) : 121 - 151
  • [8] COACT: a query interface language for collaborative databases
    Khaleel Mershad
    Qutaibah M. Malluhi
    Mourad Ouzzani
    Mingjie Tang
    Michael Gribskov
    Walid G. Aref
    Deo Prakash
    [J]. Distributed and Parallel Databases, 2018, 36 : 121 - 151
  • [9] Interfaces to Query Relational Databases in Natural Language
    Singh, Harjit
    [J]. IT PROFESSIONAL, 2019, 21 (01) : 67 - 73
  • [10] CNL-RDF-Query: A controlled natural language interface for querying ontologies and relational databases
    Henarejos-Blasco, Jose
    Antonio Garcia-Diaz, Jose
    Apolinario-Arzube, Oscar
    Valencia-Garcia, Rafael
    [J]. PROCEEDINGS OF THE 10TH EURO-AMERICAN CONFERENCE ON TELEMATICS AND INFORMATION SYSTEMS (EATIS 2020), 2020,