Functional evaluation of out-of-the-box text-mining tools for data-mining tasks

被引:30
|
作者
Jung, Kenneth [1 ]
LePendu, Paea [2 ]
Iyer, Srinivasan
Bauer-Mehren, Anna
Percha, Bethany [1 ]
Shah, Nigam H. [2 ]
机构
[1] Stanford Univ, Program Biomed Informat, Stanford, CA 94305 USA
[2] Stanford Univ, Ctr Biomed Informat Res, Stanford, CA 94305 USA
关键词
electronic health records; natural language processing; text mining; ELECTRONIC HEALTH RECORDS; CLINICAL TEXT; INFORMATION EXTRACTION; SYSTEM; ACCURACY; ARCHITECTURE; ALGORITHM; KNOWLEDGE; ARTHRITIS; ART;
D O I
10.1136/amiajnl-2014-002902
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Objective The trade-off between the speed and simplicity of dictionary-based term recognition and the richer linguistic information provided by more advanced natural language processing (NLP) is an area of active discussion in clinical informatics. In this paper, we quantify this trade-off among text processing systems that make different trade-offs between speed and linguistic understanding. We tested both types of systems in three clinical research tasks: phase IV safety profiling of a drug, learning adverse drug-drug interactions, and learning used-to-treat relationships between drugs and indications. Materials We first benchmarked the accuracy of the NCBO Annotator and REVEAL in a manually annotated, publically available dataset from the 2008 i2b2 Obesity Challenge. We then applied the NCBO Annotator and REVEAL to 9 million clinical notes from the Stanford Translational Research Integrated Database Environment (STRIDE) and used the resulting data for three research tasks. Results There is no significant difference between using the NCBO Annotator and REVEAL in the results of the three research tasks when using large datasets. In one subtask, REVEAL achieved higher sensitivity with smaller datasets. Conclusions For a variety of tasks, employing simple term recognition methods instead of advanced NLP methods results in little or no impact on accuracy when using large datasets. Simpler dictionary-based methods have the advantage of scaling well to very large datasets. Promoting the use of simple, dictionary-based methods for population level analyses can advance adoption of NLP in practice.
引用
收藏
页码:121 / 131
页数:11
相关论文
共 50 条
  • [1] USE OF TEXT-MINING TOOLS FOR SYSTEMATIC REVIEWS
    Paynter, R. A.
    Banez, L. L.
    Berliner, E.
    Erinoff, E.
    Lege-Matsuura, J. M.
    Potter, S.
    [J]. VALUE IN HEALTH, 2016, 19 (03) : A108 - A108
  • [2] Text-mining Similarity Approximation Operators for Opinion Mining in BI tools
    Kaplanski, Pawel
    Rizun, Nina
    Taranenko, Yurii
    Seganti, Alessandro
    [J]. PROCEEDINGS OF THE 11TH SCIENTIFIC CONFERENCE INTERNET IN THE INFORMATION SOCIETY 2016, 2016, : 121 - 140
  • [3] New Frontiers of Scientific Text Mining: Tasks, Data, and Tools
    Wang, Xuan
    Wang, Hongwei
    Ji, Heng
    Han, Jiawei
    [J]. PROCEEDINGS OF THE 28TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2022, 2022, : 4832 - 4833
  • [4] Circle graphs: New visualization tools for text-mining
    Aumann, Y
    Feldman, R
    Ben Yehuda, Y
    Landau, D
    Liphstat, O
    Schler, Y
    [J]. PRINCIPLES OF DATA MINING AND KNOWLEDGE DISCOVERY, 1999, 1704 : 277 - 282
  • [5] tmBioC: improving interoperability of text-mining tools with BioC
    Khare, Ritu
    Wei, Chih-Hsuan
    Mao, Yuqing
    Leaman, Robert
    Lu, Zhiyong
    [J]. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2014,
  • [6] A hybrid data-mining approach in genomics and text structures
    Teodorescu, HN
    Fira, LI
    [J]. THIRD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2003, : 649 - 652
  • [7] A review: Data mining and Text mining Tools in biological domain
    Gouider, Manel
    Hamdi, Ines
    Ben Ghezala, Henda
    [J]. VISION 2020: INNOVATION MANAGEMENT, DEVELOPMENT SUSTAINABILITY, AND COMPETITIVE ECONOMIC GROWTH, 2016, VOLS I - VII, 2016, : 2737 - 2746
  • [8] A Reusable Framework for Data-Mining Mask Shop Tools
    Meier, Dan
    [J]. PHOTOMASK TECHNOLOGY 2014, 2014, 9235
  • [9] Performance evaluation of text-mining models with Hindi stopwords lists
    Rani, Ruby
    Lobiyal, D. K.
    [J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2022, 34 (06) : 2771 - 2786
  • [10] Functional profiling of microarray experiments using text-mining derived bioentities
    Minguez, Pablo
    Al-Shahrour, Fatima
    Montaner, David
    Dopazo, Joaquin
    [J]. BIOINFORMATICS, 2007, 23 (22) : 3098 - 3099