Programming techniques for improving rule readability for rule-based information extraction natural language processing pipelines of unstructured and semi-structured medical texts

被引:2
|
作者
Ladas, Nektarios [1 ,2 ,4 ,5 ]
Borchert, Florian [3 ]
Franz, Stefan [1 ,2 ]
Rehberg, Alina [1 ,2 ]
Strauch, Natalia [1 ,2 ]
Sommer, Kim Katrin [1 ,2 ]
Marschollek, Michael [1 ,2 ]
Gietzelt, Matthias [1 ,2 ]
机构
[1] TU Braunschweig, Peter L Reichertz Inst Med Informat, Hannover, Germany
[2] Hannover Med Sch, Hannover, Germany
[3] Hasso Plattner Inst Digital Engn gGmbH, Potsdam, Germany
[4] TU Braunschweig, Peter L Reichertz Inst Med Informat, PLRI OE 8420, D-30625 Hannover, Niedersachsen, Germany
[5] Hannover Med Sch, PLRI OE 8420, D-30625 Hannover, Niedersachsen, Germany
关键词
Natural language processing; clinical information systems; rule-based information extraction; extract-transform-load; electronic health record;
D O I
10.1177/14604582231164696
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
BackgroundExtraction of medical terms and their corresponding values from semi-structured and unstructured texts of medical reports can be a time-consuming and error-prone process. Methods of natural language processing (NLP) can help define an extraction pipeline for accomplishing a structured format transformation strategy.ObjectivesIn this paper, we build an NLP pipeline to extract values of the classification of malignant tumors (TNM) from unstructured and semi-structured pathology reports and import them further to a structured data source for a clinical study. Our research interest is not focused on standard performance metrics like precision, recall, and F-measure on the test and validation data. We discuss how with the help of software programming techniques the readability of rule-based (RB) information extraction (IE) pipelines can be improved, and therefore minimize the time to correct or update the rules, and efficiently import them to another programming language.MethodsThe extract rules were manually programmed with training data of TNM classification and tested in two separate pipelines based on design specifications from domain experts and data curators. Firstly we implemented each rule directly in one line for each extraction item. Secondly, we reprogrammed them in a readable fashion through decomposition and intention-revealing names for the variable declaration. To measure the impact of both methods we measure the time for the fine-tuning and programming of the extractions through test data of semi-structured and unstructured texts.ResultsWe analyze the benefits of improving through readability of the writing of rules, through parallel programming with regular expressions (REGEX), and the Apache Uima Ruta language (AURL). The time for correcting the readable rules in AURL and REGEX was significantly reduced. Complicated rules in REGEX are decomposed and intention-revealing declarations were reprogrammed in AURL in 5 min.ConclusionWe discuss the importance of factor readability and how can it be improved when programming RB text IE pipelines. Independent of the features of the programming language and the tools applied, a readable coding strategy can be proven beneficial for future maintenance and offer an interpretable solution for understanding the extraction and for transferring the rules to other domains and NLP pipelines.
引用
收藏
页数:14
相关论文
共 11 条
  • [1] A rule-based transformation system for converting semi-structured medical documents
    Heurix J.
    Rella A.
    Fenz S.
    Neubauer T.
    Health and Technology, 2013, 3 (1) : 51 - 63
  • [2] A Rule-based Information Extraction System for Human-readable Semi-structured Scientific Documents
    Chen, Gang
    An, Baoran
    Zeng, Sifeng
    PROCEEDINGS OF 2015 4TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT 2015), 2015, : 75 - 84
  • [3] Rule-Based HierarchicalRank: An Unsupervised Approach to Visible Tag Extraction from Semi-structured Chinese Text
    Lei, Jicheng
    Yu, Jiali
    He, Chunhui
    Zhang, Chong
    Ge, Bin
    Bao, Yiping
    PRICAI 2019: TRENDS IN ARTIFICIAL INTELLIGENCE, PT III, 2019, 11672 : 191 - 205
  • [4] Rule-based natural language processing for automation of stroke data extraction: a validation study
    Dane Gunter
    Paulo Puac-Polanco
    Olivier Miguel
    Rebecca E. Thornhill
    Amy Y. X. Yu
    Zhongyu A. Liu
    Muhammad Mamdani
    Chloe Pou-Prom
    Richard I. Aviv
    Neuroradiology, 2022, 64 : 2357 - 2362
  • [5] Rule-based natural language processing for automation of stroke data extraction: a validation study
    Gunter, Dane
    Puac-Polanco, Paulo
    Miguel, Olivier
    Thornhill, Rebecca E.
    Yu, Amy Y. X.
    Liu, Zhongyu A.
    Mamdani, Muhammad
    Pou-Prom, Chloe
    Aviv, Richard I.
    NEURORADIOLOGY, 2022, 64 (12) : 2357 - 2362
  • [6] Automated classification of cancer morphology from Italian pathology reports using Natural Language Processing techniques: A rule-based approach
    Lindaa, Hammami
    Alessia, Paglialonga
    Giancarlo, Pruneri
    Michele, Torresani
    Milenaa, Sant
    Carlo, Bono
    Gianluca, Caiani Enrico
    Paolo, Baili
    JOURNAL OF BIOMEDICAL INFORMATICS, 2021, 116
  • [7] NLP4PBM: a systematic review on process extraction using natural language processing with rule-based, machine and deep learning methods
    Van Woensel, William
    Motie, Soroor
    ENTERPRISE INFORMATION SYSTEMS, 2024, 18 (11)
  • [8] Extracting social support and social isolation information from clinical psychiatry notes: comparing a rule-based natural language processing system and a large language model
    Patra, Braja Gopal
    Lepow, Lauren A.
    Kasi Reddy Jagadeesh Kumar, Praneet
    Vekaria, Veer
    Sharma, Mohit Manoj
    Adekkanattu, Prakash
    Fennessy, Brian
    Hynes, Gavin
    Landi, Isotta
    Sanchez-Ruiz, Jorge A.
    Ryu, Euijung
    Biernacka, Joanna M.
    Nadkarni, Girish N.
    Talati, Ardesheer
    Weissman, Myrna
    Olfson, Mark
    Mann, J. John
    Zhang, Yiye
    Charney, Alexander W.
    Pathak, Jyotishman
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024, 32 (01) : 218 - 226
  • [9] Semi-automatic rule-based domain terminology and software feature-relevant information extraction from natural language user manuals: An approach and evaluation at Roche Diagnostics GmbH
    Quirchmayr, Thomas
    Paech, Barbara
    Kohl, Roland
    Karey, Hannes
    Kasdepke, Gunar
    EMPIRICAL SOFTWARE ENGINEERING, 2018, 23 (06) : 3630 - 3683
  • [10] Semi-automatic rule-based domain terminology and software feature-relevant information extraction from natural language user manualsAn approach and evaluation at Roche Diagnostics GmbH
    Thomas Quirchmayr
    Barbara Paech
    Roland Kohl
    Hannes Karey
    Gunar Kasdepke
    Empirical Software Engineering, 2018, 23 : 3630 - 3683