Large language model based framework for automated extraction of genetic interactions from unstructured data

被引:1
|
作者
Gill, Jaskaran Kaur [1 ]
Chetty, Madhu [1 ]
Lim, Suryani [1 ]
Hallinan, Jennifer [1 ,2 ]
机构
[1] Federat Univ, Hlth Innovat & Transformat Ctr, Ballarat, Vic, Australia
[2] BioThink, Brisbane, Qld, Australia
来源
PLOS ONE | 2024年 / 19卷 / 05期
关键词
NEURAL-NETWORK; ENTITY; INTEGRATION;
D O I
10.1371/journal.pone.0303231
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Extracting biological interactions from published literature helps us understand complex biological systems, accelerate research, and support decision-making in drug or treatment development. Despite efforts to automate the extraction of biological relations using text mining tools and machine learning pipelines, manual curation continues to serve as the gold standard. However, the rapidly increasing volume of literature pertaining to biological relations poses challenges in its manual curation and refinement. These challenges are further compounded because only a small fraction of the published literature is relevant to biological relation extraction, and the embedded sentences of relevant sections have complex structures, which can lead to incorrect inference of relationships. To overcome these challenges, we propose GIX, an automated and robust Gene Interaction Extraction framework, based on pre-trained Large Language models fine-tuned through extensive evaluations on various gene/protein interaction corpora including LLL and RegulonDB. GIX identifies relevant publications with minimal keywords, optimises sentence selection to reduce computational overhead, simplifies sentence structure while preserving meaning, and provides a confidence factor indicating the reliability of extracted relations. GIX's Stage-2 relation extraction method performed well on benchmark protein/gene interaction datasets, assessed using 10-fold cross-validation, surpassing state-of-the-art approaches. We demonstrated that the proposed method, although fully automated, performs as well as manual relation extraction, with enhanced robustness. We also observed GIX's capability to augment existing datasets with new sentences, incorporating newly discovered biological terms and processes. Further, we demonstrated GIX's real-world applicability in inferring E. coli gene circuits.
引用
收藏
页数:22
相关论文
共 50 条
  • [1] Automated Extraction of PSA Values from Unstructured Medical Records with a Large Language Model for Cancer
    Eustace, N. J.
    Man, K.
    Lu, T.
    Mehdinia, S.
    Tam, A.
    Ladbury, C. J.
    Li, Y. R.
    INTERNATIONAL JOURNAL OF RADIATION ONCOLOGY BIOLOGY PHYSICS, 2024, 120 (02): : E621 - E622
  • [2] Extraction of clinical data on major pulmonary diseases from unstructured radiologic reports using a large language model
    Park, Hyung Jun
    Huh, Jin-Young
    Chae, Ganghee
    Choi, Myeong Geun
    EUROPEAN RESPIRATORY JOURNAL, 2024, 64
  • [3] Extraction of clinical data on major pulmonary diseases from unstructured radiologic reports using a large language model
    Park, Hyung Jun
    Huh, Jin-Young
    Chae, Ganghee
    Choi, Myeong Geun
    PLOS ONE, 2024, 19 (11):
  • [4] Exploring automated energy optimization with unstructured building data: A multi-agent based framework leveraging large language models
    Xiao, Tong
    Xu, Peng
    ENERGY AND BUILDINGS, 2024, 322
  • [5] Large language models for data extraction from unstructured and semi-structured electronic health records: a multiple model performance evaluation
    Ntinopoulos, Vasileios
    Biefer, Hector Rodriguez Cetina
    Tudorache, Igor
    Papadopoulos, Nestoras
    Odavic, Dragan
    Risteski, Petar
    Haeussler, Achim
    Dzemali, Omer
    BMJ HEALTH & CARE INFORMATICS, 2025, 32 (01)
  • [6] Collaborative large language models for automated data extraction in living systematic reviews
    Khan, Muhammad Ali
    Ayub, Umair
    Naqvi, Syed Arsalan Ahmed
    Khakwani, Kaneez Zahra Rubab
    Sipra, Zaryab bin Riaz
    Raina, Ammad
    Zhou, Sihan
    He, Huan
    Saeidi, Amir
    Hasan, Bashar
    Rumble, Robert Bryan
    Bitterman, Danielle S.
    Warner, Jeremy L.
    Zou, Jia
    Tevaarwerk, Amye J.
    Leventakos, Konstantinos
    Kehl, Kenneth L.
    Palmer, Jeanne M.
    Murad, Mohammad Hassan
    Baral, Chitta
    bin Riaz, Irbaz
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2025,
  • [7] AUTOMATED DATA EXTRACTION IN SYSTEMATIC LITERATURE REVIEWS (SLRS): ASSESSING THE ACCURACY AND RELIABILITY OF A LARGE LANGUAGE MODEL (LLM)
    Shree, A.
    Farraia, M.
    Pathak, S.
    Slim, M.
    Cichewicz, A.
    Mittal, L.
    Casanas i Comabella, C. Casanas i
    VALUE IN HEALTH, 2024, 27 (12)
  • [8] AutoLabel: Automated Textual Data Annotation Method Based on Active Learning and Large Language Model
    Ming, Xuran
    Li, Shoubin
    Li, Mingyang
    He, Lvlong
    Wang, Qing
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT IV, KSEM 2024, 2024, 14887 : 400 - 411
  • [9] Image Text Extraction and Natural Language Processing of Unstructured Data from Medical Reports
    Malashin, Ivan
    Masich, Igor
    Tynchenko, Vadim
    Gantimurov, Andrei
    Nelyub, Vladimir
    Borodulin, Aleksei
    MACHINE LEARNING AND KNOWLEDGE EXTRACTION, 2024, 6 (02): : 1361 - 1377
  • [10] Extraction and classification of structured data from unstructured hepatobiliary pathology reports using large language models: a feasibility study compared with rules-based natural language processing
    Geevarghese, Ruben
    Sigel, Carlie
    Cadley, John
    Chatterjee, Subrata
    Jain, Pulkit
    Hollingsworth, Alex
    Chatterjee, Avijit
    Swinburne, Nathaniel
    Bilal, Khawaja Hasan
    Marinelli, Brett
    JOURNAL OF CLINICAL PATHOLOGY, 2024,