Large language model based framework for automated extraction of genetic interactions from unstructured data

被引:1
|
作者
Gill, Jaskaran Kaur [1 ]
Chetty, Madhu [1 ]
Lim, Suryani [1 ]
Hallinan, Jennifer [1 ,2 ]
机构
[1] Federat Univ, Hlth Innovat & Transformat Ctr, Ballarat, Vic, Australia
[2] BioThink, Brisbane, Qld, Australia
来源
PLOS ONE | 2024年 / 19卷 / 05期
关键词
NEURAL-NETWORK; ENTITY; INTEGRATION;
D O I
10.1371/journal.pone.0303231
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Extracting biological interactions from published literature helps us understand complex biological systems, accelerate research, and support decision-making in drug or treatment development. Despite efforts to automate the extraction of biological relations using text mining tools and machine learning pipelines, manual curation continues to serve as the gold standard. However, the rapidly increasing volume of literature pertaining to biological relations poses challenges in its manual curation and refinement. These challenges are further compounded because only a small fraction of the published literature is relevant to biological relation extraction, and the embedded sentences of relevant sections have complex structures, which can lead to incorrect inference of relationships. To overcome these challenges, we propose GIX, an automated and robust Gene Interaction Extraction framework, based on pre-trained Large Language models fine-tuned through extensive evaluations on various gene/protein interaction corpora including LLL and RegulonDB. GIX identifies relevant publications with minimal keywords, optimises sentence selection to reduce computational overhead, simplifies sentence structure while preserving meaning, and provides a confidence factor indicating the reliability of extracted relations. GIX's Stage-2 relation extraction method performed well on benchmark protein/gene interaction datasets, assessed using 10-fold cross-validation, surpassing state-of-the-art approaches. We demonstrated that the proposed method, although fully automated, performs as well as manual relation extraction, with enhanced robustness. We also observed GIX's capability to augment existing datasets with new sentences, incorporating newly discovered biological terms and processes. Further, we demonstrated GIX's real-world applicability in inferring E. coli gene circuits.
引用
收藏
页数:22
相关论文
共 50 条
  • [41] ChIP-GPT: a managed large language model for robust data extraction from biomedical database records
    Cinquin, Olivier
    BRIEFINGS IN BIOINFORMATICS, 2024, 25 (02)
  • [42] Representation, Analysis, and Extraction of Knowledge from Unstructured Natural Language Texts
    H. Hoherchak
    N. Darchuk
    S. Kryvyi
    Cybernetics and Systems Analysis, 2021, 57 : 481 - 500
  • [43] Automated Spinal MRI Labelling from Reports Using a Large Language Model
    Park, Robin Y.
    Windsor, Rhydian
    Jamaludin, Amir
    Zisserman, Andrew
    MEDICAL IMAGE COMPUTING AND COMPUTER ASSISTED INTERVENTION - MICCAI 2024, PT V, 2024, 15005 : 101 - 111
  • [44] OmEGa(Ω): Ontology-based information extraction framework for constructing task-centric knowledge graph from manufacturing documents with large language model
    Shim, Midan
    Choi, Hyojun
    Koo, Heeyeon
    Um, Kaehyun
    Lee, Kyong-Ho
    Lee, Sanghyun
    ADVANCED ENGINEERING INFORMATICS, 2025, 64
  • [45] EXTRACTION OF MANUFACTURING RULES FROM UNSTRUCTURED TEXT USING A SEMANTIC FRAMEWORK
    Kang, SungKu
    Patil, Lalit
    Rangarajan, Arvind
    Moitra, Abha
    Jia, Tao
    Robinson, Dean
    Dutta, Debasish
    INTERNATIONAL DESIGN ENGINEERING TECHNICAL CONFERENCES AND COMPUTERS AND INFORMATION IN ENGINEERING CONFERENCE, 2015, VOL 1B, 2016,
  • [46] A general framework for subjective information extraction from unstructured English text
    Mangassarian, Hratch
    Artail, Hassan
    DATA & KNOWLEDGE ENGINEERING, 2007, 62 (02) : 352 - 367
  • [47] A case study for automated attribute extraction from legal documents using large language models
    Adhikary, Subinay
    Sen, Procheta
    Roy, Dwaipayan
    Ghosh, Kripabandhu
    ARTIFICIAL INTELLIGENCE AND LAW, 2024,
  • [48] Advancing oil and gas emissions assessment through large language model data extraction
    Chen, Zhenlin
    Zhong, Roujia
    Long, Wennan
    Tanga, Haoyu
    Wang, Anjing
    Liu, Zemin
    Yang, Xuelin
    Ren, Bo
    Littlefield, James
    Koyejo, Sanmi
    Masnadi, Mohammad S.
    Brandt, Adam R.
    ENERGY AND AI, 2025, 20
  • [49] Text mining: Extraction of interesting association rule with frequent itemsets mining for Korean language from unstructured data
    Department of computer Engineering, INU , Incheon, Korea, Republic of
    Int. J. Multimedia Ubiquitous Eng., 11 (11-20):
  • [50] Automated DNA extraction from large volumes
    Cote, Annie
    Landry, Manon
    Rochette, Sara-Kim
    Gibson, Karine
    Lapointe, Martine
    Sarafian, Vahe
    FORENSIC SCIENCE INTERNATIONAL GENETICS SUPPLEMENT SERIES, 2008, 1 (01) : 22 - 23