Large language model based framework for automated extraction of genetic interactions from unstructured data

被引:1
|
作者
Gill, Jaskaran Kaur [1 ]
Chetty, Madhu [1 ]
Lim, Suryani [1 ]
Hallinan, Jennifer [1 ,2 ]
机构
[1] Federat Univ, Hlth Innovat & Transformat Ctr, Ballarat, Vic, Australia
[2] BioThink, Brisbane, Qld, Australia
来源
PLOS ONE | 2024年 / 19卷 / 05期
关键词
NEURAL-NETWORK; ENTITY; INTEGRATION;
D O I
10.1371/journal.pone.0303231
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Extracting biological interactions from published literature helps us understand complex biological systems, accelerate research, and support decision-making in drug or treatment development. Despite efforts to automate the extraction of biological relations using text mining tools and machine learning pipelines, manual curation continues to serve as the gold standard. However, the rapidly increasing volume of literature pertaining to biological relations poses challenges in its manual curation and refinement. These challenges are further compounded because only a small fraction of the published literature is relevant to biological relation extraction, and the embedded sentences of relevant sections have complex structures, which can lead to incorrect inference of relationships. To overcome these challenges, we propose GIX, an automated and robust Gene Interaction Extraction framework, based on pre-trained Large Language models fine-tuned through extensive evaluations on various gene/protein interaction corpora including LLL and RegulonDB. GIX identifies relevant publications with minimal keywords, optimises sentence selection to reduce computational overhead, simplifies sentence structure while preserving meaning, and provides a confidence factor indicating the reliability of extracted relations. GIX's Stage-2 relation extraction method performed well on benchmark protein/gene interaction datasets, assessed using 10-fold cross-validation, surpassing state-of-the-art approaches. We demonstrated that the proposed method, although fully automated, performs as well as manual relation extraction, with enhanced robustness. We also observed GIX's capability to augment existing datasets with new sentences, incorporating newly discovered biological terms and processes. Further, we demonstrated GIX's real-world applicability in inferring E. coli gene circuits.
引用
收藏
页数:22
相关论文
共 50 条
  • [21] Large Language Model-Based Critical Care Big Data Deployment and Extraction: Descriptive Analysis
    Yang, Zhongbao
    Xu, Shan-Shan
    Liu, Xiaozhu
    Xu, Ningyuan
    Chen, Yuqing
    Wang, Shuya
    Miao, Ming-Yue
    Hou, Mengxue
    Liu, Shuai
    Zhou, Yi-Min
    Zhou, Jian-Xin
    Zhang, Linlin
    JMIR MEDICAL INFORMATICS, 2025, 13
  • [22] Automated data function extraction from textual requirements by leveraging semi-supervised CRF and language model
    Li, Mingyang
    Shi, Lin
    Wang, Yawen
    Wang, Junjie
    Wang, Qing
    Hu, Jun
    Peng, Xinhua
    Liao, Weimin
    Pi, Guizhen
    INFORMATION AND SOFTWARE TECHNOLOGY, 2022, 143
  • [23] Automated Extraction of Fine-Grained Standardized Product Information from Unstructured Multilingual Web Data
    Flick, Alexander
    Jaeger, Sebastian
    Trajanovska, Ivana
    Biessmann, Felix
    ADVANCES IN INFORMATION RETRIEVAL, ECIR 2023, PT III, 2023, 13982 : 230 - 235
  • [24] From text to insight: large language models for chemical data extraction
    Schilling-Wilhelmi, Mara
    Rios-Garcia, Martino
    Shabih, Sherjeel
    Gil, Maria Victoria
    Miret, Santiago
    Koch, Christoph T.
    Marquez, Jose A.
    Jablonka, Kevin Maik
    CHEMICAL SOCIETY REVIEWS, 2025, 54 (03) : 1125 - 1150
  • [25] Data extraction from polymer literature using large language models
    Gupta, Sonakshi
    Mahmood, Akhlak
    Shetty, Pranav
    Adeboye, Aishat
    Ramprasad, Rampi
    COMMUNICATIONS MATERIALS, 2024, 5 (01)
  • [26] Evaluating local open-source large language models for data extraction from unstructured reports on mechanical thrombectomy in patients with ischemic stroke
    Meddeb, Aymen
    Ebert, Philipe
    Bressem, Keno Kyrill
    Desser, Dmitriy
    Dell'Orco, Andrea
    Bohner, Georg
    Kleine, Justus F.
    Siebert, Eberhard
    Grauhan, Nils
    Brockmann, Marc A.
    Othman, Ahmed
    Scheel, Michael
    Nawabi, Jawed
    JOURNAL OF NEUROINTERVENTIONAL SURGERY, 2024,
  • [27] Using a Large Language Model (LLM) for Automated Extraction of Discrete Elements from Clinical Notes for Creation of Cancer Databases
    Gilbert, M.
    Crutchfield, A.
    Luo, B.
    Thind, K.
    Ghanem, A. I.
    Siddiqui, F.
    INTERNATIONAL JOURNAL OF RADIATION ONCOLOGY BIOLOGY PHYSICS, 2024, 120 (02): : E625 - E625
  • [28] A scoping review of large language model based approaches for information extraction from radiology reports
    Reichenpfader, Daniel
    Muller, Henning
    Denecke, Kerstin
    NPJ DIGITAL MEDICINE, 2024, 7 (01):
  • [29] Large language models overcome the challenges of unstructured text data in ecology
    Castro, Andry
    Pinto, Joao
    Reino, Luis
    Pipek, Pavel
    Capinha, Cesar
    ECOLOGICAL INFORMATICS, 2024, 82
  • [30] Information Extraction from Unstructured Data using RDF
    Gandhi, Kalgi
    Madia, Nidhi
    PROCEEDINGS OF 2016 INTERNATIONAL CONFERENCE ON ICT IN BUSINESS INDUSTRY & GOVERNMENT (ICTBIG), 2016,