Structured information extraction from scientific text with large language models

被引:29
|
作者
Dagdelen, John [1 ,2 ]
Dunn, Alexander [1 ,2 ]
Lee, Sanghoon [1 ,2 ]
Walker, Nicholas [1 ]
Rosen, Andrew S. [1 ,2 ]
Ceder, Gerbrand [1 ,2 ]
Persson, Kristin A. [1 ,2 ]
Jain, Anubhav [1 ]
机构
[1] Lawrence Berkeley Natl Lab, Berkeley, CA 94720 USA
[2] Univ Calif Berkeley, Mat Sci & Engn Dept, Berkeley, CA USA
关键词
CANCER RESISTANCE; CELLULAR SENESCENCE; PHYLOGENETIC ANALYSIS; PREMATURE SENESCENCE; MOLE-RAT; MECHANISMS; TRANSCRIPTION; DISCOVERY; ALIGNMENT; PROVIDES;
D O I
10.1038/s41467-024-45563-x
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers. Extracting scientific data from published research is a complex task required specialised tools. Here the authors present a scheme based on large language models to automatise the retrieval of information from text in a flexible and accessible manner.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] Information extraction from biomedical text
    Hobbs, JR
    JOURNAL OF BIOMEDICAL INFORMATICS, 2002, 35 (04) : 260 - 264
  • [42] Large language models for structured reporting in radiology: comment
    Amnuay Kleebayoon
    Viroj Wiwanitkit
    La radiologia medica, 2023, 128 : 1440 - 1440
  • [43] Towards Automatic Semantic Models by Extraction of Relevant Information from Online Text
    Krupp, Lars
    Gruenerbl, Agnes
    Bahle, Gernot
    Lukowicz, Paul
    2019 IEEE INTERNATIONAL CONFERENCE ON SMART COMPUTING (SMARTCOMP 2019), 2019, : 481 - 483
  • [44] SKILL: Structured Knowledge Infusion for Large Language Models
    Moiseev, Fedor
    Dong, Zhe
    Alfonseca, Enrique
    Jaggi, Martin
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 1581 - 1588
  • [45] Large language models for structured reporting in radiology: comment
    Kleebayoon, Amnuay
    Wiwanitkit, Viroj
    RADIOLOGIA MEDICA, 2023, 128 (11): : 1440 - 1440
  • [46] Language and Text. Data, Models, Information and Applications
    Buk, Solomija
    Rovenchak, Andrij
    GLOTTOMETRICS, 2022, 53
  • [47] Language and Text. Data, models, information and applications
    Kubat, Miroslav
    ROCZNIKI HUMANISTYCZNE, 2022, 70 (08): : 181 - 184
  • [48] Data extraction from polymer literature using large language models
    Gupta, Sonakshi
    Mahmood, Akhlak
    Shetty, Pranav
    Adeboye, Aishat
    Ramprasad, Rampi
    Communications Materials, 2024, 5 (01)
  • [49] Enhancing Relation Extraction from Biomedical Texts by Large Language Models
    Asada, Masaki
    Fukuda, Ken
    ARTIFICIAL INTELLIGENCE IN HCI, PT III, AI-HCI 2024, 2024, 14736 : 3 - 14
  • [50] Chinese resume information extraction based on semi-structured text
    Wentan, Yan
    Yupeng, Qiao
    Chinese Control Conference, CCC, 2017, : 11177 - 11182