Data extraction from polymer literature using large language models

被引:0
|
作者
Gupta, Sonakshi [1 ]
Mahmood, Akhlak [2 ]
Shetty, Pranav [1 ]
Adeboye, Aishat [3 ]
Ramprasad, Rampi [2 ]
机构
[1] School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta,GA, United States
[2] School of Materials Science and Engineering, Georgia Institute of Technology, Atlanta,GA, United States
[3] School of Chemical and Biomolecular Engineering, Georgia Institute of Technology, Atlanta,GA, United States
关键词
Natural language processing systems;
D O I
10.1038/s43246-024-00708-9
中图分类号
学科分类号
摘要
Automated data extraction from materials science literature at scale using artificial intelligence and natural language processing techniques is critical to advance materials discovery. However, this process for large spans of text continues to be a challenge due to the specific nature and styles of scientific manuscripts. In this study, we present a framework to automatically extract polymer-property data from full-text journal articles using commercially available (GPT-3.5) and open-source (LlaMa 2) large language models (LLM), in tandem with the named entity recognition (NER)-based MaterialsBERT model. Leveraging a corpus of ~ 2.4 million full text articles, our method successfully identified and processed around 681,000 polymer-related articles, resulting in the extraction of over one million records corresponding to 24 properties of over 106,000 unique polymers. We additionally conducted an extensive evaluation of the performance and associated costs of the LLMs used for data extraction, compared to the NER model. We suggest methodologies to optimize costs, provide insights on effective inference via in-context few-shots learning, and illuminate gaps and opportunities for future studies utilizing LLMs for natural language processing in polymer science. The extracted polymer-property data has been made publicly available for the wider scientific community via the Polymer Scholar website. © The Author(s) 2024.
引用
下载
收藏
相关论文
共 50 条
  • [31] Revisiting Relation Extraction in the era of Large Language Models
    Wadhwa, Somin
    Amir, Silvio
    Wallace, Byron C.
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 15566 - 15589
  • [32] Trend Extraction and Analysis via Large Language Models
    Soru, Tommaso
    Marshall, Jim
    18TH IEEE INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING, ICSC 2024, 2024, : 285 - 288
  • [33] Extraction and classification of structured data from unstructured hepatobiliary pathology reports using large language models: a feasibility study compared with rules-based natural language processing
    Geevarghese, Ruben
    Sigel, Carlie
    Cadley, John
    Chatterjee, Subrata
    Jain, Pulkit
    Hollingsworth, Alex
    Chatterjee, Avijit
    Swinburne, Nathaniel
    Bilal, Khawaja Hasan
    Marinelli, Brett
    JOURNAL OF CLINICAL PATHOLOGY, 2024,
  • [34] PLRTE: Progressive learning for biomedical relation triplet extraction using large language models
    Zheng, Yi-Kai
    Zeng, Bi
    Feng, Yi-Chun
    Zhou, Lu
    Li, Yi-Xue
    Journal of Biomedical Informatics, 2024, 159
  • [35] Relation extraction using large language models: a case study on acupuncture point locations
    Li, Yiming
    Peng, Xueqing
    Li, Jianfu
    Zuo, Xu
    Peng, Suyuan
    Pei, Donghong
    Tao, Cui
    Xu, Hua
    Hong, Na
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2024,
  • [36] Demystifying Data Management for Large Language Models
    Miao, Xupeng
    Jia, Zhihao
    Cui, Bin
    COMPANION OF THE 2024 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, SIGMOD-COMPANION 2024, 2024, : 547 - 555
  • [37] Using large language models in psychology
    Demszky, Dorottya
    Yang, Diyi
    Yeager, David
    Bryan, Christopher
    Clapper, Margarett
    Chandhok, Susannah
    Eichstaedt, Johannes
    Hecht, Cameron
    Jamieson, Jeremy
    Johnson, Meghann
    Jones, Michaela
    Krettek-Cobb, Danielle
    Lai, Leslie
    Jonesmitchell, Nirel
    Ong, Desmond
    Dweck, Carol
    Gross, James
    Pennebaker, James
    NATURE REVIEWS PSYCHOLOGY, 2023, 2 (11): : 688 - 701
  • [38] Using large language models in psychology
    Dorottya Demszky
    Diyi Yang
    David S. Yeager
    Christopher J. Bryan
    Margarett Clapper
    Susannah Chandhok
    Johannes C. Eichstaedt
    Cameron Hecht
    Jeremy Jamieson
    Meghann Johnson
    Michaela Jones
    Danielle Krettek-Cobb
    Leslie Lai
    Nirel JonesMitchell
    Desmond C. Ong
    Carol S. Dweck
    James J. Gross
    James W. Pennebaker
    Nature Reviews Psychology, 2023, 2 : 688 - 701
  • [40] Comparative Analysis of Large Language Models in Structured Information Extraction from Job Postings
    Sioziou, Kyriaki
    Zervas, Panagiotis
    Giotopoulos, Kostas
    Tzimas, Giannis
    ENGINEERING APPLICATIONS OF NEURAL NETWORKS, EANN 2024, 2024, 2141 : 82 - 92