Data extraction from polymer literature using large language models

被引:0
|
作者
Gupta, Sonakshi [1 ]
Mahmood, Akhlak [2 ]
Shetty, Pranav [1 ]
Adeboye, Aishat [3 ]
Ramprasad, Rampi [2 ]
机构
[1] School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta,GA, United States
[2] School of Materials Science and Engineering, Georgia Institute of Technology, Atlanta,GA, United States
[3] School of Chemical and Biomolecular Engineering, Georgia Institute of Technology, Atlanta,GA, United States
关键词
Natural language processing systems;
D O I
10.1038/s43246-024-00708-9
中图分类号
学科分类号
摘要
Automated data extraction from materials science literature at scale using artificial intelligence and natural language processing techniques is critical to advance materials discovery. However, this process for large spans of text continues to be a challenge due to the specific nature and styles of scientific manuscripts. In this study, we present a framework to automatically extract polymer-property data from full-text journal articles using commercially available (GPT-3.5) and open-source (LlaMa 2) large language models (LLM), in tandem with the named entity recognition (NER)-based MaterialsBERT model. Leveraging a corpus of ~ 2.4 million full text articles, our method successfully identified and processed around 681,000 polymer-related articles, resulting in the extraction of over one million records corresponding to 24 properties of over 106,000 unique polymers. We additionally conducted an extensive evaluation of the performance and associated costs of the LLMs used for data extraction, compared to the NER model. We suggest methodologies to optimize costs, provide insights on effective inference via in-context few-shots learning, and illuminate gaps and opportunities for future studies utilizing LLMs for natural language processing in polymer science. The extracted polymer-property data has been made publicly available for the wider scientific community via the Polymer Scholar website. © The Author(s) 2024.
引用
下载
收藏
相关论文
共 50 条
  • [1] Investigations on Scientific Literature Meta Information Extraction Using Large Language Models
    Guo, Menghao
    Wu, Fan
    Jiang, Jinling
    Yan, Xiaoran
    Chen, Guangyong
    Li, Wenhui
    Zhao, Yunhong
    Sun, Zeyi
    2023 IEEE INTERNATIONAL CONFERENCE ON KNOWLEDGE GRAPH, ICKG, 2023, : 249 - 254
  • [2] Bioregulatory event extraction using large language models: a case study of rice literature
    Xinzhi Yao
    Zhihan He
    Jingbo Xia
    Genomics & Informatics, 22 (1)
  • [3] Automated knowledge extraction from polymer literature using natural language processing
    Shetty, Pranav
    Ramprasad, Rampi
    ISCIENCE, 2021, 24 (01)
  • [4] From Large Language Models to Large Multimodal Models: A Literature Review
    Huang, Dawei
    Yan, Chuan
    Li, Qing
    Peng, Xiaojiang
    APPLIED SCIENCES-BASEL, 2024, 14 (12):
  • [5] High-Throughput Extraction of Phase-Property Relationships from Literature Using Natural Language Processing and Large Language Models
    Montanelli, Luca
    Venugopal, Vineeth
    Olivetti, Elsa A.
    Latypov, Marat I.
    INTEGRATING MATERIALS AND MANUFACTURING INNOVATION, 2024, 13 (2) : 396 - 405
  • [6] Causality Extraction from Medical Text Using Large Language Models (LLMs)
    Gopalakrishnan, Seethalakshmi
    Garbayo, Luciana
    Zadrozny, Wlodek
    Information (Switzerland), 2025, 16 (01)
  • [7] Large Language Models for Data Extraction in Slot-Filling Tasks
    Bazan, Marek
    Gniazdowski, Tomasz
    Wolkiewicz, Dawid
    Sarna, Juliusz
    Marchwiany, Maciej E.
    SYSTEM DEPENDABILITY-THEORY AND APPLICATIONS, DEPCOS-RELCOMEX 2024, 2024, 1026 : 1 - 18
  • [8] Performance of two large language models for data extraction in evidence synthesis
    Konet, Amanda
    Thomas, Ian
    Gartlehner, Gerald
    Kahwati, Leila
    Hilscher, Rainer
    Kugley, Shannon
    Crotty, Karen
    Viswanathan, Meera
    Chew, Robert
    RESEARCH SYNTHESIS METHODS, 2024,
  • [9] Comprehensive testing of large language models for extraction of structured data in pathology
    Bastian Grothey
    Jan Odenkirchen
    Adnan Brkic
    Birgid Schömig-Markiefka
    Alexander Quaas
    Reinhard Büttner
    Yuri Tolkach
    Communications Medicine, 5 (1):
  • [10] Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning
    Ozdayi, Mustafa Safa
    Peris, Charith
    Fitzgerald, Jack
    Dupuy, Christophe
    Majmudar, Jimit
    Khan, Haidar
    Parikh, Rahil
    Gupta, Rahul
    61ST CONFERENCE OF THE THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 2, 2023, : 1512 - 1521