Data extraction from polymer literature using large language models

被引:0
|
作者
Gupta, Sonakshi [1 ]
Mahmood, Akhlak [2 ]
Shetty, Pranav [1 ]
Adeboye, Aishat [3 ]
Ramprasad, Rampi [2 ]
机构
[1] School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta,GA, United States
[2] School of Materials Science and Engineering, Georgia Institute of Technology, Atlanta,GA, United States
[3] School of Chemical and Biomolecular Engineering, Georgia Institute of Technology, Atlanta,GA, United States
关键词
Natural language processing systems;
D O I
10.1038/s43246-024-00708-9
中图分类号
学科分类号
摘要
Automated data extraction from materials science literature at scale using artificial intelligence and natural language processing techniques is critical to advance materials discovery. However, this process for large spans of text continues to be a challenge due to the specific nature and styles of scientific manuscripts. In this study, we present a framework to automatically extract polymer-property data from full-text journal articles using commercially available (GPT-3.5) and open-source (LlaMa 2) large language models (LLM), in tandem with the named entity recognition (NER)-based MaterialsBERT model. Leveraging a corpus of ~ 2.4 million full text articles, our method successfully identified and processed around 681,000 polymer-related articles, resulting in the extraction of over one million records corresponding to 24 properties of over 106,000 unique polymers. We additionally conducted an extensive evaluation of the performance and associated costs of the LLMs used for data extraction, compared to the NER model. We suggest methodologies to optimize costs, provide insights on effective inference via in-context few-shots learning, and illuminate gaps and opportunities for future studies utilizing LLMs for natural language processing in polymer science. The extracted polymer-property data has been made publicly available for the wider scientific community via the Polymer Scholar website. © The Author(s) 2024.
引用
下载
收藏
相关论文
共 50 条
  • [21] Using Large Language Models to Enhance the Reusability of Sensor Data
    Berenguer, Alberto
    Morejon, Adriana
    Tomas, David
    Mazon, Jose-Norberto
    SENSORS, 2024, 24 (02)
  • [22] Structured information extraction from scientific text with large language models
    John Dagdelen
    Alexander Dunn
    Sanghoon Lee
    Nicholas Walker
    Andrew S. Rosen
    Gerbrand Ceder
    Kristin A. Persson
    Anubhav Jain
    Nature Communications, 15
  • [23] Enhancing Relation Extraction from Biomedical Texts by Large Language Models
    Asada, Masaki
    Fukuda, Ken
    ARTIFICIAL INTELLIGENCE IN HCI, PT III, AI-HCI 2024, 2024, 14736 : 3 - 14
  • [24] Structured information extraction from scientific text with large language models
    Dagdelen, John
    Dunn, Alexander
    Lee, Sanghoon
    Walker, Nicholas
    Rosen, Andrew S.
    Ceder, Gerbrand
    Persson, Kristin A.
    Jain, Anubhav
    NATURE COMMUNICATIONS, 2024, 15 (01)
  • [25] Data Extraction from Natural Language using Universal Networking Language
    Saha, Aloke Kumar
    Mridha, M. F.
    Rafiq, Jahir Ibna
    Das, Jugal Krishna
    2017 INTERNATIONAL CONFERENCE ON CURRENT TRENDS IN COMPUTER, ELECTRICAL, ELECTRONICS AND COMMUNICATION (CTCEEC), 2017, : 1228 - 1232
  • [26] Scalable information extraction from free text electronic health records using large language models
    Bowen Gu
    Vivian Shao
    Ziqian Liao
    Valentina Carducci
    Santiago Romero Brufau
    Jie Yang
    Rishi J. Desai
    BMC Medical Research Methodology, 25 (1)
  • [27] Using Augmented Small Multimodal Models to Guide Large Language Models for Multimodal Relation Extraction
    He, Wentao
    Ma, Hanjie
    Li, Shaohua
    Dong, Hui
    Zhang, Haixiang
    Feng, Jie
    APPLIED SCIENCES-BASEL, 2023, 13 (22):
  • [28] GeDa: Improving training data with large language models for Aspect Sentiment Triplet Extraction
    Mai, Weixing
    Zhang, Zhengxuan
    Chen, Yifan
    Li, Kuntao
    Xue, Yun
    KNOWLEDGE-BASED SYSTEMS, 2024, 301
  • [29] Decomposing Relational Triple Extraction with Large Language Models for Better Generalization on Unseen Data
    Meng, Boyu
    Lin, Tianhe
    Yang, Deqing
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT IV, PAKDD 2024, 2024, 14648 : 104 - 115
  • [30] Large language models for generative information extraction: a survey
    Xu, Derong
    Chen, Wei
    Peng, Wenjun
    Zhang, Chao
    Xu, Tong
    Zhao, Xiangyu
    Wu, Xian
    Zheng, Yefeng
    Wang, Yang
    Chen, Enhong
    Frontiers of Computer Science, 2024, 18 (06)