Data extraction from polymer literature using large language models

被引：0

作者：

Gupta, Sonakshi ^{[1
]}

Mahmood, Akhlak ^{[2
]}

Shetty, Pranav ^{[1
]}

Adeboye, Aishat ^{[3
]}

Ramprasad, Rampi ^{[2
]}

机构：

[1] School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta,GA, United States

[2] School of Materials Science and Engineering, Georgia Institute of Technology, Atlanta,GA, United States

[3] School of Chemical and Biomolecular Engineering, Georgia Institute of Technology, Atlanta,GA, United States

来源：

Communications Materials | 2024年 / 5卷 / 01期

关键词：

Natural language processing systems;

D O I：

10.1038/s43246-024-00708-9

中图分类号：

学科分类号：

摘要：

Automated data extraction from materials science literature at scale using artificial intelligence and natural language processing techniques is critical to advance materials discovery. However, this process for large spans of text continues to be a challenge due to the specific nature and styles of scientific manuscripts. In this study, we present a framework to automatically extract polymer-property data from full-text journal articles using commercially available (GPT-3.5) and open-source (LlaMa 2) large language models (LLM), in tandem with the named entity recognition (NER)-based MaterialsBERT model. Leveraging a corpus of ~ 2.4 million full text articles, our method successfully identified and processed around 681,000 polymer-related articles, resulting in the extraction of over one million records corresponding to 24 properties of over 106,000 unique polymers. We additionally conducted an extensive evaluation of the performance and associated costs of the LLMs used for data extraction, compared to the NER model. We suggest methodologies to optimize costs, provide insights on effective inference via in-context few-shots learning, and illuminate gaps and opportunities for future studies utilizing LLMs for natural language processing in polymer science. The extracted polymer-property data has been made publicly available for the wider scientific community via the Polymer Scholar website. © The Author(s) 2024.

引用

下载

共 50 条

[21] Using Large Language Models to Enhance the Reusability of Sensor Data
Berenguer, Alberto
Morejon, Adriana
Tomas, David
Mazon, Jose-Norberto
SENSORS, 2024, 24 (02)
[22] Structured information extraction from scientific text with large language models
John Dagdelen
Alexander Dunn
Sanghoon Lee
Nicholas Walker
Andrew S. Rosen
Gerbrand Ceder
Kristin A. Persson
Anubhav Jain
Nature Communications, 15
[23] Enhancing Relation Extraction from Biomedical Texts by Large Language Models
Asada, Masaki
Fukuda, Ken
ARTIFICIAL INTELLIGENCE IN HCI, PT III, AI-HCI 2024, 2024, 14736 : 3 - 14
[24] Structured information extraction from scientific text with large language models
Dagdelen, John
Dunn, Alexander
Lee, Sanghoon
Walker, Nicholas
Rosen, Andrew S.
Ceder, Gerbrand
Persson, Kristin A.
Jain, Anubhav
NATURE COMMUNICATIONS, 2024, 15 (01)
[25] Data Extraction from Natural Language using Universal Networking Language
Saha, Aloke Kumar
Mridha, M. F.
Rafiq, Jahir Ibna
Das, Jugal Krishna
2017 INTERNATIONAL CONFERENCE ON CURRENT TRENDS IN COMPUTER, ELECTRICAL, ELECTRONICS AND COMMUNICATION (CTCEEC), 2017, : 1228 - 1232
[26] Scalable information extraction from free text electronic health records using large language models
Bowen Gu
Vivian Shao
Ziqian Liao
Valentina Carducci
Santiago Romero Brufau
Jie Yang
Rishi J. Desai
BMC Medical Research Methodology, 25 (1)
[27] Using Augmented Small Multimodal Models to Guide Large Language Models for Multimodal Relation Extraction
He, Wentao
Ma, Hanjie
Li, Shaohua
Dong, Hui
Zhang, Haixiang
Feng, Jie
APPLIED SCIENCES-BASEL, 2023, 13 (22):
[28] GeDa: Improving training data with large language models for Aspect Sentiment Triplet Extraction
Mai, Weixing
Zhang, Zhengxuan
Chen, Yifan
Li, Kuntao
Xue, Yun
KNOWLEDGE-BASED SYSTEMS, 2024, 301
[29] Decomposing Relational Triple Extraction with Large Language Models for Better Generalization on Unseen Data
Meng, Boyu
Lin, Tianhe
Yang, Deqing
ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT IV, PAKDD 2024, 2024, 14648 : 104 - 115
[30] Large language models for generative information extraction: a survey
Xu, Derong
Chen, Wei
Peng, Wenjun
Zhang, Chao
Xu, Tong
Zhao, Xiangyu
Wu, Xian
Zheng, Yefeng
Wang, Yang
Chen, Enhong
Frontiers of Computer Science, 2024, 18 (06)

← 1 2 3 4 5 →