Data extraction from polymer literature using large language models

被引：0

作者：

Gupta, Sonakshi ^{[1
]}

Mahmood, Akhlak ^{[2
]}

Shetty, Pranav ^{[1
]}

Adeboye, Aishat ^{[3
]}

Ramprasad, Rampi ^{[2
]}

机构：

[1] School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta,GA, United States

[2] School of Materials Science and Engineering, Georgia Institute of Technology, Atlanta,GA, United States

[3] School of Chemical and Biomolecular Engineering, Georgia Institute of Technology, Atlanta,GA, United States

来源：

Communications Materials | 2024年 / 5卷 / 01期

关键词：

Natural language processing systems;

D O I：

10.1038/s43246-024-00708-9

中图分类号：

学科分类号：

摘要：

Automated data extraction from materials science literature at scale using artificial intelligence and natural language processing techniques is critical to advance materials discovery. However, this process for large spans of text continues to be a challenge due to the specific nature and styles of scientific manuscripts. In this study, we present a framework to automatically extract polymer-property data from full-text journal articles using commercially available (GPT-3.5) and open-source (LlaMa 2) large language models (LLM), in tandem with the named entity recognition (NER)-based MaterialsBERT model. Leveraging a corpus of ~ 2.4 million full text articles, our method successfully identified and processed around 681,000 polymer-related articles, resulting in the extraction of over one million records corresponding to 24 properties of over 106,000 unique polymers. We additionally conducted an extensive evaluation of the performance and associated costs of the LLMs used for data extraction, compared to the NER model. We suggest methodologies to optimize costs, provide insights on effective inference via in-context few-shots learning, and illuminate gaps and opportunities for future studies utilizing LLMs for natural language processing in polymer science. The extracted polymer-property data has been made publicly available for the wider scientific community via the Polymer Scholar website. © The Author(s) 2024.

引用

下载

共 50 条

[41] Efficient anomaly detection in tabular cybersecurity data using large language models
Xiaoyong Zhao
Xingxin Leng
Lei Wang
Ningning Wang
Yanqiong Liu
Scientific Reports, 15 (1)
[42] Improving drug repositioning with negative data labeling using large language models
Milan Picard
Mickael Leclercq
Antoine Bodein
Marie Pier Scott-Boyer
Olivier Perin
Arnaud Droit
Journal of Cheminformatics, 17 (1)
[43] A framework for human evaluation of large language models in healthcare derived from literature review
Tam, Thomas Yu Chow
Sivarajkumar, Sonish
Kapoor, Sumit
Stolyar, Alisa V.
Polanska, Katelyn
McCarthy, Karleigh R.
Osterhoudt, Hunter
Wu, Xizhi
Visweswaran, Shyam
Fu, Sunyang
Mathur, Piyush
Cacciamani, Giovanni E.
Sun, Cong
Peng, Yifan
Wang, Yanshan
NPJ DIGITAL MEDICINE, 2024, 7 (01):
[44] Context Compression and Extraction: Efficiency Inference of Large Language Models
Zhou, Junyao
Du, Ruiqing
Tan, Yushan
Yang, Jintao
Yang, Zonghao
Luo, Wei
Luo, Zhunchen
Zhou, Xian
Hu, Wenpeng
ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT I, ICIC 2024, 2024, 14875 : 221 - 232
[45] Can Large Language Models Predict Data Correlations from Column Names?
Trummer, Immanuel
PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (13): : 4310 - 4323
[46] Emotion Recognition from Videos Using Multimodal Large Language Models
Vaiani, Lorenzo
Cagliero, Luca
Garza, Paolo
FUTURE INTERNET, 2024, 16 (07)
[47] An Entity Extraction Pipeline for Medical Text Records Using Large Language Models: Analytical Study
Wang, Lei
Ma, Yinyao
Bi, Wenshuai
Lv, Hanlin
Li, Yuxiang
JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
[48] Inferring cancer disease response from radiology reports using large language models with data augmentation and prompting
Tan, Ryan Shea Ying Cong
Lin, Qian
Low, Guat Hwa
Lin, Ruixi
Goh, Tzer Chew
Chang, Christopher Chu En
Lee, Fung Fung
Chan, Wei Yin
Tan, Wei Chong
Tey, Han Jieh
Leong, Fun Loon
Tan, Hong Qi
Nei, Wen Long
Chay, Wen Yee
Tai, David Wai Meng
Lai, Gillianne Geet Yi
Cheng, Lionel Tim-Ee
Wong, Fuh Yong
Chua, Matthew Chin Heng
Chua, Melvin Lee Kiang
Tan, Daniel Shao Weng
Thng, Choon Hua
Tan, Iain Bee Huat
Ng, Hwee Tou
JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2023, 30 (10) : 1657 - 1664
[49] Information extraction from historical well records using a large language model
Zhiwei Ma
Javier E. Santos
Greg Lackey
Hari Viswanathan
Daniel O’Malley
Scientific Reports, 14 (1)
[50] Large Language Models for Software Engineering: A Systematic Literature Review
Hou, Xinyi
Zhao, Yanjie
Liu, Yue
Yang, Zhou
Wang, Kailong
Li, Li
Luo, Xiapu
Lo, David
Grundy, John
Wang, Haoyu
ACM Transactions on Software Engineering and Methodology, 2024, 33 (08)

← 1 2 3 4 5 →