Data extraction from polymer literature using large language models

被引:0
|
作者
Gupta, Sonakshi [1 ]
Mahmood, Akhlak [2 ]
Shetty, Pranav [1 ]
Adeboye, Aishat [3 ]
Ramprasad, Rampi [2 ]
机构
[1] School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta,GA, United States
[2] School of Materials Science and Engineering, Georgia Institute of Technology, Atlanta,GA, United States
[3] School of Chemical and Biomolecular Engineering, Georgia Institute of Technology, Atlanta,GA, United States
关键词
Natural language processing systems;
D O I
10.1038/s43246-024-00708-9
中图分类号
学科分类号
摘要
Automated data extraction from materials science literature at scale using artificial intelligence and natural language processing techniques is critical to advance materials discovery. However, this process for large spans of text continues to be a challenge due to the specific nature and styles of scientific manuscripts. In this study, we present a framework to automatically extract polymer-property data from full-text journal articles using commercially available (GPT-3.5) and open-source (LlaMa 2) large language models (LLM), in tandem with the named entity recognition (NER)-based MaterialsBERT model. Leveraging a corpus of ~ 2.4 million full text articles, our method successfully identified and processed around 681,000 polymer-related articles, resulting in the extraction of over one million records corresponding to 24 properties of over 106,000 unique polymers. We additionally conducted an extensive evaluation of the performance and associated costs of the LLMs used for data extraction, compared to the NER model. We suggest methodologies to optimize costs, provide insights on effective inference via in-context few-shots learning, and illuminate gaps and opportunities for future studies utilizing LLMs for natural language processing in polymer science. The extracted polymer-property data has been made publicly available for the wider scientific community via the Polymer Scholar website. © The Author(s) 2024.
引用
下载
收藏
相关论文
共 50 条
  • [41] Efficient anomaly detection in tabular cybersecurity data using large language models
    Xiaoyong Zhao
    Xingxin Leng
    Lei Wang
    Ningning Wang
    Yanqiong Liu
    Scientific Reports, 15 (1)
  • [42] Improving drug repositioning with negative data labeling using large language models
    Milan Picard
    Mickael Leclercq
    Antoine Bodein
    Marie Pier Scott-Boyer
    Olivier Perin
    Arnaud Droit
    Journal of Cheminformatics, 17 (1)
  • [43] A framework for human evaluation of large language models in healthcare derived from literature review
    Tam, Thomas Yu Chow
    Sivarajkumar, Sonish
    Kapoor, Sumit
    Stolyar, Alisa V.
    Polanska, Katelyn
    McCarthy, Karleigh R.
    Osterhoudt, Hunter
    Wu, Xizhi
    Visweswaran, Shyam
    Fu, Sunyang
    Mathur, Piyush
    Cacciamani, Giovanni E.
    Sun, Cong
    Peng, Yifan
    Wang, Yanshan
    NPJ DIGITAL MEDICINE, 2024, 7 (01):
  • [44] Context Compression and Extraction: Efficiency Inference of Large Language Models
    Zhou, Junyao
    Du, Ruiqing
    Tan, Yushan
    Yang, Jintao
    Yang, Zonghao
    Luo, Wei
    Luo, Zhunchen
    Zhou, Xian
    Hu, Wenpeng
    ADVANCED INTELLIGENT COMPUTING TECHNOLOGY AND APPLICATIONS, PT I, ICIC 2024, 2024, 14875 : 221 - 232
  • [45] Can Large Language Models Predict Data Correlations from Column Names?
    Trummer, Immanuel
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (13): : 4310 - 4323
  • [46] Emotion Recognition from Videos Using Multimodal Large Language Models
    Vaiani, Lorenzo
    Cagliero, Luca
    Garza, Paolo
    FUTURE INTERNET, 2024, 16 (07)
  • [47] An Entity Extraction Pipeline for Medical Text Records Using Large Language Models: Analytical Study
    Wang, Lei
    Ma, Yinyao
    Bi, Wenshuai
    Lv, Hanlin
    Li, Yuxiang
    JOURNAL OF MEDICAL INTERNET RESEARCH, 2024, 26
  • [48] Inferring cancer disease response from radiology reports using large language models with data augmentation and prompting
    Tan, Ryan Shea Ying Cong
    Lin, Qian
    Low, Guat Hwa
    Lin, Ruixi
    Goh, Tzer Chew
    Chang, Christopher Chu En
    Lee, Fung Fung
    Chan, Wei Yin
    Tan, Wei Chong
    Tey, Han Jieh
    Leong, Fun Loon
    Tan, Hong Qi
    Nei, Wen Long
    Chay, Wen Yee
    Tai, David Wai Meng
    Lai, Gillianne Geet Yi
    Cheng, Lionel Tim-Ee
    Wong, Fuh Yong
    Chua, Matthew Chin Heng
    Chua, Melvin Lee Kiang
    Tan, Daniel Shao Weng
    Thng, Choon Hua
    Tan, Iain Bee Huat
    Ng, Hwee Tou
    JOURNAL OF THE AMERICAN MEDICAL INFORMATICS ASSOCIATION, 2023, 30 (10) : 1657 - 1664
  • [49] Information extraction from historical well records using a large language model
    Zhiwei Ma
    Javier E. Santos
    Greg Lackey
    Hari Viswanathan
    Daniel O’Malley
    Scientific Reports, 14 (1)
  • [50] Large Language Models for Software Engineering: A Systematic Literature Review
    Hou, Xinyi
    Zhao, Yanjie
    Liu, Yue
    Yang, Zhou
    Wang, Kailong
    Li, Li
    Luo, Xiapu
    Lo, David
    Grundy, John
    Wang, Haoyu
    ACM Transactions on Software Engineering and Methodology, 2024, 33 (08)