MaterialBERT for natural language processing of materials science texts

被引:15
|
作者
Yoshitake, Michiko [1 ]
Sato, Fumitaka [1 ,2 ]
Kawano, Hiroyuki [1 ,2 ]
Teraoka, Hiroshi [1 ,2 ]
机构
[1] Natl Inst Mat Sci, MaDIS, 1-1 Namiki, Tsukuba, Ibaraki 3050044, Japan
[2] Ridgelinez, Business Sci Unit, Tokyo, Japan
来源
SCIENCE AND TECHNOLOGY OF ADVANCED MATERIALS-METHODS | 2022年 / 2卷 / 01期
关键词
Word embedding; pre-training; BERT; literal information;
D O I
10.1080/27660400.2022.2124831
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
A BERT (Bidirectional Encoder Representations from Transformers) model, which we named "MaterialBERT", has been generated using scientific papers in wide area of material science as a corpus. A new vocabulary list for tokenizer was generated using material science corpus. Two BERT models with different vocabulary lists for the tokenizer, one with the original one made by Google and the other newly made by the authors, were generated. Word vectors embedded during the pre-training with the two MaterialBERT models reasonably reflect the meanings of materials names in material-class clustering and in the relationship between base materials and their compounds or derivatives for not only inorganic materials but also organic materials and organometallic compounds. Fine-tuning with CoLA (The Corpus of Linguistic Acceptability) using the pre-trained MaterialBERT showed a higher score than the original BERT. The two MaterialBERTs could be also utilized as a starting point for transfer learning of a narrower domain-specific BERT. [GRAPHICS]
引用
收藏
页码:372 / 380
页数:9
相关论文
共 50 条
  • [21] Applications of natural language processing and large language models in materials discovery
    Jiang, Xue
    Wang, Weiren
    Tian, Shaohan
    Wang, Hao
    Lookman, Turab
    Su, Yanjing
    NPJ COMPUTATIONAL MATERIALS, 2025, 11 (01)
  • [22] Natural Language Processing for Materials Informatics of Literature Data
    Katsura, Yukari
    IEEJ Transactions on Fundamentals and Materials, 144 (09): : 350 - 359
  • [23] Looking through glass: Knowledge discovery from materials science literature using natural language processing
    Venugopal, Vineeth
    Sahoo, Sourav
    Zaki, Mohd
    Agarwal, Manish
    Gosvami, Nitya Nand
    Krishnan, N. M. Anoop
    PATTERNS, 2021, 2 (07):
  • [24] Processing natural language without natural language processing
    Brill, E
    COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, PROCEEDINGS, 2003, 2588 : 360 - 369
  • [25] ORGANIZATION OF A DICTIONARY IN LEARNING-SYSTEMS FOR PROCESSING NATURAL-LANGUAGE TEXTS
    GLADUN, VP
    SAKUNOV, IA
    CYBERNETICS, 1981, 17 (04): : 561 - 565
  • [26] Semantic similarity of short texts in languages with a deficient natural language processing support
    Furlan, Bojan
    Batanovic, Vuk
    Nikolic, Bosko
    DECISION SUPPORT SYSTEMS, 2013, 55 (03) : 710 - 719
  • [27] IDENTIFICATION OF DISABILITIES IN EDUCATIONAL TEXTS WITH THE APPLICATION OF NATURAL LANGUAGE PROCESSING AND MACHINE LEARNING
    Pinho, Cintia Maria de Araujo
    de Moura, Amanda Ferreira
    Gaspar, Marcos Antonio
    Napolitano, Domingos Marcio Rodrigues
    ETD EDUCACAO TEMATICA DIGITAL, 2022, 24 (02): : 350 - 372
  • [28] Hybrid natural language processing tool for semantic annotation of medical texts in Spanish
    Campillos-Llanos, Leonardo
    Valverde-Mateos, Ana
    Capllonch-Carrion, Adrian
    BMC BIOINFORMATICS, 2025, 26 (01):
  • [29] Natural language processing for social science research: A comprehensive review
    Hou, Yuxin
    Huang, Junming
    CHINESE JOURNAL OF SOCIOLOGY, 2025,
  • [30] Connectionist natural language processing: readings from connection science
    Sharkey, Noel
    Machine Translation, 10 (04): : 321 - 327