Unsupervised word embeddings capture latent knowledge from materials science literature

被引:614
|
作者
Tshitoyan, Vahe [1 ,3 ]
Dagdelen, John [1 ,2 ]
Weston, Leigh [1 ]
Dunn, Alexander [1 ,2 ]
Rong, Ziqin [1 ]
Kononova, Olga [2 ]
Persson, Kristin A. [1 ,2 ]
Ceder, Gerbrand [1 ,2 ]
Jain, Anubhav [1 ]
机构
[1] Lawrence Berkeley Natl Lab, Berkeley, CA 94720 USA
[2] Univ Calif Berkeley, Dept Mat Sci & Engn, Berkeley, CA 94720 USA
[3] Google LLC, Mountain View, CA 94043 USA
关键词
TOTAL-ENERGY CALCULATIONS; THERMAL-CONDUCTIVITY; EFFICIENCY;
D O I
10.1038/s41586-019-1335-8
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The overwhelming majority of scientific knowledge is published as text, which is difficult to analyse by either traditional statistical analysis or modern machine learning methods. By contrast, the main source of machine-interpretable data for the materials research community has come from structured property databases(1,2), which encompass only a small fraction of the knowledge present in the research literature. Beyond property values, publications contain valuable knowledge regarding the connections and relationships between data items as interpreted by the authors. To improve the identification and use of this knowledge, several studies have focused on the retrieval of information from scientific literature using supervised natural language processing(3-10), which requires large hand-labelled datasets for training. Here we show that materials science knowledge present in the published literature can be efficiently encoded as information-dense word embeddings(11-13) (vector representations of words) without human labelling or supervision. Without any explicit insertion of chemical knowledge, these embeddings capture complex materials science concepts such as the underlying structure of the periodic table and structure-property relationships in materials. Furthermore, we demonstrate that an unsupervised method can recommend materials for functional applications several years before their discovery. This suggests that latent knowledge regarding future discoveries is to a large extent embedded in past publications. Our findings highlight the possibility of extracting knowledge and relationships from the massive body of scientific literature in a collective manner, and point towards a generalized approach to the mining of scientific literature.
引用
下载
收藏
页码:95 / +
页数:12
相关论文
共 43 条
  • [1] Unsupervised word embeddings capture latent knowledge from materials science literature
    Vahe Tshitoyan
    John Dagdelen
    Leigh Weston
    Alexander Dunn
    Ziqin Rong
    Olga Kononova
    Kristin A. Persson
    Gerbrand Ceder
    Anubhav Jain
    Nature, 2019, 571 : 95 - 98
  • [2] Comparison of Word Embeddings from Different Knowledge Graphs
    Simov, Kiril
    Osenova, Petya
    Popov, Alexander
    LANGUAGE, DATA, AND KNOWLEDGE, LDK 2017, 2017, 10318 : 213 - 221
  • [3] Supervised learning with word embeddings derived from PubMed captures latent knowledge about protein kinases and cancer
    Ravanmehr, Vida
    Blau, Hannah
    Cappelletti, Luca
    Fontana, Tommaso
    Carmody, Leigh
    Coleman, Ben
    George, Joshy
    Reese, Justin
    Joachimiak, Marcin
    Bocci, Giovanni
    Hansen, Peter
    Bult, Carol
    Rueter, Jens
    Casiraghi, Elena
    Valentini, Giorgio
    Mungall, Christopher
    Oprea, Tudor, I
    Robinson, Peter N.
    NAR GENOMICS AND BIOINFORMATICS, 2021, 3 (04)
  • [4] Device Fabrication Knowledge Extraction from Materials Science Literature
    Wadhwa, Neelanshi
    Sarath, S.
    Shah, Sapan
    Reddy, Sreedhar
    Mitra, Pritwish
    Jain, Deepak
    Rai, Beena
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 15416 - 15423
  • [5] Seeing the advantage: visually grounding word embeddings to better capture human semantic knowledge
    Merkx, Danny
    Frank, Stefan L.
    Ernestus, Mirjam
    PROCEEDINGS OF THE WORKSHOP ON COGNITIVE MODELING AND COMPUTATIONAL LINGUISTICS (CMCL 2022), 2022, : 1 - 11
  • [6] Mobility in Unsupervised Word Embeddings for Knowledge Extraction-The Scholars' Trajectories across Research Topics
    Lombardo, Gianfranco
    Tomaiuolo, Michele
    Mordonini, Monica
    Codeluppi, Gaia
    Poggi, Agostino
    FUTURE INTERNET, 2022, 14 (01):
  • [7] Enhancing Word Embeddings with Knowledge Extracted from Lexical Resources
    Biesialska, Magdalena
    Rafieian, Bardia
    Costa-jussa, Marta R.
    58TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2020): STUDENT RESEARCH WORKSHOP, 2020, : 271 - 278
  • [8] Fully unsupervised word translation from cross-lingual word embeddings especially for healthcare professionals
    Shweta Chauhan
    Shefali Saxena
    Philemon Daniel
    International Journal of System Assurance Engineering and Management, 2022, 13 : 28 - 37
  • [9] Fully unsupervised word translation from cross-lingual word embeddings especially for healthcare professionals
    Chauhan, Shweta
    Saxena, Shefali
    Daniel, Philemon
    INTERNATIONAL JOURNAL OF SYSTEM ASSURANCE ENGINEERING AND MANAGEMENT, 2022, 13 (SUPPL 1) : 28 - 37
  • [10] Reactions to science communication: discovering social network topics using word embeddings and semantic knowledge
    de Lima, Bernardo Cerqueira
    Baracho, Renata Maria Abrantes
    Mandl, Thomas
    Porto, Patricia Baracho
    SOCIAL NETWORK ANALYSIS AND MINING, 2023, 13 (01)