A Data Quality and Quantity Governance for Machine Learning in Materials Science

被引:0
|
作者
Liu Y. [1 ,4 ]
Ma S. [1 ]
Yang Z. [1 ]
Zou X. [1 ]
Shi S. [2 ,3 ]
机构
[1] School of Computer Engineering and Science, Shanghai University, Shanghai
[2] School of Materials Science and Engineering, Shanghai University, Shanghai
[3] Materials Genome Institute, Shanghai University, Shanghai
[4] Shanghai Engineering Research Center of Intelligent Computing System, Shanghai
关键词
data quality and quantity; domain knowledge; machine learning; materials science;
D O I
10.14062/j.issn.0454-5648.20220991
中图分类号
学科分类号
摘要
Data-driven machine learning is widely used in materials property prediction and structure-activity relationship research due to its accurate and efficient predictive ability. Data determines the upper limit of machine learning. However, materials data often have various quality and quantity problems (i.e., multiple sources, large noise, small samples, and high dimensionality), affecting the application of machine learning in the materials field. In this paper, by analyzing the data quality and quantity problems and their related governance work, we find that data quality and data quantity jointly determine this problem. Following this, a data quality and quantity governance framework embedded by materials domain knowledge in the whole process of materials machine learning is proposed. We define twelve dimensions to analyze the connotation of materials data quality and quantity. A life cycle model of data quality and quantity governance is constructed to ensure that data quality and quantity governance activities are carried out in an orderly manner. To manage data quality and quantity accurately and comprehensively, a series of corresponding governance processing models are established from domain knowledge and data-driven aspects, which provides technical support for the specific implementation of the life cycle model. This framework realizes the overall evaluation and improvement of materials data quality and quantity, providing theoretical guidance and candidate solutions for high-quality and appropriate-quantity data acquisition and accelerating the in-depth application of machine learning in materials research and development. © 2023 Chinese Ceramic Society. All rights reserved.
引用
收藏
页码:427 / 437
页数:10
相关论文
共 64 条
  • [1] ROBERT C., Machine learning, a probabilistic perspective, Chance, 27, 2, (2014)
  • [2] LIU Y, ZHAO T L, JU W W, Et al., Materials discovery and design using machine learning, J Materiomics, 3, 3, (2017)
  • [3] SCHMIDT J, MARQUES M R G, BOTTI S, Et al., Recent advances and applications of machine learning in solid-state materials science, NPJ Comput Mater, 5, 1, pp. 1-36, (2019)
  • [4] CHEN C, ZUO Y X, YE W K, Et al., A critical review of machine learning of energy materials, Adv Energy Mater, 10, 8, (2020)
  • [5] CHEN H H, CHEN J P, DING J H., Data evaluation and enhancement for quality improvement of machine learning, IEEE Trans Reliab, 70, 2, (2021)
  • [6] MEHRABI N, MORSTATTER F, SAXENA N, Et al., A survey on bias and fairness in machine learning, Acm Comput Surveys, 54, 6, pp. 1-32, (2021)
  • [7] OAKI Y, IGARASHI Y., Materials informatics for 2d materials combined with sparse modeling and chemical perspective: Toward small-data-driven chemistry and materials science, Bull Chem Soc Jpn, 94, 10, pp. 2410-2422, (2021)
  • [8] LIU Y, GUO B R, ZOU X X, Et al., Machine learning assisted materials design and discovery for rechargeable batteries, Energy Storage Mater, 31, pp. 434-450, (2020)
  • [9] BEAL M S, HAYDEN B E, LE GALL T, Et al., High throughput methodology for synthesis, screening, and optimization of solid state lithium ion electrolytes, ACS Comb Sci, 13, 4, (2011)
  • [10] RAJAN A C, MISHRA A, SATSANGI S, Et al., Machine-learning-assisted accurate band gap predictions of functionalized mxene, Chem Mater, 30, 12, pp. 4031-4038, (2018)