|Effects of data bias on machine-learning–based material discovery using experimental property data
Kurosaki, Ken https://orcid.org/0000-0002-3015-3206 (unconfirmed)
large-scale material data
|Taylor & Francis
|Science and Technology of Advanced Materials: Methods
|Materials informatics (MI) research, which is the discovery of new materials through machine learning (ML) using large-scale material data, has attracted considerable attention in recent years. However, in general, the large-scale material data used in MI are biased owing to differences in the targeted material domains. Moreover, most studies on MI have not clearly demonstrated the influence of data bias on ML models. In this study, we clarify the influence of data bias on ML models by combining the concept of the applicability domain and clustering for large-scale experimental property data in the Starrydata2 material database previously developed by our group. The results show that data bias influences the error and reliability of the predictions made by the ML model. The predictions of the ML model within the applicability domain are highly reliable compared to those made outside the domain. This indicates that the material space that can be reliably discovered by the constructed ML model is limited. Nonetheless, we apply the ML model to a large dataset comprising various material classes and find that new materials similar to known materials can be proposed within a limited space. Thus, our findings demonstrate the importance of considering data bias when constructing and evaluating ML models in MI.
|© 2022 The Author(s). Published by National Institute for Materials Science in partnership with Taylor & Francis Group
This is an Open Access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
|Appears in Collections:
This item is licensed under a Creative Commons License