Geochemistry π: Automated Machine Learning Python']Python Framework for Tabular Data

被引:1
|
作者
ZhangZhou, J. [1 ]
He, Can [2 ]
Sun, Jianhao [3 ]
Zhao, Jianming [1 ]
Lyu, Yang [1 ]
Wang, Shengxin [4 ]
Zhao, Wenyu [1 ]
Li, Anzhou [1 ]
Ji, Xiaohui [5 ]
Agarwal, Anant [6 ]
机构
[1] Zhejiang Univ, Sch Earth Sci, Key Lab Geosci Big Data & Deep Resource Zhejiang P, Hangzhou, Peoples R China
[2] Natl Univ Singapore, Sch Comp, Singapore, Singapore
[3] China Univ Geosci, Sch Earth Sci, Wuhan, Peoples R China
[4] Lanzhou Univ, Sch Earth Sci, Lanzhou, Peoples R China
[5] China Univ Geosci, Sch Informat Engn, Beijing, Peoples R China
[6] Nissan Motor Corp, Dept Data Sci, Yokohama, Japan
关键词
automated machine learning; !text type='Python']Python[!/text] framework; tabular data;
D O I
10.1029/2023GC011324
中图分类号
P3 [地球物理学]; P59 [地球化学];
学科分类号
0708 ; 070902 ;
摘要
Although machine learning (ML) has brought new insights into geochemistry research, its implementation is laborious and time-consuming. Here, we announce Geochemistry pi, an open-source automated ML Python framework. Geochemists only need to provide tabulated data and select the desired options to clean data and run ML algorithms. The process operates in a question-and-answer format, and thus does not require that users have coding experience. After either automatic or manual parameter tuning, the automated Python framework provides users with performance and prediction results for the trained ML model. Based on the scikit-learn library, Geochemistry pi has established a customized automated process for implementing classification, regression, dimensionality reduction, and clustering algorithms. The Python framework enables extensibility and portability by constructing a hierarchical pipeline architecture that separates data transmission from the algorithm application. The AutoML module is constructed using the Cost-Frugal Optimization and Blended Search Strategy hyperparameter search methods from the A Fast and Lightweight AutoML Library, and the model parameter optimization process is accelerated by the Ray distributed computing framework. The MLflow library is integrated into ML lifecycle management, which allows users to compare multiple trained models at different scales and manage the data and diagrams generated. In addition, the front-end and back-end frameworks are separated to build the web portal, which demonstrates the ML model and data science workflow through a user-friendly web interface. In summary, Geochemistry pi provides a Python framework for users and developers to accelerate their data mining efficiency with both online and offline operation options. Geochemistry pi is a helpful tool for scientists who work with geochemical data. One of its standout features is its simplicity. Scientists can use the tool to perform machine learning (ML) on the tabular data by answering a series of questions about what they want to discover. The tool does the rest by using advanced ML techniques to uncover insights from the data. Even scientists without coding skills can use Geochemistry pi effectively. This tool is built on a reliable library called scikit-learn, ensuring that it works well with different ML methods. It is also flexible, allowing researchers to customize it to fit their specific needs. Geochemistry pi separates data processing from ML tasks, making it adaptable and expandable. It includes features for continuous training and managing the entire ML process. To prove its effectiveness, Geochemistry pi was tested against previous geochemical studies in areas such as regression, classification, clustering, and dimensional reduction. The results showed that it could replicate the findings of these studies accurately. Accessible through a web portal or command line, Geochemistry pi is a valuable asset for geochemists and researchers looking to analyze large geochemical data sets. Open-source Python framework for machine learning applications in geochemistryAutomated pipeline for tabular dataQuestion-and-answer format obviates the need for coding experience
引用
收藏
页数:14
相关论文
共 50 条
  • [31] The Raise of Machine Learning Hyperparameter Constraints in Python']Python Code
    Rak-amnouykit, Ingkarat
    Milanova, Ana
    Baudart, Guillaume
    Hirzel, Martin
    Dolby, Julian
    PROCEEDINGS OF THE 31ST ACM SIGSOFT INTERNATIONAL SYMPOSIUM ON SOFTWARE TESTING AND ANALYSIS, ISSTA 2022, 2022, : 580 - 592
  • [32] Landscape of High-Performance Python']Python to Develop Data Science and Machine Learning Applications
    Castro, Oscar
    Bruneau, Pierrick
    Sottet, Jean-Sebastien
    Torregrossa, Dario
    ACM COMPUTING SURVEYS, 2024, 56 (03)
  • [33] Churn Analysis with Machine Learning Classification Algorithms in Python']Python
    Ozdemir, Onur
    Batar, Mustafa
    Isik, Ali Hakan
    ARTIFICIAL INTELLIGENCE AND APPLIED MATHEMATICS IN ENGINEERING PROBLEMS, 2020, 43 : 844 - 852
  • [34] pyUDLF: A Python']Python Framework for Unsupervised Distance Learning Tasks
    Leticio, Gustavo Rosseto
    Valem, Lucas Pascotti
    Lopes, Leonardo Tadeu
    Guimaraes Pedronette, Daniel Carlos
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 9680 - 9684
  • [35] TfELM: Extreme Learning Machines framework with Python']Python and TensorFlow
    Struniawski, Karol
    Kozera, Ryszard
    SOFTWAREX, 2024, 27
  • [36] Python']Python Coverage Guided Fuzzing for Deep Learning Framework
    Nie, Yuanping
    Xiao, Xiong
    Yang, Bing
    Li, Hanqing
    Luo, Long
    Yu, Hongfang
    Sun, Gang
    2024 INTERNATIONAL CONFERENCE ON ELECTRONIC ENGINEERING AND INFORMATION SYSTEMS, EEISS 2024, 2024, : 1 - 6
  • [37] Geomstats: A Python']Python Package for Riemannian Geometry in Machine Learning
    Miolane, Nina
    Guigui, Nicolas
    Le Brigant, Alice
    Mathe, Johan
    Hou, Benjamin
    Thanwerdas, Yann
    Heyder, Stefan
    Peltre, Olivier
    Koep, Niklas
    Zaatiti, Hadi
    Hajri, Hatem
    Cabanes, Yann
    Gerald, Thomas
    Chauchat, Paul
    Shewmake, Christian
    Brooks, Daniel
    Kainz, Bernhard
    Donnat, Claire
    Holmes, Susan
    Pennec, Xavier
    JOURNAL OF MACHINE LEARNING RESEARCH, 2020, 21
  • [38] Data Analysis of Blended Learning in Python']Python Programming
    Chu, Qian
    Yu, Xiaomei
    Jiang, Yuli
    Wang, Hong
    ALGORITHMS AND ARCHITECTURES FOR PARALLEL PROCESSING, ICA3PP 2018, PT III, 2018, 11336 : 209 - 217
  • [39] PRMS-Python']Python: A Python']Python framework for programmatic PRMS modeling and access to its data structures
    Volk, John M.
    Turner, Matthew A.
    ENVIRONMENTAL MODELLING & SOFTWARE, 2019, 114 : 152 - 165
  • [40] Atomic Simulation Recipes-A Python']Python framework and library for automated workflows
    Gjerding, Morten
    Skovhus, Thorbjorn
    Rasmussen, Asbjorn
    Bertoldo, Fabian
    Larsen, Ask Hjorth
    Mortensen, Jens Jorgen
    Thygesen, Kristian Sommer
    COMPUTATIONAL MATERIALS SCIENCE, 2021, 199