A Large-Scale Study of ML-Related Python']Python Projects

被引:0
|
作者
Idowu, Samuel [1 ,2 ]
Sens, Yorick [3 ]
Berger, Thorsten [1 ,2 ,3 ]
Kruger, Jacob [4 ]
Vierhauser, Michael [3 ]
机构
[1] Chalmers, Gothenburg, Sweden
[2] Univ Gothenburg, Gothenburg, Sweden
[3] Ruhr Univ Bochum, Bochum, Germany
[4] Eindhoven Univ Technol, Eindhoven, Netherlands
关键词
machine learning; ML-enabled systems; evolution; mining study; open-source projects; large-scale study; TensorFlow; scikit-learn;
D O I
10.1145/3605098.3636056
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
The rise of machine learning (ML) for solving current and future problems increased the production of ML-enabled software systems. Unfortunately, standardized tool chains for developing, employing, and maintaining such projects are not yet mature, which can mainly be attributed to a lack of understanding of the properties of ML-enabled software. For instance, it is still unclear how to manage and evolve ML-specific assets together with other software-engineering assets. In particular, ML-specific tools and processes, such as those for managing ML experiments, are often perceived as incompatible with practitioners' software engineering tools and processes. To design new tools for developing ML-enabled software, it is crucial to understand the properties and current problems of developing these projects by eliciting empirical data from real projects, including the evolution of the different assets involved. Moreover, while studies in this direction have recently been conducted, identifying certain types of ML-enabled projects (e.g., experiments, libraries and software systems) remains a challenge for researchers. We present a large-scale study of over 31,066 ML projects found on GitHub, with an emphasis on their development stages and evolution. Our contributions include a dataset, together with empirical data providing an overview of the existing project types and analysis of the projects' properties and characteristics, especially regarding the implementation of different ML development stages and their evolution. We believe that our results support researchers, practitioners, and tool builders conduct follow-up studies and especially build novel tools for managing ML projects, ideally unified with traditional software-engineering tools.
引用
收藏
页码:1272 / 1281
页数:10
相关论文
共 50 条
  • [1] A Large-Scale Comparison of Python']Python Code in Jupyter Notebooks and Scripts
    Grotov, Konstantin
    Titov, Sergey
    Sotnikov, Vladimir
    Golubev, Yaroslav
    Bryksin, Timofey
    [J]. 2022 MINING SOFTWARE REPOSITORIES CONFERENCE (MSR 2022), 2022, : 353 - 364
  • [2] GenomeDiagram: a python']python package for the visualization of large-scale genomic data
    Pritchard, L
    White, JA
    Birch, PRJ
    Toth, IK
    [J]. BIOINFORMATICS, 2006, 22 (05) : 616 - 617
  • [3] Enabling Empirical Research: A Corpus of Large-Scale Python']Python Systems
    Omari, Safwan
    Martinez, Gina
    [J]. PROCEEDINGS OF THE FUTURE TECHNOLOGIES CONFERENCE (FTC) 2019, VOL 2, 2020, 1070 : 661 - 669
  • [4] Efficient Graph Analytics in Python']Python for Large-Scale Data Science
    Zhou, Xiantian
    Ordonez, Carlos
    [J]. BIG DATA ANALYTICS AND KNOWLEDGE DISCOVERY (DAWAK 2021), 2021, 12925 : 158 - 164
  • [5] BioNet: A Python']Python interface to NEURON for modeling large-scale networks
    Gratiy, Sergey L.
    Billeh, Yazan N.
    Dai, Kael
    Mitelut, Catalin
    Feng, David
    Gouwens, Nathan W.
    Cain, Nicholas
    Koch, Christof
    Anastassiou, Costas A.
    Arkhipov, Anton
    [J]. PLOS ONE, 2018, 13 (08):
  • [6] An Empirical Study of Type-Related Defects in Python']Python Projects
    Khan, Faizan
    Chen, Boqi
    Varro, Daniel
    McIntosh, Shane
    [J]. IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2022, 48 (08) : 3145 - 3158
  • [7] Nengo: a Python']Python tool for building large-scale functional brain models
    Bekolay, Trevor
    Bergstra, James
    Hunsberger, Eric
    DeWolf, Travis
    Stewart, Terrence C.
    Rasmussen, Daniel
    Choo, Xuan
    Voelker, Aaron Russell
    Eliasmith, Chris
    [J]. FRONTIERS IN NEUROINFORMATICS, 2014, 7
  • [8] Data Mining of Syntax Errors in a Large-Scale Online Python']Python Course
    Lee, Jung A.
    Koprinska, Irena
    Jeffries, Bryn
    [J]. ARTIFICIAL INTELLIGENCE IN EDUCATION: POSTERS AND LATE BREAKING RESULTS, WORKSHOPS AND TUTORIALS, INDUSTRY AND INNOVATION TRACKS, PRACTITIONERS AND DOCTORAL CONSORTIUM, PT II, 2022, 13356 : 599 - 603
  • [9] A Large-Scale Security-Oriented Static Analysis of Python']Python Packages in PyPI
    Ruohonen, Jukka
    Hjerppe, Kalle
    Rindell, Kalle
    [J]. 2021 18TH INTERNATIONAL CONFERENCE ON PRIVACY, SECURITY AND TRUST (PST), 2021,
  • [10] COLOSSUS: A Python']Python Toolkit for Cosmology, Large-scale Structure, and Dark Matter Halos
    Diemer, Benedikt
    [J]. ASTROPHYSICAL JOURNAL SUPPLEMENT SERIES, 2018, 239 (02):