DistilKaggle: A Distilled Dataset of Kaggle Jupyter Notebooks

被引:0
|
作者
Ghahfarokhi, Mojtaba Mostafavi [1 ]
Asgari, Arash [1 ]
Abolnejadian, Mohammad [1 ]
Heydarnoori, Abbas [1 ,2 ]
机构
[1] Sharif Univ Technol, Dept Comp Engn, Tehran, Iran
[2] Green State Univ, Dept Comp Sci Bowling, Bowling Green, OH USA
关键词
Open dataset; Kaggle; Jupyter notebooks; Code metrics; Code quality;
D O I
10.1145/3643991.3644882
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Jupyter notebooks have become indispensable tools for data analysis and processing in various domains. However, despite their widespread use, there is a notable research gap in understanding and analyzing the contents and code metrics of these notebooks. This gap is primarily attributed to the absence of datasets that encompass both Jupyter notebooks and extracted their code metrics. To address this limitation, we introduce DistilKaggle, a unique dataset specifically curated to facilitate research on code metrics in Jupyter notebooks, utilizing the Kaggle repository as a prime source. Through an extensive study, we identify thirty-four code metrics that significantly impact Jupyter notebook code quality. These features such as lines of code cell, mean number of words in markdown cells, performance tier of developer, etc., are crucial for understanding and improving the overall effectiveness of computational notebooks. The DistilKaggle dataset which is derived from a vast collection of notebooks constitutes two distinct datasets: (i) Code Cells and Markdown Cells Dataset which is presented in two CSV files, allowing for easy integration into researchers' workflows as dataframes. It provides a granular view of the content structure within 542,051 Jupyter notebooks, enabling detailed analysis of code and markdown cells; and (ii) The Notebook Code Metrics Dataset focused on the identified code metrics of notebooks. Researchers can leverage this dataset to access Jupyter notebooks with specific code quality characteristics, surpassing the limitations of filters available on the Kaggle website. Furthermore, the reproducibility of the notebooks in our dataset is ensured through the code cells and markdown cells datasets, offering a reliable foundation for researchers to build upon. Given the substantial size of our datasets, it becomes an invaluable resource for the research community, surpassing the capabilities of individual Kaggle users to collect such extensive data. For accessibility and transparency, both the dataset and the code utilized in crafting this dataset are publicly available at https://github.com/ISE-Research/DistilKaggle.
引用
收藏
页码:647 / 651
页数:5
相关论文
共 50 条
  • [1] KGTorrent: A Dataset of Python']Python Jupyter Notebooks from Kaggle
    Quaranta, Luigi
    Calefato, Fabio
    Lanubile, Filippo
    [J]. 2021 IEEE/ACM 18TH INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES (MSR 2021), 2021, : 550 - 554
  • [2] A Roadmap for Enriching Jupyter Notebooks Documentation with Kaggle Data
    Ghahfarokhi, Mojtaba Mostafavi
    Jahantigh, Hamed
    Asadi, Alireza
    Kianiangolafshani, Sepehr
    Khademian, Ashkan
    Heydarnoori, Abbas
    [J]. PROCEEDINGS 2024 IEEE/ACM 3RD INTERNATIONAL CONFERENCE ON AI ENGINEERING-SOFTWARE ENGINEERING FOR AI, CAIN 2024, 2024, : 271 - 272
  • [3] Restoring Reproducibility of Jupyter Notebooks
    Wang, Jiawei
    Kuo, Tzu-yang
    Li, Li
    Zeller, Andreas
    [J]. 2020 ACM/IEEE 42ND INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: COMPANION PROCEEDINGS (ICSE-COMPANION 2020), 2020, : 288 - 289
  • [4] Jupyter Notebooks for Generous Archive Interfaces
    Wigham, Mari
    Melgar, Liliana
    Ordelman, Roeland
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 2766 - 2774
  • [5] Code Duplication and Reuse in Jupyter Notebooks
    Koenzen, Andreas P.
    Ernst, Neil A.
    Storey, Margaret-Anne D.
    [J]. 2020 IEEE SYMPOSIUM ON VISUAL LANGUAGES AND HUMAN-CENTRIC COMPUTING (VL/HCC 2020), 2020,
  • [6] Assessing and Restoring Reproducibility of Jupyter Notebooks
    Wang, Jiawei
    Kuo, Tzu-yang
    Li, Li
    Zeller, Andreas
    [J]. 2020 35TH IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE 2020), 2020, : 138 - 149
  • [7] Restoring Execution Environments of Jupyter Notebooks
    Wang, Jiawei
    Li, Li
    Zeller, Andreas
    [J]. 2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING (ICSE 2021), 2021, : 1622 - 1633
  • [8] Interactive Data Visualization in Jupyter Notebooks
    Piazentin Ono, Jorge
    Freire, Juliana
    Silva, Claudio T.
    [J]. COMPUTING IN SCIENCE & ENGINEERING, 2021, 23 (02) : 99 - 106
  • [9] Notes on Notebooks: Is Jupyter the Bringer of Jollity?
    Singer, Jeremy
    [J]. PROCEEDINGS OF THE 2020 ACM SIGPLAN INTERNATIONAL SYMPOSIUM ON NEW IDEAS, NEW PARADIGMS, AND REFLECTIONS ON PROGRAMMING AND SOFTWARE (ONWARD! '20), 2020, : 180 - 186
  • [10] Electronic Notes Via Jupyter Notebooks
    Urcelay-Olabarria, Irene
    Igartua, Josu M.
    [J]. PROCEEDINGS OF THE 9TH INTERNATIONAL CONFERENCE ON COMPUTER SUPPORTED EDUCATION (CSEDU), VOL 1, 2017, : 464 - 469