KGTorrent: A Dataset of Python']Python Jupyter Notebooks from Kaggle

被引:21
|
作者
Quaranta, Luigi [1 ]
Calefato, Fabio [1 ]
Lanubile, Filippo [1 ]
机构
[1] Univ Bari, Bari, Italy
关键词
open dataset; repository; Kaggle; computational notebook; Jupyter;
D O I
10.1109/MSR52588.2021.00072
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Computational notebooks have become the tool of choice for many data scientists and practitioners for performing analyses and disseminating results. Despite their increasing popularity, the research community cannot yet count on a large, curated dataset of computational notebooks. In this paper, we fill this gap by introducing KGTORRENT, a dataset of Python Jupyter notebooks with rich metadata retrieved from Kaggle, a platform hosting data science competitions for learners and practitioners with any levels of expertise. We describe how we built KGTORRENT, and provide instructions on how to use it and refresh the collection to keep it up to date. Our vision is that the research community will use KGTORRENT to study how data scientists, especially practitioners, use Jupyter Notebook in the wild and identify potential shortcomings to inform the design of its future extensions.
引用
收藏
页码:550 / 554
页数:5
相关论文
共 50 条
  • [1] DistilKaggle: A Distilled Dataset of Kaggle Jupyter Notebooks
    Ghahfarokhi, Mojtaba Mostafavi
    Asgari, Arash
    Abolnejadian, Mohammad
    Heydarnoori, Abbas
    [J]. 2024 IEEE/ACM 21ST INTERNATIONAL CONFERENCE ON MINING SOFTWARE REPOSITORIES, MSR, 2024, : 647 - 651
  • [2] Error Identification Strategies for Python']Python Jupyter Notebooks
    Robinson, Derek
    Ernst, Neil A.
    Vargas, Enrique Larios
    Storey, Margaret-Anne D.
    [J]. 30TH IEEE/ACM INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC 2022), 2022, : 253 - 263
  • [3] Python']Python scripting for biochemistry and molecular biology in Jupyter Notebooks
    Craig, Paul A.
    Nash, Jessica A.
    Crawford, T. Daniel
    [J]. BIOCHEMISTRY AND MOLECULAR BIOLOGY EDUCATION, 2022, 50 (05) : 479 - 482
  • [4] Visualizing protein big data using Python']Python and Jupyter notebooks
    Weiss, Charles J.
    [J]. BIOCHEMISTRY AND MOLECULAR BIOLOGY EDUCATION, 2022, 50 (05) : 431 - 436
  • [5] Series of Jupyter Notebooks Using Python']Python for an Analytical Chemistry Course
    Menke, Erik J.
    [J]. JOURNAL OF CHEMICAL EDUCATION, 2020, 97 (10) : 3899 - 3903
  • [6] A Large-Scale Comparison of Python']Python Code in Jupyter Notebooks and Scripts
    Grotov, Konstantin
    Titov, Sergey
    Sotnikov, Vladimir
    Golubev, Yaroslav
    Bryksin, Timofey
    [J]. 2022 MINING SOFTWARE REPOSITORIES CONFERENCE (MSR 2022), 2022, : 353 - 364
  • [7] A Creative Commons Textbook for Teaching Scientific Computing to Chemistry Students with Python']Python and Jupyter Notebooks
    Weiss, Charles J.
    [J]. JOURNAL OF CHEMICAL EDUCATION, 2021, 98 (02) : 489 - 494
  • [8] PYNBLINT: A quality assurance tool to improve the quality of Python Jupyter notebooks
    Quaranta, Luigi
    Calefato, Fabio
    Lanubile, Filippo
    [J]. SoftwareX, 2024, 28
  • [9] Incorporating Jupyter and Python']Python into analytical chemistry
    Menke, Erik
    [J]. ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2019, 258
  • [10] A Roadmap for Enriching Jupyter Notebooks Documentation with Kaggle Data
    Ghahfarokhi, Mojtaba Mostafavi
    Jahantigh, Hamed
    Asadi, Alireza
    Kianiangolafshani, Sepehr
    Khademian, Ashkan
    Heydarnoori, Abbas
    [J]. PROCEEDINGS 2024 IEEE/ACM 3RD INTERNATIONAL CONFERENCE ON AI ENGINEERING-SOFTWARE ENGINEERING FOR AI, CAIN 2024, 2024, : 271 - 272