Mining Software Engineering Data from GitHub

被引:31
|
作者
Gousios, Georgios [1 ]
Spinellis, Diomidis [2 ]
机构
[1] Delft Univ Technol, Dept Software Technol, Delft, Netherlands
[2] Athens Univ Econ & Business, Dept Management Sci & Technol, Athens, Greece
关键词
GitHub; GHTorrent; empirical software engineering; Git;
D O I
10.1109/ICSE-C.2017.164
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
GitHub is the largest collaborative source code hosting site built on top of the Git version control system. The availability of a comprehensive API has made GitHub a target for many software engineering and online collaboration research efforts. In our work, we have discovered that a) obtaining data from GitHub is not trivial, b) the data may not be suitable for all types of research, and c) improper use can lead to biased results. In this tutorial, we analyze how data from GitHub can be used for large-scale, quantitative research, while avoiding common pitfalls. We use the GHTorrent dataset, a queryable offline mirror of the GitHub API data, to draw examples from and present pitfall avoidance strategies.
引用
下载
收藏
页码:501 / 502
页数:2
相关论文
共 50 条
  • [11] Data mining for software engineering and humans in the loop
    Minku L.L.
    Mendes E.
    Turhan B.
    Progress in Artificial Intelligence, 2016, 5 (04) : 307 - 314
  • [12] Application of Data Mining Technology in Software Engineering
    Ma, Jie
    PROCEEDINGS OF THE 2017 2ND INTERNATIONAL CONFERENCE ON MATERIALS SCIENCE, MACHINERY AND ENERGY ENGINEERING (MSMEE 2017), 2017, 123 : 169 - 172
  • [13] Mining Structures from Massive Text Data: Will It Help Software Engineering?
    Han, Jiawei
    PROCEEDINGS OF THE 2017 32ND IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE'17), 2017, : 2 - 2
  • [14] Mining Communication Patterns in Software Development: A GitHub Analysis
    Ortu, Marco
    Hall, Tracy
    Marchesi, Michele
    Tonelli, Roberto
    Bowes, David
    Destefanis, Giuseppe
    PROMISE'18: PROCEEDINGS OF THE 14TH INTERNATIONAL CONFERENCE ON PREDICTIVE MODELS AND DATA ANALYTICS IN SOFTWARE ENGINEERING, 2018, : 70 - 79
  • [15] Software Engineering for Data Mining (ML-Enabled) Software Applications
    Saeed, Sabeer
    Abubakar, Mohammed Mansur
    Karabatak, Murat
    9TH INTERNATIONAL SYMPOSIUM ON DIGITAL FORENSICS AND SECURITY (ISDFS'21), 2021,
  • [16] Mining Treatment-Outcome Constructs from Sequential Software Engineering Data
    Nayebi, Maleknaz
    Ruhe, Guenther
    Zimmermann, Thomas
    IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, 2021, 47 (02) : 393 - 411
  • [17] Research Progress on Software Engineering Data Mining Technology
    Deng Fengxian
    PROCEEDINGS OF THE 2015 INTERNATIONAL CONFERENCE ON EDUCATION TECHNOLOGY, MANAGEMENT AND HUMANITIES SCIENCE (ETMHS 2015), 2015, 27 : 588 - 592
  • [18] The impact of GitHub on students' learning and engagement in a software engineering course
    Patani, Prutha
    Tiwari, Saurabh
    Rathore, Santosh Singh
    COMPUTER APPLICATIONS IN ENGINEERING EDUCATION, 2024,
  • [19] GitHub as backbone in Software Engineering course: Technology acceptance analysis
    Cizmesija, A.
    Stapic, Z.
    2019 42ND INTERNATIONAL CONVENTION ON INFORMATION AND COMMUNICATION TECHNOLOGY, ELECTRONICS AND MICROELECTRONICS (MIPRO), 2019, : 742 - 746
  • [20] Mining Twitter Data for a More Responsive Software Engineering Process
    Williams, Grant
    Mahmoud, Anas
    PROCEEDINGS OF THE 2017 IEEE/ACM 39TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING COMPANION (ICSE-C 2017), 2017, : 280 - 282