Mining Software Engineering Data from GitHub

被引:31
|
作者
Gousios, Georgios [1 ]
Spinellis, Diomidis [2 ]
机构
[1] Delft Univ Technol, Dept Software Technol, Delft, Netherlands
[2] Athens Univ Econ & Business, Dept Management Sci & Technol, Athens, Greece
关键词
GitHub; GHTorrent; empirical software engineering; Git;
D O I
10.1109/ICSE-C.2017.164
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
GitHub is the largest collaborative source code hosting site built on top of the Git version control system. The availability of a comprehensive API has made GitHub a target for many software engineering and online collaboration research efforts. In our work, we have discovered that a) obtaining data from GitHub is not trivial, b) the data may not be suitable for all types of research, and c) improper use can lead to biased results. In this tutorial, we analyze how data from GitHub can be used for large-scale, quantitative research, while avoiding common pitfalls. We use the GHTorrent dataset, a queryable offline mirror of the GitHub API data, to draw examples from and present pitfall avoidance strategies.
引用
收藏
页码:501 / 502
页数:2
相关论文
共 50 条
  • [1] Data mining in software engineering
    Halkidi, M.
    Spinellis, D.
    Tsatsaronis, G.
    Vazirgiannis, M.
    [J]. INTELLIGENT DATA ANALYSIS, 2011, 15 (03) : 413 - 441
  • [2] DATA MINING FOR SOFTWARE ENGINEERING
    Xie, Tao
    Thummalapenta, Suresh
    Lo, David
    Liu, Chao
    [J]. COMPUTER, 2009, 42 (08) : 55 - 62
  • [3] Mining software engineering data
    Xie, Tao
    Pei, Jian
    Hassan, Ahmed E.
    [J]. 29TH INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: ICSE 2007 COMPANION VOLUME, PROCEEDINGS, 2007, : 172 - +
  • [4] Software Bill of Materials Adoption: A Mining Study from GitHub
    Nocera, Sabato
    Romano, Simone
    Di Penta, Massimiliano
    Francese, Rita
    Scanniello, Giuseppe
    [J]. 2023 IEEE INTERNATIONAL CONFERENCE ON SOFTWARE MAINTENANCE AND EVOLUTION, ICSME, 2023, : 39 - 49
  • [5] Continuous assessment in software engineering project course using publicly available data from GitHub
    Gustavsson, Henrik
    Brohede, Marcus
    [J]. PROCEEDINGS OF THE 15TH INTERNATIONAL SYMPOSIUM ON OPEN COLLABORATION (OPENSYM), 2019,
  • [6] Editorial: data mining in software engineering
    Hall, Robert J.
    [J]. AUTOMATED SOFTWARE ENGINEERING, 2010, 17 (04) : 373 - 374
  • [7] Editorial: data mining in software engineering
    Robert J. Hall
    [J]. Automated Software Engineering, 2010, 17 : 373 - 374
  • [8] Toward data mining engineering: A software engineering approach
    Marban, Oscar
    Segovia, Javier
    Menasalvas, Ernestina
    Fernandez-Baizan, Covadonga
    [J]. INFORMATION SYSTEMS, 2009, 34 (01) : 87 - 107
  • [9] Global software engineering in the age of GitHub and zoom
    Herbsleb, James
    [J]. JOURNAL OF SOFTWARE-EVOLUTION AND PROCESS, 2023, 35 (06)
  • [10] Data mining for validation in software engineering: An example
    Kajko-Mattsson, M
    Chapin, N
    [J]. INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2004, 14 (04) : 407 - 427