Measuring and Facilitating Data Repeatability in Web Science

被引:0
|
作者
Risch, Julian [1 ]
Krestel, Ralf [1 ]
机构
[1] Hasso Plattner Institute, University of Potsdam, Prof.-Dr.-Helmert-Str. 2–3, Potsdam,14482, Germany
来源
Datenbank-Spektrum | 2019年 / 19卷 / 02期
关键词
Tools;
D O I
10.1007/s13222-019-00316-9
中图分类号
学科分类号
摘要
Accessible and reusable datasets are a necessity to accomplish repeatable research. This requirement poses a problem particularly for web science, since scraped data comes in various formats and can change due to the dynamic character of the web. Further, usage of web data is typically restricted by copyright-protection or privacy regulations, which hinder publication of datasets. To alleviate these problems and reach what we define as partial data repeatability, we present a process that consists of multiple components. Researchers need to distribute only a scraper and not the data itself to comply with legal limitations. If a dataset is re-scraped for repeatability after some time, the integrity of different versions can be checked based on fingerprints. Moreover, fingerprints are sufficient to identify what parts of the data have changed and how much. We evaluate an implementation of this process with a dataset of 250 million online comments collected from five different news discussion platforms. We re-scraped the dataset after pausing for one year and show that less than ten percent of the data has actually changed. These experiments demonstrate that providing a scraper and fingerprints enables recreating a dataset and supports the repeatability of web science experiments. © 2019, Gesellschaft für Informatik e.V. and Springer-Verlag GmbH Germany, part of Springer Nature.
引用
收藏
页码:117 / 126
页数:9
相关论文
共 50 条
  • [1] Analyzing software science data with partial repeatability
    Cai, KY
    Chen, L
    JOURNAL OF SYSTEMS AND SOFTWARE, 2002, 63 (03) : 173 - 186
  • [2] Facilitating Access to the Web of Data: A Guide for Librarians
    Armstrong, Annie
    JOURNAL OF ACADEMIC LIBRARIANSHIP, 2012, 38 (01): : 68 - 68
  • [3] Facilitating Access to the Web of Data: A Guide for Librarians
    Wood, Bob
    INTERNATIONAL JOURNAL OF INFORMATION MANAGEMENT, 2012, 32 (04) : 403 - 405
  • [4] Facilitating Access to the Web of Data: A Guide for Librarians
    Isfandyari-Moghaddam, Alireza
    JOURNAL OF LIBRARIANSHIP AND INFORMATION SCIENCE, 2012, 44 (02) : 139 - 140
  • [5] Web of science: Measuring and assessing science beyond SCI
    Satyanarayana, K
    Jain, NC
    CURRENT SCIENCE, 2004, 86 (05): : 627 - 629
  • [6] Facilitating access to the web of data: A guide for librarians
    Russell, Fiona
    AUSTRALIAN ACADEMIC & RESEARCH LIBRARIES, 2012, 43 (01) : 88 - 88
  • [7] Facilitating Access to the Web of Data: A Guide for Librarians
    Yeates, Robin
    PROGRAM-ELECTRONIC LIBRARY AND INFORMATION SYSTEMS, 2012, 46 (02) : 283 - 285
  • [8] The Web Observatory Extension: Facilitating Web Science Collaboration through Semantic Markup
    DiFranzo, Dominic
    Erickson, John S.
    Gloria, Marie Joan Kristine T.
    Luciano, Joanne S.
    McGuinness, Deborah L.
    Hendler, James
    WWW'14 COMPANION: PROCEEDINGS OF THE 23RD INTERNATIONAL CONFERENCE ON WORLD WIDE WEB, 2014, : 475 - 479
  • [9] The Role of Data Science in Web Science
    Phethean, Christopher
    Simperl, Elena
    Tiropanis, Thanassis
    Tinati, Ramine
    Hall, Wendy
    IEEE INTELLIGENT SYSTEMS, 2016, 31 (03) : 102 - 107
  • [10] Web of science: Measuring and assessing science beyond SCI - Response
    Garfield, E
    CURRENT SCIENCE, 2004, 86 (05): : 629 - 629