Measuring and Facilitating Data Repeatability in Web Science

被引:0
|
作者
Risch, Julian [1 ]
Krestel, Ralf [1 ]
机构
[1] Hasso Plattner Institute, University of Potsdam, Prof.-Dr.-Helmert-Str. 2–3, Potsdam,14482, Germany
来源
Datenbank-Spektrum | 2019年 / 19卷 / 02期
关键词
Tools;
D O I
10.1007/s13222-019-00316-9
中图分类号
学科分类号
摘要
Accessible and reusable datasets are a necessity to accomplish repeatable research. This requirement poses a problem particularly for web science, since scraped data comes in various formats and can change due to the dynamic character of the web. Further, usage of web data is typically restricted by copyright-protection or privacy regulations, which hinder publication of datasets. To alleviate these problems and reach what we define as partial data repeatability, we present a process that consists of multiple components. Researchers need to distribute only a scraper and not the data itself to comply with legal limitations. If a dataset is re-scraped for repeatability after some time, the integrity of different versions can be checked based on fingerprints. Moreover, fingerprints are sufficient to identify what parts of the data have changed and how much. We evaluate an implementation of this process with a dataset of 250 million online comments collected from five different news discussion platforms. We re-scraped the dataset after pausing for one year and show that less than ten percent of the data has actually changed. These experiments demonstrate that providing a scraper and fingerprints enables recreating a dataset and supports the repeatability of web science experiments. © 2019, Gesellschaft für Informatik e.V. and Springer-Verlag GmbH Germany, part of Springer Nature.
引用
收藏
页码:117 / 126
页数:9
相关论文
共 50 条
  • [21] MEASURING THE IMPACT OF KNOWLEDGE A Comparison of Web of Science and Google Scholar
    Mingers, John
    Lipitakis, Lea
    KMIS 2009: PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON KNOWLEDGE MANAGEMENT AND INFORMATION SHARING, 2009, : 112 - 116
  • [22] Measuring the influence and impact of competitiveness research: a Web of Science approach
    Delbari, Seyyed Ali
    Ng, Siew Imm
    Aziz, Yuhanis Abdul
    Ho, Jo Ann
    SCIENTOMETRICS, 2015, 105 (02) : 773 - 788
  • [23] Developing a web-based multimedia assessment system for facilitating science laboratory instruction
    Wang, Tzu-Hua
    Kao, Chien-Hui
    Dai, Yu-Ling
    JOURNAL OF COMPUTER ASSISTED LEARNING, 2019, 35 (04) : 529 - 539
  • [24] Data science syllabi measuring its content
    Alon Friedman
    Education and Information Technologies, 2019, 24 : 3467 - 3481
  • [25] Measuring Data Reusability in the Open Science Era
    Zhang, Lili
    ANNALS OF LIBRARY AND INFORMATION STUDIES, 2024, 71 (04) : 384 - 391
  • [26] Data science syllabi measuring its content
    Friedman, Alon
    EDUCATION AND INFORMATION TECHNOLOGIES, 2019, 24 (06) : 3467 - 3481
  • [27] Accuracy and completeness of funding data in the Web of Science
    Alvarez-Bornstein, B.
    Morillo, F.
    Bordons, M.
    21ST INTERNATIONAL CONFERENCE ON SCIENCE AND TECHNOLOGY INDICATORS (STI 2016), 2016, : 1345 - 1348
  • [28] Open Science data and the Semantic Web journal
    Hitzler, Pascal
    Janowicz, Krzysztof
    Shimizu, Cogan
    Zhou, Lu
    Eells, Andrew
    SEMANTIC WEB, 2021, 12 (03) : 401 - 402
  • [29] Accuracy of citation data in Web of Science and Scopus
    van Eck, Nees Jan
    Waltman, Ludo
    16TH INTERNATIONAL CONFERENCE ON SCIENTOMETRICS & INFORMETRICS (ISSI 2017), 2017, : 1087 - 1092
  • [30] Facilitating Data Interoperability in Science and Technology - A Case Study and a Technical Solution
    Cenci, Karina
    Estevez, Elsa
    Fillottrani, Pablo R.
    DG.O 2017: THE PROCEEDINGS OF THE 18TH ANNUAL INTERNATIONAL CONFERENCE ON DIGITAL GOVERNMENT RESEARCH: INNOVATIONS AND TRANSFORMATIONS IN GOVERNMENT, 2017, : 407 - 415