WARChain: Consensus-based trust in web archives via proof-of-stake blockchain technology

被引:4
|
作者
Lendak, Imre [1 ,3 ]
Indig, Balazs [2 ]
Palko, Gabor [2 ]
机构
[1] Eotvos Lorand Univ, Fac Informat, Data Sci & Engn Dept, Budapest, Hungary
[2] Eotvos Lorand Univ, Fac Humanities, Dept Digital Humanities, Budapest, Hungary
[3] Univ Novi Sad, Fac Tech Sci, Novi Sad, Serbia
关键词
Web archive; validation; blockchain; proof-of-stake; web crawling; censorship;
D O I
10.3233/JCS-210040
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Web archives store born-digital documents, which are usually collected from the Internet by crawlers and stored in the Web Archive (WARC) format. The trustworthiness and integrity of web archives is still an open challenge, especially in the news portal domain, which face additional challenges of censorship even in democratic societies. The aim of this paper is to present a light-weight, blockchain-based solution for web archive validation, which would ensure that documents retrieved by crawlers are authentic for many years to come. We developed our archive validation solution as an extension and continuation of our work in web crawler development mainly targeting news portals. The system is designed as an overlay over a blockchain with a proof-of-stake (PoS) distributed consensus algorithm. PoS was chosen due to its lower ecological footprint compared to proof-of-work solutions (e.g. Bitcoin) and lower expected investment in computing infrastructure. We based our prototype on the open-source Nxt blockchain and implemented it in Python. The prototype was tested on web archive content crawled from Hungarian news portals at two different timestamps with more than 1 million articles in total. We concluded that the proposed solution is accessible, usable by different stakeholders to validate crawled content, deployable on cheap commodity hardware, tackles the archive integrity challenge and is capable to efficiently manage duplicate documents.
引用
收藏
页码:499 / 515
页数:17
相关论文
共 21 条