Sampling-based Collection and Updating of Online Big Graph Data

被引：0

作者：

Yin Z.-D. ^{[1
]}

Yue K. ^{[1
]}

Zhang B.-B. ^{[1
]}

Li J. ^{[2
]}

机构：

[1] School of Information Science and Engineering, Yunnan University, Kunming

[2] School of Software, Yunnan University, Kunming

来源：

Ruan Jian Xue Bao/Journal of Software | 2020年 / 31卷 / 11期

基金：

中国国家自然科学基金;

关键词：

Data collection; Data updating; Online big graph; Parallel crawler; Spark;

D O I：

10.13328/j.cnki.jos.005843

中图分类号：

学科分类号：

摘要：

The large volume of unstructured data obtained from Web pages, social media and knowledge bases on the Internet could be represented as an online big graph (OBG). Confronted with many challenges, such as its large-scale, widespread, heterogeneous, and fast-changing properties, OBG data acquisition includes data collection and updating, which is the basis of massive data analysis and knowledge engineering. In this study, the method for adaptive and parallel data collection and updating is proposed based on sampling techniques. First, the HD-QMC algorithm is given for adaptive data collection of OBG data by combining the branch-and-bound method and quasi-Monte Carlo sampling technique. Next, the EPP algorithm is given for efficient data updating based on entropy and Poisson process to make the collected data reflect the dynamic change of OBGs in real-world environments. Further, the effectiveness of the proposed algorithms is analyzed theoretically, and various kinds of collected OBG data are represented by triples universally to provide an easy-to-use data foundation for OBG analysis and relevant studies. Finally, the proposed algorithms for data collection and updating are implemented with Spark, and experimental results on simulated and real-world datasets show the effectiveness and efficiency of the proposed method. © Copyright 2020, Institute of Software, the Chinese Academy of Sciences. All rights reserved.

引用

页码：3540 / 3558

页数：18

共 28 条

[1] Wang JM., Key technologies in big data applications development and runtime support platform, Ruan Jian Xue Bao/Journal of Software, 28, 6, pp. 1516-1528, (2017)
[2] Wu XD, Chen HH, Wu GQ, Liu J, Zheng QH, He XF, Zhou AY, Zhao ZQ, Wei BF, Li Y, Zhang QP, Zhang SC., Knowledge engineering with big data, IEEE Intelligent Systems, 30, 5, pp. 46-55, (2015)
[3] Zhang JZ, Meng XF., Mobile Web search, Ruan Jian Xue Bao/Journal of Software, 23, 1, pp. 46-64, (2012)
[4] Wang GL, Han YB, Zhang ZM, Zhu ML., Could-based integration and service of streaming data, Chinese Journal of Computers, 2017, 1, pp. 107-125, (2017)
[5] Xia D, Wang YS, Zhao ZP, Cui D., Incremental and interactive data integration approach for hierarchical data in domain of intelligent livelihood, Journal of Computer Research and Development, 54, 3, pp. 586-596, (2017)
[6] Lin HL, Wang YZ, Jia YT, Zhang P, Wang WP., Network big data oriented knowledge fusion methods: A survey, Chinese Journal of Computers, 2017, 1, pp. 1-27, (2017)
[7] Surendran S, Prasad DC, Kaimal MR., A scalable geometric algorithm for community detection from social networks with incremental update, Social Network Analysis and Mining, 6, 1, (2016)
[8] Xi SJ, Sun FC, Wang JM., A cognitive crawler using structure pattern for incremental crawling and content extraction, Proc. of the IEEE Int'l Conf. on Cognitive Informatics, pp. 238-244, (2010)
[9] Pavai G, Geetha TV., Improving the freshness of the search engines by a probabilistic approach based incremental crawler, Information Systems Frontiers, 19, 5, pp. 1013-1028, (2017)
[10] Matteo R, Fabio V., MiSoSouP: Mining interesting subgroups with sampling and pseudodimension, Proc. of the 24th ACM Int'l Conf. on Knowledge Discovery & Data Mining, pp. 2130-2139, (2018)

← 1 2 3 →