Sampling for Big Data Profiling: A Survey

被引:12
|
作者
Liu, Zhicheng [1 ]
Zhang, Aoqian [2 ]
机构
[1] Tsinghua Univ, Sch Software, Beijing 100084, Peoples R China
[2] Univ Waterloo, Cheriton Sch Comp Sci, Waterloo, ON N2L 3G1, Canada
基金
中国国家自然科学基金;
关键词
Big Data; Data mining; Task analysis; Sampling methods; Metadata; Relational databases; Systematics; Big data; large amount; sampling; data profiling; DISTANCE THRESHOLDS; DEPENDENCIES; DISCOVERY;
D O I
10.1109/ACCESS.2020.2988120
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Due to the development of internet technology and computer science, data is exploding at an exponential rate. Big data brings us new opportunities and challenges. On the one hand, we can analyze and mine big data to discover hidden information and get more potential value. On the other hand, the 5V characteristic of big data, especially Volume which means large amount of data, brings challenges to storage and processing. For some traditional data mining algorithms, machine learning algorithms and data profiling tasks, it is very difficult to handle such a large amount of data. The large amount of data is highly demanding hardware resources and time consuming. Sampling methods can effectively reduce the amount of data and help speed up data processing. Sampling technology has been widely used in big data context. Data profiling is the activity that finds metadata of data set and has many use cases, e.g., performing data profiling tasks on relational data, graph data, and time series data for anomaly detection and data repair. However, data profiling is computationally expensive, especially for large data sets. Hence this article focuses on researching sampling for data profiling tasks in big data context and investigates the application of sampling in different categories of data profiling. From the experimental results of these studies, the results got from the sampled data are close to or even exceed the results of the full amount of data. Therefore, sampling technology plays an important role in the era of big data, and we also have reason to believe that sampling technology will become an indispensable step in big data processing in the future.
引用
收藏
页码:72713 / 72726
页数:14
相关论文
共 50 条
  • [1] Sampling Survey among Undergraduates in the Age of Big Data
    Ouyang, Chongjun
    Wang, Hu
    [J]. PROCEEDINGS OF THE 2017 INTERNATIONAL CONFERENCE ON SPORTS, ARTS, EDUCATION AND MANAGEMENT ENGINEERING (SAEME 2017), 2017, 105 : 108 - 114
  • [2] A Survey of Data Partitioning and Sampling Methods to Support Big Data Analysis
    Mahmud, Mohammad Sultan
    Huang, Joshua Zhexue
    Salloum, Salman
    Emara, Tamer Z.
    Sadatdiynov, Kuanishbay
    [J]. BIG DATA MINING AND ANALYTICS, 2020, 3 (02) : 85 - 101
  • [3] A Survey of Data Partitioning and Sampling Methods to Support Big Data Analysis
    Mohammad Sultan Mahmud
    Joshua Zhexue Huang
    Salman Salloum
    Tamer Z.Emara
    Kuanishbay Sadatdiynov
    [J]. Big Data Mining and Analytics, 2020, 3 (02) : 85 - 101
  • [4] Sampling and Sampling Frames in Big Data Epidemiology
    Mooney, Stephen J.
    Garber, Michael D.
    [J]. CURRENT EPIDEMIOLOGY REPORTS, 2019, 6 (01) : 14 - 22
  • [5] Sampling and Sampling Frames in Big Data Epidemiology
    Stephen J. Mooney
    Michael D. Garber
    [J]. Current Epidemiology Reports, 2019, 6 : 14 - 22
  • [6] Sampling for Big Data: A Tutorial
    Cormode, Graham
    Duffield, Nick
    [J]. PROCEEDINGS OF THE 20TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'14), 2014, : 1975 - 1975
  • [7] Sampling Operations on Big Data
    Gadepally, Vijay
    Herr, Taylor
    Johnson, Luke
    Milechin, Lauren
    Milosavljevic, Maja
    Miller, Benjamin A.
    [J]. 2015 49TH ASILOMAR CONFERENCE ON SIGNALS, SYSTEMS AND COMPUTERS, 2015, : 1515 - 1519
  • [8] Big Data: A Survey
    Min Chen
    Shiwen Mao
    Yunhao Liu
    [J]. Mobile Networks and Applications, 2014, 19 : 171 - 209
  • [9] The Survey of Big Data
    Fu, Qi
    Tan, Jun
    Xie, Yufang
    [J]. PROCEEDINGS OF THE 2014 INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ELECTRONIC TECHNOLOGY, 2015, 6 : 403 - 407
  • [10] Big Data: A Survey
    Chen, Min
    Mao, Shiwen
    Liu, Yunhao
    [J]. MOBILE NETWORKS & APPLICATIONS, 2014, 19 (02): : 171 - 209