Data source selection for information integration in big data era

被引:16
|
作者
Lin, Yiming [1 ]
Wang, Hongzhi [1 ]
Li, Jianzhong [1 ]
Gao, Hong [1 ]
机构
[1] Harbin Inst Technol, Harbin, Heilongjiang, Peoples R China
关键词
Source selection; Data integration; Data cleaning;
D O I
10.1016/j.ins.2018.11.029
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In big data era, information integration often requires abundant data extracted from massive data sources. Due to a large number of data sources, data source selection plays a crucial role in information integration, since it is costly and even impossible to access all data sources. Data Source selection should consider both efficiency and effectiveness issues. For efficiency, the approach should scale to large data source amount. From effectiveness aspect, data quality and overlapping of sources are to be considered. In this paper, we study source selection problem in Big Data and propose methods which can scale to datasets with up to millions of data sources and guarantee the quality of results. Motivated by this, we propose a new metric taking the expected number of true values a source can provide as a criteria to evaluate the contribution of a data source. Based on our proposed index, we present a scalable algorithm and two pruning strategies to improve the efficiency without sacrificing precision. Experimental results on both real world and synthetic data sets show that our methods can select sources providing a large proportion of true values efficiently and can scale to massive data sources. (C) 2018 Elsevier Inc. All rights reserved.
引用
收藏
页码:197 / 213
页数:17
相关论文
共 50 条
  • [1] Data Source Selection Support in the Big Data Integration Process - Towards a Taxonomy
    Kruse, Felix
    Schrlier, Christoph
    Gomez, Jorge Marx
    INNOVATION THROUGH INFORMATION SYSTEMS, VOL III: A COLLECTION OF LATEST RESEARCH ON MANAGEMENT ISSUES, 2021, 48 : 5 - 21
  • [2] Data Source Selection in Big Data Context
    Safhi, Hicham Moad
    Frikh, Bouchra
    Ouhbi, Brahim
    IIWAS2019: THE 21ST INTERNATIONAL CONFERENCE ON INFORMATION INTEGRATION AND WEB-BASED APPLICATIONS & SERVICES, 2019, : 611 - 616
  • [3] Making Sense of the Big Picture: Data Linkage and Integration in the Era of Big Data
    Chang, Hyejung
    HEALTHCARE INFORMATICS RESEARCH, 2018, 24 (04) : 251 - 252
  • [4] Data Source Management and Selection for Dynamic Data Integration
    Husemann, Martin
    Ritter, Norbert
    RESOURCE DISCOVERY, 2010, 6162 : 49 - 65
  • [5] Government Information Policy in the Era of Big Data
    Washington, Anne L.
    REVIEW OF POLICY RESEARCH, 2014, 31 (04) : 319 - 325
  • [6] Communicating useful information in the era of big data
    David F. Albertini
    Journal of Assisted Reproduction and Genetics, 2020, 37 : 241 - 242
  • [7] The Law on information during an era of Big Data
    Talapina, E., V
    VESTNIK OF SAINT PETERSBURG UNIVERSITY-LAW-VESTNIK SANKT-PETERBURGSKOGO UNIVERSITETA-PRAVO, 2020, 11 (01): : 4 - 18
  • [8] Communicating useful information in the era of big data
    Albertini, David F.
    JOURNAL OF ASSISTED REPRODUCTION AND GENETICS, 2020, 37 (02) : 241 - 242
  • [9] Research on Information Security in Big Data Era
    Zhou, Linqi
    Gu, Weihong
    Huang, Cheng
    Huang, Aijun
    Bai, Yongbin
    6TH INTERNATIONAL CONFERENCE ON COMPUTER-AIDED DESIGN, MANUFACTURING, MODELING AND SIMULATION (CDMMS 2018), 2018, 1967
  • [10] Big Data Integration: The Big Promise of Data Integration
    Gal, Avigdor
    2015 3RD INTERNATIONAL CONFERENCE ON FUTURE INTERNET OF THINGS AND CLOUD (FICLOUD) AND INTERNATIONAL CONFERENCE ON OPEN AND BIG (OBD), 2015, : XLIV - XLIV