Performance controlled data reduction for knowledge discovery in distributed databases

被引:0
|
作者
Vucetic, S [1 ]
Obradovic, Z [1 ]
机构
[1] Washington State Univ, Sch Elect Engn & Comp Sci, Pullman, WA 99164 USA
关键词
data reduction; data compression; sensitivity analysis; distributed databases; neural networks; learning curve;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The objective of data reduction is to obtain a compact representation of a large data set to facilitate repeated use of non-redundant information with complex and slow learning algorithms and to allow efficient data transfer and storage. For a user-controllable allowed accuracy loss we propose an effective data reduction procedure based on guided sampling for identifying a minimal size representative subset, followed by a model-sensitivity analysis for determining an appropriate compression level for each attribute. Experiments were performed on 3 large data sets and, depending on an allowed accuracy loss margin ranging from 1% to 5% of the ideal generalization, the achieved compression rates ranged between 95 and 12,500 times. These results indicate that transferring reduced data sets from multiple locations to a centralized site for an efficient and accurate knowledge discovery might often be possible in practice.
引用
收藏
页码:29 / 39
页数:11
相关论文
共 50 条
  • [1] Parallel and distributed databases, data mining and knowledge discovery
    Talia, D
    Kargupta, H
    Valduriez, P
    Camacho, R
    [J]. EURO-PAR 2005 PARALLEL PROCESSING, PROCEEDINGS, 2005, 3648 : 347 - 347
  • [2] Parallel and distributed databases, data mining and knowledge discovery
    Skillicorn, D
    Hameurlain, A
    Watson, P
    Orlando, S
    [J]. EURO-PAR 2004 PARALLEL PROCESSING, PROCEEDINGS, 2004, 3149 : 346 - 346
  • [3] Parallel and distributed databases, data mining and knowledge discovery
    Kosch, H
    Skilicorn, D
    Talia, D
    [J]. EURO-PAR 2002 PARALLEL PROCESSING, PROCEEDINGS, 2002, 2400 : 319 - 320
  • [4] The Influence of Data Replication in the Knowledge Discovery in Distributed Databases Process
    Pupezescu, Valentin
    Radescu, Radu
    [J]. 2016 8TH INTERNATIONAL CONFERENCE ON ELECTRONICS, COMPUTERS AND ARTIFICIAL INTELLIGENCE (ECAI), 2016,
  • [5] ADVANCES IN KNOWLEDGE DISCOVERY IN DISTRIBUTED DATABASES
    Pupezescu, Valentin
    [J]. RETHINKING EDUCATION BY LEVERAGING THE ELEARNING PILLAR OF THE DIGITAL AGENDA FOR EUROPE!, VOL. I, 2015, : 311 - 319
  • [6] Knowledge discovery by probabilistic clustering of distributed databases
    McClean, S
    Scotney, B
    Morrow, P
    Greer, K
    [J]. DATA & KNOWLEDGE ENGINEERING, 2005, 54 (02) : 189 - 210
  • [7] Data mining and knowledge discovery in databases
    Fayyad, U
    Uthurusamy, R
    [J]. COMMUNICATIONS OF THE ACM, 1996, 39 (11) : 24 - 26
  • [8] Knowledge discovery in distributed databases using evidence theory
    Cai, D
    McTear, MF
    McClean, SI
    [J]. INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2000, 15 (08) : 745 - 761
  • [9] Optimization for Distributed Committee Machines in The Knowledge Discovery in Distributed Databases Process
    Pupezescu, Valentin
    [J]. Proceedings of the 10th International Conference on Virtual Learning, 2015, : 247 - 253
  • [10] Data mining and knowledge discovery in databases: Implications for scientific databases
    Fayyad, U
    [J]. NINTH INTERNATIONAL CONFERENCE ON SCIENTIFIC AND STATISTICAL DATABASE MANAGEMENT, PROCEEDINGS, 1997, : 2 - 11