Implementation of Data Preprocessing Techniques on Distributed Big Data Platforms

被引:1
|
作者
Celik, Oguz [1 ]
Hasanbasoglu, Muruvvet [1 ]
Aktas, Mehmet S. [1 ]
Kalipsiz, Oya [1 ]
Kanli, Alper Nebi [2 ]
机构
[1] Yildiz Tech Univ, Dept Comp Engn, Istanbul, Turkey
[2] Cybersoft, R&D Ctr, Istanbul, Turkey
关键词
Big Data; Distributed Computing; Outlier Analysis; Missing Value Imputation;
D O I
10.1109/ubmk.2019.8907230
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
We are now in the era of Big Data, and the need for tools which can process and analyze such data is yet to be fulfilled. Big data mining aims to extract meaningful and valuable information from voluminous data that traditional data mining tools can not handle. One of the most vital steps of any data mining process is the preprocessing of the data. Our aim was to provide distributed implementation of some algorithms for two of the data preprocessing steps: outlier analysis and missing value imputation. The algorithms were implemented on Spark and this paper will focus on the details and performance of these algorithms on different distributed system setups.
引用
收藏
页码:73 / 78
页数:6
相关论文
共 50 条
  • [1] A Distributed Decision Tree Algorithm and Its Implementation on Big Data Platforms
    Chen, Jingxiang
    Wang, Tao
    Abbey, Ralph
    Pingenot, Joseph
    [J]. PROCEEDINGS OF 3RD IEEE/ACM INTERNATIONAL CONFERENCE ON DATA SCIENCE AND ADVANCED ANALYTICS, (DSAA 2016), 2016, : 752 - 761
  • [2] The Impact of Distributed Data in Big Data Platforms on Organizations
    Koren, Oded
    Binyaminov, Matan
    Perel, Nir
    [J]. PROCEEDINGS OF THE FUTURE TECHNOLOGIES CONFERENCE (FTC) 2018, VOL 2, 2019, 881 : 1024 - 1036
  • [3] Data Feature Selection Methods on Distributed Big Data Processing Platforms
    Catalkaya, Mehmet Burak
    Kalipsiz, Oya
    Aktas, Mehmet S.
    Turgut, Umut Orcun
    [J]. 2018 3RD INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2018, : 133 - 138
  • [4] Research and Implementation of Big Data Preprocessing System Based on Hadoop
    Dai, Huadong
    Zhang, Shu
    Wang, Li
    Ding, Yan
    [J]. PROCEEDINGS OF 2016 IEEE INTERNATIONAL CONFERENCE ON BIG DATA ANALYSIS (ICBDA), 2016, : 90 - 94
  • [5] Implementation of Hyperparameter Algorithms on Big Data Platforms: A Case Study
    Mangliyeva, Mehriniso
    Aktas, Mehmet Siddik
    Tanriverdi, Berfin
    Kalipsiz, Oya
    Balcik, Erman
    [J]. 2019 4TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND ENGINEERING (UBMK), 2019, : 7 - 11
  • [6] Implementation of Data Stream Classification Neural Network Models Over Big Data Platforms
    Puentes-Marchal, Fernando
    Dolores Perez-Godoy, Maria
    Gonzalez, Pedro
    Jose Del Jesus, Maria
    [J]. ADVANCES IN COMPUTATIONAL INTELLIGENCE (IWANN 2021), PT II, 2021, 12862 : 272 - 280
  • [7] Imbalanced Big Data Classification: A Distributed Implementation of SMOTE
    Rastogi, Avnish Kumar
    Narang, Nitin
    Siddiqui, Zamir Ahmad
    [J]. PROCEEDINGS OF THE WORKSHOP PROGRAM OF THE 19TH INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING AND NETWORKING (ICDCN'18), 2018,
  • [8] Editorial Note: Data Preprocessing for Big Multimedia Data
    Thung, Kim-Han
    Zhu, Xiaofeng
    Wee, Chong-Yaw
    Kwan, Ban-Hoe
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (03) : 3611 - 3611
  • [9] Editorial Note: Data Preprocessing for Big Multimedia Data
    [J]. Multimedia Tools and Applications, 2019, 78 : 3611 - 3611
  • [10] sPCA: Scalable Principal Component Analysis for Big Data on Distributed Platforms
    Elgamal, Tarek
    Yabandeh, Maysam
    Aboulnaga, Ashraf
    Mustafa, Waleed
    Hefeeda, Mohamed
    [J]. SIGMOD'15: PROCEEDINGS OF THE 2015 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2015, : 79 - 91