Analysis of Dimensionality Reduction Techniques on Big Data

被引:461
|
作者
Reddy, G. Thippa [1 ]
Reddy, M. Praveen Kumar [1 ]
Lakshmanna, Kuruva [1 ]
Kaluri, Rajesh [1 ]
Rajput, Dharmendra Singh [1 ]
Srivastava, Gautam [2 ,3 ]
Baker, Thar [4 ]
机构
[1] VIT, Sch Infromat Technol & Engn, Vellore 632014, Tamil Nadu, India
[2] Brandon Univ, Dept Math & Comp Sci, Brandon, MB R7A 6A9, Canada
[3] China Med Univ, Res Ctr Interneural Comp, Shenyang 10122, Peoples R China
[4] Liverpool John Moores Univ, Dept Comp Sci, Liverpool L3 3AF, Merseyside, England
来源
IEEE ACCESS | 2020年 / 8卷
关键词
Dimensionality reduction; Principal component analysis; Machine learning algorithms; Support vector machines; Medical diagnostic imaging; Feature extraction; Cardiotocography dataset; dimensionality reduction; feature engineering; linear discriminant analysis; machine learning; principal component analysis; MACHINE; CLASSIFIER; DIAGNOSIS; SYSTEM;
D O I
10.1109/ACCESS.2020.2980942
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Due to digitization, a huge volume of data is being generated across several sectors such as healthcare, production, sales, IoT devices, Web, organizations. Machine learning algorithms are used to uncover patterns among the attributes of this data. Hence, they can be used to make predictions that can be used by medical practitioners and people at managerial level to make executive decisions. Not all the attributes in the datasets generated are important for training the machine learning algorithms. Some attributes might be irrelevant and some might not affect the outcome of the prediction. Ignoring or removing these irrelevant or less important attributes reduces the burden on machine learning algorithms. In this work two of the prominent dimensionality reduction techniques, Linear Discriminant Analysis (LDA) and Principal Component Analysis (PCA) are investigated on four popular Machine Learning (ML) algorithms, Decision Tree Induction, Support Vector Machine (SVM), Naive Bayes Classifier and Random Forest Classifier using publicly available Cardiotocography (CTG) dataset from University of California and Irvine Machine Learning Repository. The experimentation results prove that PCA outperforms LDA in all the measures. Also, the performance of the classifiers, Decision Tree, Random Forest examined is not affected much by using PCA and LDA.To further analyze the performance of PCA and LDA the eperimentation is carried out on Diabetic Retinopathy (DR) and Intrusion Detection System (IDS) datasets. Experimentation results prove that ML algorithms with PCA produce better results when dimensionality of the datasets is high. When dimensionality of datasets is low it is observed that the ML algorithms without dimensionality reduction yields better results.
引用
收藏
页码:54776 / 54788
页数:13
相关论文
共 50 条
  • [31] Dimensionality Reduction Techniques for Visualizing Morphometric Data: Comparing Principal Component Analysis to Nonlinear Methods
    Du, Trina Y.
    [J]. EVOLUTIONARY BIOLOGY, 2019, 46 (01) : 106 - 121
  • [32] Sampling Techniques for Big Data Analysis
    Kim, Jae Kwang
    Wang, Zhonglei
    [J]. INTERNATIONAL STATISTICAL REVIEW, 2019, 87 : S177 - S191
  • [33] Big data analysis techniques.
    St-Pierre, N.
    [J]. JOURNAL OF ANIMAL SCIENCE, 2016, 94 : 624 - 624
  • [34] Dimensionality reduction of medical big data using neural-fuzzy classifier
    Ahmad Taher Azar
    Aboul Ella Hassanien
    [J]. Soft Computing, 2015, 19 : 1115 - 1127
  • [35] Big Data Dimensionality Reduction for Wireless Sensor Networks Using Stacked Autoencoders
    Sirshar, Muneeba
    Saleem, Sajid
    Ilyas, Muhammad U.
    Khan, Muhammad Murtaza
    Alkatheiri, Mohammed Saeed
    Alowibdi, Jalal S.
    [J]. RESEARCH & INNOVATION FORUM 2019: TECHNOLOGY, INNOVATION, EDUCATION, AND THEIR SOCIAL IMPACT, 2019, : 391 - 400
  • [36] Dimensionality reduction of medical big data using neural-fuzzy classifier
    Azar, Ahmad Taher
    Hassanien, Aboul Ella
    [J]. SOFT COMPUTING, 2015, 19 (04) : 1115 - 1127
  • [37] Low-Complexity Dimensionality Reduction for Big Data Analytics in the Smart Grid
    Mohajeri, M.
    Ghassemi, A.
    Gulliver, T. Aaron
    [J]. 2020 IEEE GLOBAL COMMUNICATIONS CONFERENCE (GLOBECOM), 2020,
  • [38] A BIG-DATA APPROACH TO ELECTRONIC HEALTH RECORD DATA - USING DIMENSIONALITY REDUCTION AND CLUSTERING TECHNIQUES TO STUDY LONGITUDINAL RELATIONSHIPS BETWEEN DISEASES
    Maurits, Marc
    Huizinga, Thomas
    Raychaudhuri, Soumya
    Reinders, Marcel
    Karlson, Elizabeth
    van den Akker, Erik
    Knevel, Rachel
    [J]. ANNALS OF THE RHEUMATIC DISEASES, 2019, 78 : 2102 - 2102
  • [39] Performance evaluation of dimensionality reduction techniques on hyperspectral data for mineral exploration
    Deepa, C.
    Shetty, Amba
    Narasimhadhan, A., V
    [J]. EARTH SCIENCE INFORMATICS, 2023, 16 (01) : 25 - 36
  • [40] Overview and comparative study of dimensionality reduction techniques for high dimensional data
    Ayesha, Shaeela
    Hanif, Muhammad Kashif
    Talib, Ramzan
    [J]. INFORMATION FUSION, 2020, 59 : 44 - 58