Handling Imbalance Classification Virtual Screening Big Data Using Machine Learning Algorithms

被引:11
|
作者
Hussin, Sahar K. [1 ]
Abdelmageid, Salah M. [2 ]
Alkhalil, Adel [3 ]
Omar, Yasser M. [4 ]
Marie, Mahmoud, I [5 ]
Ramadan, Rabie A. [3 ,6 ]
机构
[1] Alshrouck Acad, Commun & Comp Engn Dept, Cairo, Egypt
[2] Taibah Univ, Comp Engn Dept, Coll Comp Sci & Engn, Medina, Saudi Arabia
[3] Univ Hail, Coll Comp Sci & Engn, Hail, Saudi Arabia
[4] Arab Acad Sci Technol & Maritime Transport, Cairo, Egypt
[5] Al Azhar Univ, Comp & Syst Engn Dept, Cairo, Egypt
[6] Cairo Univ, Comp Engn Dept, Cairo, Egypt
关键词
K-means clustering;
D O I
10.1155/2021/6675279
中图分类号
O1 [数学];
学科分类号
0701 ; 070101 ;
摘要
Virtual screening is the most critical process in drug discovery, and it relies on machine learning to facilitate the screening process. It enables the discovery of molecules that bind to a specific protein to form a drug. Despite its benefits, virtual screening generates enormous data and suffers from drawbacks such as high dimensions and imbalance. This paper tackles data imbalance and aims to improve virtual screening accuracy, especially for a minority dataset. For a dataset identified without considering the data's imbalanced nature, most classification methods tend to have high predictive accuracy for the majority category. However, the accuracy was significantly poor for the minority category. The paper proposes a K-mean algorithm coupled with Synthetic Minority Oversampling Technique (SMOTE) to overcome the problem of imbalanced datasets. The proposed algorithm is named as KSMOTE. Using KSMOTE, minority data can be identified at high accuracy and can be detected at high precision. A large set of experiments were implemented on Apache Spark using numeric PaDEL and fingerprint descriptors. The proposed solution was compared to both no-sampling method and SMOTE on the same datasets. Experimental results showed that the proposed solution outperformed other methods.
引用
收藏
页数:15
相关论文
共 50 条
  • [21] Medical Data Clustering and Classification Using TLBO and Machine Learning Algorithms
    Dubey, Ashutosh Kumar
    Gupta, Umesh
    Jain, Sonal
    CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 70 (03): : 4523 - 4543
  • [22] Classification of Road Traffic Accident Data Using Machine Learning Algorithms
    Kumeda, Bulbula
    Zhang, Fengli
    Zhou, Fan
    Hussain, Sadiq
    Almasri, Ammar
    Assefa, Maregu
    2019 IEEE 11TH INTERNATIONAL CONFERENCE ON COMMUNICATION SOFTWARE AND NETWORKS (ICCSN 2019), 2019, : 682 - 687
  • [23] Classification of Cardiovascular Risk Using Accelerometer Data and Machine Learning Algorithms
    Boiarskaia, Elena
    Liang, Feng
    Zhu, Weimo
    MEDICINE AND SCIENCE IN SPORTS AND EXERCISE, 2014, 46 (05): : 717 - 717
  • [24] CLASSIFICATION OF FACIAL EXPRESSIONS USING DATA MINING AND MACHINE LEARNING ALGORITHMS
    Faria, Brigida Monica
    Lau, Nuno
    Reis, Luis Paulo
    SISTEMAS E TECHNOLOGIAS DE INFORMACAO: ACTAS DA 4A CONFERENCIA IBERICA DE SISTEMAS E TECNOLOGIAS DE LA INFORMACAO, 2009, : 197 - +
  • [25] Big Data Analytics in Healthcare Using Machine Learning Algorithms: A Comparative Study
    Akundi, Sai Hanuman
    Soujanya, R.
    Madhuri, P. M.
    INTERNATIONAL JOURNAL OF ONLINE AND BIOMEDICAL ENGINEERING, 2020, 16 (13) : 19 - 32
  • [26] Handling Semantic Complexity of Big Data using Machine Learning and RDF Ontology Model
    Sajjad, Rauf
    Bajwa, Imran Sarwar
    Kazmi, Rafaqut
    SYMMETRY-BASEL, 2019, 11 (03):
  • [27] Big Data Mining and Classification of Intelligent Material Science Data Using Machine Learning
    Chittam, Swetha
    Gokaraju, Balakrishna
    Xu, Zhigang
    Sankar, Jagannathan
    Roy, Kaushik
    APPLIED SCIENCES-BASEL, 2021, 11 (18):
  • [28] Elastic extreme learning machine for big data classification
    Xin, Junchang
    Wang, Zhiqiong
    Qu, Luxuan
    Wang, Guoren
    NEUROCOMPUTING, 2015, 149 : 464 - 471
  • [29] An Integration of Extreme Learning Machine for Classification of Big Data
    Zhou, Guanwu
    Zhao, Yulong
    Xu, Wenju
    PROCEEDINGS OF 2013 INTERNATIONAL CONFERENCE ON INFORMATION SCIENCE AND COMPUTER APPLICATIONS (ICSA 2013), 2013, 92 : 81 - 86
  • [30] Petrofacies classification using machine learning algorithms
    Silva, Adrielle A.
    Tavares, Monica W.
    Carrasquilla, Abel
    Missagia, Roseane
    Ceia, Marco
    GEOPHYSICS, 2020, 85 (04) : WA101 - WA113