DPCF: A framework for imputing missing values and clustering data in drug discovery process

被引:1
|
作者
Bhagat, Hutashan Vishal [1 ]
Singh, Manminder [1 ]
机构
[1] Sant Longowal Inst Engn & Technol, Longowal, Punjab, India
关键词
Data missingness; Imputation; High dimensional datasets; Data clustering; Z_score; Data partitioning algorithms; Unsupervised datasets; Cluster validation index; K-MEANS; ALGORITHM; VALIDATION; PARTITION; NUMBER; FIND;
D O I
10.1016/j.chemolab.2022.104686
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
The advent of modern Internet of Things (IoT) architectures has led to ease in data collection and availability. The data generated from such architectures are of large volume and dimensionality. As a result, data missingness and data labeling are the commonly occurring problems in the data collection process when data volume is too large. Data clustering is a commonly used unsupervised pattern classification technique that helps in identifying the hidden structure of the datasets and clusters or groups similar data items together. In context to chemo-metrics, clustering techniques play a significant role in identifying the structure-property relationships and structure-activity relationships among different compounds in drug discovery process. The quantitative structure-property relationships (QSPR) are based on the similar property principle which states that if the com-pounds are clustered together based on structural descriptors, the compounds within the same cluster will have similar properties. The quantitative structure-activity relationships (QSAR) help in determining the empirical relationships between chemical structure and biological activity of a set of identical compounds. Hence, a variety of compounds can be divided into homogeneous subsets using cluster analysis. The effectiveness of any clustering algorithm mostly depends on how efficiently the initial cluster centers are identified. Traditional techniques that mostly rely on the random selection of initial cluster centers and different parameter settings may result in distinct clusters for the same dataset. Moreover, if the initial cluster centers selected are outliers then the clusters formed are of poor quality. The quality of each cluster can be determined by using distinct cluster validation indices (CVIs). This paper aims to solve the problems of data missingness as well as data labeling for unsuper-vised datasets. In continuation to the previous study where the NMVI (Nullify the Missing Values before Imputation) technique efficiently imputes the missing values in different datasets, in this paper the proposed DPCF (Data Partitioning-based Clustering Framework) framework utilize the NMVI technique to impute the missing values and a novel Z-Clust clustering algorithm is proposed that efficiently clusters the unlabeled data samples. The integration of the NMVI technique and Z-Clust clustering technique in the proposed framework makes it well suited for the analysis of unsupervised datasets having missing values. The performance evaluation of the proposed Z-Clust clustering technique is done by using five standard CVIs and the results are compared with the existing clustering techniques. The experimental results depict that the proposed Z-Clust clustering technique shows better cluster formation as compared to the existing clustering techniques. Henceforth, the proposed DPCF framework is well suited for the analysis of datasets without labels and having missing values.
引用
收藏
页数:16
相关论文
共 50 条
  • [31] Imputation Strategies for Clustering Mixed-Type Data with Missing Values
    Rabea Aschenbruck
    Gero Szepannek
    Adalbert F. X. Wilhelm
    Journal of Classification, 2023, 40 : 2 - 24
  • [32] Imputation of Missing Values in the Fundamental Data: Using MICE Framework
    Meghanadh, Balasubramaniam
    Aravalath, Lagesh
    Joshi, Bhupesh
    Sathiamoorthy, Raghunathan
    Kumar, Manish
    JOURNAL OF QUANTITATIVE ECONOMICS, 2019, 17 (03) : 459 - 475
  • [33] Imputation Strategies for Clustering Mixed-Type Data with Missing Values
    Aschenbruck, Rabea
    Szepannek, Gero
    Wilhelm, Adalbert F. X.
    JOURNAL OF CLASSIFICATION, 2023, 40 (01) : 2 - 24
  • [34] Imputation of Missing Values in the Fundamental Data: Using MICE Framework
    Balasubramaniam Meghanadh
    Lagesh Aravalath
    Bhupesh Joshi
    Raghunathan Sathiamoorthy
    Manish Kumar
    Journal of Quantitative Economics, 2019, 17 : 459 - 475
  • [35] Discovery of Genuine Functional Dependencies from Relational Data with Missing Values
    Berti-Equille, Laure
    Harmouch, Nazar
    Naumann, Felix
    Novelli, Noel
    Saravanan
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2018, 11 (08): : 880 - 892
  • [36] Multiply imputing missing values in data sets with mixed measurement scales using a sequence of generalised linear models
    Lee, Min Cherng
    Mitra, Robin
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2016, 95 : 24 - 38
  • [37] Scalable Data Quality for Big Data: The Pythia Framework for Handling Missing Values
    Cahsai, Atoshum
    Anagnostopoulos, Christos
    Triantafillou, Peter
    BIG DATA, 2015, 3 (03) : 159 - 172
  • [38] Dynamic Clustering-Based Estimation of Missing Values in Mixed Type Data
    Ayuyev, Vadim V.
    Jupin, Joseph
    Harris, Philip W.
    Obradovic, Zoran
    DATA WAREHOUSING AND KNOWLEDGE DISCOVERY, PROCEEDINGS, 2009, 5691 : 366 - +
  • [39] Clustering binary fingerprint vectors with missing values for DNA array data analysis
    Figueroa, A
    Borneman, J
    Jiang, T
    PROCEEDINGS OF THE 2003 IEEE BIOINFORMATICS CONFERENCE, 2003, : 38 - 47
  • [40] Clustering binary fingerprint vectors with missing values for DNA array data analysis
    Figueroa, A
    Borneman, J
    Jiang, T
    JOURNAL OF COMPUTATIONAL BIOLOGY, 2004, 11 (05) : 887 - 901