An Approach for Data Labelling and Concept Drift Detection Based on Entropy Model in Rough Sets for Clustering Categorical Data

被引:0
|
作者
Reddy, H. [1 ]
Raju, S. [2 ]
Kumar, B. [1 ]
Jayachandra, C. [1 ]
机构
[1] Vasavi Coll Engn, Dept Comp Sci & Engn, Hyderabad, Andhra Pradesh, India
[2] Jawaharlal Nehru Technol Univ Hyderabad, Dept Comp Sci & Engn, Hyderabad, Andhra Pradesh, India
关键词
Data labelling; entropy; rough set; concept-drift; cluster purity; outlier;
D O I
10.1142/S0219649214500208
中图分类号
G25 [图书馆学、图书馆事业]; G35 [情报学、情报工作];
学科分类号
1205 ; 120501 ;
摘要
Clustering is an important technique in data mining. Clustering a large data set is difficult and time consuming. An approach called data labelling has been suggested for clustering large databases using sampling technique to improve efficiency of clustering. A sampled data is selected randomly for initial clustering and data points which are not sampled and unclustered are given cluster label or an outlier based on various data labelling techniques. Data labelling is an easy task in numerical domain because it is performed based on distance between a cluster and an unlabelled data point. However, in categorical domain since the distance is not defined properly between data points and data points with cluster, then data labelling is a difficult task for categorical data. This paper proposes a method for data labelling using entropy model in rough sets for categorical data. The concept of entropy, introduced by Shannon with particular reference to information theory is a powerful mechanism for the measurement of uncertainty information. In this method, data labelling is performed by integrating entropy with rough sets. This method is also applied to drift detection to establish if concept drift occurred or not when clustering categorical data. The cluster purity is also discussed using Rough Entropy for data labelling and for outlier detection. The experimental results show that the efficiency and clustering quality of this algorithm are better than the previous algorithms.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] Incremental entropy-based clustering on categorical data streams with concept drift
    Li, Yanhong
    Li, Deyu
    Wang, Suge
    Zhai, Yanhui
    [J]. KNOWLEDGE-BASED SYSTEMS, 2014, 59 : 33 - 47
  • [2] Data Labeling method based on Rough Entropy for Categorical Data Clustering
    Sreenivasulu, G.
    Raju, S. Viswanadha
    Rao, N. Sambasiva
    [J]. 2014 INTERNATIONAL CONFERENCE ON ELECTRONICS, COMMUNICATION AND COMPUTATIONAL ENGINEERING (ICECCE), 2014, : 173 - 178
  • [3] An Efficient Approach for Clustering US Census Data Based on Cluster Similarity Using Rough Entropy on Categorical Data
    Sreenivasulu, G.
    Raju, S. Viswanadha
    Rao, N. Sambasiva
    [J]. INFORMATION AND COMMUNICATION TECHNOLOGY FOR COMPETITIVE STRATEGIES, 2019, 40 : 359 - 375
  • [4] Rough Set Approach for Categorical Data Clustering
    Herawan, Tutut
    Yanto, Iwan Tri Riyadi
    Deris, Mustafa Mat
    [J]. DATABASE THEORY AND APPLICATION, 2009, 64 : 179 - 186
  • [5] Clustering of concept-drift categorical data implementation in JAVA
    Reddy Madhavi, K.
    Vinaya Babu, A.
    Viswanadha Raju, S.
    [J]. Communications in Computer and Information Science, 2012, 270 CCIS (PART II): : 639 - 654
  • [6] Data Labeling method based on Cluster Purity using Relative Rough Entropy for Categorical Data Clustering
    Reddy, H. Venkateswara
    Raju, S. Viswanadha
    Agrawal, Pratibha
    [J]. 2013 INTERNATIONAL CONFERENCE ON ADVANCES IN COMPUTING, COMMUNICATIONS AND INFORMATICS (ICACCI), 2013, : 500 - 506
  • [7] A Data Labeling method for Categorical Data Clustering using Cluster Entropies in Rough Sets
    Reddy, H. Venkateswara
    Kumar, B. Suresh
    Raju, S. Viswanadha
    [J]. 2014 FOURTH INTERNATIONAL CONFERENCE ON COMMUNICATION SYSTEMS AND NETWORK TECHNOLOGIES (CSNT), 2014, : 444 - 449
  • [8] Ensemble based rough fuzzy clustering for categorical data
    Saha, Indrajit
    Sarkar, Jnanendra Prasad
    Maulik, Ujjwal
    [J]. KNOWLEDGE-BASED SYSTEMS, 2015, 77 : 114 - 127
  • [9] Rough set based information theoretic approach for clustering uncertain categorical data
    Uddin, Jamal
    Ghazali, Rozaida
    Abawajy, Jemal H.
    Shah, Habib
    Husaini, Noor Aida
    Zeb, Asim
    [J]. PLOS ONE, 2022, 17 (05):
  • [10] Clustering of Concept-Drift Categorical Data Implementation in JAVA']JAVA
    Madhavi, K. Reddy
    Babu, A. Vinaya
    Raju, S. Viswanadha
    [J]. GLOBAL TRENDS IN INFORMATION SYSTEMS AND SOFTWARE APPLICATIONS, PT 2, 2012, 270 : 639 - +