A Hybrid Method to Measure Distribution Consistency of Mixed-Attribute Datasets

被引:2
|
作者
He Y. [1 ,2 ]
Ye X. [1 ,2 ]
Huang D. [1 ,2 ]
Fournier-Viger P. [1 ,2 ]
Huang J.Z. [1 ,2 ]
机构
[1] College of Computer Science and Software Engineering, Shenzhen University, Shenzhen
[2] National Engineering Laboratory for Big Data System Computing Technology, Shenzhen University, Shenzhen
来源
关键词
Deep encoding (DE); distribution consistency; generalized maximum mean discrepancy (GMMD); mixed-attribute dataset; multilayer extreme learning machine (MLELM); one-hot encoding (OE);
D O I
10.1109/TAI.2022.3151724
中图分类号
学科分类号
摘要
Random sample partition (RSP) is a newly developed data management and processing model for Big Data processing and analysis. To apply the RSP model for Big Data computation tasks, it is very important to measure the distribution consistency of different datasets. Existing measurement methods for continuous-attribute and discrete-attribute datasets cannot directly deal with mixed-attribute datasets. In this article, we design a hybrid method to measure the distribution consistency among different mixed-attribute datasets by using a multilayer extreme learning machine (MLELM) and the generalized maximum mean discrepancy (GMMD) criterion, abbreviated as MLELM-GMMD. MLELM is first used to transform original mixed-attribute datasets into corresponding deep encoding datasets. Then, the GMMD criterion is applied to check the distribution consistency of the deep encoding datasets. Four experiments have been done to validate the feasibility and effectiveness of MLELM-GMMD, i.e., the impact of MLELM on the amount of information during mixed-attribute data transformation, the impact of MLELM on distributions of mixed-attribute data, the distribution consistencies of RSP and non-RSP data blocks, and the comparison with other measurement methods. Experimental results show that the proposed MLELM-GMMD method can measure the distribution consistency of mixed-attribute datasets more accurately than one-hot encoding-based methods. © 2022 IEEE.
引用
收藏
页码:182 / 196
页数:14
相关论文
共 50 条
  • [1] RANDOM VECTOR GENERATION FROM MIXED-ATTRIBUTE DATASETS USING RANDOM WALK
    Skabar, Andrew
    [J]. 2016 WINTER SIMULATION CONFERENCE (WSC), 2016, : 1096 - 1107
  • [2] A novel dependency-oriented mixed-attribute data classification method
    He, Yu-Lin
    Ou, Gui-Liang
    Fournier-Viger, Philippe
    Huang, Joshua Zhexue
    Suganthan, Ponnuthurai Nagaratnam
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2022, 199
  • [3] Scalability achievements for enumerative biclustering with online partitioning: Case studies involving mixed-attribute datasets
    Veroneze, Rosana
    Von Zuben, Fernando J.
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2021, 100
  • [4] Adaptive Mixed-Attribute Data Clustering Method Based on Density Peaks
    Liu, Shihua
    [J]. COMPLEXITY, 2022, 2022
  • [5] Detecting Network Anomalies in Mixed-Attribute Data Sets
    Tran, Khoi-Nguyen
    Jin, Huidong
    [J]. THIRD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING: WKDD 2010, PROCEEDINGS, 2010, : 383 - 386
  • [6] Missing Value Estimation for Mixed-Attribute Data Sets
    Zhu, Xiaofeng
    Zhang, Shichao
    Jin, Zhi
    Zhang, Zili
    Xu, Zhuoming
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2011, 23 (01) : 110 - 121
  • [7] A practical outlier detection approach for mixed-attribute data
    Bouguessa, Mohamed
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2015, 42 (22) : 8637 - 8649
  • [8] Random Mixed Field Model for Mixed-Attribute Data Restoration
    Li, Qiang
    Bian, Wei
    Xu, Richard Yi Da
    You, Jane
    Tao, Dacheng
    [J]. THIRTIETH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2016, : 1244 - 1250
  • [9] Clustering Mixed-Attribute Data using Random Walk
    Skabar, Andrew
    [J]. INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE (ICCS 2017), 2017, 108 : 988 - 997
  • [10] Fast Distributed Outlier Detection in Mixed-Attribute Data Sets
    Matthew Eric Otey
    Amol Ghoting
    Srinivasan Parthasarathy
    [J]. Data Mining and Knowledge Discovery, 2006, 12 : 203 - 228