Similarity distance based approach for outlier detection by matrix calculation

被引:0
|
作者
Ye, Ou [1 ]
Li, Zhanli [1 ]
机构
[1] Xi'an University of Science and Technology, Xi'an, China
基金
中国国家自然科学基金;
关键词
Statistics - Matrix algebra - Data mining - Calculations - Data handling;
D O I
暂无
中图分类号
学科分类号
摘要
Purpose: In client information, string outliers need to be detected and cleaned. At present, many outlier detection algorithms only focus on the semantics of data, and ignore the structure, so it is difficult to ensure the accuracy of outlier detection. In order to address this issue, outlier detection method based on similarity distance is suggested in this paper. Methodology: We formulated the similarity calculation model of string data by combining with semantic and structure factors. According to the outlier detection theory in data cleansing, one-dimensional string data were projected to two-dimensional space and string outlier data were detected by using a new similarity measurement mechanism in the two-dimensional space. Findings: We first got the word frequency of string data by using the matrix calculation. Then the semantic similarity and structure similarity were calculated by using word frequency. After the string data mapping from one-dimensional to two-dimensional space, we obtained the outlier data by using the similarity distance. Originality: We made a study of string outlier detection in data cleansing. Firstly, we formulated the similarity calculation model by considering the semantic factor and structure factor. Secondly, by constructing the similarity cell to project the string data, we fulfilled the similarity distance measurement in the similarity cell. Practical value: The method can be used to clean the outlier string data in client information for any enterprise so that to ensure the data quality of client information, and reduce the costs of data maintenance. Extensive simulation experiments have been conducted to prove the feasibility and rationality of this method. The results showed that this method allows improving the accuracy of string outlier detection. © Ou Ye, Zhanli Li, 2016.
引用
收藏
页码:99 / 105
相关论文
共 50 条
  • [21] Efficient Pruning Schemes for Distance-Based Outlier Detection
    Vu, Nguyen Hoang
    Gopalkrishnan, Vivekanand
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, PT II, 2009, 5782 : 160 - 175
  • [22] An Outlier Detection Method Based on Mahalanobis Distance for Source Localization
    Yan, Qingli
    Chen, Jianfeng
    De Strycker, Lieven
    SENSORS, 2018, 18 (07)
  • [23] Distance-Based Outlier Detection: Consolidation and Renewed Bearing
    Orair, Gustavo H.
    Teixeira, Carlos H. C.
    Wang, Ye
    Parthasarathy, Srinivasan
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2010, 3 (02): : 1469 - 1480
  • [24] Explainable Distance-Based Outlier Detection in Data Streams
    Toliopoulos, Theodoros
    Gounaris, Anastasios
    IEEE ACCESS, 2022, 10 : 47921 - 47936
  • [25] An Outlier Detection Based Approach for PCB Testing
    He, Xin
    Malaiya, Yashwant
    Jayasumana, Anura P.
    Parker, Kenneth P.
    Hird, Stephen
    ITC: 2009 INTERNATIONAL TEST CONFERENCE, 2009, : 273 - +
  • [26] An rough entropy based approach to outlier detection
    Li, Xiangjun
    Rao, Fen
    Journal of Computational Information Systems, 2012, 8 (24): : 10501 - 10508
  • [27] ERDOF: outlier detection algorithm based on entropy weight distance and relative density outlier factor
    Zhang Z.
    Liu W.
    Zhang Y.
    Deng Y.
    Wei M.
    Tongxin Xuebao/Journal on Communications, 2021, 42 (09): : 133 - 143
  • [28] Stairways Detection and Distance Estimation Approach Based on Three Connected Point and Triangular Similarity
    Khaliluzzaman, Md.
    Deb, Kaushik
    Jo, Kang-Hyun
    2016 9TH INTERNATIONAL CONFERENCE ON HUMAN SYSTEM INTERACTIONS (HSI), 2016, : 330 - 336
  • [29] A New Local Distance-Based Outlier Detection Approach for Scattered Real-World Data
    Zhang, Ke
    Hutter, Marcus
    Jin, Huidong
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PROCEEDINGS, 2009, 5476 : 813 - 822
  • [30] An iterative approach to unsupervised outlier detection using ensemble method and distance-based data filtering
    Chakraborty, Bodhan
    Chaterjee, Agneet
    Malakar, Samir
    Sarkar, Ram
    COMPLEX & INTELLIGENT SYSTEMS, 2022, 8 (04) : 3215 - 3230