Similarity distance based approach for outlier detection by matrix calculation

被引:0
|
作者
Ye, Ou [1 ]
Li, Zhanli [1 ]
机构
[1] Xi'an University of Science and Technology, Xi'an, China
基金
中国国家自然科学基金;
关键词
Statistics - Matrix algebra - Data mining - Calculations - Data handling;
D O I
暂无
中图分类号
学科分类号
摘要
Purpose: In client information, string outliers need to be detected and cleaned. At present, many outlier detection algorithms only focus on the semantics of data, and ignore the structure, so it is difficult to ensure the accuracy of outlier detection. In order to address this issue, outlier detection method based on similarity distance is suggested in this paper. Methodology: We formulated the similarity calculation model of string data by combining with semantic and structure factors. According to the outlier detection theory in data cleansing, one-dimensional string data were projected to two-dimensional space and string outlier data were detected by using a new similarity measurement mechanism in the two-dimensional space. Findings: We first got the word frequency of string data by using the matrix calculation. Then the semantic similarity and structure similarity were calculated by using word frequency. After the string data mapping from one-dimensional to two-dimensional space, we obtained the outlier data by using the similarity distance. Originality: We made a study of string outlier detection in data cleansing. Firstly, we formulated the similarity calculation model by considering the semantic factor and structure factor. Secondly, by constructing the similarity cell to project the string data, we fulfilled the similarity distance measurement in the similarity cell. Practical value: The method can be used to clean the outlier string data in client information for any enterprise so that to ensure the data quality of client information, and reduce the costs of data maintenance. Extensive simulation experiments have been conducted to prove the feasibility and rationality of this method. The results showed that this method allows improving the accuracy of string outlier detection. © Ou Ye, Zhanli Li, 2016.
引用
收藏
页码:99 / 105
相关论文
共 50 条
  • [31] An iterative approach to unsupervised outlier detection using ensemble method and distance-based data filtering
    Bodhan Chakraborty
    Agneet Chaterjee
    Samir Malakar
    Ram Sarkar
    Complex & Intelligent Systems, 2022, 8 : 3215 - 3230
  • [32] Outlier detection approach based on local outlier factor for datasets with mixed attributes
    Cho, Nam-Wook (nwcho@seoultech.ac.kr), 2016, ICIC Express Letters Office (07):
  • [33] A New Distance for Intuitionistic Fuzzy Sets Based on Similarity Matrix
    Cheng, Cuiping
    Xiao, Fuyuan
    Cao, Zehong
    IEEE ACCESS, 2019, 7 : 70436 - 70446
  • [34] Adaptivity in continuous massively parallel distance-based outlier detection
    Theodoros Toliopoulos
    Anastasios Gounaris
    Computing, 2022, 104 : 2659 - 2684
  • [35] N DoT: Nearest Neighbor Distance Based Outlier Detection Technique
    Hubballi, Neminath
    Patra, Bidyut Kr.
    Nandi, Sukumar
    PATTERN RECOGNITION AND MACHINE INTELLIGENCE, 2011, 6744 : 36 - 42
  • [36] Reverse Nearest Neighbors in Unsupervised Distance-Based Outlier Detection
    Radovanovic, Milos
    Nanopoulos, Alexandros
    Ivanovic, Mirjana
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2015, 27 (05) : 1369 - 1382
  • [37] Outlier Detection Algorithm based on Mahalanobis Distance for Wireless Sensor Networks
    Titouna, Chafiq
    Titouna, Faiza
    Ari, Ado Adamou Abba
    2019 INTERNATIONAL CONFERENCE ON COMPUTER COMMUNICATION AND INFORMATICS (ICCCI - 2019), 2019,
  • [38] Little data is often enough for distance-based outlier detection
    Muhr, David
    Affenzeller, Michael
    3RD INTERNATIONAL CONFERENCE ON INDUSTRY 4.0 AND SMART MANUFACTURING, 2022, 200 : 984 - 992
  • [39] Density-Distance Outlier Detection Algorithm Based on Natural Neighborhood
    Zhang, Jiaxuan
    Yang, Youlong
    AXIOMS, 2023, 12 (05)
  • [40] Multivariate outlier detection based on a robust Mahalanobis distance with shrinkage estimators
    Cabana, Elisa
    Lillo, Rosa E.
    Laniado, Henry
    STATISTICAL PAPERS, 2021, 62 (04) : 1583 - 1609