Similarity distance based approach for outlier detection by matrix calculation

被引:0
|
作者
Ye, Ou [1 ]
Li, Zhanli [1 ]
机构
[1] Xi'an University of Science and Technology, Xi'an, China
基金
中国国家自然科学基金;
关键词
Statistics - Matrix algebra - Data mining - Calculations - Data handling;
D O I
暂无
中图分类号
学科分类号
摘要
Purpose: In client information, string outliers need to be detected and cleaned. At present, many outlier detection algorithms only focus on the semantics of data, and ignore the structure, so it is difficult to ensure the accuracy of outlier detection. In order to address this issue, outlier detection method based on similarity distance is suggested in this paper. Methodology: We formulated the similarity calculation model of string data by combining with semantic and structure factors. According to the outlier detection theory in data cleansing, one-dimensional string data were projected to two-dimensional space and string outlier data were detected by using a new similarity measurement mechanism in the two-dimensional space. Findings: We first got the word frequency of string data by using the matrix calculation. Then the semantic similarity and structure similarity were calculated by using word frequency. After the string data mapping from one-dimensional to two-dimensional space, we obtained the outlier data by using the similarity distance. Originality: We made a study of string outlier detection in data cleansing. Firstly, we formulated the similarity calculation model by considering the semantic factor and structure factor. Secondly, by constructing the similarity cell to project the string data, we fulfilled the similarity distance measurement in the similarity cell. Practical value: The method can be used to clean the outlier string data in client information for any enterprise so that to ensure the data quality of client information, and reduce the costs of data maintenance. Extensive simulation experiments have been conducted to prove the feasibility and rationality of this method. The results showed that this method allows improving the accuracy of string outlier detection. © Ou Ye, Zhanli Li, 2016.
引用
下载
收藏
页码:99 / 105
相关论文
共 50 条
  • [1] An Efficient Distance and Density Based Outlier Detection Approach
    Zhong, Xunbiao
    Huang, Xiaoxia
    MECHANICAL ENGINEERING AND GREEN MANUFACTURING II, PTS 1 AND 2, 2012, 155-156 : 342 - 347
  • [2] A Comparative Study of Cluster Based Outlier Detection, Distance Based Outlier Detection and Density Based Outlier Detection Techniques
    Mandhare, Harshada C.
    Idate, S. R.
    2017 INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND CONTROL SYSTEMS (ICICCS), 2017, : 931 - 935
  • [3] An Unsupervised Approach for Combining Scores of Outlier Detection Techniques, Based on Similarity Measures
    Pasillas-Diaz, Jose Ramon
    Ratte, Sylvie
    ELECTRONIC NOTES IN THEORETICAL COMPUTER SCIENCE, 2016, 329 : 61 - 77
  • [4] Similarity Distribution Density: An Optimized Approach to Outlier Detection
    Quan, Li
    Gong, Tao
    Jiang, Kaida
    ELECTRONICS, 2023, 12 (20)
  • [5] Outlier detection method based on improved distance
    School of Computer Science and Engineering, South China University of Technology, Guangzhou 510640, China
    Huanan Ligong Daxue Xuebao, 2008, 9 (25-30):
  • [6] Similarity-Based Unsupervised Evaluation of Outlier Detection
    Marques, Henrique O.
    Zimek, Arthur
    Campello, Ricardo J. G. B.
    Sander, Jorg
    SIMILARITY SEARCH AND APPLICATIONS (SISAP 2022), 2022, 13590 : 234 - 248
  • [7] An Unbiased Distance-Based Outlier Detection Approach for High-Dimensional Data
    Hoang Vu Nguyen
    Gopalkrishnan, Vivekanand
    Assent, Ira
    DATABASE SYSTEMS FOR ADVANCED APPLICATIONS, PT I, 2011, 6587 : 138 - +
  • [8] Distance-based Outlier Detection in Data Streams
    Tran, Luan
    Fan, Liyue
    Shahabi, Cyrus
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2016, 9 (12): : 1089 - 1100
  • [9] Distance-based outlier detection on uncertain data
    Yu, Hao
    Wang, Bin
    Xiao, Gang
    Yang, Xiaochun
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2010, 47 (03): : 474 - 484
  • [10] GPU Strategies for Distance-Based Outlier Detection
    Angiulli, Fabrizio
    Basta, Stefano
    Lodi, Stefano
    Sartori, Claudio
    IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 2016, 27 (11) : 3256 - 3268