Dynamic order Markov model for categorical sequence clustering

被引:0
|
作者
Chen, Rongbo [1 ]
Sun, Haojun [2 ]
Chen, Lifei [3 ]
Zhang, Jianfei [1 ]
Wang, Shengrui [1 ]
机构
[1] Univ Sherbrooke, Dept Comp Sci, Sherbrooke, PQ, Canada
[2] Shantou Univ, Dept Comp Sci, Shantou, Peoples R China
[3] Fujian Normal Univ, Dept Comp Sci, Fuzhou, Peoples R China
基金
加拿大自然科学与工程研究理事会; 中国国家自然科学基金;
关键词
Sparse pattern; Pattern detection; Dynamic order Markov model; Categorical sequence clustering; CHAIN MODELS; SEARCH; ALIGNMENT;
D O I
10.1186/s40537-021-00547-2
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Markov models are extensively used for categorical sequence clustering and classification due to their inherent ability to capture complex chronological dependencies hidden in sequential data. Existing Markov models are based on an implicit assumption that the probability of the next state depends on the preceding context/pattern which is consist of consecutive states. This restriction hampers the models since some patterns, disrupted by noise, may be not frequent enough in a consecutive form, but frequent in a sparse form, which can not make use of the information hidden in the sequential data. A sparse pattern corresponds to a pattern in which one or some of the state(s) between the first and last one in the pattern is/are replaced by wildcard(s) that can be matched by a subset of values in the state set. In this paper, we propose a new model that generalizes the conventional Markov approach making it capable of dealing with the sparse pattern and handling the length of the sparse patterns adaptively, i.e. allowing variable length pattern with variable wildcards. The model, named Dynamic order Markov model (DOMM), allows deriving a new similarity measure between a sequence and a set of sequences/cluster. DOMM builds a sparse pattern from sub-frequent patterns that contain significant statistical information veiled by the noise. To implement DOMM, we propose a sparse pattern detector (SPD) based on the probability suffix tree (PST) capable of discovering both sparse and consecutive patterns, and then we develop a divisive clustering algorithm, named DMSC, for Dynamic order Markov model for categorical sequence clustering. Experimental results on real-world datasets demonstrate the promising performance of the proposed model.
引用
收藏
页数:25
相关论文
共 50 条
  • [1] Dynamic order Markov model for categorical sequence clustering
    Rongbo Chen
    Haojun Sun
    Lifei Chen
    Jianfei Zhang
    Shengrui Wang
    [J]. Journal of Big Data, 8
  • [2] A Novel Variable-order Markov Model for Clustering Categorical Sequences
    Xiong, Tengke
    Wang, Shengrui
    Jiang, Qingshan
    Huang, Joshua Zhexue
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2014, 26 (10) : 2339 - 2353
  • [3] A MEASURE OF CATEGORICAL CLUSTERING BASED UPON A MODEL OF RECALL ORDER
    ROBERTSON, C
    [J]. BRITISH JOURNAL OF MATHEMATICAL & STATISTICAL PSYCHOLOGY, 1985, 38 (NOV): : 141 - 151
  • [4] A New Multivariate Markov Chain Model for Adding a New Categorical Data Sequence
    Wang, Chao
    Huang, Ting-Zhu
    Ching, Wai-Ki
    [J]. MATHEMATICAL PROBLEMS IN ENGINEERING, 2014, 2014
  • [5] Clustering sequence data using hidden Markov model representation
    Li, C
    Biswas, G
    [J]. DATA MINING AND KNOWLEDGE DISCOVERY: THEORY, TOOLS, AND TECHNOLOGY, 1999, 3695 : 14 - 21
  • [6] A SCALABLE CLUSTERING METHOD FOR CATEGORICAL SEQUENCE DATA
    Oh, Seung-Joon
    Kim, Jae-Yearn
    [J]. INTERNATIONAL JOURNAL OF COMPUTATIONAL METHODS, 2005, 2 (02) : 167 - 180
  • [7] A hierarchical clustering algorithm for categorical sequence data
    Oh, SJ
    Kim, JY
    [J]. INFORMATION PROCESSING LETTERS, 2004, 91 (03) : 135 - 140
  • [8] Hidden Markov Model Optimized by PSO Algorithm for Gene Sequence Clustering
    Soruri, Mohammad
    Sadri, Javad
    Zahiri, S. Hamid
    [J]. PROCEEDINGS OF THE SECOND INTERNATIONAL CONFERENCE ON INTERNET OF THINGS, DATA AND CLOUD COMPUTING (ICC 2017), 2017,
  • [9] Sequence Clustering with the Self-Organizing Hidden Markov Model Map
    Ferles, Christos
    Stafylopatis, Andreas
    [J]. 8TH IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOENGINEERING, VOLS 1 AND 2, 2008, : 430 - 436
  • [10] A Modified Markov Clustering Approach for Protein Sequence Clustering
    Medves, Lehel
    Szilagyi, Laszlo
    Szilagyi, Sandor M.
    [J]. PATTERN RECOGNITION IN BIOINFORMATICS, PROCEEDINGS, 2008, 5265 : 110 - 120