Random Projection Based Clustering for Population Genomics

被引:0
|
作者
Tasoulis, Sotiris [1 ]
Cheng, Lu [2 ]
Valimaki, Niko [3 ]
Croucher, Nicholas J. [4 ]
Harris, Simon R. [5 ]
Hanage, William P. [6 ]
Roos, Teemu [1 ]
Corander, Jukka [7 ]
机构
[1] Univ Helsinki, Dept Comp Sci, Helsinki Inst Informat Technol HIIT, FIN-00014 Helsinki, Finland
[2] Aalto Univ, Helsinki Inst Informat Technol HIIT, Dept Informat & Comp Sci, Espoo, Finland
[3] Univ Helsinki, Dept Comp Sci, FIN-00014 Helsinki, Finland
[4] Imperial Coll, Dept Infect Dis Epidemiol, London, England
[5] Wellcome Trust Sanger Inst, Cambridge, England
[6] Harvard Sch Publ Hlth, Ctr Communicable Dis Dynam, Dept Epidemiol, Boston, MA USA
[7] Univ Helsinki, Dept Math & Stat, Helsinki Inst Informat Technol HIIT, FIN-00014 Helsinki, Finland
关键词
Clustering; Random Projection; Population Genomics; High Dimensionality; ALGORITHM;
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Recent data revolution in population genomics for bacteria has increased the size of aligned sequence data sets by two-to-three orders of magnitude. This trend is expected to continue in the near future, putting an emphasis on applicability of big data techniques to leverage biologically important insights. Moreover, with the increasing density of sampling, it may also be necessary to consider alignment-free sequence analysis techniques combined with clustering to yield a sufficient insight to data. This leads to ultra high-dimensional data with tens of millions of variables, which can no longer be handled by the existing population genomic methods. Using the largest bacterial sequence data sets published to date, we demonstrate that random projection based clustering provides a highly accurate and several orders of magnitude faster approach to the analysis of both alignment-based and alignment-free genome data sets, compared with the Bayesian model-based analysis that is currently considered as the state-of-the-art. Hence, clustering methods for big data harbor considerable potential for important applications in genomics and could pave way for novel analysis pipelines even in the online setting when executed in a massively parallel computing environment.
引用
收藏
页码:675 / 682
页数:8
相关论文
共 50 条
  • [1] Random Projection Clustering on Streaming Data
    Carraher, Lee A.
    Wilsey, Philip A.
    Moitra, Anindya
    Dey, Sayantan
    [J]. 2016 IEEE 16TH INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS (ICDMW), 2016, : 708 - 715
  • [2] High Dimensional Data Stream Clustering Algorithm Based on Random Projection
    Zhu, Yingwen
    Chen, Songcan
    [J]. Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2020, 57 (08): : 1683 - 1696
  • [3] Deep Embedded Clustering with Random Projection Penalty
    Song, Kang
    Han, Wei
    Lekamalage, Chamara Kasun Liyanaarachchi
    Chen, Lihui
    [J]. ADVANCES IN NATURAL COMPUTATION, FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, ICNC-FSKD 2022, 2023, 153 : 139 - 146
  • [4] Fast Spectral Clustering with Random Projection and Sampling
    Sakai, Tomoya
    Imiya, Atsushi
    [J]. MACHINE LEARNING AND DATA MINING IN PATTERN RECOGNITION, 2009, 5632 : 372 - 384
  • [5] Random Projection for k-means Clustering
    Sieranoja, Sami
    Franti, Pasi
    [J]. ARTIFICIAL INTELLIGENCE AND SOFT COMPUTING, ICAISC 2018, PT I, 2018, 10841 : 680 - 689
  • [6] Accelerated Kmeans Clustering Using Binary Random Projection
    Choi, Yukyung
    Park, Chaehoon
    Kweon, In So
    [J]. COMPUTER VISION - ACCV 2014, PT II, 2015, 9004 : 257 - 272
  • [7] Fast Fusion Clustering via Double Random Projection
    Wang, Hongni
    Li, Na
    Zhou, Yanqiu
    Yan, Jingxin
    Jiang, Bei
    Kong, Linglong
    Yan, Xiaodong
    [J]. ENTROPY, 2024, 26 (05)
  • [8] Efficient clustering on Riemannian manifolds: A kernelised random projection approach
    Zhao, Kun
    Alavi, Azadeh
    Willem, Arnold
    Lovell, Brian C.
    [J]. PATTERN RECOGNITION, 2016, 51 : 333 - 345
  • [9] Random Projection Towards the Baire Metric for High Dimensional Clustering
    Murtagh, Fionn
    Contreras, Pedro
    [J]. STATISTICAL LEARNING AND DATA SCIENCES, 2015, 9047 : 424 - 431
  • [10] Fast Constrained Spectral Clustering and Cluster Ensemble with Random Projection
    Liu, Wenfen
    Ye, Mao
    Wei, Jianghong
    Hu, Xuexian
    [J]. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE, 2017, 2017