Scalable Sequence Clustering for Large-Scale Immune Repertoire Analysis

被引:0
|
作者
Bhusal, Prem [1 ]
Alam, A. K. M. Mubashwir [2 ]
Chen, Keke [2 ]
Jiang, Ning [3 ]
Xiao, Jun [4 ]
机构
[1] Wright State Univ, Dept Comp Sci & Engn, Dayton, OH 45435 USA
[2] Marquette Univ, Dept Comp Sci, Milwaukee, WI 53233 USA
[3] Univ Penn, Dept Bioengn, Philadelphia, PA USA
[4] ImmuDX LLC, Austin, TX USA
关键词
clustering; sequence data; scalability; parallel processing; summarization; indexing; LARGE SETS; SEARCH;
D O I
10.1109/BigData52589.2021.9671320
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The development of the next-generation sequencing technology has enabled systems immunology researchers to conduct detailed immune repertoire analysis at the molecular level that allows researchers to understand the healthiness of a patient's immune system. Recent studies have shown that the single-linkage clustering algorithm can give the best results for B cell clonality analysis - a critical type of immune repertoire sequencing (IR-Seq) analysis. Large sequence datasets (e.g., millions of sequences) are being collected to comprehensively understand how a specific person's immune system evolves over different stages of disease development. However, the classical single-linkage clustering algorithm does not scale well to such large sequence datasets. Surprisingly, no study has been done to address this scalability issue for immunology research and development. We study three different strategies to scale up the single-linkage algorithm for sequence data. They include (1) the approximate single-linkage algorithm enhanced with the non-Euclidean indexing methods, (2) the Spark-based single-linkage algorithm (SparkMST) that was originally designed for vector data and now modified for sequence data, and (3) a new tree-based sequence summarization approach - SCT that aims to reduce the data for single-linkage clustering with well-preserved clustering quality. We have implemented these approaches and experimented with real sequence datasets for B cell clonality analysis. (1) The index-enhanced hierarchical clustering algorithm (e.g., VPT-HC using the Vantage-Point tree for indexing) preserves the clustering quality very well while significantly reducing the time complexity. (2) The SCT approach serving as a preprocessing step can effectively reduce data size for clustering. The overall clustering, SCT followed by VPT-HC, is the fastest among the evaluated single-machine algorithms. However, this approach also slightly affects the clustering quality. (3) The SparkMST parallel algorithm scales out nicely and also gives exact single-linkage clustering results. However, SparkMST is tied to the single-linkage algorithm and cannot be extended to general hierarchical clustering algorithms. Although this study focused on the specific application area: the B cell clonality analysis, we believe other sequence data analysis problems may find the developed scalable techniques useful.
引用
收藏
页码:1349 / 1358
页数:10
相关论文
共 50 条
  • [1] New perspectives for large-scale repertoire analysis of immune receptors
    Boudinot, Pierre
    Marriotti-Ferrandiz, Maria Encarnita
    Du Pasquier, Louis
    Benmansour, Abdenour
    Cazenave, Pierre-Andre
    Six, Adrien
    [J]. MOLECULAR IMMUNOLOGY, 2008, 45 (09) : 2437 - 2445
  • [2] Fast and scalable support vector clustering for large-scale data analysis
    Ping, Yuan
    Chang, Yun Feng
    Zhou, Yajian
    Tian, Ying Jie
    Yang, Yi Xian
    Zhang, Zhili
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2015, 43 (02) : 281 - 310
  • [3] Fast and scalable support vector clustering for large-scale data analysis
    Yuan Ping
    Yun Feng Chang
    Yajian Zhou
    Ying Jie Tian
    Yi Xian Yang
    Zhili Zhang
    [J]. Knowledge and Information Systems, 2015, 43 : 281 - 310
  • [4] Scalable k-means for large-scale clustering
    Ming, Yuewei
    Zhu, En
    Wang, Mao
    Liu, Qiang
    Liu, Xinwang
    Yin, Jianping
    [J]. INTELLIGENT DATA ANALYSIS, 2019, 23 (04) : 825 - 838
  • [5] Complet plus : a computationally scalable method to improve completeness of large-scale protein sequence clustering
    Nguyen, Rachel
    Sokhansanj, Bahrad A.
    Polikar, Robi
    Rosen, Gail L.
    [J]. PEERJ, 2023, 11
  • [6] Parallel Hierarchical Clustering in Linearithmic Time for Large-Scale Sequence Analysis
    Mao, Qi
    Zheng, Wei
    Wang, Li
    Cai, Yunpeng
    Mai, Volker
    Sun, Yijun
    [J]. 2015 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2015, : 310 - 319
  • [7] Scalable Spectral Clustering for Overlapping Community Detection in Large-Scale Networks
    Van Lierde, Hadrien
    Chow, Tommy W. S.
    Chen, Guanrong
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2020, 32 (04) : 754 - 767
  • [8] Scalable and Memory-Efficient Clustering of Large-Scale Social Networks
    Whang, Joyce Jiyoung
    Sui, Xin
    Dhillon, Inderjit S.
    [J]. 12TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2012), 2012, : 705 - 714
  • [9] Large-scale analysis of gene clustering in bacteria
    Yang, Qingwu
    Sze, Sing-Hoi
    [J]. GENOME RESEARCH, 2008, 18 (06) : 949 - 956
  • [10] High availability and scalable application clustering solution for a large-scale OLTP application
    Nanda, Mohit
    Khanapurkar, Amol
    Sahoo, Prabin
    [J]. 2011 ANNUAL IEEE INDIA CONFERENCE (INDICON-2011): ENGINEERING SUSTAINABLE SOLUTIONS, 2011,