Scalable Sequence Clustering for Large-Scale Immune Repertoire Analysis

被引：0

作者：

Bhusal, Prem ^{[1
]}

Alam, A. K. M. Mubashwir ^{[2
]}

Chen, Keke ^{[2
]}

Jiang, Ning ^{[3
]}

Xiao, Jun ^{[4
]}

机构：

[1] Wright State Univ, Dept Comp Sci & Engn, Dayton, OH 45435 USA

[2] Marquette Univ, Dept Comp Sci, Milwaukee, WI 53233 USA

[3] Univ Penn, Dept Bioengn, Philadelphia, PA USA

[4] ImmuDX LLC, Austin, TX USA

来源：

2021 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA) | 2021年

关键词：

clustering; sequence data; scalability; parallel processing; summarization; indexing; LARGE SETS; SEARCH;

D O I：

10.1109/BigData52589.2021.9671320

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

The development of the next-generation sequencing technology has enabled systems immunology researchers to conduct detailed immune repertoire analysis at the molecular level that allows researchers to understand the healthiness of a patient's immune system. Recent studies have shown that the single-linkage clustering algorithm can give the best results for B cell clonality analysis - a critical type of immune repertoire sequencing (IR-Seq) analysis. Large sequence datasets (e.g., millions of sequences) are being collected to comprehensively understand how a specific person's immune system evolves over different stages of disease development. However, the classical single-linkage clustering algorithm does not scale well to such large sequence datasets. Surprisingly, no study has been done to address this scalability issue for immunology research and development. We study three different strategies to scale up the single-linkage algorithm for sequence data. They include (1) the approximate single-linkage algorithm enhanced with the non-Euclidean indexing methods, (2) the Spark-based single-linkage algorithm (SparkMST) that was originally designed for vector data and now modified for sequence data, and (3) a new tree-based sequence summarization approach - SCT that aims to reduce the data for single-linkage clustering with well-preserved clustering quality. We have implemented these approaches and experimented with real sequence datasets for B cell clonality analysis. (1) The index-enhanced hierarchical clustering algorithm (e.g., VPT-HC using the Vantage-Point tree for indexing) preserves the clustering quality very well while significantly reducing the time complexity. (2) The SCT approach serving as a preprocessing step can effectively reduce data size for clustering. The overall clustering, SCT followed by VPT-HC, is the fastest among the evaluated single-machine algorithms. However, this approach also slightly affects the clustering quality. (3) The SparkMST parallel algorithm scales out nicely and also gives exact single-linkage clustering results. However, SparkMST is tied to the single-linkage algorithm and cannot be extended to general hierarchical clustering algorithms. Although this study focused on the specific application area: the B cell clonality analysis, we believe other sequence data analysis problems may find the developed scalable techniques useful.

引用

页码：1349 / 1358

页数：10

共 50 条

[1] New perspectives for large-scale repertoire analysis of immune receptors
Boudinot, Pierre
Marriotti-Ferrandiz, Maria Encarnita
Du Pasquier, Louis
Benmansour, Abdenour
Cazenave, Pierre-Andre
Six, Adrien
[J]. MOLECULAR IMMUNOLOGY, 2008, 45 (09) : 2437 - 2445
[2] Fast and scalable support vector clustering for large-scale data analysis
Ping, Yuan
Chang, Yun Feng
Zhou, Yajian
Tian, Ying Jie
Yang, Yi Xian
Zhang, Zhili
[J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2015, 43 (02) : 281 - 310
[3] Fast and scalable support vector clustering for large-scale data analysis
Yuan Ping
Yun Feng Chang
Yajian Zhou
Ying Jie Tian
Yi Xian Yang
Zhili Zhang
[J]. Knowledge and Information Systems, 2015, 43 : 281 - 310
[4] Scalable k-means for large-scale clustering
Ming, Yuewei
Zhu, En
Wang, Mao
Liu, Qiang
Liu, Xinwang
Yin, Jianping
[J]. INTELLIGENT DATA ANALYSIS, 2019, 23 (04) : 825 - 838
[5] Complet plus : a computationally scalable method to improve completeness of large-scale protein sequence clustering
Nguyen, Rachel
Sokhansanj, Bahrad A.
Polikar, Robi
Rosen, Gail L.
[J]. PEERJ, 2023, 11
[6] Parallel Hierarchical Clustering in Linearithmic Time for Large-Scale Sequence Analysis
Mao, Qi
Zheng, Wei
Wang, Li
Cai, Yunpeng
Mai, Volker
Sun, Yijun
[J]. 2015 IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM), 2015, : 310 - 319
[7] Scalable Spectral Clustering for Overlapping Community Detection in Large-Scale Networks
Van Lierde, Hadrien
Chow, Tommy W. S.
Chen, Guanrong
[J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2020, 32 (04) : 754 - 767
[8] Scalable and Memory-Efficient Clustering of Large-Scale Social Networks
Whang, Joyce Jiyoung
Sui, Xin
Dhillon, Inderjit S.
[J]. 12TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2012), 2012, : 705 - 714
[9] Large-scale analysis of gene clustering in bacteria
Yang, Qingwu
Sze, Sing-Hoi
[J]. GENOME RESEARCH, 2008, 18 (06) : 949 - 956
[10] High availability and scalable application clustering solution for a large-scale OLTP application
Nanda, Mohit
Khanapurkar, Amol
Sahoo, Prabin
[J]. 2011 ANNUAL IEEE INDIA CONFERENCE (INDICON-2011): ENGINEERING SUSTAINABLE SOLUTIONS, 2011,

← 1 2 3 4 5 →