Accurately estimating the length distributions of genomic micro-satellites by tumor purity deconvolution

被引:3
|
作者
Wang, Yixuan [1 ,2 ]
Zhang, Xuanping [1 ,2 ]
Xiao, Xiao [3 ]
Zhang, Fei-Ran [4 ]
Yan, Xinxing [1 ,2 ]
Feng, Xuan [1 ,2 ]
Zhao, Zhongmeng [1 ,2 ]
Guan, Yanfang [1 ,2 ,5 ]
Wang, Jiayin [1 ,2 ]
机构
[1] Xi An Jiao Tong Univ, Sch Comp Sci & Technol, Xian 710048, Peoples R China
[2] Xi An Jiao Tong Univ, Sch Comp Sci & Technol, Shaanxi Engn Res Ctr Med & Hlth Big Data, Xian 710048, Peoples R China
[3] Xi An Jiao Tong Univ, Inst Hlth Adm & Policy, Sch Publ Policy & Adm, Xian 710048, Peoples R China
[4] Shantou Univ, Med Coll, Affiliated Hosp 1, Dept Gen Surg, Shantou 515041, Guangdong, Peoples R China
[5] Geneplus Beijing Inst, Beijing 100061, Peoples R China
基金
中国博士后科学基金; 美国国家科学基金会;
关键词
Cancer genomics; Genomic micro-satellite; Length distribution estimation; Tumor purity; Computational pipeline; Sequencing data analysis; MICROSATELLITE INSTABILITY DETECTION; CANCER; DNA;
D O I
10.1186/s12859-020-3349-5
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background Genomic micro-satellites are the genomic regions that consist of short and repetitive DNA motifs. Estimating the length distribution and state of a micro-satellite region is an important computational step in cancer sequencing data pipelines, which is suggested to facilitate the downstream analysis and clinical decision supporting. Although several state-of-the-art approaches have been proposed to identify micro-satellite instability (MSI) events, they are limited in dealing with regions longer than one read length. Moreover, based on our best knowledge, all of these approaches imply a hypothesis that the tumor purity of the sequenced samples is sufficiently high, which is inconsistent with the reality, leading the inferred length distribution to dilute the data signal and introducing the false positive errors. Results In this article, we proposed a computational approach, named ELMSI, which detected MSI events based on the next generation sequencing technology. ELMSI can estimate the specific length distributions and states of micro-satellite regions from a mixed tumor sample paired with a control one. It first estimated the purity of the tumor sample based on the read counts of the filtered SNVs loci. Then, the algorithm identified the length distributions and the states of short micro-satellites by adding the Maximum Likelihood Estimation (MLE) step to the existing algorithm. After that, ELMSI continued to infer the length distributions of long micro-satellites by incorporating a simplified Expectation Maximization (EM) algorithm with central limit theorem, and then used statistical tests to output the states of these micro-satellites. Based on our experimental results, ELMSI was able to handle micro-satellites with lengths ranging from shorter than one read length to 10kbps. Conclusions To verify the reliability of our algorithm, we first compared the ability of classifying the shorter micro-satellites from the mixed samples with the existing algorithm MSIsensor. Meanwhile, we varied the number of micro-satellite regions, the read length and the sequencing coverage to separately test the performance of ELMSI on estimating the longer ones from the mixed samples. ELMSI performed well on mixed samples, and thus ELMSI was of great value for improving the recognition effect of micro-satellite regions and supporting clinical decision supporting. The source codes have been uploaded and maintained at https://github.com/YixuanWang1120/ELMSI for academic use only.
引用
收藏
页数:14
相关论文
共 5 条
  • [1] Accurately estimating the length distributions of genomic micro-satellites by tumor purity deconvolution
    Yixuan Wang
    Xuanping Zhang
    Xiao Xiao
    Fei-Ran Zhang
    Xinxing Yan
    Xuan Feng
    Zhongmeng Zhao
    Yanfang Guan
    Jiayin Wang
    [J]. BMC Bioinformatics, 21
  • [2] Estimating the Length Distributions of Genomic Micro-satellites from Next Generation Sequencing Data
    Feng, Xuan
    Hu, Huan
    Zhao, Zhongmeng
    Zhang, Xuanping
    Wang, Jiayin
    [J]. BIOINFORMATICS AND BIOMEDICAL ENGINEERING, IWBBIO 2018, PT I, 2018, 10813 : 461 - 472
  • [3] CMSI: A Bayesian model for estimating clonal micro-satellites instability from NGS data
    Wang, Yixuan
    Zhang, Xuanping
    Huang, Yi
    Liu, Tao
    Xiao, Xiao
    Wang, Jiayin
    [J]. CANCER RESEARCH, 2019, 79 (13)
  • [4] Accurately Estimating Tumor Purity of Samples with High Degree of Heterogeneity from Cancer Sequencing Data
    Geng, Yu
    Zhao, Zhongmeng
    Liu, Ruoyu
    Zheng, Tian
    Xu, Jing
    Huang, Yi
    Zhang, Xuanping
    Xiao, Xiao
    Wang, Jiayin
    [J]. INTELLIGENT COMPUTING THEORIES AND APPLICATION, ICIC 2017, PT II, 2017, 10362 : 273 - 285
  • [5] An assessment of computational methods for estimating purity and clonality using genomic data derived from heterogeneous tumor tissue samples
    Yadav, Vinod Kumar
    De, Subhajyoti
    [J]. BRIEFINGS IN BIOINFORMATICS, 2015, 16 (02) : 232 - 241