Exact algorithms for haplotype assembly from whole-genome sequence data

被引:35
|
作者
Chen, Zhi-Zhong [1 ]
Deng, Fei [2 ]
Wang, Lusheng [2 ]
机构
[1] Tokyo Denki Univ, Div Informat Syst Design, Saitama 3500394, Japan
[2] City Univ Hong Kong, Dept Comp Sci, Hong Kong, Hong Kong, Peoples R China
关键词
ACCURATE ALGORITHM; RECONSTRUCTION;
D O I
10.1093/bioinformatics/btt349
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Haplotypes play a crucial role in genetic analysis and have many applications such as gene disease diagnoses, association studies, ancestry inference and so forth. The development of DNA sequencing technologies makes it possible to obtain haplotypes from a set of aligned reads originated from both copies of a chromosome of a single individual. This approach is often known as haplotype assembly. Exact algorithms that can give optimal solutions to the haplotype assembly problem are highly demanded. Unfortunately, previous algorithms for this problem either fail to output optimal solutions or take too long time even executed on a PC cluster. Results: We develop an approach to finding optimal solutions for the haplotype assembly problem under the minimum-error-correction (MEC) model. Most of the previous approaches assume that the columns in the input matrix correspond to (putative) heterozygous sites. This all-heterozygous assumption is correct for most columns, but it may be incorrect for a small number of columns. In this article, we consider the MEC model with or without the all-heterozygous assumption. In our approach, we first use new methods to decompose the input read matrix into small independent blocks and then model the problem for each block as an integer linear programming problem, which is then solved by an integer linear programming solver. We have tested our program on a single PC [a Linux (x64) desktop PC with i7-3960X CPU], using the filtered HuRef and the NA 12878 datasets (after applying some variant calling methods). With the all-heterozygous assumption, our approach can optimally solve the whole HuRef data set within a total time of 31 h (26 h for the most difficult block of the 15th chromosome and only 5 h for the other blocks). To our knowledge, this is the first time that MEC optimal solutions are completely obtained for the filtered HuRef dataset. Moreover, in the general case (without the all-heterozygous assumption), for the HuRef dataset our approach can optimally solve all the chromosomes except the most difficult block in chromosome 15 within a total time of 12 days. For both of the HuRef and NA12878 datasets, the optimal costs in the general case are sometimes much smaller than those in the all-heterozygous case. This implies that some columns in the input matrix (after applying certain variant calling methods) still correspond to false-heterozygous sites.
引用
收藏
页码:1938 / 1945
页数:8
相关论文
共 50 条
  • [31] Whole-genome sequence of Schistosoma haematobium
    Young, Neil D.
    Jex, Aaron R.
    Li, Bo
    Liu, Shiping
    Yang, Linfeng
    Xiong, Zijun
    Li, Yingrui
    Cantacessi, Cinzia
    Hall, Ross S.
    Xu, Xun
    Chen, Fangyuan
    Wu, Xuan
    Zerlotini, Adhemar
    Oliveira, Guilherme
    Hofmann, Andreas
    Zhang, Guojie
    Fang, Xiaodong
    Kang, Yi
    Campbell, Bronwyn E.
    Loukas, Alex
    Ranganathan, Shoba
    Rollinson, David
    Rinaldi, Gabriel
    Brindley, Paul J.
    Yang, Huanming
    Wang, Jun
    Wang, Jian
    Gasser, Robin B.
    [J]. NATURE GENETICS, 2012, 44 (02) : 221 - 225
  • [32] Organization and evolution of primate centromeric DNA from whole-genome shotgun sequence data
    Alkan, Can
    Ventura, Mario
    Archidiacono, Nicoletta
    Rocchi, Mariano
    Sahinalp, S. Cenk
    Eichler, Evan E.
    [J]. PLOS COMPUTATIONAL BIOLOGY, 2007, 3 (09) : 1807 - 1818
  • [33] RNA Sequencing Data Sets and Their Whole-Genome Sequence Assembly of Dengue Virus from Three Serial Passages in Vero Cells
    Wongsurawat, Thidathip
    Punyadee, Nuntaya
    Jenjaroenpun, Piroon
    Mairiang, Dumrong
    Tangthawornchaikul, Nattaya
    Malasit, Prida
    Avirutnan, Panisadee
    Suriyaphol, Prapat
    Chin-inmanu, Kwanrutai
    [J]. MICROBIOLOGY RESOURCE ANNOUNCEMENTS, 2021, 10 (17):
  • [34] Whole-genome sequence of Schistosoma haematobium
    Neil D Young
    Aaron R Jex
    Bo Li
    Shiping Liu
    Linfeng Yang
    Zijun Xiong
    Yingrui Li
    Cinzia Cantacessi
    Ross S Hall
    Xun Xu
    Fangyuan Chen
    Xuan Wu
    Adhemar Zerlotini
    Guilherme Oliveira
    Andreas Hofmann
    Guojie Zhang
    Xiaodong Fang
    Yi Kang
    Bronwyn E Campbell
    Alex Loukas
    Shoba Ranganathan
    David Rollinson
    Gabriel Rinaldi
    Paul J Brindley
    Huanming Yang
    Jun Wang
    Jian Wang
    Robin B Gasser
    [J]. Nature Genetics, 2012, 44 : 221 - 225
  • [35] Haplotype and population structure inference using neural networks in whole-genome sequencing data
    Meisner, Jonas
    Albrechtsen, Anders
    [J]. GENOME RESEARCH, 2022, 32 (08) : 1542 - 1552
  • [36] What if we had whole-genome sequence data for millions of individuals?
    Visscher, Peter M.
    Gibson, Greg
    [J]. GENOME MEDICINE, 2013, 5
  • [37] Methods for Collapsing Multiple Rare Variants in Whole-Genome Sequence Data
    Sung, Yun Ju
    Korthauer, Keegan D.
    Swartz, Michael D.
    Engelman, Corinne D.
    [J]. GENETIC EPIDEMIOLOGY, 2014, 38 : S13 - S20
  • [38] Accuracy of imputation to whole-genome sequence data in Holstein Friesian cattle
    van Binsbergen, Rianne
    Bink, Marco C. A. M.
    Calus, Mario P. L.
    van Eeuwijk, Fred A.
    Hayes, Ben J.
    Hulsegge, Ina
    Veerkamp, Roel F.
    [J]. GENETICS SELECTION EVOLUTION, 2014, 46
  • [39] Bioinformatic Analyses of Whole-Genome Sequence Data in a Public Health Laboratory
    Oakeson, Kelly F.
    Wagner, Jennifer Marie
    Mendenhall, Michelle
    Rohrwasser, Andreas
    Atkinson-Dunn, Robyn
    [J]. EMERGING INFECTIOUS DISEASES, 2017, 23 (09) : 1441 - 1445
  • [40] Comparison of structural variant callers for massive whole-genome sequence data
    Joe, Soobok
    Park, Jong-Lyul
    Kim, Jun
    Kim, Sangok
    Park, Ji-Hwan
    Yeo, Min-Kyung
    Lee, Dongyoon
    Yang, Jin Ok
    Kim, Seon-Young
    [J]. BMC GENOMICS, 2024, 25 (01)