Exact algorithms for haplotype assembly from whole-genome sequence data

被引:35
|
作者
Chen, Zhi-Zhong [1 ]
Deng, Fei [2 ]
Wang, Lusheng [2 ]
机构
[1] Tokyo Denki Univ, Div Informat Syst Design, Saitama 3500394, Japan
[2] City Univ Hong Kong, Dept Comp Sci, Hong Kong, Hong Kong, Peoples R China
关键词
ACCURATE ALGORITHM; RECONSTRUCTION;
D O I
10.1093/bioinformatics/btt349
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Haplotypes play a crucial role in genetic analysis and have many applications such as gene disease diagnoses, association studies, ancestry inference and so forth. The development of DNA sequencing technologies makes it possible to obtain haplotypes from a set of aligned reads originated from both copies of a chromosome of a single individual. This approach is often known as haplotype assembly. Exact algorithms that can give optimal solutions to the haplotype assembly problem are highly demanded. Unfortunately, previous algorithms for this problem either fail to output optimal solutions or take too long time even executed on a PC cluster. Results: We develop an approach to finding optimal solutions for the haplotype assembly problem under the minimum-error-correction (MEC) model. Most of the previous approaches assume that the columns in the input matrix correspond to (putative) heterozygous sites. This all-heterozygous assumption is correct for most columns, but it may be incorrect for a small number of columns. In this article, we consider the MEC model with or without the all-heterozygous assumption. In our approach, we first use new methods to decompose the input read matrix into small independent blocks and then model the problem for each block as an integer linear programming problem, which is then solved by an integer linear programming solver. We have tested our program on a single PC [a Linux (x64) desktop PC with i7-3960X CPU], using the filtered HuRef and the NA 12878 datasets (after applying some variant calling methods). With the all-heterozygous assumption, our approach can optimally solve the whole HuRef data set within a total time of 31 h (26 h for the most difficult block of the 15th chromosome and only 5 h for the other blocks). To our knowledge, this is the first time that MEC optimal solutions are completely obtained for the filtered HuRef dataset. Moreover, in the general case (without the all-heterozygous assumption), for the HuRef dataset our approach can optimally solve all the chromosomes except the most difficult block in chromosome 15 within a total time of 12 days. For both of the HuRef and NA12878 datasets, the optimal costs in the general case are sometimes much smaller than those in the all-heterozygous case. This implies that some columns in the input matrix (after applying certain variant calling methods) still correspond to false-heterozygous sites.
引用
收藏
页码:1938 / 1945
页数:8
相关论文
共 50 条
  • [1] Optimal algorithms for haplotype assembly from whole-genome sequence data
    He, Dan
    Choi, Arthur
    Pipatsrisawat, Knot
    Darwiche, Adnan
    Eskin, Eleazar
    [J]. BIOINFORMATICS, 2010, 26 (12) : i183 - i190
  • [2] An MCMC algorithm for haplotype assembly from whole-genome sequence data
    Bansal, Vikas
    Halpern, Aaron L.
    Axelrod, Nelson
    Bafna, Vineet
    [J]. GENOME RESEARCH, 2008, 18 (08) : 1336 - 1346
  • [3] Comparative analysis of algorithms for whole-genome assembly of pyrosequencing data
    Finotello, Francesca
    Lavezzo, Enrico
    Fontana, Paolo
    Peruzzo, Denis
    Albiero, Alessandro
    Barzon, Luisa
    Falda, Marco
    Di Camillo, Barbara
    Toppo, Stefano
    [J]. BRIEFINGS IN BIOINFORMATICS, 2012, 13 (03) : 269 - 280
  • [4] GAMIBHEAR: whole-genome haplotype reconstruction from Genome Architecture Mapping data
    Markowski, Julia
    Kempfer, Rieke
    Kukalev, Alexander
    Irastorza-Azcarate, Ibai
    Loof, Gesa
    Kehr, Birte
    Pombo, Ana
    Rahmann, Sven
    Schwarz, Roland F.
    [J]. BIOINFORMATICS, 2021, 37 (19) : 3128 - 3135
  • [5] Relationship Estimation from Whole-Genome Sequence Data
    Li, Hong
    Glusman, Gustavo
    Hu, Hao
    Shankaracharya
    Caballero, Juan
    Hubley, Robert
    Witherspoon, David
    Guthery, Stephen L.
    Mauldin, Denise E.
    Jorde, Lynn B.
    Hood, Leroy
    Roach, Jared C.
    Huff, Chad D.
    [J]. PLOS GENETICS, 2014, 10 (01):
  • [6] Assembly and annotation of whole-genome sequence of Fusarium equiseti
    Li, Xueping
    Xu, Shiyang
    Zhang, Jungao
    Li, Minquan
    [J]. GENOMICS, 2021, 113 (04) : 2870 - 2876
  • [7] Applications of the double-barreled data in whole-genome shotgun sequence assembly and analysis
    HAN Yujun 1
    2. Beijing Genomics Institute
    3. James D. Watson Institute of Genome Sciences
    [J]. Science China Life Sciences, 2005, (03) : 300 - 306
  • [8] Applications of the double-barreled data in whole-genome shotgun sequence assembly and analysis
    Han, YJ
    Ni, PX
    Lü, H
    Ye, J
    Hu, JF
    Chen, C
    Huang, XG
    Cong, LJ
    Li, GY
    Wang, J
    Gu, XC
    Yu, J
    Li, SG
    [J]. SCIENCE IN CHINA SERIES C-LIFE SCIENCES, 2005, 48 (03): : 300 - 306
  • [9] Applications of the double-barreled data in whole-genome shotgun sequence assembly and analysis
    Yujun Han
    Peixiang Ni
    Hong Lü
    Jia Ye
    Jianfei Hu
    Chen Chen
    Xiangang Huang
    Lijuan Cong
    Guangyuan Li
    Jing Wang
    Xiaocheng Gu
    Jun Yu
    Songgang Li
    [J]. Science in China Series C: Life Sciences, 2005, 48 (3): : 300 - 306
  • [10] Whole-genome sequence assembly of the water buffalo (Bubalus bubalis)
    Tantia, M. S.
    Vijh, R. K.
    Bhasin, V.
    Sikka, Poonam
    Vij, P. K.
    Kataria, R. S.
    Mishra, B. P.
    Yadav, S. P.
    Pandey, A. K.
    Sethi, R. K.
    Joshi, B. K.
    Gupta, S. C.
    Pathak, K. M. L.
    [J]. INDIAN JOURNAL OF ANIMAL SCIENCES, 2011, 81 (05): : 465 - 473