Exact algorithms for haplotype assembly from whole-genome sequence data

被引:34
|
作者
Chen, Zhi-Zhong [1 ]
Deng, Fei [2 ]
Wang, Lusheng [2 ]
机构
[1] Tokyo Denki Univ, Div Informat Syst Design, Saitama 3500394, Japan
[2] City Univ Hong Kong, Dept Comp Sci, Hong Kong, Hong Kong, Peoples R China
关键词
ACCURATE ALGORITHM; RECONSTRUCTION;
D O I
10.1093/bioinformatics/btt349
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Haplotypes play a crucial role in genetic analysis and have many applications such as gene disease diagnoses, association studies, ancestry inference and so forth. The development of DNA sequencing technologies makes it possible to obtain haplotypes from a set of aligned reads originated from both copies of a chromosome of a single individual. This approach is often known as haplotype assembly. Exact algorithms that can give optimal solutions to the haplotype assembly problem are highly demanded. Unfortunately, previous algorithms for this problem either fail to output optimal solutions or take too long time even executed on a PC cluster. Results: We develop an approach to finding optimal solutions for the haplotype assembly problem under the minimum-error-correction (MEC) model. Most of the previous approaches assume that the columns in the input matrix correspond to (putative) heterozygous sites. This all-heterozygous assumption is correct for most columns, but it may be incorrect for a small number of columns. In this article, we consider the MEC model with or without the all-heterozygous assumption. In our approach, we first use new methods to decompose the input read matrix into small independent blocks and then model the problem for each block as an integer linear programming problem, which is then solved by an integer linear programming solver. We have tested our program on a single PC [a Linux (x64) desktop PC with i7-3960X CPU], using the filtered HuRef and the NA 12878 datasets (after applying some variant calling methods). With the all-heterozygous assumption, our approach can optimally solve the whole HuRef data set within a total time of 31 h (26 h for the most difficult block of the 15th chromosome and only 5 h for the other blocks). To our knowledge, this is the first time that MEC optimal solutions are completely obtained for the filtered HuRef dataset. Moreover, in the general case (without the all-heterozygous assumption), for the HuRef dataset our approach can optimally solve all the chromosomes except the most difficult block in chromosome 15 within a total time of 12 days. For both of the HuRef and NA12878 datasets, the optimal costs in the general case are sometimes much smaller than those in the all-heterozygous case. This implies that some columns in the input matrix (after applying certain variant calling methods) still correspond to false-heterozygous sites.
引用
收藏
页码:1938 / 1945
页数:8
相关论文
共 50 条
  • [21] HAPLOWSER: a whole-genome haplotype browser for personal genome and metagenome
    Kim, Jong Hyun
    Kim, Woo-Cheol
    Waterman, Michael S.
    Park, Sanghyun
    Li, Lei M.
    BIOINFORMATICS, 2009, 25 (18) : 2430 - 2431
  • [22] Whole-genome shotgun optical mapping of Rhodobacter sphaeroides strain 2.4.1 and its use for whole-genome shotgun sequence assembly
    Zhou, SG
    Kvikstad, E
    Kile, A
    Severin, J
    Forrest, D
    Runnheim, R
    Churas, C
    Hickman, JW
    Mackenzie, C
    Choudhary, M
    Donohue, T
    Kaplan, S
    Schwartz, DC
    GENOME RESEARCH, 2003, 13 (09) : 2142 - 2151
  • [23] On genetic canalization, infinitesimal model and whole-genome sequence data
    Odegard, J.
    Meuwissen, T.
    JOURNAL OF ANIMAL BREEDING AND GENETICS, 2016, 133 (01) : 1 - 2
  • [24] Genotype phasing in pedigrees using whole-genome sequence data
    August N. Blackburn
    Lucy Blondell
    Mark Z. Kos
    Nicholas B. Blackburn
    Juan M. Peralta
    Peter T. Stevens
    Donna M. Lehman
    John Blangero
    Harald H. H. Göring
    European Journal of Human Genetics, 2020, 28 : 790 - 803
  • [25] Genotype phasing in pedigrees using whole-genome sequence data
    Blackburn, August N.
    Blondell, Lucy
    Kos, Mark Z.
    Blackburn, Nicholas B.
    Peralta, Juan M.
    Stevens, Peter T.
    Lehman, Donna M.
    Blangero, John
    Goring, Harald H. H.
    EUROPEAN JOURNAL OF HUMAN GENETICS, 2020, 28 (06) : 790 - 803
  • [26] Genome assembly with in vitro proximity ligation data and whole-genome triplication in lettuce
    Reyes-Chin-Wo, Sebastian
    Wang, Zhiwen
    Yang, Xinhua
    Kozik, Alexander
    Arikit, Siwaret
    Song, Chi
    Xia, Liangfeng
    Froenicke, Lutz
    Lavelle, Dean O.
    Truco, Maria-Jose
    Xia, Rui
    Zhu, Shilin
    Xu, Chunyan
    Xu, Huaqin
    Xu, Xun
    Cox, Kyle
    Korf, Ian
    Meyers, Blake C.
    Michelmore, Richard W.
    NATURE COMMUNICATIONS, 2017, 8
  • [27] Genome assembly with in vitro proximity ligation data and whole-genome triplication in lettuce
    Sebastian Reyes-Chin-Wo
    Zhiwen Wang
    Xinhua Yang
    Alexander Kozik
    Siwaret Arikit
    Chi Song
    Liangfeng Xia
    Lutz Froenicke
    Dean O. Lavelle
    María-José Truco
    Rui Xia
    Shilin Zhu
    Chunyan Xu
    Huaqin Xu
    Xun Xu
    Kyle Cox
    Ian Korf
    Blake C. Meyers
    Richard W. Michelmore
    Nature Communications, 8
  • [28] Whole-genome assembly of Culex tarsalis
    Main, Bradley J.
    Marcantonio, Matteo
    Johnston, J. Spencer
    Rasgon, Jason L.
    Brown, C. Titus
    Barker, Christopher M.
    G3-GENES GENOMES GENETICS, 2021, 11 (02):
  • [29] PCAP: A whole-genome assembly program
    Huang, XQ
    Wang, JM
    Aluru, S
    Yang, SP
    Hillier, L
    GENOME RESEARCH, 2003, 13 (09) : 2164 - 2170
  • [30] Whole-Genome Sequence of Mycobacterium kyorinense
    Ohtsuka, Kouki
    Ohnishi, Hiroaki
    Nozaki, Eriko
    Ramos, Jesus Pais
    Tortoli, Enrico
    Yonetani, Shota
    Matsushima, Satsuki
    Tateishi, Yoshitaka
    Matsumoto, Sohkichi
    Watanabe, Takashi
    GENOME ANNOUNCEMENTS, 2014, 2 (05)