An MCMC algorithm for haplotype assembly from whole-genome sequence data

被引:89
|
作者
Bansal, Vikas [1 ]
Halpern, Aaron L. [2 ]
Axelrod, Nelson [2 ]
Bafna, Vineet [1 ]
机构
[1] Univ Calif San Diego, Dept Comp Sci & Engn, La Jolla, CA 92093 USA
[2] J Craig Venter Inst, Rockville, MD 20850 USA
基金
美国国家科学基金会;
关键词
D O I
10.1101/gr.077065.108
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
In comparison to genotypes, knowledge about haplotypes ( the combination of alleles present on a single chromosome) is much more useful for whole-genome association studies and for making inferences about human evolutionary history. Haplotypes are typically inferred from population genotype data using computational methods. Whole-genome sequence data represent a promising resource for constructing haplotypes spanning hundreds of kilobases for an individual. In this article, we propose a Markov chain Monte Carlo (MCMC) algorithm, HASH ( haplotype assembly for single human), for assembling haplotypes from sequenced DNA fragments that have been mapped to a reference genome assembly. The transitions of the Markov chain are generated using min-cut computations on graphs derived from the sequenced fragments. We have applied our method to infer haplotypes using whole-genome shotgun sequence data from a recently sequenced human individual. The high sequence coverage and presence of mate pairs result in fairly long haplotypes (N50 length similar to 350 kb). Based on comparison of the sequenced fragments against the individual haplotypes, we demonstrate that the haplotypes for this individual inferred using HASH are significantly more accurate than the haplotypes estimated using a previously proposed greedy heuristic and a simple MCMC method. Using haplotypes from the HapMap project, we estimate the switch error rate of the haplotypes inferred using HASH to be quite low, similar to 1.1%. Our Markov chain Monte Carlo algorithm represents a general framework for haplotype assembly that can be applied to sequence data generated by other sequencing technologies. The code implementing the methods and the phased individual haplotypes can be downloaded from http://www.cse.ucsd.edu/users/vibansal/HASH/.
引用
收藏
页码:1336 / 1346
页数:11
相关论文
共 50 条
  • [1] Exact algorithms for haplotype assembly from whole-genome sequence data
    Chen, Zhi-Zhong
    Deng, Fei
    Wang, Lusheng
    [J]. BIOINFORMATICS, 2013, 29 (16) : 1938 - 1945
  • [2] Optimal algorithms for haplotype assembly from whole-genome sequence data
    He, Dan
    Choi, Arthur
    Pipatsrisawat, Knot
    Darwiche, Adnan
    Eskin, Eleazar
    [J]. BIOINFORMATICS, 2010, 26 (12) : i183 - i190
  • [3] GAMIBHEAR: whole-genome haplotype reconstruction from Genome Architecture Mapping data
    Markowski, Julia
    Kempfer, Rieke
    Kukalev, Alexander
    Irastorza-Azcarate, Ibai
    Loof, Gesa
    Kehr, Birte
    Pombo, Ana
    Rahmann, Sven
    Schwarz, Roland F.
    [J]. BIOINFORMATICS, 2021, 37 (19) : 3128 - 3135
  • [4] Relationship Estimation from Whole-Genome Sequence Data
    Li, Hong
    Glusman, Gustavo
    Hu, Hao
    Shankaracharya
    Caballero, Juan
    Hubley, Robert
    Witherspoon, David
    Guthery, Stephen L.
    Mauldin, Denise E.
    Jorde, Lynn B.
    Hood, Leroy
    Roach, Jared C.
    Huff, Chad D.
    [J]. PLOS GENETICS, 2014, 10 (01):
  • [5] Assembly and annotation of whole-genome sequence of Fusarium equiseti
    Li, Xueping
    Xu, Shiyang
    Zhang, Jungao
    Li, Minquan
    [J]. GENOMICS, 2021, 113 (04) : 2870 - 2876
  • [6] Applications of the double-barreled data in whole-genome shotgun sequence assembly and analysis
    HAN Yujun 1
    2. Beijing Genomics Institute
    3. James D. Watson Institute of Genome Sciences
    [J]. Science China Life Sciences, 2005, (03) : 300 - 306
  • [7] Applications of the double-barreled data in whole-genome shotgun sequence assembly and analysis
    Han, YJ
    Ni, PX
    Lü, H
    Ye, J
    Hu, JF
    Chen, C
    Huang, XG
    Cong, LJ
    Li, GY
    Wang, J
    Gu, XC
    Yu, J
    Li, SG
    [J]. SCIENCE IN CHINA SERIES C-LIFE SCIENCES, 2005, 48 (03): : 300 - 306
  • [8] Applications of the double-barreled data in whole-genome shotgun sequence assembly and analysis
    Yujun Han
    Peixiang Ni
    Hong Lü
    Jia Ye
    Jianfei Hu
    Chen Chen
    Xiangang Huang
    Lijuan Cong
    Guangyuan Li
    Jing Wang
    Xiaocheng Gu
    Jun Yu
    Songgang Li
    [J]. Science in China Series C: Life Sciences, 2005, 48 (3): : 300 - 306
  • [9] Whole-genome sequence assembly of the water buffalo (Bubalus bubalis)
    Tantia, M. S.
    Vijh, R. K.
    Bhasin, V.
    Sikka, Poonam
    Vij, P. K.
    Kataria, R. S.
    Mishra, B. P.
    Yadav, S. P.
    Pandey, A. K.
    Sethi, R. K.
    Joshi, B. K.
    Gupta, S. C.
    Pathak, K. M. L.
    [J]. INDIAN JOURNAL OF ANIMAL SCIENCES, 2011, 81 (05): : 465 - 473
  • [10] Whole-genome sequence assembly for mammalian genomes: Arachne 2
    Jaffe, DB
    Butler, J
    Gnerre, S
    Mauceli, E
    Lindblad-Toh, K
    Mesirov, JP
    Zody, MC
    Lander, ES
    [J]. GENOME RESEARCH, 2003, 13 (01) : 91 - 96