High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs

被引:63
|
作者
Dilthey, Alexander T. [1 ,2 ]
Gourraud, Pierre-Antoine [3 ,4 ]
Mentzer, Alexander J. [1 ]
Cereb, Nezih [5 ]
Iqbal, Zamin [1 ]
McVean, Gil [1 ,6 ]
机构
[1] Univ Oxford, Wellcome Trust Ctr Human Genet, Oxford, England
[2] NHGRI, NIH, Bethesda, MD 20892 USA
[3] UCSF, Dept Neurol, San Francisco, CA USA
[4] Univ Nantes, Nantes Univ Hosp, INSERM, Unit ATIP 1064,Avenir Team 6, Nantes, France
[5] Histogenetics, Ossining, NY USA
[6] Univ Oxford, Li Ka Shing Ctr Hlth Informat & Discovery, Oxford, England
基金
欧洲研究理事会; 英国惠康基金;
关键词
HIGH-RESOLUTION HLA; CLASS-I; SUSCEPTIBILITY;
D O I
10.1371/journal.pcbi.1005151
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Genetic variation at the Human Leucocyte Antigen (HLA) genes is associated with many autoimmune and infectious disease phenotypes, is an important element of the immunological distinction between self and non-self, and shapes immune epitope repertoires. Determining the allelic state of the HLA genes (HLA typing) as a by-product of standard whole-genome sequencing data would therefore be highly desirable and enable the immunogenetic characterization of samples in currently ongoing population sequencing projects. Extensive hyperpolymorphism and sequence similarity between the HLA genes, however, pose problems for accurate read mapping and make HLA type inference from whole-genome sequencing data a challenging problem. We describe how to address these challenges in a Population Reference Graph (PRG) framework. First, we construct a PRG for 46 (mostly HLA) genes and pseudogenes, their genomic context and their characterized sequence variants, integrating a database of over 10,000 known allele sequences. Second, we present a sequence-to-PRG paired-end read mapping algorithm that enables accurate read mapping for the HLA genes. Third, we infer the most likely pair of underlying alleles at G group resolution from the IMGT/HLA database at each locus, employing a simple likelihood framework. We show that HLA*PRG, our algorithm, outperforms existing methods by a wide margin. We evaluate HLA*PRG on six classical class I and class II HLA genes (HLA-A, -B, -C, -DQA1, -DQB1, -DRB1) and on a set of 14 samples (3 samples with 2 x 100bp, 11 samples with 2 x 250bp Illumina HiSeq data). Of 158 alleles tested, we correctly infer 157 alleles (99.4%). We also identify and re-type two erroneous alleles in the original validation data. We conclude that HLA*PRG for the first time achieves accuracies comparable to gold-standard reference methods from standard whole-genome sequencing data, though high computational demands (currently similar to 30-250 CPU hours per sample) remain a significant challenge to practical application.
引用
收藏
页数:16
相关论文
共 50 条
  • [31] Bayesian Inference of Infectious Disease Transmission from Whole-Genome Sequence Data
    Didelot, Xavier
    Gardy, Jennifer
    Colijn, Caroline
    MOLECULAR BIOLOGY AND EVOLUTION, 2014, 31 (07) : 1869 - 1879
  • [32] The inference of sex-biased human demography from whole-genome data
    Musharoff, Sheila
    Shringarpure, Suyash
    Bustmante, Carlos D.
    Ramachandran, Sohini
    PLOS GENETICS, 2019, 15 (09):
  • [33] Inference of Gorilla Demographic and Selective History from Whole-Genome Sequence Data
    McManus, Kimberly F.
    Kelley, Joanna L.
    Song, Shiya
    Veeramah, Krishna R.
    Woerner, August E.
    Stevison, Laurie S.
    Ryder, Oliver A.
    Kidd, Jeffrey M.
    Wall, Jeffrey D.
    Bustamante, Carlos D.
    Hammer, Michael F.
    MOLECULAR BIOLOGY AND EVOLUTION, 2015, 32 (03) : 600 - 612
  • [34] Using whole-genome sequencing data to derive the homologous recombination deficiency scores
    Xavier M. de Luca
    Felicity Newell
    Stephen H. Kazakoff
    Gunter Hartel
    Amy E. McCart Reed
    Oliver Holmes
    Qinying Xu
    Scott Wood
    Conrad Leonard
    John V. Pearson
    Sunil R. Lakhani
    Nicola Waddell
    Katia Nones
    Peter T. Simpson
    npj Breast Cancer, 6
  • [35] Prioritising positively selected variants in whole-genome sequencing data using FineMAV
    Wahyudi, Fadilla
    Aghakhanian, Farhang
    Rahman, Sadequr
    Teo, Yik-Ying
    Szpak, Michal
    Dhaliwal, Jasbir
    Ayub, Qasim
    BMC BIOINFORMATICS, 2021, 22 (01)
  • [36] Using whole-genome sequencing data to derive the homologous recombination deficiency scores
    de Luca, Xavier M.
    Newell, Felicity
    Kazakoff, Stephen H.
    Hartel, Gunter
    Reed, Amy E. McCart
    Holmes, Oliver
    Xu, Qinying
    Wood, Scott
    Leonard, Conrad
    Pearson, John, V
    Lakhani, Sunil R.
    Waddell, Nicola
    Nones, Katia
    Simpson, Peter T.
    NPJ BREAST CANCER, 2020, 6 (01)
  • [37] Prioritising positively selected variants in whole-genome sequencing data using FineMAV
    Fadilla Wahyudi
    Farhang Aghakhanian
    Sadequr Rahman
    Yik-Ying Teo
    Michał Szpak
    Jasbir Dhaliwal
    Qasim Ayub
    BMC Bioinformatics, 22
  • [38] Investigation of selection signatures of dairy goats using whole-genome sequencing data
    Peng, Weifeng
    Zhang, Yiyuan
    Gao, Lei
    Wang, Shuping
    Liu, Mengting
    Sun, Enrui
    Lu, Kaixin
    Zhang, Yunxia
    Li, Bing
    Li, Guoyin
    Cao, Jingya
    Yang, Mingsheng
    Guo, Yanfeng
    Wang, Mengyun
    Zhang, Yuming
    Wang, Zihan
    Han, Yan
    Fan, Shuhua
    Huang, Li
    BMC GENOMICS, 2025, 26 (01):
  • [39] GENOME-WIDE ASSOCIATION STUDY OF EXTREME LONGEVITY USING WHOLE-GENOME SEQUENCING DATA
    Gurinovich, Anastasia
    Bae, Harold
    Song, Zeyuan
    Leshchyk, Anastasia
    Li, Mengze
    Andersen, Stacy
    Perls, Thomas
    Sebastiani, Paola
    INNOVATION IN AGING, 2022, 6 : 395 - 395
  • [40] Relating Phage Genomes to Helicobacter pylori Population Structure: General Steps Using Whole-Genome Sequencing Data
    Vale, Filipa F.
    Lehours, Philippe
    INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES, 2018, 19 (07)