High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs

被引:63
|
作者
Dilthey, Alexander T. [1 ,2 ]
Gourraud, Pierre-Antoine [3 ,4 ]
Mentzer, Alexander J. [1 ]
Cereb, Nezih [5 ]
Iqbal, Zamin [1 ]
McVean, Gil [1 ,6 ]
机构
[1] Univ Oxford, Wellcome Trust Ctr Human Genet, Oxford, England
[2] NHGRI, NIH, Bethesda, MD 20892 USA
[3] UCSF, Dept Neurol, San Francisco, CA USA
[4] Univ Nantes, Nantes Univ Hosp, INSERM, Unit ATIP 1064,Avenir Team 6, Nantes, France
[5] Histogenetics, Ossining, NY USA
[6] Univ Oxford, Li Ka Shing Ctr Hlth Informat & Discovery, Oxford, England
基金
欧洲研究理事会; 英国惠康基金;
关键词
HIGH-RESOLUTION HLA; CLASS-I; SUSCEPTIBILITY;
D O I
10.1371/journal.pcbi.1005151
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Genetic variation at the Human Leucocyte Antigen (HLA) genes is associated with many autoimmune and infectious disease phenotypes, is an important element of the immunological distinction between self and non-self, and shapes immune epitope repertoires. Determining the allelic state of the HLA genes (HLA typing) as a by-product of standard whole-genome sequencing data would therefore be highly desirable and enable the immunogenetic characterization of samples in currently ongoing population sequencing projects. Extensive hyperpolymorphism and sequence similarity between the HLA genes, however, pose problems for accurate read mapping and make HLA type inference from whole-genome sequencing data a challenging problem. We describe how to address these challenges in a Population Reference Graph (PRG) framework. First, we construct a PRG for 46 (mostly HLA) genes and pseudogenes, their genomic context and their characterized sequence variants, integrating a database of over 10,000 known allele sequences. Second, we present a sequence-to-PRG paired-end read mapping algorithm that enables accurate read mapping for the HLA genes. Third, we infer the most likely pair of underlying alleles at G group resolution from the IMGT/HLA database at each locus, employing a simple likelihood framework. We show that HLA*PRG, our algorithm, outperforms existing methods by a wide margin. We evaluate HLA*PRG on six classical class I and class II HLA genes (HLA-A, -B, -C, -DQA1, -DQB1, -DRB1) and on a set of 14 samples (3 samples with 2 x 100bp, 11 samples with 2 x 250bp Illumina HiSeq data). Of 158 alleles tested, we correctly infer 157 alleles (99.4%). We also identify and re-type two erroneous alleles in the original validation data. We conclude that HLA*PRG for the first time achieves accuracies comparable to gold-standard reference methods from standard whole-genome sequencing data, though high computational demands (currently similar to 30-250 CPU hours per sample) remain a significant challenge to practical application.
引用
收藏
页数:16
相关论文
共 50 条
  • [41] Using imputation-based whole-genome sequencing data to improve the accuracy of genomic prediction for combined populations in pigs
    Hailiang Song
    Shaopan Ye
    Yifan Jiang
    Zhe Zhang
    Qin Zhang
    Xiangdong Ding
    Genetics Selection Evolution, 51
  • [42] Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing
    Li Tai Fang
    Bin Zhu
    Yongmei Zhao
    Wanqiu Chen
    Zhaowei Yang
    Liz Kerrigan
    Kurt Langenbach
    Maryellen de Mars
    Charles Lu
    Kenneth Idler
    Howard Jacob
    Yuanting Zheng
    Luyao Ren
    Ying Yu
    Erich Jaeger
    Gary P. Schroth
    Ogan D. Abaan
    Keyur Talsania
    Justin Lack
    Tsai-Wei Shen
    Zhong Chen
    Seta Stanbouly
    Bao Tran
    Jyoti Shetty
    Yuliya Kriga
    Daoud Meerzaman
    Cu Nguyen
    Virginie Petitjean
    Marc Sultan
    Margaret Cam
    Monika Mehta
    Tiffany Hung
    Eric Peters
    Rasika Kalamegham
    Sayed Mohammad Ebrahim Sahraeian
    Marghoob Mohiyuddin
    Yunfei Guo
    Lijing Yao
    Lei Song
    Hugo Y. K. Lam
    Jiri Drabek
    Petr Vojta
    Roberta Maestro
    Daniela Gasparotto
    Sulev Kõks
    Ene Reimann
    Andreas Scherer
    Jessica Nordlund
    Ulrika Liljedahl
    Roderick V. Jensen
    Nature Biotechnology, 2021, 39 : 1151 - 1160
  • [43] Using imputation-based whole-genome sequencing data to improve the accuracy of genomic prediction for combined populations in pigs
    Song, Hailiang
    Ye, Shaopan
    Jiang, Yifan
    Zhang, Zhe
    Zhang, Qin
    Ding, Xiangdong
    GENETICS SELECTION EVOLUTION, 2019, 51 (01)
  • [44] Establishing community reference samples, data and call sets for benchmarking cancer mutation detection using whole-genome sequencing
    Fang, Li Tai
    Zhu, Bin
    Zhao, Yongmei
    Chen, Wanqiu
    Yang, Zhaowei
    Kerrigan, Liz
    Langenbach, Kurt
    de Mars, Maryellen
    Lu, Charles
    Idler, Kenneth
    Jacob, Howard
    Zheng, Yuanting
    Ren, Luyao
    Yu, Ying
    Jaeger, Erich
    Schroth, Gary P.
    Abaan, Ogan D.
    Talsania, Keyur
    Lack, Justin
    Shen, Tsai-Wei
    Chen, Zhong
    Stanbouly, Seta
    Tran, Bao
    Shetty, Jyoti
    Kriga, Yuliya
    Meerzaman, Daoud
    Nguyen, Cu
    Petitjean, Virginie
    Sultan, Marc
    Cam, Margaret
    Mehta, Monika
    Hung, Tiffany
    Peters, Eric
    Kalamegham, Rasika
    Sahraeian, Sayed Mohammad Ebrahim
    Mohiyuddin, Marghoob
    Guo, Yunfei
    Yao, Lijing
    Song, Lei
    Lam, Hugo Y. K.
    Drabek, Jiri
    Vojta, Petr
    Maestro, Roberta
    Gasparotto, Daniela
    Koks, Sulev
    Reimann, Ene
    Scherer, Andreas
    Nordlund, Jessica
    Liljedahl, Ulrika
    Jensen, Roderick, V
    NATURE BIOTECHNOLOGY, 2021, 39 (09) : 1151 - +
  • [45] NyuWa Genome resource: A deep whole-genome sequencing-based variation profile and reference panel for the Chinese population
    Zhang, Peng
    Li, Yanyan
    Luo, Huaxia
    Wang, You
    Wang, Jiajia
    Zheng, Yu
    Niu, Yiwei
    Shi, Yirong
    Zhou, Honghong
    Song, Tingrui
    Kang, Quan
    Xu, Tao
    He, Shunmin
    CELL REPORTS, 2021, 37 (07):
  • [46] Prediction of antimicrobial resistance in clinicalCampylobacter jejuniisolates from whole-genome sequencing data
    Dahl, Louise Gade
    Joensen, Katrine Grimstrup
    Osterlund, Mark Thomas
    Kiil, Kristoffer
    Nielsen, Eva Moller
    EUROPEAN JOURNAL OF CLINICAL MICROBIOLOGY & INFECTIOUS DISEASES, 2021, 40 (04) : 673 - 682
  • [47] ConsensuSV-from the whole-genome sequencing data to the complete variant list
    Chilinski, Mateusz
    Plewczynski, Dariusz
    BIOINFORMATICS, 2022, 38 (24) : 5440 - 5442
  • [48] Detecting the Population Structure and Scanning for Signatures of Selection in Horses (Equus caballus) From Whole-Genome Sequencing Data
    Zhang, Cheng
    Ni, Pan
    Ahmad, Hafiz Ishfaq
    Gemingguli, M.
    Baizilaitibei, A.
    Gulibaheti, D.
    Fang, Yaping
    Wang, Haiyang
    Asif, Akhtar Rasool
    Xiao, Changyi
    Chen, Jianhai
    Ma, Yunlong
    Liu, Xiangdong
    Du, Xiaoyong
    Zhao, Shuhong
    EVOLUTIONARY BIOINFORMATICS, 2018, 14
  • [49] Toward genomic prediction from whole-genome sequence data: impact of sequencing design on genotype imputation and accuracy of predictions
    Druet, T.
    Macleod, I. M.
    Hayes, B. J.
    HEREDITY, 2014, 112 (01) : 39 - 47
  • [50] Toward genomic prediction from whole-genome sequence data: impact of sequencing design on genotype imputation and accuracy of predictions
    T Druet
    I M Macleod
    B J Hayes
    Heredity, 2014, 112 : 39 - 47