Secure discovery of genetic relatives across large-scale and distributed genomic data sets

被引:2
|
作者
Hong, Matthew M. [1 ]
Froelicher, David [1 ,2 ]
Magner, Ricky [2 ]
Popic, Victoria [2 ]
Berger, Bonnie [1 ,2 ,3 ]
Cho, Hyunghoon [4 ]
机构
[1] MIT, Comp Sci & Artificial Intelligence Lab, Cambridge, MA 02139 USA
[2] Broad Inst Massachusetts Inst Technol & Harvard, Cambridge, MA 02142 USA
[3] MIT, Dept Math, Cambridge, MA 02139 USA
[4] Yale Univ, Dept Biomed Informat & Data Sci, New Haven, CT 06510 USA
基金
美国国家卫生研究院;
关键词
CRYPTIC RELATEDNESS; ASSOCIATIONS; INFERENCE; MODEL;
D O I
10.1101/gr.279057.124
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Finding relatives within a study cohort is a necessary step in many genomic studies. However, when the cohort is distributed across multiple entities subject to data-sharing restrictions, performing this step often becomes infeasible. Developing a privacy-preserving solution for this task is challenging owing to the burden of estimating kinship between all the pairs of individuals across data sets. We introduce SF-Relate, a practical and secure federated algorithm for identifying genetic relatives across data silos. SF-Relate vastly reduces the number of individual pairs to compare while maintaining accurate detection through a novel locality-sensitive hashing (LSH) approach. We assign individuals who are likely to be related together into buckets and then test relationships only between individuals in matching buckets across parties. To this end, we construct an effective hash function that captures identity-by-descent (IBD) segments in genetic sequences, which, along with a new bucketing strategy, enable accurate and practical private relative detection. To guarantee privacy, we introduce an efficient algorithm based on multiparty homomorphic encryption (MHE) to allow data holders to cooperatively compute the relatedness coefficients between individuals and to further classify their degrees of relatedness, all without sharing any private data. We demonstrate the accuracy and practical runtimes of SF-Relate on the UK Biobank and All of Us data sets. On a data set of 200,000 individuals split between two parties, SF-Relate detects 97% of third-degree or closer relatives within 15 h of runtime. Our work enables secure identification of relatives across large-scale genomic data sets.
引用
收藏
页码:1312 / 1323
页数:12
相关论文
共 50 条
  • [1] Secure Discovery of Genetic Relatives Across Large-Scale and Distributed Genomic Datasets
    Hong, Matthew M.
    Froelicher, David
    Magner, Ricky
    Popic, Victoria
    Berger, Bonnie
    Cho, Hyunghoon
    RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2024, 2024, 14758 : 308 - 313
  • [2] A Workflow for Parallel and Distributed Computing of Large-Scale Genomic Data
    Choi, Hyun-Hwa
    Kim, Byoung-Seob
    Ahn, Shin-Young
    Bae, Seung-Jo
    2013 8TH INTERNATIONAL CONFERENCE FOR INTERNET TECHNOLOGY AND SECURED TRANSACTIONS (ICITST), 2013, : 215 - 218
  • [3] On Distributed Deep Network for Processing Large-Scale Sets of Complex Data
    Qin Chao
    Gao Xiao-guang
    Chen Da-qing
    2016 8TH INTERNATIONAL CONFERENCE ON INTELLIGENT HUMAN-MACHINE SYSTEMS AND CYBERNETICS (IHMSC), VOL. 1, 2016, : 395 - 399
  • [4] A distributed ensemble of relevance vector machines for large-scale data sets on Spark
    Wangchen Qin
    Fang Liu
    Mi Tong
    Zhengying Li
    Soft Computing, 2021, 25 : 7119 - 7130
  • [5] A distributed ensemble of relevance vector machines for large-scale data sets on Spark
    Qin, Wangchen
    Liu, Fang
    Tong, Mi
    Li, Zhengying
    SOFT COMPUTING, 2021, 25 (10) : 7119 - 7130
  • [6] Secure Distributed Outsourcing of Large-scale Linear Systems
    Feng, Da
    Zhou, Fucai
    He, Debiao
    Guo, Mengna
    Wu, Qiyu
    2022 IEEE 42ND INTERNATIONAL CONFERENCE ON DISTRIBUTED COMPUTING SYSTEMS (ICDCS 2022), 2022, : 1110 - 1121
  • [7] Distributed servers approach for large-scale secure multicast
    Chan, KC
    Chan, SHG
    IEEE JOURNAL ON SELECTED AREAS IN COMMUNICATIONS, 2002, 20 (08) : 1500 - 1510
  • [8] Distributed Data Strategies to Support Large-Scale Data Analysis Across Geo-Distributed Data Centers
    Emara, Tamer Z.
    Huang, Joshua Zhexue
    IEEE ACCESS, 2020, 8 (178526-178538) : 178526 - 178538
  • [9] Resource discovery mechanism for large-scale distributed simulation oriented data grid
    Huang, H
    Wang, SF
    Zhang, Y
    Wu, W
    GRID AND COOPERATIVE COMPUTING GCC 2004, PROCEEDINGS, 2004, 3251 : 431 - 439
  • [10] Secure distributed data-mining and its application to large-scale network measurements
    Roughan, M
    Zhang, Y
    ACM SIGCOMM COMPUTER COMMUNICATION REVIEW, 2006, 36 (01) : 7 - 14