A new tool called DISSECT for analysing large genomic data sets using a Big Data approach

被引:36
|
作者
Canela-Xandri, Oriol [1 ,2 ]
Law, Andy [1 ,2 ]
Gray, Alan [3 ]
Woolliams, John A. [1 ,2 ]
Tenesa, Albert [1 ,2 ,4 ]
机构
[1] Univ Edinburgh, Roslin Inst, Edinburgh EH25 9RG, Midlothian, Scotland
[2] Univ Edinburgh, Royal Dick Sch Vet Studies, Edinburgh EH25 9RG, Midlothian, Scotland
[3] Univ Edinburgh, EPCC, Edinburgh EH9 3FD, Midlothian, Scotland
[4] Univ Edinburgh, MRC IGMM, MRC HGU, Edinburgh EH4 2XU, Midlothian, Scotland
来源
NATURE COMMUNICATIONS | 2015年 / 6卷
基金
英国医学研究理事会; 英国生物技术与生命科学研究理事会;
关键词
AVERAGE INFORMATION REML; MIXED-MODEL ANALYSIS; GENETIC RISK; ASSOCIATION; PREDICTION; DISEASE; TRAITS; ACCURACY;
D O I
10.1038/ncomms10162
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Large-scale genetic and genomic data are increasingly available and the major bottleneck in their analysis is a lack of sufficiently scalable computational tools. To address this problem in the context of complex traits analysis, we present DISSECT. DISSECT is a new and freely available software that is able to exploit the distributed-memory parallel computational architectures of compute clusters, to perform a wide range of genomic and epidemiologic analyses, which currently can only be carried out on reduced sample sizes or under restricted conditions. We demonstrate the usefulness of our new tool by addressing the challenge of predicting phenotypes from genotype data in human populations using mixed-linear model analysis. We analyse simulated traits from 470,000 individuals genotyped for 590,004 SNPs in similar to 4 h using the combined computational power of 8,400 processor cores. We find that prediction accuracies in excess of 80% of the theoretical maximum could be achieved with large sample sizes.
引用
收藏
页数:6
相关论文
共 50 条
  • [1] A new tool called DISSECT for analysing large genomic data sets using a Big Data approach
    Oriol Canela-Xandri
    Andy Law
    Alan Gray
    John A. Woolliams
    Albert Tenesa
    [J]. Nature Communications, 6
  • [2] Using "Big Data" to Dissect Clinical Heterogeneity
    Altman, Russ B.
    Ashley, Euan A.
    [J]. CIRCULATION, 2015, 131 (03) : 232 - 233
  • [3] Novel approach to analysing large data sets of personal sun exposure measurements
    Blesic, Suzana M.
    Stratimirovic, Dorde I.
    Ajtic, Jelena V.
    Wright, Caradee Y.
    Allen, Martin W.
    [J]. JOURNAL OF EXPOSURE SCIENCE AND ENVIRONMENTAL EPIDEMIOLOGY, 2016, 26 (06) : 613 - 620
  • [4] Novel approach to analysing large data sets of personal sun exposure measurements
    Suzana M Blesić
    Đorđe I Stratimirović
    Jelena V Ajtić
    Caradee Y Wright
    Martin W Allen
    [J]. Journal of Exposure Science & Environmental Epidemiology, 2016, 26 : 613 - 620
  • [5] RBF Approximation of Big Data Sets with Large Span of Data
    Skala, Vaclav
    [J]. 2017 FOURTH INTERNATIONAL CONFERENCE ON MATHEMATICS AND COMPUTERS IN SCIENCES AND IN INDUSTRY (MCSI), 2017, : 212 - 218
  • [6] Multidimensional scaling for large genomic data sets
    Tzeng, Jengnan
    Lu, Henry Horng-Shing
    Li, Wen-Hsiung
    [J]. BMC BIOINFORMATICS, 2008, 9 (1)
  • [7] Multidimensional scaling for large genomic data sets
    Jengnan Tzeng
    Henry Horng-Shing Lu
    Wen-Hsiung Li
    [J]. BMC Bioinformatics, 9
  • [8] Big Data Privacy Risk Connecting Many Large Data Sets
    Ying, Star
    Grandison, Tyrone
    [J]. 2016 IEEE 2ND INTERNATIONAL CONFERENCE ON COLLABORATION AND INTERNET COMPUTING (IEEE CIC), 2016, : 86 - 91
  • [9] Analysing large biological data sets with an improved algorithm for MIC
    Wang, Shuliang
    Zhao, Yiping
    [J]. INTERNATIONAL JOURNAL OF DATA MINING AND BIOINFORMATICS, 2015, 13 (02) : 158 - 170
  • [10] A new approach to analysing spatial data using sparse grids
    Laffan, SW
    Silcock, H
    Nielsen, O
    Hegland, M
    [J]. MODSIM 2003: INTERNATIONAL CONGRESS ON MODELLING AND SIMULATION, VOLS 1-4: VOL 1: NATURAL SYSTEMS, PT 1; VOL 2: NATURAL SYSTEMS, PT 2; VOL 3: SOCIO-ECONOMIC SYSTEMS; VOL 4: GENERAL SYSTEMS, 2003, : 708 - 712