Multidimensional scaling for large genomic data sets

被引:70
|
作者
Tzeng, Jengnan [1 ]
Lu, Henry Horng-Shing [2 ]
Li, Wen-Hsiung [1 ,3 ]
机构
[1] Acad Sinica, Genom Res Ctr, Taipei 115, Taiwan
[2] Natl Chiao Tung Univ, Inst Stat, Hsinchu 30050, Taiwan
[3] Univ Chicago, Dept Ecol & Evolut, Chicago, IL 60637 USA
关键词
Singular Value Decomposition; Original Space; Yeast Cell Cycle; Bayesian Information Criterion Score; Cell Cycle Function;
D O I
10.1186/1471-2105-9-179
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: Multi-dimensional scaling (MDS) is aimed to represent high dimensional data in a low dimensional space with preservation of the similarities between data points. This reduction in dimensionality is crucial for analyzing and revealing the genuine structure hidden in the data. For noisy data, dimension reduction can effectively reduce the effect of noise on the embedded structure. For large data set, dimension reduction can effectively reduce information retrieval complexity. Thus, MDS techniques are used in many applications of data mining and gene network research. However, although there have been a number of studies that applied MDS techniques to genomics research, the number of analyzed data points was restricted by the high computational complexity of MDS. In general, a non-metric MDS method is faster than a metric MDS, but it does not preserve the true relationships. The computational complexity of most metric MDS methods is over O(N-2), so that it is difficult to process a data set of a large number of genes N, such as in the case of whole genome microarray data. Results: We developed a new rapid metric MDS method with a low computational complexity, making metric MDS applicable for large data sets. Computer simulation showed that the new method of split-and-combine MDS (SC-MDS) is fast, accurate and efficient. Our empirical studies using microarray data on the yeast cell cycle showed that the performance of K-means in the reduced dimensional space is similar to or slightly better than that of K-means in the original space, but about three times faster to obtain the clustering results. Our clustering results using SC-MDS are more stable than those in the original space. Hence, the proposed SC-MDS is useful for analyzing whole genome data. Conclusion: Our new method reduces the computational complexity from O(N-3) to O(N) when the dimension of the feature space is far less than the number of genes N, and it successfully reconstructs the low dimensional representation as does the classical MDS. Its performance depends on the grouping method and the minimal number of the intersection points between groups. Feasible methods for grouping methods are suggested; each group must contain both neighboring and far apart data points. Our method can represent high dimensional large data set in a low dimensional space not only efficiently but also effectively.
引用
收藏
页数:17
相关论文
共 50 条
  • [1] Multidimensional scaling for large genomic data sets
    Jengnan Tzeng
    Henry Horng-Shing Lu
    Wen-Hsiung Li
    [J]. BMC Bioinformatics, 9
  • [2] ALTERNATIVE MULTIDIMENSIONAL SCALING METHODS FOR LARGE STIMULUS SETS
    RAO, VR
    KATZ, R
    [J]. JOURNAL OF MARKETING RESEARCH, 1971, 8 (04) : 488 - 494
  • [3] Eigensolver methods for progressive multidimensional scaling of large data
    Brandes, Ulrik
    Pich, Christian
    [J]. GRAPH DRAWING, 2007, 4372 : 42 - +
  • [4] Heterogeneous processing of large, multidimensional imaging data sets
    Kissick, David J.
    Ong, Ta-Hsuan
    Rubakhin, Stanislav
    Sweedler, Jonathan
    [J]. ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2013, 246
  • [5] Projection error evaluation for large multidimensional data sets
    Paulauskiene, Kotryna
    Kurasova, Olga
    [J]. NONLINEAR ANALYSIS-MODELLING AND CONTROL, 2016, 21 (01): : 92 - 102
  • [6] Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets
    Hughes, Adam
    Ruan, Yang
    Ekanayake, Saliya
    Bae, Seung-Hee
    Dong, Qunfeng
    Rho, Mina
    Qiu, Judy
    Fox, Geoffrey
    [J]. BMC BIOINFORMATICS, 2012, 13
  • [7] Interpolative multidimensional scaling techniques for the identification of clusters in very large sequence sets
    Adam Hughes
    Yang Ruan
    Saliya Ekanayake
    Seung-Hee Bae
    Qunfeng Dong
    Mina Rho
    Judy Qiu
    Geoffrey Fox
    [J]. BMC Bioinformatics, 13
  • [8] Large-scale comparative visualisation of sets of multidimensional data
    Vohl, Dany
    Barnes, David G.
    Fluke, Christopher J.
    Poudel, Govinda
    Georgiou-Karistianis, Nellie
    Hassan, Amr H.
    Benovitski, Yuri
    Wong, Tsz Ho
    Kaluza, Owen L.
    Nguyen, Toan D.
    Bonnington, C. Paul
    [J]. PEERJ COMPUTER SCIENCE, 2016,
  • [9] A FAST ALGORITHM FOR TRANSPOSING LARGE MULTIDIMENSIONAL IMAGE DATA SETS
    VANHEEL, M
    [J]. ULTRAMICROSCOPY, 1991, 38 (01) : 75 - 83
  • [10] An Approximate Median Polish Algorithm for Large Multidimensional Data Sets
    Daniel Barbará
    Xintao Wu
    [J]. Knowledge and Information Systems, 2003, 5 (4) : 416 - 438