Combining multiple hypothesis testing and affinity propagation clustering leads to accurate, robust and sample size independent classification on gene expression data

被引:13
|
作者
Sakellariou, Argiris [1 ,2 ]
Sanoudou, Despina [3 ]
Spyrou, George [1 ]
机构
[1] Acad Athens, Biomed Res Fdn, Biomed Informat Unit, Athens, Greece
[2] Natl & Kapodistrian Univ Athens, Dept Informat & Telecommun, Athens 11528, Greece
[3] Natl & Kapodistrian Univ Athens, Sch Med, Dept Pharmacol, Athens 11528, Greece
来源
BMC BIOINFORMATICS | 2012年 / 13卷
关键词
TUMOR CLASSIFICATION; FEATURE-SELECTION; SKELETAL-MUSCLE; NEMALINE MYOPATHY; MARKER GENES; R-PACKAGE; CANCER; ALGORITHMS; PREDICTION; RANKING;
D O I
10.1186/1471-2105-13-270
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: A feature selection method in microarray gene expression data should be independent of platform, disease and dataset size. Our hypothesis is that among the statistically significant ranked genes in a gene list, there should be clusters of genes that share similar biological functions related to the investigated disease. Thus, instead of keeping N top ranked genes, it would be more appropriate to define and keep a number of gene cluster exemplars. Results: We propose a hybrid FS method (mAP-KL), which combines multiple hypothesis testing and affinity propagation (AP)-clustering algorithm along with the Krzanowski & Lai cluster quality index, to select a small yet informative subset of genes. We applied mAP-KL on real microarray data, as well as on simulated data, and compared its performance against 13 other feature selection approaches. Across a variety of diseases and number of samples, mAP-KL presents competitive classification results, particularly in neuromuscular diseases, where its overall AUC score was 0.91. Furthermore, mAP-KL generates concise yet biologically relevant and informative N-gene expression signatures, which can serve as a valuable tool for diagnostic and prognostic purposes, as well as a source of potential disease biomarkers in a broad range of diseases. Conclusions: mAP-KL is a data-driven and classifier-independent hybrid feature selection method, which applies to any disease classification problem based on microarray data, regardless of the available samples. Combining multiple hypothesis testing and AP leads to subsets of genes, which classify unknown samples from both, small and large patient cohorts with high accuracy.
引用
收藏
页数:19
相关论文
共 12 条
  • [1] Combining multiple hypothesis testing and affinity propagation clustering leads to accurate, robust and sample size independent classification on gene expression data
    Argiris Sakellariou
    Despina Sanoudou
    George Spyrou
    [J]. BMC Bioinformatics, 13
  • [2] Lung Cancer Classification and Gene Selection by Combining Affinity Propagation Clustering and Sparse Group Lasso
    Li, Juntao
    Chang, Mingming
    Gao, Qinghui
    Song, Xuekun
    Gao, Zhiyu
    [J]. CURRENT BIOINFORMATICS, 2020, 15 (07) : 703 - 712
  • [3] Combining Multiple Clustering and Network Analysis for Discoveries in Gene Expression Data
    Alhajj, Sleiman
    Alhajj, Aya
    Ozyer, Sibel Tariyan
    [J]. PROCEEDINGS OF THE 2021 IEEE/ACM INTERNATIONAL CONFERENCE ON ADVANCES IN SOCIAL NETWORKS ANALYSIS AND MINING, ASONAM 2021, 2021, : 502 - 509
  • [4] Optimal sample size for multiple testing:: The case of gene expression microarrays
    Müller, P
    Parmigiani, G
    Robert, C
    Rousseau, J
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 2004, 99 (468) : 990 - 1001
  • [5] Hierarchical clustering combining numerical and biological similarities for gene expression data classification
    Bosio, Mattia
    Salembier, Philippe
    Bellot, Pau
    Oliveras-Verges, Albert
    [J]. 2013 35TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY (EMBC), 2013, : 584 - 587
  • [6] Clustering by soft-constraint affinity propagation: applications to gene-expression data
    Leone, Michele
    Sumedha
    Weigt, Martin
    [J]. BIOINFORMATICS, 2007, 23 (20) : 2708 - 2715
  • [7] Rough Set based Attribute Clustering for Sample Classification of Gene Expression Data
    Nayak, Rudra Kalyan
    Mishra, Debahuti
    Shaw, Kailash
    Mishra, Sashikala
    [J]. INTERNATIONAL CONFERENCE ON MODELLING OPTIMIZATION AND COMPUTING, 2012, 38 : 1788 - 1792
  • [8] MULTIPLE HYPOTHESIS TESTING ADJUSTED FOR LATENT VARIABLES, WITH AN APPLICATION TO THE AGEMAP GENE EXPRESSION DATA
    Sun, Yunting
    Zhang, Nancy R.
    Owen, Art B.
    [J]. ANNALS OF APPLIED STATISTICS, 2012, 6 (04): : 1664 - 1688
  • [9] Combining multiple perspective as intelligent agents into robust approach for biomarker detection in gene expression data
    Alshalalfa, Mohammed
    Naji, Ghada
    Qabaja, Ala
    Alhajj, Reda
    Rokne, Jon
    [J]. INTERNATIONAL JOURNAL OF DATA MINING AND BIOINFORMATICS, 2011, 5 (03) : 332 - 350
  • [10] pySAPC, a python']python package for sparse affinity propagation clustering: Application to odontogenesis whole genome time series gene-expression data
    Cao, Huojun
    Amendt, Brad A.
    [J]. BIOCHIMICA ET BIOPHYSICA ACTA-GENERAL SUBJECTS, 2016, 1860 (11): : 2613 - 2618