STatistical Inference Relief (STIR) feature selection

被引:39
|
作者
Le, Trang T. [1 ]
Urbanowicz, Ryan J. [1 ]
Moore, Jason H. [1 ]
McKinney, Brett A. [2 ,3 ]
机构
[1] Univ Penn, Inst Biomed Informat, Dept Biostat Epidemiol & Informat, Philadelphia, PA 19104 USA
[2] Univ Tulsa, Dept Math, Tulsa, OK 74104 USA
[3] Univ Tulsa, Tandy Sch Comp Sci, Tulsa, OK 74104 USA
关键词
D O I
10.1093/bioinformatics/bty788
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation: Relief is a family of machine learning algorithms that uses nearest-neighbors to select features whose association with an outcome may be due to epistasis or statistical interactions with other features in high-dimensional data. Relief-based estimators are non-parametric in the statistical sense that they do not have a parameterized model with an underlying probability distribution for the estimator, making it difficult to determine the statistical significance of Relief-based attribute estimates. Thus, a statistical inferential formalism is needed to avoid imposing arbitrary thresholds to select the most important features. We reconceptualize the Relief-based feature selection algorithm to create a new family of STatistical Inference Relief (STIR) estimators that retains the ability to identify interactions while incorporating sample variance of the nearest neighbor distances into the attribute importance estimation. This variance permits the calculation of statistical significance of features and adjustment for multiple testing of Relief-based scores. Specifically, we develop a pseudo t-test version of Relief-based algorithms for case-control data. Results: We demonstrate the statistical power and control of type I error of the STIR family of feature selection methods on a panel of simulated data that exhibits properties reflected in real gene expression data, including main effects and network interaction effects. We compare the performance of STIR when the adaptive radius method is used as the nearest neighbor constructor with STIR when the fixed-k nearest neighbor constructor is used. We apply STIR to real RNA-Seq data from a study of major depressive disorder and discuss STIR's straightforward extension to genome-wide association studies. Availability and implementation: Code and data available at http://insilico.utulsa.edu/software/STIR.
引用
收藏
页码:1358 / 1365
页数:8
相关论文
共 50 条
  • [1] CHERNOFF DISTANCE AND RELIEF FEATURE SELECTION
    Peng, Jing
    Seetharaman, Guna
    [J]. 2012 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS), 2012, : 3493 - 3496
  • [2] Orthogonal Relief Algorithm for Feature Selection
    Yang, Jun
    Li, Yue-Peng
    [J]. INTELLIGENT COMPUTING, PART I: INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING, ICIC 2006, PART I, 2006, 4113 : 227 - 234
  • [3] Feature selection based on inference correlation
    Mo, Dengyao
    Huang, Samuel H.
    [J]. INTELLIGENT DATA ANALYSIS, 2011, 15 (03) : 375 - 398
  • [4] A statistical feature selection technique
    Borah, Pallabi
    Ahmed, Hasin A.
    Bhattacharyya, Dhruba K.
    [J]. NETWORK MODELING AND ANALYSIS IN HEALTH INFORMATICS AND BIOINFORMATICS, 2014, 3 (01):
  • [5] A statistical feature selection technique
    Pallabi Borah
    Hasin A. Ahmed
    Dhruba K. Bhattacharyya
    [J]. Network Modeling Analysis in Health Informatics and Bioinformatics, 2014, 3 (1)
  • [6] Statistical Inference After Model Selection
    Richard Berk
    Lawrence Brown
    Linda Zhao
    [J]. Journal of Quantitative Criminology, 2010, 26 : 217 - 236
  • [7] Statistical Inference After Model Selection
    Berk, Richard
    Brown, Lawrence
    Zhao, Linda
    [J]. JOURNAL OF QUANTITATIVE CRIMINOLOGY, 2010, 26 (02) : 217 - 236
  • [8] Feature selection using sparse Bayesian inference
    Brandes, T. Scott
    Baxter, James R.
    Woodworth, Jonathan
    [J]. ALGORITHMS FOR SYNTHETIC APERTURE RADAR IMAGERY XXI, 2014, 9093
  • [9] Improved Relief Weight Feature Selection Algorithm Based on Relief and Mutual Information
    Wang, Hongbin
    Wang, Pengming
    Deng, Shengchun
    Li, Haoran
    [J]. INFORMATION, 2021, 12 (06)
  • [10] Relief-based feature selection: Introduction and review
    Urbanowicz, Ryan J.
    Meeker, Melissa
    La Cava, William
    Olson, Randal S.
    Moore, Jason H.
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2018, 85 : 189 - 203