Feature Selection for Breast Cancer Classification by Integrating Somatic Mutation and Gene Expression

被引:9
|
作者
Jiang, Qin [1 ]
Jin, Min [1 ]
机构
[1] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha, Peoples R China
基金
中国国家自然科学基金;
关键词
breast cancer; machine learning; classification; feature selection; gradient boosted decision tree; BIOMARKER; BENIGN;
D O I
10.3389/fgene.2021.629946
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Exploring the molecular mechanisms of breast cancer is essential for the early prediction, diagnosis, and treatment of cancer patients. The large scale of data obtained from the high-throughput sequencing technology makes it difficult to identify the driver mutations and a minimal optimal set of genes that are critical to the classification of cancer. In this study, we propose a novel method without any prior information to identify mutated genes associated with breast cancer. For the somatic mutation data, it is processed to a mutated matrix, from which the mutation frequency of each gene can be obtained. By setting a reasonable threshold for the mutation frequency, a mutated gene set is filtered from the mutated matrix. For the gene expression data, it is used to generate the gene expression matrix, while the mutated gene set is mapped onto the matrix to construct a co-expression profile. In the stage of feature selection, we propose a staged feature selection algorithm, using fold change, false discovery rate to select differentially expressed genes, mutual information to remove the irrelevant and redundant features, and the embedded method based on gradient boosting decision tree with Bayesian optimization to obtain an optimal model. In the stage of evaluation, we propose a weighted metric to modify the traditional accuracy to solve the sample imbalance problem. We apply the proposed method to The Cancer Genome Atlas breast cancer data and identify a mutated gene set, among which the implicated genes are oncogenes or tumor suppressors previously reported to be associated with carcinogenesis. As a comparison with the integrative network, we also perform the optimal model on the individual gene expression and the gold standard PMA50. The results show that the integrative network outperforms the gene expression and PMA50 in the average of most metrics, which indicate the effectiveness of our proposed method by integrating multiple data sources, and can discover the associated mutated genes in breast cancer.
引用
收藏
页数:12
相关论文
共 50 条
  • [31] Meta-heuristics for Feature Selection and Classification in Diagnostic Breast Cancer
    Khafaga, Doaa Sami
    Alhussan, Amel Ali
    El-kenawy, El-Sayed M.
    Takieldeen, Ali E.
    Hassan, Tarek M.
    Hegazy, Ehab A.
    Eid, Elsayed Abdel Fattah
    Ibrahim, Abdelhameed
    Abdelhamid, Abdelaziz A.
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 73 (01): : 748 - 765
  • [32] Breast cancer: A hybrid method for feature selection and classification in digital mammography
    Thawkar, Shankar
    Katta, Vijay
    Parashar, Ajay Raj
    Singh, Law Kumar
    Khanna, Munish
    [J]. INTERNATIONAL JOURNAL OF IMAGING SYSTEMS AND TECHNOLOGY, 2023, 33 (05) : 1696 - 1712
  • [33] Accuracy Enhancement for Breast Cancer Detection Using Classification and Feature Selection
    Jain, Somil
    Kumar, Puneet
    [J]. INTERNATIONAL JOURNAL OF INFORMATION RETRIEVAL RESEARCH, 2022, 12 (02)
  • [34] BREAST CANCER CLASSIFICATION USING A NOVEL HYBRID FEATURE SELECTION APPROACH
    Akkur, E.
    Turk, F.
    Erogul, Osman
    [J]. NEURAL NETWORK WORLD, 2023, 33 (02) : 67 - 83
  • [35] Ensemble Feature Selection for Breast Cancer Classification using Microarray Data
    Hengpraprohm, Supoj
    Jungjit, Suwimol
    [J]. INTELIGENCIA ARTIFICIAL-IBEROAMERICAL JOURNAL OF ARTIFICIAL INTELLIGENCE, 2020, 23 (65): : 100 - 114
  • [36] Unsupervised feature selection algorithm for multiclass cancer classification of gene expression RNA-Seq data
    Garcia-Diaz, Pilar
    Sanchez-Berriel, Isabel
    Martinez-Rojas, Juan A.
    Diez-Pascual, Ana M.
    [J]. GENOMICS, 2020, 112 (02) : 1916 - 1925
  • [37] Dimension Reduction and Classifier-Based Feature Selection for Oversampled Gene Expression Data and Cancer Classification
    Petinrin, Olutomilayo Olayemi
    Saeed, Faisal
    Salim, Naomie
    Toseef, Muhammad
    Liu, Zhe
    Muyide, Ibukun Omotayo
    [J]. PROCESSES, 2023, 11 (07)
  • [38] A simulation to analyze feature selection methods utilizing gene ontology for gene expression classification
    Gillies, Christopher E.
    Siadat, Mohammad-Reza
    Patel, Nilesh V.
    Wilson, George D.
    [J]. JOURNAL OF BIOMEDICAL INFORMATICS, 2013, 46 (06) : 1044 - 1059
  • [39] Toward integrating feature selection algorithms for classification and clustering
    Liu, H
    Yu, L
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2005, 17 (04) : 491 - 502
  • [40] SOMATIC GENE MUTATION AND BREAST-CARCINOMA
    HULTEN, M
    [J]. NATURE, 1984, 310 (5973) : 103 - 104