Feature Selection for Breast Cancer Classification by Integrating Somatic Mutation and Gene Expression

被引:9
|
作者
Jiang, Qin [1 ]
Jin, Min [1 ]
机构
[1] Hunan Univ, Coll Comp Sci & Elect Engn, Changsha, Peoples R China
基金
中国国家自然科学基金;
关键词
breast cancer; machine learning; classification; feature selection; gradient boosted decision tree; BIOMARKER; BENIGN;
D O I
10.3389/fgene.2021.629946
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Exploring the molecular mechanisms of breast cancer is essential for the early prediction, diagnosis, and treatment of cancer patients. The large scale of data obtained from the high-throughput sequencing technology makes it difficult to identify the driver mutations and a minimal optimal set of genes that are critical to the classification of cancer. In this study, we propose a novel method without any prior information to identify mutated genes associated with breast cancer. For the somatic mutation data, it is processed to a mutated matrix, from which the mutation frequency of each gene can be obtained. By setting a reasonable threshold for the mutation frequency, a mutated gene set is filtered from the mutated matrix. For the gene expression data, it is used to generate the gene expression matrix, while the mutated gene set is mapped onto the matrix to construct a co-expression profile. In the stage of feature selection, we propose a staged feature selection algorithm, using fold change, false discovery rate to select differentially expressed genes, mutual information to remove the irrelevant and redundant features, and the embedded method based on gradient boosting decision tree with Bayesian optimization to obtain an optimal model. In the stage of evaluation, we propose a weighted metric to modify the traditional accuracy to solve the sample imbalance problem. We apply the proposed method to The Cancer Genome Atlas breast cancer data and identify a mutated gene set, among which the implicated genes are oncogenes or tumor suppressors previously reported to be associated with carcinogenesis. As a comparison with the integrative network, we also perform the optimal model on the individual gene expression and the gold standard PMA50. The results show that the integrative network outperforms the gene expression and PMA50 in the average of most metrics, which indicate the effectiveness of our proposed method by integrating multiple data sources, and can discover the associated mutated genes in breast cancer.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Feature selection and classification approaches in gene expression of breast cancer
    Ghosh, Sarada
    Samanta, Guruprasad
    De la Sen, Manuel
    [J]. AIMS BIOPHYSICS, 2021, 8 (04): : 372 - 384
  • [2] Feature selection and classification of gene expression profile in hereditary breast cancer
    Raza, M
    Gondal, Q
    Green, D
    Coppel, RL
    [J]. HIS'04: FOURTH INTERNATIONAL CONFERENCE ON HYBRID INTELLIGENT SYSTEMS, PROCEEDINGS, 2005, : 315 - 320
  • [3] Feature Selection and Classification in gene expression cancer data
    Pavithra, D.
    Lakshmanan, B.
    [J]. 2017 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE IN DATA SCIENCE (ICCIDS), 2017,
  • [4] Feature Selection of Gene Expression Data for Cancer Classification: A Review
    Singh, Rabindra Kumar
    Sivabalakrishnan, M.
    [J]. BIG DATA, CLOUD AND COMPUTING CHALLENGES, 2015, 50 : 52 - 57
  • [5] Past, present and future of gene feature selection for breast cancer classification - a survey
    Chowdhary, Chiranji Lal
    Khare, Neelu
    Patel, Harshita
    Koppu, Srinivas
    Kaluri, Rajesh
    Rajput, Dharmendra Singh
    [J]. INTERNATIONAL JOURNAL OF ENGINEERING SYSTEMS MODELLING AND SIMULATION, 2022, 13 (02) : 140 - 153
  • [6] Feature Selection and Classification of MAQC-II Breast Cancer and Multiple Myeloma Microarray Gene Expression Data
    Liu, Qingzhong
    Sung, Andrew H.
    Chen, Zhongxue
    Liu, Jianzhong
    Huang, Xudong
    Deng, Youping
    [J]. PLOS ONE, 2009, 4 (12): : 1 - 24
  • [7] An Integrated Feature Selection Algorithm for Cancer Classification using Gene Expression Data
    Ahmed, Saeed
    Kabir, Muhammad
    Ali, Zakir
    Arif, Muhammad
    Ali, Farman
    Yu, Dong-Jun
    [J]. COMBINATORIAL CHEMISTRY & HIGH THROUGHPUT SCREENING, 2018, 21 (09) : 631 - 645
  • [8] Mixture feature selection strategy applied in cancer classification from gene expression
    Jin, Xing
    Deng, Yufeng
    Zhong, yixin
    [J]. 2005 27TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY, VOLS 1-7, 2005, : 4807 - 4809
  • [9] Feature Selection Facilitated Classification For Breast Cancer Prediction
    Arunadevi, J.
    Ganeshamoorthi, K.
    [J]. PROCEEDINGS OF THE 2019 3RD INTERNATIONAL CONFERENCE ON COMPUTING METHODOLOGIES AND COMMUNICATION (ICCMC 2019), 2019, : 560 - 563
  • [10] Comparison of Diagnostics Set and Feature Selection for Breast Cancer Classification Based on microRNA Expression
    Khasburrahman, Kharis
    Wibowo, Adi
    Waspada, Indra
    Bin Hashim, Hairulazwan
    Jatmiko, Wisnu
    [J]. 2017 1ST INTERNATIONAL CONFERENCE ON INFORMATICS AND COMPUTATIONAL SCIENCES (ICICOS), 2017, : 165 - 169