Interpretable feature extraction and dimensionality reduction in ESM2 for protein localization prediction

被引:0
|
作者
Luo, Zeyu [1 ]
Wang, Rui [1 ]
Sun, Yawen [1 ]
Liu, Junhao [1 ]
Chen, Zongqing [2 ]
Zhang, Yu-Juan [1 ,3 ]
机构
[1] Chongqing Normal Univ, Coll Life Sci, Chongqing, Peoples R China
[2] Chongqing Normal Univ, Sch Math Sci, Chongqing 400047, Peoples R China
[3] Chongqing Normal Univ, Coll Life Sci, Chongqing Key Lab Vector Insects, Chongqing Key Lab Anim Biol, Chongqing 401331, Peoples R China
基金
中国国家自然科学基金;
关键词
subcellular localization prediction; feature representation; large language models; model interpretation; Res-VAE; PHOSPHORYLATION; EXPLANATIONS; LANGUAGE; SEQUENCE;
D O I
暂无
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
As the application of large language models (LLMs) has broadened into the realm of biological predictions, leveraging their capacity for self-supervised learning to create feature representations of amino acid sequences, these models have set a new benchmark in tackling downstream challenges, such as subcellular localization. However, previous studies have primarily focused on either the structural design of models or differing strategies for fine-tuning, largely overlooking investigations into the nature of the features derived from LLMs. In this research, we propose different ESM2 representation extraction strategies, considering both the character type and position within the ESM2 input sequence. Using model dimensionality reduction, predictive analysis and interpretability techniques, we have illuminated potential associations between diverse feature types and specific subcellular localizations. Particularly, the prediction of Mitochondrion and Golgi apparatus prefer segments feature closer to the N-terminal, and phosphorylation site-based features could mirror phosphorylation properties. We also evaluate the prediction performance and interpretability robustness of Random Forest and Deep Neural Networks with varied feature inputs. This work offers novel insights into maximizing LLMs' utility, understanding their mechanisms, and extracting biological domain knowledge. Furthermore, we have made the code, feature extraction API, and all relevant materials available at https://github.com/yujuan-zhang/feature-representation-for-LLMs.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Interpretable Discriminative Dimensionality Reduction and Feature Selection on the Manifold
    Hosseini, Babak
    Hammer, Barbara
    [J]. MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2019, PT I, 2020, 11906 : 310 - 326
  • [2] Feature Extraction Techniques for Protein Subcellular Localization Prediction
    Gao, Qing-Bin
    Jin, Zhi-Chao
    Wu, Cheng
    Sun, Ya-Lin
    He, Jia
    He, Xiang
    [J]. CURRENT BIOINFORMATICS, 2009, 4 (02) : 120 - 128
  • [3] Interpretable Subgraph Feature Extraction for Hyperlink Prediction
    Li, Peiyan
    Pan, Liming
    Li, Kai
    Plant, Claudia
    Boehm, Christian
    [J]. 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING, ICDM 2023, 2023, : 279 - 288
  • [4] Comparison of feature extraction methods in dimensionality reduction
    Wu, Jee-cheng
    Chang, Chiao-Po
    Tsuei, Gwo-Chyang
    [J]. CANADIAN JOURNAL OF REMOTE SENSING, 2010, 36 (06): : 645 - 649
  • [5] ESMDNN-PPI: a new protein-protein interaction prediction model developed with protein language model of ESM2 and deep neural network
    Li, Yane
    Wang, Chengfeng
    Gu, Haibo
    Feng, Hailin
    Ruan, Yaoping
    [J]. MEASUREMENT SCIENCE AND TECHNOLOGY, 2024, 35 (12)
  • [6] Feature extraction and dimensionality reduction for mass spectrometry data
    Liu, Yihui
    [J]. COMPUTERS IN BIOLOGY AND MEDICINE, 2009, 39 (09) : 818 - 823
  • [7] Prediction of Protein Folds: Extraction of New Features, Dimensionality Reduction, and Fusion of Heterogeneous Classifiers
    Ghanty, Pradip
    Pal, Nikhil R.
    [J]. IEEE TRANSACTIONS ON NANOBIOSCIENCE, 2009, 8 (01) : 100 - 110
  • [8] Feature extraction and dimensionality reduction algorithms and their applications in vowel recognition
    Wang, XC
    Paliwal, KK
    [J]. PATTERN RECOGNITION, 2003, 36 (10) : 2429 - 2439
  • [9] Feature Extraction for Dimensionality Reduction in Cellular Networks Performance Analysis
    de-la-Bandera, Isabel
    Palacios, David
    Mendoza, Jessica
    Barco, Raquel
    [J]. SENSORS, 2020, 20 (23) : 1 - 10
  • [10] Fault detection and classification by unsupervised feature extraction and dimensionality reduction
    Praveen Chopra
    Sandeep Kumar Yadav
    [J]. Complex & Intelligent Systems, 2015, 1 (1-4) : 25 - 33