Improved recognition of splice sites in A. thaliana by incorporating secondary structure information into sequence-derived features: a computational study

被引:1
|
作者
Meher, Prabina Kumar [1 ]
Satpathy, Subhrajit [1 ]
机构
[1] ICAR Indian Agr Stat Res Inst, New Delhi 110012, India
关键词
Secondary structure; Computational biology; Machine learning; Splice junction; Nucleotide dependencies; PRE-MESSENGER-RNA; SUPPORT VECTOR MACHINES; 3'SS SELECTION; PREDICTION;
D O I
10.1007/s13205-021-03036-8
中图分类号
Q81 [生物工程学(生物技术)]; Q93 [微生物学];
学科分类号
071005 ; 0836 ; 090102 ; 100705 ;
摘要
Identification of splice sites is an important aspect with regard to the prediction of gene structure. In most of the existing splice site prediction studies, machine learning algorithms coupled with sequence-derived features have been successfully employed for splice site recognition. However, the splice site identification by incorporating the secondary structure information is lacking, particularly in plant species. Thus, we made an attempt in this study to evaluate the performance of structural features on the splice site prediction accuracy in Arabidopsis thaliana. Prediction accuracies were evaluated with the sequence-derived features alone as well as by incorporating the structural features into the sequence-derived features, where support vector machine (SVM) was employed as prediction algorithm. Both short (40 base pairs) and long (105 base pairs) sequence datasets were considered for evaluation. After incorporating the secondary structure features, improvements in accuracies were observed only for the longer sequence dataset and the improvement was found to be higher with the sequence-derived features that accounted nucleotide dependencies. On the other hand, either a little or no improvement in accuracies was found for the short sequence dataset. The performance of SVM was further compared with that of LogitBoost, Random Forest (RF), AdaBoost and XGBoost machine learning methods. The prediction accuracies of SVM, AdaBoost and XGBoost were observed to be at par and higher than that of RF and LogitBoost algorithms. While prediction was performed by taking all the sequence-derived features along with the structural features, a little improvement in accuracies was found as compared to the combination of individual sequence-based features and structural features. To the best of our knowledge, this is the first attempt concerning the computational prediction of splice sites using machine learning methods by incorporating the secondary structure information into the sequence-derived features. All the source codes are available at .
引用
收藏
页数:13
相关论文
共 6 条
  • [1] Improved recognition of splice sites in A. thaliana by incorporating secondary structure information into sequence-derived features: a computational study
    Prabina Kumar Meher
    Subhrajit Satpathy
    [J]. 3 Biotech, 2021, 11
  • [2] Computational identification of protein S-sulfenylation sites by incorporating the multiple sequence features information
    Hasan, Md Mehedi
    Guo, Dianjing
    Kurata, Hiroyuki
    [J]. MOLECULAR BIOSYSTEMS, 2017, 13 (12) : 2545 - 2550
  • [3] Complementing sequence-derived features with structural information extracted from fragment libraries for protein structure prediction
    Siyuan Liu
    Tong Wang
    Qijiang Xu
    Bin Shao
    Jian Yin
    Tie-Yan Liu
    [J]. BMC Bioinformatics, 22
  • [4] Complementing sequence-derived features with structural information extracted from fragment libraries for protein structure prediction
    Liu, Siyuan
    Wang, Tong
    Xu, Qijiang
    Shao, Bin
    Yin, Jian
    Liu, Tie-Yan
    [J]. BMC BIOINFORMATICS, 2021, 22 (01)
  • [5] PSNO: Predicting Cysteine S-Nitrosylation Sites by Incorporating Various Sequence-Derived Features into the General Form of Chou's PseAAC
    Zhang, Jian
    Zhao, Xiaowei
    Sun, Pingping
    Ma, Zhiqiang
    [J]. INTERNATIONAL JOURNAL OF MOLECULAR SCIENCES, 2014, 15 (07) : 11204 - 11219
  • [6] Identification of S-glutathionylation sites in species-specific proteins by incorporating five sequence-derived features into the general pseudo-amino acid composition
    Zhao, Xiaowei
    Ning, Qiao
    Ai, Meiyue
    Chai, Haiting
    Yang, Guifu
    [J]. JOURNAL OF THEORETICAL BIOLOGY, 2016, 398 : 96 - 102