Statistical assessment of discriminative features for protein-coding and non coding cross-species conserved sequence elements

被引:2
|
作者
Creanza, Teresa M. [1 ,2 ]
Horner, David S. [2 ]
D'Addabbo, Annarita [1 ]
Maglietta, Rosalia [1 ]
Mignone, Flavio [3 ]
Ancona, Nicola [1 ]
Pesole, Graziano [4 ,5 ]
机构
[1] CNR, Ist Studi Sistemi Intelligenti Automaz, I-70126 Bari, Italy
[2] Univ Milan, Dipartimento Sci Biomol & Biotecnol, Milan, Italy
[3] Univ Milan, Dipartimento Chim Strutturale & Stereochim Inorga, Milan, Italy
[4] Univ Bari, Dipartmento Biochim & Biol Mol, Bari, Italy
[5] CNR, Ist Tecnol Biomed, I-70126 Bari, Italy
来源
BMC BIOINFORMATICS | 2009年 / 10卷
关键词
IDENTIFICATION; TOOL; REGIONS; SEARCH; MOUSE; BLAST; TAGS; RAT;
D O I
10.1186/1471-2105-10-S6-S2
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: The identification of protein coding elements in sets of mammalian conserved elements is one of the major challenges in the current molecular biology research. Many features have been proposed for automatically distinguishing coding and non coding conserved sequences, making so necessary a systematic statistical assessment of their differences. A comprehensive study should be composed of an association study, i.e. a comparison of the distributions of the features in the two classes, and a prediction study in which the prediction accuracies of classifiers trained on single and groups of features are analyzed, conditionally to the compared species and to the sequence lengths. Results: In this paper we compared distributions of a set of comparative and non comparative features and evaluated the prediction accuracy of classifiers trained for discriminating sequence elements conserved among human, mouse and rat species. The association study showed that the analyzed features are statistically different in the two classes. In order to study the influence of the sequence lengths on the feature performances, a predictive study was performed on different data sets composed of coding and non coding alignments in equal number and equally long with an ascending average length. We found that the most discriminant feature was a comparative measure indicating the proportion of synonymous nucleotide substitutions per synonymous sites. Moreover, linear discriminant classifiers trained by using comparative features in general outperformed classifiers based on intrinsic ones. Finally, the prediction accuracy of classifiers trained on comparative features increased significantly by adding intrinsic features to the set of input variables, independently on sequence length (Kolmogorov-Smirnov P-value <= 0.05). Conclusion: We observed distinct and consistent patterns for individual and combined use of comparative and intrinsic classifiers, both with respect to different lengths of sequences/alignments and with respect to error rates in the classification of coding and non-coding elements. In particular, we noted that comparative features tend to be more accurate in the classification of coding sequences - this is likely related to the fact that such features capture deviations from strictly neutral evolution expected as a consequence of the characteristics of the genetic code.
引用
收藏
页数:12
相关论文
共 44 条
  • [1] Statistical assessment of discriminative features for protein-coding and non coding cross-species conserved sequence elements
    Teresa M Creanza
    David S Horner
    Annarita D'Addabbo
    Rosalia Maglietta
    Flavio Mignone
    Nicola Ancona
    Graziano Pesole
    BMC Bioinformatics, 10
  • [2] Computational identification of protein coding potential of conserved sequence tags through cross-species evolutionary analysis
    Mignone, F
    Grillo, G
    Liuni, S
    Pesole, G
    NUCLEIC ACIDS RESEARCH, 2003, 31 (15) : 4639 - 4645
  • [3] Comparative analysis of protein-coding and long non-coding transcripts based on RNA sequence features
    Volkova, Oxana A.
    Kondrakhin, Yury V.
    Kashapov, Timur A.
    Sharipov, Ruslan N.
    JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, 2018, 16 (02)
  • [4] Statistical characterization of conserved non-coding elements in vertebrates
    te Boekhorst, R.
    Walter, K.
    Elgar, G.
    Gilks, W. R.
    Abnizova, I.
    PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON BIOINFORMATICS OF GENOME REGULATION AND STRUCTURE, VOL 1, 2006, : 39 - +
  • [5] A subset of conserved mammalian long non-coding RNAs are fossils of ancestral protein-coding genes
    Hadas Hezroni
    Rotem Ben-Tov Perry
    Zohar Meir
    Gali Housman
    Yoav Lubelsky
    Igor Ulitsky
    Genome Biology, 18
  • [6] A subset of conserved mammalian long non-coding RNAs are fossils of ancestral protein-coding genes
    Hezroni, Hadas
    Perry, Rotem Ben-Tov
    Meir, Zohar
    Housman, Gali
    Lubelsky, Yoav
    Ulitsky, Igor
    GENOME BIOLOGY, 2017, 18
  • [7] Utilizing sequence intrinsic composition to classify protein-coding and long non-coding transcripts
    Sun, Liang
    Luo, Haitao
    Bu, Dechao
    Zhao, Guoguang
    Yu, Kuntao
    Zhang, Changhai
    Liu, Yuanning
    Chen, Runsheng
    Zhao, Yi
    NUCLEIC ACIDS RESEARCH, 2013, 41 (17)
  • [8] CSTminer:: a web tool for the identification of coding and noncoding conserved sequence tags through cross-species genome comparison
    Castrignanò, T
    Canali, A
    Grillo, G
    Liuni, S
    Mignone, F
    Pesole, G
    NUCLEIC ACIDS RESEARCH, 2004, 32 : W624 - W627
  • [9] STATISTICAL INFORMATION CHARACTERIZATION OF CONSERVED NON-CODING ELEMENTS IN VERTEBRATES
    Abnizova, I.
    Walter, K.
    te Boekhorst, R.
    Elgar, G.
    Gilks, W. R.
    JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, 2007, 5 (2B) : 533 - 547
  • [10] DISTINCTIVE SEQUENCE FEATURES IN PROTEIN-CODING GENIC NONCODING, AND INTERGENIC HUMAN DNA
    GUIGO, R
    FICKETT, JW
    JOURNAL OF MOLECULAR BIOLOGY, 1995, 253 (01) : 51 - 60