RNACompress: Grammar-based compression and informational complexity measurement of RNA secondary structure

被引:13
|
作者
Liu, Qi [2 ,3 ,4 ]
Yang, Yu [1 ]
Chen, Chun [1 ]
Bu, Jiajun [1 ]
Zhang, Yin [1 ]
Ye, Xiuzi [1 ]
机构
[1] Zhejiang Univ, Coll Comp Sci, Hangzhou 310027, Peoples R China
[2] Zhejiang Univ, Zhejiang Calif Int Nanosyst Inst, Hangzhou 310029, Peoples R China
[3] Zhejiang Univ, Coll Life Sci, Hangzhou 310027, Peoples R China
[4] Zhejiang Univ, James D Watson Inst Genom Sci, Hangzhou 310008, Peoples R China
关键词
D O I
10.1186/1471-2105-9-176
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: With the rapid emergence of RNA databases and newly identified non-coding RNAs, an efficient compression algorithm for RNA sequence and structural information is needed for the storage and analysis of such data. Although several algorithms for compressing DNA sequences have been proposed, none of them are suitable for the compression of RNA sequences with their secondary structures simultaneously. This kind of compression not only facilitates the maintenance of RNA data, but also supplies a novel way to measure the informational complexity of RNA structural data, raising the possibility of studying the relationship between the functional activities of RNA structures and their complexities, as well as various structural properties of RNA based on compression. Results: RNACompress employs an efficient grammar-based model to compress RNA sequences and their secondary structures. The main goals of this algorithm are two fold: ( 1) present a robust and effective way for RNA structural data compression; ( 2) design a suitable model to represent RNA secondary structure as well as derive the informational complexity of the structural data based on compression. Our extensive tests have shown that RNACompress achieves a universally better compression ratio compared with other sequence-specific or common text-specific compression algorithms, such as Gencompress, winrar and gzip. Moreover, a test of the activities of distinct GTP-binding RNAs (aptamers) compared with their structural complexity shows that our defined informational complexity can be used to describe how complexity varies with activity. These results lead to an objective means of comparing the functional properties of heteropolymers from the information perspective. Conclusion: A universal algorithm for the compression of RNA secondary structure as well as the evaluation of its informational complexity is discussed in this paper. We have developed RNACompress, as a useful tool for academic users. Extensive tests have shown that RNACompress is a universally efficient algorithm for the compression of RNA sequences with their secondary structures. RNACompress also serves as a good measurement of the informational complexity of RNA secondary structure, which can be used to study the functional activities of RNA molecules.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] RNACompress: Grammar-based compression and informational complexity measurement of RNA secondary structure
    Qi Liu
    Yu Yang
    Chun Chen
    Jiajun Bu
    Yin Zhang
    Xiuzi Ye
    [J]. BMC Bioinformatics, 9
  • [2] On the complexity of optimal grammar-based compression
    Arpe, Jan
    Reischuk, R. diger
    [J]. DCC 2006: DATA COMPRESSION CONFERENCE, PROCEEDINGS, 2006, : 173 - +
  • [3] Grammar-Based Tree Compression
    Lohrey, Markus
    [J]. DEVELOPMENTS IN LANGUAGE THEORY (DLT 2015), 2015, 9168 : 46 - 57
  • [4] Grammar-based graph compression
    Maneth, Sebastian
    Peternek, Fabian
    [J]. INFORMATION SYSTEMS, 2018, 76 : 19 - 45
  • [5] Grammar-Based Compression of Unranked Trees
    Gascon, Adria
    Lohrey, Markus
    Maneth, Sebastian
    Reh, Carl Philipp
    Siebert, Kurt
    [J]. COMPUTER SCIENCE - THEORY AND APPLICATIONS, CSR 2018, 2018, 10846 : 118 - 131
  • [6] Grammar-Based Compression of Unranked Trees
    Gascon, Adria
    Lohrey, Markus
    Maneth, Sebastian
    Reh, Carl Philipp
    Sieber, Kurt
    [J]. THEORY OF COMPUTING SYSTEMS, 2020, 64 (01) : 141 - 176
  • [7] Grammar-Based Compression of Unranked Trees
    Adrià Gascón
    Markus Lohrey
    Sebastian Maneth
    Carl Philipp Reh
    Kurt Sieber
    [J]. Theory of Computing Systems, 2020, 64 : 141 - 176
  • [8] Grammar-based compression of interpreted code
    Evans, WS
    Fraser, CW
    [J]. COMMUNICATIONS OF THE ACM, 2003, 46 (08) : 61 - 66
  • [9] Grammar-Based Compression in a Streaming Model
    Gagie, Travis
    Gawrychowski, Pawel
    [J]. LANGUAGE AND AUTOMATA THEORY AND APPLICATIONS, 2010, 6031 : 273 - +
  • [10] Approximation algorithms for grammar-based compression
    Lehman, E
    Shelat, A
    [J]. PROCEEDINGS OF THE THIRTEENTH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2002, : 205 - 212