Ambiguous genes due to aligners and their impact on RNA-seq data analysis

被引:2
|
作者
Szabelska-Beresewicz, Alicja [1 ]
Zyprych-Walczak, Joanna [1 ]
Siatkowski, Idzi [1 ]
Okoniewski, Michal [2 ]
机构
[1] Poznan Univ Life Sci, Dept Math & Stat Methods, Wojska Polskiego 28, PL-60637 Poznan, Poland
[2] Swiss Fed Inst Technol, Sci IT Serv, Weinbergstr 11, CH-8092 Zurich, Switzerland
关键词
REPRODUCIBILITY;
D O I
10.1038/s41598-023-41085-6
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
The main scope of the study is ambiguous genes, i.e. genes whose expression is difficult to estimate from the data produced by next-generation sequencing technologies. We focused on the RNA sequencing (RNA-Seq) type of experiment performed on the Illumina platform. It is crucial to identify such genes and understand the cause of their difficulty, as these genes may be involved in some diseases. By giving misleading results, they could contribute to a misunderstanding of the cause of certain diseases, which could lead to inappropriate treatment. We thought that the ambiguous genes would be difficult to map because of their complex structure. So we looked at RNA-seq analysis using different mappers to find genes that would have different measurements from the aligners. We were able to identify such genes using a generalized linear model with two factors: mappers and groups introduced by the experiment. A large proportion of ambiguous genes are pseudogenes. High sequence similarity of pseudogenes to functional genes may indicate problems in alignment procedures. In addition, predictive analysis verified the performance of difficult genes in classification. The effectiveness of classifying samples into specific groups was compared, including the expression of difficult and not difficult genes as covariates. In almost all cases considered, ambiguous genes have less predictive power.
引用
收藏
页数:11
相关论文
共 50 条
  • [11] Simulation-based comprehensive benchmarking of RNA-seq aligners
    Baruzzo, Giacomo
    Hayer, Katharina E.
    Kim, Eun Ji
    Di Camillo, Barbara
    FitzGerald, Garret A.
    Grant, Gregory R.
    NATURE METHODS, 2017, 14 (02) : 135 - 139
  • [12] Statistical analysis of RNA-seq data at scale
    Leek, Jeff T.
    GENETIC EPIDEMIOLOGY, 2015, 39 (07) : 563 - 563
  • [13] A comprehensive review on RNA-seq data analysis
    Zhang, Li
    Liu, Xuejun
    Transactions of Nanjing University of Aeronautics and Astronautics, 2016, 33 (03) : 339 - 361
  • [14] Dynamic Model for RNA-seq Data Analysis
    Li, Lerong
    Xiong, Momiao
    BIOMED RESEARCH INTERNATIONAL, 2015, 2015
  • [15] Computational analysis of bacterial RNA-Seq data
    McClure, Ryan
    Balasubramanian, Divya
    Sun, Yan
    Bobrovskyy, Maksym
    Sumby, Paul
    Genco, Caroline A.
    Vanderpool, Carin K.
    Tjaden, Brian
    NUCLEIC ACIDS RESEARCH, 2013, 41 (14)
  • [16] RseqFlow: workflows for RNA-Seq data analysis
    Wang, Ying
    Mehta, Gaurang
    Mayani, Rajiv
    Lu, Jingxi
    Souaiaia, Tade
    Chen, Yangho
    Clark, Andrew
    Yoon, Hee Jae
    Wan, Lin
    Evgrafov, Oleg V.
    Knowles, James A.
    Deelman, Ewa
    Chen, Ting
    BIOINFORMATICS, 2011, 27 (18) : 2598 - 2600
  • [17] A Comprehensive Review on RNA-seq Data Analysis
    Zhang Li
    Liu Xuejun
    Transactions of Nanjing University of Aeronautics and Astronautics, 2016, 33 (03) : 339 - 361
  • [18] Parametric analysis of RNA-seq expression data
    Konishi, Tomokazu
    GENES TO CELLS, 2016, 21 (06) : 639 - 647
  • [19] Automated identification of reference genes based on RNA-seq data
    Carmona, Rosario
    Arroyo, Macarena
    Jose Jimenez-Quesada, Maria
    Seoane, Pedro
    Zafra, Adoracion
    Larrosa, Rafael
    de Dios Alche, Juan
    Gonzalo Claros, M.
    BIOMEDICAL ENGINEERING ONLINE, 2017, 16
  • [20] RNA-Seq UD: A bioinformatics plattform for RNA-Seq analysis
    Ramirez, Miguel
    Alejandro Rojas-Quintero, Cristian
    Enrique Vera-Parra, Nelson
    2015 10TH IBERIAN CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGIES (CISTI), 2015,