Identification of sample annotation errors in gene expression datasets

被引:0
|
作者
Miriam Lohr
Birte Hellwig
Karolina Edlund
Johanna S. M. Mattsson
Johan Botling
Marcus Schmidt
Jan G. Hengstler
Patrick Micke
Jörg Rahnenführer
机构
[1] TU Dortmund University,Department of Statistics
[2] Leibniz Research Centre for Working Environment and Human Factors (IfADo) at Dortmund TU,Department of Immunology, Genetics and Pathology
[3] Uppsala University,Department of Obstetrics and Gynecology
[4] University Hospital,undefined
来源
Archives of Toxicology | 2015年 / 89卷
关键词
Gene expression; Microarray; Misannotation; Quality control; Male–female classifier;
D O I
暂无
中图分类号
学科分类号
摘要
The comprehensive transcriptomic analysis of clinically annotated human tissue has found widespread use in oncology, cell biology, immunology, and toxicology. In cancer research, microarray-based gene expression profiling has successfully been applied to subclassify disease entities, predict therapy response, and identify cellular mechanisms. Public accessibility of raw data, together with corresponding information on clinicopathological parameters, offers the opportunity to reuse previously analyzed data and to gain statistical power by combining multiple datasets. However, results and conclusions obviously depend on the reliability of the available information. Here, we propose gene expression-based methods for identifying sample misannotations in public transcriptomic datasets. Sample mix-up can be detected by a classifier that differentiates between samples from male and female patients. Correlation analysis identifies multiple measurements of material from the same sample. The analysis of 45 datasets (including 4913 patients) revealed that erroneous sample annotation, affecting 40 % of the analyzed datasets, may be a more widespread phenomenon than previously thought. Removal of erroneously labelled samples may influence the results of the statistical evaluation in some datasets. Our methods may help to identify individual datasets that contain numerous discrepancies and could be routinely included into the statistical analysis of clinical gene expression data.
引用
收藏
页码:2265 / 2272
页数:7
相关论文
共 50 条
  • [1] Identification of sample annotation errors in gene expression datasets
    Lohr, Miriam
    Hellwig, Birte
    Edlund, Karolina
    Mattsson, Johanna S. M.
    Botling, Johan
    Schmidt, Marcus
    Hengstler, Jan G.
    Micke, Patrick
    Rahnenfuehrer, Joerg
    ARCHIVES OF TOXICOLOGY, 2015, 89 (12) : 2265 - 2272
  • [2] An efficient annotation and gene-expression derivation tool for Illumina Solexa datasets
    Hosseini P.
    Tremblay A.
    Matthews B.F.
    Alkharouf N.W.
    BMC Research Notes, 3 (1)
  • [3] Systematic identification of gene annotation errors in the widely used yeast mutation collections
    Taly Ben-Shitrit
    Nir Yosef
    Keren Shemesh
    Roded Sharan
    Eytan Ruppin
    Martin Kupiec
    Nature Methods, 2012, 9 (4) : 373 - 378
  • [4] Systematic identification of gene annotation errors in the widely used yeast mutation collections
    Ben-Shitrit, Taly
    Yosef, Nir
    Shemesh, Keren
    Sharan, Roded
    Ruppin, Eytan
    Kupiec, Martin
    NATURE METHODS, 2012, 9 (04) : 373 - U82
  • [5] Towards large-scale sample annotation in gene expression repositories
    Pitzer, Erik
    Lacson, Ronilda
    Hinske, Christian
    Kim, Jihoon
    Galante, Pedro A. F.
    Ohno-Machado, Lucila
    BMC BIOINFORMATICS, 2009, 10
  • [6] Towards large-scale sample annotation in gene expression repositories
    Erik Pitzer
    Ronilda Lacson
    Christian Hinske
    Jihoon Kim
    Pedro AF Galante
    Lucila Ohno-Machado
    BMC Bioinformatics, 10
  • [7] A simple strategy for sample annotation error detection in cytometry datasets
    Smithmyer, Megan E.
    Wiedeman, Alice E.
    Skibinski, David A. G.
    Savage, Adam K.
    Acosta-Vega, Carolina
    Scheiding, Sheila
    Gersuk, Vivian H.
    Long, S. Alice
    Buckner, Jane H.
    Speake, Cate
    O'Rourke, Colin
    CYTOMETRY PART A, 2022, 101 (04) : 351 - 360
  • [8] scMatch: a single-cell gene expression profile annotation tool using reference datasets
    Hou, Rui
    Denisenko, Elena
    Forrest, Alistair R. R.
    BIOINFORMATICS, 2019, 35 (22) : 4688 - 4695
  • [9] HIGHLIGHT REPORT: ERRONEOUS SAMPLE ANNOTATION IN A HIGH FRACTION OF PUBLICLY AVAILABLE GENOME-WIDE EXPRESSION DATASETS
    Grinberg, Marianna
    EXCLI JOURNAL, 2015, 14
  • [10] Identification of Common Prognostic Gene Expression Signatures with Biological Meanings from Microarray Gene Expression Datasets
    Yao, Jun
    Zhao, Qi
    Yuan, Ying
    Zhang, Li
    Liu, Xiaoming
    Yung, W. K. Alfred
    Weinstein, John N.
    PLOS ONE, 2012, 7 (09):