GSC: efficient lossless compression of VCF files with fast query

被引:0
|
作者
Luo, Xiaolong [1 ]
Chen, Yuxin [2 ,3 ,4 ]
Liu, Ling [5 ]
Ding, Lulu [6 ]
Li, Yuxiang [2 ,3 ,4 ]
Li, Shengkang [2 ,3 ,4 ]
Zhang, Yong [2 ,3 ,4 ]
Zhu, Zexuan [6 ]
机构
[1] Shenzhen Univ, Coll Comp Sci & Software Engn, Shenzhen 518060, Peoples R China
[2] BGI Res, Wuhan 430074, Peoples R China
[3] BGI Res, Shenzhen 518083, Peoples R China
[4] BGI Res, Guangdong Bigdata Engn Technol Res Ctr Life Sci, Shenzhen 518083, Peoples R China
[5] Xidian Univ, Guangzhou Inst Technol, Guangzhou 510555, Peoples R China
[6] Shenzhen Univ, Natl Engn Lab Big Data Syst Comp Technol, Shenzhen 518060, Peoples R China
来源
GIGASCIENCE | 2024年 / 13卷
基金
中国国家自然科学基金;
关键词
VCF/BCF files; lossless compression; rapid random access;
D O I
10.1093/gigascience/giae046
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background With the rise of large-scale genome sequencing projects, genotyping of thousands of samples has produced immense variant call format (VCF) files. It is becoming increasingly challenging to store, transfer, and analyze these voluminous files. Compression methods have been used to tackle these issues, aiming for both high compression ratio and fast random access. However, existing methods have not yet achieved a satisfactory compromise between these 2 objectives.Findings To address the aforementioned issue, we introduce GSC (Genotype Sparse Compression), a specialized and refined lossless compression tool for VCF files. In benchmark tests conducted across various open-source datasets, GSC showcased exceptional performance in genotype data compression. Compared with the industry's most advanced tools (namely, GBC and GTC), GSC achieved compression ratios that were higher by 26.9% to 82.4% over GBC and GTC on the datasets, respectively. In lossless compression scenarios, GSC also demonstrated robust performance, with compression ratios 1.5x to 6.5x greater than general-purpose tools like gzip, zstd, and BCFtools-a mode not supported by either GBC or GTC. Achieving such high compression ratios did require some reasonable trade-offs, including longer decompression times, with GSC being 1.2x to 2x slower than GBC, yet 1.1x to 1.4x faster than GTC. Moreover, GSC maintained decompression query speeds that were equivalent to its competitors. In terms of RAM usage, GSC outperformed both counterparts. Overall, GSC's comprehensive performance surpasses that of the most advanced technologies.Conclusion GSC balances high compression ratios with rapid data access, enhancing genomic data management. It supports seamless PLINK binary format conversion, simplifying downstream analysis.
引用
收藏
页数:10
相关论文
共 50 条
  • [1] genozip: a fast and efficient compression tool for VCF files
    Lan, Divon
    Tobler, Raymond
    Souilmi, Yassine
    Llamas, Bastien
    BIOINFORMATICS, 2020, 36 (13) : 4091 - 4092
  • [2] 123VCF: an intuitive and efficient tool for filtering VCF files
    Milad Eidi
    Samaneh Abdolalizadeh
    Soheila Moeini
    Masoud Garshasbi
    Javad Zahiri
    BMC Bioinformatics, 25
  • [3] 123VCF: an intuitive and efficient tool for filtering VCF files
    Eidi, Milad
    Abdolalizadeh, Samaneh
    Moeini, Soheila
    Garshasbi, Masoud
    Zahiri, Javad
    BMC BIOINFORMATICS, 2024, 25 (01)
  • [4] A fast and efficient lossless data-compression method
    Jou, JM
    Chen, PY
    IEEE TRANSACTIONS ON COMMUNICATIONS, 1999, 47 (09) : 1278 - 1283
  • [5] On Additional Constrains in Lossless Compression of Text Files
    Radescu, Radu
    ROMANIAN JOURNAL OF INFORMATION SCIENCE AND TECHNOLOGY, 2015, 18 (04): : 299 - 311
  • [6] On Effectiveness of Lossless Compression in Transferring mHealth Data Files
    Dzhagaryan, Armen
    Milenkovic, Aleksandar
    2015 17TH INTERNATIONAL CONFERENCE ON E-HEALTH NETWORKING, APPLICATION & SERVICES (HEALTHCOM), 2015, : 665 - 668
  • [7] Fast lossless image compression
    Wehnes, JC
    Pai, HT
    Bovik, AC
    PROCEEDINGS OF THE IEEE SOUTHWEST SYMPOSIUM ON IMAGE ANALYSIS AND INTERPRETATION, 1996, : 145 - 148
  • [8] Lossless Compression of Internal Files in Parallel Reservoir Simulation
    Rogowski, Marcin
    Kayum, Suha N.
    Mannuss, Florian
    2019 IEEE HIGH PERFORMANCE EXTREME COMPUTING CONFERENCE (HPEC), 2019,
  • [9] Concordance Techniques in Lossless Data Compression of Text Files
    Radescu, Radu
    2021 12TH INTERNATIONAL SYMPOSIUM ON ADVANCED TOPICS IN ELECTRICAL ENGINEERING (ATEE), 2021,
  • [10] Investigation of fast and efficient lossless compression algorithms for macromolecular crystallography experiments
    Bernstein, Herbert J.
    Jakoncic, Jean
    JOURNAL OF SYNCHROTRON RADIATION, 2024, 31 : 647 - 654