Virtual Chunks: On Supporting Random Accesses to Scientific Data in Compressible Storage Systems

被引:0
|
作者
Zhao, Dongfang [1 ,2 ]
Yin, Jian [2 ]
Qiao, Kan [1 ,3 ]
Raicu, Ioan [1 ,4 ]
机构
[1] IIT, Chicago, IL 60616 USA
[2] Pacific Northwest Natl Lab, Richland, WA USA
[3] Google Inc, Mountain View, CA USA
[4] Argonne Natl Lab, Argonne, IL 60439 USA
关键词
D O I
暂无
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Data compression could ameliorate the I/O pressure of scientific applications on high-performance computing systems. Unfortunately, the conventional wisdom of naively applying data compression to the file or block brings the dilemma between efficient random accesses and high compression ratios. File-level compression can barely support efficient random accesses to the compressed data: any retrieval request need trigger the decompression from the beginning of the compressed file. Block-level compression provides flexible random accesses to the compressed data, but introduces extra overhead when applying the compressor to each every block that results in a degraded overall compression ratio. This paper introduces a concept called virtual chunks aiming to support efficient random accesses to the compressed scientific data without sacrificing its compression ratio. In essence, virtual chunks are logical blocks identified by appended references without breaking the physical continuity of the file content. These additional references allow the decompression to start from an arbitrary position (efficient random access), and retain the file's physical entirety to achieve high compression ratio on par with file-level compression. One potential concern of virtual chunks lies on its space overhead (from the additional references) that degrades the compression ratio, but our analytic study and experimental results demonstrate that such overhead is negligible. We have implemented virtual chunks in two forms: a middleware to the GPFS parallel file system, and a module in the FusionFS distributed file system. Large-scale evaluations on up to 1,024 cores showed that virtual chunks could help improve the I/O throughput by 2X speedup.
引用
收藏
页码:231 / 240
页数:10
相关论文
共 27 条
  • [1] Dynamic Virtual Chunks: On Supporting Efficient Accesses to Compressed Scientific Data
    Zhao, Dongfang
    Qiao, Kan
    Yin, Jian
    Raicu, Ioan
    [J]. IEEE TRANSACTIONS ON SERVICES COMPUTING, 2016, 9 (01) : 96 - 109
  • [2] Data Chunks Placement Optimization for Hybrid Storage Systems
    Yolchuyev, Agil
    Levendovszky, Janos
    [J]. FUTURE INTERNET, 2021, 13 (07):
  • [3] USES OF VIRTUAL STORAGE SYSTEMS IN A SCIENTIFIC ENVIRONMENT
    CALLAWAY, PH
    THOMPSON, CH
    CONSIDINE, JP
    [J]. IBM SYSTEMS JOURNAL, 1972, 11 (03) : 200 - +
  • [4] Enabling Scientific Data Storage and Processing on Big-data Systems
    Biookaghazadeh, Saman
    Xu, Yiqi
    Zhou, Shujia
    Zhao, Ming
    [J]. PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 1978 - 1984
  • [5] Streaming Machine Learning for Supporting Data Prefetching in Modern Data Storage Systems
    Lucas Filho, Edson Ramiro
    Yang, Lun
    Fu, Kebo
    Herodotou, Herodotos
    [J]. PROCEEDINGS OF THE 1ST WORKSHOP ON AI FOR SYSTEMS, AI4SYS 2023, 2023, : 7 - 12
  • [6] Optimizing data regeneration and storage with data dependency for cloud scientific workflow systems
    Fan, Lei
    Zhou, Lin
    Wang, Meijuan
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 238
  • [7] A data dependency based strategy for intermediate data storage in scientific cloud workflow systems
    Yuan, Dong
    Yang, Yun
    Liu, Xiao
    Zhang, Gaofeng
    Chen, Jinjun
    [J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2012, 24 (09): : 956 - 976
  • [8] Conditional Anonymous Certificateless Public Auditing Scheme Supporting Data Dynamics for Cloud Storage Systems
    Zhang, Xiaojun
    Wang, Xin
    Gu, Dawu
    Xue, Jingting
    Tang, Wei
    [J]. IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2022, 19 (04): : 5333 - 5347
  • [9] Multidimensional data organization and random access in large-scale DNA storage systems
    Song, Xin
    Shah, Shalin
    Reif, John
    [J]. THEORETICAL COMPUTER SCIENCE, 2021, 894 : 190 - 202
  • [10] Random Slicing: Efficient and Scalable Data Placement for Large-Scale Storage Systems
    Miranda, Alberto
    Effert, Sascha
    Kang, Yangwook
    Miller, Ethan L.
    Popov, Ivan
    Brinkmann, Andre
    Friedetzky, Tom
    Cortes, Toni
    [J]. ACM TRANSACTIONS ON STORAGE, 2014, 10 (03)