Virtual Chunks: On Supporting Random Accesses to Scientific Data in Compressible Storage Systems

被引：0

作者：

Zhao, Dongfang ^{[1
,2
]}

Yin, Jian ^{[2
]}

Qiao, Kan ^{[1
,3
]}

Raicu, Ioan ^{[1
,4
]}

机构：

[1] IIT, Chicago, IL 60616 USA

[2] Pacific Northwest Natl Lab, Richland, WA USA

[3] Google Inc, Mountain View, CA USA

[4] Argonne Natl Lab, Argonne, IL 60439 USA

来源：

2014 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA) | 2014年

关键词：

D O I：

暂无

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Data compression could ameliorate the I/O pressure of scientific applications on high-performance computing systems. Unfortunately, the conventional wisdom of naively applying data compression to the file or block brings the dilemma between efficient random accesses and high compression ratios. File-level compression can barely support efficient random accesses to the compressed data: any retrieval request need trigger the decompression from the beginning of the compressed file. Block-level compression provides flexible random accesses to the compressed data, but introduces extra overhead when applying the compressor to each every block that results in a degraded overall compression ratio. This paper introduces a concept called virtual chunks aiming to support efficient random accesses to the compressed scientific data without sacrificing its compression ratio. In essence, virtual chunks are logical blocks identified by appended references without breaking the physical continuity of the file content. These additional references allow the decompression to start from an arbitrary position (efficient random access), and retain the file's physical entirety to achieve high compression ratio on par with file-level compression. One potential concern of virtual chunks lies on its space overhead (from the additional references) that degrades the compression ratio, but our analytic study and experimental results demonstrate that such overhead is negligible. We have implemented virtual chunks in two forms: a middleware to the GPFS parallel file system, and a module in the FusionFS distributed file system. Large-scale evaluations on up to 1,024 cores showed that virtual chunks could help improve the I/O throughput by 2X speedup.

引用

页码：231 / 240

页数：10

共 27 条

[1] Dynamic Virtual Chunks: On Supporting Efficient Accesses to Compressed Scientific Data
Zhao, Dongfang
Qiao, Kan
Yin, Jian
Raicu, Ioan
[J]. IEEE TRANSACTIONS ON SERVICES COMPUTING, 2016, 9 (01) : 96 - 109
[2] Data Chunks Placement Optimization for Hybrid Storage Systems
Yolchuyev, Agil
Levendovszky, Janos
[J]. FUTURE INTERNET, 2021, 13 (07):
[3] USES OF VIRTUAL STORAGE SYSTEMS IN A SCIENTIFIC ENVIRONMENT
CALLAWAY, PH
THOMPSON, CH
CONSIDINE, JP
[J]. IBM SYSTEMS JOURNAL, 1972, 11 (03) : 200 - +
[4] Enabling Scientific Data Storage and Processing on Big-data Systems
Biookaghazadeh, Saman
Xu, Yiqi
Zhou, Shujia
Zhao, Ming
[J]. PROCEEDINGS 2015 IEEE INTERNATIONAL CONFERENCE ON BIG DATA, 2015, : 1978 - 1984
[5] Streaming Machine Learning for Supporting Data Prefetching in Modern Data Storage Systems
Lucas Filho, Edson Ramiro
Yang, Lun
Fu, Kebo
Herodotou, Herodotos
[J]. PROCEEDINGS OF THE 1ST WORKSHOP ON AI FOR SYSTEMS, AI4SYS 2023, 2023, : 7 - 12
[6] Optimizing data regeneration and storage with data dependency for cloud scientific workflow systems
Fan, Lei
Zhou, Lin
Wang, Meijuan
[J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 238
[7] A data dependency based strategy for intermediate data storage in scientific cloud workflow systems
Yuan, Dong
Yang, Yun
Liu, Xiao
Zhang, Gaofeng
Chen, Jinjun
[J]. CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE, 2012, 24 (09): : 956 - 976
[8] Conditional Anonymous Certificateless Public Auditing Scheme Supporting Data Dynamics for Cloud Storage Systems
Zhang, Xiaojun
Wang, Xin
Gu, Dawu
Xue, Jingting
Tang, Wei
[J]. IEEE TRANSACTIONS ON NETWORK AND SERVICE MANAGEMENT, 2022, 19 (04): : 5333 - 5347
[9] Multidimensional data organization and random access in large-scale DNA storage systems
Song, Xin
Shah, Shalin
Reif, John
[J]. THEORETICAL COMPUTER SCIENCE, 2021, 894 : 190 - 202
[10] Random Slicing: Efficient and Scalable Data Placement for Large-Scale Storage Systems
Miranda, Alberto
Effert, Sascha
Kang, Yangwook
Miller, Ethan L.
Popov, Ivan
Brinkmann, Andre
Friedetzky, Tom
Cortes, Toni
[J]. ACM TRANSACTIONS ON STORAGE, 2014, 10 (03)

← 1 2 3 →