Optimization Method for Storing Massive Small Files in Multi-modal Medical Data

被引：0

作者：

Zeng M. ^{[1
,4
]}

Zou B.-J. ^{[1
,4
]}

Zhang W.-S. ^{[2
]}

Yang X.-B. ^{[2
]}

Zhu C.-Z. ^{[3
,4
]}

机构：

[1] School of Computer Science and Engineering, Central South University, Changsha

[2] Institute of Automation, Chinese Academy of Sciences, Beijing

[3] School of Literature and Journalism, Central South University, Changsha

[4] Hunan Engineering Research Center of Machine Vision and Intelligent Medicine, Central South University, Changsha

来源：

Ruan Jian Xue Bao/Journal of Software | 2023年 / 34卷 / 03期

关键词：

HBase; HDFS; multi-modal medical data; small files; storage performance optimization;

D O I：

10.13328/j.cnki.jos.006710

中图分类号：

学科分类号：

摘要：

Hadoop distributed file system (HDFS) is used for the storage and management of large files, while storing and computing a large number of small files consume a lot of NameNode memory usage and access time. Therefore, the small file problem becomes an important factor that restricts HDFS performance. Aiming at the problem of massive small files in multi-modal medical data, a small file storage method based on two-layer hash coding and HBase is proposed to optimize the storage of massive small files on HDFS. When merging small files, an expandable hash function is utilized to build an index file bucket to expand the index file dynamically as needed and realize the file append function. To read the file in O(1) time complexity and improve the efficiency of file search, the MWHC hash function is used to store the position of the index information of each file in the index file. There is no need to read the index information of all files, only need to read the index information of the corresponding bucket. To meet the storage needs of multi-modal medical data, HBase is used to store the index information and set the identification column to identify different modal medical data, which is convenient for storage and management of different modal data and improves file reading speed. To further optimize storage performance, the LRU-based metadata prefetching mechanism is established, and the LZ4 compression algorithm is utilized to compress the merged files. The experiment compares file access performance and NameNode memory usage. The results show that compared with the original HDFS, HAR, MapFile, TypeStorage, and HPF small file storage methods, the proposed algorithm has a shorter file access time, which can improve the overall performance of HDFS when processing massive small files in multi-modal medical data. © 2023 Chinese Academy of Sciences. All rights reserved.

引用

页码：1451 / 1469

页数：18

共 41 条

[1] Chen CT, Hsu CC, Wu JJ, Liu PF., GFS: A distributed file system with multi-source data access and replication for grid computing, Proc. of the Advances in Grid and Pervasive Computing, pp. 119-130, (2009)
[2] Shvachko K, Kuang H, Radia S, Chansler R., The Hadoop distributed file system, Proc. of the IEEE 26th Symp. on Mass Storage Systems and Technologies (MSST), pp. 1-10, (2010)
[3] Braam PJ., The lustre storage architecture, (2019)
[4] Apache Hadoop, (2020)
[5] Jin GD, Bian HQ, Chen YG, Du XY., Survey on storage and optimization techniques of HDFS, Ruan Jian Xue Bao/Journal of Software, 31, 1, pp. 137-161, (2020)
[6] Srinithya L, Reddy G., Performance evaluation of Hadoop distributed file system and local file system, Int’l Journal of Science and Research (IJSR), 3, 9, pp. 1174-1183, (2015)
[7] Zhai YL, Tchaye-Kondi J, Lin KJ, Zhu LH, Tao WJ, Du XJ, Guizani M., Hadoop perfect file: A fast and memory-efficient metadata access archive file to face small files problem in HDFS, Journal of Parallel and Distributed Computing, 156, pp. 119-130, (2021)
[8] Hadoop archives guide, (2020)
[9] Vorapongkitipun C, Nupairoj N., Improving performance of small-file accessing in Hadoop, Proc. of the 11th Int’l Joint Conf. on Computer Science and Software Engineering (JCSSE), pp. 200-205, (2014)
[10] Renner T, Muller J, Thamsen L, Kao O., Addressing Hadoop’s small file problem with an appendable archive file format, Proc. of the Computing Frontiers Conf. ACM, pp. 367-372, (2017)

← 1 2 3 4 5 →