On Distributed Hash Table's Applicability to Internet-of-Things Big Data Management

被引：0

作者：

An Y.-Z. ^{[1
]}

Zhu Y.-Q. ^{[2
,3
]}

Wang J.-M. ^{[1
,2
,3
]}

机构：

[1] School of Software, Tsinghua University, Beijing

[2] Beijing National Research Center of Information Science and Technology (Tsinghua University), Beijing

[3] National Engineering Laboratory of Big Data System Software, Beijing

来源：

Jisuanji Xuebao/Chinese Journal of Computers | 2021年 / 44卷 / 08期

基金：

中国国家自然科学基金;

关键词：

Distributed hash table; Internet of Things data management; Load balance; Time series; Time series database;

D O I：

10.11897/SP.J.1016.2021.01679

中图分类号：

学科分类号：

摘要：

Targeting at the emerging application scenarios and the corresponding challenges of Internet of Things (IoT), this work presents a theoretical analysis on the load rebalancing conditions of distributed hash table (DHT), focusing on the unprecedentedly high workload of writes and the network bandwidth between nodes. While DHT is the state-of-the-practice system structure for large-scale data management, its design has not taken into account the workload characteristics of IoT applications. The typical workload characteristic is the unprecedented intensity of writes. With respect to write workloads and network bandwidth, this paper deduces the applicability conditions of DHT, considering the constraints on bandwidth, storage and time. For DHT-based IoT data management systems with load balancing, the theoretical results imply the following facts: (1) the maximum write throughput that a scalable IoT data management system can support is decided by the number n of nodes to scale to and by the network bandwidth of system nodes; (2) while increasing the number N of system nodes can increase the total storage capacity of the system, it cannot increase the maximum write throughput that the system can support; (3) scale-out processes with a large number n of nodes can lead to sudden and heavy decreases of the maximum write throughput at each system node, leading to disruptive workload redistribution; and, (4) scaling out by a small number n of nodes is a more economical process and complies with the Pay-as-You-Go design consideration of cloud, but still not addressing the problem of scalable IoT data management. Experiments on the widely-used DHT-based system Cassandra and extensive simulations based on standard network system simulator ns-3 validate the theoretical results. With real IoT data management use cases, it is demonstrated that the theoretical results of this work can be used to account for the problems met when exploiting DHT-based systems for IoT data storage, as well as guiding the design of IoT data management system. The results of this paper are applied to analyze the designs of the top 10 time series databases ranked by the DB-Engines website. Among the time series databases that have a distributed version, some have adopted the DHT architecture, e.g., KairosDB and OpenTSDB; thus, they are expected to come into the problems as described in the paper. The others have circumvented the problem by using other architectures that are not as highly scalable as DHT. A further study of Google's time series database Monarch and IBM's DB2 Event Store shows that, they have abandoned the DHT architecture and chosen layering architectures to avoid the problems under IoT data management workloads. According to the results of this paper, DHT with a load rebalancing design is only applicable to limited-scale IoT data management, but not large-scale IoT data management, especially when the write workload keeps increasing. Unfortunately, as the number of IoT devices keeps increasing, the write workload will inevitably increase. Therefore, a redesign of the DHT load balancing technique or a reconsideration of data distribution architecture is necessary for IoT data management. The results of this paper can be applied to designs, implementations and analyses of large-scale IoT data management systems. © 2021, Science Press. All right reserved.

引用

页码：1679 / 1695

页数：16

共 20 条

[1] DeCandia G, Hastorun D, Jampani M, Et al., Dynamo: Amazon's highly available key-value store, ACM SIGOPS Operating Systems Review, 41, 6, pp. 205-220, (2007)
[2] Lakshman A, Malik P., Cassandra: A decentralized structured storage system, ACM SIGOPS Operating Systems Review, 44, 2, pp. 35-40, (2010)
[3] Stoica I, Morris R, Karger D, Et al., Chord: A scalable peer-to-peer lookup service for internet applications, ACM SIGCOMM Computer Communication Review, 31, 4, pp. 149-160, (2001)
[4] Liben-Nowell D, Balakrishnan H, Karger D., Analysis of the evolution of peer-to-peer systems, Proceedings of the 21st Annual Symposium on Principles of Distributed Computing, pp. 233-242, (2002)
[5] Andersen M P, Culler D E., BTrDB: Optimizing storage system design for time series processing, Proceedings of the 14th USENIX Conference on File and Storage Technologies, pp. 39-52, (2016)
[6] Adams C, Alonso L, Atkin B, Et al., Monarch: Google's planet-scale in-memory time series database, Proceedings of the VLDB Endowment, 13, 12, pp. 3181-3194, (2020)
[7] Garcia-Arellano C, Storm A, Roumani D K H, Et al., Db2 event store: A purpose-built IoT database engine, Proceedings of the VLDB Endowment, 13, 12, pp. 3299-3312, (2020)
[8] Adya A, Myers D, Howell J, Et al., Slicer: Auto-sharding for datacenter applications, Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, pp. 739-753, (2016)
[9] Ratnasamy S, Francis P, Handley M, Et al., A scalable content-addressable network, Proceedings of the 2001 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communications, pp. 161-172, (2001)
[10] Rowstron A, Druschel P., Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems, Proceedings of the IFIP/ACM International Conference on Distributed Systems Platforms and Open Distributed Processing, pp. 329-350, (2001)

← 1 2 →