A Model Checking Method for Verifying the Fault Tolerance of Distributed Protocol Liveness Properties

被引：0

作者：

Lu C.-Y. ^{[1
]}

Nie C.-H. ^{[1
]}

Cheung S.C. ^{[2
]}

机构：

[1] State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing

[2] The Hong Kong University of Science and Technology, Hong Kong

来源：

Jisuanji Xuebao/Chinese Journal of Computers | 2021年 / 44卷 / 08期

关键词：

Distributed system; Fault injection; Fault-tolerance mechanisms; Liveness properties; Model checking; Peer reduction policy;

D O I：

10.11897/SP.J.1016.2021.01714

中图分类号：

学科分类号：

摘要：

Cloud computing is a new mode that provides users with computing resources charged on demand through a communication network as a service. At present, enterprises are gradually transferring business deployment and data processing to cloud computing platforms. Because of various requirements such as scalability and performance, cloud platforms are deployed on distributed systems. Since the distributed system uses a large number of commodity machines to build through a complex structure, technicians cannot guarantee that these machines will work correctly all the time. Therefore, component failures in the distributed system are unavoidable. In order to improve the reliability of the distributed system, the technicians designed fault-tolerant mechanisms for them when developing the distributed system. In order to ensure that the fault-tolerant mechanisms can really work correctly when there is a failure in the distributed system, fault injection is one of the most effective ways to test the fault-tolerant mechanism, observing the behavior of the system and verifying whether the fault-tolerance mechanism is working correctly, by artificially injecting specific faults into the system under test. Due to the concurrent nature of distributed systems, it is very difficult for traditional software testing methods to fully test such systems. In recent years, technicians have increasingly used more systematic model checking techniques to verify distributed systems. However, existing model checking technology focuses on the checking of the safety properties and liveness properties of distributed systems, and ignores the checking of fault-tolerant mechanisms, especially liveness properties fault-tolerant mechanisms. As a result, how to verify the liveness properties fault-tolerant mechanism of the distributed system is currently a challenge. The use of abstract model checking method will introduce the problem of mismatch between the model and the actual system because of human error. At the same time, the use of implementation-level model checking method will aggravate the state space explosion problem. As a result, in this paper we propose an implementation-level model checking tool LTMC (Liveness Properties Fault Tolerance Model Checker), which combines fault injection technology to verify the liveness properties with their fault tolerance mechanisms and safety properties of the distributed protocols in the distributed systems. At the same time, based on the role of distributed system nodes, this paper proposes a PRP strategy (Peer Reduction Policy) to reduce the state space that LTMC needs to search, alleviating the problem of space explosion. LTMC can purposefully inject specific faults at specific moments when the system to be verified is running, instead of relying on random fault injection strategies; when the system to be verified changes, only slight modifications to the model checker are required; LTMC can systematically discover all bugs of the specified types in the distributed protocols. In addition, LTMC prioritizes the exploration of more practical execution paths of events by introducing a logic clock mechanism. At the end of this article, we apply LTMC to several protocols of ZooKeeper and Cassandra which are the most popular distributed systems on the market, and compare LTMC with depth-first exploration. LTMC has a state space reduction rate of 3.7-594.4 times. © 2021, Science Press. All right reserved.

引用

下载

页码：1714 / 1731

页数：17

共 39 条

[1] Fedoruk A, Deters R., Improving fault-tolerance by replicating agents, Proceedings of the 1st International Joint Conference on Autonomous Agents & Multiagent Systems, pp. 737-744, (2002)
[2] Yang J, Chen T, Wu M, Et al., MODIST: Transparent model checking of unmodified distributed systems, Proceedings of the 6th USENIX Symposium on Networked Systems Design and Implementation, pp. 213-228, (2009)
[3] Guo H, Wu M, Zhou L, Et al., Practical software model checking via dynamic interface reduction, Proceedings of the 23rd ACM Symposium on Operating Systems Principles, pp. 265-278, (2011)
[4] Hafeez A I, Faiq K, Osman H, Et al., McSeVIC: A model checking based framework for security vulnerability analysis of integrated circuits, IEEE Access, 6, pp. 32240-32257, (2018)
[5] Bai G, Ye Q, Wu Y, Et al., Towards model checking android applications, IEEE Transactions on Software Engineering, 44, 6, pp. 595-612, (2018)
[6] Tian T, Zhang Y, Zhou Q, Et al., ModelX: Using model checking to find design errors of cloud applications, Proceedings of the 2016 IEEE International Conference on Computer and Information Technology, pp. 607-610, (2016)
[7] Klai K, Ochi H., Model checking of composite cloud services, Proceedings of the IEEE International Conference on Web Services, pp. 356-363, (2016)
[8] Clarke E M, Henzinger T A, Veith H, Bloem R., Handbook of Model Checking, (2018)
[9] Killian C, Anderson J W, Jhala R, Et al., Life, death, and the critical transition: Finding liveness bugs in systems code, Proceedings of the 4th Symposium on Networked Systems Design and Implementation, pp. 18-32, (2007)
[10] Ezekiel J, Lomuscio A., Combining fault injection and model checking to verify fault tolerance in multi-agent systems, Proceedings of the 8th International Joint Conference on Autonomous Agents and Multiagent Systems, pp. 113-120, (2009)

← 1 2 3 4 →