A Dual Deep Network Based Secure Deep Reinforcement Learning Method

被引：0

作者：

Zhu F. ^{[1
,2
,3
,4
]}

Wu W. ^{[1
]}

Fu Y.-C. ^{[1
,5
]}

Liu Q. ^{[1
,2
,3
]}

机构：

[1] School of Computer Science and Technology, Soochow University, Suzhou, 215006, Jiangsu

[2] Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing

[3] Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun

[4] Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, Suzhou, 215006, Jiangsu

[5] School of Computer Science and Engineering, Changshu Institute of Technology, Changshu, 215500, Jiangsu

来源：

Jisuanji Xuebao/Chinese Journal of Computers | 2019年 / 42卷 / 08期

基金：

中国国家自然科学基金;

关键词：

Deep Q-network; Deep reinforcement learning; Experience replay; Reinforcement learning; Safe artificial intelligence; Safe reinforcement learning;

D O I：

10.11897/SP.J.1016.2019.01812

中图分类号：

学科分类号：

摘要：

Reinforcement learning is a widely studied class of machine learning method, where the agent of reinforcement learning keeps continuously interacting with the environment with the goal of getting maximal long term return. Reinforcement learning is particularly prominent in areas such as control and optimal scheduling. Deep reinforcement learning, which is able to take large-scale high-dimensional data, e.g. video and image, as original input data, takes advantage of deep learning methods to extract abstract representations of them, and then utilizes reinforcement learning methods to attain optimal strategies, has recently become a research hotspot in artificial intelligence. There has emerged a large amount of work on deep reinforcement learning. For example, deep Q network(DQN), one of the most famous models in deep reinforcement learning, is based on convolutional neural networks (CNNs) and Q-learning algorithm, directly uses the unprocessed image as the input. DQN has been applied to learn strategy in complex environments with high-dimensional input. However, few deep reinforcement learning algorithms considers how to ensure security during the process of learning in the unknown environment. Even more, many reinforcement learning algorithms intentionally add random exploration approaches, e.g. ε-greedy, to guarantee the diversity of data sampling so that the algorithm could obtain a better approximate optimal solution. Nevertheless, exploration without any security constraint is very dangerous and likely to bring with high risk of leading to disastrous results. Aiming at solving this problem, an algorithm, named dual deep network based secure deep reinforcement learning (DDN-SDRL), is proposed. The DDN-SDRL algorithm sets up two experience pools. The first one is the experience pool of dangerous samples, including critical states and dangerous states that caused failure; and the second one is the experience pool of the secure sample, which excluded critical states and dangerous states. The DDN-SDRL algorithm takes advantage of an additional deep Q network to train dangerous samples and reconstructs a new objective function by introducing a penalty component. The new objective function is calculated by the penalty component and the original network objective function. The penalty component, which is trained by a deep Q network with samples in the critical state experience pool, is used to represent critical states before failure. As the DDN-SDRL algorithm fully uses information of critical state, dangerous state and secure state, the agent is able to improve security by avoiding most dangerous states during the training process. The DDN-SDRL is a general mechanism of enhancing security during the learning and can be combined with a variety of deep network models, such as DQN, dueling deep Q network(DuDQN), and deep recurrent Q network(DRQN). In the simulated experiments, DQN, DuDQN and DRQN were used as original deep network respectively, and at the same time DDN-SDRL was applied to ensure security. The results of six testing Atari 2600 games, CrazyClimber, Kangaroo, KungFuMaster, Pooyan, RoadRunner and Zaxxon, indicate that the proposed DDN-SDRL algorithm makes control safer, more stable and more effective. It can be concluded that the characteristics of the environment suitable for DDN-SDRL include: (1) there are many representable dangerous states that lead to failure in the environment; (2) the difference between dangerous states and secure states is discriminative; (3) there are not too many actions and the agent can attain improvement by self-training. In these cases, the DDN-SDRL improves original deep network much better. © 2019, Science Press. All right reserved.

引用

页码：1812 / 1826

页数：14

共 29 条

[1] Sutton R.S., Barto A.G., Reinforcement Learning: An Introduction, (2018)
[2] Lee D., Seo H., Jung M.W., Neural basis of reinforcement learning and decision making, Annual Review of Neuroscience, 35, pp. 287-308, (2012)
[3] Liu Q., Zhai J.-W., Zhang Z.-Z., Et al., A survey on deep reinforcement learning, Chinese Journal of Computers, 41, 1, pp. 1-27, (2018)
[4] Xu X., Zuo L., Huang Z., Reinforcement learning algorithms with function approximation: Recent advances and applications, Information Sciences, 261, 5, pp. 1-31, (2014)
[5] Zhu F., Liu Q., Zhang X., Shen B., Protein-Protein Internation network constructing based on text mining and reinforcement learning with application to prostate cancer, Systems Biology Iet, 9, 4, pp. 106-112, (2015)
[6] Silver D., Sutton R.S., Muller M., Temporal-difference search in computer Go, Machine Learning, 87, 2, pp. 183-219, (2012)
[7] Lecun Y., Bengio Y., Hinton G., Deep learning, Nature, 521, 7553, pp. 436-444, (2015)
[8] Zhang Q., Yang L.T., Chen Z., Et al., A survey on deep learning for big data, Information Fusion, 42, pp. 146-157, (2018)
[9] Karpathy A., Toderici G., Shetty S., Et al., Large-scale video classification with convolutional neural networks, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1725-1732, (2014)
[10] Watkins C., Dayan P., Q-learning, Machine Learning, 8, 3-4, pp. 279-292, (1992)

← 1 2 3 →