Exploring the Use of Invalid Action Masking in Reinforcement Learning: A Comparative Study of On-Policy and Off-Policy Algorithms in Real-Time Strategy Games

被引：0

作者：

Hou, Yueqi ^{[1
,2
]}

Liang, Xiaolong ^{[1
,2
]}

Zhang, Jiaqiang ^{[1
,2
]}

Yang, Qisong ^{[3
]}

Yang, Aiwu ^{[1
,2
]}

Wang, Ning ^{[1
,2
]}

机构：

[1] Air Force Engn Univ, Air Traff Control & Nav Sch, Xian 710051, Peoples R China

[2] Air Force Engn Univ, Shaanxi Key Lab Meta Synth Elect & Informat Syst, Xian 710051, Peoples R China

[3] Xian Res Inst High Technol, Xian 710051, Peoples R China

来源：

APPLIED SCIENCES-BASEL | 2023年 / 13卷 / 14期

基金：

中国国家自然科学基金;

关键词：

invalid action masking; reinforcement learning; policy gradient; proximal policy optimization; real-time strategy game;

D O I：

10.3390/app13148283

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

Invalid action masking is a practical technique in deep reinforcement learning to prevent agents from taking invalid actions. Existing approaches rely on action masking during policy training and utilization. This study focuses on developing reinforcement learning algorithms that incorporate action masking during training but can be used without action masking during policy execution. The study begins by conducting a theoretical analysis to elucidate the distinction between naive policy gradient and invalid action policy gradient. Based on this analysis, we demonstrate that the naive policy gradient is a valid gradient and is equivalent to the proposed composite objective algorithm, which optimizes both the masked policy and the original policy in parallel. Moreover, we propose an off-policy algorithm for invalid action masking that employs the masked policy for sampling while optimizing the original policy. To compare the effectiveness of these algorithms, experiments are conducted using a simplified real-time strategy (RTS) game simulator called Gym-mu RTS. Based on empirical findings, we recommend utilizing the off-policy algorithm for addressing most tasks while employing the composite objective algorithm for handling more complex tasks.

引用

页数：16

共 6 条

[1] Adaptive Optimal Control for Stochastic Multiplayer Differential Games Using On-Policy and Off-Policy Reinforcement Learning
Liu, Mushuang
Wan, Yan
Lewis, Frank L.
Lopez, Victor G.
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2020, 31 (12) : 5522 - 5533
[2] Comparison of On-Policy Deep Reinforcement Learning A2C with Off-Policy DQN in Irrigation Optimization: A Case Study at a Site in Portugal
Alibabaei, Khadijeh
Gaspar, Pedro D.
Assuncao, Eduardo
Alirezazadeh, Saeid
Lima, Tania M.
Soares, Vasco N. G. J.
Caldeira, Joao M. L. P.
[J]. COMPUTERS, 2022, 11 (07)
[3] Off-Policy Integral Reinforcement Learning Method to Solve Nonlinear Continuous-Time Multiplayer Nonzero-Sum Games
Song, Ruizhuo
Lewis, Frank L.
Wei, Qinglai
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2017, 28 (03) : 704 - 713
[4] Robust hierarchical games of linear discrete-time systems based on off-policy model-free reinforcement learning
Ma, Xiao
Yuan, Yuan
[J]. JOURNAL OF THE FRANKLIN INSTITUTE-ENGINEERING AND APPLIED MATHEMATICS, 2024, 361 (07):
[5] Non-zero-sum games of discrete-time Markov jump systems with unknown dynamics: An off-policy reinforcement learning method
Zhang, Xuewen
Shen, Hao
Li, Feng
Wang, Jing
[J]. INTERNATIONAL JOURNAL OF ROBUST AND NONLINEAR CONTROL, 2024, 34 (02) : 949 - 968
[6] Optimal tracking control for non-zero-sum games of linear discrete-time systems via off-policy reinforcement learning
Wen, Yinlei
Zhang, Huaguang
Su, Hanguang
Ren, He
[J]. OPTIMAL CONTROL APPLICATIONS & METHODS, 2020, 41 (04): : 1233 - 1250

← 1 →