Mutual-Information Regularization in Markov Decision Processes and Actor-Critic Learning

被引:0
|
作者
Leibfried, Felix [1 ]
Grau-Moya, Jordi [1 ]
机构
[1] PROWLER Io, Cambridge, England
来源
关键词
Mutual-Information Regularization; MDP; Actor-Critic Learning;
D O I
暂无
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Cumulative entropy regularization introduces a regulatory signal to the reinforcement learning (RL) problem that encourages policies with high-entropy actions, which is equivalent to enforcing small deviations from a uniform reference marginal policy. This has been shown to improve exploration and robustness, and it tackles the value overestimation problem. It also leads to a significant performance increase in tabular and high-dimensional settings, as demonstrated via algorithms such as soft Q-learning (SQL) and soft actor-critic (SAC). Cumulative entropy regularization has been extended to optimize over the reference marginal policy instead of keeping it fixed, yielding a regularization that minimizes the mutual information between states and actions. While this has been initially proposed for Markov Decision Processes (MDPs) in tabular settings, it was recently shown that a similar principle leads to significant improvements over vanilla SQL in RL for high-dimensional domains with discrete actions and function approximators. Here, we follow the motivation of mutual-information regularization from an inference perspective and theoretically analyze the corresponding Bellman operator. Inspired by this Bellman operator, we devise a novel mutual-information regularized actor-critic learning (MIRACLE) algorithm for continuous action spaces that optimizes over the reference marginal policy. We empirically validate MIRACLE in the Mujoco robotics simulator, where we demonstrate that it can compete with contemporary RL methods. Most notably, it can improve over the model-free state-of-the-art SAC algorithm which implicitly assumes a fixed reference policy.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] An actor-critic algorithm for constrained Markov decision processes
    Borkar, VS
    [J]. SYSTEMS & CONTROL LETTERS, 2005, 54 (03) : 207 - 213
  • [2] Actor-critic algorithms for hierarchical Markov decision processes
    Bhatnagar, S
    Panigrahi, JR
    [J]. AUTOMATICA, 2006, 42 (04) : 637 - 644
  • [3] Consolidated actor-critic model for partially-observable Markov decision processes
    Elhanany, I.
    Niedzwiedz, C.
    Liu, Z.
    Livingston, S.
    [J]. ELECTRONICS LETTERS, 2008, 44 (22) : 1317 - U41
  • [4] An Online Actor-Critic Algorithm with Function Approximation for Constrained Markov Decision Processes
    Bhatnagar, Shalabh
    Lakshmanan, K.
    [J]. JOURNAL OF OPTIMIZATION THEORY AND APPLICATIONS, 2012, 153 (03) : 688 - 708
  • [5] An actor-critic algorithm with function approximation for discounted cost constrained Markov decision processes
    Bhatnagar, Shalabh
    [J]. SYSTEMS & CONTROL LETTERS, 2010, 59 (12) : 760 - 766
  • [6] Actor-critic-type learning algorithms for Markov decision processes
    Konda, VR
    Borkar, VS
    [J]. SIAM JOURNAL ON CONTROL AND OPTIMIZATION, 1999, 38 (01) : 94 - 123
  • [7] A simultaneous perturbation Stochastic approximation-based actor-critic algorithm for Markov decision processes
    Bhatnagar, S
    Kumar, S
    [J]. IEEE TRANSACTIONS ON AUTOMATIC CONTROL, 2004, 49 (04) : 592 - 598
  • [8] Generalized Offline Actor-Critic with Behavior Regularization
    Cheng, Yu-Hu
    Huang, Long-Yang
    Hou, Di-Yuan
    Zhang, Jia-Zhi
    Chen, Jun-Long
    Wang, Xue-Song
    [J]. Jisuanji Xuebao/Chinese Journal of Computers, 2023, 46 (04): : 843 - 855
  • [9] Actor-Critic Learning Control With Regularization and Feature Selection in Policy Gradient Estimation
    Li, Luntong
    Li, Dazi
    Song, Tianheng
    Xu, Xin
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (03) : 1217 - 1227
  • [10] Granular computing in actor-critic learning
    Peters, James F.
    [J]. 2007 IEEE SYMPOSIUM ON FOUNDATIONS OF COMPUTATIONAL INTELLIGENCE, VOLS 1 AND 2, 2007, : 59 - 64