Off-policy and on-policy reinforcement learning with the Tsetlin machine

被引:0
|
作者
Saeed Rahimi Gorji
Ole-Christoffer Granmo
机构
[1] University of Agder,Centre for Artificial Intelligence Research
来源
Applied Intelligence | 2023年 / 53卷
关键词
Tsetlin machine; Explainable machine learning; Learning automata; Reinforcement learning; Temporal difference learning; SARSA;
D O I
暂无
中图分类号
学科分类号
摘要
The Tsetlin Machine is a recent supervised learning algorithm that has obtained competitive accuracy- and resource usage results across several benchmarks. It has been used for convolution, classification, and regression, producing interpretable rules in propositional logic. In this paper, we introduce the first framework for reinforcement learning based on the Tsetlin Machine. Our framework integrates the value iteration algorithm with the regression Tsetlin Machine as the value function approximator. To obtain accurate off-policy state-value estimation, we propose a modified Tsetlin Machine feedback mechanism that adapts to the dynamic nature of value iteration. In particular, we show that the Tsetlin Machine is able to unlearn and recover from the misleading experiences that often occur at the beginning of training. A key challenge that we address is mapping the intrinsically continuous nature of state-value learning to the propositional Tsetlin Machine architecture, leveraging probabilistic updates. While accurate off-policy, this mechanism learns significantly slower than neural networks on-policy. However, by introducing multi-step temporal-difference learning in combination with high-frequency propositional logic patterns, we are able to close the performance gap. Several gridworld instances document that our framework can outperform comparable neural network models, despite being based on simple one-level AND-rules in propositional logic. Finally, we propose how the class of models learnt by our Tsetlin Machine for the gridworld problem can be translated into a more understandable graph structure. The graph structure captures the state-value function approximation and the corresponding policy found by the Tsetlin Machine.
引用
收藏
页码:8596 / 8613
页数:17
相关论文
共 50 条
  • [1] Off-policy and on-policy reinforcement learning with the Tsetlin machine
    Gorji, Saeed Rahimi
    Granmo, Ole-Christoffer
    [J]. APPLIED INTELLIGENCE, 2023, 53 (08) : 8596 - 8613
  • [2] Interpolated Policy Gradient: Merging On-Policy and Off-Policy Gradient Estimation for Deep Reinforcement Learning Shixiang
    Gu, Shixiang
    Lillicrap, Timothy
    Ghahramani, Zoubin
    Turner, Richard E.
    Scholkopf, Bernhard
    Levine, Sergey
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [3] A multi-step on-policy deep reinforcement learning method assisted by off-policy policy evaluation
    Zhang, Huaqing
    Ma, Hongbin
    Mersha, Bemnet Wondimagegnehu
    Jin, Ying
    [J]. APPLIED INTELLIGENCE, 2024, 54 (21) : 11144 - 11159
  • [4] Multi-player H∞ Differential Game using On-Policy and Off-Policy Reinforcement Learning
    An, Peiliang
    Liu, Mushuang
    Wan, Yan
    Lewis, Frank L.
    [J]. 2020 IEEE 16TH INTERNATIONAL CONFERENCE ON CONTROL & AUTOMATION (ICCA), 2020, : 1137 - 1142
  • [5] Adaptive Optimal Control for Stochastic Multiplayer Differential Games Using On-Policy and Off-Policy Reinforcement Learning
    Liu, Mushuang
    Wan, Yan
    Lewis, Frank L.
    Lopez, Victor G.
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2020, 31 (12) : 5522 - 5533
  • [6] On-Policy vs. Off-Policy Deep Reinforcement Learning for Resource Allocation in Open Radio Access Network
    Hammami, Nessrine
    Kim Khoa Nguyen
    [J]. 2022 IEEE WIRELESS COMMUNICATIONS AND NETWORKING CONFERENCE (WCNC), 2022, : 1461 - 1466
  • [7] Batch Reinforcement Learning With a Nonparametric Off-Policy Policy Gradient
    Tosatto, Samuele
    Carvalho, Joao
    Peters, Jan
    [J]. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (10) : 5996 - 6010
  • [8] Variance Penalized On-Policy and Off-Policy Actor-Critic
    Jain, Arushi
    Patil, Gandharv
    Jain, Ayush
    Khetarpa, Khimya
    Precup, Doina
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 7899 - 7907
  • [9] Safe and efficient off-policy reinforcement learning
    Munos, Remi
    Stepleton, Thomas
    Harutyunyan, Anna
    Bellemare, Marc G.
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 29 (NIPS 2016), 2016, 29
  • [10] Bounds for Off-policy Prediction in Reinforcement Learning
    Joseph, Ajin George
    Bhatnagar, Shalabh
    [J]. 2017 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2017, : 3991 - 3997