Multi-Feedback Bandit Learning with Probabilistic Contexts

被引:0
|
作者
Yang, Luting [1 ]
Yang, Jianyi [1 ]
Ren, Shaolei [1 ]
机构
[1] Univ Calif Riverside, Riverside, CA 92521 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Contextual bandit is a classic multi-armed bandit setting, where side information (i.e., context) is available before arm selection. A standard assumption is that exact contexts are perfectly known prior to arm selection and only single feedback is returned. In this work, we focus on multi-feedback bandit learning with probabilistic contexts, where a bundle of contexts are revealed to the agent along with their corresponding probabilities at the beginning of each round. This models such scenarios as where contexts are drawn from the probability output of a neural network and the reward function is jointly determined by multiple feedback signals. We propose a kernelized learning algorithm based on upper confidence bound to choose the optimal arm in reproducing kernel Hilbert space for each context bundle. Moreover, we theoretically establish an upper bound on the cumulative regret with respect to an oracle that knows the optimal arm given probabilistic contexts, and show that the bound grows sublinearly with time. Our simulation on machine learning model recommendation further validates the sub-linearity of our cumulative regret and demonstrates that our algorithm outperforms the approach that selects arms based on the most probable context.
引用
收藏
页码:3087 / 3093
页数:7
相关论文
共 50 条
  • [1] Curriculum Disentangled Recommendation with Noisy Multi-feedback
    Chen, Hong
    Chen, Yudong
    Wang, Xin
    Xie, Ruobing
    Wang, Rui
    Xia, Feng
    Zhu, Wenwu
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [2] Bandit Learning with Implicit Feedback
    Qi, Yi
    Wu, Qingyun
    Wang, Hongning
    Tang, Jie
    Sun, Maosong
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
  • [3] Information thermodynamics for a multi-feedback process with time delay
    Kwon, Chulan
    Um, Jaegon
    Park, Hyunggyu
    [J]. EPL, 2017, 117 (01)
  • [4] Bandit Learning with Biased Human Feedback
    Tang, Wei
    Ho, Chien-Ju
    [J]. AAMAS '19: PROCEEDINGS OF THE 18TH INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS AND MULTIAGENT SYSTEMS, 2019, : 1324 - 1332
  • [5] Learning with Bandit Feedback in Potential Games
    Cohen, Johanne
    Heliou, Amelie
    Mertikopoulos, Panayotis
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [6] Learning in Congestion Games with Bandit Feedback
    Cui, Qiwen
    Xiong, Zhihan
    Fazel, Maryam
    Du, Simon S.
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [7] The Blinded Bandit: Learning with Adaptive Feedback
    Dekel, Ofer
    Hazan, Elad
    Koren, Tomer
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 27 (NIPS 2014), 2014, 27
  • [8] Learning from eXtreme Bandit Feedback
    Lopez, Romain
    Dhillon, Inderjit S.
    Jordan, Michael, I
    [J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 8732 - 8740
  • [9] Multi-feedback Pairwise Ranking via Adversarial Training for Recommender
    WANG Jianfang
    FU Zhiyuan
    NIU Mingxin
    ZHANG Pengbo
    ZHANG Qiuling
    [J]. Chinese Journal of Electronics, 2020, 29 (04) : 615 - 622
  • [10] Multi-feedback Pairwise Ranking via Adversarial Training for Recommender
    Wang, Jianfang
    Fu, Zhiyuan
    Niu, Mingxin
    Zhang, Pengbo
    Zhang, Qiuling
    [J]. CHINESE JOURNAL OF ELECTRONICS, 2020, 29 (04) : 615 - 622