Multi-Feedback Bandit Learning with Probabilistic Contexts

被引：0

作者：

Yang, Luting ^{[1
]}

Yang, Jianyi ^{[1
]}

Ren, Shaolei ^{[1
]}

机构：

[1] Univ Calif Riverside, Riverside, CA 92521 USA

来源：

PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE | 2020年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Contextual bandit is a classic multi-armed bandit setting, where side information (i.e., context) is available before arm selection. A standard assumption is that exact contexts are perfectly known prior to arm selection and only single feedback is returned. In this work, we focus on multi-feedback bandit learning with probabilistic contexts, where a bundle of contexts are revealed to the agent along with their corresponding probabilities at the beginning of each round. This models such scenarios as where contexts are drawn from the probability output of a neural network and the reward function is jointly determined by multiple feedback signals. We propose a kernelized learning algorithm based on upper confidence bound to choose the optimal arm in reproducing kernel Hilbert space for each context bundle. Moreover, we theoretically establish an upper bound on the cumulative regret with respect to an oracle that knows the optimal arm given probabilistic contexts, and show that the bound grows sublinearly with time. Our simulation on machine learning model recommendation further validates the sub-linearity of our cumulative regret and demonstrates that our algorithm outperforms the approach that selects arms based on the most probable context.

引用

页码：3087 / 3093

页数：7

共 50 条

[1] Curriculum Disentangled Recommendation with Noisy Multi-feedback
Chen, Hong
Chen, Yudong
Wang, Xin
Xie, Ruobing
Wang, Rui
Xia, Feng
Zhu, Wenwu
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
[2] Bandit Learning with Implicit Feedback
Qi, Yi
Wu, Qingyun
Wang, Hongning
Tang, Jie
Sun, Maosong
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 31 (NIPS 2018), 2018, 31
[3] Information thermodynamics for a multi-feedback process with time delay
Kwon, Chulan
Um, Jaegon
Park, Hyunggyu
[J]. EPL, 2017, 117 (01)
[4] Bandit Learning with Biased Human Feedback
Tang, Wei
Ho, Chien-Ju
[J]. AAMAS '19: PROCEEDINGS OF THE 18TH INTERNATIONAL CONFERENCE ON AUTONOMOUS AGENTS AND MULTIAGENT SYSTEMS, 2019, : 1324 - 1332
[5] Learning with Bandit Feedback in Potential Games
Cohen, Johanne
Heliou, Amelie
Mertikopoulos, Panayotis
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
[6] Learning in Congestion Games with Bandit Feedback
Cui, Qiwen
Xiong, Zhihan
Fazel, Maryam
Du, Simon S.
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
[7] The Blinded Bandit: Learning with Adaptive Feedback
Dekel, Ofer
Hazan, Elad
Koren, Tomer
[J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 27 (NIPS 2014), 2014, 27
[8] Learning from eXtreme Bandit Feedback
Lopez, Romain
Dhillon, Inderjit S.
Jordan, Michael, I
[J]. THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 8732 - 8740
[9] Multi-feedback Pairwise Ranking via Adversarial Training for Recommender
WANG Jianfang
FU Zhiyuan
NIU Mingxin
ZHANG Pengbo
ZHANG Qiuling
[J]. Chinese Journal of Electronics, 2020, 29 (04) : 615 - 622
[10] Multi-feedback Pairwise Ranking via Adversarial Training for Recommender
Wang, Jianfang
Fu, Zhiyuan
Niu, Mingxin
Zhang, Pengbo
Zhang, Qiuling
[J]. CHINESE JOURNAL OF ELECTRONICS, 2020, 29 (04) : 615 - 622

← 1 2 3 4 5 →