For the multiarmed bandit, the classic result is probabilistic: each state of each bandit (Markov chain with rewards) has an index that is determined by an optimal stopping time for that state's bandit, and expected discounted income is maximized by playing at each epoch a bandit whose current state has the largest index. Our approach is analytic, not probabilistic. It uses pairwise comparison in place of stopping times. A simple recursion assigns to each state of each bandit a utility and an amplification of future utility that depend solely on the data for that state's bandit. These utilities and amplifications determine whether or not one state dominates another. We show that it is optimal to play at each epoch any bandit whose current state is not dominated by the current states of the other bandits. We obtain this result by a coherent analysis that encompasses three models-one with risk-averse exponential utility, one with risk-seeking exponential utility, and one with linear utility and either stopping or discounting. We also show that the risk-seeking case and a model of Nash [Nash, P. 1980. A generalized bandit problem. J. Roy. Statist. Soc. B 42 165-169) are equivalent to each other.
机构:
Faculdade de Economia, Universidade de Coimbra, 3004-512 Coimbra, Av. Dias da SilvaFaculdade de Economia, Universidade de Coimbra, 3004-512 Coimbra, Av. Dias da Silva
Monteiro A.M.
Tütüncü R.H.
论文数: 0引用数: 0
h-index: 0
机构:
Goldman Sachs Asset Management, New YorkFaculdade de Economia, Universidade de Coimbra, 3004-512 Coimbra, Av. Dias da Silva
Tütüncü R.H.
Vicente L.N.
论文数: 0引用数: 0
h-index: 0
机构:
CMUC, Department of Mathematics, University of CoimbraFaculdade de Economia, Universidade de Coimbra, 3004-512 Coimbra, Av. Dias da Silva