Observatorio de I+D+i UPM

Memorias de investigación
Communications at congresses:
The Possibilistic Reward Methods for the Multi-Armed Bandit Problem
Year:2017
Research Areas
  • Operative research and mathematic programming,
  • Information technology and adata processing
Information
Abstract
The multi-armed bandit problem has been at great depth studied in statistics, becoming fundamental in different areas of economics, statistics or artificial intelligence, such as reinforcement learning and evolutionary programming. Two families of bandit settings can be distinguished. In the first, the distribution of rewards is assumed to belong to a family of probability distributions whereas in the second, the rewards are only assumed to be bounded (say, between 0 and 1), and policies rely directly on the estimates of the expected rewards for each arm. Almost all the policies or allocation strategies in the literature focus on the first family and they can be separated, in two distinct approaches: the frequentist view and the Bayesian approach. In the frequentist view, the expected mean rewards corresponding to all arms are considered as unknown deterministic quantities and the aim of the algorithm is to reach the best parameter-dependent performance. In the Bayesian approach, each arm is characterized by a parameter which is endowed with a prior distribution. In this work, we propose a set of allocation strategies to deal with the multi-armed bandit problem that accounts for the Bayesian perspective, the possibilistic reward methods. First, we use possibilistic reward distributions to model the uncertainty about the expected rewards for arm, derived from a set of infinite confidence intervals nested around the expected value. Depending on the inequity (Chenoff-Hoeffding, Chernoff or Bernstein) used to compute the confidence intervals we have three PR methods with different features. Next, we use a pignistic probability transformation borrowed from decision theory and the transferable belief model to convert these possibilistic functions into probability distributions following the insufficient reason principle. Finally, Thompson sampling techniques are used to identify the arm with the higher expected reward and play that arm. For this, we carry out a simulation experiment by sampling from each arm according to their distributions. Finally, the picked arm is pulled/played and a real reward is output. Then, the possibilistic function corresponding to the picked arm is updated and started again. A numerical study analyses the performance of the proposed methods with respect to other policies in the literature. For this, five scenarios are considered accounting for a Bernoulli distribution with very low success probabilities, success probabilities close to 0.5 and success probabilities close to 0.5 and Gaussian rewards; a truncated Poisson distribution in [0,10]; and a truncated exponential distribution in [0,10].
International
Si
Congress
5th Symposium on Games and Decisions in Reliability and Risk
960
Place
Madrid, España
Reviewers
No
ISBN/ISSN
Start Date
07/06/2017
End Date
09/06/2017
From page
To page
Participants
  • Autor: Miguel Martín Blanco
  • Autor: Antonio Jimenez Martin (UPM)
  • Autor: Alfonso Mateos Caballero (UPM)
Research Group, Departaments and Institutes related
  • Creador: Grupo de Investigación: Grupo de análisis de decisiones y estadística
  • Departamento: Inteligencia Artificial
S2i 2019 Observatorio de investigación @ UPM con la colaboración del Consejo Social UPM
Cofinanciación del MINECO en el marco del Programa INNCIDE 2011 (OTR-2011-0236)
Cofinanciación del MINECO en el marco del Programa INNPACTO (IPT-020000-2010-22)