Observatorio de I+D+i UPM

Memorias de investigación
Ponencias en congresos:
The Possibilistic Reward Methods for the Multi-Armed Bandit Problem
Áreas de investigación
  • Investigación operativa y programación matemática,
  • Ciencias de la computación y tecnología informática
The multi-armed bandit problem has been at great depth studied in statistics, becoming fundamental in different areas of economics, statistics or artificial intelligence, such as reinforcement learning and evolutionary programming. Two families of bandit settings can be distinguished. In the first, the distribution of rewards is assumed to belong to a family of probability distributions whereas in the second, the rewards are only assumed to be bounded (say, between 0 and 1), and policies rely directly on the estimates of the expected rewards for each arm. Almost all the policies or allocation strategies in the literature focus on the first family and they can be separated, in two distinct approaches: the frequentist view and the Bayesian approach. In the frequentist view, the expected mean rewards corresponding to all arms are considered as unknown deterministic quantities and the aim of the algorithm is to reach the best parameter-dependent performance. In the Bayesian approach, each arm is characterized by a parameter which is endowed with a prior distribution. In this work, we propose a set of allocation strategies to deal with the multi-armed bandit problem that accounts for the Bayesian perspective, the possibilistic reward methods. First, we use possibilistic reward distributions to model the uncertainty about the expected rewards for arm, derived from a set of infinite confidence intervals nested around the expected value. Depending on the inequity (Chenoff-Hoeffding, Chernoff or Bernstein) used to compute the confidence intervals we have three PR methods with different features. Next, we use a pignistic probability transformation borrowed from decision theory and the transferable belief model to convert these possibilistic functions into probability distributions following the insufficient reason principle. Finally, Thompson sampling techniques are used to identify the arm with the higher expected reward and play that arm. For this, we carry out a simulation experiment by sampling from each arm according to their distributions. Finally, the picked arm is pulled/played and a real reward is output. Then, the possibilistic function corresponding to the picked arm is updated and started again. A numerical study analyses the performance of the proposed methods with respect to other policies in the literature. For this, five scenarios are considered accounting for a Bernoulli distribution with very low success probabilities, success probabilities close to 0.5 and success probabilities close to 0.5 and Gaussian rewards; a truncated Poisson distribution in [0,10]; and a truncated exponential distribution in [0,10].
Nombre congreso
5th Symposium on Games and Decisions in Reliability and Risk
Tipo de participación
Lugar del congreso
Madrid, España
Fecha inicio congreso
Fecha fin congreso
Desde la página
Hasta la página
Título de las actas
Esta actividad pertenece a memorias de investigación
  • Autor: Miguel Martín Blanco
  • Autor: Antonio Jimenez Martin (UPM)
  • Autor: Alfonso Mateos Caballero (UPM)
Grupos de investigación, Departamentos, Centros e Institutos de I+D+i relacionados
  • Creador: Grupo de Investigación: Grupo de análisis de decisiones y estadística
  • Departamento: Inteligencia Artificial
S2i 2022 Observatorio de investigación @ UPM con la colaboración del Consejo Social UPM
Cofinanciación del MINECO en el marco del Programa INNCIDE 2011 (OTR-2011-0236)
Cofinanciación del MINECO en el marco del Programa INNPACTO (IPT-020000-2010-22)