Memorias de investigación
Communications at congresses:
The Possibilistic Reward Methods for the Multi-Armed Bandit Problem

Research Areas
  • Operative research and mathematic programming,
  • Information technology and adata processing

The multi-armed bandit problem has been at great depth studied in statistics, becoming fundamental in different areas of economics, statistics or artificial intelligence, such as reinforcement learning and evolutionary programming. Two families of bandit settings can be distinguished. In the first, the distribution of rewards is assumed to belong to a family of probability distributions whereas in the second, the rewards are only assumed to be bounded (say, between 0 and 1), and policies rely directly on the estimates of the expected rewards for each arm. Almost all the policies or allocation strategies in the literature focus on the first family and they can be separated, in two distinct approaches: the frequentist view and the Bayesian approach. In the frequentist view, the expected mean rewards corresponding to all arms are considered as unknown deterministic quantities and the aim of the algorithm is to reach the best parameter-dependent performance. In the Bayesian approach, each arm is characterized by a parameter which is endowed with a prior distribution. In this work, we propose a set of allocation strategies to deal with the multi-armed bandit problem that accounts for the Bayesian perspective, the possibilistic reward methods. First, we use possibilistic reward distributions to model the uncertainty about the expected rewards for arm, derived from a set of infinite confidence intervals nested around the expected value. Depending on the inequity (Chenoff-Hoeffding, Chernoff or Bernstein) used to compute the confidence intervals we have three PR methods with different features. Next, we use a pignistic probability transformation borrowed from decision theory and the transferable belief model to convert these possibilistic functions into probability distributions following the insufficient reason principle. Finally, Thompson sampling techniques are used to identify the arm with the higher expected reward and play that arm. For this, we carry out a simulation experiment by sampling from each arm according to their distributions. Finally, the picked arm is pulled/played and a real reward is output. Then, the possibilistic function corresponding to the picked arm is updated and started again. A numerical study analyses the performance of the proposed methods with respect to other policies in the literature. For this, five scenarios are considered accounting for a Bernoulli distribution with very low success probabilities, success probabilities close to 0.5 and success probabilities close to 0.5 and Gaussian rewards; a truncated Poisson distribution in [0,10]; and a truncated exponential distribution in [0,10].
5th Symposium on Games and Decisions in Reliability and Risk
Madrid, España
Start Date
End Date
From page
To page

Research Group, Departaments and Institutes related
  • Creador: Grupo de Investigación: Grupo de análisis de decisiones y estadística
  • Departamento: Inteligencia Artificial