Descripción



The multiarmed bandit problem has been at great depth studied in statistics, becoming fundamental in different areas of economics, statistics or artificial intelligence, such as reinforcement learning and evolutionary programming. Two families of bandit settings can be distinguished. In the first, the distribution of rewards is assumed to belong to a family of probability distributions whereas in the second, the rewards are only assumed to be bounded (say, between 0 and 1), and policies rely directly on the estimates of the expected rewards for each arm. Almost all the policies or allocation strategies in the literature focus on the first family and they can be separated, in two distinct approaches: the frequentist view and the Bayesian approach. In the frequentist view, the expected mean rewards corresponding to all arms are considered as unknown deterministic quantities and the aim of the algorithm is to reach the best parameterdependent performance. In the Bayesian approach, each arm is characterized by a parameter which is endowed with a prior distribution. In this work, we propose a set of allocation strategies to deal with the multiarmed bandit problem that accounts for the Bayesian perspective, the possibilistic reward methods. First, we use possibilistic reward distributions to model the uncertainty about the expected rewards for arm, derived from a set of infinite confidence intervals nested around the expected value. Depending on the inequity (ChenoffHoeffding, Chernoff or Bernstein) used to compute the confidence intervals we have three PR methods with different features. Next, we use a pignistic probability transformation borrowed from decision theory and the transferable belief model to convert these possibilistic functions into probability distributions following the insufficient reason principle. Finally, Thompson sampling techniques are used to identify the arm with the higher expected reward and play that arm. For this, we carry out a simulation experiment by sampling from each arm according to their distributions. Finally, the picked arm is pulled/played and a real reward is output. Then, the possibilistic function corresponding to the picked arm is updated and started again. A numerical study analyses the performance of the proposed methods with respect to other policies in the literature. For this, five scenarios are considered accounting for a Bernoulli distribution with very low success probabilities, success probabilities close to 0.5 and success probabilities close to 0.5 and Gaussian rewards; a truncated Poisson distribution in [0,10]; and a truncated exponential distribution in [0,10].  
Internacional

Si 
Nombre congreso

5th Symposium on Games and Decisions in Reliability and Risk 
Tipo de participación

960 
Lugar del congreso

Madrid, España 
Revisores

No 
ISBN o ISSN


DOI


Fecha inicio congreso

07/06/2017 
Fecha fin congreso

09/06/2017 
Desde la página


Hasta la página


Título de las actas
