1 An empirical evalutation of Thompson Sampling (Chapelle and Li 2011)

Chapelle and Li (2011) use simulations as well as real data set to demonstrate empirically that Thompson Sampling asymptotically outperforms Upper Confidence Bounds for solving Multi-Armed Bandit (MAB) problems. In the following, I will replicate their findings using synthetic data.

1.1 A heads-up on the code

The empirical exercises have been implemented in R. To boost computational performance, bottleneck problems have been implemented in C++ through Rcpp. Where appropriate I use object-oriented programming (S3 classes). Code chunks in this document can be shown or hidden through buttons in their top right corner. All code is contained in the ./R folder.

1.2 Setting up the Bernoulli MAB

To start off with, let us create the MAB. As in Chapelle and Li (2011) we will consider a Bernoulli MAB: at each step \(T\) the bandit is faced with the problem of choosing one of \(K\) arms, all of which yield rewards that are generated by a Bernoulli distribution. For simulation purposes, two approaches come to mind:

Generate all \((T \times K)\) returns ex-ante and test the algorithms one that same data set.
Generate the returns on-line separately for each algorithm.

The first approach is perhaps more conventional in the context of benchmarking, since it provides a fair performance comparison of different algorithms with respect to the same data set. But since we will run the simulation many times over random realizations of the Bernoulli MAB, there is no need to have different algorithms work on the exact same realizations (asymptotically they are working with the same bandit problem).

The second approach is arguably more closely resembles real-world applications of reinforcement learning: at each step \(T\) the bandit chooses an action and only then they observe the reward. It also has computational advantages, some of which are obvious and others that will become clearer below. An obvious benefit of generating rewards on the fly is the smaller memory costs associated with this approach. Storing a \((T \times K)\) matrix consumes a fair bit of working memory and the for the simulations we will need as much of that as possible.

In the end I have therefore opted for the second approach. The mab function below takes basic input parameters concerning the time horizon of the game, the (Bernoulli) probabilities of each arm and their distribution (only implemented for Bernoulli, but easily extended). It returns an S3 class that will be used as the input for the different algorithms below.

#' Multi-Armed Bandit
#'
#' @param prob Bernoulli probabilities.
#' @param horizon Number of trials.
#' @param method Method to use. Only Bernoulli is implemented.
#'
#' @author Patrick Altmeyer
mab <- function(prob, horizon=1000, method="bernoulli") {
  # Parameters: ----
  K <- length(prob)
  v_star <- max(prob)

  # Reward generating function: ----
  if (method=="bernoulli") {
    generate_rewards <- function(prop) {
      sapply(
        prob,
        function(p) {
          rbinom(1,1,p)
        }
      )
    }
  }

  # Output: ----
  output <- list(
    prob = prob,
    generate_rewards = generate_rewards,
    v_star = v_star,
    K = K,
    horizon = horizon,
    method = method
  )
  class(output) <- "mab"
  return(output)
}

unpack.mab <- function(mab, envir=parent.env) {
  list2env(mab, envir = envir(environment()))
}

unpack <- function(mab, envir=parent.env) {
  UseMethod("unpack")
}

1.3 UCB

The code chunk below shows the implementation of the UCB algorithm in R. This implementation does not make use of Rcpp and is only shown for illustrative purposes. The code is thoroughly commented, so I here only talk to the high-level details. Like all other algorithms that we will see, ucb_R takes an object of class mab as its main input. The optional action_values argument can be used to supply initial values, should we have prior knowledge. By default all action values are initialized as zero. Regret is computed in each period as \(V^*-Q(a_T)\) where \(V^*\) is the optimal choice and \(Q(a_T)\) represents the true reward probability of the arm chosen in period \(T\). In particular, following Chapelle and Li (2011) here we have \(V^*=0.5=Q(a^{(1)})\) and \(Q(a^{(j)})=0.5-\varepsilon\) for all \(j\ne1\), \(j \in [1,K]\). It is easy to see that with this implementation, we can simply take the cumulative sum of the resulting vector of single-period regret to recover the total cumulative regret in period \(T\) as specified in Auer, Cesa-Bianchi, and Fischer (2002), namely

\[ \begin{equation} \begin{aligned} && V^*T- \sum_{j=1}^{K}Q(a^{(j)})N_j(T)&= \\ \end{aligned} \tag{1.1} \end{equation} \]

where \(N_j(T)\) denotes the number of times that arm \(j\) has been chosen up until (and including) time \(T\).

The function outputs an object of class policy, which contains the proposed solution to the MAB. The policy class uses a set of methods (e.g. print and plot) which will be used below.

#' UCB
#'
#' @param mab An object of class `mab`.
#' @param update_every Integer specifying length of interval after which parameters will be updated.
#' @param action_values Initial action value estimates (optional).
#'
#' @author Patrick Altmeyer
ucb_R <- function(
  mab,
  update_every = 1,
  action_values = NULL
) {

  # Parameters: ----
  unpack(mab) # unpack bandit problem
  # Uncertainty implemented according to Chapelle et al. (2011):
  uncertainty <- function(action_value,m,delta) {
    ifelse(m==0, Inf, sqrt((2 * action_value * log(1/delta)) / m) + (2*log(1/delta)) / m)
  }
  # If unambiguous, select arm with maximum UCB, else draw uniformly:
  select_arm <- function(ucb) {
    ifelse(sum(ucb==max(ucb))==1, which.max(ucb), sample(which(ucb==max(ucb)),1))
  }

  # Initialization: ----
  T_ <- 1 # iteration counter
  delta <- sqrt(1/T_)
  policy <- rep(0, horizon) # vector to keep track of choices
  regret <- rep(v_star, horizon) # initialize regret as v_star (maximal true action value)
  times_chosen <- rep(0,K) # count of times that actions were chosen
  if (is.null(action_values)) {
    action_values <- rep(0,K)
  }
  ucb <- action_values + uncertainty(action_values, times_chosen, delta) # initial upper confidence bounds

  # Recursion: ----
  while (T_ <= horizon) {

    # Select arm:
    arm <- select_arm(ucb)

    # Observe reward:
    r <- generate_rewards(prob)[arm]

    # Update choices in T:
    policy[T_] <- arm # update policy
    regret[T_] <- regret[T_] - prob[arm] # compute regret in this period
    times_chosen[arm] <- times_chosen[arm] + 1 # update counter
    q <- action_values[arm] # action value at T
    m <- times_chosen[arm] # times chose at T
    action_values[arm] <- q + 1/m * (r - q) # action value increment

    # Update parameters and estimates for T+1:
    T_ <- T_ + 1
    update_this_round <- T_ %% update_every == 0 # update only if modulus equal to zero
    if (update_this_round) {
      delta <- sqrt(1/T_)
      # action_values[arm] <- q + 1/m * (r - q) # action value increment
      ucb[arm] <- action_values[arm] + uncertainty(action_values[arm], m, delta) # UCB increment
    }
  }

  # Output: ----
  output <- list(
    policy = policy,
    regret = regret,
    times_chosen = times_chosen,
    action_values = action_values,
    algo = list(
      type = "ucb",
      upper_confidence_bounds = ucb
    ),
    mab = mab
  )
  class(output) <- "policy"
  return(output)
}

1.3.1 Example

Before moving on to Thompson sampling, let us run a single game to see how the algorithm performs. The code chunk below generates the game using the mab function introduced above.

eps <- 0.1
K <- 10
T_ <- 10000
prob <- c(0.5,rep(0.5-eps,K-1))
bernoulli_mab <- mab(prob, horizon = T_)

Below we use the Rcpp version of the UCB algorithm to fit our model. For \(T=10^{4}\) it only takes a few seconds to solve the problem. The ucb_Rcpp function is a wrapper function that executes the compiled C++ code under the hood and returns a policy object. The C++ code will not be discussed here but details can be found in the annex. For \(K=10\) and \(\varepsilon=0.1\), the cumulative regret after \(T=10^{4}\) periods is around the same levels as the average cumulative regret for that specification in Chapelle and Li (2011) (Figure 1.1).

set.seed(42)
fitted <- ucb_Rcpp(bernoulli_mab)
plot(fitted)

Figure 1.1: An example run of the UCB algorithm.

1.4 Thompson sampling

The implementation of Thompson sampling is very similar to UCB. Once again commented R code is shown below.

#' Thompson Sampling
#'
#' @param mab An object of class `mab`.
#' @param update_every Integer specifying length of interval after which parameters will be updated.
#' @param successes Initial success counts (optional).
#' @param failures Initial failure counts (optional).
#'
#' @author Patrick Altmeyer
thompson_R <- function(
  mab,
  update_every = 1,
  successes = NULL,
  failures = NULL
) {

  # Parameters: ----
  unpack(mab) # unpack bandit problem
  # Function to compute posterior means (action values):
  posterior_means <- function(successes, failures, method=method) {
    K <- length(successes)
    if (method=="bernoulli") {
      theta <- rbeta(K,successes,failures)
    }
    return(theta)
  }
  # Function to select based on action values:
  select_arm <- function(theta) {
    arm <- which.max(theta)
    return(arm)
  }

  # Initialization: ----
  T_ <- 1 # iteration counter
  policy <- rep(0, horizon) # vector to keep track of choices
  regret <- rep(v_star, horizon) # initialize regret as v_star (maximal true action value)
  times_chosen <- rep(0,K) # count of times that actions were chosen
  if (is.null(successes)) {
    successes <- rep(1,K) # initialize success count
  }
  if (is.null(failures)) {
    failures <- rep(1,K) # initialize failure count
  }
  theta <- posterior_means(successes = successes, failures = failures) # initilize action values

  # Recursion: ----
  while (T_ <= horizon) {

    # Select arm:
    arm <- select_arm(theta = theta)

    # Observe reward:
    r <- generate_rewards(prob)[arm]

    # Update choices in T:
    policy[T_] <- arm # update policy
    regret[T_] <- regret[T_] - prob[arm] # compute regret in this period
    times_chosen[arm] <- times_chosen[arm] + 1 # update counter
    if (r==1) {
      successes[arm] <- successes[arm] + 1
    } else {
      failures[arm] <- failures[arm] + 1
    }

    # Update parameters and estimates for T+1:
    T_ <- T_ + 1
    update_this_round <- T_ %% update_every == 0 # update only if modulus equal to zero
    if (update_this_round) {
      theta <- posterior_means(successes = successes, failures = failures)
    }

  }

  # Output: ----
  output <- list(
    policy = policy,
    regret = regret,
    times_chosen = times_chosen,
    action_values = theta,
    algo = list(
      type = "thompson",
      successes = successes,
      failures = failures
    ),
    mab = mab
  )
  class(output) <- "policy"
  return(output)
}

1.4.1 Example

Let us also look at an example run with Thompson sampling using the Rcpp implementation as before. We will use the same bandit as for the UCB above. As evident from Figure 1.2 the cumulative regret after \(T=10^{4}\) trials is much lower than for UCB. The curve describing the trajectory is also smoother than for UCB, perhaps indicating that when learning from posteriour probabilities, the bandit is less likely to stick with any particular choice throughout multiple consecutive periods.

set.seed(42)
fitted <- thompson_Rcpp(bernoulli_mab)
plot(fitted)

Figure 1.2: An example run of Thompson sampling.

1.5 Performance benchmark

The choice to compile the algorithms in C++ through Rcpp was driven by the fact that we are looking at computationally very intensive simulations.¹ This implementation leads to massive performance boosts: Figure 1.3 shows a micro benchmark of the computing times of the different implementations.

library(microbenchmark)
library(ggplot2)
mb <- microbenchmark(
    "UCB (R)" = {ucb_R(bernoulli_mab)},
    "UCB (Rcpp)" = {ucb_Rcpp(bernoulli_mab)},
    "Thompson (R)" = {thompson_R(bernoulli_mab)},
    "Thompson (Rcpp)" = {thompson_Rcpp(bernoulli_mab)},
    times = 100
)
p <- autoplot(mb)
ggsave("www/mb.png", width = 6, height = 4)

Figure 1.3: Reduction in computing time due to compiling through Rcpp.

2 Simulations

First, let us set up a grid for the simulation, which will use both to draw the different parameters (\(K\), \(\varepsilon\)):

library(data.table)
t_0 <- 1
T_ <- 1e7
K <- c(10,100)
eps <- c(0.02,0.1)
algos <- list(
  ucb = sim_ucb,
  thompson = sim_thompson
)
param_grid <- CJ(K=K,eps=eps,algo=names(algos))

The functions going into the list algos are compiled through Rcpp to also speed up that part of the process. Details can again be found in the annex.

We will also set up a corresponding table to store the results from the simulation and eventually plot them. This involves performing a cross-join across the different parameters and the trials (\(T=[1,10^{4}]\)), which is efficiently implemented in data.table.

simulation <- CJ(K=K,eps=eps,T_=t_0:T_)

2.1 Continuous updates

2.1.1 Lower bound

The first variable we will add to simulation is the lower bound for the different parameter specifications. The lower bound is implemented according to the formulation in Chapelle and Li (2011):

#' Lower bound
#'
#' @param t Time T.
#' @param prob Action values.
#' @param kl Function to compute KL-Divergence.
#'
#' @author Patrick Altmeyer
lb <- function(t, prob, kl) {
  p_star <- max(prob)
  log(t) * sum(
    sapply(
      prob,
      function(p) {
        ifelse(p_star-p>0,(p_star-p)/kl(p,p_star),0)
      }
    )
  )
}

Next we apply this function for each \(K\) and \(\varepsilon\). Following Chapelle and Li (2011) we will shift the lower bound such that it passes through the origin at \(T=100\). Again these computation can be done very efficiently using data.table:

simulation[,value:=lb(T_,c(0.5,rep(0.5-eps,K-1)),kl_bernoulli) - lb(100,c(0.5,rep(0.5-eps,K-1)),kl_bernoulli),by=.(K,eps)]
simulation[,variable:="lower_bound"]

2.1.2 Algorithms

Now let us finally turn to the simulation exercise. While the code Rcpp code runs fast, running multiple simulations of Bernoulli bandits with \(K=100\) arms over \(T=10^7\) trials is computationally challenging.² The code chunk below was used to run n_sim\(=25\) simulations as in Chapelle and Li (2011). While the authors chose to only update parameters every 100 iterations presumably to speed up computations, we do not use that simplification here and instead update every round.

start <- Sys.time()
simulation <- rbind(
  simulation,
  rbindlist(
    lapply(
      1:nrow(param_grid),
      function(param_spec) {
        list2env(c(param_grid[param_spec,]), envir = environment()) # unpack variables
        prob <- c(0.5,rep(0.5-eps,K-1)) # compute corresponding probabilities
        bernoulli_mab <- mab(prob, horizon = T_) # generate MAB
        unpack(bernoulli_mab) # unpack objects
        # initialize table to store output for parameter specification:
        simulation_spec <- data.table(K=K, eps=eps, T_=t_0:T_, value=0, variable=algo) 
        # run simulation:
        message(sprintf(
          "Starting with: %s, %i arms, eps=%0.2f, T=%i, n_sim=%i", 
          algo, 
          K, 
          eps,
          T_,
          n_sim
        ))
        regret <- algos[[algo]](
          n = n_sim,
          horizon = horizon,
          v_star = v_star,
          K = K,
          prob = prob,
          update_every = update_every
        )
        message(sprintf(
          "Done with: %s, %i arms, eps=%0.2f, T=%i, n_sim=%i", 
          algo, 
          K, 
          eps,
          T_,
          n_sim
        ))
        simulation_spec[,value:=cumsum(regret)] # store results
        return(simulation_spec)
      }
    )
  )
)
time_passed <- Sys.time() - start
print(time_passed)
saveRDS(simulation, file="results/simulation.rds")

2.1.3 Chart

The results are consistent with the findings in Chapelle and Li (2011): for all parameter specifications both the UCB and Thompson Sampling attain sub-linearity before reaching \(10^7\) trials (Figure 2.1). As expected, Thompson Sampling consistently outperforms UCB. Compared to the results in Chapelle and Li (2011), both the UCB and to a lesser extent Thompson Sampling yield somewhat lower absolute levels of cumulative regret for any given \(T\). This is likely due to the fact that here we have updated parameters continuously in each round, while Chapelle and Li (2011) only update every 100 trials.

Comparison of algorithms as in @chapelle2011empirical.

Figure 2.1: Comparison of algorithms as in Chapelle and Li (2011).

2.2 Discrete updates

For the complete picture we also run the simulation with updates at discrete intervals of 100 trials. The results shown in Figure 2.2 are broadly consistent with the results from above.

n_sim <- 25
run_simulation(n_sim=n_sim,horizon=1e7,update_every = 100,file="ps1_sim_discrete.png")

Comparison of algorithms as in @chapelle2011empirical. Parameter updates every 100 trials.

Figure 2.2: Comparison of algorithms as in Chapelle and Li (2011). Parameter updates every 100 trials.

References

Auer, Peter, Nicolo Cesa-Bianchi, and Paul Fischer. 2002. “Finite-Time Analysis of the Multiarmed Bandit Problem.” Machine Learning 47 (2): 235–56.

Chapelle, Olivier, and Lihong Li. 2011. “An Empirical Evaluation of Thompson Sampling.” Advances in Neural Information Processing Systems 24: 2249–57.

Annex

A `Rcpp` code

A.1 UCB

#include <Rcpp.h>
using namespace Rcpp;

// Helper function to generate rewards:
// [[Rcpp::export]]
NumericVector generate_rewards(NumericVector prob, String method="bernoulli") {
  int n = prob.size();
  NumericVector v = no_init(n);
  if (method=="bernoulli") {
    for (int i=0; i<n; i++) {v[i] = as<double>(rbinom(1, 1, prob[i]));}
  }
  return(v);
}

// Helper function to select arm:
// [[Rcpp::export]]
int select_arm(NumericVector ucb) {
  int out;
  if (sum(ucb==max(ucb))==1) {
    // if there is an unambiguous maximal value, choose that one
    out = which_max(ucb);
  } else {
    // else draw uniformly from maximal values
    IntegerVector indices = seq_along(ucb)-1; // indices of values
    LogicalVector is_max = ucb==max(ucb); // boolean vector identifying maximal values
    IntegerVector candidates = indices[is_max]; // indices of maximal values
    IntegerVector out_v = sample(candidates, 1); // sample from those
    out = out_v[0];
  }
  return out;
}

// Helper function to quantify uncertainty:
// [[Rcpp::export]]
NumericVector uncertainty(NumericVector action_values, NumericVector times_chosen, double delta) {
  NumericVector out=ifelse(
    times_chosen==0,
    INFINITY, // if arm has never been selected, set to infinity
    sqrt(2 * log(1/delta) *  action_values / times_chosen) + (2*log(1/delta)) / times_chosen
  );
  return out;
}

// Main UCB function:
// [[Rcpp::export]]
List ucb(
    int horizon,
    double v_star,
    int K,
    NumericVector prob,
    int update_every=1,
    String method="bernoulli",
    Nullable<NumericVector> action_values_ = R_NilValue
) {

  // Initialize:
  int T = 1;
  double delta = sqrt(1/T);
  NumericVector policy (horizon); // vector to keep track of choices
  NumericVector regret (horizon, v_star); // initialize regret as v_star (maximal true action value)
  NumericVector times_chosen (K); // counts of times selected
  NumericVector action_values (K); // action_values initialized to zeros
  // If initial action values have been supplied, then overwrite:
  if (action_values_.isNotNull()) {
    NumericVector action_values(action_values_); // casting to underlying type
  }
  NumericVector u = uncertainty(action_values, times_chosen, delta); // uncertainty
  NumericVector ucb = action_values + u; // upper confidence bound

  // Recursion:
  while(T <= horizon) {

    // Rcout << "UCB: " << ucb << "\n";

    // Select arm:
    int arm=select_arm(ucb);

    // Observe reward:
    NumericVector rewards = generate_rewards(prob, method);
    int r = rewards[arm];

    // Update choices in T:
    policy[T-1] = arm+1; //update policy
    regret[T-1] = regret[T-1] - prob[arm]; // compute regret in this period
    times_chosen[arm] = times_chosen[arm] + 1; // update counter
    double q = action_values[arm]; // action value at T
    int m = times_chosen[arm]; // times chose at T
    action_values[arm] = q + 1.0/m * (r - q); // action value increment for selected arm

    // Update:
    T++;
    // Check if modulus is zero:
    int update_this_round = T % update_every == 0;
    if (update_this_round) {
      double delta = sqrt(1.0/T);
      // action_values[arm] = q + 1.0/m * (r - q); // action value increment for selected arm
      u = uncertainty(action_values, times_chosen, delta); // update uncertainty
      ucb = action_values + u; // update UCB
    }
  }

  // Output
  List output = List::create(
    Named("policy") = policy,
    Named("regret") = regret,
    Named("times_chosen") = times_chosen,
    Named("action_values") = action_values,
    Named("ucb") = ucb
  );

  return output;
}

// Function to run simulation:
// [[Rcpp::export]]
NumericVector sim_ucb(
    double n,
    int horizon,
    double v_star,
    int K,
    NumericVector prob,
    int update_every=1,
    String method="bernoulli",
    Nullable<NumericVector> action_values_ = R_NilValue
) {
  NumericVector regret (horizon); // allocate memory
  double sim_counter=1.0; // counter
  while (sim_counter <= n) {
    Rcout << sim_counter << "/" << n << "\n";
    // Run algorithm:
    List policy = ucb(
      horizon,
      v_star,
      K,
      prob,
      update_every,
      method,
      action_values_
    );
    NumericVector reg_sim=policy["regret"];
    regret = regret + reg_sim / n; // increment regret
    sim_counter++;
  }
  return(regret);
}

// You can include R code blocks in C++ files processed with sourceCpp
// (useful for testing and development). The R code will be automatically
// run after the compilation.
//
//
// /*** R
// prob <- c(0.5,rep(0.4,9))
// bernoulli_mab <- mab(prob, horizon = 10000)
// unpack(bernoulli_mab)
// sim_ucb(
//   n = 50,
//   horizon = horizon,
//   v_star = v_star,
//   K = K,
//   prob = prob,
//   update_every=100,
//   method = method,
//   action_values_ = NULL
// )
// */

A.2 Thompson Sampling

#include <Rcpp.h>
using namespace Rcpp;

// Helper function to generate rewards:
// [[Rcpp::export]]
NumericVector generate_rewards(NumericVector prob, String method="bernoulli") {
  int n = prob.size();
  NumericVector v = no_init(n);
  if (method=="bernoulli") {
    for (int i=0; i<n; i++) {v[i] = as<double>(rbinom(1, 1, prob[i]));}
  }
  return(v);
}

// Helper function to compute posterior means (action values):
// [[Rcpp::export]]
NumericVector posterior_means(IntegerVector successes, IntegerVector failures, String method="bernoulli") {
  int K=successes.size();
  NumericVector theta = no_init(K);
  if (method=="bernoulli") {
    for (int i=0; i<K; i++) {theta[i] = as<double>(rbeta(1, successes[i], failures[i]));}
  }
  return(theta);
}

// Helper function to select arm:
// [[Rcpp::export]]
int select_arm(NumericVector theta) {
  int arm = which_max(theta);
  return arm;
}

// Main UCB function:
// [[Rcpp::export]]
List thompson(
    int horizon,
    double v_star,
    int K,
    NumericVector prob,
    int update_every=1,
    String method="bernoulli",
    Nullable<IntegerVector> successes_ = R_NilValue,
    Nullable<IntegerVector> failures_ = R_NilValue
) {

  // Initialize:
  int T = 1;
  NumericVector policy (horizon); // vector to keep track of choices
  NumericVector regret (horizon, v_star); // initialize regret as v_star (maximal true action value)
  NumericVector times_chosen (K); // counts of times selected
  IntegerVector successes (K,1); // successes initialized to one
  IntegerVector failures (K,1); // failures initialized to one
  // If initial success/failure values have been supplied, then overwrite:
  if (successes_.isNotNull()) {
    IntegerVector successes(successes_); // casting to underlying type
  }
  if (failures_.isNotNull()) {
    IntegerVector failures(failures_); // casting to underlying type
  }
  NumericVector theta=posterior_means(successes, failures, method);

  // Recursion:
  while(T <= horizon) {

    // Rcout << "Round: " << T << "\n";

    // Select arm:
    int arm=select_arm(theta);

    // Observe reward:
    NumericVector rewards = generate_rewards(prob, method);
    int r = rewards[arm];

    // Update choices in T:
    policy[T-1] = arm+1; //update policy
    regret[T-1] = regret[T-1] - prob[arm]; // compute regret in this period
    times_chosen[arm] = times_chosen[arm] + 1; // update counter

    // Update:
    T++;
    if (r==1) {
      successes[arm]++;
    } else {
      failures[arm]++;
    }
    // Check if modulus is zero:
    int update_this_round = T % update_every == 0;
    if (update_this_round) {
      theta=posterior_means(successes, failures, method);
    }
  }

  // Output
  List output = List::create(
    Named("policy") = policy,
    Named("regret") = regret,
    Named("times_chosen") = times_chosen,
    Named("action_values") = theta,
    Named("successes") = successes,
    Named("failures") = failures
  );

  return output;
}

// Function to run simulation:
// [[Rcpp::export]]
NumericVector sim_thompson(
    double n,
    int horizon,
    double v_star,
    int K,
    NumericVector prob,
    int update_every=1,
    String method="bernoulli",
    Nullable<IntegerVector> successes_ = R_NilValue,
    Nullable<IntegerVector> failures_ = R_NilValue
) {
  NumericVector regret (horizon); // allocate memory
  double sim_counter=1.0; // counter
  while (sim_counter <= n) {
    Rcout << sim_counter << "/" << n << "\n";
    // Run algorithm:
    List policy = thompson(
      horizon,
      v_star,
      K,
      prob,
      update_every,
      method,
      successes_,
      failures_
    );
    NumericVector reg_sim=policy["regret"];
    regret = regret + reg_sim / n; // increment regret
    sim_counter++;
  }
  return(regret);
}

// You can include R code blocks in C++ files processed with sourceCpp
// (useful for testing and development). The R code will be automatically
// run after the compilation.
//
//
// /*** R
// prob <- c(0.5,rep(0.4,9))
// bernoulli_mab <- mab(prob, horizon = 10000)
// unpack(bernoulli_mab)
// thompson(
//   # n = 10,
//   horizon = horizon,
//   v_star = v_star,
//   K = K,
//   prob = prob,
//   update_every = 1,
//   method = method,
//   successes_ = NULL,
//   failures_ = NULL
// )
// */

B Session info

utils::sessionInfo()

## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS High Sierra 10.13.6
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] Rcpp_1.0.6        scales_1.1.1      ggplot2_3.3.3     data.table_1.14.0
## 
## loaded via a namespace (and not attached):
##  [1] pillar_1.6.0      compiler_4.0.3    highr_0.8         tools_4.0.3      
##  [5] digest_0.6.27     evaluate_0.14     lifecycle_1.0.0   tibble_3.1.1     
##  [9] gtable_0.3.0      pkgconfig_2.0.3   png_0.1-7         rlang_0.4.10     
## [13] yaml_2.2.1        xfun_0.20         withr_2.4.2       stringr_1.4.0    
## [17] dplyr_1.0.2       knitr_1.30        generics_0.1.0    vctrs_0.3.7      
## [21] grid_4.0.3        tidyselect_1.1.0  glue_1.4.2        R6_2.5.0         
## [25] fansi_0.4.2       rmarkdown_2.6     bookdown_0.20     purrr_0.3.4      
## [29] magrittr_2.0.1    ellipsis_0.3.1    htmltools_0.5.1.1 colorspace_2.0-0 
## [33] utf8_1.2.1        stringi_1.5.3     munsell_0.5.0     crayon_1.4.1

It has in part been an exercise for me to get more familiar with Rcpp.↩︎
On my personal device, an old single-core MacBook Pro ’13 from mid-2010, these computations took 13 hours over all.↩︎

Problem Set 1

Thompson sampling

Patrick Altmeyer

13 May, 2021

1 An empirical evalutation of Thompson Sampling (Chapelle and Li 2011)

1.1 A heads-up on the code

1.2 Setting up the Bernoulli MAB

1.3 UCB

1.3.1 Example

1.4 Thompson sampling

1.4.1 Example

1.5 Performance benchmark

2 Simulations

2.1 Continuous updates

2.1.1 Lower bound

2.1.2 Algorithms

2.1.3 Chart

2.2 Discrete updates

References

Annex

A `Rcpp` code

A.1 UCB

A.2 Thompson Sampling

B Session info

Problem Set 1

Thompson sampling

Patrick Altmeyer

13 May, 2021

1 An empirical evalutation of Thompson Sampling (Chapelle and Li 2011)

1.1 A heads-up on the code

1.2 Setting up the Bernoulli MAB

1.3 UCB

1.3.1 Example

1.4 Thompson sampling

1.4.1 Example

1.5 Performance benchmark

2 Simulations

2.1 Continuous updates

2.1.1 Lower bound

2.1.2 Algorithms

2.1.3 Chart

2.2 Discrete updates

References

Annex

A Rcpp code

A.1 UCB

A.2 Thompson Sampling

B Session info

A `Rcpp` code