Sampling equiprobably with dice

An illustrative example of how to compute with the probabilities associated to dice, as well as the analysis of algorithms, is the following problem. Suppose you have a group of 19 individuals and an ordinary six-sided die. Your task is to use that die to select one of these individuals equiproabably, i.e. so that all individuals are equally likely, having probability $$1/19$$ of being selected, and using as few rolls of the die as possible. This is an example of the more general problem of using dice to sample a range equiprobably, where the range has no common factor with the number of faces of the die, which is in turn equivalent to one of the key problems of the algorithmics of random number generators. This is because with today's hardware, even if a random number generator samples a physical source, it will need to load the sample into a discrete register consisting of a finite number of bits. Hence the question arises of how to sample equiprobably from a range not being a power of two using a random number generator that returns some fixed number of random bits.

The main result is that the expected number $$E[T]$$ of rolls when an n-sided die is used to select equiprobably from a range m where $$n < m$$ and $$(n,m) = 1$$ and the discrete random variable T represents the number of rolls, obeys the bound
 * $$E[T] \le \lfloor \log_n m \rfloor + 1 +

\frac{m-1}{m} \frac{n}{n-1}.$$ This article presents the basic concepts and a basic analysis of the problem. For a much more sophisticated treatment of the case $$n=2$$, consult the article by H. Prodinger, who also discusses the problem in terms of leader election on a network. (The reference is here).

Building the algorithm
First, we return to the case of an ordinary die and nineteen individuals. A moment's thought reveals that a bounded number of rolls of the die, say k, will never suffice, as there is no way to distribute the values from one to nineteen among the $$6^k$$ possible outcomes so that all values are equally likely. We note, however, that if $$6^k$$ is large, and we use $$q \, \bmod \,19$$, where $$q\in [0, 6^k)$$ is the outcome, to choose the individual, the probabilities will be very close to $$\frac{1}{19}$$, being perturbed only slightly by the remainder $$r = 6^k \,\bmod \,19.$$ This suggests that we roll the die until the number $$6^k$$ of possible outcomes is at least nineteen, and take $$q \, \bmod \,19$$, if $$q < 6^k - r.\,$$ If $$q \ge 6^k - r,$$ however, we will need to roll the die again.

Recall that we require as few rolls of the die as possible. Therefore we should make use of the $$r$$ discarded outcomes that trigger another set of rolls, so as to not loose any valuable random bits. Each of these discarded outcomes may combine with any one of the outcomes of the next set of rolls, forming $$6r$$ possible pairs. This finally suggests the following procedure. Introduce the variable $$c_k$$, where k is the number of rolls, and set $$c_0 = 1.$$ This is the sequence of remainders r. Similarly, introduce the variable $$q_k$$ and set $$q_0 = 0.$$ This is the index into the remainder interval if applicable. At every step, roll the die, obtaining a value q between zero and five, i.e. modulo six. Let $$c_k = 6 c_{k-1} \,\bmod\, 19$$ and $$p = 6 q_{k-1} + q.\,$$ If $$p < 6 c_{k-1} - c_k\,$$, take $$p \, \bmod \,19$$ and halt, otherwise, set $$q_k = p - (6 c_{k-1} - c_k)\,$$ and repeat.

This algorithm is actually quite simple. Roll the die. If the value obtained is below the remainder interval, take the value modulo nineteen. If not, use six times the remainder interval as the new available range. Roll again and combine with the previous index into the remainder interval to obtain an index into the new range, and repeat.

Analysis
This procedure may be analysed in various ways. The simplest is to fix an individual and compute the probability of him/her being chosen. If this turns out to be $$\frac{1}{19}$$, then the algorithm is indeed correct. With this in mind we introduce the probability $$p_k$$ of the individual being chosen after k rolls, and make use of the fact that the sequence of remainders must repeat with a period that divides $$\varphi(19) = 18$$, according to Fermat's little theorem.

We find that

p_1 = \frac{0}{6} + \frac{6}{6} p_2, \quad p_2 = \frac{1}{36} + \frac{17}{36} p_3, \quad p_3 = \frac{5}{102} + \frac{7}{102} p_4 $$



p_4 = \frac{2}{42} + \frac{4}{42} p_5, \quad p_5 = \frac{1}{24} + \frac{5}{24} p_6, \quad p_6 = \frac{1}{30} + \frac{11}{30} p_7 $$



p_7 = \frac{3}{66} + \frac{9}{66} p_8, \quad p_8 = \frac{2}{54} + \frac{16}{54} p_9, \quad p_9 = \frac{5}{96} + \frac{1}{96} p_1. $$

This yields $$p_1 = \frac{1}{19}$$, so the algorithm is correct. In fact we have
 * $$p_1 = p_2 = p_3 = \ldots = p_9 = \frac{1}{19}

\quad \mbox{because} \quad \frac{1}{19} = \frac{1}{19} x + (1-x) \frac{1}{19}. $$

We introduce $$t_k$$, the expected number of rolls after k rolls, in order to verify that the algorithm fulfills the requirement of needing few rolls. Using the same probabilities as above, we find that

t_1 = \frac{0}{6} + \frac{6}{6} (1 + t_2), \quad t_2 = \frac{19}{36} + \frac{17}{36} (1 + t_3), \quad t_3 = \frac{95}{102} + \frac{7}{102} (1+ t_4) $$



t_4 = \frac{38}{42} + \frac{4}{42} (1 + t_5), \quad t_5 = \frac{19}{24} + \frac{5}{24} (1 + t_6), \quad t_6 = \frac{19}{30} + \frac{11}{30} (1 + t_7) $$



t_7 = \frac{57}{66} + \frac{9}{66} (1 + t_8), \quad t_8 = \frac{38}{54} + \frac{16}{54} (1 + t_9), \quad t_9 = \frac{95}{96} + \frac{1}{96} (1 + t_1). $$

This yields $$t_1 = \frac{25281276}{10077695} \approx 2.508636747,$$ so we need on average two and a half rolls of a six-sided die to select one of nineteen individuals equiprobably.

The same result may be obtained by introducing the probability generating function $$f(x)$$ given by

f(x) = \frac{19}{6^2} x^2 + \frac{95}{6^3} x^3 + \frac{38}{6^4} x^4 + \frac{19}{6^5} x^5 + \frac{19}{6^6} x^6 + \frac{57}{6^7} x^7 + \frac{38}{6^8} x^8 + \frac{95}{6^9} x^9 + \frac{1}{6^9} x^9 f(x)$$ and using the fact that
 * $$ t_1 = \left. \frac{d}{dx} f(x) \right|_{x=1}

= \frac{25281276}{10077695}.$$ The correctness of this equivalent approach will be shown in the next section.

Analysis of the general case
The analysis of the general case, i.e. the expected number of rolls where an n-sided die is used to select equiprobably from a range m where $$n < m$$ and $$(n, m)=1$$ is very instructive, and serves to illustrate the techniques used in the manipulation of discrete random variables.

Here it is useful to introduce an infinite tree, consisting of internal nodes and of outdegree n, where each level of the tree represents the $$n^k$$ possible outcomes after k rolls, the root being the initial state (zero rolls). We may think of the levels being divided from left to right into three zones, a blue zone, a green one, and a red one. The blue nodes represent outcomes that correspond to a halt after fewer than k rolls, the green ones, to a halt at the current level (exactly k rolls) and the red ones, to the remainder case (another roll is required). All children of the blue nodes are blue, as are children of the green nodes. There are $$c_k$$ red nodes, and of their children, $$c_{k+1} = n c_k \, \bmod \, m $$ are in red in turn, while $$ n c_k - c_{k+1}\, $$ are green. The number of green nodes is always a multiple of m and a fraction of $$1/m$$ of them correspond to a particular individual or value, thus showing by inspection that the algorithm samples the range equiprobably.

The three colors correspond to three sequences, $$(a_k), (b_k),$$ and $$(c_k)$$, where $$a_k/n^k$$ is the probability of having halted after fewer than k rolls, $$b_k/n^k$$ is the probability of halting after exactly k rolls, and $$c_k/n^k$$ is the probability of needing at least $$k+1$$ rolls. These sequences obey the following recurrences:
 * $$a_{k+1} = n a_k + n b_k, \quad

b_{k+1} = n c_k - c_{k+1}, \quad \mbox{and} \quad c_{k+1} = n c_k\, \bmod \, m$$ as well as
 * $$a_k + b_k + c_k = \,n^k.$$

Let T be the random variable giving the number of rolls. The fact that $$c_k/n^k$$ is the probability of needing at least $$k+1$$ rolls may also be derived from the conditional probabilities of needing another roll at any time.
 * $$P[T \ge k+1] =

\frac{c_k}{b_k+c_k} \frac{c_{k-1}}{b_{k-1}+c_{k-1}} \cdots \frac{c_1}{c_1+b_1}.$$ But $$b_k + c_k = \, n c_{k-1},$$ so this is

\frac{c_k}{n c_{k-1}} \frac{c_{k-1}}{n c_{k-2}} \cdots \frac{c_1}{n c_0} = \frac{c_k}{n^k}.$$

Using the definition of the expected value, we find that

E[T] = \sum_{k\ge 1} k P[T=k] = \sum_{k \ge 1} P[T\ge k] = \sum_{k\ge 0} \frac{c_k}{n^k}.$$ Recalling the definition of $$c_k$$, this becomes

\begin{array}{lcl} E[T] & = & \sum_{k\ge 0} \frac{n^k \, \bmod \, m}{n^k} \\ & = & \sum_{n^k < m} 1 + \sum_{n^k > m} \frac{n^k \, \bmod \, m}{n^k} \\ & = & \lfloor \log_n m \rfloor + 1 + \frac{1}{n^{\lfloor \log_n m \rfloor + 1}} \sum_{k\ge 0} \frac{n^{k + \lfloor \log_n m \rfloor + 1} \, \bmod \, m}{n^k} \\ & \le & \lfloor \log_n m \rfloor + 1 + \frac{1}{n^{\lfloor \log_n m \rfloor + 1}} \sum_{k\ge 0} \frac{m-1}{n^k} \\ & \le & \lfloor \log_n m \rfloor + 1 + \frac{1}{n^{\log_n m -1 + 1}} (m-1) \frac{1}{1-1/n} \\ & = & \lfloor \log_n m \rfloor + 1 + \frac{m-1}{m} \frac{n}{n-1}. \end{array} $$

Hence we need about two more rolls than $$\lfloor \log_n m \rfloor$$ to select equiprobably from m individuals using an n-sided die.