Fast sampling methods

Random sampling from a fixed set is used in many areas of statistical computing. The performance of this operation can be critical, especially when the sampled set is large. The fast RNGs provided in this package make very fast sampling possible when combined with suitably fast algorithms.

Benchmarks

By combining fast RNGs with a fast methods for creating integers in a range one gets good performance for sampling with replacement:

library(dqrng)
m <- 1e6
n <- 1e4
bm <- bench::mark(sample.int(m, n, replace = TRUE),
                  sample.int(1e4*m, n, replace = TRUE),
                  dqsample.int(m, n, replace = TRUE),
                  dqsample.int(1e4*m, n, replace = TRUE),
                  check = FALSE)

expression	min	median	itr/sec	mem_alloc
sample.int(m, n, replace = TRUE)	433.3µs	475µs	1999.342	49.1KB
sample.int(10000 * m, n, replace = TRUE)	846µs	900.4µs	1072.688	80.7KB
dqsample.int(m, n, replace = TRUE)	47.6µs	50µs	17170.274	58.2KB
dqsample.int(10000 * m, n, replace = TRUE)	53.3µs	57.5µs	14311.590	82.9KB

Note that sampling from 10^10 integers triggers “long-vector support” in R.

When sampling without replacement one has to consider an appropriate algorithm for making sure that no entry is repeated. When more than 50% of the population are sampled, dqrng shuffles an appropriate part of the full list and returns that. The algorithm used in R is similar but dqrng has the edge with respect to performance:

library(dqrng)
m <- 1e6
n <- 6e5
bm <- bench::mark(sample.int(m, n),
                  dqsample.int(m, n),
                  check = FALSE, min_iterations = 50)

expression	min	median	itr/sec	mem_alloc
sample.int(m, n)	55.96ms	63.7ms	15.45168	6.11MB
dqsample.int(m, n)	8.16ms	10.3ms	90.04322	6.1MB

For lower sampling ratios a set based rejection sampling algorithm is used by dqrng. In principle, R can make use of a similar algorithm based on a hashset. However, it is only used for larger input vectors even though it is faster than the default method. The algorithm in dqrng, which is based on a bitset, is even faster, though:

library(dqrng)
m <- 1e6
n <- 1e4
bm <- bench::mark(sample.int(m, n),
                  sample.int(m, n, useHash = TRUE),
                  dqsample.int(m, n),
                  check = FALSE)

expression	min	median	itr/sec	mem_alloc
sample.int(m, n)	1.19ms	1.94ms	433.0223	3.85MB
sample.int(m, n, useHash = TRUE)	590.9µs	657.81µs	1427.8970	169.65KB
dqsample.int(m, n)	99.66µs	102.62µs	8678.4395	39.11KB

As one decreases the sampling rate even more, dqrng switches to a hashset based rejection sampling. Both hashset based methods have similar performance and are much faster than R’s default method.

library(dqrng)
m <- 1e6
n <- 1e2
bm <- bench::mark(sample.int(m, n),
                  sample.int(m, n, useHash = TRUE),
                  dqsample.int(m, n),
                  check = FALSE)

expression	min	median	itr/sec	mem_alloc
sample.int(m, n)	478.87µs	692.85µs	799.9916	3.82MB
sample.int(m, n, useHash = TRUE)	14.64µs	16.57µs	56302.7572	3.98KB
dqsample.int(m, n)	4.76µs	5.42µs	158618.5998	448B

For larger sampling ranges R uses the hashset by default, though dqsample.int is still faster:

library(dqrng)
m <- 1e10
n <- 1e5
bm <- bench::mark(sample.int(m, n),
                  dqsample.int(m, n),
                  check = FALSE)

expression	min	median	itr/sec	mem_alloc
sample.int(m, n)	12.45ms	16.43ms	58.97732	1.76MB
dqsample.int(m, n)	1.98ms	3.04ms	302.74952	781.3KB

Technicalities

The following methods are used for sampling without replacement. The algorithms are presented in R-like pseudo code, even though the real implementation is in C++. For sampling rates above 50%, a partial Fisher-Yates shuffle is used:

no_replace_shuffle <- function(m, n) {
  tmp <- seq_len(m)
  for (i in seq_len(n))
    swap(tmp[i], tmp[i + random_int(m-i)])
  tmp[1:n]
}

where random_int(m-i) returns a random integer in [0, m-i]. Since the full population is kept in memory, this method is only suitable for high selection rates. One could expect that reservoir sampling should work well for lower selection rates. However, in my tests set based algorithms were faster:

no_replace_set <- function(m, n) {
  result <- vector(mode = "...", length = n) # integer or numeric
  elems <- new(set, m, n) # set object for storing n objects out of m possible values
  for (i in seq_len(n))
    while (TRUE) {
      v = random_int(m)
      if (elems.insert(v)) {
        result[i] = v
        break
      }
    }
  result
}

Here elems.insert(v) returns TRUE if the insert was successful, i.e. v was not in elems before, and FALSE otherwise. There are different strategies for implementing such a set. For intermediate sampling rates (currently between 0.1% and 50%) dqrng uses a bitset, i.e. a vector of m bits each representing one of the possible values. For lower sampling rates the memory usage of this algorithm is to expensive, which is why a hashset¹ is used, since there the used memory scales with n and not with m. One could expect that Robert Floyd’s sampling algorithm would be superior, but this was not the case in my tests, probably because it requires a final shuffling of the result to get a random permutation instead of a random combination.

For the specialists: Open addressing with a power-of-two size between 1.5 and 3 times n, identity hash function for the stored integers and quadratic probing.↩︎

Fast sampling methods

Ralf Stubner

2023-11-29

Benchmarks

Technicalities