--- title: "Ecume" author: "Hector Roux de BĂ©zieux" bibliography: Ecume.bib date: '`r format(Sys.time(), "%d %B , %Y")`' output: rmarkdown::html_document: toc: true toc_depth: 3 vignette: > %\VignetteIndexEntry{Ecume} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r packages, include=F} library(knitr) opts_chunk$set( fig.pos = "!h", out.extra = "", warning = FALSE, message = FALSE, fig.align = "center" ) library(stats) ``` # Aim of the package The _Ecume_ package provides statistical methods to test whether samples are from the same or distinct distributions. It contains non-parametric tests for different settings. # Two sample tests for univariate distribution ## General Setting Consider two distributions $P_1$ and $P_2$. We want to test the null hypothesis $H_0: P_1 = P_2$. To do this, we have two sets of observations $x_i\sim P_1$, with $i\in[1,\ldots m]$ and $y_j\sim P_1$, with $j\in[1,\ldots n]$. ## Kolmogorov-Smirnov test The Kolmogorov-Smirnov statistic for two samples relies on the empirical cumulative distributions $F_x$ and $F_y$ for **univariate** distributions and is computed as $$D_{n,m}=\sup _{x\in\{x_i, y_j\},i\in[1,\ldots m], j\in[1,\ldots n]}|F_x(x)-F_y(x)|$$ We can compute the distribution of $D_{n,m}$ under the null and device a test. If the distributions are identical we expect to not reject the null about 95% of the time, for a nominal level of $.05$. ```{r} set.seed(20) x <- rnorm(100, 0, 1) y <- rnorm(200, 0, 1) ks.test(x, y) ``` However, if the distributions are not identical, we see that we would reject the null hypothesis. ```{r} set.seed(20) x <- rnorm(100, 0, 1) y <- rnorm(200, 0, 2) ks.test(x, y) ``` ## Adding observation weights and threshold Now, let us imagine we also have observations weights $w_{x, i}$ and $w_{y, j}$ for all samples. Moreover, instead of testing $H_0:D_{m,n} =0$, we want to test against an effect size, i.e. $H_0:D_{m,n} \leq c$. Relying on [@Monahan2011], we can do this using the `ks_test` function: ```{r} library(Ecume) set.seed(20) x <- rnorm(100, 0, 1) w_x <- runif(100, 0, 1) y <- rnorm(200, 0, 1) w_y <- runif(200, 0, 1) ks_test(x = x, y = y, w_x = w_x, w_y = w_y, thresh = .01) ``` ```{r} set.seed(20) x <- rnorm(100, 0, 1) w_x <- runif(100, 0, 1) y <- rnorm(200, 0, 2) w_y <- runif(200, 0, 1) ks_test(x = x, y = y, w_x = w_x, w_y = w_y, thresh = .01) ``` # Multivariate Distributions However, the KS test does not work with multivariate distributions. Other statistics have been proposed. Here, we have implemented two ## The two-sample kernel test from [@Gretton2012] The two-sample kernel test relies on a kernel function $(x, y)\rightarrow k(x, y)$ and the Mean Maximum Discrepancy: $$MMD^2_u = \frac{1}{m(m-1)}\sum_{x\neq x'}k(x, x') + \frac{1}{n(n-1)}\sum_{y\neq y'}k(y, y') - \frac{2}{mn}\sum_{x, y}k(x, y)$$ While this statistic has some closed-form bounds under the null for some kernels, we can also compute its distribution using permutations ```{r} set.seed(20) x <- matrix(c(rnorm(100, 0, 1), rnorm(100, 0, 1)), ncol = 2) y <- matrix(c(rnorm(200, 0, 2), rnorm(200, 0, 1)), ncol = 2) mmd_test(x = x, y = y, iterations = 10^4) ``` If the number of samples is too large, we can use a "linear" form of the statistics that samples elements from the sums above. ```{r} set.seed(20) x <- matrix(c(rnorm(100, 0, 1), rnorm(100, 0, 1)), ncol = 2) y <- matrix(c(rnorm(200, 0, 2), rnorm(200, 0, 1)), ncol = 2) mmd_test(x = x, y = y, iterations = 10^4, type = "linear") ``` ## Classifier Test If we split the data into a training and test set and train a classifier $Cl$ on the training set to distinguish between points from $P_1$ and $P_2$, then under the null, the classifier will not do better than chance on the test set. So if we have $n_{test}$ samples in the test set, then the number of correctly classified points $c_{Cl, n_{test}}$ follows $$ c_{Cl, n_{test}}\sim_{H_0} Binom(.5, n_{test})$$ This provides a valid test statistic and its distribution under the null. The quality of the classifier only matters for the power of the test. To provide flexibility for the users, we rely on the `caret` package [@caret]. By default, we use a $k$-NN classifier where $k$ is selected through cross-validation on the training set. ```{r} set.seed(20) x <- matrix(c(rnorm(200, 0, 1), rnorm(200, 0, 1)), ncol = 2) y <- matrix(c(rnorm(200, 0, 2), rnorm(200, 0, 1)), ncol = 2) classifier_test(x = x, y = y) ``` # Considering more than 2 distributions Now we consider the case where we have more than two distributions. Instead we have $k$ distributions and we want to test the hypothesis $$H_0:\forall (i,j)\in[1\ldots k], P_i=P_j$$ The classifier test can be readily extended to more than two distributions. The main difference is that, under the null, $$ c_{Cl, n_{test}}\sim_{H_0} Binom(\frac{1}{k}, n_{test})$$ ```{r} set.seed(20) x1 <- matrix(c(rnorm(200, 0, 1), rnorm(200, 0, 1)), ncol = 2) x2 <- matrix(c(rnorm(200, 0, 2), rnorm(200, 0, 1)), ncol = 2) x3 <- matrix(c(rnorm(200, 1, 1), rnorm(200, 0, 1)), ncol = 2) classifier_test(x = list("x1" = x1, "x2" = x2, "x3" = x3)) ``` Note that here, we assume that we have as many as samples from each distribution. In case of class imbalance, we downsample everything to the smallest class. # References