--- title: "Getting Started with NNS: Comparing Distributions" author: "Fred Viole" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with NNS: Comparing Distributions} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, include=FALSE, message=FALSE} knitr::opts_chunk$set(echo = TRUE) library(NNS) library(data.table) data.table::setDTthreads(2L) options(mc.cores = 1) Sys.setenv("OMP_THREAD_LIMIT" = 2) ``` ```{r setup2,message=FALSE,warning = FALSE} library(NNS) library(data.table) require(knitr) require(rgl) ``` # Comparing Distributions **`NNS`** offers a multitude of ways to test if distributions came from the same population, or if they share the same mean or median. The underlying function for these tests is **`NNS.ANOVA()`**. The output from **`NNS.ANOVA()`** is a `Certainty` statistic, which compares CDFs of distributions from several shared quantiles and normalizes the similarity of these points to be within the interval $[0,1]$, with 1 representing identical distributions. For a complete analysis of `Certainty` to common p-values and the role of power, please see the [References](#References). ## Test if Same Population Below we run the analysis to whether automatic transmissions and manual transmissions have significantly different `mpg` distributions per the `mtcars` dataset. The plot on the left shows the robust `Certainty` estimate, reflecting the distribution of `Certainty` estimates over 100 random permutations of both variables. The plot on the right illustrates the control and treatment variables, along with the grand mean among variables, and the confidence interval associated with the control mean. ```{r cars, fig.width=7, fig.align='center'} mpg_auto_trans = mtcars[mtcars$am==1, "mpg"] mpg_man_trans = mtcars[mtcars$am==0, "mpg"] NNS.ANOVA(control = mpg_man_trans, treatment = mpg_auto_trans, robust = TRUE) ``` The `Certainty` shows that these two distributions clearly do not come from the same population. This is verified with the Mann-Whitney-Wilcoxon test, which also does not assume a normality to the underlying data as a nonparametric test of identical distributions. ```{r cars2, warning=FALSE} wilcox.test(mpg ~ am, data=mtcars) ``` ## Test if means are Equal Here we provide the output from **`NNS.ANOVA()`** and `t.test()` functions on two Normal distribution samples, where we are pretty certain these two means are equal. ```{r equalmeans, echo=TRUE, fig.width=7, fig.align='center'} set.seed(123) x = rnorm(1000, mean = 0, sd = 1) y = rnorm(1000, mean = 0, sd = 2) NNS.ANOVA(control = x, treatment = y, means.only = TRUE, robust = TRUE, plot = TRUE) t.test(x,y) ``` ## Test if means are Unequal By altering the mean of the `y` variable, we can start to see the sensitivity of the results from the two methods, where both firmly reject the null hypothesis of identical means. ```{r unequalmeans, echo=TRUE, fig.width=7, fig.align='center'} set.seed(123) x = rnorm(1000, mean = 0, sd = 1) y = rnorm(1000, mean = 1, sd = 1) NNS.ANOVA(control = x, treatment = y, means.only = TRUE, robust = TRUE, plot = TRUE) t.test(x,y) ``` The effect size from **`NNS.ANOVA()`** is calculated from the confidence interval of the control mean and the specified `y` shift of 1 is within the provided lower and upper effect boundaries. ## Medians In order to test medians instead of means, simply set both `means.only = TRUE` and `medians = TRUE` in **`NNS.ANOVA()`**. ```{r unequalmedians, echo=TRUE, fig.width=7, fig.align='center'} NNS.ANOVA(control = x, treatment = y, means.only = TRUE, medians = TRUE, robust = TRUE, plot = TRUE) ``` # Stochastic Dominance Another method of comparing distributions involves a test for stochastic dominance. The first, second, and third degree stochastic dominance tests are available in **`NNS`** via: - **`NNS.FSD()`** - **`NNS.SSD()`** - **`NNS.TSD()`** ```{r stochdom, fig.width=7, fig.align='center'} NNS.FSD(x, y) ``` **`NNS.FSD()`** correctly identifies the shift in the `y` variable we specified when testing for unequal means. ## Stochastic Dominant Efficient Sets **`NNS`** also offers the ability to isolate a set of variables that do not have any dominated constituents with the **`NNS.SD.efficient.set()`** function. `x2, x4, x6, x8` all dominate their preceding distributions yet do not dominate one another, and are thus included in the first degree stochastic dominance efficient set. ```{r stochdomset, eval=FALSE} set.seed(123) x1 = rnorm(1000) x2 = x1 + 1 x3 = rnorm(1000) x4 = x3 + 1 x5 = rnorm(1000) x6 = x5 + 1 x7 = rnorm(1000) x8 = x7 + 1 NNS.SD.efficient.set(cbind(x1, x2, x3, x4, x5, x6, x7, x8), degree = 1, status = FALSE) [1] "x4" "x2" "x8" "x6" ``` # References {#references} If the user is so motivated, detailed arguments and proofs are provided within the following: * [Continuous CDFs and ANOVA with NNS](https://doi.org/10.2139/ssrn.3007373) * [A Note on Stochastic Dominance](https://doi.org/10.2139/ssrn.3002675) * [LPM Density Functions for the Computation of the SD Efficient Set](http://dx.doi.org/10.4236/jmf.2016.61012) ```{r threads, echo = FALSE} Sys.setenv("OMP_THREAD_LIMIT" = "") ```