--- title: "Getting Started with NNS: Clustering and Regression" author: "Fred Viole" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Getting Started with NNS: Clustering and Regression} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ```{r setup, include=FALSE, message=FALSE} knitr::opts_chunk$set(echo = TRUE) library(NNS) library(data.table) data.table::setDTthreads(2L) options(mc.cores = 1) Sys.setenv("OMP_THREAD_LIMIT" = 2) ``` ```{r setup2, message=FALSE, warning=FALSE} library(NNS) library(data.table) require(knitr) require(rgl) ``` # Clustering and Regression Below are some examples demonstrating unsupervised learning with NNS clustering and nonlinear regression using the resulting clusters. As always, for a more thorough description and definition, please view the References. ## NNS Partitioning `NNS.part()` **`NNS.part`** is both a partitional and hierarchical clustering method. `NNS` iteratively partitions the joint distribution into partial moment quadrants, and then assigns a quadrant identification (1:4) at each partition. **`NNS.part`** returns a `data.table` of observations along with their final quadrant identification. It also returns the regression points, which are the quadrant means used in **`NNS.reg`**. ```{r linear} x = seq(-5, 5, .05); y = x ^ 3 for(i in 1 : 4){NNS.part(x, y, order = i, Voronoi = TRUE, obs.req = 0)} ``` ### X-only Partitioning **`NNS.part`** offers a partitioning based on $x$ values only **`NNS.part(x, y, type = "XONLY", ...)`**, using the entire bandwidth in its regression point derivation, and shares the same limit condition as partitioning via both $x$ and $y$ values. ```{r x part,results='hide'} for(i in 1 : 4){NNS.part(x, y, order = i, type = "XONLY", Voronoi = TRUE)} ``` Note the partition identifications are limited to 1's and 2's (left and right of the partition respectively), not the 4 values per the $x$ and $y$ partitioning. ```{r res2, echo=FALSE} NNS.part(x,y,order = 4, type = "XONLY") ``` ## Clusters Used in Regression The right column of plots shows the corresponding regression (plus endpoints and central point) for the order of `NNS` partitioning. ```{r depreg},results='hide'} for(i in 1 : 3){NNS.part(x, y, order = i, obs.req = 0, Voronoi = TRUE, type = "XONLY") ; NNS.reg(x, y, order = i, ncores = 1)} ``` # NNS Regression `NNS.reg()` **`NNS.reg`** can fit any $f(x)$, for both uni- and multivariate cases. **`NNS.reg`** returns a self-evident list of values provided below. ## Univariate: ```{r nonlinear,fig.width=5,fig.height=3,fig.align = "center"} NNS.reg(x, y, ncores = 1) ``` ## Multivariate: Multivariate regressions return a plot of $y$ and $\hat{y}$, as well as the regression points (`$RPM`) and partitions (`$rhs.partitions`) for each regressor. ```{r nonlinear multi,fig.width=5,fig.height=3,fig.align = "center"} f = function(x, y) x ^ 3 + 3 * y - y ^ 3 - 3 * x y = x ; z <- expand.grid(x, y) g = f(z[ , 1], z[ , 2]) NNS.reg(z, g, order = "max", plot = FALSE, ncores = 1) ``` ## Inter/Extrapolation `NNS.reg` can inter- or extrapolate any point of interest. The **`NNS.reg(x, y, point.est = ...)`** parameter permits any sized data of similar dimensions to $x$ and called specifically with **`NNS.reg(...)$Point.est`**. ## NNS Dimension Reduction Regression **`NNS.reg`** also provides a dimension reduction regression by including a parameter **`NNS.reg(x, y, dim.red.method = "cor", ...)`**. Reducing all regressors to a single dimension using the returned equation **`NNS.reg(..., dim.red.method = "cor", ...)$equation`**. ```{r nonlinear_class,fig.width=5,fig.height=3,fig.align = "center", message = FALSE} NNS.reg(iris[ , 1 : 4], iris[ , 5], dim.red.method = "cor", location = "topleft", ncores = 1)$equation ``` ```{r nonlinear_class2,fig.width=5,fig.height=3,fig.align = "center", message = FALSE, echo=FALSE} a = NNS.reg(iris[ , 1 : 4], iris[ , 5], dim.red.method = "cor", location = "topleft", ncores = 1, plot = FALSE)$equation ``` Thus, our model for this regression would be: $$Species = \frac{`r round(a$Coefficient[1],3)`*Sepal.Length `r round(a$Coefficient[2],3)`*Sepal.Width +`r round(a$Coefficient[3],3)`*Petal.Length +`r round(a$Coefficient[4],3)`*Petal.Width}{4} $$ ### Threshold **`NNS.reg(x, y, dim.red.method = "cor", threshold = ...)`** offers a method of reducing regressors further by controlling the absolute value of required correlation. ```{r nonlinear class threshold,fig.width=5,fig.height=3,fig.align = "center"} NNS.reg(iris[ , 1 : 4], iris[ , 5], dim.red.method = "cor", threshold = .75, location = "topleft", ncores = 1)$equation ``` ```{r nonlinear class threshold 2,fig.width=5,fig.height=3,fig.align = "center", echo=FALSE} a = NNS.reg(iris[ , 1 : 4], iris[ , 5], dim.red.method = "cor", threshold = .75, location = "topleft", ncores = 1, plot = FALSE)$equation ``` Thus, our model for this further reduced dimension regression would be: $$Species = \frac{\: `r round(a$Coefficient[1],3)`*Sepal.Length + `r round(a$Coefficient[2],3)`*Sepal.Width +`r round(a$Coefficient[3],3)`*Petal.Length +`r round(a$Coefficient[4],3)`*Petal.Width}{3} $$ and the `point.est = (...)` operates in the same manner as the full regression above, again called with **`NNS.reg(...)$Point.est`**. ```{r final,fig.width=5,fig.height=3,fig.align = "center"} NNS.reg(iris[ , 1 : 4], iris[ , 5], dim.red.method = "cor", threshold = .75, point.est = iris[1 : 10, 1 : 4], location = "topleft", ncores = 1)$Point.est ``` # Classification For a classification problem, we simply set **`NNS.reg(x, y, type = "CLASS", ...)`**. **NOTE: Base category of response variable should be 1, not 0 for classification problems.** ```{r class,fig.width=5,fig.height=3,fig.align = "center", message=FALSE} NNS.reg(iris[ , 1 : 4], iris[ , 5], type = "CLASS", point.est = iris[1 : 10, 1 : 4], location = "topleft", ncores = 1)$Point.est ``` # Cross-Validation `NNS.stack()` The **`NNS.stack`** routine cross-validates for a given objective function the `n.best` parameter in the multivariate **`NNS.reg`** function as well as the `threshold` parameter in the dimension reduction **`NNS.reg`** version. **`NNS.stack`** can be used for classification: **`NNS.stack(..., type = "CLASS", ...)`** or continuous dependent variables: **`NNS.stack(..., type = NULL, ...)`**. Any objective function `obj.fn` can be called using `expression()` with the terms `predicted` and `actual`, even from external packages such as `Metrics`. **`NNS.stack(..., obj.fn = expression(Metrics::mape(actual, predicted)), objective = "min")`**. ```{r stack,fig.width=5,fig.height=3,fig.align = "center", message=FALSE, eval=FALSE} NNS.stack(IVs.train = iris[ , 1 : 4], DV.train = iris[ , 5], IVs.test = iris[1 : 10, 1 : 4], dim.red.method = "cor", obj.fn = expression( mean(round(predicted) == actual) ), objective = "max", type = "CLASS", folds = 1, ncores = 1) ``` ```{r stackevalres, eval = FALSE} Folds Remaining = 0 Current NNS.reg(... , threshold = 0.935 ) MAX Iterations Remaining = 2 Current NNS.reg(... , threshold = 0.795 ) MAX Iterations Remaining = 1 Current NNS.reg(... , threshold = 0.44 ) MAX Iterations Remaining = 0 Current NNS.reg(... , n.best = 1 ) MAX Iterations Remaining = 12 Current NNS.reg(... , n.best = 2 ) MAX Iterations Remaining = 11 Current NNS.reg(... , n.best = 3 ) MAX Iterations Remaining = 10 Current NNS.reg(... , n.best = 4 ) MAX Iterations Remaining = 9 $OBJfn.reg [1] 1 $NNS.reg.n.best [1] 4 $probability.threshold [1] 0.43875 $OBJfn.dim.red [1] 0.9666667 $NNS.dim.red.threshold [1] 0.935 $reg [1] 1 1 1 1 1 1 1 1 1 1 $reg.pred.int NULL $dim.red [1] 1 1 1 1 1 1 1 1 1 1 $dim.red.pred.int NULL $stack [1] 1 1 1 1 1 1 1 1 1 1 $pred.int NULL ``` ## Increasing Dimensions Given multicollinearity is not an issue for nonparametric regressions as it is for OLS, in the case of an ill-fit univariate model a better option may be to increase the dimensionality of regressors with a copy of itself and cross-validate the number of clusters `n.best` via: **`NNS.stack(IVs.train = cbind(x, x), DV.train = y, method = 1, ...)`**. ```{r stack2, message = FALSE,fig.width=5,fig.height=3,fig.align = "center",results='hide', eval = FALSE} set.seed(123) x = rnorm(100); y = rnorm(100) nns.params = NNS.stack(IVs.train = cbind(x, x), DV.train = y, method = 1, ncores = 1) ``` ```{r stack2optim, echo = FALSE} set.seed(123) x = rnorm(100); y = rnorm(100) nns.params = list() nns.params$NNS.reg.n.best = 100 ``` ```{r stack2res, fig.width=5,fig.height=3,fig.align = "center",results='hide'} NNS.reg(cbind(x, x), y, n.best = nns.params$NNS.reg.n.best, point.est = cbind(x, x), residual.plot = TRUE, ncores = 1, confidence.interval = .95) ``` # References If the user is so motivated, detailed arguments further examples are provided within the following: * [Nonlinear Nonparametric Statistics: Using Partial Moments](https://github.com/OVVO-Financial/NNS/blob/NNS-Beta-Version/examples/index.md) * [Deriving Nonlinear Correlation Coefficients from Partial Moments](https://doi.org/10.2139/ssrn.2148522) * [Nonparametric Regression Using Clusters](https://doi.org/10.1007/s10614-017-9713-5) * [Clustering and Curve Fitting by Line Segments](https://doi.org/10.2139/ssrn.2861339) * [Classification Using NNS Clustering Analysis](https://doi.org/10.2139/ssrn.2864711) * [Partitional Estimation Using Partial Moments](https://doi.org/10.2139/ssrn.3592491) ```{r threads, echo = FALSE} Sys.setenv("OMP_THREAD_LIMIT" = "") ```