Clustering and Regression

Below are some examples demonstrating unsupervised learning with NNS clustering and nonlinear regression using the resulting clusters. As always, for a more thorough description and definition, please view the References.

NNS Partitioning `NNS.part()`

NNS.part is both a partitional and hierarchical clustering method. NNS iteratively partitions the joint distribution into partial moment quadrants, and then assigns a quadrant identification (1:4) at each partition.

NNS.part returns a data.table of observations along with their final quadrant identification. It also returns the regression points, which are the quadrant means used in NNS.reg.

x = seq(-5, 5, .05); y = x ^ 3

for(i in 1 : 4){NNS.part(x, y, order = i, Voronoi = TRUE, obs.req = 0)}

X-only Partitioning

NNS.part offers a partitioning based on $x$ values only NNS.part(x, y, type = "XONLY", ...), using the entire bandwidth in its regression point derivation, and shares the same limit condition as partitioning via both $x$ and $y$ values.

for(i in 1 : 4){NNS.part(x, y, order = i, type = "XONLY", Voronoi = TRUE)}

Note the partition identifications are limited to 1’s and 2’s (left and right of the partition respectively), not the 4 values per the $x$ and $y$ partitioning.

## $order
## [1] 4
## 
## $dt
##          x         y quadrant prior.quadrant
##   1: -5.00 -125.0000    q1111           q111
##   2: -4.95 -121.2874    q1111           q111
##   3: -4.90 -117.6490    q1111           q111
##   4: -4.85 -114.0841    q1111           q111
##   5: -4.80 -110.5920    q1111           q111
##  ---                                        
## 197:  4.80  110.5920    q2222           q222
## 198:  4.85  114.0841    q2222           q222
## 199:  4.90  117.6490    q2222           q222
## 200:  4.95  121.2874    q2222           q222
## 201:  5.00  125.0000    q2222           q222
## 
## $regression.points
##    quadrant          x            y
## 1:     q111 -4.4523966 -89.21983502
## 2:     q112 -3.2250000 -31.43670868
## 3:     q121 -2.0023966  -7.41841667
## 4:     q122 -0.7569582  -0.49480202
## 5:     q211  0.3739355   0.07811065
## 6:     q212  1.3499632   2.24821307
## 7:     q221  2.6206250  16.32999350
## 8:     q222  4.1955267  75.62094504

Clusters Used in Regression

The right column of plots shows the corresponding regression (plus endpoints and central point) for the order of NNS partitioning.

for(i in 1 : 3){NNS.part(x, y, order = i, obs.req = 0, Voronoi = TRUE, type = "XONLY") ; NNS.reg(x, y, order = i, ncores = 1)}

NNS Regression `NNS.reg()`

NNS.reg can fit any $f(x)$, for both uni- and multivariate cases. NNS.reg returns a self-evident list of values provided below.

Univariate:

NNS.reg(x, y, ncores = 1)

## $R2
## [1] 0.9999858
## 
## $SE
## [1] 0.1826798
## 
## $Prediction.Accuracy
## NULL
## 
## $equation
## NULL
## 
## $x.star
## NULL
## 
## $derivative
##     Coefficient X.Lower.Range X.Upper.Range
##  1: 74.25250000    -5.0000000    -4.9750000
##  2: 72.47650000    -4.9750000    -4.8500000
##  3: 68.69350000    -4.8500000    -4.7250000
##  4: 64.90716418    -4.7250000    -4.5854167
##  5: 60.99688312    -4.5854167    -4.4250000
##  6: 57.66583333    -4.4250000    -4.2875000
##  7: 52.28005556    -4.2875000    -4.1000000
##  8: 55.30688406    -4.1000000    -3.9562500
##  9: 39.54454023    -3.9562500    -3.7750000
## 10: 41.10328358    -3.7750000    -3.6354167
## 11: 38.00441558    -3.6354167    -3.4750000
## 12: 34.71126866    -3.4750000    -3.3354167
## 13: 31.86863636    -3.3354167    -3.1750000
## 14: 29.29651515    -3.1750000    -3.0375000
## 15: 25.78422222    -3.0375000    -2.8500000
## 16: 23.19455128    -2.8500000    -2.6875000
## 17: 20.16506410    -2.6875000    -2.5250000
## 18: 18.23350000    -2.5250000    -2.4000000
## 19: 16.36150000    -2.4000000    -2.2750000
## 20: 15.46658730    -2.2750000    -2.1437500
## 21: 12.09709877    -2.1437500    -1.9750000
## 22: 10.85119403    -1.9750000    -1.8354167
## 23:  9.17213483    -1.8354167    -1.6500000
## 24:  7.72313333    -1.6500000    -1.4937500
## 25:  5.66932099    -1.4937500    -1.3250000
## 26:  4.77560606    -1.3250000    -1.1875000
## 27:  3.65064103    -1.1875000    -1.0250000
## 28:  2.72231343    -1.0250000    -0.8854167
## 29:  1.97227273    -0.8854167    -0.7250000
## 30:  1.29969697    -0.7250000    -0.5875000
## 31:  0.71474227    -0.5875000    -0.3854167
## 32:  0.25968750    -0.3854167    -0.1854167
## 33:  0.07922078    -0.1854167    -0.1052083
## 34:  0.01168831    -0.1052083    -0.0250000
## 35:  0.00625000    -0.0250000     0.0750000
## 36:  0.05125000     0.0750000     0.1750000
## 37:  0.17050000     0.1750000     0.3000000
## 38:  0.40450000     0.3000000     0.4250000
## 39:  0.68125000     0.4250000     0.5250000
## 40:  0.99625000     0.5250000     0.6250000
## 41:  1.29904762     0.6250000     0.7562500
## 42:  2.23629630     0.7562500     0.9250000
## 43:  2.85625000     0.9250000     1.0250000
## 44:  3.47125000     1.0250000     1.1250000
## 45:  4.21750000     1.1250000     1.2500000
## 46:  5.19250000     1.2500000     1.3750000
## 47:  6.18250000     1.3750000     1.5000000
## 48:  7.35250000     1.5000000     1.6250000
## 49:  7.75857143     1.6250000     1.7562500
## 50: 10.85161290     1.7562500     1.9500000
## 51: 10.92884615     1.9500000     2.1125000
## 52: 14.30443299     2.1125000     2.3145833
## 53: 17.95402174     2.3145833     2.5062500
## 54: 21.47258065     2.5062500     2.7000000
## 55: 20.49711538     2.7000000     2.8625000
## 56: 26.01281250     2.8625000     3.0625000
## 57: 32.71118954     3.0625000     3.2671585
## 58: 34.20366742     3.2671585     3.5000000
## 59: 33.56373418     3.5000000     3.6645833
## 60: 46.95383721     3.6645833     3.8437500
## 61: 42.67457143     3.8437500     4.0625000
## 62: 57.10865385     4.0625000     4.2250000
## 63: 55.21789474     4.2250000     4.3437500
## 64: 59.67890264     4.3437500     4.5671585
## 65: 66.35838622     4.5671585     4.8321864
## 66: 72.08274344     4.8321864     5.0000000
##     Coefficient X.Lower.Range X.Upper.Range
## 
## $Point.est
## NULL
## 
## $pred.int
## NULL
## 
## $regression.points
##              x             y
##  1: -5.0000000 -1.250000e+02
##  2: -4.9750000 -1.231437e+02
##  3: -4.8500000 -1.140841e+02
##  4: -4.7250000 -1.054974e+02
##  5: -4.5854167 -9.643748e+01
##  6: -4.4250000 -8.665256e+01
##  7: -4.2875000 -7.872351e+01
##  8: -4.1000000 -6.892100e+01
##  9: -3.9562500 -6.097064e+01
## 10: -3.7750000 -5.380319e+01
## 11: -3.6354167 -4.806585e+01
## 12: -3.4750000 -4.196931e+01
## 13: -3.3354167 -3.712420e+01
## 14: -3.1750000 -3.201194e+01
## 15: -3.0375000 -2.798367e+01
## 16: -2.8500000 -2.314913e+01
## 17: -2.6875000 -1.938001e+01
## 18: -2.5250000 -1.610319e+01
## 19: -2.4000000 -1.382400e+01
## 20: -2.2750000 -1.177881e+01
## 21: -2.1437500 -9.748823e+00
## 22: -1.9750000 -7.707437e+00
## 23: -1.8354167 -6.192792e+00
## 24: -1.6500000 -4.492125e+00
## 25: -1.4937500 -3.285385e+00
## 26: -1.3250000 -2.328687e+00
## 27: -1.1875000 -1.672042e+00
## 28: -1.0250000 -1.078812e+00
## 29: -0.8854167 -6.988229e-01
## 30: -0.7250000 -3.824375e-01
## 31: -0.5875000 -2.037292e-01
## 32: -0.3854167 -5.929167e-02
## 33: -0.1854167 -7.354167e-03
## 34: -0.1052083 -1.000000e-03
## 35: -0.0250000 -6.250000e-05
## 36:  0.0750000  5.625000e-04
## 37:  0.1750000  5.687500e-03
## 38:  0.3000000  2.700000e-02
## 39:  0.4250000  7.756250e-02
## 40:  0.5250000  1.456875e-01
## 41:  0.6250000  2.453125e-01
## 42:  0.7562500  4.158125e-01
## 43:  0.9250000  7.931875e-01
## 44:  1.0250000  1.078813e+00
## 45:  1.1250000  1.425938e+00
## 46:  1.2500000  1.953125e+00
## 47:  1.3750000  2.602188e+00
## 48:  1.5000000  3.375000e+00
## 49:  1.6250000  4.294063e+00
## 50:  1.7562500  5.312375e+00
## 51:  1.9500000  7.414875e+00
## 52:  2.1125000  9.190813e+00
## 53:  2.3145833  1.208150e+01
## 54:  2.5062500  1.552269e+01
## 55:  2.7000000  1.968300e+01
## 56:  2.8625000  2.301378e+01
## 57:  3.0625000  2.821634e+01
## 58:  3.2671585  3.491097e+01
## 59:  3.5000000  4.287500e+01
## 60:  3.6645833  4.839903e+01
## 61:  3.8437500  5.681159e+01
## 62:  4.0625000  6.614666e+01
## 63:  4.2250000  7.542681e+01
## 64:  4.3437500  8.198394e+01
## 65:  4.5671585  9.531671e+01
## 66:  4.8321864  1.129035e+02
## 67:  5.0000000  1.250000e+02
##              x             y
## 
## $Fitted.xy
##          x         y     y.hat   NNS.ID gradient  residuals standard.errors
##   1: -5.00 -125.0000 -125.0000 q1111111 74.25250  0.0000000      0.00000000
##   2: -4.95 -121.2874 -121.3318 q1111112 72.47650  0.0444000      0.07380015
##   3: -4.90 -117.6490 -117.7080 q1111121 72.47650  0.0589500      0.07380015
##   4: -4.85 -114.0841 -114.0841 q1111121 68.69350  0.0000000      0.05069967
##   5: -4.80 -110.5920 -110.6495 q1111122 68.69350  0.0574500      0.05069967
##  ---                                                                       
## 197:  4.80  110.5920  110.7677 q2222221 66.35839 -0.1756980      0.27460497
## 198:  4.85  114.0841  114.1876 q2222222 72.08274 -0.1034635      0.11950581
## 199:  4.90  117.6490  117.7917 q2222222 72.08274 -0.1427257      0.11950581
## 200:  4.95  121.2874  121.3959 q2222222 72.08274 -0.1084878      0.11950581
## 201:  5.00  125.0000  125.0000 q2222222 72.08274  0.0000000      0.11950581

Multivariate:

Multivariate regressions return a plot of $y$ and $\hat{y}$, as well as the regression points ($RPM) and partitions ($rhs.partitions) for each regressor.

f = function(x, y) x ^ 3 + 3 * y - y ^ 3 - 3 * x
y = x ; z <- expand.grid(x, y)
g = f(z[ , 1], z[ , 2])
NNS.reg(z, g, order = "max", plot = FALSE, ncores = 1)

## $R2
## [1] 1
## 
## $rhs.partitions
##         Var1 Var2
##     1: -5.00   -5
##     2: -4.95   -5
##     3: -4.90   -5
##     4: -4.85   -5
##     5: -4.80   -5
##    ---           
## 40397:  4.80    5
## 40398:  4.85    5
## 40399:  4.90    5
## 40400:  4.95    5
## 40401:  5.00    5
## 
## $RPM
##        Var1  Var2         y.hat
##     1: -4.8 -4.80 -7.105427e-15
##     2: -4.8 -2.55 -8.726063e+01
##     3: -4.8 -2.50 -8.806700e+01
##     4: -4.8 -2.45 -8.883587e+01
##     5: -4.8 -2.40 -8.956800e+01
##    ---                         
## 40397: -2.6 -2.80  3.776000e+00
## 40398: -2.6 -2.75  2.770875e+00
## 40399: -2.6 -2.70  1.807000e+00
## 40400: -2.6 -2.65  8.836250e-01
## 40401: -2.6 -2.60  1.776357e-15
## 
## $Point.est
## NULL
## 
## $pred.int
## NULL
## 
## $Fitted.xy
##         Var1 Var2          y      y.hat      NNS.ID residuals
##     1: -5.00   -5   0.000000   0.000000     201.201         0
##     2: -4.95   -5   3.562625   3.562625     402.201         0
##     3: -4.90   -5   7.051000   7.051000     603.201         0
##     4: -4.85   -5  10.465875  10.465875     804.201         0
##     5: -4.80   -5  13.808000  13.808000    1005.201         0
##    ---                                                       
## 40397:  4.80    5 -13.808000 -13.808000 39597.40401         0
## 40398:  4.85    5 -10.465875 -10.465875 39798.40401         0
## 40399:  4.90    5  -7.051000  -7.051000 39999.40401         0
## 40400:  4.95    5  -3.562625  -3.562625 40200.40401         0
## 40401:  5.00    5   0.000000   0.000000 40401.40401         0

Inter/Extrapolation

NNS.reg can inter- or extrapolate any point of interest. The NNS.reg(x, y, point.est = ...) parameter permits any sized data of similar dimensions to $x$ and called specifically with NNS.reg(...)$Point.est.

NNS Dimension Reduction Regression

NNS.reg also provides a dimension reduction regression by including a parameter NNS.reg(x, y, dim.red.method = "cor", ...). Reducing all regressors to a single dimension using the returned equation NNS.reg(..., dim.red.method = "cor", ...)$equation.

NNS.reg(iris[ , 1 : 4], iris[ , 5], dim.red.method = "cor", location = "topleft", ncores = 1)$equation

##        Variable Coefficient
## 1: Sepal.Length   0.7980781
## 2:  Sepal.Width  -0.4402896
## 3: Petal.Length   0.9354305
## 4:  Petal.Width   0.9381792
## 5:  DENOMINATOR   4.0000000

Thus, our model for this regression would be: \[Species = \frac{0.798*Sepal.Length -0.44*Sepal.Width +0.935*Petal.Length +0.938*Petal.Width}{4} \]

Threshold

NNS.reg(x, y, dim.red.method = "cor", threshold = ...) offers a method of reducing regressors further by controlling the absolute value of required correlation.

NNS.reg(iris[ , 1 : 4], iris[ , 5], dim.red.method = "cor", threshold = .75, location = "topleft", ncores = 1)$equation

##        Variable Coefficient
## 1: Sepal.Length   0.7980781
## 2:  Sepal.Width   0.0000000
## 3: Petal.Length   0.9354305
## 4:  Petal.Width   0.9381792
## 5:  DENOMINATOR   3.0000000

Thus, our model for this further reduced dimension regression would be: \[Species = \frac{\: 0.798*Sepal.Length + 0*Sepal.Width +0.935*Petal.Length +0.938*Petal.Width}{3} \]

and the point.est = (...) operates in the same manner as the full regression above, again called with NNS.reg(...)$Point.est.

NNS.reg(iris[ , 1 : 4], iris[ , 5], dim.red.method = "cor", threshold = .75, point.est = iris[1 : 10, 1 : 4], location = "topleft", ncores = 1)$Point.est

##  [1] 1 1 1 1 1 1 1 1 1 1

Classification

For a classification problem, we simply set NNS.reg(x, y, type = "CLASS", ...).

NOTE: Base category of response variable should be 1, not 0 for classification problems.

NNS.reg(iris[ , 1 : 4], iris[ , 5], type = "CLASS", point.est = iris[1 : 10, 1 : 4], location = "topleft", ncores = 1)$Point.est

##  [1] 1 1 1 1 1 1 1 1 1 1

Cross-Validation `NNS.stack()`

The NNS.stack routine cross-validates for a given objective function the n.best parameter in the multivariate NNS.reg function as well as the threshold parameter in the dimension reduction NNS.reg version. NNS.stack can be used for classification:

NNS.stack(..., type = "CLASS", ...)

or continuous dependent variables:

NNS.stack(..., type = NULL, ...).

Any objective function obj.fn can be called using expression() with the terms predicted and actual, even from external packages such as Metrics.

NNS.stack(..., obj.fn = expression(Metrics::mape(actual, predicted)), objective = "min").

NNS.stack(IVs.train = iris[ , 1 : 4], 
          DV.train = iris[ , 5], 
          IVs.test = iris[1 : 10, 1 : 4],
          dim.red.method = "cor",
          obj.fn = expression( mean(round(predicted) == actual) ),
          objective = "max", type = "CLASS", 
          folds = 1, ncores = 1)

Folds Remaining = 0 
Current NNS.reg(... , threshold = 0.935 ) MAX Iterations Remaining = 2 
Current NNS.reg(... , threshold = 0.795 ) MAX Iterations Remaining = 1 
Current NNS.reg(... , threshold = 0.44 ) MAX Iterations Remaining = 0 
Current NNS.reg(... , n.best = 1 ) MAX Iterations Remaining = 12 
Current NNS.reg(... , n.best = 2 ) MAX Iterations Remaining = 11 
Current NNS.reg(... , n.best = 3 ) MAX Iterations Remaining = 10 
Current NNS.reg(... , n.best = 4 ) MAX Iterations Remaining = 9 
$OBJfn.reg
[1] 1

$NNS.reg.n.best
[1] 4

$probability.threshold
[1] 0.43875

$OBJfn.dim.red
[1] 0.9666667

$NNS.dim.red.threshold
[1] 0.935

$reg
 [1] 1 1 1 1 1 1 1 1 1 1

$reg.pred.int
NULL

$dim.red
 [1] 1 1 1 1 1 1 1 1 1 1

$dim.red.pred.int
NULL

$stack
 [1] 1 1 1 1 1 1 1 1 1 1

$pred.int
NULL

Increasing Dimensions

Given multicollinearity is not an issue for nonparametric regressions as it is for OLS, in the case of an ill-fit univariate model a better option may be to increase the dimensionality of regressors with a copy of itself and cross-validate the number of clusters n.best via:

NNS.stack(IVs.train = cbind(x, x), DV.train = y, method = 1, ...).

set.seed(123)
x = rnorm(100); y = rnorm(100)

nns.params = NNS.stack(IVs.train = cbind(x, x),
                        DV.train = y,
                        method = 1, ncores = 1)

NNS.reg(cbind(x, x), y, 
        n.best = nns.params$NNS.reg.n.best,
        point.est = cbind(x, x), 
        residual.plot = TRUE,  
        ncores = 1, confidence.interval = .95)

Getting Started with NNS: Clustering and Regression

Fred Viole

Clustering and Regression

NNS Partitioning `NNS.part()`

X-only Partitioning

Clusters Used in Regression

NNS Regression `NNS.reg()`

Univariate:

Multivariate:

Inter/Extrapolation

NNS Dimension Reduction Regression

Threshold

Classification

Cross-Validation `NNS.stack()`

Increasing Dimensions

References

Getting Started with NNS: Clustering and Regression

Fred Viole

Clustering and Regression

NNS Partitioning NNS.part()

X-only Partitioning

Clusters Used in Regression

NNS Regression NNS.reg()

Univariate:

Multivariate:

Inter/Extrapolation

NNS Dimension Reduction Regression

Threshold

Classification

Cross-Validation NNS.stack()

Increasing Dimensions

References

NNS Partitioning `NNS.part()`

NNS Regression `NNS.reg()`

Cross-Validation `NNS.stack()`