Introduction to pcLasso

Introduction

pcLasso is a package that fits the principal components lasso, a new method for obtaining sparse models for supervised learning problems. pcLasso shrinks predictions toward the top principal components of the feature matrix. It combines the power of Principal Components Regression with the sparsity of the lasso. The method is also able to handle grouped features, where the features can belong to one of many groups. In that case, pcLasso shrinks the component of the parameter vector towards the top principal componets of the corresponding feature matrix.

We introduce some notation that we will use throughout this vignette. Let there be \(n\) observations, each with feature vector \(x_i \in \mathbb{R}^p\) and response \(y_i\). Let \(X \in \mathbb{R}^{n \times p}\) denote the overall feature matrix, and let \(y \in \mathbb{R}^n\) denote the vector of responses. Assume our data features come in \(K\) groups, and let \(X_k \in \mathbb{R}^{n \times p_k}\) denote the feature matrix for group \(k\). In the simplest case with \(K=1\), there is no feature grouping.

For each feature matrix \(X_k\), let \(X_k = U_k D_k V_k^T\) be its singular value decomposition (SVD). Let \(D_k\) have diagonal entries \(d_{k1}, d_{k2}, \dots\), and let \(D_{d_{k1}^2 - d_{kj}^2}\) denote the diagonal matrix such that the \(j\)th diagonal entry is \(d_{k1} - d_{kj}^2\).

pcLasso solves the optimization problem

\(\underset{\beta_0,\beta}{\text{minimize}} \quad \dfrac{1}{n} \displaystyle\sum_{i=1}^N w_i \ell (y_i, \beta_0 + \beta^T x_i) + \lambda \|\beta\|_1 + \dfrac{\theta}{2} \sum_{k = 1}^K \beta_k^T V_k D_{d_{k1}^2 - d_{kj}^2} V_k^T \beta_k.\)

In the above, \(\ell(y, \eta)\) is the negative log-likelihood contribution for observation \(i\); e.g. in the Gaussian case \(\ell(y, \eta) = \frac{1}{2}(y - \eta)^2\). \(w_i\) is the observation weight given to observation \(i\) (by default \(w_i = 1\) for all \(i\)). \(\beta_k\) is the subvector of \(\beta\) which corresponds to group \(k\). \(\lambda\) and \(\theta\) are non-negative hyperparameters. pcLasso solves the optimization problem for a grid of \(\lambda\) values; \(\theta\) must be specified in advance by the user.

pcLasso uses cyclical coordinate descent, which successively optimizes the objective function over each parameter with all other parameters fixed, cycling repeatedly until convergence.

The package also includes methods for prediction and plotting, and a function which performs \(k\)-fold cross-validation.

For more details, please see our paper on arXiv.

Installation

The simplest way to obtain pcLasso is to install it directly from CRAN. Type the following command in R console:

install.packages("pcLasso")

This command downloads the R package and installs it to the default directories. Users may change add a repos option to the function call to specify which repository to download from, depending on their locations and preferences.

Alternatively, users can download the package source at CRAN and type Unix commands to install it to the desired location.

Quick Start

The purpose of this section is to give users a general sense of the package. We will briefly go over the main functions of the package, as well as some key options. More details are given in later sections.

First, we load the pcLasso package:

library(pcLasso)

Let’s generate some data:

set.seed(944)
n <- 100
p <- 20
x <- matrix(rnorm(n*p), n, p)
beta <- matrix(c(rep(2, 5), rep(0, 15)), ncol = 1)
y <- x %*% beta + rnorm(n)

We fit the model using the most basic call to pcLasso:

fit <- pcLasso(x, y, ratio = 0.8)

In addition to the feature matrix x and the response y, the user must specify either theta or ratio (but not both). theta is the hyperparameter value for the last term in the objective function. The scale for theta depends on x and y and hence can be hard to determine a priori. Instead of specifying theta, we recommend that the user specify ratio, a hyperparameter whose value lies in the interval \((0,1]\). Roughly speaking, smaller values of ratio represent greater shrinkage to the top principal components. (ratio = 1 gives the usual lasso.)

If the argument verbose is set to TRUE, while the function is running, pcLasso informs the user which lambda in the lambda sequence it is currently processing by printing it to the console.

If ratio is passed to pcLasso, it internally computes the corresponding value of theta and uses that in minimizing the objective function. This value can be retrieved as the value of theta in the output list.

The function pcLasso returns a pcLasso object. At the moment, the only way to extract model coefficients is to use list syntax on the returned object. For example, the code below extracts the intercept and coefficients for the model at the 20th lambda value:

# intercept
fit$a0[20]
#> [1] -0.05855159

# coefficients
fit$beta[, 20]
#>  [1]  0.26955362  0.34871182  0.31431625  0.44873422  0.17694987  0.21373963
#>  [7]  0.00000000 -0.05236147  0.00000000  0.00000000 -0.17767456  0.11606285
#> [13]  0.01645924  0.11182802  0.00000000 -0.10867816  0.06198964  0.12127325
#> [19] -0.04744480 -0.07523572

A pcLasso object has a nzero key which tells us the number of non-zero model coefficients at each value of lambda:

fit$nzero
#>   [1]  0  1  1  4  4  4  5  5  5  6  7  7 11 12 13 14 15 16 16 16 17 17 17 18 18
#>  [26] 18 18 18 18 18 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 20 20 20 20 20
#>  [51] 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
#>  [76] 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20

We can make predictions using a pcLasso object by calling the predict method. Each column gives the predictions for a value of lambda.

# get predictions for 20th model
predict(fit, x[1:5, ])[, 20]
#> [1]  1.0830384 -0.3009271  1.8452090  1.3588647 -1.1403960

Grouped features

By default, pcLasso assumes that all the features in x belong to one group. If the features come in groups, pcLasso can make use of this information in the model-fitting procedure.

Assume our features come in 4 (non-overlapping) groups of size 5:

groups <- vector("list", 4)
for (k in 1:4) {
    groups[[k]] <- 5 * (k-1) + 1:5
}
groups
#> [[1]]
#> [1] 1 2 3 4 5
#> 
#> [[2]]
#> [1]  6  7  8  9 10
#> 
#> [[3]]
#> [1] 11 12 13 14 15
#> 
#> [[4]]
#> [1] 16 17 18 19 20

We can use this information in our fit by specifying the groups option. groups must be a list of length \(K\) (the number of groups). groups[[k]] is a vector of column indices representing the features which belong to group \(k\). (For example, groups[[1]] <- c(3, 4, 6) means that columns 3, 4 and 6 of x belong to group 1.) Every feature must belong to at least one group.

fit <- pcLasso(x, y, ratio = 0.8, groups = groups)

One advantage of pcLasso is that the algorithm works with overlapping groups as well. For example, we modify the groups list such that features 6 and 7 also belong to group 1. We can make the same call to pcLasso:

groups[[1]] <- 1:7
groups
#> [[1]]
#> [1] 1 2 3 4 5 6 7
#> 
#> [[2]]
#> [1]  6  7  8  9 10
#> 
#> [[3]]
#> [1] 11 12 13 14 15
#> 
#> [[4]]
#> [1] 16 17 18 19 20

fit <- pcLasso(x, y, ratio = 0.8, groups = groups)

One thing to note with overlapping groups is that the model coefficients and the number of non-zero coefficients are stored differently in the output object. To get the coefficients in the original feature space, look at the origbeta key:

# intercept at 20th model: same as before
fit$a0[20]
#> [1] -0.03609611

# coefficients at 20th model: look at origbeta instead
fit$origbeta[, 20]
#>  [1] 0.9453523 1.0498390 1.0494266 1.3574616 0.7357579 0.3295442 0.0000000
#>  [8] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
#> [15] 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000

To get the number of non-zero coefficients in the original feature space, look at orignzero:

fit$orignzero
#>   [1]  0  1  1  3  4  4  5  5  5  5  5  6  6  6  6  6  6  6  6  6  7  8  8  8 10
#>  [26] 10 12 12 12 13 13 13 15 15 16 17 17 17 17 17 17 18 19 19 19 19 19 19 19 19
#>  [51] 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19 19
#>  [76] 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20

For more information on what the algorithm does in the case of overlapping groups, see Section 3.2 of our paper.

Cross-validation (CV)

We can perform \(k\)-fold cross-validation (CV) with cv.pcLasso. It does 10-fold cross-validation by default:

cvfit <- cv.pcLasso(x, y, groups = groups, ratio = 0.8)

We can change the number of folds using the nfolds option:

cvfit <- cv.pcLasso(x, y, groups = groups, ratio = 0.8, nfolds = 5)

If we want to specify which observation belongs to which fold, we can do that by specifying the foldid option, which is a vector of length \(n\), with the \(i\)th element being the fold number for observation \(i\).

foldid <- sample(rep(seq(10), length = n))
cvfit <- cv.pcLasso(x, y, groups = groups, ratio = 0.8, foldid = foldid)

A cv.pcLasso call returns a cv.pcLasso object. We can plot this object to get the CV curve with error bars (one standard error in each direction). The left vertical dotted line represents lambda.min, the lambda value which attains minimum CV error, while the right vertical dotted line represents lambda.1se, the largest lambda value with CV error within one standard error of the minimum CV error.

plot(cvfit)

The numbers at the top represent the number of non-zero coefficients for each model in the original feature space. If the groups are overlapping, we can plot the number of non-zero coefficients for each model in the expanded feature space instead by setting orignz = FALSE:

plot(cvfit, orignz = FALSE)

The two special lambda values can be extracted directly from the cv.pcLasso object as well:

cvfit$lambda.min
#> [1] 0.0002570626
cvfit$lambda.1se
#> [1] 0.1899878

Predictions can be made from the fitted cv.pcLasso object. By default, predictions are given for lambda being equal to lambda.1se. To get predictions are lambda.min, set s = "lambda.min".

predict(cvfit, x[1:5, ])   # s = lambda.1se
#> [1]  4.850616 -2.192256  5.120815  3.127318 -4.748510
predict(cvfit, x[1:5, ], s = "lambda.min")
#> [1]  5.059693 -2.553863  7.064400  4.051717 -4.870323

Other options

Here are some other options that one may specify for the pcLasso and cv.pcLasso functions:

w: The user can pass a vector of length \(n\) representing observation weights. The squared residual of the observations are weighted according to this vector. By default, this is set to 1 for all observations.
family: The default value for the family option of the pcLasso and cv.pcLasso functions is gaussian. Use this default when y is a quantitative variable (i.e. takes values along the real number line). The objective function for the Gaussian family is

\(\underset{\beta_0,\beta}{\text{minimize}} \quad \dfrac{1}{2n} \displaystyle\sum_{i=1}^N w_i (y_i - \beta_0 - \beta^T x_i)^2 + \lambda \|\beta\|_1 + \dfrac{\theta}{2} \sum_{k = 1}^K \beta_k^T V_k D_{d_{k1}^2 - d_{kj}^2} V_k^T \beta_k.\)

As before, \(\lambda\) and \(\theta\) are hyperparameters. The user passes a specific value of theta (or its ratio equivalent) to the pcLasso and cv.pcLasso functions, and the function computes the model fit for a path of lambda values.

For binary prediction, use family = binomial. In this setting, the response y should be a numeric vector containing just 0s and 1s.
SVD_info: A list containing SVD information that the algorithm uses. The user typically does not need to provide these options to pcLasso; pcLasso will compute them from the given data. Internally, cv.pcLasso provides these options to pcLasso for computational efficiency.
lambda: The lambda sequence at which the model fit will be computed. This is typically not provided by the user: the program can construct the sequence on its own. When automatically generated, the lambda sequence is determined by lambda.max (internally computed) and lambda.min.ratio. (lambda.min.ratio is the ratio of smallest value of the generated lambda sequence, say lambda.min, to lambda.max.) The program generates nlam values (default is 100) linear on the log scale from lambda.max down to lambda.min.
standardize: If set to TRUE, the columns of the feature matrix x are scaled to have unit variance before the algorithm is run.

For more information, type ?pcLasso or ?cv.pcLasso.