--- title: "HotellingEllipse" author: "Christian L. Goueguel" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{HotellingEllipse} %\VignetteEngine{knitr::knitr} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.retina = 2 ) ``` This package is specifically designed to help draw Hotelling’s T-squared ellipses on PCA or PLS score scatterplots, a crucial tool in chemometrics for multivariate data analysis and quality control. In the field of chemometrics, these ellipses serve as powerful visual aids for measurements assessment, outlier detection, and process monitoring. By superimposing Hotelling’s T-squared ellipses onto score plots, analysts can quickly identify abnormal samples and establish well-defined confidence regions for normal process operations. The package's functionality enables computing the Hotelling’s T-squared statistic, the semi-minor and semi-major axes of the Hotelling’s T-squared ellipse, and generates the coordinate points necessary for constructing a confidence ellipse. Specifically, the package provides two primary functions: * `ellipseParam()`, is used to calculate the Hotelling’s T-squared statistic and the semi-axes of the confidence ellipses at 99% and 95% confidence intervals. * `ellipseCoord()`, is primarily used to get the *x*, *y*, and *z* coordinates for plotting 2D or 3D confidence ellipses at user-defined confidence interval. The confidence interval is set at 95% by default. ```{r message=FALSE, warning=FALSE, include=FALSE} library(dplyr) library(FactoMineR) library(tibble) library(purrr) library(ggplot2) library(ggforce) ``` ```{r message=FALSE, warning=FALSE, include=FALSE} requireNamespace("rgl", quietly = TRUE) requireNamespace("scales", quietly = TRUE) requireNamespace("viridisLite", quietly = TRUE) ``` # Data ```{r message=FALSE, warning=FALSE} library(HotellingEllipse) ``` ```{r} data("specData", package = "HotellingEllipse") ``` # Principal component analysis In this example, we use `FactoMineR::PCA()` to perform the Principal Component Analysis (PCA) on a LIBS spectral dataset `specData`, and extract the PCA scores. ```{r} set.seed(002) pca_mod <- specData %>% select(where(is.numeric)) %>% PCA(scale.unit = FALSE, graph = FALSE) ``` ```{r} pca_scores <- pca_mod %>% pluck("ind", "coord") %>% as_tibble() %>% print() ``` # Hotelling’s T-Squared Statistic The Hotelling’s T-squared statistic can be computed using the `ellipseParam` function in two distinct ways. (1) By specifying the parameter `k`: This represents the number of principal components to retain. (2) By setting the `threshold` parameter: This defines the cumulative explained variance to determine the number of components. ## Calculate using the first k components ```{r} T2 <- ellipseParam(pca_scores, k = 3)$Tsquare$value ``` ```{r} T2 ``` ```{r} T2 <- ellipseParam(pca_scores, k = 5)$Tsquare$value ``` ```{r} T2 ``` ## Calculate using the cumulative variance threshold ```{r} T2 <- ellipseParam(pca_scores, threshold = 0.80)$Tsquare$value ``` ```{r} T2 ``` ```{r} T2 <- ellipseParam(pca_scores, threshold = 0.95)$Tsquare$value ``` ```{r} T2 ``` # Hotelling’s T-Squared Ellipse ## Calculating the semi-axes To visualize the confidence region for our multivariate data, we employ the `ellipseParam()` function to generate a confidence ellipse. Our objective is to calculate the lengths of the semi-axes for this ellipse, focusing on the bivariate relationship within the PC2-PC4 subspace of our principal component analysis. We maintain the default value for `k` (number of components) at 2. This ensures we're working with a two-dimensional representation. We specify `pcx = 2` and `pcy = 4` as inputs. This directs the function to use the 2nd and 4th principal components for the *x* and *y* axes, respectively. ```{r} ellipse_axes <- ellipseParam(pca_scores, pcx = 2, pcy = 4) ``` ```{r} str(ellipse_axes) ``` We can extract parameters for further use: * Semi-axes of the ellipse at 99% confidence level. ```{r} a1 <- ellipse_axes %>% pluck("Ellipse", "a.99pct") b1 <- ellipse_axes %>% pluck("Ellipse", "b.99pct") ``` * Semi-axes of the ellipse at 95% confidence level. ```{r} a2 <- ellipse_axes %>% pluck("Ellipse", "a.95pct") b2 <- ellipse_axes %>% pluck("Ellipse", "b.95pct") ``` * Hotelling’s T$^2$ statistic. ```{r} Tsq <- ellipse_axes %>% pluck("Tsquare", "value") ``` ```{r message=FALSE, warning=FALSE} pca_scores %>% ggplot(aes(x = Dim.2, y = Dim.4)) + geom_ellipse(aes(x0 = 0, y0 = 0, a = a1, b = b1, angle = 0), linewidth = .5, linetype = "solid", fill = "white") + geom_ellipse(aes(x0 = 0, y0 = 0, a = a2, b = b2, angle = 0), linewidth = .5, linetype = "dashed", fill = "white") + geom_point(aes(fill = Tsq), shape = 21, size = 3, color = "black") + scale_fill_viridis_c(option = "viridis") + geom_hline(yintercept = 0, linetype = "solid", color = "black", linewidth = .2) + geom_vline(xintercept = 0, linetype = "solid", color = "black", linewidth = .2) + labs( title = "Scatterplot of PCA scores", subtitle = "PC1 vs. PC3", x = "PC2", y = "PC4", fill = "T2", caption = "Figure 1" ) + theme_grey() ``` ## Calculating the x-y coordinates An alternative method for visualizing the Hotelling’s T-squared confidence region is to utilize the `ellipseCoord()` function. This function generates the *x* and *y* coordinates for plotting the confidence ellipse, offering greater flexibility in visualization and further analysis. It allows users to specify a custom confidence level via the `confi.limit` parameter. The default confidence level is set at 95%, which is commonly used in statistical analyses. The function returns a set of coordinates that define the ellipse's boundary in the chosen subspace, thereby complementing the semi-axes information provided by the `ellipseParam()` function. In the example below, we focus on the subspace spanned by the 1st and 5th components. ```{r} xy_coord <- ellipseCoord(pca_scores, pcx = 1, pcy = 5, conf.limit = 0.975, pts = 500) ``` ```{r} str(xy_coord) ``` ```{r message=FALSE, warning=FALSE} ggplot() + geom_polygon(data = xy_coord, aes(x, y), color = "black", fill = "white") + geom_point(data = pca_scores, aes(x = Dim.1, y = Dim.5), shape = 21, size = 5, fill = "blue", color = "black", alpha = 1/2) + geom_hline(yintercept = 0, linetype = "solid", color = "black", linewidth = .2) + geom_vline(xintercept = 0, linetype = "solid", color = "black", linewidth = .2) + labs( title = "Scatterplot of PCA scores", subtitle = "PC1 vs. PC5", x = "PC1", y = "PC5", caption = "Figure 2" ) + theme_grey() ``` # Hotelling’s T-Squared Ellipsoid The `ellipseCoord` function offers an optional parameter `pcz`. When activated, this parameter extends the function's capabilities from two-dimensional ellipses to three-dimensional ellipsoids, providing a more comprehensive visualization of multivariate data, particularly in situation where the first two principal components do not adequately capture the data's variability. ## Calculating the x-y-z coordinates In the example below, the resulting 3D Hotelling’s T-squared ellipsoid serves as a volumetric confidence region in the subspace spanned by the 1st, 2nd, and 3rd components. It encapsulates a specified proportion of data points, determined by the confidence level, providing a more holistic view of the data's distribution and outliers in three dimensions. ```{r} xyz_coord <- ellipseCoord(pca_scores, pcx = 1, pcy = 2, pcz = 3, conf.limit = 0.95, pts = 100) ``` ```{r} str(xyz_coord) ``` ```{r} T2 <- ellipseParam(pca_scores, k = 3)$Tsquare$value ``` ``` color_palette <- viridisLite::viridis(nrow(pca_scores)) scaled_T2 <- scales::rescale(T2, to = c(1, nrow(pca_scores))) point_colors <- color_palette[round(scaled_T2)] ``` ``` rgl::setupKnitr(autoprint = TRUE) rgl::plot3d( x = xyz_coord$x, y = xyz_coord$y, z = xyz_coord$z, xlab = "PC1", ylab = "PC2", zlab = "PC3", type = "l", lwd = 0.5, col = "lightgray", alpha = 0.5) rgl::points3d( x = pca_scores$Dim.1, y = pca_scores$Dim.2, z = pca_scores$Dim.3, col = point_colors, size = 5, add = TRUE) rgl::bgplot3d({ par(mar = c(0,0,0,0)) plot.new() color_legend <- as.raster(matrix(rev(color_palette), ncol = 1)) rasterImage(color_legend, 0.85, 0.1, 0.9, 0.9) text( x = 0.92, y = seq(0.1, 0.9, length.out = 5), labels = round(seq(min(T2), max(T2), length.out = 5), 2), cex = 0.7) text(x = 0.92, y = 0.95, labels = "T2", cex = 0.8)}) rgl::view3d(theta = 30, phi = 25, zoom = .8) ```