--- title: "kdml package" author: "John R. J. Thompson & Jesse S. Ghashti" date: ' ' output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{kdml package} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r setup, include=FALSE} knitr::opts_chunk$set(message = FALSE,warning = FALSE) knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(np) library(stats) library(MASS) library(kdml) ``` # Introduction The \(\texttt{kdml}\) package calculates the pairwise distances between mixed-type observations consisting of continuous (numeric), nominal (factor), and ordinal (ordered factor) variables. This kernel metric learning methodology calculates the bandwidths associated with each kernel function for each variable type using cross-validations and returns a distance matrix that can be utilized in any distance-based algorithm. We define a kernel similarity between two data points $\mathbf{x}_{i}$ and $\mathbf{x}_{j}$ two different way based on two papers. From Ghashti and Thompson (2024), the distance using kernel product similarity \(\texttt{dkps}\) similarity is given by \begin{equation}\label{eq:kdsumsim} s_{\text{dkps}}(\mathbf{x}_{i}, \mathbf{x}_{j} \vert \boldsymbol{\lambda}) = \prod_{k=1}^{p_c} \frac{1}{\lambda_k} K\left( \frac{x_{i,k} - x_{j,k}}{\lambda_k}\right) + \sum_{k=p_c+1}^{p_c + p_u} L(x_{i,k},x_{j,k},\lambda_k) + \sum_{k = p_c+p_u+1}^p l(x_{i,k},x_{j,k},\lambda_k), \end{equation} and from Ghashti (2024), the distance using kernel summation similarity \(\texttt{kss}\) is given by \begin{equation}\label{eq:dksssim} s_{\text{kss}}(\mathbf{x}_{i}, \mathbf{x}_{j} \vert \boldsymbol{\lambda}) = \sum_{k=1}^{p_c} K\left( \frac{x_{i,k} - x_{j,k}}{\lambda_k}\right) + \sum_{k=p_c+1}^{p_c + p_u} L(x_{i,k},x_{j,k},\lambda_k) + \sum_{k = p_c+p_u+1}^p l(x_{i,k},x_{j,k},\lambda_k). \end{equation} For both $s_{\text{dkps}}(\mathbf{x}_{i}, \mathbf{x}_{j} \vert \boldsymbol{\lambda})$ and $s_{\text{kss}}(\mathbf{x}_{i}, \mathbf{x}_{j} \vert \boldsymbol{\lambda})$, $K(\cdot)$, $L(\cdot)$, and $\ell(\cdot)$ are kernel functions for continuous (numeric), nominal (factor), and ordinal (ordered factor) variables, respectively. The data frame consists of $p$-many variables, such that $p = p_c + p_u + p_o$ is the sum of the continuous, nominal, and ordinal variables, respectively. $\boldsymbol{\lambda}$ is a vector of length $p$ containing variable-specific bandwidth values for each kernel function. Phillips and Venkatasubramanian (2011) discuss how to calculate distance from similarity functions. Using either similarity function from $s_{\text{dkps}}(\mathbf{x}_{i}, \mathbf{x}_{j} \vert \boldsymbol{\lambda})$ or $s_{\text{kss}}(\mathbf{x}_{i}, \mathbf{x}_{j} \vert \boldsymbol{\lambda})$, the kernel distance between two observations $\mathbf{x}_i$ and $\mathbf{x}_j$ is given by \begin{equation}\label{eq:kerndist} d^2(\mathbf{x}_i, \mathbf{x}_j \ | \ \boldsymbol{\lambda}) = s(\mathbf{x}_i, \mathbf{x}_i \ | \ \boldsymbol{\lambda}) + s(\mathbf{x}_j, \mathbf{x}_j \ | \ \boldsymbol{\lambda}) - 2s(\mathbf{x}_i, \mathbf{x}_j \ | \ \boldsymbol{\lambda}). \end{equation} These two distances can be called in \(\texttt{R}\) with the package \(\texttt{kdml}\) using functions \(\texttt{dkps}\) for Equation $s_{\text{dkps}}(\mathbf{x}_{i}, \mathbf{x}_{j} \vert \boldsymbol{\lambda})$ and \(\texttt{dkss}\) for Equation $s_{\text{kss}}(\mathbf{x}_{i}, \mathbf{x}_{j} \vert \boldsymbol{\lambda})$, both of which are used in the Equation $d^2(\mathbf{x}_i, \mathbf{x}_j \ | \ \boldsymbol{\lambda})$ for pairwise distance calculations. The vector of bandwidths $\boldsymbol{\lambda}$ may be a user-input numeric vector of length $p$, for which the possible values for bandwidths for each variable type are bounded based on the kernel choice. For continuous variables $\lambda > 0$, ordinal variables $\lambda \in [0,1]$, and nominal variables is kernel specific. For example, $\lambda \in [0,1]$ for the 'u_aitken' kernel and $\lambda \in [0,(c-1)/c]$ for 'u_aitchisonaitken', where $c$ is the number of unique values for a specific nominal variable. For an overview of kernel functions, we refer the reader to Aitchison and Aitken (1976), Cameron and Trivedi (2005), Härdle et al. (2004), Li and Racine (2007), Li and Racine (2003), Silverman (1986), Titterington and Bowman (1985), and Wang and van Ryzin (1981). # Installing You can install the stable version from CRAN using \(\texttt{install.packages()}\). The developmental version of \textsf{kdml} from [Github](https://github.com/jrjthompson/R-package-kdml) with: ```{r, echo = TRUE, eval = FALSE} library(devtools) install_github("jrjthompson/R-package-kdml") library(kdml) ``` # Bandwidth Selection The maximum similarity cross-validation (MSCV) method for bandwidth selection (Ghashti and Thompson, 2024) is based on the objective \begin{equation}\label{eq:mscv} \underset{\boldsymbol{\lambda}}{\text{argmax}}\left\{\frac{1}{n}\sum_{i=1}^n\log\left(\frac{1}{(n-1)}\sum_{\substack{j=1 \\ j \ne i}}^n s_{\boldsymbol{\lambda}}(\textbf{x}_i,\textbf{x}_j)\right)\right\}, \end{equation} where $s_{\boldsymbol{\lambda}}(\cdot)$ is as the similarity function for $s_{\text{dkps}}(\mathbf{x}_{i}, \mathbf{x}_{j} \vert \boldsymbol{\lambda})$ and $s_{\text{kss}}(\mathbf{x}_{i}, \mathbf{x}_{j} \vert \boldsymbol{\lambda})$. (MSCV) can be invoked implicitly within the distance calculation by setting the argument \(\texttt{bw = "mscv"}\) in the functions \(\texttt{dkps}\) and \(\texttt{dkss}\). Users also have the option of setting the argument \(\texttt{bw = "np"}\) to specify bandwidth selection using maximum-likelihood cross-validation from the highly optimized package \(\texttt{np}\) (Hayfield and Racine, 2008). # Data Generation To simulate data containing a mix of variable types and true class labels, we have included the function \(\texttt{confactord}\). This function performs the following simulation automatically, and can be configured for any number of numeric, nominal and ordinal variables. See the package manual for more details. We simulate a mix of continuous ($x_1$, $x_4$), nominal ($x_2$, $x_3$), and ordinal data ($x_5$, $x_6$) an store in a data frame as follows ```{r} df <- data.frame( x1 = runif(100, 0, 100), x2 = factor(sample(c("A", "B", "C"), 100, replace = TRUE)), x3 = factor(sample(c("A", "B", "C"), 100, replace = TRUE)), x4 = rnorm(100, 10, 3), x5 = ordered(sample(c("Low", "Medium", "High"), 100, replace = TRUE), levels = c("Low", "Medium", "High")), x6 = ordered(sample(c("Low", "Medium", "High"), 100, replace = TRUE), levels = c("Low", "Medium", "High")) ) ``` A minimial usage of the distance metrics functions requires only the data frame, where the functions default the kernel functions, and the bandwidth specification method to 'mscv'. ```{r, echo = TRUE, eval = FALSE} # DKPS distance dis_dkps <- dkps(df = df) # DKSS distance dis_kdss <- kdss(df = df) ``` Using the maximum-likelihood cross-validation technique from package \(\texttt{np}\). ```{r, echo = TRUE, eval = FALSE} # DKPS distance dis_dkps_np <- dkps(df = df, bw = "np") # DKSS distance dis_kdss_np <- kdss(df = df, bw = "np") ``` Users also have many kernel functions available them, which are listed in the additional arguments below. Some of the kernel functions from \(\texttt{np}\) are not available. Kernels used for the bandwidth selection technique should be the same used for the distance calculation. ```{r, echo = TRUE, eval = FALSE} dis_dkps_custom_kernels <- dkss(df = df, bw = "mscv", cFUN = "c_epanechnikov", uFUN = "u_aitken", oFUN = "o_habbema") ``` If users require only the bandwidths selected by MSCV, and not the pairwise distance matrix obtained from \(\texttt{dkps}\) or \(\texttt{dkss}\), they may do so with the following function calls: ```{r, echo = TRUE, eval = FALSE} # MSCV bandwidth specification using the similarity function in Equation (1) mscv.dkps(df, nstart = NULL, ckernel = "c_gaussian", ukernel = "u_aitken", okernel = "o_wangvanryzin", verbose = TRUE) # MSCV bandwidth specification using the similarity function in Equation (2) mscv.dkss(df, nstart = NULL, ckernel = "c_gaussian", ukernel = "u_aitken", okernel = "o_wangvanryzin", verbose = TRUE) ``` For more details on the usage of each of these functions, consult the package documentation found on CRAN. **References** [1] Aitchison, J. and C.G.G. Aitken (1976), “Multivariate binary discrimination by the kernel method”, Biometrika, 63, 413-420. [2] Cameron, A. and P. Trivedi (2005), “Microeconometrics: Methods and Applications”, Cambridge University Press. [3] Ghashti, J.S. (2024), "Similarity Maximization and Shrinkage Approach in Kernel Metric Learning for Clustering Mixed-type Data", University of British Columbia. [4] Ghashti, J.S. and J.R.J Thompson (2024), “Mixed-type Distance Shrinkage and Selection for Clustering via Kernel Metric Learning”, Journal of Classification, Accepted. [5] Härdle, W., and M. Müller and S. Sperlich and A. Werwatz (2004), Nonparametric and Semiparametric Models, (Vol. 1). Berlin: Springer. [6] Hayfield, T. and J.S. Racine (2008). Nonparametric Econometrics: The \(\texttt{np}\) Package. Journal of Statistical Software 27(5). [7] Li, Q. and J.S. Racine (2007), Nonparametric Econometrics: Theory and Practice, Princeton University Press. [8] Li, Q. and J.S. Racine (2003), “Nonparametric estimation of distributions with categorical and continuous data”, Journal of Multivariate Analysis, 86, 266-292. [9] Silverman, B.W. (1986), Density Estimation, London: Chapman and Hall. [10] Titterington, D.M. and A.W. Bowman (1985), “A comparative study of smoothing procedures for ordered categorical data”, Journal of Statistical Computation and Simulation, 21(3-4), 291-312. [11] Wang, M.C. and J. van Ryzin (1981), “A class of smooth estimators for discrete distributions”, Biometrika, 68, 301-309.