--- title: "Kernel functions" author: "Elies Ramon" output: rmarkdown::html_vignette vignette: > %\VignetteIndexEntry{Kernel functions} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) ``` ```{r setup, include = FALSE} library(kerntools) ``` ## Purpose of this document When first knowing about the `kerntools` package, or perhaps after downloading it, you may have asked yourself some of these questions: - What are kernel functions and matrices? - Why are they useful? - How to compute them using `kerntools`? This document is meant to answer exactly that. The first section, [Introduction], deals with the basic concepts regarding the kernel approach. Subsequent sections review all kernels implemented in `kerntools` one by one, highlighting their most important traits and giving some examples of use. ## Introduction ### Informal definition Kernel methods are all those machine learning methods that use kernel functions. Let $X = {x_1, x_2, ... , x_N}$ be a dataset of *N* objects of any type (real-valued, graphs, rankings, matrices, categorical data, etc.). The kernel function evaluates all possible pairs of objects, which are then stored in a kernel matrix with dimension *NxN*: $$\mathbf{K_{ij}} = k(x_i , x_j )$$ Intuitively, we can understand kernels as functions $k(x_i , x_j )$ that measure the similarity between $x_i$ and $x_j$. Generally speaking, all kernels are similarity measures; however, not every similarity measure is a kernel: only those that generate proper kernel matrices. Kernel matrices are characterized by four properties: they are real-valued, squared, symmetric and PSD (positive semi-definite: all their eigenvalues are nonnegative). Thus, the kernel matrix **K**, and not the original dataset *X*, is the input data for our kernel method. In this approach, choosing a good kernel function is capital, because it's the "lens" that the method will use to "see" the data. ### Advantages If using kernels implies an "extra step" (computing **K**), why should we bother at all? Surely, it's easier to use any other machine learning method and dealing directly with *X*, isn't it? Well, the thing is that the kernel approach has some advantages: - Most machine learning methods require a numeric input (some of them, like Random Forests, also can accept categorical data). Usually, if our dataset is different from the typical real-valued matrix **X** with *D* features (variables), we will have to recode it somehow. But in kernel methods, the objects in *X* are never represented explicitly (only via their pairwise similarities), which has an interesting consequence: kernels can handle data types that are not standard. Instead, the challenge is using (or designing) kernels that extract valuable information from specific data types or problems. This requires a notion of what is considered “similar” in that given context. - Kernel functions can be combined or "custom-tailored" for a specific problem. For instance, previous knowledge about the problem can be encoded explicitly into the kernel. This is facilitated by the fact that kernel design is modular in nature: kernels can be combined. For example, the linear combination of kernel matrices is another kernel matrix. - When using nonlinear kernel functions, methods that were originally linear (for instance: Principal Components Analysis or PCA) "become" nonlinear. In some sense, we "trick" the method: it *still* works in a linear way, but as its input data has been transformed by a nonlinear function, the output it produces is nonlinear. - Curse of dimensionality: contrarily to a lot of other methods, this is not a big issue when working with kernels. In fact, when dealing with *NxD* datasets where *N << D*, kernels can decrease the computational burden if used cleverly, as the method will deal with a kernel matrix of dimension *NxN*. ### Feature space The most basic kernel function is the dot product of two real vectors (see [Linear kernel]). This definition is *very* important, because all kernels can be written as inner products in some space: $$x → φ(x)$$ $$k(x_i , x_j ) = <φ(x_i),φ(x_j)>$$ $φ(x)$ maps an object from the original (input) space to the so-called *feature space*. This feature space typically has a higher dimension than the original space (other times it has the same dimension) and it's endowed with an inner product. However, all the process happens "under the hood": the kernel function uses this mapping without telling so (recall that the objects' explicit representations are never used). Sometimes, even knowing the feature space associated to a given kernel can be tricky. However, this take avoids the computational cost of 1) explicitly mapping the dataset into feature space (which sometimes can be *very* high dimensional) + 2) computing the inner product there: the kernel conflates the two in a single step, which is known as the *kernel trick*. By the way, this also allows solving nonlinear problems using linear methods: the kernel takes the original data onto a higher dimensional space where it can be operated in a linear way with inner products. (A visual example of this can be found [here](https://es.m.wikipedia.org/wiki/Archivo:Kernel_trick_idea.svg)). ## Kernels for real vectors Now we introduce the kernels in `kerntools` that deal with real vectors. There are *a lot* of kernels that can handle this data, but here we are going to focus only on those implemented in this package. By real vectors I refer the typical continuous dataset **X** with *N* samples in rows and *D* continuous variables (or features) in columns. Thus, for each sample we have $\mathbf{x} \in {\rm I\!R}^D$. ### Linear kernel #### Definition The Linear kernel is the dot product of two real vectors: $$Linear(\mathbf{x_i},\mathbf{x_j}) = \mathbf{x_i}^T\mathbf{x_j}$$ It can take positive values, negative values, and even be zero. Obviously, it has a linear nature. Thus, if we use it in a "kernelized" machine learning method, the output will be the same that when using the non-kernelized method (for instance: linear kPCA is equivalent to standard PCA). The linear kernel is related to the cosine of the angle between two vectors, which is maximum when they share the same direction and 0 if they are perpendicular. Sometimes, is more easy to see the linear kernel as a "similarity score" if the cosine normalization is applied: $$Linear_{cosine}(\mathbf{x_i},\mathbf{x_j}) = \frac{\mathbf{x_i}^T\mathbf{x_j}}{\sqrt{\mathbf{x_i}^T\mathbf{x_i}}\sqrt{\mathbf{x_j}^T\mathbf{x_j}}}$$ This is also known as the "cosine similarity". A score of 1 implies that $\mathbf{x_i},\mathbf{x_j}$ are equal, while the lower the score, the lower the similarity. #### Feature space This is the only kernel that fulfills that feature space = input space. #### Usage This kernel is suitable for typical numeric, continuous datasets (especially if you know in advance that your problem has a linear solution). For instance, if we work with the famous iris dataset: ```{r,eval=FALSE} Linear(iris[,1:4]) ``` In general, however, it is recommended to standardize a dataset before presenting it to the linear kernel. Imagine that your features are very different in magnitude: for instance, if the range of feature 1 is $[0.01,0.1]$ and the range of feature 2 is $[10^2,10^3]$, feature 2 will absolutely dominate the output of the kernel, even if this feature is almost irrelevant to the problem you want to solve. The reverse case arises when you want to *manually* prioritize some features over others (for example, because you have previous knowledge about your problem). In that case, you can weight them in advance, specifying your desired coefficients. For example: ```{r,eval=FALSE} Linear(iris[,1:4], coeff = c(1/2, 1/4, 1/8, 1/8)) ``` Cosine-normalization of the linear kernel can be done with: ```{r,eval=FALSE} Linear(iris[,1:4], cos.norm = TRUE) ``` ### RBF kernel #### Definition The Gaussian Radial Basis Function (RBF) kernel is regarded as the gold standard among kernels. Typically is written as: $$RBF(\mathbf{x_i},\mathbf{x_j}) = e^{-\gamma {||\mathbf{x_i}-\mathbf{x_j}||}^2}$$ It is immediately obvious that it is nonlinear. The output ranges between 0 and 1; again, the higher the RBF kernel value, the higher the similarity between $\mathbf{x_i},\mathbf{x_j}$. The ${||\mathbf{x_i}-\mathbf{x_j}||}$ term correspond to the Euclidean distance between two real vectors, while $\gamma>0$ is a hyperparameter. There is an alternative definition of the RBF in some books or packages, where it has a "bandwidth" hyperparameter called $\sigma>0$. Don't worry. The relationship between the two is: $\gamma = \frac{1}{2\sigma^2}$. The RBF kernel is related to the linear kernel. This is because the inner product induces a L2 norm that, in real vectors, is equivalent to the Euclidean distance. But contrarily to the linear kernel, the RBF kernel is an universal approximator: it can be used to model any decision boundary. In this regard, $\gamma$ governs the "smoothness" of this kernel when adapting to data. The smaller the value, the more complex the resulting model, and the risk of overfitting is greater. Instead, if the value is too small, the decision boundary become very smooth and "linearized", to the point that (with very large values) the RBF kernel starts behaving as the linear kernel. #### Feature space With this kernel, data is mapped onto an infinite-dimensional feature space. Several techniques to approximate it exist. For instance, $φ(\mathbf{x})$ can be computed via Taylor series expansion (more info in [Explicit Approximations of the Gaussian Kernel](https://arxiv.org/abs/1109.4603)). #### Usage This kernel is perfectly fine for typical numeric, continuous datasets, especially if the solution to your problem is nonlinear. Here, you can see how to compute it with `kerntools` using the iris example dataset: ```{r,eval=FALSE} RBF(iris[,1:4], g= 0.1) ``` (Centering or not the dataset is irrelevant, but it is advisable to scale the variables by the their standard deviation before presenting the dataset to the RBF kernel. See [Linear kernel] for more info.) Here, `RBF(...,g)` is the value of the hyperparameter gamma, which should be chosen by the user. There are some heuristics that may be useful for choosing a good value. In that respect, `kerntools` provides this handy function: ```{r,eval=FALSE} estimate_gamma(iris[,1:4]) ``` (Remember that, in `kerntools`, the hyperparameter of the RBF kernel is $\gamma$, not $\sigma$.) Finally, `kerntools`'s implementation currently does not compute the feature space of the RBF kernel. ### Laplacian kernel #### Definition The Laplacian kernel is similar to the RBF kernel. However, instead of computing the L2 norm (Euclidean distance) between two real vectors, it uses the L1 norm: $$Laplacian(\mathbf{x_i},\mathbf{x_j}) = e^{-\gamma |\mathbf{x_i}-\mathbf{x_j}|}$$ It ranges between 0 and 1. The higher the laplacian kernel value, the higher the similarity between $\mathbf{x_i},\mathbf{x_j}$. The hyperparameter $\gamma > 0$ has a similar role that in the RBF kernel. However, now its relationship to $\sigma$ is: $\gamma = \frac{1}{\sigma}$. The hyperparameter governs how the laplacian kernel adapts to data, though using the L1 norm makes this kernel more "sharp" than the RBF kernel. #### Usage You can use this kernel with a typical numeric, continuous dataset, especially if the solution to your problem is nonlinear. An example of use with the famous iris dataset: ```{r,eval=FALSE} Laplace(iris[,1:4], g= 0.1) ``` Here, `Laplace(...,g)` is the value of the hyperparameter $\gamma$, which should be chosen by the user. Centering or not the dataset is irrelevant, but it is advisable to scale the variables by the their standard deviation before presenting it to the laplacian kernel. See [Linear kernel] for more info. Remember that in `kerntools`, the hyperparameter of the laplacian kernel is $\gamma$, not $\sigma$. Moreover, `kerntools`'s implementation right now does not compute the feature space of the laplace kernel. ## Kernels for real matrices The following kernel works with data such as each sample is a real matrix (its entries are real numbers). ### Frobenius kernel #### Definition The Frobenius kernel of two matrices $\mathbf{X_i},\mathbf{X_j}$ with dimension *NxD* is defined as: $$Frobenius(\mathbf{X_i},\mathbf{X_j}) = tr(\mathbf{X_i} \mathbf{X_j}^T) = \sum_{n=1}^N \sum_{d=1}^D\mathbf{X}_{ind} \circ \mathbf{X}_{jnd} $$ $\circ$ denotes the Hadamard product. It is an inner product and, thus, the Frobenius kernel is a sort of "linear kernel" for matrices. The result can be positive, negative or even zero. As always, the higher the kernel the higher the similarity between the two samples. This is understood more easily when the cosine normalization is applied: $$Frobenius_{cosine}(\mathbf{X_i},\mathbf{X_j}) = \frac{tr(\mathbf{X_i} \mathbf{X_j}^T)}{\sqrt{tr(\mathbf{X_i} \mathbf{X_i}^T)}\sqrt{tr(\mathbf{X_j} \mathbf{X_j}^T)}}$$ A score of 1 implies that the two samples (matrices) are equal. #### Feature space The feature space of this kernel is straightforward. It can be obtained by vectorizing (converting into column vectors) each one of the matrices. #### Usage This kernel can be used to compare matrix-formatted information, for instance image data. It can be also used to compare the similarity between two kernel matrices. In `kerntools`, the Frobenius kernel can be called typing: ```{r,eval=FALSE} ### Let's suppose our samples come from three different matrix tables: setosa <- iris[iris$Species == "setosa", 1:4] versicolor <- iris[iris$Species == "versicolor", 1:4] virginica <- iris[iris$Species == "virginica", 1:4] matrices <- list(setosa,versicolor,virginica) Frobenius(matrices) ``` Cosine-normalization of the Frobenius kernel can be done with: ```{r,eval=FALSE} Frobenius(matrices, cos.norm = TRUE) ``` and the feature space (cosine-normalized or not) can be recovered like this: ```{r,eval=FALSE} Frobenius(matrices, feat_space = TRUE) ``` ## Kernels for abundances (counts or frequencies) The following kernels deal with counts and frequency data, be it absolute or relative frequencies. This kind of data lives in datasets of dimension *NxD*, which (at first glance) may appear very similar to real data. However, some nuances should be taken into account. Count data consists in nonnegative integer vectors of dimension *D* $\mathbf{x} \in {\rm I\!R}_{>0}^D$ that quantify the abundance of some trait or analyzed variable. This data has a non-symmetric distribution, a potentially large dynamic range (from zero up to millions) and, when the quantification is done by an instrument, other factors like upper and lower bounds on detection can be at play. Thus, zeroes may be ambiguous: they may signal an actual absence, or a quantity below the detection limit. Sometimes, counts or absolute frequencies are converted to relative frequencies. This data is usually called *compositional* and constitutes a specific category within abundance data. In that case, all samples (rows) sum to an arbitrary number (for instance 1, or 100). This data does not live in the positive real space $\mathbf{x} \in {\rm I\!R}_{>0}^D$ but in the simplex, which is a *D - 1* subspace of it. This is important because the variables are *not* independent, but parts of a whole. Thus, increasing the abundance of a feature within a sample decreases the abundance of the rest. The kernels that follow are well known in some fields, like ecology, because of their relation to widespread diversity metrics: the Bray-Curtis dissimilarity and the Ruzicka distance. ### Bray-Curtis kernel #### Definition The Bray-Curtis kernel is defined as: $$ BCK(x_{i},x_{j}) = 2\frac{\sum_d min(x_{id},x_{jd})}{\sum_d(x_{id}+x_{jd})}$$ where $x_{id}, x_{jd}$ are two samples (that is: two rows of the dataset). It ranges between 0 and 1. #### Usage This kernel is $BCK = 1 - BCD$, where BCD is the Bray-Curtis dissimilarity (it is called "dissimilarity" because it is semimetric and, therefore, not a distance in the strict sense). It can be applied over counts and frequencies; however, the "correctness" of using directly this kernel over compositional data is disputed. In `kerntools`, it can be computed typing: ```{r,eval=FALSE} ### Our example dataset contains the bacterial abundance of *D* species in *N* soil samples. data <- soil$abund BrayCurtis(data) ``` ### Ruzicka kernel #### Definition The Ruzicka kernel (also known as: quantitative Jaccard, or min-max kernel) is defined as: $$ Ruzicka(x_{i},x_{j}) = \frac{\sum_d min(x_{id},x_{jd})}{\sum_dmax(x_{id}+x_{jd})}$$ where $x_{id}, x_{jd}$ are two samples (that is: two rows of the dataset). It is bounded between 0 and 1. #### Usage The Ruzicka kernel is related to the quantitative Jaccard distance, aka Ruzicka distance, in the same manner as the Bray-Curtis kernel is related to the Bray-Curtis dissimilarity. The Ruzicka distance is rank-order similar to BCD, but it fulfills the conditions to be a metric. It is also related to the Jaccard kernel (which is introduced in the [Jaccard kernel] subsection) for sets or presence/absence data. It can be applied over counts and frequencies; however, the "correctness" of using directly this kernel over compositional data is disputed. In `kerntools`, it can be computed typing: ```{r,eval=FALSE} ### Our example dataset contains the bacterial abundance of *D* species in *N* soil samples. data <- soil$abund Ruzicka(data) ``` ## Kernels for categorical data Categorical data consist on qualitative variables that can take a finite number of values (called classes, modalities, categories, or levels). Within this package, "categorical data" refers to nominal data whose categories don't have a meaningful order, in contrast with ordinal data. The typical categorical dataset is a matrix *X* with *N* samples and *D* categorical variables or features. (In R, "data.frames" having columns of class "factor" can be used for this kind of data). ### Dirac kernel #### Definition First we define the Overlap kernel over a single categorical variable as: $$x_{id}=x_{jd} \Leftrightarrow Overlap(x_{id},x_{jd}) = 1$$ $$x_{id} \neq x_{jd} \Leftrightarrow Overlap(x_{id},x_{jd}) = 0$$ where $x_{id},x_{jd}$ are two samples and *D = 1*. This is the most basic categorical kernel, akin in a way to the linear kernel for continuous data. From there, we can define the Dirac kernel over multivariate categorical data as the linear combination of the Overlap kernel's evaluations: $$ Dirac(x_{i},x_{j}) = \sum_d c_d Overlap(x_{id},x_{jd})$$ The coefficients $c_d>0$ can be all 1 (and then, the Dirac kernel is simply the sum of the Overlap kernel evaluations), *1/D* (so the Dirac kernel is the mean), or may be used to weight differently the variables (for example, because you have previous knowledge about your problem). When the coefficients are all set to *1/D*, the values for the Dirac kernel range between 0 and 1, the 1 meaning that the two samples are equal, and 0 that they are completely different. This is equivalent to computing the "sum" Dirac kernel and then cosine-normalizing it. When custom coefficients are used but $\sum_d c_d=1$, this output of this kernel is a weighted average bounded between 0 and 1 (thus, trying to cosine-normalize it is useless). #### Feature space We can map our data onto the feature space of the Overlap kernel via one-hot-encoding. That is: the map $φ(x)$ recodes a categorical variable (using the R nomenclature, "factor") with *L* levels into *L* dummy variables (each one representing a given level) that can be 0 or 1. For instance, if we have a variable that can take three levels ("A", "B" and "C"): ```{r} cat_feat <- data.frame(var=factor(sample(LETTERS[1:3],10,replace = TRUE))) rownames(cat_feat) <- 1:10 cat_feat ``` the feature space for the Overlap kernel is: ```{r} dummy_data(cat_feat) ``` The feature space for the Dirac kernel results of "concatenating" the feature spaces for all *D* categorical variables. If the weights $c_d \neq 1$, then data in feature space should be scaled accordingly. For instance, if $c_d = 1/D$ for all weights, which (as stated before) is the same than cosine-normalize the Dirac kernel, then the feature space is normalized to unit length. #### Usage In `kerntools`, we need a *NxD* dataset of class "matrix" (2-d dimensional array) that contains only characters, or instead a dataset of class "data.frame" (with all entries = "character", or columns = "factor"). `showdata` is an example dataset that meets this requirements. The Dirac kernel can be called doing: ```{r,eval=FALSE} Dirac(showdata) ``` Moreover, if we want to compute the feature space too, we can type: ```{r,eval=FALSE} Dirac(showdata,feat_space = TRUE) ``` By default `Dirac(...,comp = "sum")`; that is, all coefficients $c_d$ are set to 1. If we want the Dirac kernel computed as the mean of the Overlap kernel evaluations, we can do `Dirac(...,comp = "mean")`. Instead, to have the average weighting according to some coefficients, we can do: ```{r,eval=FALSE} coeffs <- c(1/8,1/4,1/4,1/4,1/8) Dirac(showdata, comp = "weighted", coeff=coeffs) ``` The feature space returned by this function will change accordingly to the coefficients chosen. Of course, `Dirac(...,comp = "mean")` is the same than doing: ```{r,eval=FALSE} coeffs <- rep(1/ncol(showdata),length(showdata)) Dirac(showdata, comp = "weighted", coeff=coeffs) ``` ## Kernels for sets A set *S* is a collection of different objects (characters, numbers, strings, ... even other sets). We call *universe* to the set of all objects we want to consider in a given problem. The objects of *S* are called their *elements*, and we denote as $|S|$ the size of the set: that is, the number of elements it contains. ### Intersect kernel #### Definition Imagine that we have a dataset (size *N*) such as each sample is a set: $S_1,...,S_N$. The most simple kernel for this data is the Intersect kernel: $$Intersect(S_i,S_j)= |S_i \cap S_j| $$ that is, the size of the two sets' shared elements. This is akin to the linear kernel for continuous data (see the "Feature space" subsection below). This kernel ranges from 0 (the sets are completely different) to any positive number. In general higher values indicate higher similarity; however, some sets may be bigger (that is: they contain more elements) than others. You can use the cosine-normalization to remove the size effect and have a result that is bounded between 0 and 1: $$Intersect_{cosine}(S_i,S_j)= \frac{|S_i \cap S_j|}{\sqrt{|S_i|·|S_j|}}$$ From there, we can consider a multivariate case: each sample has several (*D*) variables or features that happen to be sets. As we have done with the Dirac kernel, we can do a linear combination of the Intersect kernel's evaluations to produce a "multivariate" kernel: $$ Intersect_{multi}(S_{i},S_{j}) = \sum_{d=1}^{D} c_d Intersect(S_{id},S_{jd})$$ The coefficients $c_d>0$ can be all 1 (and then, this multivariate kernel is simply the sum of the Intersect kernel evaluations), *1/D* (so the Intersect-multi kernel is the mean of evaluations), or may be used to weight differently the variables (for example, because you have previous knowledge about your problem). In this last case, the Intersect-multi kernel is a lineal combination of the Intersect kernel's evaluations. When the coefficients are all set to *1/D* and we also perform a cosine normalization over the univariate evaluations, the values for the Intersect-multi kernel range between 0 and 1, the 1 meaning that the two samples are equal, and 0 that they are completely different. #### Feature space We can map our data onto the feature space of the Intersect kernel via a sort of one-hot-encoding. In this case, the map $φ(S)$ recodes a set coming from a universe with *L* elements into *L* dummy variables (each one representing a given level) that can be 1, if a given element *l* is present in the set *S*, or 0 if it is absent. For example: **Favorite colors of 3 people:** ```{r} universe <- c("blue","green","lightblue","orange","purple","red","white","yellow") person1 <- c(2, 6, 8) person2 <- c(1, 8) person3 <- c(2, 3, 6) list(person1=universe[person1],person2=universe[person2],person3=universe[person3]) ``` The data mapped onto the feature space for this kernel is: ```{r} colors <- matrix(0,nrow=3,ncol=length(universe)) rownames(colors) <- c("person1","person2","person3") colnames(colors) <- universe colors[1,person1] <- 1 colors[2,person2] <- 1 colors[3,person3] <- 1 colors ``` and now the Intersect kernel can be computed directly as the dot product between samples. If we use the Intersect-cosine, then the feature space is normalized accordingly to the unit vector: ```{r} cosnormX(colors) ``` Moreover, the Intersect-multi kernel's feature space results of "concatenating" the feature spaces of all sets associated to a sample, each one scaled accordingly by the weights $c_d$. #### Usage `kerntools` implementation requires that the set members (elements) are character symbols (length=1) combined into a string. Compare this with the last example: ```{r} universe <- c("b","g","l","o","p","r","w","y") colors <- matrix("",ncol=1,nrow=3) rownames(colors) <- c("person1","person2","person3") colors[1,] <-paste(universe[person1],collapse="") colors[2,] <-paste(universe[person2],collapse="") colors[3,] <-paste(universe[person3],collapse="") colors ``` That is a *NxD* "matrix" or "data.frame" containing characters. Each sample is in a different row. Now, you can do: ```{r,eval=FALSE} Intersect(colors,elements = universe) ``` This function requires that we state the universe's elements explicitly. The feature space can be returned too: ```{r,eval=FALSE} Intersect(colors, elements = universe, feat_space = TRUE) ``` If data is multivariate ($D>1$ columns, and each one contains a set feature), the universe should span all elements found in all the *D* sets. For instance, consider this second feature: **Best friends of the same 3 people:** (*Anna = A*; *Bernard = B*; *Cesc = C*; *Diana = D *; *Elsa = E *; *Fran = F*; *Gisele = G*; *Horaci = H*; *Iria = I*; *Jaume = J*) ```{r} universe2 <- LETTERS[1:10] person1 <- c(1,2,9,10) person2 <- c(1,4,6,7) person3 <- c(2,3,5,9,10) person1=universe2[person1] person2=universe2[person2] person3=universe2[person3] list(person1=person3,person2=person3,person3=person3) friends <- colors friends[1,] <- paste(person1,collapse="") friends[2,] <- paste(person2,collapse="") friends[3,] <- paste(person3,collapse="") friends ``` So the final dataset is: ```{r} set_data <- data.frame(colors=colors,friends=friends) set_data ``` And now we can compute the kernel: ```{r,eval=FALSE} Intersect(set_data,elements = c(universe,universe2),comp="sum") ``` (We should indicate the elements of the two universes: colors and friends). When `Intersect(...,comp = "sum")`, all $c_d = 1$ and, in principle, all "features" (following with our last example: "friends" and "colors") have the same weight. In practice you should be careful, because `kerntools` doesn't cosine-normalizes each univariate evaluation of the kernel. Thus, you may inadvertently introduce some bias in favor of "bigger" sets ("friends" will have, implicitly, more weight than "colors"). Instead, when `comp = "mean"` or `comp = "weighted"` is chosen, `kerntools` cosine-normalizes all univariate evaluations, so the result ranges between 0 and 1, and there are not implicit biases. `Intersect(...,comp = "mean")` sets $c_d = 1/D$ and therefore gives the same importance to all features. However, if you want to give more importance to some features over others, you can choose manually the weights that you want: ```{r,eval=FALSE} ### We will make that "friends" has 3 times more importance than "colors": coeffs <- c(1/4,3/4) Intersect(set_data, elements = c(universe,universe2),comp = "weighted", coeff=coeffs) ``` Here, the result of the kernel (between 0 and 1) is the weighted average of the features. If you set `Intersect(..., feat_space = TRUE)`, the feature space returned by this function will be scaled accordingly to the $c_d$ coefficients chosen. The same will occur when doing `Intersect(...,comp = "mean")`, but in this case the scaling consist in simply dividing by *D* (the set features). ### Jaccard kernel #### Definition The Jaccard index (or Jaccard kernel) of two sets is defined as: $$Jaccard(S_i,S_j)= \frac{|S_i \cap S_j|}{|S_i \cup S_j|} $$ It naturally ranges between 0 (the sets don't share any element) and 1 (the sets are identical). Thus, the cosine normalization is redundant for this kernel. Like the Dirac and the Intersect kernels, we can consider a multivariate case for the Jaccard kernel. If each sample has several (*D*) variables that happen to be sets, we can do a linear combination of the kernel's evaluations to produce a "multivariate" kernel: $$ Jaccard_{multi}(S_{i},S_{j}) = \sum_{d=1}^{D} c_d Jaccard(S_{id},S_{jd})$$ The coefficients $c_d>0$ can all be 1 (and then, this multivariate kernel is simply the sum of the Jaccard kernel evaluations), *1/D* (so the Jaccard-multi kernel is the mean of evaluations), or may be used to weight variables (sets) differently (for example, because you have previous knowledge about your problem). In this last case, the Jaccard-multi kernel is a weighted average of the Jaccard kernel's evaluations. When the coefficients are all set to $\sum c_d=1$, the values given by the Jaccard-multi kernel range between 0 and 1. #### Usage The usage is pretty similar to the Intersect kernel: ```{r,eval=FALSE} Jaccard(set_data,elements = c(universe,universe2),comp="sum") ``` The only nuance here is that the univariate Jaccard kernel always is bounded between 0 and 1. There are not underlying biases and all features have the same weight (unless you explicitly do `Intersect(...,comp = "weighted")` and enter some coefficients). Thus, this: ```{r,eval=FALSE} cosNorm(Jaccard(set_data,elements = c(universe,universe2),comp="sum")) ``` or this: ```{r,eval=FALSE} D <- ncol(set_data) Jaccard(set_data,elements = c(universe,universe2),comp="sum")/D ``` Is the same than this: ```{r,eval=FALSE} Jaccard(set_data,elements = c(universe,universe2),comp="mean") ``` Also, take into account that the `Jaccard()` function does not return the data mapped onto the feature space of the kernel. ## Kernels for ordinal data, ranks, and permutations Ordinal data consist on qualitative variables that can take a finite number of values that follow an intrinsic and meaningful order. In that regard, ordinal data is in-between continuous and categorical (nominal) data. For instance, in a survey, the typical agree/disagree questions that have answers like "strongly agree"/"agree"/"neither agree or disagree"/"disagree"/etc. are ordinal variables. This is related to the concept of "ranking", which consists in assigning the ordering labels "1st", "2nd", "3rd", ... to these values. ### Kendall's $\tau$ kernel #### Definition Kendall's $\tau$ coefficient measures the similarity or rank correlation between to two ordinal variables. It can be applied to different ordinal variables or to different rankings of the same variable. It is also a kernel function, defined as: $$Kendall_\tau (x,y)= \frac{nc-nd}{np}$$ Let $(x_1,y_1),...,(x_L,y_L)$ be two ordinal variables, or two samples of the same variable. For simplicity, we shall consider that all values taken by $x_i$ and $y_i$ are unique. Observations $(x_i, y_i)$ and ($x_j , y_j)$ with $i < j$ are "concordant" if we either observe $x_i < x_j$ and $y_i < y_j$, or $x_i > x_j$ and $y_i > y_j$. Otherwise, they are "discordant". (Please note that, as $x_i$ and $y_i$ are unique, there are no *ties*: it's impossible that $x_i = x_j$ or $y_i = y_j$). All possible pairwise evaluations are taken into account: $np$ is the total number of pairs, given by $\binom{L}{2}$, $nc$ is the number of total concordant pairs, and $nd$ is the number of total discordant pairs. The result of the Kendall's $\tau$ kernel can range between -1 and 1. 1 means that the two rankings are identical: the two ordinal samples/variables are equal, -1 that one ranking is the reverse of the other, and 0 that the two rankings are completely independent. #### Feature space If we call $r_i$ and $s_i$ the ranks of the i-th member in the two ordinal samples, we can then define two matrices **A** and **B** that: $$\mathbf{A}_{ij} = sign(r_j - r_i)$$ $$\mathbf{B}_{ij} = sign(s_j - s_i)$$ These matrices fulfill that $\mathbf{A}^T = -\mathbf{A}$ and that $\mathbf{B}^T = -\mathbf{B}$. Now, we can observe that: $$Kendall_\tau (x,y) = Frobenius_{cosine}(\mathbf{A},\mathbf{B})$$ From there, the data in feature space can be found the same way than in the Frobenius kernel. It can be deduced that, for the Kendall's $\tau$ kernel, feature space has dimension $\binom{L}{2}$, and each feature corresponds to a "relative order" ($+ = x_i < x_j$; $- = x_i > x_j$) between all pair of levels within a ranking. #### Usage A *NxL* or *LxN* dataset is required. *N* is the sample size and *L* the number of values (levels) of the ordinal variable. This is an example that such a dataset: **Three people are given a list of 10 colors. They rank them from most (1) to least (10) favorite.** ```{r} color_list <- c("black","blue","green","grey","lightblue","orange","purple", "red","white","yellow") survey1 <- 1:10 survey2 <- 10:1 survey3 <- sample(10) color <- cbind(survey1,survey2,survey3) # Samples in columns rownames(color) <- color_list color ``` In this case, the ranking is already done. The Kendall's $\tau$ kernel is computed like: ```{r,eval=FALSE} Kendall(color) ``` (If we prefer samples in rows, we can do:) ```{r,eval=FALSE} color <- t(color) Kendall(color,samples.in.rows=TRUE) ``` Alternatively, the ranking may be not given explicitly by the user. Consider this other dataset: **The same three people are asked the number of times they ate 5 different kinds of food during the last month** ```{r} food <- matrix(c(10, 1,18, 25,30, 7, 5,20, 5, 12, 7,20, 20, 3,22),ncol=3,nrow=5,byrow = TRUE) colnames(food) <- colnames(color) rownames(food) <- c("spinach", "chicken", "beef" , "salad","lentils") food ``` In that case, the ranking is done internally before computing the Kendall's $\tau$. ```{r,eval=FALSE} Kendall(food) ``` Furthermore, we can combine these two datasets, because the participants are the same: ```{r,eval=FALSE} ordinal_data <- list(color=color,food=food) #All samples in columns Kendall(ordinal_data) ``` By default, the Kendall's $\tau$ kernel values of the two datasets are averaged. The alternative is to sum them, doing: `Kendall(...,comp="sum")`. Right now, `Kendall()` does not return the data mapped onto the feature space. ## Kernels for strings, sequences, or text Strings or sequences are vectors of objects (usually represented as characters) in which repetitions are allowed and order matters. These characters are drawn from a given alphabet $A$. An example of an alphabet is: ```{r} LETTERS ``` We denote the size of an alphabet as $|A|$. In the example above, this is: ```{r} length(LETTERS) ``` ### Spectrum kernel #### Feature space Let $x$ be a string or sequence, and $s$ a substring of length *L* that consist of characters drawn from an alphabet $A$, so we can count the times ($t_s$) that $s$ occurs in $x$. Then, we do the same for all possible substrings of length *L* generated from $A$. Each $t_s$ is a feature in feature space, which has dimension $|A|^L$. #### Definition The Spectrum kernel is the dot product of two strings mapped onto the feature space we have just defined. Alternatively, it can also be written as: $$Spectrum(x_i,x_j) = \sum_{s \in A^L} |s \in x_i|·|s \in x_j| $$ This kernel ranges from 0 (the strings are completely different: they don't share any substring) to any positive number. In general, higher values indicate higher similarity; however, it's important to remember that some strings may be longer than others. This is fine if the strings' length is relevant to the problem you are studying; but if it is not, you can cosine-normalize the Spectrum kernel so the result is bounded between 0 and 1: $$Spectrum_{cosine}(x_i,x_j) = \frac{Spectrum(x_i,x_j)}{\sqrt{Spectrum(x_i,x_i)Spectrum(x_j,x_j)}}$$ A result of 0 means complete dissimilarity, 1 means complete similarity, and the strings' length is not taken into account. #### Usage You need a dataset containing *N* strings or text data. For instance: ```{r} alphabet <- c(letters,"_") strings <- c("hello_world","hello_word","hola_mon","kaixo_mundua", "saluton_mondo","ola_mundo", "bonjour_le_monde") names(strings) <- c("english1","english_typo","catalan","basque", "esperanto","galician","french") strings alphabet ``` Then you can compute the Spectrum kernel. For instance, we could consider substrings of length L = 2: ```{r,eval=FALSE} Spectrum(strings,l=2,alphabet=alphabet) ``` The feature space can be recovered doing: ```{r,eval=FALSE} Spectrum(strings,l=2,alphabet=alphabet, feat_space = TRUE) ``` The cosine-normalized version can be computed typing: ```{r,eval=FALSE} Spectrum(strings,l=2,alphabet=alphabet,cos.norm = TRUE) ``` (if you set `feat_space=TRUE`, the feature space returned by `Spectrum()` will be scaled accordingly). Right now, the Spectrum kernel implemented by `kerntools` is very basic and, in large datasets and/or when *L* is high, the computation may be excessively slow. In that case, there are other R packages that you may use. For instance, you may refer to `kernlab::stringdot()`, or check the `kebabs` package, which covers complex kernels for strings like the motif kernel, the mismatch kernel, the gappy pair kernel, etc.