Kernel density estimates for tidy and geospatial data in the eks package

Tidy data example

The data set we focus on is the crabs data set from the MASS package, with the variables FL frontal lobe size (mm), CW carapace width (mm) and sp species (B for blue, O for orange).

library(eks)
library(colorspace)
library(ggplot2)
library(dplyr)

## crabs data 
data(crabs, package="MASS")
crabs2 <- select(crabs, FL, CW)
xlab <- "Frontal lobe size (mm)"
ylab <- "Carapace width (mm)"

The KDE for tidy data is computed by tidy_kde. From the output, the scatter plot of the data is generated by geom_point_ks and the contour plot of the KDE by geom_contour_ks. The bimodal structure of the data distribution, corresponding to the two species, is clearly visible from the KDE plot from tidy_kde. This is due to the optimal choice of the matrix of smoothing parameters. This optimal smoothing matrix is the plug-in bandwidth computed by ks::Hpi, and it is suitable for a wide range of data sets. For further details of the computation of the kernel density estimate and the bandwidth, see ?ks::kde and ?ks::Hpi.

## KDE contour plot + scatter plot
tkde2 <- tidy_kde(crabs2)
gkde2 <- ggplot(tkde2, aes(x=FL, y=CW)) + labs(x=xlab, y=ylab)
gkde2 + geom_point_ks(colour=8) + geom_contour_ks(colour=1)

On the other hand, the default bandwidth and the resulting KDE computed by ggplot2::geom_density_2d leads to an oversmoothed KDE which does not reveal the data bimodality.

## geom_density_2d KDE contour plot + scatter plot
mkde2 <- ggplot(crabs2, aes(x=FL, y=CW))
mkde2 + geom_point(colour=8) + geom_density_2d(colour=1, bins=4)

The default choice of the contour levels in the eks package is based on probability contours. Probability contours offer an intuitive approach to selecting the contour levels that reveal the pertinent characteristics of the data distribution. See Chacon & Duong (2018, Chapter 2.2). Filled contour plots, generated by geom_contour_filled_ks, can be coloured with an appropriate sequential colour scale. For example, a 25% contour region (red region) is the smallest region that contains 25% of the probability mass defined by the KDE. The 50% contour region consists of the union of the orange region and the red region, and it contains 50% of the data points etc. Note that the 25%, 50% and 75% contour regions of the crabs KDE are composed of separate, unconnected contour sub-regions.

## KDE filled contour plot
gkde2 + geom_contour_filled_ks(colour=1)

As an alternative to these discretised contours, the usual ggplot2::geom_raster generates a plot with a continuous colour scale.

## KDE continuous colour scale plot
gkde2 + geom_raster(aes(fill=estimate), interpolate=TRUE) + 
   scale_fill_continuous_sequential(palette="Heat")
#> Scale for fill is already present.
#> Adding another scale for fill, which will replace the existing scale.

One of the main advantages of ggplot2 is its ability to handle grouped data: in this case, the KDE plots for each species. The KDE with blue contours is for the B species, and orange contours for the O species.

crabs2g <- select(crabs, FL, CW, sp)
crabs2g <- group_by(crabs2g, sp)
tkde2g <- tidy_kde(crabs2g)
gkde2g <- ggplot(tkde2g, aes(x=FL, y=CW, group=sp)) + labs(x=xlab, y=ylab, colour="Species") + 
   scale_colour_manual(values=c(4, 7)) + 
   guides(colour=guide_legend(title="Species"))
#> Scale for colour is already present.
#> Adding another scale for colour, which will replace the existing scale.

## facetted KDE contour plots + scatter plots
gkde2g + geom_point_ks(colour=8) + 
   geom_contour_ks(aes(colour=sp)) + facet_wrap(~sp)

The probability contour levels computed in geom_contour_ks and geom_contour_filled_ks are relative to the grouping variable. So whilst the same probability 25% level is applied to both groups KDE, the height of 25% contour region for the blue species is 0.04143, and for the orange species it is 0.03171. For a direct comparison of different KDEs, it is useful to have a set of fixed contour heights for all KDEs . A heuristic solution is implemented in contour_breaks. For the crabs data, this gives 0.01418, 0.02690, 0.04143. Since the KDE for the B species exceeds the highest level 0.04143, whereas the O species KDE doesn’t reach this is level, the former KDE is more peaked.

## facetted KDE filled contour plots with fixed contour levels for all facets
bkde2g <- contour_breaks(tkde2g)
gkde2g + geom_contour_filled_ks(breaks=bkde2g, colour=1) + facet_wrap(~sp)
#> Ignoring unknown labels:
#> • colour : "Species"

Geospatial data example

GIS for geospatial data analysis in R is implemented in the sf package, and the eks package builds on this. To illustrate geospatial KDE, we focus on the grevilleasf data set in the eks package. It has 22303 rows, where each row corresponds to an observed grevillea plant in Western Australia. In addition, we utilise wa, the geospatial polygon for Western Australia. Both of these geospatial data sets are in the EPSG:7850 (GDA2020/MGA zone 50) projection.

library(sf)

## Grevillea data
data(grevilleasf, package="eks")
grevilleasf <- mutate(grevilleasf, species=factor(species))
paradoxa <- filter(grevilleasf, name %in% "Grevillea paradoxa")
eryngioides <- filter(grevilleasf, name %in% "Grevillea eryngioides")
grevillea_ep <- filter(grevilleasf, name %in% c("Grevillea eryngioides", 
   "Grevillea paradoxa"))
grevillea_ep <- group_by(grevillea_ep, name)
xlim <- c(1.2e5, 1.1e6); ylim <- c(6.1e6, 7.2e6)

## WA polygon
data(wa, package="eks")
gwa <- geom_sf(data=wa, fill=NA, colour=1)

Since geospatial data can be visualised with both base R and ggplot2 graphics engines, we provide code for both: their outputs are similar due to the standardisation of geospatial maps within GIS. Though these plots can’t be mixed due to fundamental differences between the graphical rendering in base R and ggplot2.

## base R scatter plot
plot(st_geometry(wa), xlim=xlim, ylim=ylim)
plot(st_geometry(eryngioides), add=TRUE, col=3, pch=16, cex=0.5)
plot(st_geometry(paradoxa), add=TRUE, col=6, pch=17, cex=0.5)
mapsf::mf_legend(type="symb", val=c("Grevillea eryngioides", "Grevillea paradoxa"), 
   pal=c(3,6), pch=16:17, cex=c(1,1), title="Species", pos="bottomleft")

## geom_sf scatter plot 
theme_set(ggthemes::theme_map())
ggplot() + gwa + 
   geom_sf(data=grevillea_ep, aes(colour=name, shape=name)) + 
   coord_sf(xlim=xlim, ylim=ylim) + scale_colour_manual(values=c(3, 6)) + 
   guides(colour=guide_legend(title="Species"), shape=guide_legend(title="Species"))

The KDE for geospatial data is computed by st_kde. The calculations of the KDE, including the bandwidth matrix fo smoothing parameters, are the same as in tidy_kde. Though, unlike for tidy_kde where the probability contour regions are computed dynamically in geom_contour_filled_ks, the 5% to 95% regions are explicitly computed as multipolygons in st_kde since this conversion can be computationally heavy to execute for each plot. For display, it is a matter of selecting the appropriate contour regions. The quartile contours 25%, 50%, 75% are selected by default in geom_contour_filled_ks for tidy data. This is also the case for the base R plot.

skde1 <- st_kde(paradoxa)

## base R contour plot
plot(st_geometry(wa), xlim=xlim, ylim=ylim)
plot(st_geometry(paradoxa), add=TRUE, pch=16, col=8, cex=0.5)
plot(skde1, add=TRUE, col=NA, border=1, legend=FALSE)

On the other hand, we can’t replicate exactly the default contour selection in geom_sf, so we first apply the auxiliary function st_get_contour to the input of geom_sf.

## geom_sf contour plot
gs <- ggplot(skde1) + gwa + ggthemes::theme_map()
gs + geom_sf(data=paradoxa, col=8, size=0.5) + 
   geom_sf(data=st_get_contour(skde1), colour=1, fill=NA) + 
   coord_sf(xlim=xlim, ylim=ylim)

To generate a filled contour plot, the only required change is to set the fill aesthetic to contperc (to produce legend labels with density contour percentages) or estimate (labels with density estimate levels).

## R base filled contour plot
plot(st_geometry(wa), xlim=xlim, ylim=ylim)
plot(skde1, add=TRUE)

## geom_sf filled contour
gs + geom_sf(data=st_get_contour(skde1), aes(fill=contperc)) + 
  coord_sf(xlim=xlim, ylim=ylim)

Since the output from st_kde is compatible with geom_sf, then it is easy to generate multiple maps of related geospatial KDEs. For example, KDEs for each Grevillea species, with probability contour levels or with fixed contour levels:

skde1g <- st_kde(grevillea_ep)

## facetted geom_sf filled contour
gs + geom_sf(data=st_get_contour(skde1g), aes(fill=contperc)) + 
   facet_wrap(~name) + coord_sf(xlim=xlim, ylim=ylim)

## facetted geom_sf filled contour with fixed contour levels for all facets
bkde1g <- contour_breaks(skde1g, group=TRUE)
gs + geom_sf(data=st_get_contour(skde1g, breaks=bkde1g), aes(fill=estimate)) +
   facet_wrap(~name) + coord_sf(xlim=xlim, ylim=ylim)

Kernel density estimates for tidy and geospatial data in the eks package

Tarn Duong https://mvstat.net/tduong/

07 December 2025

Introduction

Tidy data example

Geospatial data example

References