OPTICS k-Xi

This R package provides a novel cluster extraction method for the OPTICS algorithm, OPTICS k-Xi, along with ggplot2 visualizations and a framework to compare clustering models with varying parameters using distance-based metrics.

Summary

Density-based clustering methods are well adapted to the clustering of high-dimensional data and enable the discovery of core groups of various shapes despite large amounts of noise.

The opticskxi R package provides a novel density-based cluster extraction method, OPTICS k-Xi, and a framework to compare k-Xi models using distance-based metrics to investigate datasets with unknown number of clusters. The vignette first introduces density-based algorithms with simulated datasets, then presents and evaluates the k-Xi cluster extraction method. Finally, the models comparison framework is described and experimented on 2 genetic datasets to identify groups and their discriminating features.

The k-Xi algorithm is a novel OPTICS cluster extraction method that specifies directly the number of clusters and does not require fine-tuning of the steepness parameter as the OPTICS Xi method. Combined with a framework that compares models with varying parameters, the OPTICS k-Xi method can identify groups in noisy datasets with unknown number of clusters.

Installation

Using the devtools package in R:

  devtools::install_git('https://framagit.org/thomaschln/opticskxi.git')

Usage

Compute OPTICS profile and k-Xi clustering

  data('multishapes')
  optics_shapes <- dbscan::optics(multishapes[1:2])
  kxi_shapes <- opticskxi(optics_shapes, n_xi = 5, pts = 30)

Visualize with ggplot2

  ggplot_optics(optics_shapes)
  ggplot_kxi_profile(kxi_shapes)

Compare multiple k-Xi models in dataset with unknown number of clusters and visualize the best models:

   data('hla')
   m_hla <- hla[-c(1:2)] %>% scale
   df_params_hla <- expand.grid(n_xi = 3:5, pts = c(20, 30, 40),
     dist = c('manhattan', 'euclidean', 'abscorrelation', 'abspearson'))
   df_kxi_hla <- opticskxi_pipeline(m_hla, df_params_hla)
   ggplot_kxi_metrics(df_kxi_hla, n = 8)
   gtable_kxi_profiles(df_kxi_hla) %>% plot
   best_kxi_hla <- get_best_kxi(df_kxi_hla, rank = 2)
   clusters_hla <- best_kxi_hla$clusters
   fortify_pca(m_hla, sup_vars = data.frame(Clusters = clusters_hla)) %>%
     ggpairs('Clusters', ellipses = TRUE, variables = TRUE)

See the vignette for results and further details.

Acknowledgements

This work was inspired by Jérôme Wojcik (Precision for Medicine) and Sviatoslav Voloshynovskiy (University of Geneva).

License

This package is free and open source software, licensed under GPL-3.