---
title: "R HPC Benchmark"
author: "James R. McCombs"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{R HPC Benchmark}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

# Introduction

Benchmarking is a way to assess the performance of software on a computing
platform and to compare performance between different platforms.
The benchmark performance results can also be used to prioritize software
performance optimization efforts on emerging High Performance Computing (HPC)
systems.  Optimization of the R interpreter, its intrinsic functionality, and R
packages for specific hardware architectures will be necessary for data analysts
to take full advantage of the latest HPC clusters, and to obviate the need to
reengineer their analysis workflows.  We developed the `RHPCBenchmark` package
to determine the single-node run time performance of compute intensive linear
algebra kernels that are common to many data analytics algorithms, and the run
time performance of machine learning functionality commonly implemented with
linear algebra operations.

The benchmarks are divided into three categories: dense matrix linear
algebra kernels, sparse matrix linear algebra kernels, and machine learning
functionality.  All of the dense linear algebra kernels are implemented
around BLAS or LAPACK interfaces.  The sparse linear algebra kernels are
members of the R Matrix library.  The machine learning benchmarks currently
only cover variants of K-means functionality for clustering using the
`cluster` package.  The dense matrix linear algebra kernels, sparse
matrix linear algebra kernels, and machine learning functions that are
benchmarked are all part of the R interpreter's intrinsic functionality or
packages included the with the R programming environment standard
distributions from CRAN.

The following functions are intended to be called by users of this library
to perform benchmarking for each of the three categories of benchmarks:

- `RunDenseMatrixBenchmark` Executes the dense matrix microbenchmarks
- `RunSparseMatrixBenchmark` Executes the sparse matrix microbenchmarks
- `RunMachineLearningBenchmark` Executes the machine learning microbenchmarks

Each of these benchmark functions is a top-level function which
executes microbenchmarks on several R computational kernels and other
functionality relevant to each benchmark category.  The top-level benchmarks
execute multiple *performance trials* for each microbenchmark in which the
run time (synonymous with wall-clock time in this document) is obtained.  The
benchmarks then compute the average, standard deviation, maximum, and minimum
run time for each microbenchmark.  The user may also specify one or more
warm-up runs to ensure the R programming environment is settled before executing
the performance trials.  The results of each microbenchmark performance trial
are written to a data frame returned by the top-level benchmark functions.  The
summary statistics for each microbenchmark are written to files in CSV format
as the benchmark function progresses through the microbenchmarks, permitting
retention of data if the benchmark function is terminated prematurely.  If a
CSV file already exists in an output directory, the new results will be
concatenated to the existing file.

See the object documentation for the top-level benchmark functions and the
microbenchmark definition classes listed below for information on how to
configure the individual microbenchmarks.  If any of the microbenchmarks fails
to run in a timely manner or fails due to memory constraints, the matrix sizes
and number of performance trials per matrix can be adjusted.  Each benchmark
function takes as input microbenchmark definition objects that specify
data sets and microbenchmarking functions used to time the functionality
being benchmarked.  The microbenchmark definition classes supported are listed
below.  

Microbenchmark definition classes:

- `DenseMatrixMicrobenchmark` Specifies a dense matrix microbenchmark
- `SparseMatrixMicrobenchmark` Specifies a sparse matrix microbenchmark
- `ClusteringMicrobenchmark` Specifies a clustering for machine learning microbenchmark

Each microbenchmark definition class has an allocator function field and
a benchmark function field which specify functions for allocating the data for
the microbenchmark and the function that performs the timing of functionality
being tested.  The microbenchmark definition classes also have parameters used
to specify dimensions of the data sets to be used with the microbenchmarks.  See
the object documentation for the microbenchmark definition classes for
specifics on the class fields and how they should be populated.

# Supported Microbenchmarks

The microbenchmarks supported by each benchmark are described in this section.

## Supported Dense Matrix Microbenchmarks

The dense matrix microbenchmarks supported by the dense matrix benchmark
are: Cholesky factorization, matrix cross product, matrix determinant,
eigendecomposition, linear solve with multiple right hand sides, least squares
fit, matrix deformation and transpose, matrix-matrix multiplication,
matrix-vector multiplication, QR decomposition, and singular value
decomposition.

Each microbenchmark is tested with several matrices of increasing size.
The tested matrix dimensions are parameterized by $N$ with values of $N$ equal to:
1000, 2000, 4000, 8000, 10000, 15000, and 20000.
The lsfit microbenchmark tests matrices of dimension $2N$-by-$N/2$, while
the remaining microbenchmarks all test $N$-by-$N$ (square) matrices.

The table below lists the name of each microbenchmark supported
along with a brief description of the functionality it tests.  The kernel functions
are applied to matrix or vector operands, where A and B are input matrices and x
is an input vector generated by the corresponding allocator function.

| Microbenchmark name          | Functionality tested         | Kernel functions   |
|:-----------------------------|:-----------------------------|:-------------------|
| choleksy                     | Cholesky factorization       | chol(A)            |
| crossprod                    | Matrix cross product         | crossprod(A)       |
| deformtrans                  | Transpose, reshape, and retranspose matrix | B <- t(A); dim(B) <- c(N/2, 2*N); t(B) |
| determinant                  | Matrix determinant           | determinant(A)     |
| eigen                        | Eigendecomposition           | eigen(A, symmetric=FALSE, only.values=FALSE) |
| lsfit                        | Least squares fit            | lsfit(A, b, intercept=FALSE) |
| solve                        | Linear solve                 | solve(A, B)        |
| transpose                    | Matrix transpose             | t(A)               |
| matmat                       | Matrix-matrix multiplication | A %*% B            |
| matvec                       | Matrix-vector multiplication | A %*% x            |
| qr                           | QR decomposition             | qr(A, LAPACK=TRUE) |
| svd                          | Singular value decomposition | svd(A)             |

For fast performance of the dense matrix kernels, it is crucial to link
the R programming environment with optimized BLAS and LAPACK libraries.
It is also important to have substantial amounts of memory (16GB minimum)
to run most of the microbenchmarks.

The function `GetDenseMatrixDefaultMicrobenchmarks` defines the default
microbenchmarks referenced in the table.  It returns a vector of
`DenseMatrixMicrobenchmark` objects specifying each microbenchmark.
Microbenchmark parameters that can be specified include the dimensions of the
matrices to be performance tested with, number of performance trials per
matrix, and the allocator and microbenchmarking functions to be used.  By
default, the benchmark function `RunDenseMatrixBenchmark` calls
`GetDenseMatrixDefaultMicrobenchmarks` to specify the microbenchmarks to
be executed.  See the object documentation for
`GetDenseMatrixDefaultMicrobenchmarks` for more details.

The allocator functions for the dense matrix microbenchmarks take a
`DenseMatrixMicrobenchmark` object specifying the microbenchmark and an integer
index indicating which of the above values of $N$ is to be applied.  The
allocator must return a list of allocated data objects, including the matrix,
for the microbenchmark to operate on.  The microbenchmarking functions take the
`DenseMatrixMicrobenchmark` object defining the microbenchmark and the list of
data objects returned by the allocator function.

## Supported Sparse Matrix Microbenchmarks

The sparse matrix microbenchmarks supported by the sparse matrix benchmark
are: matrix-vector multiplication, Cholesky factorization, LU factorization,
and QR factorization.  Each sparse matrix kernel tested is from the `Matrix`
library included with the **R** distribution, and each kernel is performance
tested with two or three sparse matrices from different application domains.
The microbenchmarks, their associated identifiers and brief descriptions of the
tested matrices are given in the table below.  The table below lists each sparse
matrix tested in the sparse matrix benchmark along with some fundamental
characteristics of each matrix.  The Laplacian matrices were generated for use
with the benchmark, the rest of the matrices come from the
[University of Florida Sparse Matrix Collection](https://www.cise.ufl.edu/research/sparse/matrices/).
More detailed descriptions of the matrices from the *Sparse Matrix Collection*
can be found on the web site.  The R data files containing the sparse matrices
can be downloaded in the companion package [RHPCBenchmarkData](https://github.com/IUResearchAnalytics/RBenchmarking/blob/master/RHPCBenchmarkData_0.1.0.tar.gz).

| Matrix name            | Number of rows | Number of columns | Number of non-zeros | Symmetry and Structure    |
|:-----------------------|:--------------:|:-----------------:|:-------------------:|:-------------------------:| 
| laplacian7pt_100       | 1000000        | 1000000           | 6940000             | symmetric banded          |
| laplacian7pt_200       | 8000000        | 8000000           | 55760000            | symmetric banded          |
| ca2010                 | 710145         | 710145            | 3489366             | symmetric block           |  
| ct20stif               | 52329          | 52329             | 2600295             | symmetric block           |
| Andrews                | 60000          | 60000             | 760154              | symmetric unstructured    |
| G3_circuit             | 1585478        | 1585478           | 7660826             | symmetric block           |
| circuit5M_dc           | 3523317        | 3523317           | 14865409            | non-symmetric block/banded| 
| stomach                | 213360         | 213360            | 3021648             | non-symmetric banded      |
| torso3                 | 259156         | 259156            | 4429042             | non-symmetric banded      |
| Maragal                | 21255          | 10152             | 537694              | rectangular unstructured  |
| landmark               | 71952          | 2704              | 1146848             | rectangular banded        |

The following table lists the microbenchmarks supported by the sparse matrix
benchmark, the matrices each microbenchmark is executed with, and the kernel function
that is microbenchmarked.  The kernel functions are applied to matrix or vector
operands, where A and B are input matrices and x is an input vector generated by the
corresponding allocator function.

| Microbenchmark name          | Functionality tested                | Problem Domain Description                                         | Kernel function  |
|:-----------------------------|:------------------------------------|:-------------------------------------------------------------------|:-----------------|
| matvec_laplacian7pt_100      | Sparse matrix-vector multiplication | Laplacian 7-point stencil applied to 100x100x100 grid              | A %*% x          |
| matvec_laplacian7pt_200      | Sparse matrix-vector multiplication | Laplacian 7-point stencil applied to 200x200x200 grid              | A %*% x          |
| matvec_ca2010                | Sparse matrix-vector multiplication | Undirected weighted graph from congressional redistricting         | A %*% x          |
| cholesky_ct20stif            | Sparse Cholesky factorization       | Stiffness matrix from structural                                   | Matrix::Cholesky(A) |
| cholesky_Andrews             | Sparse Cholesky factorization       | Eigenvalue problem from computer graphics/vision                   | Matrix::Cholesky(A) |
| cholesky_G3_circuit          | Sparse Cholesky factorization       | Circuit simulation                                                 | Matrix::Cholesky(A) |
| lu_circuit5M_dc              | Sparse LU factorization             | Large circuit DC analysis                                          | Matrix::lu(A)       |
| lu_stomach                   | Sparse LU factorization             | 3D electro-physical model of duodenum                              | Matrix::lu(A)       |
| lu_torso3                    | Sparse LU factorization             | 3D electro-physcial model of torso                                 | Matrix::lu(A)       |
| qr_Maragal_6                 | Sparse QR factorization             | Rank deficient least squares                                       | Matrix::QR(A)       | 
| qr_landmark                  | Sparse QR factorization             | Least squares                                                      | Matrix::QR(A)       |

The functions `GetSparseMatrixVectorDefaultMicrobenchmarks`,
`GetSparseCholeskyDefaultMicrobenchmarks`, `GetSparseLuDefaultMicrobenchmarks`,
and `GetSparseQrDefaultMicrobenchmarks` define the default
microbenchmarks referenced in the table.  Each returns a vector of
`SparseMatrixMicrobenchmark` objects specifying each microbenchmark.
Microbenchmark parameters that can be specified include the dimensions of the
matrices to be performance tested with, the number of performance trials to be
conducted per matrix, and the allocator and microbenchmarking functions to be
used.  By default, the benchmark function `RunSparseMatrixBenchmark` calls each
of the `GetSparse*` functions to specify the microbenchmarks to be executed.
See the object documentation for `RunSparseMatrixBenchmarks` and the
`GetSparse*` functions for more details.  

The allocator functions for the sparse matrix microbenchmarks take a
`SparseMatrixMicrobenchmark` object specifying the microbenchmark and an integer
index indicating which matrix is being tested.  The integer index is unused by
the microbenchmarks specified by the `GetSparse*` default functions because the
sparse matrix microbenchmarks read the test matrices from files as opposed to
dynamically generating them.  The allocator returns a list of data objects,
including the matrix, for the microbenchmark to operate on.  The
microbenchmarking functions take the `SparseMatrixMicrobenchmark` object
defining the microbenchmark and the list of data objects returned by the
allocator function.

## Supported Machine Learning Microbenchmarks

Currently, clustering microbenchmarks are the only microbenchmarks supported by
the machine learning benchmark.  The clustering benchmark generates clusters of
normally distributed feature vectors in an $N$-dimensional, real-valued feature
space where the mean of one cluster is located at the origin and the means of two clusters are
each located at positions -1 and 1 of one or more of the axes.  See the object
documentation for the function `GenerateClusterData` for more details on how the
clusters are generated.  Microbenchmarking is performed with two cluster
identification functions `pam` and `clara`, from the `cluster` package provided
with the **R** distribution.  The `pam` function implements the *partitioning
around medoids* algorithm which has
quadratic time complexity.  The `clara` function is used for large data sets and
operates in linear time complexity.  The `pam` function is called with
`cluster.only=TRUE` so that only assignments of feature vectors to clusters is
reported.  The `clara` function is called with `samples=50`, `keep.data=FALSE`,
`rngG=TRUE`, and `pamLike=TRUE`.  See the object documentation for the
`RunMachineLearningBenchmark`, `pam`, and `clara` functions for more details.

| Microbenchmark name        | Functionality tested | Kernel function | Number of features        | Number of clusters | Number of feature vectors per cluster |
|:---------------------------|:---------------------|:----------------|:-------------------------:|:------------------:|:-------------------------------------:|
| pam_cluster_3_7_2500       | clustering           | cluster::pam    | 3                         | 7                  | 2500                                  |
| pam_cluster_3_7_5000       | clustering           | cluster::pam    | 3                         | 7                  | 5000                                  |
| pam_cluster_3_7_5715       | clustering           | cluster::pam    | 3                         | 7                  | 5715                                  |
| pam_cluster_16_33_1213     | clustering           | cluster::pam    | 16                        | 33                 | 1213                                  |
| pam_cluster_64_33_1213     | clustering           | cluster::pam    | 64                        | 33                 | 1213                                  |
| pam_cluster_16_7_2858      | clustering           | cluster::pam    | 16                        | 7                  | 2858                                  |
| pam_cluster_32_7_2858      | clustering           | cluster::pam    | 32                        | 7                  | 2858                                  |
| pam_cluster_64_7_5715      | clustering           | cluster::pam    | 64                        | 7                  | 5715                                  |
| clara_cluster_64_33_1213   | clustering           | cluster::clara  | 64                        | 33                 | 1213                                  |
| clara_cluster_1000_99_1000 | clustering           | cluster::clara  | 1000                      | 99                 | 1000                                  |

The function `GetClusteringDefaultMicrobenchmarks` defines the default
microbenchmarks referenced in the table.  The function returns a vector of
`ClusteringMicrobenchmark` objects specifying each microbenchmark.
By default, the benchmark function `RunMachineLearningBenchmark` calls
the `GetClusteringDefaultMicrobenchmarks` function to specify the
microbenchmarks given in the table.  See the object documentation for
`RunMachineLearningBenchmarks` and the `GetClusteringDefaultMicrobenchmarks`
functions for more details.  

The allocator functions for the clustering microbenchmarks take a
`ClusteringMicrobenchmark` object specifying the microbenchmark.  The
allocator must return a list of allocated data objects, including the matrix
of feature vectors, for the microbenchmark to operate on.  The microbenchmarking
functions take the `ClusteringMicrobenchmark` object defining the microbenchmark
and the list of data objects returned by the allocator function.

# Support For Multithreading

There is no explicit support for multithreading by the benchmarking functions in
the the R HPC benchmark.  However, if the functionality being microbenchmarked
is implemented with support for multithreading, and the number of threads can
be controlled through the use of environment variables, as is often the case,
then the benchmarks can be executed multithreaded.  For example, if a benchmark
user wishes to benchmark functionality implemented with a package engineered
with OpenMP support, they can set the `OMP_NUM_THREADS` environment variable
before running the benchmarks.  If a microbenchmark tests kernel functions
implemented with a multithreaded BLAS or LAPACK library, then the environment
variable specific to that library for controlling the number of threads should
be set.  For example, the `MKL_NUM_THREADS` environment variable should
be set when the dense matrix benchmarks are tested using an instance of **R**
that is linked to the parallel Intel Math Kernel Library which implements
multithreaded BLAS and LAPACK functionality.

Even though the benchmark functions do not control the number of threads
to be utilized, the benchmarks must still report the number of threads used
in the CSV files and data frames generated for reporting results.  Thus, the
environment variable specifying the number of threads must be retrievable
in a way that is portable regardless of which multithreaded library the R
programming environment is linked with.  To accomplish this, the environment
variable `R_BENCH_NUM_THREADS_VARIABLE` must be set by the benchmark user to
specify the name of the particular environment variable specifying the number
of threads to be used.  So if `R_BENCH_NUM_THREADS_VARIABLE` is set to
`MKL_NUM_THREADS`, and `MKL_NUM_THREADS` is set to 16, then the number of
threads reported in the timing results will be 16.

For consistent behavior, the user should set the environment variable for the
number of threads before the R programming environment is initialized.  This is
because the multithreading libraries with which the R environment may be linked
will not necessarily register changes to the number of threads after
initialization.

# Examples 

Examples are given in this section to show how to run each benchmark.  Before
running the examples, certain environment variables need to be set to
specify the number of threads for parallel processing; see the previous section
for how to properly do this.

## Dense Matrix Microbenchmark Examples

This example runs all of the default dense matrix microbenchmarks, saves the
summary statistics for each microbenchmark in the directory
`DenseMatrixResults`, and saves the data frame returned from the dense matrix
benchmark to a file named `allResultsFrame.RData`.
```
# Create the directory where the CSV files and data frames will be saved.
resultsDirectory <- "./DenseMatrixResults"
dir.create(resultsDirectory)
# Set the run identifier to be appended to the names of the CSV files generated
# so that they can be identified with this run.
runIdentifier <- "all"
# Run all of the default microbenchmarks
allResults <- RunDenseMatrixBenchmark(runIdentifier, resultsDirectory)
save(allResults, file="./DenseMatrixResults/allResultsFrame.RData")
```

This example runs all but the matrix transpose microbenchmarks, which tend to
run very slowly, and saves the results to the same directory as in the previous
example.
```
# Get the default microbenchmark definition objects and deactivate
# the unwanted matrix transpose microbenchmarks.
myDenseMatrixMicrobenchmarks <- GetDenseMatrixDefaultMicrobenchmarks()
myDenseMatrixMicrobenchmarks[["deformtrans"]]$active <- FALSE
myDenseMatrixMicrobenchmarks[["transpose"]]$active <- FALSE
# Set an appropriate run identifier
runIdentifier <- "no_transpose"
exTransposeResults <- RunDenseMatrixBenchmark(runIdentifier, resultsDirectory, myDenseMatrixMicrobenchmarks)
save(exTransposeResults, file="./DenseMatrixResults/exTransposeResultsFrame.RData")
```

This example runs only the matrix-matrix microbenchmark, and it modifies
the default matrix dimensions to test only a few small matrices.
```
# Get the default microbenchmark definition objects and select only
# the matrix-matrix multiplication microbenchmark to run.
myMatmatMicrobenchmark <- list()
myMatmatMicrobenchmark[["matmat"]] <- GetDenseMatrixDefaultMicrobenchmarks()[["matmat"]]
# Set the matrix dimensions to be tested to 2000 and 3000.  The
# number of trials and number of warmup trials per matrix must also be
# updated.
myMatmatMicrobenchmark[["matmat"]]$dimensionParameters <- as.integer(c(2000, 3000))
myMatmatMicrobenchmark[["matmat"]]$numberOfTrials <- as.integer(c(3, 3))
myMatmatMicrobenchmark[["matmat"]]$numberOfWarmupTrials <- as.integer(c(1, 1))
# Set an appropriate run identifier
runIdentifier <- "matmat"
matmatResults <- RunDenseMatrixBenchmark(runIdentifier, resultsDirectory, microbenchmarks=myMatmatMicrobenchmark)
save(matmatResults, file="./DenseMatrixResults/matmatResultsFrame.RData")
```

## Sparse Matrix Microbenchmark Examples

This example runs all of the default sparse matrix microbenchmarks, saves the
summary statistics for each microbenchmark in the directory
`SparseMatrixResults`, and saves the data frame returned from the dense matrix
benchmark to a file named `allResultsFrame.RData`.
```
# Create the directory where the CSV files and data frames will be saved.
resultsDirectory <- "./SparseMatrixResults"
dir.create(resultsDirectory)
# Set the run identifier to be appended to the names of the CSV files generated
# so that they can be identified with this run.
runIdentifier <- "all"
# Run all of the default microbenchmarks
allResults <- RunSparseMatrixBenchmark(runIdentifier, resultsDirectory)
save(allResults, file="./SparseMatrixResults/allResultsFrame.RData")
```

This example runs only the Cholesky factorization microbenchmarks and saves
the results to the same directory as in the previous example.
```
# Set an appropriate run identifier
runIdentifier <- "cholesky_only"
# Run only the sparse Cholesky factorization microbenchmarks
choleskyResults <- RunSparseMatrixBenchmark(runIdentifier, resultsDirectory,
   matrixVectorMicrobenchmarks=NULL, luMicrobenchmarks=NULL, qrMicrobenchmarks=NULL)
save(choleskyResults, file="./SparseMatrixResults/choleskyResultsFrame.RData")
```

## Machine Learning Microbenchmark Examples

This example runs all of the default machine learning microbenchmarks, saves the
summary statistics for each microbenchmark in the directory
`MachineLearningResults`, and saves the data frame returned from the dense
matrix benchmark to a file named `allResultsFrame.RData`.
```
# Create the directory where the CSV files and data frames will be saved.
resultsDirectory <- "./MachineLearningResults"
dir.create(resultsDirectory)
# Set the run identifier to be appended to the names of the CSV files generated
# so that they can be identified with this run.
runIdentifier <- "all"
# Run all of the default microbenchmarks
allResults <- RunMachineLearningMatrixBenchmark(runIdentifier, resultsDirectory)
save(allResults, file="./MachineLearningResults/allResultsFrame.RData")
```

This example shows how to specify a new clustering microbenchmark and run it.
```
# Create a new clustering microbenchmark that tests the `clara` method from the
# `cluster` package using a data set with 250 features, 50 clusters, and
# 2000 normally distributed feature vectors per cluster.
claraMicrobenchmark <- list()
claraMicrobenchmark[["clara_cluster_250_50_2000"]] <- methods::new(
      "ClusteringMicrobenchmark",
      active = TRUE,
      benchmarkName = "clara_cluster_250_50_2000",
      benchmarkDescription = "Clustering of 100000 250-dimensional feature vectors into 50 clusters using clara function",
      numberOfFeatures = as.integer(250),
      numberOfClusters = as.integer(50),
      numberOfFeatureVectorsPerCluster = as.integer(2000),
      dataObjectName = NA_character_,
      numberOfTrials = as.integer(3),
      numberOfWarmupTrials = as.integer(1),
      allocatorFunction = ClusteringAllocator,
      benchmarkFunction = ClaraClusteringBenchmark
   )
# Set an appropriate run identifier
runIdentifier <- "clara"
# Run the clara microbenchmark
claraResults <- RunMachineLearningBenchmark(runIdentifier, resultsDirectory,
   clusteringMicrobenchmarks=claraMicrobenchmark)
save(claraResults, file="./MachineLearningResults/claraResultsFrame.RData")
```