%% use JSS class but for now with nojss option \documentclass[nojss,shortnames,article]{jss} %\usepackage{float} \usepackage{booktabs} \author{Dirk Eddelbuettel\\Debian Project} % \And Second Author\\Plus Affiliation} \title{Benchmarking Single- and Multi-Core BLAS Implementations and GPUs for use with \proglang{R}} \Plainauthor{Dirk Eddelbuettel} % , Second Author} %% comma-separated \Plaintitle{Benchmarking Single- and Multi-Core BLAS Implementations and GPUs for use with R} \Shorttitle{Benchmarking BLAS and GPUs for use with R} \Abstract{ \noindent We provide timing results for common linear algebra subroutines across BLAS (Basic Linear Algebra Subprograms) and GPU (Graphics Processing Unit)-based implementations. Several BLAS implementations are compared. The first is the unoptimised reference BLAS which provides a baseline to measure against. Second is the Atlas tuned BLAS, configured for single-threaded mode. Third is the development version of Atlas, configured for multi-threaded mode. Fourth is the optimised and multi-threaded Goto BLAS. Fifth is the multi-threaded BLAS contained in the commercial Intel MKL package. We also measure the performance of a GPU-based implementation for \proglang{R} \citep{RCore:R} provided by the package \pkg{gputools} \citep{cran:gputools}. Several frequently-used linear algebra computations are compared across BLAS (and LAPACK) implementations and via GPU computing: matrix multiplication as well as QR, SVD and LU decompositions. The tests are performed from an end-user perspective, and `net' times (including all necessary data transfers) are compared. While results are by their very nature dependent on the hardware of the test platforms, a few general lessons can be drawn. Unsurprisingly, accelerated BLAS clearly outperform the reference implementation. Similarly, multi-threaded BLAS hold a clear advantage over single-threaded BLAS when used on modern multi-core machines. Between the multi-threaded BLAS implementations, Goto is seen to have a slight advantage over MKL and Atlas. GPU computing is showing promise but requires relatively large matrices to outperform multi-threaded BLAS. We also provide the framework for computing these benchmark results as the corresponding \proglang{R} package \pkg{gcbd}. It enables other researchers to compute similar benchmark results which could be the basis for heuristics helping to select optimial computing strategies for a given platform, library, problem and size combination. } \Keywords{BLAS, Atlas, Goto, MKL, GPU, \proglang{R}, Linux} %% at least one keyword must be supplied \Plainkeywords{BLAS, Atlas, Goto, MKL, GPU, R, Linux} %% without formatting %% publication information %% NOTE: Typically, this can be left commented and will be filled out by the technical editor %% \Volume{13} %% \Issue{9} %% \Month{September} %% \Year{2004} %% \Submitdate{2004-09-29} %% \Acceptdate{2004-09-29} \Address{ Dirk Eddelbuettel \\ Debian Project \\ Chicago, IL, USA\\ %% Telephone: +43/1/31336-5053 %% Fax: +43/1/31336-734 E-mail: \email{edd@debian.org}\\ URL: \url{http://dirk.eddelbuettel.com} } %% need no \usepackage{Sweave.sty} % ------------------------------------------------------------------------ \begin{document} \SweaveOpts{engine=R,eps=FALSE,echo=FALSE,prefix.string=figures/chart} %\VignetteIndexEntry{BLAS and GPU Benchmarking for Use with R} %\VignetteDepends{gcbd} %\VignetteKeywords{blas,gpu,atlas,mkl,goto,R,Linux} %\VignettePackage{gcbd} \shortcites{Brodtkorb_et_al_2010,Blackford_et_al:2002,lapack} %% Note that you should use the \pkg{}, \proglang{} and \code{} commands. <>= options(width=50) library(gcbd) gcbd.version <- packageDescription("gcbd")$Version gcbd.date <- packageDescription("gcbd")$Date now.date <- strftime(Sys.Date(), "%B %d, %Y") # create figures/ if not present if ( ! (file.exists("figures") && file.info("figures")$isdir) ) dir.create("figures") @ % \makeatletter \if@nojss Detailed comments from Roger Bivand, Rog\'{e}rio Brito, Allan Engelhardt, Bryan Lewis, James MacKinnon, Junji Nakano, Mark Seligman, Brian Smith and Luke Tierney are very gratefully acknowledged. All remaining errors are mine. \vfill This version corresponds to \pkg{gcbd} version \Sexpr{gcbd.version} and was typeset on \Sexpr{now.date}. \fi \makeatother \pagebreak \section[Introduction]{Introduction} Analysts are often eager to reap the maximum performance from their computing platforms. A popular suggestion in recent years has been to consider optimised basic linear algebra subprograms (BLAS). Optimised BLAS libraries have been included with some (commercial) analysis platforms for a decade \citep{Moler:2000}, and have also been available for (at least some) Linux distributions for an equally long time \citep{Maguire:1999}. Setting BLAS up can be daunting: the \proglang{R} language and environment devotes a detailed discussion to the topic in its \textsl{Installation and Administration} manual \citep[appendix A.3.1]{RCore:InstAdmin}. Among the available BLAS implementations, several popular choices have emerged. Atlas (an acronym for \textsl{Automatically Tuned Linear Algebra System}) is popular as it has shown very good performance due to its automated and CPU-specific tuning \citep{Whaley_Dongarra:1999, Whaley_Petitet:2005}. It is also licensed in such a way that it permits redistribution leading to fairly wide availability of Atlas.\footnote{The Atlas FAQ lists Maple, Matlab and Mathematica as using Atlas, and mentions that GSL, Octave, R, Scientific Python, and Scilab can be used with Atlas.} We deploy Atlas in both a single-threaded and a multi-threaded configuration. Another popular BLAS implementation is Goto BLAS which is named after its main developer, Kazushige Goto \citep{Goto_VanDeGeijin:2008}. While `free to use', its license does not permit redistribution putting the onus of configuration, compilation and installation on the end-user. Lastly, the Intel Math Kernel Library (MKL), a commercial product, also includes an optimised BLAS library. A recent addition to the tool chain of high-performance computing are graphical processing units (GPUs). Originally designed for optimised single-precision arithmetic to accelerate computing as performed by graphics cards, these devices are increasingly used in numerical analysis. Earlier criticism of insufficient floating-point precision or severe performance penalties for double-precision calculation are being addressed by the newest models. Dependence on particular vendors remains a concern with NVidia's CUDA toolkit \citep{nVidia:2010} currently still the preferred development choice whereas the newer OpenCL standard \citep{OpenCL:2010} may become a more generic alternative that is independent of hardware vendors. \citet{Brodtkorb_et_al_2010} provide an excellent recent survey. But what has been lacking is a comparison of the effective performance of these alternatives. This paper works towards answering this question. By analysing performance across five different BLAS implementations---as well as a GPU-based solution---we are able to provide a reasonably broad comparison. Performance is measured as an end-user would experience it: we record computing times from launching commands in the interactive \proglang{R} environment \citep{RCore:R} to their completion. While implemented in \proglang{R}, these benchmark results are more general and valid beyond the \proglang{R} system as there is only a very thin translation layer between the higher-level commands and the underlying implementations (such as, say, \code{dgemm} for double-precision matrix multiplications) in the respective libraries. This lack of (strong) dependence on the test environment makes our results more generally applicable. However, \proglang{R} is very useful as an environment to generate test data, execute the benchmarks, and to collect the results which are subsequently analysed and visualized. The rest of the paper is organised as follows. In the next section, the technical background is briefly discussed. The implementation of our benchmark tests is outlined in section 3. We provide results in section 4, before a summary concludes in section 5. \section{Background} Basic Linear Algebra Subprograms (BLAS) provide an Application Programming Interface (API) for linear algebra. For a given task such as, say, a multiplication of two conformant matrices, an interface is described via a function declaration, in this case \code{sgemm} for single precision and \code{dgemm} for double precision. The actual implementation becomes interchangeable thanks to the API definition and can be supplied by different approaches or algorithms. This is one of the fundamental code design features we are using here to benchmark the difference in performance from different implementations. A second key aspect is the difference between static and shared linking. In static linking, object code is taken from the underlying library and copied into the resulting executable. This has several key implications. First, the executable becomes larger due to the copy of the binary code. Second, it makes it marginally faster as the library code is present and no additional look-up and subsequent redirection has to be performed. The actual amount of this performance difference is the subject of near-endless debate. We should also note that this usually amounts to only a small load-time penalty combined with a function pointer redirection---the actual computation effort is unchanged as the actual object code is identical. Third, it makes the program more robust as fewer external dependencies are required. However, this last point also has a downside: no changes in the underlying library will be reflected in the binary unless a new build is executed. Shared library builds, on the other hand, result in smaller binaries that may run marginally slower---but which can make use of different libraries without a rebuild. That last feature is key here: it offers us the ability to use different BLAS implementations in a `plug and play' mode which facilitates comparisons. So because of both the standardised interface of the BLAS, and the fact that we have several alternative implementations at our disposal, we can switch between these alternatives. To do so easily makes use of a package mechanism used by Ubuntu, the Debian-based Linux distribution we employ. However, a simpler (yet less robust) approach would also be available. This technical aspect is discussed further below. The first available BLAS implementation stems from the original unoptimised code on Netlib. On Debian-based systems, it is provided by the (source) package \pkg{blas} \citep{Blackford_et_al:2002} and used with the \pkg{lapack} package \citep{lapack}. This implementation is commonly referred to as `reference BLAS' (and we will use `ref' as a shorthand) as it provides a reference implementation. The second and third BLAS implementations are provided by Atlas \citep{Whaley_Dongarra:1999,Whaley_Petitet:2005} which stands for \textsl{Automatically Tuned Linear Algebra Software} as it optimises its performance and parameters (such as cache sizes) during its initial build. Another notable aspect is its liberal licensing which permits wide distribution. Consequently, Atlas is available on most if not all Linux distributions. Windows binaries are also widely available. We used two different Atlas versions. The first one is the (pre-compiled) version in the Ubuntu and Debian releases. It is based on Atlas 3.6.0 and built only for single-threaded operation. The second version is based on the current Atlas development version 3.9.25, built for multi-threaded mode and locally compiled.\footnote{A critical change enabling multi-threaded mode which we added was also filed as Debian bug report \#595326 and will be reflected in future Debian / Ubuntu packages.} We will refer to the second variant as `Atlas39' and `Atl39' when we report results below. The fourth BLAS implementation is provided by the Goto BLAS. Upon the required user registration, these are freely available from the University of Texas, albeit under a license that prohibits redistribution. This prevents inclusion into popular Linux distributions, a clear disadvantage for easy installation and de-installation. However, the contributed Debian repository at the Institute of Statistical Mathematics in Japan provides a `helper package' \citep{nakano_nakama:2009,Nakama:2010} which, given the required registration information, downloads the source code from the University of Texas site, and then configures and compiles the Goto BLAS in such a manner that a local binary Debian package is produced---which is also optimised for the local installation. This permits us to use the Goto BLAS via the resulting package as a fourth BLAS alternative. The fifth available BLAS implementation is part of the Intel Math Kernel Library (MKL), a commercial product. However, Ubuntu release 9.10 (``karmic'') contains a set of packages sponsored by Revolution Analytics which comprises version 10.2 (dated March 2009) of the Intel MKL in a setup directly usable by \proglang{R}. We use these packages here as a fifth set of optimised BLAS. The AMD Core Math Library (AMCL) could provide a sixth' BLAS variant. However, given that Intel is currently not truly challenged by AMD for high-performance CPU, we considered analysing the Intel MKL to be a sufficient representation of the vendor BLAS market. The first \proglang{R} extension for graphics-processing units has been implemented by the \pkg{gputools} package \citep*{cran:gputools}. It provides a number of functions which use the GPU instead of the CPU. This can provide a significant performance increase for `large enough' problems due a the large number of cores---from several dozen to several hundred on newer cards---on a GPU. However, because data has to be transferred from the CPU to the GPU, a fixed cost in communications has to be borne by every invocation of the GPU. For sizable problems, this cost can be outweighed by the benefits of the massively parallel GPU computation. Exactly where the indifference point lies beyond which GPU computing has an advantage is unfortunately dependent on the particular problem and algorithm as well as the given hardware and software combination. A recent addition to the set of \proglang{R} packages is \pkg{magma} \citep{cran:magma} which interfaces the \pkg{Magma} library project of the same name \citep{Tomov_et_al:2009,Tomov_et_al:2010} (where we will capitalize the name of the library to distinguish it from the \proglang{R} package). The stated goal of the Magma project it to \textsl{develop a dense linear algebra library similar to LAPACK but for heterogeneous/hybrid architectures, starting with current ``Multicore+GPU'' systems}. The current release of Magma is version 0.2 which is available for Linux systems with NVidia's CUDA toolkit. It provides LU, QR, and Cholesky factorizations as well as linear solvers based on these along with implementations of several BLAS function including \code{dgemm}, \code{dgemv}, and \code{dsymv} (and well as their single-precision counterparts). Figure~\ref{fig:magma} illustrates the changes of relative workload between CPU and GPU when using Magma for a (single-precision) QR decomposition. It clearly depicts how the share of the total computing effort that goes to the GPU increases as a function of the matrix size. \begin{figure}[t!] \centering \includegraphics[width=4in]{MagmaTimes.png} \caption{Breakdown of CPU and GPU computing percentages for single-precision QR decomposition using the hybrid Magma approach. (Source: \citet[p.7]{Tomov_et_al:2009})} \label{fig:magma} \end{figure} A flexible and hybrid approach has the potential to improve upon solutions that use either the CPU or the GPU. However, the current version of the \pkg{Magma} library was exhibiting stability problems when used with GPU card employed (a Quadro FX48000). Analysing the performance of \pkg{Magma} relative to accelerated BLAS and other GPU-based approached is therefore left as topic for future research. Finally, an important, and possibly under-appreciated, topic is how to balance explicitly parallel code that distributes load across cores with multi-threaded BLAS. Execution of code may already be parallelised in a `coarse-grained' fashion across all local cores or CPUs, maybe via tools that explicitly spawn multiple threads, maybe via compiler-driven directives as for example with Open MP, or maybe via cluster-computing tools across a local network. If such a setup `booked' all available cores, a key assumption of multi-threaded BLAS no longer holds: other cores are not idle but also busy. In this case contention arises between the explicit parallelism and the implicit parallelism from the multi-threaded BLAS, and performance is bound to suffer. \citet{bivand2010} discusses this issues and provides several illustrations using examples from data-intensive GIS applications. The simplest remedy is to withdraw one of the mechanisms for multi-threaded use by limiting the number of cores to one. This can be done for Goto BLAS using the \code{GOTO_NUM_THREADS} environment variable. For the MKL one can use the environment variable \code{OMP_NUM_THREADS} (of the underlying Open MP implementation), or the R function \code{setMKLthreads()} of the corresponding package. Atlas, on the other hand, fixes the number of cores used at compile-time and cannot vary this setting at run-time. \section{Implementation} \subsection{Requirements} In order to undertake the automated benchmarking, we need to be able to switch between different implementations of the BLAS API. As discussed above, dynamic libraries are one possible solution that avoids having to rebuild \proglang{R} explicitly for each library. However, this also requires that \proglang{R} itself is built with shared-library support, as well as with support for external BLAS and LAPACK libraries. This requirement is however the default configuration on Debian and Ubuntu systems. The reference BLAS as well as Atlas have been available for essentially all Debian and Ubuntu releases. Atlas 3.9.25 was packaged locally; the package and helper scripts are available upon request. The Intel MKL is available for Ubuntu following the Revolution R upload for Ubuntu release 9.10. For Goto BLAS, we are using a helper script provided in the contributed \texttt{gotoblas2-helper} package \citep{nakano_nakama:2009,Nakama:2010}. This package arranges for a download of the Goto BLAS sources (provided the required account and password information for the University of Texas software download center) and an automated Debian package build and installation via the command \code{sudo /etc/init.d/gotoblas2-helper start}. Note that the initial invocation of this command will trigger a build of the package which may take up to two hours. While designed for Debian, it also works perfectly on the Ubuntu systems used here.\footnote{The \texttt{/etc/init.d/gotoblas2-helper} script required one change from \texttt{/bin/sh} to \texttt{/bin/bash}.} For GPU-based testing we require the \proglang{R} package \pkg{gputools} which in turn requires support for CUDA and the NVidia SDK (as well as appropriate NVidia hardware). Detailed installation instructions are provided by the package so we will defer to these and assume that the packages are in fact installed. Helper scripts in our package will then verify this availability while the benchmark is executed. \subsection{Benchmark implementation} The benchmarks described in this paper are produced by the package \pkg{gcbd} which is part of the larger \pkg{gcb} project \citep*{rforge:gcb} on the R-Forge hosting site \citep{RJournal:Theussl+Zeileis:2009}. The \pkg{gcbd} package (where the acronym expands to `GPU/CPU Benchmarking on Deb-based systems') contains a number of \proglang{R} helper functions as well as an actual benchmarking script which is executed. The helper functions fall into two groups: utilities, and benchmarks. The utilities fall into several categories: \begin{itemize} \item initial feature detection via a function \code{requirements()} which asserts that a number of testable features of the host operating system are met; \item feature tests via the function \code{hasGputools()} allowing the benchmark script to bypass GPU-based tests in systems without a GPU; \item installation and removal functions which interface the package management layer and install (or remove) the Atlas, Atlas39, MKL or Goto BLAS packages, respectively, which helps ensure that at any one point in time only one accelerated BLAS library package is present; \item database creation where the results database (and table schema) is created if not already present; \item recording of simulation results in the database. \end{itemize} The benchmark functions can also be categorized: \begin{itemize} \item creation of random data for standard \pkg{Matrix} objects; \item actual benchmarking code for \begin{itemize} \item matrix crossproducts; \item SV decomposition; \item QR decomposition; \item LU decomposition; \end{itemize} for BLAS and \pkg{gputools}, respectively. \end{itemize} For the QR decomposition, we set the flag \code{LAPACK=TRUE}. For \proglang{R}, the default QR operation is provided by LINPACK which uses level 1 BLAS operations where LAPACK can reap a larger benefit from accelerated or optimised BLAS libraries. For the LU decompositions, we use the function from the \pkg{Matrix} package by \citet{cran:matrix}. \subsection{Benchmark script} The benchmark execution can then be triggered by the script \code{benchmark.r} in the sub-directory \code{scripts} of the \pkg{gcbd} package. It is implemented as an executable \proglang{R} script which uses the \pkg{getopt} package \citep{cran:getopt} for command-line parsing: \begin{verbatim} $ ./benchmark.r -h Usage: benchmark.r [-[-verbose|v]] [-[-help|h]] [-[-nobs|n] ] [-[-runs|r] ] [-[-benchmark|b] ] -v|--verbose verbose operations, default is false -h|--help help on options -n|--nobs number of rows and columns in matrix, default is 250 -r|--runs number of benchmark runs, default is 30 -b|--benchmark benchmark to run (matmult [default], qr, svd, lu) \end{verbatim} Each benchmark experiment consists of $r$ runs for a matrix of size $n \times n$ observations using the chosen benchmark---matrix cross product or one of the QR, LU or SVD decompositions---over all five BLAS implementations and the GPU package. The run against the GPU package is optional and dependent on the GPU package being present. At the end of each benchmark experiment, the results are appended to a SQLite database. We use the `elapsed time' as measured by the \proglang{R} function \code{system.time()}. This measure is preferable over the sum of system time and user time which adds up total cputime---but without adjusting for multiple threads or cores. Elapsed time correctly measures from start to finish, whether one or multiple threads or cores are involved (but is susceptible to be influenced by system load). Using this script \code{benchmark.r}, users can collect benchmark results on their systems. These could be used to aggregate more performance data which could then be used to estimate realistic performance numbers for a much wider variety of hardware configurations. Doing so is beyond the scope of this paper but a possible avenue for future work. \subsection{Alternative implementation} On platforms that do not have access to pre-built BLAS library packages, an alternative approach could consist of locally installing the different libraries into sub-directories of, say, \code{/opt/blas}. One could then use the environment variable \verb|LD_LIBRARY_PATH| to select one of these directories at a time. However, such an approach places a larger burden on the user of the benchmarking software as she would have to download, configure, compile and install the BLAS libraries which is typically not a trivial step. \subsection{Hardware and software considerations} \label{subsec:hardware} For benchmarking linear algebra performance, hardware and software aspects matter greatly for the overall results. Different CPUs, different GPUs, different memory type as well as different compilers (or operating systems) may generate different performance profiles. We have been running the results presented here on these two platforms: \begin{itemize} \item a four-core Intel i7 920, 2.67 GHz clock speed, 8192 kb CPU cache, 6 gb RAM, in hyper-threaded mode for eight visible cores, running Ubuntu 10.4 in 64-bit mode; \item a dual-CPU four-core Intel Xeon 5570, 2.96 Ghz clock speed, 8192 kb CPU cache, 16gb RAM, in hyper-threaded mode for sixteen visible cores, running Ubuntu 10.4 in 64-bit mode. \end{itemize} The i7 system also has a NVidia Quadro FX4800 GPU with 192 cores and 1.5 gb of RAM. This card has a so-called `compute-number' of 1.3 whereas newer models (`Fermi') have a compute number of 2.0 signalling the availability of more hardware capabilities. Different hardware platforms could be reflected by other users installing the \pkg{gcbd} package on their system and reporting results back by supplying the resulting SQLite database files. The software configuration was held constant by running on 64-bit Ubuntu 10.4 in both cases. It should be noted that hyper-threading is a potential distraction. While it was not possible to reconfigure the machines used here, it could be informative to compare results on identical hardware with hyper-threading turned on and off, respectively. Lastly, comparing across operating system would also be of interest. While the test platform used here makes extensive use of the packaging system available on Debian / Ubuntu, it would of course be possible to set up something similar on, say, OS X to measure the performance of the vecLib system. %\pagebreak \section[Results]{Results} <>= i7 <- getBenchmarkData("i7_920") xeon <- getBenchmarkData("xeon_X5570") D <- subset(i7[,-c(1:2,5)], type=="matmult") @ %\subsection{BLAS comparison} We present the benchmark results (which are also included in the \pkg{gcbd} package in the file \code{sql/gcbd.sqlite}). We show the four different benchmarks for the two test architectures (currently one of i7 or xeon as detailed in section~\ref{subsec:hardware}) for eight different panels in a lattice plot \citep{lattice}. Each of these panels displays results for all the available BLAS (and GPU) implementations, with the matrix dimension on the horizontal axis and the elapsed time in seconds on the vertical axis. In Figure~\ref{fig:lattice}, we first display the raw results with the elapsed time capped at 30 seconds. Figure~\ref{fig:logloglattice} then shows the same results using logarithmic axes in order to better differentiate between alternative implementations.\footnote{I am particularly grateful to Allan Engelhardt for suggesting to switch from a plot with logged y-axis to a log-log plot.} \setkeys{Gin}{width=0.99\textwidth} \begin{figure}[t!] \centering <>= figure_Lattice(titles=FALSE) @ \caption{Benchmark results for BLAS and GPUs across four tests.} \label{fig:lattice} \end{figure} \setkeys{Gin}{width=0.99\textwidth} \begin{figure}[t!] \centering <>= figure_LogLogLattice(titles=FALSE) @ \caption{Benchmark results for BLAS and GPUs across four tests using logarithmic axes.} \label{fig:logloglattice} \end{figure} \subsubsection{Matrix multiplication: i7 with GPU.} % \setkeys{Gin}{width=0.99\textwidth} % \begin{figure}[t!] % \centering % <>= % figure_MatMult_i7(i7) % @ % \caption{Matrix multiplications on i7 with GPU.} % \end{figure} It is immediately apparent that the reference BLAS underperform very significantly. Similarly, the single-threaded Atlas performance is also dominated by the multi-threaded BLAS (Atlas39, Goto, MKL) and the GPU-based gputools. We also see (especially in the log/log plot) that the GPU-based solution bears a clear penalty for smaller dimensions. It only crosses below the single-threaded Atlas around $N=1000$ and approaches the best performing multi-threaded BLAS at the very end for $N=5000$. On the other hand, it clearly exhibits a much slower slope implying a lesser time penalty as matrix sizes increases. The three multi-threaded BLAS implementations are essentially indistinguishable for this test and hardware platform. \subsubsection{Matrix multiplication: Xeon} % \begin{figure}[t!] % \centering % <>= % figure_MatMult_xeon(xeon) % @ % \caption{Matrix multiplications on xeon.} % \end{figure} As above, the reference BLAS underperform significantly, and single-threaded Atlas is dominated by all multi-threaded BLAS implementations. On the other hand, on this platform the log/log plot offers a better differentiation between the multi-threaded BLAS implementations. Goto is clearly ahead of its competitors, although Atlas93 catches up for the largest matrices. Atlas39 also beats MKL for essentially all sizes but cannot catch Goto BLAS' performances. MKL shows a lower slope and higher times for small sizes hinting at a suboptimal configuration for small sizes. %\clearpage \subsubsection{QR decomposition: i7 with GPU} % \begin{figure}[t!] % \centering % <>= % figure_QR_i7(i7) % @ % \caption{QR decomposition on i7 with GPU.} % \end{figure} As before, reference BLAS are behind all other implementations. The single-threaded Atlas is on-par with the multi-threaded Atlas39---and the GPU solution. Goto and MKL beat all others, with Goto dominating overall. The log/log plot once again illustrates the higher `startup cost' of the GPU solution due to the communications cost being high (in relatively terms) for small matrices where the computations are still very fast. \subsubsection{QR decomposition: Xeon} % \begin{figure}[b!] % \centering % <>= % figure_QR_xeon(xeon) % @ % \caption{QR decomposition on xeon.} % \end{figure} Results fall basically into three groups. The Reference BLAS are behind everybody. A second group of both Atlas variants and MKL forms the middle with MKL having a slight edge over the Atlas libraries. However, Goto is ahead of everybody (yet shows a somewhat surprising non-linearity in the log/log plot). %\clearpage \subsubsection{SVD decomposition: i7 with GPU} % \begin{figure}[t!] % \centering % <>= % figure_SVD_i7(i7) % @ % \caption{SVD on i7 with GPU.} % \end{figure} For the SVD, we have the familiar ordering of Reference BLAS behind single-threaded Atlas which is behind multi-threaded Atlas. MKL, Goto and GPU are all competitive, with the GPU lagging for small sizes but beating all competitors for the largest matrix size tests. In the log/log chart, the GPU performance also demonstrates a much slower slope, yet along with the much higher intercept. \subsubsection{SVD: Xeon} % \begin{figure}[t!] % \centering % <>= % figure_SVD_xeon(xeon) % @ % \caption{SVD on xeon.} % \end{figure} On the Xeon, results are similar to the i7 with the main difference that the multi-threaded Atlas39 is closer to its competitors, in particular the MKL. Goto BLAS are once again ahead. \subsubsection{LU decomposition: i7 with GPU} % \begin{figure}[t!] % \centering % <>= % figure_LU_i7(i7) % @ % \caption{LU decomposition on i7 with GPU.} % \end{figure} LU decomposition results, using the function from the \pkg{Matrix} package \citep{cran:matrix}, are similar to earlier results. Reference BLAS are by far the slowest, single-threaded Atlas beats them clearly and loses equally clearly to the multi-threaded BLAS. Multi-threaded Atlas is very close to Goto and MKL, which are ever so slightly ahead and essentially indistinguishable. For this algorithm, no GPU-based performance numbers are available as the current version of the \pkg{gputools} package does not provide a LU decomposition. \subsubsection{LU decomposition: Xeon} % \begin{figure}[t!] % \centering % <>= % figure_LU_xeon(xeon) % @ % \caption{LU decomposition on xeon.} % \end{figure} On the xeon chip we see once again a clearer separation between the three accelerated BLAS implementations on the one hand and the single-threaded Atlas and the Reference BLAS on the other hand. Goto has a clear edge over MKL, which is also overtaken by Atlas39 for medium-to-large size matrices. MKL also exhibits some irregularity for small-to-medium sized matrices. \subsubsection{Comparison} The charts shown above, and in particular Figure~\ref{fig:logloglattice}, suggest a further analysis: a regression of the logarithm of the elapsed times on the logarithm of the matrix dimension. These are shown in Table~\ref{tab:logloganalysis} below. For these results, a higher slope coefficient implies a higher increase in elapsed time per increase in matrix dimension, and moreover in a non-linear fashion as we have taken logarithms. <>= C <- loglogAnalysis()[["coefs"]] C[,-c(1,2)] <- format(round(C[,-c(1,2)], digits=2)) @ \begin{table}[b!] \begin{center} \begin{tabular*}{0.975\textwidth}{@{\extracolsep{\fill}} l r rr rr rr rr rr} %\begin{tabular}{lrrrrrr} \toprule Platform & BLAS & \multicolumn{2}{c}{Mat.Mult} & \multicolumn{2}{c}{QR} & \multicolumn{2}{c}{SVD} & \multicolumn{2}{c}{LU} \\ & & \multicolumn{1}{c}{$\alpha$} & \multicolumn{1}{c}{$\beta$} & \multicolumn{1}{c}{$\alpha$} & \multicolumn{1}{c}{$\beta$} & \multicolumn{1}{c}{$\alpha$} & \multicolumn{1}{c}{$\beta$} & \multicolumn{1}{c}{$\alpha$} & \multicolumn{1}{c}{$\beta$} \\ \cmidrule(r){3-4} \cmidrule(r){5-6} \cmidrule(r){7-8} \cmidrule(r){9-10} \\[2pt] \multicolumn{1}{r}{i7} & Ref & \Sexpr{C[2,3]} & \Sexpr{C[2,4]} & \Sexpr{C[3,3]} & \Sexpr{C[3,4]} & \Sexpr{C[4,3]} & \Sexpr{C[4,4]} & \Sexpr{C[1,3]} & \Sexpr{C[1,4]} \\ & Atlas & \Sexpr{C[2,5]} & \Sexpr{C[2,6]} & \Sexpr{C[3,5]} & \Sexpr{C[3,6]} & \Sexpr{C[4,5]} & \Sexpr{C[4,6]} & \Sexpr{C[1,5]} & \Sexpr{C[1,6]} \\ & Atlas39& \Sexpr{C[2,7]} & \Sexpr{C[2,8]} & \Sexpr{C[3,7]} & \Sexpr{C[3,8]} & \Sexpr{C[4,7]} & \Sexpr{C[4,8]} & \Sexpr{C[1,7]} & \Sexpr{C[1,8]} \\ & Goto & \Sexpr{C[2,9]} & \Sexpr{C[2,10]}& \Sexpr{C[3,9]} & \Sexpr{C[3,10]}& \Sexpr{C[4,9]} & \Sexpr{C[4,10]}& \Sexpr{C[1,9]} & \Sexpr{C[1,10]}\\ & MKL & \Sexpr{C[2,11]}& \Sexpr{C[2,12]}& \Sexpr{C[3,11]}& \Sexpr{C[3,12]}& \Sexpr{C[4,11]}& \Sexpr{C[4,12]}& \Sexpr{C[1,11]}& \Sexpr{C[1,12]}\\ & GPU & \Sexpr{C[2,13]}& \Sexpr{C[2,14]}& \Sexpr{C[3,13]}& \Sexpr{C[3,14]}& \Sexpr{C[4,13]}& \Sexpr{C[4,14]}& & \\[4pt] \multicolumn{1}{r}{xeon} & Ref & \Sexpr{C[6,3]} & \Sexpr{C[6,4]} & \Sexpr{C[7,3]} & \Sexpr{C[7,4]} & \Sexpr{C[8,3]} & \Sexpr{C[8,4]} & \Sexpr{C[5,3]} & \Sexpr{C[5,4]} \\ & Atlas & \Sexpr{C[6,5]} & \Sexpr{C[6,6]} & \Sexpr{C[7,5]} & \Sexpr{C[7,6]} & \Sexpr{C[8,5]} & \Sexpr{C[8,6]} & \Sexpr{C[5,5]} & \Sexpr{C[5,6]} \\ & Atlas39& \Sexpr{C[6,7]} & \Sexpr{C[6,8]} & \Sexpr{C[7,7]} & \Sexpr{C[7,8]} & \Sexpr{C[8,7]} & \Sexpr{C[8,8]} & \Sexpr{C[5,7]} & \Sexpr{C[5,8]} \\ & Goto & \Sexpr{C[6,9]} & \Sexpr{C[6,10]}& \Sexpr{C[7,9]} & \Sexpr{C[7,10]}& \Sexpr{C[8,9]} & \Sexpr{C[8,10]}& \Sexpr{C[5,9]} & \Sexpr{C[5,10]}\\ & MKL & \Sexpr{C[6,11]}& \Sexpr{C[6,12]}& \Sexpr{C[7,11]}& \Sexpr{C[7,12]}& \Sexpr{C[8,11]}& \Sexpr{C[8,12]}& \Sexpr{C[5,11]}& \Sexpr{C[5,12]}\\ \bottomrule \end{tabular*} \end{center} \caption{Comparison of intercepts $\alpha$ and slopes $\beta$ in log/log analysis of BLAS performance.} \label{tab:logloganalysis} \end{table} The results in Table~\ref{tab:logloganalysis} as well as the visualization of different slope estimates in Figure~\ref{fig:slopeloglog}, formalise the rank-ordering of BLAS implementation seen in the analysis of the individual charts above. \begin{itemize} \item Unsurprisingly, Reference BLAS are clearly dominated by all other alternatives and exhibit the highest increase in elapsed time per increase in matrix dimension. \item Single-threaded Atlas is clearly improving on the Reference BLAS, but is dominated by the other multi-threaded libraries and the GPU-based solution; the QR decompostion is an exception where both Atlas variants and the MKL are comparable and just behind Goto. \item Multi-threaded Atlas39 is roughly comparable to the MKL (and ahead for matrix multiplication) and just behind Goto BLAS. \item The MKL Blas perform well, generally on-par with or ahead of Atlas39 but trailing Goto BLAS for two of the four algorithms. \item Goto BLAS are ahead for the QR and SVD tests, and on par or behind for matrix multiplication and LU decompositions. \item GPU computing is seen to have the lowest slope corresponding to the lowest cost per additional matrix dimension increases---but this has to be balanced with the cost for smaller to medium sizes which can be seen from the corresponding analysis for intercepts (not shown but available in the package). \end{itemize} \begin{figure}[t!] \centering <>= figure_LogLogSlopes() @ \caption{Comparison of slopes in log/log analysis of BLAS performance.} \label{fig:slopeloglog} \end{figure} % \subsection{Magma: GPU and BLAS combined} % \pkg{Magma}, as a hybrid system comprising both CPU and GPU-based computing, has a % lot of potential for improving over solutions using just one of the % processing units. In this section we are trying to measure just how much of % this potential is already realised with the early versions of Magma. % \subsubsection{Performance of magma and BLAS relative to BLAS} % \begin{figure}[t!] % \centering % < >= % figure_magma_MatMult_QR() % @ % \caption{Matrix multiplication and QR decomposition with Magma} % \end{figure} % This chart illustrates a significant improvement for matrix multiplication in % the case of the single-threaded Atlas libraries. As matrix size increases, % the ratio increases and approaches 3.5 indicating a 3.5-times speedup for % standard Atlas when combined with \pkg{magma}. However, for the % already-multithreaded BLAS, the opposite effect is realized: combined % performance is actually \textsl{slower} in the hybrid case. This result seems % invariant to the choice of multi-threaded BLAS as all three implementations perform similarly % poorly when combined with \pkg{magma}. The other noteworthy effect is the % variability of results as $N$ approaches 1000. % Attempts to run a QR decomposition resulted in segmentation fault on the card % used in these tests (Quadro FX 4800) once the matrix dimensions exceeded the % (relatively modest) size of 64.\footnote{The same library and operating % system combination worked correctly on a newer card (Fermi C 1060) for % Brian Smith (personal communication). However, we were unable to test on % that platform.} This has been reported to the \pkg{Magma} development % team; hopefully a future library version will allow these tests to be % performed. % \begin{figure}[t!] % \centering % < >= % figure_magma_SVD_LU() % @ % \caption{SVD and LU decompositions with Magma} % \end{figure} % For the SVD decomposition, we see strong variability for all four BLAS % libraries tested along with Magma. An initial performance penalty (below % $N=500$) gets outweighed by a performance gain up to $N=1000$. However, once % the matrix dimension increases, performance becomes indistinguishable from the % use of just the accelerated BLAS. We should note this case differs from the % others as \pkg{Magma} has not yet implemented an algorithm for the SVD. In % that sense, this test provides an illustration of the potential cost to naive % users and it is in fact an acceptable result that the ratio generally % approaches one. % For LU decomposition, we experience similar difficulties as for QR % decomposition once the matrix dimension exceeded 2000. For smaller matrices, % results are promising for the single-threaded Atlas---similar to the case of matrix % multiplication---and unimpressive for Goto and MKL. In all cases small sizes % lead to performance penalties. For medium-sized matrices Atlas39 manages a % ratio of just above 1.0 indicating a small gain. The opposite is true for % Goto and MKL. \section[Summary]{Summary} We present benchmarking results comparing five different BLAS implementations as well as a GPU-based approach for four different matrix computations on two different hardware platforms. We find reference BLAS to be dominated in all cases. Single-threaded Atlas BLAS improves on the reference BLAS but loses to multi-threaded BLAS. For multi-threaded BLAS we find the Goto BLAS dominate the Intel MKL, with a single exception of the QR decomposition on the xeon-based system which may reveal an error. The development version of Atlas, when compiled in multi-threaded mode is competitive with both Goto BLAS and the MKL. GPU computing is found to be compelling only for very large matrix sizes. Our benchmarking framework in the \pkg{gcbd} package can be employed by others through the \proglang{R} packaging system which could lead to a wider set of benchmark results. These results could be helpful for next-generation systems which may need to make heuristic choices about when to compute on the CPU and when to compute on the GPU. A strong empirical basis may make these heuristics more robust. \section*{Computational details} The results in this paper were obtained using \proglang{R} \Sexpr{paste(R.Version()[6:7], collapse = ".")} with the packages \pkg{RSQLite}, \pkg{DBI},\pkg{getopt} and the now-archived package \pkg{gputools}. \proglang{R} itself and all packages used are available from CRAN at \url{http://CRAN.R-project.org/}. The following Ubuntu packages were used to provide the different BLAS implementations: \pkg{libblas} 1.2-build1, \pkg{liblapack3gf} 3.2.1-2, \pkg{libatlas3gf-base} 3.6.0-24ubuntu, \pkg{revolution-mkl} 3.0.0-1ubuntu1 (containing MKL 10.2 dated March 2009), and \pkg{gotoblas2} 1.13-1 which was built using \pkg{gotoblas2-helper} 0.1-12 by \citet{Nakama:2010}. Apart from \pkg{gotoblas2-helper} and \pkg{gotoblas2}, all these package are available via every Ubuntu mirror. \section*{Disclaimers} NVidia provided the Quadro FX 4800 GPU card for evaluation / research purposes. REvolution Computing (now Revolution Analytics) employed the author as a paid consultant for the creation of the \pkg{revolution-mkl} package and integration of REvolution R into Ubuntu 9.10. \bibliography{gcbd} \end{document}