%% $Id: entropy_np.Rnw,v 1.43 2010/02/18 14:43:37 jracine Exp jracine $ %\VignetteIndexEntry{Parallel np: Using the npRmpi Package} %\VignetteDepends{npRmpi,boot,cubature,MASS} %\VignetteKeywords{nonparametric, kernel, entropy, econometrics, qualitative, %categorical} %\VignettePackage{npRmpi} \documentclass[nojss]{jss} %% need no \usepackage{Sweave.sty} \usepackage{amsmath,amsfonts} \usepackage[utf8x]{inputenc} \newcommand{\field}[1]{\mathbb{#1}} \newcommand{\N}{\field{N}} \newcommand{\bbR}{\field{R}} %% Blackboard R \newcommand{\bbS}{\field{S}} %% Blackboard S \author{Jeffrey S.~Racine\\McMaster University} \title{Parallel np: Using the npRmpi Package} \Plainauthor{Jeffrey S.~Racine} \Plaintitle{Parallel np: Using the npRmpi Package} \Abstract{ The \pkg{npRmpi} package is a parallel implementation of the \proglang{R} (\citet{R}) package \pkg{np} (\citet{np}). The underlying \proglang{C} code uses the message passing interface (`\proglang{MPI}') and is \proglang{MPI2} compliant. } \Keywords{nonparametric, semiparametric, kernel smoothing, categorical data} \Plainkeywords{Nonparametric, kernel, econometrics, qualitative, categorical} \Address{Jeffrey S.~Racine\\ Department of Economics\\ McMaster University\\ Hamilton, Ontario, Canada, L8S 4L8\\ E-mail: \email{racinej@mcmaster.ca}\\ URL: \url{https://experts.mcmaster.ca/people/racinej}\\ } \begin{document} %% Note that you should use the \pkg{}, \proglang{} and \code{} commands. %% Note - fragile using \label{} in \section{} - must be outside %% For graphics \setkeys{Gin}{width=\textwidth} %% Achim's request for R prompt set invisibly at the beginning of the %% paper \section{Overview} A common and understandable complaint often voiced about applied nonparametric kernel methods is the amount of computation time required for data-driven bandwidth selection when one has a large data set. There is a certain irony at play here since nonparametric methods are ideally suited to situations involving large data sets yet, computationally speaking, their analysis may lie beyond the reach of many users. Some background may be in order. My co-authors and I favor data-driven methods of bandwidth selection such as cross-validation, among others. These methods possess a number of very desirable properties but have run times that are proportional to the square of the number of observations hence doubling the number of observations will increase run time by a factor of four. For large data sets run time many simply not be feasible in a serial (i.e.~single processor) environment. The solution adopted in the \pkg{npRmpi} package is to run the code in a parallel computing environment and exploit the presence of multiple processors when available. The underlying \proglang{C} code for \pkg{np} is \proglang{MPI}-aware (\proglang{MPI} denotes the `message passing interface', a popular parallel programming library that is an international standard), and we merge the \pkg{R np} and \pkg{Rmpi} packages to form the \pkg{npRmpi} package (this requires minor modification of some of the underlying \pkg{Rmpi} code which is why we cannot simply load the \pkg{Rmpi} package itself).\footnote{The \pkg{npRmpi} package incorporates the \pkg{Rmpi} package (Hao Yu ) with minor modifications and we are extremely grateful to Hao Yu for his contributions to the \proglang{R} community.} All of the functions in \pkg{np} can exploit the presence of multiple processors. Run time is inversely proportional to the number of processors hence doubling the number of processors will cut run time in half.\footnote{There is minor overhead involved with message passing, and for small samples the overhead can be substantial when the ratio of message passing to computing the kernel estimator increases - this will be negligible for sufficiently large samples.} Given the availability of commodity cluster computers and the presence of multiple cores in desktop and laptop machines, leveraging the \pkg{npRmpi} package for large data sets may present a feasible solution to the often lengthy computation times associated with nonparametric kernel methods. The code has been tested in the macOS and Linux environments which allow the user to compile \proglang{R} packages on the fly (presuming of course that a C compiler exists on your system). Users running MS Windows will have to consult local tech support personnel and may also wish to consult the resources available for the \pkg{Rmpi} package and associated web site for further assistance. I cannot assist with installation issues beyond what is provided in this document and trust the reader will forgive me for this. \section{Differences Between np and npRmpi} There are only a few visible differences between running code in serial versus parallel environments. Typically you run your parallel code in batch mode so the first step would be to get your code running in batch mode using the \pkg{np} package (obviously on a subset of your data for large data sets). Once you have properly functioning code, you will next add some `hooks' necessary for \proglang{MPI} to run (see Section \ref{sec example} below for a detailed example), and finally you will run the job using either \code{mpirun} or, indirectly, via a batch scheduler on your cluster such as \code{sqsub} (kindly consult your local support personnel for assistance on using batch queueing systems on your cluster). Note that, since the data has to be broadcast to the slave notes, it is a good idea to put it in a dataframe first and it is always a good idea at this stage to cast your variables according to type. \subsection*{Installation} Installation will depend on your hardware and software configuration and will vary widely across platforms. If you are not familiar with parallel computing you are strongly advised to seek local advice. That being said, if you have current versions of \proglang{R} and \proglang{MPI} (e.g., MPICH or Open MPI) properly installed on your system, installation of the \pkg{npRmpi} package can be done in the standard manner from within \proglang{R} via \begin{verbatim} > install.packages("npRmpi") \end{verbatim} or, alternatively, you can download the \pkg{npRmpi} tarball from CRAN and, from a command shell, run \begin{verbatim} R CMD INSTALL npRmpi_foo.tar.gz \end{verbatim} where foo is the version number. For clusters you may additionally need to provide locations of libraries (kindly see your local sysadmin as there are far too many variations for me to assist). On a local Linux cluster I use the following by way of illustration (for this illustration we use the Intel compiler suite and need to set \proglang{MPI} library paths and \proglang{MPI} root directories): \begin{verbatim} export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/sharcnet/mpich/current/intel/lib export MPI_ROOT=/opt/sharcnet/mpich/current/intel \end{verbatim} where again foo is the version number. Please seek local help for further assistance on installing and running parallel programs. \subsection{Parallel Batch Execution} To run a parallel \pkg{np} job having successfully installed the \pkg{npRmpi} program, copy the \code{Rprofile} file in \code{npRmpi/inst} to the current directory and name it \code{.Rprofile} (or copy it to your home directory and again name it \code{.Rprofile}).\footnote{You will need to download the \pkg{npRmpi} source code and unpack it in order to get Rprofile from the npRmpi/inst directory.} Then to run the batch code in the file \code{npudensml_npRmpi.R} using two processors on an \proglang{MPI} system you will enter something like \begin{verbatim} mpirun -np 2 R CMD BATCH npudensml_npRmpi.R \end{verbatim} You can compare run times and any other differences by examining the files \code{npudensml_serial.Rout} and \code{npudensml_npRmpi.Rout} (see Section \ref{sec run time} below for some illustrative examples). Clearly you could do this with a subset of your data for large problems to judge the extent to which the parallel code reduces run time. If you have a batch scheduler installed on your cluster you might instead enter something like \begin{verbatim} sqsub -q mpi -n 2 R CMD BATCH npudensml_npRmpi.R \end{verbatim} Again, kindly consult local tech support personnel for issues concerning the use of batch queueing systems and using compute clusters. \subsection{Parallel Execution in an Interactive R Session} You might instead wish to run your code interactively in an \proglang{R} terminal/console rather than in batch mode. Doing so requires the essential elements in Section \ref{sec example} below with two minor differences. For this we do not require that the initialization file \code{.Rprofile} reside in the root or current directory (if it does it will be benign). To start a session, (i) you will load the \pkg{npRmpi} package in the normal manner via \code{library("npRmpi")}, then (ii) generate the slave nodes using the command \code{mpi.spawn.Rslaves(nslaves=foo)}. For example, on a dual core desktop set \code{foo=1} so that \code{mpi.spawn.Rslaves(nslaves=1)} will spawn one slave node in addition to the master node on which \proglang{R} is running for a total of two compute nodes. See the file \code{interactive_Rterm.R} in the demo directory for an illustration, but here is some sample code illustrating the additional elements required for an interactive session. Note also that in an interactive session the messages displayed during execution might be informative so there is no need for the option \code{option(np.messages=FALSE)} and you likely would want to comment this statement out in the demo files if you are running them interactively. \begin{verbatim} ## This code can be run interactively inside the R terminal/console ## It does not require that you have the .Rprofile file from ## npRmpi/inst/ in your current directory or home directory as we can ## load the npRmpi package in the standard manner. library(npRmpi) ## Now we can spawn our slaves from within the R terminal/console manually. ## On a two core desktop we need an additional slave on top of the master ## node to allow both cores to run simultaneously. mpi.spawn.Rslaves(nslaves=1) ## Then the rest of your program would follow along the lines of that ## in, say, Section \ref{sec example}... \end{verbatim} A number of illustrative examples are readily available in the interactive session via \code{example()} e.g.~\code{example(npreg)} and so forth. \subsection{Essential Program Elements}\label{sec example} Here is a simple illustrative example of a serial batch program that you would typically run using the \pkg{np} package. \begin{verbatim} ## This is the serial version of npudensml_npRmpi.R for comparison ## purposes (bandwidth ought to be identical, timing may ## differ). Study the differences between this file and its MPI ## counterpart for insight about your own problems. library(np) options(np.messages=FALSE) ## Generate some data n <- 2500 set.seed(42) x <- rnorm(n) ## A simple example with likelihood cross-validation t <- system.time(bw <- npudensbw(~x, bwmethod="cv.ml")) summary(bw) cat("Elapsed time =", t[3], "\n") \end{verbatim} Below is the same code modified to run in parallel using the \pkg{npRmpi} package. The salient differences are as follows: \begin{enumerate} \item You {\em must} copy the \code{Rprofile} file from the npRmpi/inst directory of the tarball/zip file into either your root directory or current working directory and rename it \code{.Rprofile}. \item You will notice that there are some \code{mpi.foo} commands where \code{foo} is, for example, \code{bcast.cmd}. These are the \pkg{Rmpi} commands for telling the slave nodes what to run. The first thing we do is initialize the master and slave nodes using the \code{np.mpi.initialize()} command. \item Next we broadcast our data to the slave nodes using the \code{mpi.bcast.Robj2slave()} command which sends an \proglang{R} object to the slaves. \item After this, we might compute the data-driven bandwidths. Note we have wrapped the \pkg{np} command \code{npudensbw()} in the \code{mpi.bcast.cmd()} with the option \code{caller.execute=TRUE} which indicates it is to execute on the master and slave nodes simultaneously. \item Finally, we clean up gracefully by broadcasting the \code{mpi.quit()} command. \item There are a number of example files (including that above and below) in the \code{npRmpi/demo} directory that you may wish to examine. Each of these runs and has been deployed in a range of environments (macOS, Linux). \end{enumerate} \begin{verbatim} ## Make sure you have the .Rprofile file from npRmpi/inst/ in your ## current directory or home directory. It is necessary. ## To run this on systems with an MPI implementation (e.g. MPICH) ## installed and working, try ## mpirun -np 2 R CMD BATCH npudensml_npRmpi.R. Check the time in the ## output file npudensml_npRmpi.Rout (the name of this file with extension ## .Rout), then try with, say, 4 processors and compare run time. ## Initialize master and slaves. mpi.bcast.cmd(np.mpi.initialize(), caller.execute=TRUE) ## Turn off progress i/o as this clutters the output file (if you want ## to see search progress you can comment out this command) mpi.bcast.cmd(options(np.messages=FALSE), caller.execute=TRUE) ## Generate some data and broadcast it to all slaves (it will be known ## to the master node) n <- 2500 mpi.bcast.cmd(set.seed(42), caller.execute=TRUE) x <- rnorm(n) mpi.bcast.Robj2slave(x) ## A simple example with likelihood cross-validation t <- system.time(mpi.bcast.cmd(bw <- npudensbw(~x, bwmethod="cv.ml"), caller.execute=TRUE)) summary(bw) cat("Elapsed time =", t[3], "\n") ## Clean up properly then quit() mpi.bcast.cmd(mpi.quit(), caller.execute=TRUE) \end{verbatim} For more examples including regression, conditional density estimation, and semiparametric models, see the files in the \code{npRmpi/demo} directory. Kindly study these files and the comments in each in order to extend the parallel examples to your specific problem. Note that the output from the serial and parallel runs ought to be identical save for execution time. If they are not there is a problem with the underlying code and I would ask you to kindly report such things to me immediately along with the offending code. \section{Summary} The \pkg{npRmpi} package is a parallel implementation of the \pkg{np} package that can exploit the presence of multiple processors and the \proglang{MPI} interface for parallel computing to reduce the computational run time associated with kernel methods. Run time is inversely proportional to the number of available processors, so two processors will complete a job in roughly one half the time of one processor, ten in one tenth and so forth.\footnote{There is minor overhead involved with message passing, and for small samples the overhead can be substantial as the ratio of message passing to computing the kernel estimator increases - this will be negligible for sufficiently large samples.} Though installation of a working \proglang{MPI} implementation requires some familiarity with computer systems, local expertise exists for many and help is to be found there. That being said, the macOS operating system can readily use fully functioning versions of \proglang{MPI} (e.g. MPICH or Open MPI) so there is minimal additional effort required for the user in order to get up and running in this environment. Finally, any feedback for improvements for this document, reporting of errors and bugs and so forth is always encouraged and much appreciated. \bibliography{npRmpi} \appendix \section{Illustrative Timed Runs}\label{sec run time} The run times reported in Table \ref{run table} were generated using R 4.5.2 and MPICH 4.3.2 on an Apple M2 Mac Studio (12 cores) running macOS 15.2 (current date: February 7, 2026). Each example was run in serial mode ($np=1$) using the \pkg{np} package then in parallel mode ($np=2, 3, 4$) using the \pkg{npRmpi} package. Elapsed time for the \pkg{np} functions is provided in Table \ref{run table} (seconds) as is the ratio of the elapsed time for each parallel run to the serial run. \begin{table}[!ht] \begin{center} \caption{\label{run table}Illustrative timed runs (seconds) with 1 processor (serial, \pkg{np} package) and 2, 3, and 4 processors (parallel, \pkg{npRmpi} package) on an Apple M2 Mac Studio.} \scriptsize \begin{tabular}{llrrrrrrr} Function & $n$ & Secs(1) & Secs(2) & Secs(3) & Secs(4) & Ratio(2) & Ratio(3) & Ratio(4)\cr \hline npcdensls & 1000 & 460.6 & 236.0 & 166.4 & 126.0 & 0.51 & 0.36 & 0.27\cr npcdensml & 2500 & 35.5 & 18.0 & 12.2 & 9.4 & 0.51 & 0.34 & 0.27\cr npcdistls & 2000 & 84.1 & 43.2 & 36.7 & 31.2 & 0.51 & 0.44 & 0.37\cr npcmstest & 616 & 10.0 & 5.8 & 4.4 & 3.8 & 0.58 & 0.44 & 0.38\cr npconmode & 189 & 8.9 & 5.3 & 4.1 & 3.5 & 0.60 & 0.47 & 0.40\cr npcopula & 5000 & 5.1 & 3.0 & 3.1 & 2.5 & 0.60 & 0.61 & 0.50\cr npdeneqtest & 2500 & 33.8 & 17.1 & 11.9 & 9.3 & 0.50 & 0.35 & 0.28\cr npdeptest & 2500 & 41.9 & 21.1 & 14.7 & 11.3 & 0.50 & 0.35 & 0.27\cr npglpreg & 1500 & 422.6 & 235.7 & 167.7 & 136.3 & 0.56 & 0.40 & 0.32\cr npindexich & 5000 & 18.6 & 10.2 & 6.8 & 5.2 & 0.55 & 0.37 & 0.28\cr npindexks & 5000 & 24.6 & 13.4 & 9.0 & 6.9 & 0.54 & 0.37 & 0.28\cr npplreg & 1000 & 10.1 & 6.0 & 4.3 & 3.4 & 0.59 & 0.42 & 0.34\cr npqreg & 1008 & 27.1 & 15.1 & 12.2 & 10.2 & 0.56 & 0.45 & 0.38\cr npregiv & 2500 & 139.1 & 112.2 & 97.5 & 91.5 & 0.81 & 0.70 & 0.66\cr npreglcaic & 5000 & 160.8 & 82.9 & 55.0 & 40.4 & 0.52 & 0.34 & 0.25\cr npreglcls & 5000 & 157.1 & 81.0 & 54.5 & 39.7 & 0.52 & 0.35 & 0.25\cr npregllaic & 5000 & 97.0 & 66.1 & 50.5 & 39.3 & 0.68 & 0.52 & 0.40\cr npregllls & 5000 & 96.7 & 64.1 & 48.9 & 38.4 & 0.66 & 0.51 & 0.40\cr npscoef & 10000 & 42.7 & 24.6 & 17.9 & 14.7 & 0.58 & 0.42 & 0.34\cr npsdeptest & 1500 & 70.8 & 37.3 & 26.4 & 20.8 & 0.53 & 0.37 & 0.29\cr npsigtest & 1000 & 56.9 & 42.9 & 47.1 & 37.9 & 0.75 & 0.83 & 0.67\cr npsymtest & 2500 & 38.9 & 20.4 & 14.5 & 11.1 & 0.52 & 0.37 & 0.28\cr npudensls & 10000 & 85.1 & 42.0 & 28.2 & 21.2 & 0.49 & 0.33 & 0.25\cr npudensml & 10000 & 42.5 & 20.9 & 14.0 & 10.7 & 0.49 & 0.33 & 0.25\cr npudistcdf & 10000 & 119.5 & 62.6 & 89.3 & 78.7 & 0.52 & 0.75 & 0.66\cr npunitest & 5000 & 155.5 & 76.4 & 52.2 & 39.5 & 0.49 & 0.34 & 0.25\cr \hline \end{tabular} \end{center} \end{table} Note that many of these illustrative examples use smallish sample sizes hence the run time with 2 processors will not be $1/2$ that with 1 processor due to overhead. But for larger samples (i.e.~the ones you actually need parallel computing for, not these toy illustrations) you ought to see an improvement in run time that is inversely related to the number of processors. \end{document}