%% $Id: entropy_np.Rnw,v 1.43 2010/02/18 14:43:37 jracine Exp jracine $

%\VignetteIndexEntry{Parallel np: Using the npRmpi Package}
%\VignetteDepends{npRmpi,boot,cubature,MASS}
%\VignetteKeywords{nonparametric, kernel, entropy, econometrics, qualitative,
%categorical}
%\VignettePackage{npRmpi}

\documentclass[nojss]{jss}

%% need no \usepackage{Sweave.sty}

\usepackage{amsmath,amsfonts}
\usepackage[utf8x]{inputenc} 

\newcommand{\field}[1]{\mathbb{#1}}
\newcommand{\N}{\field{N}}
\newcommand{\bbR}{\field{R}} %% Blackboard R
\newcommand{\bbS}{\field{S}} %% Blackboard S

\author{Jeffrey S.~Racine\\McMaster University}

\title{Parallel np: Using the npRmpi Package}

\Plainauthor{Jeffrey S.~Racine}

\Plaintitle{Parallel np: Using the npRmpi Package}

\Abstract{ 
  
  The \pkg{npRmpi} package is a parallel implementation of the
  \proglang{R} (\citet{R}) package \pkg{np} (\citet{np}). The
  underlying \proglang{C} code uses the message passing interface
  (`\proglang{MPI}') and is \proglang{MPI2} compliant.

}

\Keywords{nonparametric, semiparametric, kernel smoothing,
  categorical data}

\Plainkeywords{Nonparametric, kernel, econometrics, qualitative,
  categorical}

\Address{Jeffrey S.~Racine\\
  Department of Economics\\
  McMaster University\\
  Hamilton, Ontario, Canada, L8S 4L8\\
  E-mail: \email{racinej@mcmaster.ca}\\
  URL: \url{https://experts.mcmaster.ca/people/racinej}\\
}

\begin{document}

%% Note that you should use the \pkg{}, \proglang{} and \code{} commands.

%% Note - fragile using \label{} in \section{} - must be outside

%% For graphics

\setkeys{Gin}{width=\textwidth}

%% Achim's request for R prompt set invisibly at the beginning of the
%% paper

\section{Overview}

A common and understandable complaint often voiced about applied
nonparametric kernel methods is the amount of computation time
required for data-driven bandwidth selection when one has a large data
set. There is a certain irony at play here since nonparametric methods
are ideally suited to situations involving large data sets yet,
computationally speaking, their analysis may lie beyond the reach of
many users.  Some background may be in order. My co-authors and I
favor data-driven methods of bandwidth selection such as
cross-validation, among others. These methods possess a number of very
desirable properties but have run times that are proportional to the
square of the number of observations hence doubling the number of
observations will increase run time by a factor of four. For large
data sets run time many simply not be feasible in a serial
(i.e.~single processor) environment.

The solution adopted in the \pkg{npRmpi} package is to run the code in
a parallel computing environment and exploit the presence of multiple
processors when available. The underlying \proglang{C} code for \pkg{np}
is \proglang{MPI}-aware (\proglang{MPI} denotes the `message passing
interface', a popular parallel programming library that is an
international standard), and we merge the \pkg{R np} and \pkg{Rmpi}
packages to form the \pkg{npRmpi} package (this requires minor
modification of some of the underlying \pkg{Rmpi} code which is why we
cannot simply load the \pkg{Rmpi} package itself).\footnote{The
  \pkg{npRmpi} package incorporates the \pkg{Rmpi} package (Hao Yu
  <hyu@stats.uwo.ca>) with minor modifications and we are extremely
  grateful to Hao Yu for his contributions to the \proglang{R}
  community.}

All of the functions in \pkg{np} can exploit the presence of multiple
processors. Run time is inversely proportional to the number of
processors hence doubling the number of processors will cut run time
in half.\footnote{There is minor overhead involved with message
  passing, and for small samples the overhead can be substantial when
  the ratio of message passing to computing the kernel estimator
  increases - this will be negligible for sufficiently large samples.}
Given the availability of commodity cluster computers and the presence
of multiple cores in desktop and laptop machines, leveraging the
\pkg{npRmpi} package for large data sets may present a feasible
solution to the often lengthy computation times associated with
nonparametric kernel methods.

The code has been tested in the macOS and Linux environments which
allow the user to compile \proglang{R} packages on the fly (presuming
of course that a C compiler exists on your system). Users running MS
Windows will have to consult local tech support personnel and may also
wish to consult the resources available for the \pkg{Rmpi} package and
associated web site for further assistance. I cannot assist with
installation issues beyond what is provided in this document and trust
the reader will forgive me for this.

\section{Differences Between np and npRmpi}

There are only a few visible differences between running code in
serial versus parallel environments.  Typically you run your parallel
code in batch mode so the first step would be to get your code running
in batch mode using the \pkg{np} package (obviously on a subset of
your data for large data sets). Once you have properly functioning
code, you will next add some `hooks' necessary for \proglang{MPI} to
run (see Section \ref{sec example} below for a detailed example), and
finally you will run the job using either \code{mpirun} or,
indirectly, via a batch scheduler on your cluster such as \code{sqsub}
(kindly consult your local support personnel for assistance on using
batch queueing systems on your cluster).

Note that, since the data has to be broadcast to the slave notes, it
is a good idea to put it in a dataframe first and it is always a good
idea at this stage to cast your variables according to type.

\subsection*{Installation}

Installation will depend on your hardware and software configuration
and will vary widely across platforms. If you are not familiar with
parallel computing you are strongly advised to seek local advice.

That being said, if you have current versions of \proglang{R} and
\proglang{MPI} (e.g., MPICH or Open MPI) properly installed on your system, installation of the
\pkg{npRmpi} package can be done in the standard manner from within
\proglang{R} via
\begin{verbatim}
> install.packages("npRmpi")
\end{verbatim}
or, alternatively, you can download the \pkg{npRmpi} tarball from CRAN
and, from a command shell, run
\begin{verbatim}
R CMD INSTALL npRmpi_foo.tar.gz 
\end{verbatim}
where foo is the version number. 

For clusters you may additionally need to provide locations of
libraries (kindly see your local sysadmin as there are far too many
variations for me to assist). On a local Linux cluster I use the
following by way of illustration (for this illustration we use the
Intel compiler suite and need to set \proglang{MPI} library paths and
\proglang{MPI} root directories):
\begin{verbatim}
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/sharcnet/mpich/current/intel/lib
export MPI_ROOT=/opt/sharcnet/mpich/current/intel 
\end{verbatim}
where again foo is the version number. 

Please seek local help for further assistance on installing and
running parallel programs.

\subsection{Parallel Batch Execution}

To run a parallel \pkg{np} job having successfully installed the
\pkg{npRmpi} program, copy the \code{Rprofile} file in
\code{npRmpi/inst} to the current directory and name it
\code{.Rprofile} (or copy it to your home directory and again name it
\code{.Rprofile}).\footnote{You will need to download the \pkg{npRmpi}
  source code and unpack it in order to get Rprofile from the
  npRmpi/inst directory.} Then to run the batch code in the file
\code{npudensml_npRmpi.R} using two processors on an
\proglang{MPI} system you will enter something like

\begin{verbatim}
mpirun -np 2 R CMD BATCH npudensml_npRmpi.R
\end{verbatim}

You can compare run times and any other differences by examining the
files \code{npudensml_serial.Rout} and \code{npudensml_npRmpi.Rout}
(see Section \ref{sec run time} below for some illustrative
examples). Clearly you could do this with a subset of your data for
large problems to judge the extent to which the parallel code reduces
run time.

If you have a batch scheduler installed on your cluster you might
instead enter something like

\begin{verbatim}
sqsub -q mpi -n 2 R CMD BATCH npudensml_npRmpi.R
\end{verbatim}

Again, kindly consult local tech support personnel for issues
concerning the use of batch queueing systems and using compute
clusters.

\subsection{Parallel Execution in an Interactive R Session}

You might instead wish to run your code interactively in an
\proglang{R} terminal/console rather than in batch mode. Doing so
requires the essential elements in Section \ref{sec example} below
with two minor differences. For this we do not require that the
initialization file \code{.Rprofile} reside in the root or current
directory (if it does it will be benign). To start a session, (i) you
will load the \pkg{npRmpi} package in the normal manner via
\code{library("npRmpi")}, then (ii) generate the slave nodes using the
command \code{mpi.spawn.Rslaves(nslaves=foo)}. For example, on a dual
core desktop set \code{foo=1} so that
\code{mpi.spawn.Rslaves(nslaves=1)} will spawn one slave node in
addition to the master node on which \proglang{R} is running for a
total of two compute nodes. See the file \code{interactive_Rterm.R} in
the demo directory for an illustration, but here is some sample code
illustrating the additional elements required for an interactive
session.

Note also that in an interactive session the messages displayed during
execution might be informative so there is no need for the option
\code{option(np.messages=FALSE)} and you likely would want to comment
this statement out in the demo files if you are running them
interactively.

\begin{verbatim}
## This code can be run interactively inside the R terminal/console
## It does not require that you have the .Rprofile file from
## npRmpi/inst/ in your current directory or home directory as we can
## load the npRmpi package in the standard manner.

library(npRmpi)

## Now we can spawn our slaves from within the R terminal/console manually. 
## On a two core desktop we need an additional slave on top of the master
## node to allow both cores to run simultaneously.

mpi.spawn.Rslaves(nslaves=1)

## Then the rest of your program would follow along the lines of that 
## in, say, Section \ref{sec example}...
\end{verbatim}

A number of illustrative examples are readily available in the
interactive session via \code{example()} e.g.~\code{example(npreg)}
and so forth.

\subsection{Essential Program Elements}\label{sec example}

Here is a simple illustrative example of a serial batch program that
you would typically run using the \pkg{np} package.
\begin{verbatim}
## This is the serial version of npudensml_npRmpi.R for comparison
## purposes (bandwidth ought to be identical, timing may
## differ). Study the differences between this file and its MPI
## counterpart for insight about your own problems.

library(np)
options(np.messages=FALSE)

## Generate some data

n <- 2500

set.seed(42)

x <- rnorm(n)

## A simple example with likelihood cross-validation

t <- system.time(bw <- npudensbw(~x, bwmethod="cv.ml"))

summary(bw)

cat("Elapsed time =", t[3], "\n")
\end{verbatim}

Below is the same code modified to run in parallel using the
\pkg{npRmpi} package. The salient differences are as follows:

\begin{enumerate}
  
  \item You {\em must} copy the \code{Rprofile} file from the npRmpi/inst
    directory of the tarball/zip file into either your root directory
    or current working directory and rename it \code{.Rprofile}. 
    
  \item You will notice that there are some \code{mpi.foo} commands
    where \code{foo} is, for example, \code{bcast.cmd}. These are the
    \pkg{Rmpi} commands for telling the slave nodes what to run.
      
      The first thing we do is initialize the master and slave nodes
      using the \code{np.mpi.initialize()} command.
      
    \item Next we broadcast our data to the slave nodes using the
      \code{mpi.bcast.Robj2slave()} command which sends an
      \proglang{R} object to the slaves.
      
    \item After this, we might compute the data-driven
      bandwidths. Note we have wrapped the \pkg{np} command
      \code{npudensbw()} in the \code{mpi.bcast.cmd()} with the option
      \code{caller.execute=TRUE} which indicates it is to execute on
      the master and slave nodes simultaneously.
      
    \item Finally, we clean up gracefully by broadcasting the
      \code{mpi.quit()} command.
      
    \item There are a number of example files (including that above
      and below) in the \code{npRmpi/demo} directory that you may wish
      to examine. Each of these runs and has been deployed in a range
      of environments (macOS, Linux).
        
\end{enumerate}


\begin{verbatim}
## Make sure you have the .Rprofile file from npRmpi/inst/ in your
## current directory or home directory. It is necessary.

## To run this on systems with an MPI implementation (e.g. MPICH) 
## installed and working, try
## mpirun -np 2 R CMD BATCH npudensml_npRmpi.R. Check the time in the
## output file npudensml_npRmpi.Rout (the name of this file with extension
## .Rout), then try with, say, 4 processors and compare run time.

## Initialize master and slaves.

mpi.bcast.cmd(np.mpi.initialize(), caller.execute=TRUE)

## Turn off progress i/o as this clutters the output file (if you want
## to see search progress you can comment out this command)

mpi.bcast.cmd(options(np.messages=FALSE), caller.execute=TRUE)

## Generate some data and broadcast it to all slaves (it will be known
## to the master node)

n <- 2500

mpi.bcast.cmd(set.seed(42), caller.execute=TRUE)

x <- rnorm(n)
mpi.bcast.Robj2slave(x)

## A simple example with likelihood cross-validation

t <- system.time(mpi.bcast.cmd(bw <- npudensbw(~x, bwmethod="cv.ml"),
                               caller.execute=TRUE))

summary(bw)

cat("Elapsed time =", t[3], "\n")

## Clean up properly then quit()

mpi.bcast.cmd(mpi.quit(), caller.execute=TRUE)
\end{verbatim}

For more examples including regression, conditional density
estimation, and semiparametric models, see the files in the
\code{npRmpi/demo} directory.  Kindly study these files and the
comments in each in order to extend the parallel examples to your
specific problem.

Note that the output from the serial and parallel runs ought to be
identical save for execution time. If they are not there is a problem
with the underlying code and I would ask you to kindly report such
things to me immediately along with the offending code.

\section{Summary}

The \pkg{npRmpi} package is a parallel implementation of the \pkg{np}
package that can exploit the presence of multiple processors and the
\proglang{MPI} interface for parallel computing to reduce the
computational run time associated with kernel methods. Run time is
inversely proportional to the number of available processors, so two
processors will complete a job in roughly one half the time of one
processor, ten in one tenth and so forth.\footnote{There is minor
  overhead involved with message passing, and for small samples the
  overhead can be substantial as the ratio of message passing to
  computing the kernel estimator increases - this will be negligible
  for sufficiently large samples.} Though installation of a working
\proglang{MPI} implementation requires some familiarity with computer
systems, local expertise exists for many and help is to be found
there. That being said, the macOS operating system can readily use
fully functioning versions of \proglang{MPI} (e.g. MPICH or Open MPI) so there is
minimal additional effort required for the user in order to get up and running
in this environment. Finally, any feedback for improvements for this
document, reporting of errors and bugs and so forth is always
encouraged and much appreciated.

\bibliography{npRmpi}

\appendix

\section{Illustrative Timed Runs}\label{sec run time}

The run times reported in Table \ref{run table} were generated using R
4.5.2 and MPICH 4.3.2 on an Apple M2 Mac Studio (12 cores) running
macOS 15.2 (current date: February 7, 2026). Each example was run in
serial mode ($np=1$) using the \pkg{np} package then in parallel mode
($np=2, 3, 4$) using the \pkg{npRmpi} package. Elapsed time for the
\pkg{np} functions is provided in Table \ref{run table} (seconds) as is
the ratio of the elapsed time for each parallel run to the serial run.

\begin{table}[!ht]
\begin{center}
  \caption{\label{run table}Illustrative timed runs (seconds) with 1
    processor (serial, \pkg{np} package) and 2, 3, and 4 processors
    (parallel, \pkg{npRmpi} package) on an Apple M2 Mac Studio.}
\scriptsize
\begin{tabular}{llrrrrrrr}
Function & $n$ & Secs(1) & Secs(2) & Secs(3) & Secs(4) & Ratio(2) & Ratio(3) & Ratio(4)\cr
\hline
npcdensls & 1000 & 460.6 & 236.0 & 166.4 & 126.0 & 0.51 & 0.36 & 0.27\cr
npcdensml & 2500 & 35.5 & 18.0 & 12.2 & 9.4 & 0.51 & 0.34 & 0.27\cr
npcdistls & 2000 & 84.1 & 43.2 & 36.7 & 31.2 & 0.51 & 0.44 & 0.37\cr
npcmstest & 616 & 10.0 & 5.8 & 4.4 & 3.8 & 0.58 & 0.44 & 0.38\cr
npconmode & 189 & 8.9 & 5.3 & 4.1 & 3.5 & 0.60 & 0.47 & 0.40\cr
npcopula & 5000 & 5.1 & 3.0 & 3.1 & 2.5 & 0.60 & 0.61 & 0.50\cr
npdeneqtest & 2500 & 33.8 & 17.1 & 11.9 & 9.3 & 0.50 & 0.35 & 0.28\cr
npdeptest & 2500 & 41.9 & 21.1 & 14.7 & 11.3 & 0.50 & 0.35 & 0.27\cr
npglpreg & 1500 & 422.6 & 235.7 & 167.7 & 136.3 & 0.56 & 0.40 & 0.32\cr
npindexich & 5000 & 18.6 & 10.2 & 6.8 & 5.2 & 0.55 & 0.37 & 0.28\cr
npindexks & 5000 & 24.6 & 13.4 & 9.0 & 6.9 & 0.54 & 0.37 & 0.28\cr
npplreg & 1000 & 10.1 & 6.0 & 4.3 & 3.4 & 0.59 & 0.42 & 0.34\cr
npqreg & 1008 & 27.1 & 15.1 & 12.2 & 10.2 & 0.56 & 0.45 & 0.38\cr
npregiv & 2500 & 139.1 & 112.2 & 97.5 & 91.5 & 0.81 & 0.70 & 0.66\cr
npreglcaic & 5000 & 160.8 & 82.9 & 55.0 & 40.4 & 0.52 & 0.34 & 0.25\cr
npreglcls & 5000 & 157.1 & 81.0 & 54.5 & 39.7 & 0.52 & 0.35 & 0.25\cr
npregllaic & 5000 & 97.0 & 66.1 & 50.5 & 39.3 & 0.68 & 0.52 & 0.40\cr
npregllls & 5000 & 96.7 & 64.1 & 48.9 & 38.4 & 0.66 & 0.51 & 0.40\cr
npscoef & 10000 & 42.7 & 24.6 & 17.9 & 14.7 & 0.58 & 0.42 & 0.34\cr
npsdeptest & 1500 & 70.8 & 37.3 & 26.4 & 20.8 & 0.53 & 0.37 & 0.29\cr
npsigtest & 1000 & 56.9 & 42.9 & 47.1 & 37.9 & 0.75 & 0.83 & 0.67\cr
npsymtest & 2500 & 38.9 & 20.4 & 14.5 & 11.1 & 0.52 & 0.37 & 0.28\cr
npudensls & 10000 & 85.1 & 42.0 & 28.2 & 21.2 & 0.49 & 0.33 & 0.25\cr
npudensml & 10000 & 42.5 & 20.9 & 14.0 & 10.7 & 0.49 & 0.33 & 0.25\cr
npudistcdf & 10000 & 119.5 & 62.6 & 89.3 & 78.7 & 0.52 & 0.75 & 0.66\cr
npunitest & 5000 & 155.5 & 76.4 & 52.2 & 39.5 & 0.49 & 0.34 & 0.25\cr
\hline
\end{tabular}
\end{center}
\end{table}

Note that many of these illustrative examples use smallish sample
sizes hence the run time with 2 processors will not be $1/2$ that with 1
processor due to overhead. But for larger samples (i.e.~the ones you
actually need parallel computing for, not these toy illustrations) you
ought to see an improvement in run time that is inversely related to
the number of processors.

\end{document}