Full Text

Implicit and Explicit Parallel Computing in R
Luke Tierney
Department of Statistics and Actuarial Science

University of Iowa, Iowa City, IA 52242, U.S.A., luke@stat.uiowa.edu
Abstract. This paper outlines two approaches to introducing parallel computing

to the R statistical computing environment. The first approach is based on implicitly
parallelizing basic R operations, such as vectorized arithmetic operations; this is
suitable for taking advantage of multi-core processors with shared memory. The
second approach is based on developing a small set of explicit parallel computation
directives and is most useful in a distributed memory framework.
Keywords: shared memory parallel computing, distributed memory parallel

computing, vectorized arithmetic
1 Introduction
With increasing availability of muiti-core processors as well as computational

clusters it is useful to explore ways in which the R system can be modified or
extended so statisticians can take advantage of these resources and speed up
their computations or make computations feasible that would otherwise take
too long to complete. While a major rewrite of R to take advantage of parallel
resources would provide many interesting opportunities and challenges, the
current objectives are more modest: to see what can be accomplished with
reasonable levels of developer effort within the current R design. Several ap-
proaches are possible. One approach is to modify basic vectorized operations
to automatically split computations among all available processors. This im-
plicit approach tends to be most effective in a shared memory context. No
user input is required, thus making this approach simple to use.
A second approach is to develop some simple explicit parallel computing
constructs that can be used for expressing parallel computations explicitly.
This approach requires some learning on the part of users but will allow
coarser-grained parallelization and is also suitable for use in distributed mem-
ory settings that often allow much larger numbers of processors to be used.
This paper outlines efforts adding both forms of parallel computing to R.
The next section discusses one approach to parallelizing some basic arithmetic
operations. The third section outlines a simple explicit parallel computing
framework. The final section provides some discussion and outlines future
work.
44 Tierney, L.
2 Implicit parallel computing in R

The basic idea in implicit parallelization of a high level language like R is to
identify operations that are computationally intensive and to arrange for the
work of these operations to be divided up among several processors without
requiring explicit intervention on the part of the user. This involves the use
of multiple threads. In some cases libraries using this approach are readily
available and can be used by linking them into R. For example, most lin-
ear algebra computations are based on the basic linear algepra subroutines
library, or BLAS. R provides its own basic implementation derived from an
open source reference implementation, but makes it easy to substitute an
altrnate implementation, such as a harware vendor library or one from the
ATLAS project (Whaley and Petitet (2005)). A number of BLAS implemen-
tations provide threaded versions that try to improve performance by using
multiple threads. A major challenge is that there is overhead associated with
synchronization of threads, among other things, that can result in threaded
versions running slower than non-threaded ones. This has been observed in
the use of threaded BLAS libraries.
Another candidate for implicit parallelization is R’s vectorized arithmetic
operations. The R math library includes many special functions, densities,
cumulative distribution functions, and quantile functions. R level versions
of these functions apply the functions to all elements of vector arguments.
This is currently implemented by a simple loop. If the work of this loop is
divided among several processors then the resulting computation may run
faster. However, care is needed as there is synchronization overhead, and
shared resources (memory, bus, etc.) impose bottlenecks. As a result, while
parallelization of vectorized operations will be beneficial for large vectors,
it can be harmful for short ones. Careful tuning is needed to ensure that
parallelization is only used if it will be helpful.
Figure 1 shows performance measurements for a range of vector sizes and
two functions on two 8-processor systems. Some simple empirical observations
from these and similar plots for other functions: Times are roughly linear in
vector length for each function/OS/thread number combination. The inter-
cepts are roughly the same for all functions on a given platform. If the slope
for P processors is sP then, at least for P = 2 and P = 4, the approximation
sp ≈ s1 /P seems reasonable. Finally, the relative slopes for different functions
seem roughly independent of OS/architecture.
These observations motivate a simple strategy: Relative slopes are com-
puted using a number of runs on a range of platforms and recorded. The slope
for the normal density function dnorm is used as a base line; thus timings are
computed in units of single element dnorm calculations. Intercepts are esti-
mated for each OS/architecture combination. The two-processor intercept for
Linux/AMD/x86 64 is approximately 200 dnorm evaluations; for Mac OS X
10.4/Intel/i386 it is around 500. Using this information one can estimate for
each function f the value N2 (f ) such that using P = 2 processors is faster
Parallel Computing in R 45
qnorm, Linux/AMD/x86_64 pgamma, Linux/AMD/x86_64

0.30
0.30
1 thread
2 threads
4 threads
0.25
0.25
8 threads
0.20
0.20
Time in milliseconds
0.15
0.15
0.10
0.10
0.05
0.05
0.00
0.00
0 200 400 600 800 1000 0 200 400 600 800 1000
n n
qnorm, Mac OS X/Intel/i386 pgamma, Mac OS X/Intel/i386

0.30
0.30
0.25
0.25
0.20
0.20
0.15
0.15
0.10
0.10
0.05
0.05
0.00
0.00
0 200 400 600 800 1000 0 200 400 600 800 1000
n n
Fig. 1. Timings of vectorized function evaluations for qnorm and pgamma as a func-
tion of vector length for two 8-processor systems. Plots for 10 replications are shown.
than using a single processor for vectors of length n > N2 (f ). For P = 4

processors we use N4 (f ) = 2N2 (f ) and for P = 8 we use N8 (f ) = 4N2 (f ).
Figure 2 shows selected values of N2 (f ) for a Linux/AMD system. For
simple functions like sqrt parallel evaluation does not pay for vectors with
fewer that n = 2000 elements. For dnorm the cutoff is around 500. For some
very computationally intensive functions, such as qtukey, parallel evaluation
is useful for vectors of length n = 2.
Implementing the approach outlined above involves using threads to eval-
uate different parts of the basic vectorization loops. One possibility is to
directly use a basic threading API, such as pthreads, but a better choice is
to use Open MP (Chandra et al. 2000). Many commercial compilers as well
as gcc 4.2 support Open MP; Redhat has back-ported the Open MP support
46 Tierney, L.
Some N2(f) Values on Linux
sqrt
sin
cos
exp
dnorm
pnorm
qnorm
dgamma
pgamma
qgamma
pbeta
qbeta
ptukey
qtukey
0 500 1000 1500 2000
Fig. 2. Selected cutoff levels for switching to parallel evaluation of vectorized func-
tions.
to gcc 4.1 in recent Fedora and Redhat Linux releases. The current MinGW
Windows compiler suite also includes Open MP support.
Open MP uses compiler directives (#pragma statements in C; FORTRAN
uses structured comments) to request parallel implementation of a loop. For
example, Figure 3 shows the loop used for vectorizing a function of a single ar-
gument along with the Open MP parallelization directive. Functions of more
#pragma omp parallel for if (P > 0) num_threads(P) \

default(shared) private(i) reduction(&&:naflag)
for (i = 0; i < n; i++) {
double ai = a[i];
MATH1_LOOP_BODY(y[i], f(ai), ai, naflag);
}
Fig. 3. Vectorization loop for function of one argument with Open MP paralleliza-
tion directive.
than one argument are somewhat more complicated because of conventions

for recycling shorter arguments. A compiler that does not support Open MP
will ignore the omp directive and compile this as a standard sequential loop.
If the compiler supports Open MP and is asked to use it, then this will be
compiled to use the number of threads specified by the variable P .
Use of Open MP eliminates the need to manually manage threads, but
some effort is still needed. Only loops with simple control structure can be
parallelized by Open MP, which requires rewriting some of the loops used in
the standard R code. Also, it is essential that the functions being called are
safe to call from multiple threads. For this to be true these functions cannot
use read/write global variables, call R’s memory manager, signal warnings
or errors, or check for user interrupts. Even creating internationalized error
messages can be problematic as the subroutines that do this are not guar-
anteed to be thread-safe. Almost all functions in the basic R math library
are either thread-safe or easily modified to be thread-safe. Exceptions are the
Bessel functions and the Wilcoxon and signed rank functions.
A preliminary implementation of the approach outlined here is available
as a package pnmath. Loading this package replaces the standard vectorized
functions in R by parallelized ones. For Linux and Mac OS X predetermined
intercept calibrations are used; for other platforms a calibration test is run at
package load time. The package requires a version of gcc that supports Open
MP and allows dlopen to be used with the support library libgomp.
3 Explicit parallel computing in R
Several packages are available to support explicit parallel computing in R,

including Rmpi, rpvm, nws, and snow. These packages are based on the idea
of coordinating several separate R processes running either on a single multi-
core machine or on several machines connected by a network. The packages
Rmpi and rpvm provide access from R to most features of the MPI (Pacheco,
1997) and PVM (Geist et al. 1994) message passing systems and as such are
powerful frameworks for parallel computing. However they are not easy to
use, and many parallel computations can be handled using a simpler frame-
work. The goal of the snow package is to provide such a simple framework
that is easy to use in an interactive context and is capable of expressing a
number of interesting parallel computations. A particular goal is to provide a
framework in which user code cannot create a deadlock situation, a common
error in parallel code written with many very general parallel frameworks.
A more extensive review of the snow framework is given in Rossini, Tierney,
and Li (2007). The snow package is designed to operate on top of a more ba-
sic communication mechanism; currently supported mechanisms are sockets,
PVM, and MPI.
Figure 4 shows a simple snow session. The call to the function makeCluster
creates a cluster of two worker R processes. clusterCall calls a specified
function with zero or more specified arguments in each worker process and
48 Tierney, L.
> cl <- makeCluster(2)

> clusterCall(cl, function() Sys,info()["nodename"])
[[1]]
[1] "node02"
[[2]]
[1] "node03"
> clusterApply(cl, 1:2, function(x) x + 1)
[[1]]
[1] 2
[[2]]
[1] 3
> stopCluster(cl)
Fig. 4. A minimal snow session.
returns a list of the results. clusterApply is a version of lapply that ap-

plies the specified function to each element of the list or vector argument,
one element per worker process, and returns a list of the results. Finally,
stopCluster shuts down the worker processes.
clusterCall and clusterApply are the two basic functions from which
other functions are constructed. Higher level functions include parLapply,
parApply, and parMap. These are parallel versions of the standard functions
lapply, apply, and Map, respectively. A simple rule is used to partition input
into roughly equal sized batches, with the number of batches equal to the
number of worker processes. The process of converting a sequential R program
to a parallel one using snow usually involves identifying a loop that can benefit
from parallelization, rewriting the loop in terms of a function such as lapply,
and, once the rewrite has been debugged, replacing the lapply call by a call
to parLapply.
An important issue that needs to be addressed in parallel statistical com-
puting is pseudo-random number generation. If one uses standard R genera-
tors then there is a very good chance, though no guarantee, that all R worker
processes will see identical random number streams. If identical streams are
desired, as they might be at times for blocking purposes, then this can be
assured by setting a common seed. If, as is more commonly the case, one
wants to treat the workers as producing independent streams then it is best
to use R’s ability to replace the basic random number generator along with
one of several packages that are designed for parallel random number gen-
eration. snow supports using two of these packages, with the default being
the rlecuyer interface to the random number streams library of L’Ecyuer
et al. (2002). The function clusterSetupRNG is used to set up independent
random number streams on all cluster processes.
Figure 5 shows a comparison of a sequential bootstrap calculation and
a parallel one using a cluster of 10 worker processes. The elapsed time of
the parallel version is approximately one tenth of the elapsed time for the
## sequential version:
> R <- 1000
> system.time(nuke.boot <-
+ boot(nuke.data, nuke.fun, R=R, m=1,
+ fit.pred=new.fit, x.pred=new.data))
user system elapsed
12.703 0.001 12.706
## Parallel version, using 10 processes:
> clusterEvalQ(cl,library(boot))
> clusterSetupRNG(cl)
> system.time(cl.nuke.boot <-
+ clusterCall(cl,boot,nuke.data, nuke.fun,
+ R=R/length(cl), m=1,
+ fit.pred=new.fit, x.pred=new.data))
user system elapsed
0.009 0.004 1.246
Fig. 5. Bootstrap example from the boot help page.
sequential version. The function clusterEvalQ is a utility function used to

evaluate an expression on all worker processes; in this case it is used to ensure
that the boot package is loaded on all worker processes.
Linear performance speedup as seen in this bootstrap example is not
always achievable. One issue is the cost of communication. Data is transferred
to and from the workers. If the amount of computation on each worker is not
large relative to the communication overhead then speedup will be less, and in
extreme cases parallel versions can run slower than single process sequential
versions. Another issue is that sometimes the time needed by each worker to
perform its task may vary from worker to worker, either because of variations
in tasks themselves or because of differing load conditions on the machines
involved. This can be addressed by load balancing. Currently snow provides
one function for doing this, clusterApplyLB, a load balancing version of
clusterApply. For a cluster of P processes and a vector on n > P elements
this function assigns the first P jobs to the P processes, and then assigns job
P +1 to the first process to complete its work, job P +2 to the second process
to complete its work, and so on. As a result, the particular worker process
that handles a given task is non-deterministic. This can create complications
with simulations if random number streams are assigned to processes but can
be very useful for non-stochastic applications.
4 Discussion and future work
Work on implicit parallelization within R is still in its early stages. The par-
allel vectorized math library package described in Section 2 above is a first
step. Over the next few months this work will be folded into the base R distri-
50 Tierney, L.
bution, and we will explore other possibilities of using implicit parallelization

implemented via Open MP. Some reduction operations, such as row or col-
umn sum calculations, may also be amenable to this approach. One aspect
that will also be explored is whether the parallelization framework developed
within the R internals, such as the loop shown in Figure 3, can be made
available to package writers so package writers can easily define their own
parallel vectorized functions without reimplementing what has already been
done for the built-in functions.
Implicit parallelization is easiest from the user point of view as it requires
no special user action. However in an interpreted framework such as R im-
plicit parallelization is only easily applied at a very fine level of granularity
of individual vector or linear algebra operations. This means that speed im-
provements are only achievable for large data sets. It is hoped that work
currently underway on developing a byte code compiler for R may allow par-
allelization to be moved to a somewhat higher level of granularity by fusing
together several vector operations. This should significantly reduce the syn-
chronization overhead and allow parallel computation for much smaller data
sets. Compilation may also help in automatically parallelizing certain simple
uses of the apply family of functions.
Explicit parallel computing can more easily achieve substantial speedups
both because it is possible to work at higher levels of granularity and because
it is possible to bring to bear larger numbers of processors (though the num-
ber of processors available in multi-core computers is likely to increase in the
next few years; see for example Asanovic et al. 2006). More work is needed
on the interface provided by the snow framework. One area under consider-
ation is a redesign of the approach to load balancing to make it possible for
all parallel functions to optionally use load balancing. Another area is the
development of tools for measuring and displaying the parallel computation,
and the communication overhead in particular. Tools for doing this within
PVM and certain MPI frameworks are available, but it should be possible
to build on R’s own facilities and develop some useful tools that work more
easily and on all communications back ends.
The current snow framework is already quite effective for implementing a
range of parallel algorithms. It can easily handle any computation expressible
as a sequence of scatter-compute-gather steps. A useful addition would be to
allow some intermediate results to remain on the worker processes between
such scatter-compute-gather steps, but to be sure these results are cleaned
up after a complete computation. Also useful would be the ability to request
limited transfer of data between nodes. In Markov random field simulations
for example, one might divide the field among workers and need to exchange
boundary information in between iterations. Both of these ideas fit well within
a formalism known as bulk synchronous parallel computing (BSP; Bisseling
2005). Like snow, the BSP model is designed so code using the model cannot
create deadlock situations and is thus a good fit for generalizing the snow
model. Extensions to snow to support the BSP model are currently being
explored.
More extensive rewriting of the R implementation might enable the inte-
gration of more advanced parallel libraries, such as ScaLAPACK (Blackford
et al. (1997)), and more advanced parallel programming approaches. This is
the subject of future research.
5 Acknowledgements
This work was supported in part by National Science Foundation grant DMS
06-04593. Some of the computations for this paper were performed on equip-
ment funded by National Science Foundation grant DMS 06-18883.
References
ASANOVIC, K., BODIK, R., CATANZARO, B.C., GEBIS, J.J., HUSBANDS,
P., KEUTZER, K., PATTERSON, D.A., PLISHKER, L.W., SHALF, J.,
WILLIAMS, S.W., YELICK, K.A. (2006): The landscape of parallel computing
research: a view from Berkeley, EECS Department, University of California,
Berkeley, Technical Report No. UCB/EECS-2006-183.
BISSELING, R.H. (2004): Parallel Scientific Computation: A Structured Approach
using BSP and MPI, Oxford university Press, Oxford.
BLACKFORD, L.S., CHOI, J., CLEARY, A., D’AZEVEDO, E., DEMMEL, J.,
DHILLON, I., DONGARRA, J., HAMMARLING, S., HENRY, G., PETITET,
A., STANLEY, K., WALKER, D., WHALEY, R.C. (1997): ScaLAPACK
Users’ Guide. Society for Industrial and Applied Mathematics, Philadelphia.
CHANDRA, R., MENON, R., DAGUM, L., KOHR, D. (2000): Parallel Program-
ming in OpenMP. Morgan Kaufmann, San Francisco.
GEIST, A., BEGUELIN, A., DONGARRA, J., JIANG, W. (1994): PVM: Parallel
Virtual Machine, MIT Press, Cambridge.
L’ECUYER, P., SIMARD, R., CHEN, E.J., KELTON, W.D. (2002): An objected-
oriented random-number package with many long streams and substreams,
Operations Research, 50 (6), 1073–1075.
PACHECO, P. (1997): Parallel Programming with MPI, Morgan Kaufmann, San
Fransisco.
ROSSINI, A.J., TIERNEY, L., LI, N. (2007): Simple Parallel Statistical Computing
in R, Journal of Computational and Graphical Statistics, 16 (1), 399–420.
WHALEY, R.C., PETITET, A. (2005): Minimizing development and maintenance
costs in supporting persistently optimized BLAS, Software: Practice and Ex-
perience, 35 (2), 101–121.

Full Text

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Full Text

Загружено:

Авторское право:

Доступные форматы

Implicit and Explicit Parallel Computing in R

Department of Statistics and Actuarial Science

Abstract. This paper outlines two approaches to introducing parallel computing

Keywords: shared memory parallel computing, distributed memory parallel

With increasing availability of muiti-core processors as well as computational

2 Implicit parallel computing in R

qnorm, Linux/AMD/x86_64 pgamma, Linux/AMD/x86_64

qnorm, Mac OS X/Intel/i386 pgamma, Mac OS X/Intel/i386

than using a single processor for vectors of length n > N2 (f ). For P = 4

Some N2(f) Values on Linux

0 500 1000 1500 2000

#pragma omp parallel for if (P > 0) num_threads(P) \

than one argument are somewhat more complicated because of conventions

3 Explicit parallel computing in R

Several packages are available to support explicit parallel computing in R,

> cl <- makeCluster(2)

Fig. 4. A minimal snow session.

returns a list of the results. clusterApply is a version of lapply that ap-

Fig. 5. Bootstrap example from the boot help page.

sequential version. The function clusterEvalQ is a utility function used to

4 Discussion and future work

bution, and we will explore other possibilities of using implicit parallelization

Вам также может понравиться