Вы находитесь на странице: 1из 9

ARTICLE IN PRESS

Computers & Geosciences 30 (2004) 683691

Multivariable geostatistics in S: the gstat package$


Edzer J. Pebesma*
Department of Physical Geography, Utrecht University, P.O. Box 80.115, 3508 TC Utrecht, Netherlands

Received 2 July 2003; received in revised form 17 March 2004; accepted 18 March 2004

Abstract

This paper discusses advantages and shortcomings of the S environment for multivariable geostatistics, in particular
when extended with the gstat package, an extension package for the S environments (R, S-Plus). The gstat S
package provides multivariable geostatistical modelling, prediction and simulation, as well as several visualisation
functions. In particular, it makes the calculation, simultaneous tting, and visualisation of a large number of direct and
cross (residual) variograms very easy. Gstat was started 10 years ago and was released under the GPL in 1996;
gstat.org was started in 1998. Gstat was not initially written for teaching purposes, but for research purposes,
emphasising exibility, scalability and portability. It can deal with a large number of practical issues in geostatistics,
including change of support (block kriging), simple/ordinary/universal (co)kriging, fast local neighbourhood
selection, exible trend modelling, variables with different sampling congurations, and efcient simulation of
large spatially correlated random elds, indicator kriging and simulation, and (directional) variogram and cross
variogram modelling. The formula/models interface of the S language is used to dene multivariable geostatistical
models. This paper introduces the gstat S package, and discusses a number of design and implementation issues.
It also draws attention to a number of papers on integration of spatial statistics software, GIS and the S environment
that were presented on the spatial statistics workshop and sessions during the conference Distributed Statistical
Computing 2003.
r 2004 Elsevier Ltd. All rights reserved.

Keywords: Kriging; Cokriging; Linear model of coregionalisation; Open source software; S language; Stochastic simulation

1. Introduction and Srivastava, 1989) is not a new subject to the S


community, and several S packages or libraries are
S is a high-level language for data analysis and available. Some of these were developed for teaching
graphics. Currently, it has one commercial implementa- purposes, and some have very advanced functionality.
tion, S-Plus (S-Plus home page: http://www.insightful. Still, all of the currently available S packages lack
com/), Becker et al., 1988; Chambers, 1998 and an open- features that are commonly used in applied geostatistics,
source implementation, called R (Ihaka and Gentle- notably block kriging, kriging in a local neighbourhood,
man, 1996; Bivand, 2000; R home page: http://www. multivariable variogram modelling, cokriging and cosi-
r-project.org/Comprehensive R archive network: http:// mulation. This paper introduces that gstat S package,
cran.r-project.org/ and mirrors). Geostatistics (Isaaks which lls this gap.
Gstat (Pebesma and Wesseling, 1998; gstat home
$
Code available from server at http://www.iamg.org/ page: http://www.gstat.org/) used to be a stand-alone
CGEditor/index.htm. computer program that provides all these features, but
*Tel.: +31-30-2533051; fax: +31-30-2531145. with no graphics capabilities of its own: it has an
E-mail address: e.pebesma@geog.uu.nl (E.J. Pebesma). interactive user interface for variogram modelling, but

0098-3004/$ - see front matter r 2004 Elsevier Ltd. All rights reserved.
doi:10.1016/j.cageo.2004.03.012
ARTICLE IN PRESS
684 E.J. Pebesma / Computers & Geosciences 30 (2004) 683691

uses the gnuplot graphics program for visualising 3.1. Univariable prediction
variograms. The gstat stand-alone program works well
with several GIS systems, as it can read and write point Let Z(s) be a vector of length n with observations
and/or grid map data to and from more than 20 GIS Z(s1),y,Z(sn) observed at spatial locations si arbitrarily
formats. Graphical user interfaces that use gstat as a spread in R1, R2 or R3. The variability in observations
back-end have been developed within PCRaster, Idrisi32 Z(s) is usually thought of as consisting of a trend
and ArcGIS environments. and a residual, and the trend is modelled as a linear
The S (R/S-Plus) environment has much to offer for function
multivariable geostatistical analysis. The Trellis/Lattice
graphics functions allow visualising high-dimensional X
p
Zs Xj sbj es X b es 1
data by creating structured, composite graphs. The gstat j0
S package now offers the major geostatistical function-
ality of the gstat stand-alone program to S users, with Xj(s), j>0, the p explanatory or predictor variables,
provides new functions for fast modelling of arbitrarily with b0 usually being an intercept and X0(s)1, with
many cross and direct variograms, and provides a b the vector with unknown regression coefcients,
number of useful functions for plotting spatial point and with e(s) the residual vector. For spatial data,
data, multiple grid maps, and multivariable or direc- residuals are usually spatially correlated, and given the
tional variograms. In the following, gstat will refer to covariance matrix V of e(s), best linear unbiased
the gstat S package. prediction (kriging) of Z(s0) at an unobserved location
s0 is obtained by

Zs0 xs0 b# v0 V 1 Zs  X b
# 2
2. DSC2003 and spatial statistics in S
with x(s0) the row of X that would have corre-
During the conference distributed statistical computing
sponded to Z(s0), with b# X 0 V 1 X 1 X 0 V 1 Zs the
2003 (DSC2003) held in Vienna on March 1922, 2003,
generalised least-squares estimate of the trend coef-
a 1-day workshop and three paper sessions were devoted
cients where X0 denotes the transpose of X, and
to spatial statistics, and the handling of spatial data in S
with v CovZs0 ; Zs1 ; y; CovZs0 ; Zsn 0 where
environments, R in particular. The overview given by
Cov(  ,  ) denotes covariance.
Bivand (2003) shows that at least six other R packages
The corresponding prediction error variance is
deal with geostatistics; three packages deal with point
pattern analysis; one package deals with lattice (poly- s2 s0 s20  v0 V 1 v xs0  v0 V 1 X X 0 V 1 X 1
gon) data and 10 packages with interfacing R to GIS
formats (e.g. Bivand, 2000), of which one uses the xs0  v0 V 1 X 0 ; 3
generic spatial data abstraction layer GDAL (Gdal
home page: http://www.remotesensing.org/gdal/). All of where s20 is Var(Z(s0)).
these packages share a need for S data structures that
are aware of their spatial topology. An initiative for a 3.2. Multivariable prediction
public mailing list and a CVS repository aimed at
dealing with spatial data and spatial statistics in S was Multivariable prediction involves the joint predic-
started as a result of this workshop. tion of multiple, both spatially and cross-variable
correlated variables. Consider m distinct variables,
and let {Zi(s), Xi bi, ei(s), xi(s0), vi, Vi} correspond
to {Z(s), X, b, e(s), x(s0), v, V} of the ith variable.
3. Multivariable geostatistics
Next, let Z(s)=(Z1(s)0 ,y,Zm(s)0 )0 , B=(b10 ,y,bm0 )0 ,
e(s)=(e1(s)0 ,y,em(s)0 )0 ,
Multivariable geostatistics involves the simultaneous
prediction (or simulation) of multiple variables based on 2 3
X1 0 y 0
single or multiple predictors, as well as the modelling of 6 0 X
6 2 y 0 7
7
all necessary direct and cross variograms. This section is X 6 7;
meant to introduce notation for the multivariable 4 ^ ^ & ^ 5
geostatistical model as implemented in gstat, as briey 0 0 y Xm
as possible, but necessary for the explanation of the 2 3
x1 s0 0 y 0
functionality of the gstat package for S. Further theory 6 0 x2 s0 y 0 7
6 7
is also found in various papers and text books, e.g. xs0 6 7
4 ^ ^ & ^ 5
Cressie (1993), Ver Hoef and Cressie (1993) and
Wackernagel (1998). 0 0 y xm s0
ARTICLE IN PRESS
E.J. Pebesma / Computers & Geosciences 30 (2004) 683691 685

with 0 conforming zero matrices, and was originally limited to mining applications, block
2 3 kriging is now widely used in environmental applications
v1;1 v1;2 y v1;m
when spatially aggregated predictions for larger areas
6v 7
6 2;1 v2;2 y v2;m 7 are required, or when point support predictions are too
v 6 7;
4 ^ ^ & ^ 5 inaccurate.
vm;1 vm;2 y vm;m Simple and ordinary kriging: In certain cases, the trend
2 3 coefcients can be assumed known, e.g. when an other
V1;1 V1;2 y V1;m
6V 7 mechanism, such as an external deterministic model
6 2;1 V2;2 y V2;m 7 takes care of estimating them. In this case, called simple
V 6 7;
4 ^ ^ & ^ 5 kriging, b is substituted for b# in Eq. (2), and the third
Vm;1 Vm;2 y Vm;m term on the right-hand side of Eq. (3) disappears.
Another simplied version of universal kriging is
where element i of vk;l is Cov(Zk(si), Zl(s0)), and where ordinary kriging, which contains only an intercept
element (i,j) of Vk,l is Cov(Zk(si),Zl (sj)). (p=0).
The multivariable prediction equations equal Eqs. (2) Shared trend coefficients and colocated cokriging:
and (3) when all matrices are substituted by their When two variables measure the same phenomenon
multivariable forms (see also Ver Hoef and Cressie, with different devices, they will show different varia-
1993), and when in (3) s20 is substituted by S with bility, but share a common mean value. In this case, they
Cov(Zi(s0), Zj(s0)) in its (i,j)th element. Note that (3) is should be treated as two variables, having a common
now a prediction error covariance matrix. mean (or trend) coefcient(s). Gstat allows the sharing
The implementation of this model in gstat does not of any two (or more) coefcients across pairs of
pose restrictions to the number of variables m, and each variables. The simplest case of this corresponds to
variable can have its own set of predictor variables, standardised ordinary cokriging with one single unbia-
number of observations, and unique observation loca- sedness constraint, or colocated ordinary cokriging
tions. Covariances are specied by ways of variogram (Goovaerts, 1997; note that this is different from
functions and cross variogram functions. Wackernagels (1998) interpretation of colocated ordin-
ary cokriging). Simple colocated cokriging is a special
3.3. Extensions case of simple cokriging with a neighbourhood size of
one for secondary variables.
Gstat provides a number of highly useful extensions Generalised linear models: Regression models for
to the straightforward application of Eqs. (2) and (3): count data or for presence/absense (1/0) data are usually
Kriging in a local neighbourhood: Instead of using all dealt with by generalised linear models. Gotway and
data, only data in a local neighbourhood around s0 are Stroup (1997) extended these models to the case where
used for predicting Z(s0), where neighbourhood can be residuals are spatially correlated in which case residuals
dened for each variable in terms of distance to s0 or in have mean-related non-stationary covariances. Predic-
terms of the number of nearest observations. There are tion of residuals for several variance functions are
at least two good reasons for restricting kriging to a implemented in gstat.
local neighbourhood. First, the system V1X becomes Debugging results: Near-singularities may occur for a
prohibitively large when data are abundant nb103 or number of reasons, such as near-zero distances between
when sequential simulation is used to simulate large data points, or linear dependencies among columns of a
elds. Second, the assumption of spatially constant (locally formed) matrix X. Gstat has many debug modes
trend coefcients in Eq. (1) may need to be relaxed to for obtaining information on all aspects of the systems,
apply only to local neighbourhoods. Gstat takes care of and can verify that estimated condition numbers of V
cases where one or more of the variables are missing in a and X0 V1X stay below a threshold.
local neighbourhood, dened by a distance criterium.
An efcient, scalable quadtree-based neighbourhood 3.4. Sequential simulation
algorithm (Hjaltason and Samet, 1995; Quadtree demos:
http://www.cs.umd.edu/Bbrabec/quadtree/index.html) is Sequential simulation (Johnson, 1987; Gomez-Her-
used to select data in a local neighbourhood. nandez and Journel, 1993) involves the generation of
Block kriging or simulation: Instead of predicting Z(s0) many independent realisations of a Gaussian (or in case
(point kriging), block kriging (Journel and Huijbregts, of indicator simulation, binary) random eld, condi-
1978) aims at predicting the average of Z(  ) over tional to observed data, that honour the variogram
a larger
R support (area or volume) B0 : ZB0 (covariance) of the random eld. Gstat uses the
jB0 j1 B0 Zs ds; with |B0| the area (or volume) of B0. sequential simulation algorithm because it is versatile,
Blocks B0 may be rectangular or irregular (specied by a efcient, and suitable for large to very large elds
number of points discretising B0). Although the interest (number of nodes b106 ).
ARTICLE IN PRESS
686 E.J. Pebesma / Computers & Geosciences 30 (2004) 683691

Traditionally, simulation algorithms only involved the the linear model of coregionalisation (Goovaerts, 1997),
simulation of the residual part of Eq. (1), although some ensuring that cross covariance matrices are always
attempts to stretch this have been reported (Goovaerts, positive denite. Furthermore, gstat can calculate and
1997). This can be seen as the simulation equivalent of visualise directional variograms, variogram clouds,
simple kriging. Gstat implements a wider class that and provides identication through interactive examina-
allows to account for statistical uncertainty on trend tion (for example of extreme points) in the variogram
coefcients, using the algorithm reported (although cloud.
somewhat hidden) by Abrahamsen and Espen Benth Variogram models may consist of simple models
(2001). For each realisation, it involves the simulation such as the Nugget, Exponential, Spherical, Gaussian,
of trend coefcients, followed by simulating residuals Linear, Power model, or the nested sum of one or
with respect to the trend coefcients drawn. It is the more basic models. Each simple model can have its
simulation equivalent of universal kriging. For the own 2D or 3D geometric or zonal anisotropy para-
simulation of trend coefcients, the multivariate normal meters dened. The gstat R package also includes
distribution with mean b# and covariance (X0 V1X)1 the Matern class (strongly recommended by Stein,
is used. 1999), but does not automatically t its smoothness
parameter.

3.5. Variogram modelling

All methods mentioned above assume that the 4. Implementation


residual covariance is known. A common convention
is to enter the covariance by ways of the variogram. 4.1. The S environment
Gstat calculates direct sample variograms, cross vario-
grams (classical cross variograms for variables that S is a functional language: functions are called with
have identical locations, pseudo-cross variograms (Ver data and specications as the function arguments. The
Hoef and Cressie, 1993) when locations do not gstat S package provides a set of functions, most of
coincide), and can t nested variogram models to which are listed in Table 1. These functions consist of
sample variograms. In tting direct and cross variogram about 1000 lines of S code, and for a part they hide calls
models, it can also guarantee that the tted model obeys to the underlying 40,000 lines of C code in the gstat

Table 1
User functions in package gstat

Gstat Add variable denition to gstat object

Variogram modelling
variogram Calculate sample variogram, directional sample variograms, or direct and cross
variograms
fit.variogram Fit variogram model coefcients to sample variogram
fit.lmc Fit a linear model of coregionalisation to direct and cross variograms
variogram.line Calculates variogram values from a variogram model

Prediction/simulation
predict.gstat Spatial prediction or simulation, see also Fig. 3
krige Univariable wrapper around gstat and predict.gstat
krige.cv Leave-one-out or n-fold cross-validation wrapper for krige
zerodist Detect observation pairs with identical locations

Plotting
bubble Bubble scatter plot for data or residuals (using colour for sign, size for value)
plot.variogram Plot sample variogram (optional with number of point pairs) and tted model; uses
conditioning plots for directional or multivariable variograms (Fig. 2)
plot.variogram.cloud Plot variogram cloud, with options for interactive point pairs identication
plot.point.pairs Plot point pairs, identied by plot.variogram.cloud, in a map
image.data.frame Draw image for (x,y,z) values, stored in columns of a data frame
map.to.lev Stack data in the form (x,y,z1,z2,y,zn) to a form, suitable for plotting with levelplot
mapasp Calculate aspect ratio for geographically correct levelplot
ARTICLE IN PRESS
E.J. Pebesma / Computers & Geosciences 30 (2004) 683691 687

package. Data in S are typically stored in data frames, function of distance to the river D, log(z(s))=b0+
tables which contain in each column one variable, of b1D(s)+e(s), we can calculate the residual variogram of
categorical (factor) or numerical mode. log(zinc) as a function of dist with spatial coordi-
nates in x and y, found in data frame meuse by (> is
4.2. Examples the S prompt):
> zn:vgm variogramlogzincBdist;
In the following, examples are given that use the data
Bx y; meuse
set of heavy metal pollutions in the topsoil of a
oodplain along the Meuse river near Stein, Netherlands which saves the results in zn.vgm, to be shown, plotted
(Burrough and McDonnell, 1998). This data set is or tted:
supplied with gstat.
> zn:mod fit:variogramzn:vgm;
4.3. Formula interface model vgm1; Exp; 300; 1

> plotzn:vgm; model zn:mod


The gstat S package uses the S formula interface,
(Chambers and Hastie, 1992), which is also found in ts an exponential variogram model and plots sample
the regression and ANOVA functions (lm, aov), general- variogram and tted model (Fig. 1a). By default,
ised linear models (glm), and many other regression ordinary least-squares residuals are used, but generalised
modelling or prediction methods. The rst function least-squares residuals given a variogram models are
argument is a formula, like yBx1+x2, to express that optional. Note that plot is a generic function: as its rst
variable y depends on x1 and x2, and possibly in a later argument is of class variogram, in reality the function
argument the data frame that contains y, x1 and x2 as plot.variogram of the gstat package is called; this
columns. Formulas may contain mathematical functions function adds a number of options useful to plotting
of variables (e.g., sqrt(y) instead of y), complex variograms.
relationships (like interactions, x1:x2, or nested effects), Univariate universal kriging on locations dened in
and dependent variables may be factor (nominal) meuse.grid, using a tted (residual) variogram model
variables in which case they are automatically converted zn.mod is obtained by
into the necessary set of dummy (01) regressor > zn:krg krigelogzincBdist; Bx y;
variables.
Gstat uses one formula to dene how the response meuse; meuse:grid; zn:mod
depends on the predictor variables, and a second
> levelplotvar1:predBx y; data zn:krg;
formula to dene the spatial coordinates. Suppose
we model zinc concentrations z(s) as a linear regression asp mapaspzn:krg

0.3 -7.5

333000 -7
0.25

-6.5
0.2
semivariance

332000 -6
y

0.15
-5.5

0.1 331000
-5

0.05 -4.5

330000
-4
0
0 500 1000 1500 178500 179000 179500 180000 180500 181000 181500
(a) distance (b) x

Fig. 1. Sample variogram and tted model for log(zinc) residuals (a); universal kriging predictions for log(zinc) (b).
ARTICLE IN PRESS
688 E.J. Pebesma / Computers & Geosciences 30 (2004) 683691

for which the plot is shown in Fig. 1b. Alternatively, 50 variables measured in the meuse data set, then the ve
conditional simulations are obtained by commands
> krigelogzincBdist; Bx y; meuse; > meuse:g gstatmeuse:g; model vgm1; Sph;
meuse:grid; zn: mod; nmax 20; nsim 50 900; nugget 1; fill:all T
where nmax refers to the neighbourhood size, limited for > x variogrammeuse:g; cutoff 1000
fast sequential simulation.
For multivariable prediction or simulation, we need to > meuse:fit fit:lmcx; meuse:g
specify for each variable at least two formulas and a
data frame. All this information is stored in an object of > plotx; model meuse:fit
class gstat, which is built one variable at a time, by a
function (surprisingly) called gstat: > meuse:cok predictmeuse:fit;
newdata meuse:grid
> meuse:g gstatid log-zn;
formula logzincB1; ll all variogram models with the same initial (Nug-
get+Spherical) variogram model, (ii) calculate sample
locations Bx y; data meuse
variograms and cross variograms, (iii) t a linear model
> meuse:g gstatobject meuse:g; of coregionalisation to direct and cross variograms, (iv)
plot the variograms and tted models (Fig. 2), and (v)
id log-cu; formula logcopperB1; store four-variate cokriging predictions and prediction
error (co)variances in meuse.cok.
locations Bx y; data meuse
The prediction function, predict.gstat, is the
prediction and simulation engine of gstat. Depending
y
on the data it is fed with, it decides what to do; Fig. 3
that can accumulate an arbitrary number of variables. shows the decision tree for this. The list of user functions
Suppose meuse.g is lled with the four heavy metal in package gstat is shown in Table 1.

zn

0.6

0.4

0.2

0
zn.cu cu
0.4 0.3

0.3
0.2
0.2
0.1
0.1
semivariance

0 0
zn.cd cu.cd cd
2
1
0.6
0.8 1.5

0.6 0.4
1
0.4
0.2 0.5
0.2

0 0 0
zn.pb cu.pb cd.pb pb
0.4
0.6 0.6
0.8
0.3 0.5
0.4 0.6 0.4
0.2 0.3
0.4
0.2 0.2
0.1 0.2
0.1
0 0 0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
distance

Fig. 2. Direct sample variograms (diagonal), cross variograms (off-diagonal) and tted linear model of coregionalisation for four
heavy metal variables in meuse data set.
ARTICLE IN PRESS
E.J. Pebesma / Computers & Geosciences 30 (2004) 683691 689

No No
Variograms? Trend functions given? Inverse distance weighted interpolation
Yes
Yes (local) trend surface prediction
No Yes
Simulations? Trend coefficients given? Simple (co)kriging
No
Yes
Trend has only intercept? Ordinary (co)kriging
No
Universal (co)kriging, or BLUE

Yes No
indicators? Sequential Gaussian (co)simulation

Yes
Trend coefficients given? "simple"
No
"universal"
Yes
Sequential Indicator (co)simulation

Fig. 3. Decision tree for predict.gstat (or krige); each of prediction/simulation methods may apply to points, rectangular blocks,
or irregular blocks, and may use all data or a selection of local data in a local neighbourhood around each prediction location.

The location argument is necessary because S data Adding a new variogram function to the gstat C code is
frames do not register which columns contain spatial straightforward, though.
coordinates. As this is not likely to ever change, a more
elegant solution would be to use a data class that is
aware of its own spatial topology, in which case the 5. Relation to other geostatistics packages
location formula could be left out altogether.
Ripley (2001) gives a short overview of available R
4.4. C code packages for spatial statistics. Geostatistics packages
on CRAN (R home page: http://www.r-project.org/
The gstat C code used for the gstat package consists of Comprehensive R archive network: http://cran.r-project.
approximately 25,000 lines of native gstat code, and org/ and mirrors) include spatial, sgeostat, geoR/
14,000 lines of C code in the Meschach matrix, library geoRglm (geoRhome page: http://www.est.ufpr.br/
(Stewart and Leyk, 1994; Meschach home page: http:// geoR/), fields and RandomFields. Most of these
www.math.uiowa.edu/Bdstewart/meschach/) used by packages provide variogram modelling, trend surface
gstat. Because originally gstat was written as a stand- analysis and/or universal kriging. None of them
alone program (Pebesma and Wesseling, 1998), a large provides kriging in a local neighbourhood, block
part of the effort of writing the gstat S package was kriging, cokriging, or three-dimensional kriging. S-Plus
dedicated towards making the code suitable as a callable has a commercial module, S+SpatialStats, that provides
library. This involved removing many static variables, block kriging. Large parts of the geoR/geoRglm (geoR
re-initialising the full state of the library after every call home page: http://www.est.ufpr.br/geoR/) code address
from S, and writing wrapper functions around all log, the uncertainty of estimated covariance parameters in a
warning and error messages. Bayesian framework (also called model-based kriging;
Two important optimised algorithms are implemented Diggle et al., 1998), an issue that seems to be relevant
in the gstat C code. The rst is a fast neighbourhood especially for smaller data sets (Moyeed and Papritz,
search algorithm, based on the PR-bucket quadtree 2002).
search index structure (Hjaltason and Samet, 1995).
The second is the realisation of many simulated
random elds in a single call following a single random 6. Code availability
path through the simulation locations, re-using the
expensive results, i.e. the neighbourhood selection and For R, the gstat package can be installed from CRAN
V1X. (R home page: http://www.r-project.org/Comprehensive
All variogram models are dened in the gstat R archive network: http://cran.r-project.org/ and mir-
packages are in the gstat C code, and provides not an rors), which means that a single mouse click on
easy way to use variogram functions dened in S. Windows version or a single command for Unix versions
ARTICLE IN PRESS
690 E.J. Pebesma / Computers & Geosciences 30 (2004) 683691

is sufcient to install the package on computers with an the Trellis/Lattice functions to visualise its results,
internet connection. For S-Plus, the gstat library is notably
available in binary form for Windows versions of S-Plus, * xyplot for visualising directional variograms and
and in source code form for Unix/Linux versions of S-
multivariable (direct and cross) variograms (e.g.
Plus from the gstat home page (http://www.gstat.org/).
Fig. 2), and to visualise spatial data and cross-
Installation instructions are also found there.
validation residuals;
* levelplot for visualising (multiple) grid maps, using
the aspect argument to make them geographically
7. Conclusions correct (1 km north equals 1 km east, a convention
that even S+SpatialStats ignores);
The gstat package provides a robust and exible suite * image for fast display of many grid maps; and
of univariable or multivariable geostatistical methods. * plot and identify to identify extreme point pairs in
From the following ve items: a variogram cloud.
* One-, two- or three-dimensional, The graphics functions in Table 1 are no more than
* point, regular block, or irregular block, simple wrapper functions around the S graphics func-
* univariable, multiple (uncorrelated), or multivariable tions, but may be among the most critical ones to make
(correlated) cokriging, a multivariable analysis successful.
* (co)kriging, unconditional or conditional (co)simula-
tion, 8.2. Gstat stand-alone features missing in the S package
* using a global or a local neighbourhood,
The major functionality of gstat is made available in
any combination (e.g. three-dimensional universal irre-
the package, but a number of advanced features are
gular block cosimulation) can be obtained by the gstat
missing. Most of them can be added easily once a
package. Also, routines are available for very fast tting
common set of S data structures for spatial data (grids,
of large numbers of direct and cross variograms. The
lines) is dened. Gstat stand-alone features missing in
objection to cokriging or cosimulation that the model-
the S package are: Stratified mode: the gstat program has
ling of a large number of (cross) variograms is
an efcient way of dealing with a stratication, where
prohibitively tedious can now only be put in the past
each stratum has its own data, variogram and prediction
tense. The open-source gstat extension package makes
locations. Variogram maps: two-dimensional variogram
the S environment (the R or S-Plus programs) a very
maps, calculated on a regular grid are not yet
powerful environment for (multivariable) geostatistics.
implemented in the S package. Efficient variogram
The package offers several methods for handling one
calculation for gridded data: knowing the gridded
or more exhaustive grids of secondary information for
topology of data, sample variograms can be calculated
prediction or simulation of a primary variable:
in O(N), instead of O(N2). Multi-step simulation
* secondary variables can be treated as explanatory or (Gomez-Hernandez and Journel, 1993): the gstat code
predictor variables, leading to linear regression or can use a recursively rening random visiting sequence
universal kriging prediction (sometimes referred to as (Pebesma and Wesseling, 1998) for sequential simula-
external drift kriging); tion, but needs to know the grid topology of prediction
* secondary variables can be treated as (realisations of) locations; currently a simple random path is chosen.
random elds, leading to a cokriging formulation; Edges: open or closed polygons can be dened to further
* colocated ordinary or simple cokriging can be used, constrain the search neighbourhood. Quadrant/octant
limiting the availability of the secondary variable to search neighbourhoods, variogram distance: these are
that of the prediction location. other methods to rene search neighbourhoods based on
direction or correlation. Latin hypercube sampling of
A discussion on structural differences between these Gaussian random fields (Pebesma and Heuvelink, 1999)
approaches is found in Rivoirard (2002). is an issue that should be easy to re-implement in S.

8.3. Handling spatial data in S


8. Discussion
Prediction locations are often gridded, and observa-
8.1. S visualisation tions sometimes are. As noted above, a number of
efciency gains can be obtained when the grid topology
One major reason why S is a suitable environment for of data, if present, is available to gstat. Storing
doing multivariable geostatistics with gstat is its prediction results as grids (2D matrices) can be wasteful,
graphics capabilities. The gstat package gratefully uses because large part of the area may be lled with NAs.
ARTICLE IN PRESS
E.J. Pebesma / Computers & Geosciences 30 (2004) 683691 691

Currently, gstat resolves coordinates and explanatory Cressie, N.A.C., 1993. Statistics for Spatial Data, Revised Edn.
variables at prediction locations using model.matrix, Wiley, New York. 900 pp.
which requires both observation data and prediction Diggle, P.J., Tawn, J.A., Moyeed, R.A., 1998. Model-based
locations to be in a data frame. Storing output of geostatistics. Applied Statistics 47 (3), 299350.
predict.gstat as grids might be benecial when they Goovaerts, P., 1997. Geostatistics for Natural Resources
Evaluation. Oxford University Press, New York, NY
are plotted with image, but not when plotted with
483pp.
levelplot. The conversion of table data to gridded data Gomez-Hernandez, J.J., Journel, A.G., 1993. Joint sequen-
is close to O(N) (see function xyz2img in package gstat). tial simulation of multiGaussian elds. In: Soares, A.
Currently, an open-source effort (r-spatial project (Ed.), Geostatistics Troia, Vol. 92. Kluwer, Dordrecht,
page: http://www.sourceforge.net/projects/r-spatial/) is pp. 8594.
being taken to provide spatial classes for R (and Gotway, C.A., Stroup, W.W., 1997. A generalized linear model
potentially S-Plus), for point, grid, and polygon data, approach to spatial data analysis and prediction. Journal of
and gstat supports this. It requires prior specication Agricultural, Biological and Environmental Statistics 2 (2),
which variables in a data frame refer to spatial 157178.
coordinates, and removes the need to specify coordi- Hjaltason, G., Samet, H., 1995. Ranking in spatial databases.
In: Egenhofer, M.J., Herring, J.R. (Eds.), Advances in
nates in subsequent gstat library function calls.
Spatial DatabasesFourth Symposium, SSD95, Lecture
Notes in Computer Science, Vol. 951. Springer, Berlin,
pp. 8395. See also Quadtree demos: http://www.cs.umd.
Acknowledgements edu/Bbrabec/quadtree/index.html.
Ihaka, R., Gentleman, R., 1996. R: a language for data analysis
The development of the gstat S package was and graphics. Journal of Computational and Graphical
supported nancially by the Dutch National Institute Statistics 5 (3), 299314.
for Coastal and Marine Management (RIKZ); Richard Isaaks, E., Srivastava, R.M., 1989. An Introduction to Applied
Duin (RIKZ) played a very stimulating role in the work Geostatistics. Oxford University Press, New York, NY
presented here. Roger Bivand not only cotriggered 561pp.
Johnson, M.E., 1987. Multivariate Statistical Simulation.
the writing of the package presented here, but also
Wiley, New York 230pp.
organ ised the workshop and three sessions on spatial Journel, A.G., Huijbregts, Ch.J., 1978. Mining Geostatistics.
statistics during the DSC2003 conference (DSC2003: Academic Press, London 600pp.
http://www.ci.tuwien.ac.at/Conferences/DSC-2003/). Moyeed, R.A., Papritz, A., 2002. An empirical comparison of
One anonymous reviewer provided valuable comments. kriging methods for nonlinear spatial point prediction.
Mathematical Geology 34 (4), 365386.
Pebesma, E.J., Heuvelink, G.B.M., 1999. Latin hypercube
References sampling of Gaussian random elds. Technometrics 41 (4),
303312.
Abrahamsen, P., Espen Benth, F., 2001. Kriging with inequality Pebesma, E.J., Wesseling, C.G., 1998. Gstat, a program
constraints. Mathematical Geology 33 (6), 719744. for geostatistical modelling, prediction and simulation.
Becker, R.A., Chambers, J.M., Wilks, A.R., 1988. The New Computers & Geosciences 24 (1), 1731.
S Language. Chapman & Hall, London 702pp. Ripley, B.D., 2001. Spatial statistics in R. R News 1 (2),
Bivand, R.S., 2000. Using the R statistical data analysis 1415.
language on GRASS 5.0 GIS data base les. Computers Rivoirard, J., 2002. On the structural link between variables in
& Geosciences 26, 10431052. kriging with external drift. Mathematical Geology 34 (7),
Bivand, R.S., 2003. Approaches to classes for spatial data in 797808.
R. In: Hornik, K., Leisch, F. (Eds.), Proceedings of the Stein, M.L., 1999. Interpolation of Spatial Data: Some Theory
Third International Workshop on Distributed Statistical for Kriging. Springer, New York, NY 247pp.
Computing (DSC 2003), March 2022, Vienna, Austria. Stewart, D.E., Leyk, Z., 1994. Meschach: matrix computations
ISSN 1609-395X; available from DSC2003: http://www. in C. Proceedings of the Centre for Mathematics and its
ci.tuwien.ac.at/Conferences/DSC-2003/. Applications, Vol. 32. Australian National University
Burrough, P.A., McDonnell, R.A., 1998. Principles of 240pp. See also Meschach home page: http://www.
Geographical Information Systems. Oxford University math.uiowa.edu/Bdstewart/meschach/.
Press, New York, NY 431pp. Ver Hoef, J.M., Cressie, N.A.C., 1993. Multivariable spatial
Chambers, J.M., 1998. Programming with Data. Springer, prediction. Mathematical Geology 25 (2), 219240.
New York 469pp. Wackernagel, H., 1998. Multivariate Geostatistics; An Intro-
Chambers, J.M., Hastie, T.J., 1992. Statistical Models in S. duction with Applications, 2nd Edition. Springer, Berlin
Chapman & Hall, London 428pp. 291pp.

Вам также может понравиться