Вы находитесь на странице: 1из 16

126

The Canadian Journal of Statistics


Vol. 42, No. 1, 2014, Pages 126141
La revue canadienne de statistique

Robust small area estimation under


semi-parametric mixed models
Jon N. K. RAO, Sanjoy K. SINHA* and Laura DUMITRESCU
School of Mathematics and Statistics, Carleton University, Ottawa, ON Canada K1S 5B6
Key words and phrases: Bootstrap; mean squared prediction error; outliers; random effects; small area
mean; unit level model.
MSC 2010: Primary 62F10; secondary 62F35
Abstract: Small area estimation has been extensively studied under unit level linear mixed models. In
particular, empirical best linear unbiased predictors (EBLUPs) of small area means and associated estimators
of mean squared prediction error (MSPE) that are unbiased to second order have been developed. However,
EBLUP can be sensitive to outliers. Sinha & Rao (2009) developed a robust EBLUP method and demonstrated
its advantages over the EBLUP in the presence of outliers in the random small area effects and/or unit level
errors in the model. A bootstrap method for estimating MSPE of the robust EBLUP was also proposed. In
this paper, we relax the assumption of linear regression for the xed part of the model and we replace it by a
weaker assumption of a semi-parametric regression. By approximating the semi-parametric mixed model by
a penalized spline mixed model, we develop robust EBLUPs of small area means and bootstrap estimators of
MSPE. Results of a simulation study are also presented. The Canadian Journal of Statistics 42: 126141; 2014
2013 Statistical Society of Canada
Re sume : Lestimation pour petits domaines a e te largement e tudiee a` laide de mod`eles lineaires mixtes au
niveau des unites. Dans ce contexte, les meilleurs predicteurs lineaires sans biais empiriques (MPLSBE) pour
la moyenne de petits domaines ont e te developpes, ainsi que des estimateurs sans biais au deuxi`eme ordre
pour lerreur quadratique moyenne de prevision (EQMP) associee. Cependant, les MPLSBE peuvent e tre
sensibles aux donnees aberrantes. Sinha et Rao (2009) ont e labore des MPLSBE robustes et ont demontre
leurs avantages par rapport aux MPLSBE en presence de donnees aberrantes dans les effets aleatoires
des petits domaines ou dans les termes derreur au niveau de lunite. Ces auteurs ont aussi propose une
methode bootstrap destimation de lEQMP des MPLSBE robustes. Dans cet article, les auteurs assouplissent
lhypoth`ese de regression lineaire dans la partie xe du mod`ele et la remplacent par une hypoth`ese moins
rigide de regression semi-parametrique. En approximant le mod`ele mixte semi-parametrique par un mod`ele
mixte de splines penalise, ils developpent des MPLSBE robustes pour petits domaines et des estimateurs
bootstrap de lEQMP. Les resultats dune e tude de simulation sont e galement presentes. La revue canadienne
de statistique 42: 126141; 2014 2013 Socit statistique du Canada

1. INTRODUCTION
Traditional area-specic direct estimators of small area totals or means are not suitable because of
small area-specic sample sizes. As a result, indirect estimation methods that borrow information
across related areas through explicit linking models are widely used for small area estimation
(Rao, 2003; Jiang & Lahiri, 2006). Linking models use auxiliary population information such as
census and administrative data.
We focus on unit level models in this paper. A basic unit level linear mixed model, called the
nested error regression model (Battese, Harter, & Fuller, 1988), received considerable attention
* Author to whom correspondence may be addressed.
E-mail: sinha@math.carleton.ca
2013 Statistical Society of Canada / Socit statistique du Canada

2014

ROBUST SMALL AREA ESTIMATION

127

in recent years. Letting y be a variable of interest and x1 , . . . , xp be the associated auxiliary


variables, the model is dened by
yij = xijt + vi + eij ,

i = 1, . . . , m; j = 1, . . . , Ni ,

(1)

where xij = (1, x1ij , . . . , xpij )t , m is the number of small areas, Ni is the number of population
units j in area i, vi iid (0, v2 ) denote random small area effects that account for variation in y not
explained by the auxiliary variables, and eij iid (0, e2 ) are unit errors assumed to be independent
of the vi . The population values of the auxiliary variables x1 , . . . , xp for each area i are assumed
to be known.
A sample of ni ( 1) units is drawn from each area and the variable of interest y and the
covariates x1 , . . . , xp are observed. Sampling is assumed to be non-informative in the sense
that the population model (1) also holds for the sample; for example, simple random sampling
within each area. The best linear unbiased predictor (BLUP) of the area mean Y i , for a given
= (v2 , e2 )t , is rst obtained from the sample model, given by (1) with Ni changed to ni , and
then is replaced by a consistent estimator to obtain an empirical BLUP (EBLUP) of Y i . EBLUP
is also an empirical Bayes (EB) predictor under normality of vi and eij . Maximum likelihood (ML)
or restricted ML (REML) method is often used for the estimation of , assuming normality, but
the resulting remains consistent without the normality assumption (Jiang, 1996). Jiang (2010,
Chapter 13) gives a detailed account of EBLUPs and associated second-order approximations to
the mean squared prediction error (MSPE) and its estimator.
The assumption of a parametric mean function m(x, ) = xt in (1) may be restrictive in
practice. To get around this limitation, Ruppert, Wand, & Carroll (2003) studied semi-parametric
regression with an unspecied mean function m0 (x), assuming that x is a continuous scalar variable
and focusing on regression models without random area effects: y = m0 (x) + e. The unknown
function m0 (x) is assumed to be approximated sufciently well by a penalized spline (P-spline)
function with a truncated polynomial spline basis. Using a mixed model formulation Ruppert,
Wand, & Carroll (2003) obtained an EBLUP of the P-spline function. Opsomer et al. (2008) extended the traditional P-spline model to small area estimation by including the random area effects
vi and obtained the EBLUP of Y i . They also studied the estimation of the MSPE of the EBLUP.
The EBLUP under the P-spline small area model can be sensitive to outliers in vi and eij ,
as in the case of EBLUP under the nested error regression model (1). In this paper, we obtain a
robust EBLUP (REBLUP) of the mean Y i under a P-spline nested error regression model, using
Fellners (1986) results on robust mixed model equations and his two-step iterative method. The
robust ML approach of Sinha & Rao (2009) for estimating under the nested error model (1) runs
into difculty in the context of P-spline mixed models. We also propose a conditional bootstrap
estimator of the MSPE of the REBLUP. Results of a simulation study are presented.
2. P-SPLINE MIXED MODEL
Consider rst the case of regression model y = m0 (x) + e with a scalar x. Penalized spline approximates m0 (x) by m(x, , u) = zt , where z = {1, x, . . . , xh , (x q1 )h+ , . . . , (x qK )h+ }t is
the truncated polynomial basis (Ruppert, Wand, & Carroll, 2003). Here h( 1) is the degree of
the spline, (x q)+ = max(0, x q), q1 < < qK is a set of xed knots and t = (t , ut )
with = (0 , . . . , h )t and u = (u1 , . . . , uK )t denoting the coefcient vectors corresponding
to the parametric and the spline portions of the model, respectively. Linear P-spline corresponds
to h = 1. Penalized least squares estimates of and u are obtained by minimizing
n
{y
m(xi , , u)}2 + 2 t , where is a smoothing parameter and {(yi , xi ) : i = 1, . . . , n}
i
i=1
denote the data set. The resulting ridge-type estimator of depends on . One approach to determining is to use cross-validation or related methods. An alternative approach, suggested by
DOI: 10.1002/cjs

The Canadian Journal of Statistics / La revue canadienne de statistique

128

RAO, SINHA AND DUMITRESCU

Vol. 42, No. 1

Ruppert, Wand, & Carroll (2003), uses a mixed formulation of zt by treating u as a random
effect vector with mean 0 and covariance matrix u2 I. The resulting best linear unbiased estimator
(BLUE) of and BLUP of u are equivalent to the corresponding penalized least squares estimator
with 2 = e2 /u2 . The ML method may be used to estimate 2 . Because of the familiarity with
linear mixed models and ready availability of software (such as, SAS), the second approach has
gained popularity among users.
Opsomer et al. (2008) extended the P-spline mixed model to small area estimation by including
the random effects vi . For a scalar x, this model is given by
yij = xijt + wijt u + vi + eij , j = 1, . . . , ni ; i = 1, . . . , m,

(2)

where xij = (1, xij , . . . , xijh )t , with specied h 1 and wij = {(xij q1 )h+ , . . . , (xij qK )h+ }t =
(wij1 , . . . , wijK )t . Additional auxiliary variables, including categorical variables, that may be
linearly related to y can be included in the vector xij . Similarly, additive nonparametric terms in
the mean function of the form m0 (x1 ) + g0 (x2 ) can be handled by using P-spline approximations
to both m0 (x1 ) and g0 (x2 ) and then modifying wij and u to include random vectors u1 and u2 ,
corresponding to m0 (x1 ) and g0 (x2 ), with means 0 and covariance matrices u21 I and u22 I (Ruppert,
Wand, & Carroll, 2003). In this paper, we focus on the case of a single non-parametric term in
the mean function.
Letting Xi and Wi denote the ni p matrix with rows xijt and the ni K matrix with rows
t
wij respectively, the sample spline mixed model (2) may be expressed in matrix notation as
yi = Xi + Wi u + Zi vi + ei , i = 1, . . . , m,

(3)

where yi and ei are ni 1 vectors with elements yij and eij respectively, and Zi = 1ni , where
1ni is the ni 1 vector with elements all equal to 1. We further assume that u, vi and ei are
mutually independent. Unlike (1), model (3) does not have a block-diagonal covariance structure
t ) because of the common random spline effect u across the areas i. We can
for yt = (y1t , . . . , ym
express (3) as a linear mixed model
y = X + Wu + Zv + e,
where X = col1im (Xi ), W = col1im (Wi ), Z = diag1im (Zi ), y = col1im (yi ), v =

(v1 , . . . , vm )t (0, v2 Im ), e = col1im (ei ) (0, e2 In ) and u (0, u2 IK ). Here n = i ni is
the total sample size.
Opsomer et al. (2008) assumed that the area mean of the spline approximation to the true
t
semi-parametric population model was the parameter of interest. That is, Y i i := X i +
t
i u + vi , assuming that the area population size Ni is large, where X i = (X
1i , . . . , X
pi )t and
W
t

Wi = (W1i , . . . , WKi ) is the vector of means of wijk (j = 1, . . . , Ni ; k = 1, . . . , K). Both X i and


i are assumed to be known.
W
3. ESTIMATION OF MODEL PARAMETERS AND RANDOM EFFECTS
We assume that u (0, u2 IK ), v (0, v2 Im ), and e (0, u2 In ). Henderson (1963) showed that
the best linear unbiased estimator of and the BLUP of u and v, for xed = (u2 , v2 , e2 )t , could
be obtained by solving the mixed model equations
2 t
2 t
e X X
e X y
e2 Xt W
e2 Xt Z

2 t
2 t
2
2
t
2
t
(4)
e W Z
e W X u IK + e W W
u = e W y .
e2 Zt X

e2 Zt W

v2 Im + e2 Zt Z

The Canadian Journal of Statistics / La revue canadienne de statistique

e2 Zt y
DOI: 10.1002/cjs

2014

ROBUST SMALL AREA ESTIMATION

129

Denoting the inverse of the partition matrix on the left side of (4) by

T11

T = T21
T31

T12
T22
T32

T13

T23 ,
T33

(5)

the solution to (4) may be expressed in the form


t
Q1 y

t
u = Q2 y ,
Qt3 y
v

(6)

where Qt1 = e2 (T11 Xt + T12 Wt + T13 Zt ), Qt2 = e2 (T21 Xt + T22 Wt + T23 Zt ), and Qt3 =
e2 (T31 Xt + T32 Wt + T33 Zt ). The BLUE of and the BLUPs of v and u given by (6) are
equivalent to Equations (4)(6) of Opsomer et al. (2008). However, (6) requires the inversion of
a xed (p + K + m)-dimensional matrix, unlike Equations (4)(6) of Opsomer et al. (2008) that
require the inversion of an n n covariance matrix of y, namely, V = u2 WWt + v2 ZZt + e2 In .
Moreover, representation (6) is useful in deriving the variances of the elements of u and v , which
are used in Section 5 to modify the Fellner (1986) method of robust estimation of in the presence
of outliers.
By writing e2 (X W Z)t (X W Z) as T1 diag(0, u2 IK , v2 Im ), it is easy to verify that

Qt1

t
Q2 X
Qt3

Ip


Z =0
0

u2 T12
IK u2 T22
u2 T32

v2 T13

v2 T23 .
Im v2 T33

(7)

= Qt X = , E(u u) = Qt X 0 = 0, and E(v v) =


It now follows from (7) that E()
1
2
t
u,
and v .
Q3 X 0 = 0, demonstrating unbiasedness of ,
It also follows from (7) that
E( ) = T11 + T , E(u u t ) = u2 IK T22 , and E(vv t ) = v2 Im T33 .
T

(8)

Using (8), we get


u 2k

= tr{E(u u t )} = u2 (K t2 ),

(9)

= tr{E(vv t )} = v2 (m t3 ),

(10)

k=1

and
E


v 2i

i=1

where t2 = tr(T22 )/u2 and t3 = tr(T33 )/v2 . Turning to e2 , letting e = y X Wu Zv, we


have e = (In e2 MTMt )y, where M = (X W Z) and
E(ee t ) = E(eet ) = e2 In MTMt .
DOI: 10.1002/cjs

(11)

The Canadian Journal of Statistics / La revue canadienne de statistique

130

RAO, SINHA AND DUMITRESCU

Vol. 42, No. 1

Using (11), we get

ni
m

e 2ij = tr{E(ee t )} = e2 (n t4 ),

(12)

i=1 j=1

where t4 = tr(MTMt )/e2 .


It follows from (9), (10), and (12) that u2 , v2 , and e2 may be estimated iteratively from the
following xed-point equations
u2 =

K

k=1

v2 =

m

i=1

e2 =

u 2k
,
K t2

(13)

v 2i
,
m t3

(14)

ni
m

e 2ij

i=1 j=1

n t4

(15)

where t4 = p + (K t2 ) + (m t3 ). Note that normality of v, u, and e is not assumed in the


derivation of (13)(15), but under normality REML equations for u2 , v2 , and e2 given by Fellner
(1986) have the same form as (13)(15). The iterative method proceeds as follows: (i) Compute
v , and e from (4), with a starting value (0) of
the terms on the right side of (13)(15), using u,
2
2
2
t
(1)
(1)
= (u , v , e ) to get . (ii) Using , compute the next updated estimate (2) from (13) to
(15), and so on until convergence, leading to the estimator = ( u2 , v2 , e2 )t . General asymptotic
results for linear mixed models (Jiang, 1996) show that the REML estimator is consistent for
under the spline mixed model provided the number of knots increases with n.
Substituting for into (6), we obtain the empirical BLUE (EBLUE) of and the EBLUP of
u,
and v . The EBLUP of the small area mean Y i is given by
u and v, denoted by ,

i =
(16)
yij +
yij ,
Ni
jsi

jsi

where si and si represent the set of sampled units and the set of nonsampled units in area i, and yij =
xijt + wijt u + v i is the predictor of yij for j si using the spline approximation. If Ni is large, then

t u + v i :=
t + W
ia . Note that
ia may be expressed as
ia = N 1 Ni yij , where

i X
i

j=1

the observed yij for j si is replaced by the predictor yij .


We now consider the estimation of means for non-sampled areas. Suppose that the P-spline
model (2) holds for a non-sampled area l, that is, ylj = xljt + wljt u + vl + elj , j = 1, . . . , Nl .
Then we use a synthetic predictor ylj = xljt + wljt u of ylj , j = 1, . . . , Nl and the resulting predictor of area mean Y l is

l =

Nl
1
lt + W
lt u,

ylj = X
Nl

(17)

j=1

t are the known means of x and w for area l. Opsomer et al. (2008) did not study
t and W
where X
l
l
the case of non-sampled areas.
The Canadian Journal of Statistics / La revue canadienne de statistique

DOI: 10.1002/cjs

2014

ROBUST SMALL AREA ESTIMATION

131

4. MEAN SQUARED PREDICTION ERROR


Opsomer et al. (2008) derived a second-order approximation to the mean squared prediction
error (MSPE) of
ia given by MSPE(
ia ) = E(
ia i )2 , where the expectation is with respect to the spline model (3). They also proposed a non-parametric bootstrap method to estimate
MSPE(
ia ). Bootstrap replicates u , v and e were generated from the standardized BLUPs
1/2
Using u , v and e , boot
[V(v)]1/2 (v v ) and [V(e)]1/2 (e e ) evaluated at .
[V(u)]
(u u),

strap observations y = X + Wu + Zv + e are constructed and the spline mixed model is


ia . The MSPE of
ia is then estithen tted to y to get the corresponding bootstrap EBLUP
B
t

1
2
t u + v is the approximation
+ W
{

(b)

mated by B
(b)}
,
where

(b)
=
X
i
i
i
i
i
b=1 ia
v , and e , used
to the ith area bootstrap population mean. The expressions for the variances of u,
by Opsomer et al. (2008), involve the inverse of the n n matrix V. This can be avoided by using
V (v), and V (e) that depend only on T, given by
the alternative formulae (8) and (11) for V (u),
(5), which is the inverse of a xed (p + K + m)-dimensional matrix.
Ruppert, Wand, & Carroll (2003, p. 138) remark that the mixed model formulation of penalized
splines is a convenient ction to estimate smoothing parameters. They further note that the
randomness of the spline term u is a device used to model curvature and that the inferences
should be conditional on u. We therefore propose a conditional bootstrap method of estimating the
MSPE. We assume normality of v and e here, but normality may be relaxed by drawing bootstrap
replicates v and e from the standardized BLUPs as described in the previous paragraph.
We generate v and e from N(0, v2 Im ) and N(0, e2 In ) respectively and obtain bootstrap responses yij = xijt + wijt u + vi + eij , j = 1, . . . , Ni and i = 1, . . . , m. Using the corresponding
sample data {(y , xij , wij ), j si , i = 1, . . . , m}, we obtain the bootstrap estimates , u , and
ij

v i and the predicted values yij = xijt + wijt u + v i for j si . The resulting bootstrap EBLUP
of Y i is



1
yij +
yij

i =
Ni
jsi

jsi

 i
and the bootstrap population mean is Y i = Ni1 N
j=1 yij . Performing the above bootstrap operations B times, the conditional bootstrap estimator of MSPE of EBLUP of Y i is given by
M boot (
i) =

1
{
i (b) Y i (b)}2 ,
B
B

(18)

b=1

where
i (b) and Y i (b) are the values of
i and Y i for the bth bootstrap replicate; see GonzalezManteiga et al. (2008) for a description of bootstrap MSPE estimation for nite populations. In the
unconditional bootstrap method, bootstrap responses are given by yij = xijt + wijt u + vi + eij ,
where u is generated from N(0, u2 IK ).
5. ROBUST ESTIMATION
ML or REML estimators of model parameters and associated EBLUPs of small area means
can be sensitive to outliers in the responses yij , leading to an increase in MSPE. Sinha & Rao
(2009) studied robust EBLUPs (REBLUPs) of small area means under linear mixed models with a
block-diagonal covariance structure, in particular, the nested error model (1). Robust ML (RML)
estimators of and = (v2 , e2 )t are rst obtained by robustifying the score equations, using
Hubers (1964) -function, and then solving the equations using the NewtonRaphson (NR)
DOI: 10.1002/cjs

The Canadian Journal of Statistics / La revue canadienne de statistique

132

RAO, SINHA AND DUMITRESCU

Vol. 42, No. 1

iterative method. The RML estimator = ( v2 , e2 )t is then substituted for in the robustied
mixed model equations and solved iteratively using the NR method to get a robust estimator
M and a robust predictor v iM of and vi (i = 1, . . . , m). Robust predictors yijM = xijt M + v iM
for j si are then used to get REBLUP of Y i as

iM


1
=
yij +
yijM .
Ni
jsi

(19)

jsi

The RML method of Sinha & Rao (2009) runs into difculties in the context of spline mixed
models. Robust ML equations involve the Huber -functions (rij ), where rij are the elements
of r = U1/2 (y X) and U is a diagonal matrix with diagonal elements equal to the diagonal
elements of the covariance matrix V. For the spline mixed models, majority of elements of r will
be large in absolute value because the spline term Wu is not included in r, which in turn make
the corresponding elements of the derivative of (r) equal to zero. This leads to difculties in
implementing the NR method due to near singularity of the matrix whose inverse is used in the
NR algorithm.
We avoid the above difculty with the robust ML method by rst robustifying the xed-point
Equations (14) and (15) involving the BLUP of v, to get
v2 = v2

m

2 ( 1 v iF )
v

i=1

e2 = e2

{(m t3 )h}

ni
m

2 ( 1 e ijF )
e

i=1 j=1

{(n t4 )h}

(20)

(21)

where h = E{2 (Z)}, with Z N(0, 1) and the Huber (1964) -function (a) = b (a) =
a min(1, b/|a|), b > 0 is a tuning constant commonly chosen as b = 1.345. In the above,
e ijF = yij xijt F wijt u F v iF . The xed point Equation (13) for u2 given by
u2 =

K

u 2kF
,
K t2

(22)

k=1

is not robustied because u is not affected by outliers. In (20) and (22), u F and v F are robust
BLUPs, for given = (u2 , v2 , e2 )t , of u and v, obtained from Fellners mixed model equations
as shown below in (25).
We now describe the Fellner iterative algorithm for getting REBLUPs of u, v, and robust
estimators of , u2 , v2 , and e2 simultaneously.
Step 1. Given a starting value (0) of = (u2 , v2 , e2 )t , solve the mixed model Equations (4) to
obtain (0) , u(0) , v(0) and compute e(0) = y X(0) Wu(0) Zv(0) .
Step 2. Substitute the values from Step 1 in the right side of (13), (20), and (21) to get updated
estimate (1) of .
Step 3. Compute the pseudo-values
y = X + Wu + Zv + e (e1 e),
0 v = v v (v1 v),
The Canadian Journal of Statistics / La revue canadienne de statistique

(23)
(24)
DOI: 10.1002/cjs

2014

ROBUST SMALL AREA ESTIMATION

133

where (u) = ((u1 ), (u2 ), . . .)t , using (0) , u(0) , v(0) , and e(0) for , u, v, and e, and (1) for
.
Step 4. Solve the robust mixed model equations
2 t


e2 Xt y
e X X
e2 Xt W
e2 Xt Z

2 t


e2 Wt y
e2 Wt Z
e W X u2 IK + e2 Wt W
, (25)
u =
2
t
2

v
e2 Zt X
e2 Zt W
v2 Im + e2 Zt Z

e Z y + v 0 v
using (1) and the corresponding pseudo-values obtained in Step 3. This leads to new values (1) ,
u(1) , v(1) , and using these values compute e(1) = y (1) X(1) Wu(1) Zv(1) .
Step 5. Repeat Steps 24 using the new values until convergence is achieved. At convergence, we
2 ,
2 , and
2 and REBLUPs u
F , v F .
get the robust estimators F , uF
vF
eF
Note that (25) reduces to the mixed model Equations (4) when (e1 e) = e1 e and
(v1 v) = v1 v, noting that 0 v reduces to 0. Also, one could use different -functions in (23)
and (24).
Stahel & Welsh (1997) proposed a modication to Fellners method by standardizing the
quantities to be robustied by more suitably dened quantities than those used in (20) and (21).
In particular, we replace v2 and e2 in the right-hand side of (20) and (21) by the corresponding
variances v2i , and e2ij of v i and e ij . Thus, the modied Fellner equations are given by (22) and
v2 =

m

v2 2 (v1 v iF )
i

i=1

e2 =

{(m t3 )h}

ni 2 2 ( 1 e
m

e ij
e ij ijF )
i=1 j=1

{(n t4 )h}

(26)

(27)

where using (8) and (11),


v2i = v2 Ti33 ,

e2ij = e2 (MTMt )l ,

with Ti33 the ith diagonal element of T33 and (MTMt )l the lth diagonal element of MTMt . Here l is
the position in the n-dimensional vector corresponding to the ijth component, that is, e ij = e l . The
2
2 , and
2
iterative solution for getting the robust estimators MF , uMF
, vMF
eMF
and the REBLUPs
u MF , v MF under the modied Fellner (MF) method follows along the lines of Steps 15 using
(26) and (27) in place of (20) and (21).
The REBLUP of Y i is given by (16) with yijM changed to yijF = xijt F + wijt u F + v iF using
the Fellner (F) method


1
(28)
yij +
yijF .

iF =
Ni
jsi

jsi

Similarly, for the modied Fellner (MF) method, the REBLUP of Y i is given by (16), with yijM
changed to yijMF = xijt MF + wijt u MF + v iMF ,

iMF


1
=
yij +
yijMF .
Ni
jsi

DOI: 10.1002/cjs

(29)

jsi

The Canadian Journal of Statistics / La revue canadienne de statistique

134

RAO, SINHA AND DUMITRESCU

Vol. 42, No. 1

 i
t + W
t u + vi may be written as
Note that the REBLUP of i = X
iaF = Ni1 N
i
i
j=1 yijF under
1 Ni
the Fellner method and as
iaMF = Ni
j=1 yijMF under the modied Fellner method, where the
observed yij , for j si is replaced by the robust predictors yijF and yijMF . If the sampling fraction
iF and
iaF , as predictors of Y i , perform similarly in terms of efciency,
ni /Ni is small, then
otherwise
iF is more efcient because it makes use of sample observations, yij , j si , which


are part of the population mean Y i = Ni1 ( jsi yij + jsi yij ). A similar comment applies to

iMF and
iaMF as estimators of Y i .
For a non-sampled area l, the Fellner synthetic predictor of ylj is yljF = xljt F + wljt u F , j =
1, . . . , Nl and the resulting predictor of Y l is
Nl
1
lt F + W
lt u F .
yljF = X

lF =
Nl

(30)

j=1

Similarly, the modied Fellner predictor of ylj is given by yljMF = xljt MF + wljt u MF , j =
1, . . . , Nl and the resulting predictor of Y l is

lMF =

Nl
1
lt MF + W
lt u MF .
yljMF = X
Nl

(31)

j=1

In the simulation study of Section 7, we have not studied the case of non-sampled areas.
6. BOOTSTRAP MSPE ESTIMATOR OF REBLUP
Given the complex form of the REBLUP (28) of the area mean Y i and the lack of knowledge
of the underlying distributions of u, vi , and ei , MSPE estimation becomes difcult. One could
follow Opsomer et al. (2008) and use a non-parametric bootstrap method to generate bootstrap
observations. But in the presence of outliers in vi and/or ei , the proportion of outliers in bootstrap
samples may be much higher than that in the original data set, and this difculty may lead to poor
performance of the non-parametric bootstrap method in the presence of outliers. Salibian-Barrera
& Van Aelst (2008) proposed a fast and robust bootstrap method for linear regression models
that addresses the above difculty but its extension to handle spline mixed models needs further
study.
Sinha & Rao (2009) proposed a parametric bootstrap for the nested error model (1) to estimate
the MSPE of the REBLUP. Their method uses robust estimates M and M and bootstrap values
2 ) and N(0,
2 ), respectively, to obtain bootstrap responses
eM
v and e generated from N(0, vM
t

yij = xij M + vi + eij , j = 1, . . . , Ni and i = 1, . . . , m. The above method was motivated by


noting that the focus is on deviations from the working assumption of normality of v and e and
that it is natural to use robust parameter estimates for drawing robust bootstrap samples since the
MSPE of
iM is not sensitive to outliers. Simulation results showed that the proposed bootstrap
method of estimating MSPE(
iM ) performs well in the sense that the simulated expected value
of the estimated MSPE is close to the corresponding simulated MSPE in the presence of outliers.
We follow the SinhaRao method to obtain a conditional bootstrap estimator of MSPE of
2 I ) and N(0,
2 I )

iF under the spline mixed model. We generate v and e from N(0, vF


eF
m
n
t
t

respectively, and obtain bootstrap responses yij = xij F + wij uF + vi + eij , j = 1, . . . , Ni


and i = 1, . . . , m that are free of outliers. Using the corresponding bootstrap sample data

{(yij , xij , wij , j si ; i = 1, . . . , m)} we obtain bootstrap estimates , u , and v i and the
The Canadian Journal of Statistics / La revue canadienne de statistique

DOI: 10.1002/cjs

2014

ROBUST SMALL AREA ESTIMATION

135

predicted values yij = xijt + wijt u + v i for j si , where is the BLUE of and u and v i
are the EBLUPs of u and vi , respectively. The resulting bootstrap REBLUP of Y i is

i =
yij +
yij
Ni
jsi

jsi

 i

and the bootstrap population mean is Y i = Ni1 N


j=1 yij . Note that the use of EBLUPs rather
than REBLUPs makes sense because the bootstrap sample is free of outliers. However, our
simulation results indicate that the use of REBLUPs gives similar results. An advantage of using
EBLUPs over the REBLUPs is that the bootstrap computations are greatly reduced.
Performing the above bootstrap operation B times, the bootstrap estimator of the MSPE of

iF is given by
M boot (
iF ) =

{
i (b) Y i (b)}2 ,
B
B

(32)

i=1

i and Y i for the bth bootstrap replicate and


where
i (b) and Y i (b) are the values of
iF is given
by (28). It may be possible to get better bootstrap MSPE estimators than (32) by using a double
bootstrap (Hall & Maiti, 2006).
7. SIMULATION STUDY
In this section, we report some results of a limited simulation study on the performance of the
proposed P-spline REBLUP of a small area mean. We used the following true semi-parametric
nested error model to generate samples for the simulation study
yij = m0 (xij ) + vi + eij , j = 1, . . . , Ni ; i = 1, . . . , m,

(33)

where Ni = 40 for all the areas i, m = 40 and three different choices of m0 (x) : 1. m0 (x) = 1 + x
(linear), 2. m0 (x) = 1 + x + x2 (quadratic), 3. m(x) = 1 x + 0.5 exp(x) (exponential 1) used
by Breidt, Claeskens, & Opsomer (2005).
Further, to reect outlying observations we assumed that vi and eij are generated from contaminated normal distributions
2
),
vi iid (1 1 )N(0, v2 ) + 1 N(0, v1

2
eij iid (1 2 )N(0, e2 ) + 2 N(0, e1
),

2 = 2 = 25. Four combinations of distributions for v and e , denoted


where v2 = e2 = 1 and v1
i
ij
e1
(0, 0), (e, 0), (0, v), and (e, v), were studied, where (0, 0) indicates no contamination (1 = 2 =
0), (0, v) indicates contamination in v only (1 = 0.1, 2 = 0), (e, 0) indicates contamination in e
only (1 = 0, 2 = 0.1), and (e, v) indicates contamination in both v and e (1 = 0.1, 2 = 0.1).
We generated the population xij , j = 1, . . . , Ni = 40, i = 1, . . . , m = 40 from N(1, 1) and
held them xed over the simulation runs r = 1, . . . , R = 500. We generated simple random
samples, si , each of size ni = 4, from the 40 areas in each simulation run. Population data for
(r)
the rth simulation is {(yij , xij ), j = 1, . . . , 40; i = 1, . . . , 40} and the corresponding sample
(r)

(r)

(r)

(r)

(r)

(r)

data is {(yij , xij ), j si ; i = 1, . . . , 40}, where yij = m0 (xij ) + vi + eij , and vi and eij are
generated from the specied distributions of vi and eij .
For the simulation study, we used the linear P-spline approximation (h = 1) for m0 (xij ).
Regarding the choice of number of knots K and the location of the knots q1 , . . . , qK , we followed
DOI: 10.1002/cjs

The Canadian Journal of Statistics / La revue canadienne de statistique

136

RAO, SINHA AND DUMITRESCU

Vol. 42, No. 1

Table 1: Simulated mean squared errors of estimators of variance components for Fellner (F) method and
the modied Fellner method (MF), K = 20 knots.

Parameter

Quadratic model

Exponential(1) model

MF

MF

No contamination
v2 = 1

0.1306

0.1321

0.1350

0.1349

e2 = 1

0.0430

0.0309

0.0594

0.0313

v2 = 1

0.1771

0.1693

0.1946

0.1782

=1

0.3963

0.1412

0.6052

0.1975

v2 = 1

0.5199

0.3098

0.5392

0.3227

=1

0.0430

0.0318

0.0504

0.0315

v2 = 1

0.5438

0.2998

0.5681

0.3100

e2 = 1

0.4078

0.1288

0.4871

0.1412

Contamination in e
e2

Contamination in v
e2

Contamination in (e, v)

Ruppert, Wand, & Carroll (2003, Sec. 5.5): (1) One needs enough knots to ensure sufcient
exibility to t the data, but after that, additional knots do not change the t of the model. (2)
Place the knots at the quantiles of the unique population x-values, which give equal or nearly
equal number of x-values between the knots, and use the same knots in calculating the P-spline
EBLUP or REBLUP from the sample data.
Following Stahel & Welsh (1997), we focus on estimating the variance parameters, v2
and e2 , of the central model, which is N(0, 1) model; in our case 90% of vi and/or eij
2(r)
follow the central model. From each generated data set r, we calculated the estimators vF
2(r)
2(r)
2(r)
and eF under the Fellner method and vMF and eMF under the modied Fellner method,
using the spline model with K = 20 knots. Table 1 reports the simulated mean squared


2(r)
2(r)
errors R1 R
vF 1)2 and R1 R
eF 1)2 for the Fellner (F) method and
r=1 (
r=1 (


2(r)
2(r)
vMF 1)2 and R1 R
R1 R
eMF 1)2 for the modied Fellner (MF) method, under
r=1 (
r=1 (
the quadratic model (model 2) and the exponential model (model 3) for m0 (x) in (33). It is clear
from Table 1 that the MF leads to signicantly smaller mean squared error (MSE), especially for
the parameter corresponding to the contamination: e2 for (e, 0) and (e, v) and v2 for (0, v) and
(e, v). The increase in MSE for F is due to larger bias.
We now turn to the performance of the two methods (F and MF) in estimating the small
(r)
(r)
area means under model (33). We computed the estimates
iF and
iMF using (28) and (29), for

(r)
(r)
iMF were computed as R1 R
each data set r. Simulated MSPEs of
iF and
iF Y i }2
r=1 {

40

(r)
(r)
(r)
(r)
and R1 R
iMF Y i }2 , respectively, where Y i =
yij /40. Table 2 reports the
r=1 {
j=1

simulated MSPEs of the estimators


iF and
iMF averaged over areas, for the two methods. It is
clear from Table 2 that the two methods perform similarly in terms of the average MSPE, unlike
in the case of estimating variance components. Modied Fellner method did not perform well
relative to the Fellner method in estimating the MSPE using the bootstrap method of Section 6.
In view of the above results on estimating small area means, we focused on the original Fellner
The Canadian Journal of Statistics / La revue canadienne de statistique

DOI: 10.1002/cjs

2014

ROBUST SMALL AREA ESTIMATION

137

Table 2: Simulated mean squared prediction errors of robust estimators of small area means (averaged
over areas) for the Fellner (F) method and the modied Fellner (MF) method (K = 20 knots).
Quadratic model
Contamination

Exponential(1) model

MF

MF

(0, 0)

0.1993

0.2042

0.2063

0.2111

(e, 0)

0.3273

0.3320

0.3485

0.3541

(0, v)

0.2105

0.2134

0.2340

0.2356

(v, e)

0.3586

0.3608

0.4161

0.4169

method, and detailed results on MSPE and the bootstrap estimator of MSPE are reported in Tables
3 and 4, respectively.
We computed the simulated bias and MSPE of the EBLUP
i , given by (16), and the REBLUP

iF , given by (28) for the true models linear, quadratic, and exponential (1) and K = 0, 20, and
30 knots. The special case of K = 0 corresponds to the standard nested error model (1) and
associated EBLUP and REBLUP studied by Sinha & Rao (2009).
Table 3 reports the values of absolute bias and MSPE averaged over the 40 areas. Some broad
observations from Table 3 are the following: (1) In the case of no contamination (0, 0), EBLUP
(K = 20 or 30) and EBLUP (K = 0) are more efcient than the corresponding REBLUP (K = 20
or 30) and REBLUP (K = 0), as expected. However, the loss in efciency of the latter estimators
is small, conrming their robustness. (2) For a given K, REBLUP is much more efcient than the
corresponding EBLUP in the case of contamination in e only or both v and e, but only slightly
more efcient in the case of (0, v) because the expected number of outliers is signicantly smaller
under (0, v) relative to (e, 0) or (e, v). For example, for K = 20 and the quadratic true model,
average MSPE for the EBLUP is 0.4792 compared to 0.3201 for the REBLUP under (e, 0) and
0.6250 compared to 0.3560 under (e, v). (3) MSPE of the estimators is not much affected by the
choice of the number of knots, K, when we compare the values for K = 20 to the corresponding
values for K = 30. This result suggests that K = 20 is a good choice. (4) In the case of the nested
error model (1), the increase in MSPE of the EBLUP (K = 20 or 30) over the EBLUP (K = 0)
is minimal across the four contamination combinations. On the other hand, EBLUP (K = 0)
leads to a large increase in MSPE relative to EBLUP (K = 20 or 30) when the true model is
quadratic or exponential. For example, under the quadratic model and no contamination (0, 0),
average MSPE of EBLUP (K = 0) in Table 3, is 0.3997 compared to 0.1921 for the EBLUP
(K = 20); similar result holds for the REBLUP (K = 0) and the REBLUP (K = 20): 0.3756
versus 0.1976. The increase in MSPE of the EBLUP (K = 0) over the EBLUP (K = 20) or the
REBLUP (K = 0) over the REBLUP (K = 20) is due to larger absolute bias when the model is not
linear.
The values reported in Table 3 are based on R = 500 simulated samples, but the values are
reliable because they refer to averages over the 40 areas. On the other hand, to make comparisons
of area-specic MSPE of the estimators it is necessary to generate a much larger number of
simulated samples. Because of the substantial increase in computations, we focused on the case
of K = 20 knots and the quadratic model (2) and generated R = 10,000 simulated samples to
compare MSPEs of EBLUP and REBLUP. Figure 1a reports the area-specic MSPE values for
the cases (0, 0) and (0, v), with no outliers in e and Figure 1b for the cases (e, 0) and (e, v),
with outliers in e. Figure 1a clearly shows that in the case of no contamination (0, 0), REBLUP
is slightly less efcient than the EBLUP for each area, as expected. Similarly, in the case of
DOI: 10.1002/cjs

The Canadian Journal of Statistics / La revue canadienne de statistique

138

RAO, SINHA AND DUMITRESCU

Vol. 42, No. 1

Table 3: Simulated absolute biases and mean squared prediction errors of EBLUP and REBLUP of small
area means (averaged over areas).
K=0
True model
Linear

K = 20

K = 30

Contamination

Method

Bias

MSPE

Bias

MSPE

Bias

MSPE

(0, 0)

EBLUP

0.0193

0.1878

0.0127

0.1915

0.0136

0.1886

REBLUP 0.0193

0.1940

0.0120

0.1968

0.0142

0.1936

(e, 0)

EBLUP

0.0246 0.4732 0.0273 0.4682 0.0242 0.4707

REBLUP 0.0208 0.3168


(0, v)

EBLUP

0.0142

0.2115

0.0192

0.3118

0.0210

0.3165

0.0175

0.2138

0.0148

0.2141

REBLUP 0.0140 0.2044 0.0173


(e, v)

EBLUP

0.2076 0.0148 0.2079

0.0256 0.6201 0.0245 0.6310 0.0286 0.6220

REBLUP 0.0203 0.3447 0.0206 0.3433 0.0226 0.3430


Quadratic

(0, 0)

EBLUP
REBLUP

(e, 0)

EBLUP

0.0543 0.3997

0.0174

0.1921

0.0166

0.1017

0.0179

0.1976

0.0164 0.2022

0.3756

0.0643 0.6262 0.0287 0.4792 0.0288 0.4756

REBLUP 0.1030 0.5251 0.0235 0.3201 0.0212


(0, v)

EBLUP

0.0521 0.5556 0.0156

REBLUP 0.1156 0.4535 0.0143


(e, v)

EBLUP

(0, 0)

EBLUP

0.2183

0.3210

0.0198 0.2202

0.2112 0.0200 0.2130

0.0681 0.9451 0.0260 0.6250 0.0283 0.6404

REBLUP 0.1336 0.6631


Exponential(1)

0.1969

0.0217 0.3560 0.0220 0.3529

0.0676 0.4398 0.0233 0.2012 0.0291

0.2117

REBLUP 0.1307 0.3672 0.0253 0.2059 0.0327 0.2158


(e, 0)
(0, v)
(e, v)

EBLUP

0.1151

0.7040 0.0389 0.4926 0.0377 0.4891

REBLUP

0.1614

0.5407 0.0334 0.3350 0.0334 0.3357

EBLUP

0.0841 0.7053 0.0358 0.2713 0.0290 0.2362

REBLUP

0.1610

EBLUP

0.1305

REBLUP 0.2186

0.4602 0.0408 0.2423 0.0298 0.2260


1.2551

0.0312 0.6489 0.0330 0.6529

0.7423 0.0337 0.3773 0.0286 0.3646

contamination model (0, v), REBLUP is slightly more efcient than EBLUP for each area. On
the other hand, in case of contamination models (e, 0) and (e, v), with outliers in e, REBLUP
leads to large gain in efciency over the EBLUP for each area, particularly for the case (e, v).
The above results are in agreement with those reported in Table 3.
We also computed the bootstrap MSPE estimators (18) and (32) of the EBLUP
i and the
REBLUP
iF for each simulation run using B = 200 bootstrap samples, and then obtained its
average value over the simulation runs to get the simulated expected value of the bootstrap MSPE
estimator for each area i. Table 4 reports the simulated values of the MSPE and the associated
simulated expected values of the MSPE estimator for K = 20, averaged over the areas. The values
of percent absolute relative bias (ARB %) of the MSPE estimator, averaged over the areas, are
also reported. It is clear from Table 4 that the bootstrap MSPE estimator performs well in terms
of average ARB, which is less than 12% in most cases.
The Canadian Journal of Statistics / La revue canadienne de statistique

DOI: 10.1002/cjs

2014

ROBUST SMALL AREA ESTIMATION

139

Table 4: Simulated MSPE, expected value and ARB (%) of bootstrap MSPE estimator of EBLUP and
REBLUP (averaged over areas).

K = 20
True model

Boot size

Contamination

200

(0, 0)

Quadratic

(e, 0)
(0, v)
(e, v)
Exponential(1)

200

(0, 0)
(e, 0)
(0, v)
(e, v)

Method

MSPE

Boot-MSPE

ARB(%)

EBLUP

0.1921

0.1908

5.9

REBLUP

0.1976

0.2146

9.4

EBLUP

0.4792

0.4585

7.7

REBLUP

0.3201

0.2813

12.0

EBLUP

0.2183

0.2177

5.2

REBLUP

0.2112

0.2263

8.1

EBLUP

0.6250

0.6142

4.9

REBLUP

0.3560

0.3095

12.7

EBLUP

0.2012

0.1962

7.0

REBLUP

0.2059

0.2199

9.6

EBLUP

0.4926

0.4674

8.5

REBLUP

0.3350

0.2910

12.5

EBLUP

0.2713

0.2542

13.2

REBLUP

0.2423

0.2373

11.0

EBLUP

0.6489

0.6339

6.3

REBLUP

0.3773

0.3242

13.3

We also considered t-distributions with 3 degrees of freedom to generate vi and eij and obtained
results similar to those under the contamination models. The t-distribution is also symmetric and
exhibits long tails that can lead to outliers.
All in all, our simulation study indicates that the proposed robust REBLUP with enough
knots (say K = 20) performs well in terms of MSPE when the true mean function is not necessarily linear. Also, the proposed bootstrap MSPE estimator seems to track the MSPE quite
well.
8. DISCUSSION
Ugarte et al. (2009) used a B-spline, instead of the traditional truncated polynomial basis used
by Opsomer et al. (2008), Ruppert, Wand, & Carroll (2003) and in our paper. Ruppert, Wand, &
Carroll (2003) noted that a P-spline t of degree h may be expressed in terms of the B-spline basis
of the same degree and the same knot locations. Hence, the linear mixed model representation
can be used to handle a B-spline basis. Opsomer et al. (2008) adapted the P-spline basis to handle
radial basis functions for use in the context of spatial modeling.
We assumed symmetric contamination of the random area effects, vi , and the unit errors eij .
Chambers et al. (2009) and Gershunskaya (2010) proposed bias-corrected REBLUP to handle
asymmetric contaminations in vi and eij in the context of the nested error model (1). Extensions
of this method to handle semi-parametric mixture models would be useful.
DOI: 10.1002/cjs

The Canadian Journal of Statistics / La revue canadienne de statistique

140

RAO, SINHA AND DUMITRESCU

Vol. 42, No. 1

Figure 1: Area-specic MSPEs of EBLUP and REBLUP for K = 20 knots and true model quadratic
(model 2). a) cases (0, 0) and (0, v); b) cases (e, 0) and (e, v).

In the context of models without random area effects Oh, Nychka, & Lee (2007) proposed
robust curve estimators based on M-type estimation and penalty-based smoothing. Extensions on
this approach to handle semi-parametric mixture models might be useful.
We assumed sampling to be non-informative in the sense that the assumed population model
also holds for the sample. It would be useful to extend the approach of Pfeffermann & Sverchkov
(2007) to handle informative sampling.
Finally, it would be useful to study the properties of the synthetic predictors (17), (30), and
(31) for non-sampled areas and associated bootstrap estimators of the MSPE.
The Canadian Journal of Statistics / La revue canadienne de statistique

DOI: 10.1002/cjs

2014

ROBUST SMALL AREA ESTIMATION

141

ACKNOWLEDGEMENTS
The authors are thankful to the editor, associate editor, and two referees for constructive comments
and suggestions. Authors also wish to thank the Natural Sciences and Engineering Research
Council of Canada (NSERC) for support of this research.
BIBLIOGRAPHY
Battese, G. E., Harter, R. M., & Fuller, W. A. (1988). An error component model for prediction of county
crop areas using survey and satellite data. Journal of the American Statistical Association, 83, 2836.
Breidt, F. J., Claeskens, G., & Opsomer, J. D. (2005). Model-assisted estimation for complex surveys using
penalised splines. Biometrika, 92, 831846.
Chambers, R., Chandra, H., Salvati, N., & Tzavidis, N. (2009). Outlier robust small area estimation. CSSM
Working Paper, University of Wollongong, New South Wales, Australia.
Fellner, W. H. (1986). Robust estimation of variance components. Technometrics, 28, 5160.
Gershunskaya, J. (2010). Robust small area estimation using a mixture model. Proceedings of the Section
on Survey Research Methods, American Statistical Association.
Gonzalez-Manteiga, W., Lombarda, M. J., Molina, I., Morales, D., & Santamara, L. (2008). Bootstrap mean
squared error of a small-area EBLUP. Journal of Statistical Computation and Simulation, 78, 443462.
Hall, P. & Maiti, T. (2006). Nonparametric estimation of mean-squared prediction error in nested-error
regression models. Annals of Statistics, 34, 17331750.
Henderson, C. R. (1963). Selection index and expected genetic advance. In Statistical Genetics and Plant
Breeding, National Research Council Publication 982, National Academy of Science, Washington, D.C.,
pp. 141163.
Huber, P. J. (1964). Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35,
73101.
Jiang, J. (1996). REML estimation: Asymptotic behavior and related topics. The Annals of Statistics, 24,
255286.
Jiang, J. (2010). Large Sample Techniques for Statistics. Springer, New York.
Jiang, J., & Lahiri, P. (2006) Mixed model prediction and small area estimation. Test, 15, 196.
Oh, H., Nychka, D., & Lee, T. (2007). The role of pseudo data for robust smoothing with application to
wavelet regression. Biometrika, 94, 893904.
Opsomer, J. P., Claeskens, G., Ranalli, M. G., Kauemann, G., & Breidt, F. J. (2008). Non-parametric small
area estimation using penalized spline regression. Journal of the Royal Statistical Society, Series B, 70,
265286.
Pfeffermann, D. & Sverchkov, M. (2007). Small-area estimation under informative probability sampling of
areas and within the selected areas. Journal of American Statistical Association, 102, 14271439.
Rao, J. N. K. (2003). Small Area Estimation. Wiley, New York.
Ruppert, D., Wand, M. P., & Carroll, R. J. (2003). Semiparametric Regression. Cambridge University Press,
New York, NY.
Salibian-Barrera, M., & Van Aelst, S. (2008). Robust model selection using fast and robust bootstrap.
Computational Statistics and Data Analysis, 52, 51215135.
Sinha, S. K. & Rao, J. N. K. (2009). Robust small area estimation. The Canadian Journal of Statistics, 37,
381399.
Stahel, W. A., & Welsh, A. (1997). Approaches to robust estimation in the simplest variance components
model. Journal of Statistical Planning and Inference, 57, 295319.
Ugarte, M. D., Goicoa, T., Militino, A. F., & Durban, M. (2009). Spline smoothing in small area trend
estimation and forecasting. Computational Statistics and Data Analysis, 53, 36163629.
Received 18 December 2012
Accepted 17 July 2013
DOI: 10.1002/cjs

The Canadian Journal of Statistics / La revue canadienne de statistique

Вам также может понравиться