Вы находитесь на странице: 1из 10

Journal of the Korean Statistical Society 46 (2017) 298–307

Contents lists available at ScienceDirect

Journal of the Korean Statistical Society


journal homepage: www.elsevier.com/locate/jkss

Optimal generalized case–cohort analysis with accelerated


failure time model
Yongxiu Cao, Qinglong Yang, Jichang Yu ∗
School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan 430073, PR China

article info abstract


Article history: Case–cohort design has been widely advocated in large cohort studies when the disease
Received 24 January 2016 rate is low. When the event is not rare, it is desirable to consider a generalized case–cohort
Accepted 29 October 2016 design where the covariates are observed only for a subcohort randomly selected from the
Available online 18 November 2016
underlying cohort and a subset of additional failures outside the subcohort. In this article,
we propose the smoothed weighted Gehan estimating equation for regression parameters
AMS 2000 subject classifications:
in the accelerated failure time model under generalized case–cohort design. Asymptotic
62D05
62N01
properties of the proposed estimators are developed. To demonstrate the effectiveness of
the generalized case–cohort sampling, we compare it with simple random sampling in
Keywords: terms of asymptotic relative efficiency. Furthermore, we derive the optimal allocation of
Accelerated failure time model the subsamples for the proposed design. The performance of the finite sample properties
Generalized case–cohort design are evaluated via simulation studies. A real data set is analyzed to illustrate the estimating
Induced smooth
procedure.
Optimal allocation
Survival data
© 2016 The Korean Statistical Society. Published by Elsevier B.V. All rights reserved.

1. Introduction

The case–cohort design (Prentice, 1986) has been widely used as a cost-effective sampling design in large epidemiological
studies with time-to-event data. In case–cohort design, expensive covariates are measured only for a subcohort which is
randomly selected from the full cohort at the beginning of the study and additional failures outside the subcohort. This
design can be almost as efficient as the full cohort design when studying rare diseases or events. The statistical methods for
case–cohort design have been well studied in the literature (e.g., Chen & Lo, 1999; Kalbfleisch & Lawless, 1988; Kong & Cai,
2009; Kong, Cai, & Sen, 2004; Lin & Ying, 1993; Self & Prentice, 1988; Sun, Sun, & Flournoy, 2004, and so on).
The case–cohort design is especially useful when the failure rate is low. However, in many studies, the failure rate may
not be low. It is unpractical to assemble covariates of all failures due to financial constraints. Under such situations, several
authors proposed to sample a subset of failures instead of all the failures in case–cohort design, which is called as the
generalized case–cohort (GCC) design. For example, Chen (2001) proposed the generalized case–cohort design and studied
its statistical properties under the Cox proportional hazards model (Cox, 1972). Cai and Zeng (2007) compared the hazards
function between two samples by the log-rank test and provided the formula for the power function under the GCC design.
Kang and Cai (2009) studied the multivariate failure time data from the GCC design with multiple disease outcomes under the
Cox proportional hazards model. The aforementioned articles are all under the framework of the Cox proportional hazards
model. Yu, Shi, Yang, and Liu (2014) studied the GCC design under the additive hazards model by the weighted estimating
equations.

∗ Corresponding author.
E-mail address: yujc@zuel.edu.cn (J. Yu).

http://dx.doi.org/10.1016/j.jkss.2016.10.006
1226-3192/© 2016 The Korean Statistical Society. Published by Elsevier B.V. All rights reserved.
Y. Cao et al. / Journal of the Korean Statistical Society 46 (2017) 298–307 299

The Cox proportional hazards model and the additive hazards model are commonly used to identify the risk of the
factors on failure time in survival analysis, which are both based on modeling the hazard function. However, it may be
more attractive to relate the failure time to covariates directly in many applications. Recently, the accelerated failure time
model which linearly relates the logarithm of the failure time to the covariates has gained more and more attention. The
statistical methods for the accelerated failure time model have been studied by many authors in the literature (e.g., Jin, Lin,
Wei, & Ying, 2003; Tsiatis, 1990; Wei, Ying, & Lin, 1990; Ying, 1993). Kong and Cai (2009) and Nan, Yu, and Kalbfleisch (2006)
studied the case–cohort design under the accelerated failure time model. Chiou, Kang, and Yan (2014) proposed fast rank-
based inference procedures for accelerated failure time models under case–cohort design. To the best of our knowledge, no
consideration is given for the analysis of GCC design under the accelerated failure time model. In this article, we propose
the statistical inference methods for analyzing generalized case–cohort data under the accelerated failure time model. To
provide guideline for practitioner, we also propose a method to design an optimal GCC sampling for the accelerated failure
time model.
The article is organized as follows. In Section 2, we introduce the generalized case–cohort design and propose the
smoothed weighted Gehan estimating equation approach for the unknown regression parameters in the accelerated failure
time model. The asymptotic properties of the proposed estimator are established in Section 3. In Section 4, we discuss the
optimal allocation of subsamples in the GCC design. In Section 5, we present the simulation studies to evaluate the finite
sample performance of the proposed method. A real data analysis is given to illustrate the proposed method in Section 6.
We provide some concluding remarks in Section 7. The proofs for theoretical results are outlined in the Appendix.

2. Statistical inference procedure

2.1. Model

Let T and C denote the failure time and the corresponding censoring time. Due to right censoring, we only can observe
X = min(T , C ) and δ = I (T ≤ C ), where I (·) is an indicator function. Let Z be a p-dimensional vector of covariates. Given the
covariates Z , the failure time T and the censoring time C are assumed independent. We consider the following accelerated
failure time model:
log(T ) = β0′ Z + ϵ, (2.1)
where β0 is an unknown p-vector of regression parameters and ϵ is the random error with an unspecified distribution
function F and the corresponding density function f .

2.2. Generalized case–cohort sampling

Suppose the underlying cohort has n subjects, and {Xi , δi , Zi , i = 1, . . . , n} are the independent copies of (X , δ, Z ). In
generalized case–cohort design, let the binary random variable ξi denote whether or not the ith subject is selected into the
subcohort; let ηi be the selection indicator for whether or not the ith subject is selected into supplemental failure sample
and P (ηi = 1|ξi = 0, δi = 1) = pr , which is the supplemental failure sampling probability. Suppose that covariates Z are
only observed on the selected subjects. Obviously, ξi + (1 − ξi )δi ηi is the indicator for whether or not Zi is observed. The
observed data structure can be summarized as:
{Xi , δi , Zi (ξi + (1 − ξi )δi ηi ) , i = 1, . . . , n} .
Assume that the sample size of subcohort and supplemental failures are n0 and n1 , respectively. Assume nv /n → ρv ,
n0 /nv → ρ0 and n1 /nv → ρ1 in probability with nv = n0 + n1 . Hence, pr = ρ1 ρv /{(1 − ρ0 ρv )π } with π = P (δ = 1), which
can be estimated by the fraction of failures in the underlying cohort.

2.3. Inference procedure

Define ei (β) = log(Xi ) − β ′ Zi , i = 1, . . . , n. Let Ni (β; t ) = I (ei (β) ≤ t , δi = 1) and Yi (β; t ) = I (ei (β) ≥ t ) denote
the counting process and at risk process, respectively, where the range of t is (−∞, ∞). In the full cohort design where the
data {Xi , δi , Zi , i = 1, . . . , n} are completely observed, the regression parameters in the model (2.1) can be estimated by the
following estimating equations:
n 
 ∞
Un,ψ (β) = ψ(β; t )[Zi − Z̄ (β; t )]dNi (β; t ) = 0, (2.2)
i=1 −∞

where ψ is a possible data-dependent weight function and Z̄ (β; t ) = S (1) (β; t )/S (0) (β; t ) with S (d) (β; t ) = n−1 j=1
n

Yj (β; t )Zjd for d = 0, 1. The weight function ψ(β; t ) = 1 and ψ(β; t ) = S (0) (β; t ) are corresponding to the log-rank and
Gehan statistics, respectively. The asymptotic properties of the estimator from Eq. (2.2) have been studied by many authors
in the literature (e.g., Lai & Ying, 1991; Tsiatis, 1990; Ying, 1993). In the GCC design, the covariate history Z is not completely
300 Y. Cao et al. / Journal of the Korean Statistical Society 46 (2017) 298–307

available for each subject in the underlying cohort and the distribution of selected supplemental failures is different from
that of the underlying population. Therefore, we propose the following weight to adjust the biased sampling mechanism in
the GCC design:
Wi = δi ξi + (1 − δi )ξi /(ρ0 ρv ) + (1 − ξi )δi ηi π (1 − ρ0 ρv )/(ρ1 ρv ), i = 1, . . . , n, (2.3)
which are the inverse probability weight (Horvitz & Thompson, 1951).
Under GCC design, the regression coefficients β0 in model (2.1) can be estimated by solving the following weighted
estimating equation:
n 
 ∞  
Ũn,ψ̃ (β) = ψ̃(β; t )Wi Zi − Z̃ (β; t ) dNi (β; t ) = 0, (2.4)
i=1 −∞

where ψ̃ is a possible data-dependent weight function and Z̃ (β; t ) = S̃ (1) (β; t )/S̃ (0) (β; t ) with S̃ (d) (β; t ) = n−1 j=1
n

Wj Yj (β; t )Zjd for d = 0, 1. The weight function ψ̃(β; t ) = 1 and ψ̃(β; t ) = S̃ (0) (β; t ) are corresponding to the log-rank
and Gehan statistics, respectively. In this article, we consider the situation with ψ̃(β; t ) = S̃ (0) (β; t ). Hence, the weighted
Gehan estimating equations can be written as:
n 
 n
Ũn,G (β) = n−1 δi Wi Wj (Zi − Zj )I ei (β) ≤ ej (β) = 0,
 
(2.5)
i =1 j =1

where Ũn,G (β) is monotone in each component of β (Fygenson & Ritov, 1994) and let β̃n denote the estimator by solving
estimating equations (2.5).
Due to the indicator function I ei (β) ≤ ej (β) in (2.5), the weighted Gehan estimating equation is not continuous, which
 
presents the challenges for solving (2.5). Therefore, we adopt the induced smoothing procedure to smooth the weighted
Gehan estimating equation which could lead to continuously differentiable estimating equation and can be solved by the
standard numerical methods. Let V be a p-dimensional standard normal random vector, the estimating equation (2.5) can
be replaced with
 
Ūn,G (β) = EV Ũn,G (β + n−1/2 V ) = 0,

where the expectation is taken with respect to V (Brown & Wang, 2007; Johnson & Strawderman, 2009). Therefore, the
smoothed weighted Gehan estimating equation can be rewritten as:

ej (β) − ei (β)
n 
 n  
Ūn,G (β) = n−1 δi Wi Wj (Zi − Zj )Φ = 0, (2.6)
i =1 j =1
rij

where rij2 = n−1 (Zj − Zi )′ (Zj − Zi ), Φ (·) is the standard normal distribution function. Let 
βn denote the estimator by solving
(2.6). In the next section, we will show the solution βn is a strong consistent estimator to β0 and is asymptotically normal.
Furthermore, the asymptotic distribution of  βn is also the same as that of β̃n .

3. Asymptotic properties

Let s(0) (β; t ), s(1) (β; t ) and eZ (β; t ) be the corresponding limits of S (0) (β; t ), S (1) (β; t ) and Z̄ (β; t ). By the strong law of
large numbers (Pollard, 1990), we get the above convergence is almost surely. We define Mi (β; t ) = Ni (β; t ) − Λi (β; t ),
t
Λi (β; t ) = −∞ Yi (β; s)λ(u)du, and λ(·) is the common hazard function of the error terms. Obviously, Mi (β0 ; t ) is a
martingale. In order to obtain the asymptotic properties of the proposed estimator, we impose the following regular
conditions:
(C1) The parameter space B containing β0 is a compact subset of Rp .
(C2) For any t and β ∈ B , P (Y (β; t ) = 1) > 0.
(C3) The covariates of Z are bounded almost surely by a nonrandom constant K .
  f ′ (t ) 2
(C4) The random error’s density function f and its derivative f ′ are bounded and f (t )
f (t )dt < ∞.
(C5) The matrix ΣA (β0 ) is nonsingular, which is the limit of n −1
∂ Ūn,G (β0 )/∂β .

Theorem 1. Under the conditions C1–C3, n−1/2 Ũn,G (β0 ) is asymptotically normal with mean zero and covariance matrix
0 ρv
ΣF (β0 ) + 1−ρ
ρ ρ
ΣC (β0 ) + (1−ρ0 ρv ){π(ρ1−ρ
ρ
0 ρv )−ρ1 ρv }
ΣG (β0 ).
0 v 1 v

We defer the proof and the definition of ΣF (β0 ), ΣC (β0 ) and ΣG (β0 ) to the Appendix.
Y. Cao et al. / Journal of the Korean Statistical Society 46 (2017) 298–307 301

Theorem 2. Under the conditions C1–C3, the weighted Gehan estimating function and the smoothed weighted Gehan estimating
function are asymptotically equivalent:

n−1/2 Ũn,G (β0 ) = n−1/2 Ūn,G (β0 ) + op (1).

We defer the proof to the Appendix.



Theorem 3. Under the conditions C1–C5,  βn is strong consistency and n( βn − β0 ) converges in distribution to zero-mean
1−ρ ρ (1−ρ0 ρv ){π (1−ρ0 ρv )−ρ1 ρv }
normal with covariance ΣA−1 (β0 )(ΣF (β0 ) + ρ ρ0 v ΣC (β0 ) + ρ ρ
ΣG (β0 ))(ΣA−1 (β0 ))′ .
0 v 1 v

The proof is given in the Appendix.

4. Optimal GCC sampling

An important question is what the efficiency loss is if we adopt the usually simple random sampling scheme to obtain the
nv subjects and measure the covariate for them in practice. Moreover, if nv is fixed in the GCC sampling, how to allocate the
subsamples between the subcohort and the supplemental failures is also of interest. As a guidance, we propose a method to
design an optimal GCC sampling for the accelerated failure time model.
βG and 
Let  βN denote the corresponding estimators for β0 from the proposed GCC design and simple random sampling
design with the same sample size, respectively. The asymptotic relative efficiency between the estimators  βN and  βG is
denoted by ARE (
βN , 
βG ). From Theorem 3, we can obtain
1 − ρ0 ρv
ARE (
βN , 
βG ) = ρv Ip + ΣA (β0 )ΣF−1 (β0 )ΣC (β0 )ΣA−1 (β0 )
ρ0
(1 − ρ0 ρv ){π (1 − ρ0 ρv ) − ρ1 ρv }
+ ΣA (β0 )ΣF−1 (β0 )ΣG (β0 )ΣA−1 (β0 ),
ρ1
where Ip is an identity matrix of size p × p. Assume that the total sample size of the underlying cohort, n, and the number
of subjects with covariate being observed, nv is fixed, which is equal to ρv fixed. The optimal GCC design means that the
optimal allocation of n0 and n1 to minimize the trace of the asymptotic relative efficiency ARE ( βN , 
βG ) under the constraint
that nv is fixed (Yu, Liu, Cai, Sandler, & Zhou, 2016; Yu, Liu, Sandler, & Zhou, 2015). Hence, the optimal allocation is equal to
considering the following optimal problem: minimizing TARE ( βN , 
βG ) under the constraint ρ0 + ρ1 = 1, where
1 − ρ0 ρv
TARE (
βN , 
βG ) = ρv p + Trace ΣA (β0 )ΣF−1 (β0 )ΣC (β0 )ΣA−1 (β0 )
 
ρ0
(1 − ρ0 ρv ){π (1 − ρ0 ρv ) − ρ1 ρv }
Trace ΣA (β0 )ΣF−1 (β0 )ΣG (β0 )ΣA−1 (β0 ) ,
 
+ (4.1)
ρ1
which is a function of ρv , ρ0 and ρ1 . Because ρv is fixed and ρ0 + ρ1 = 1, it is easy to obtain an optimal fraction ρ0∗ for ρ0 ,
which can tell us how many samples should be selected into the subcohort for GCC design to enhance the study efficiency.

5. Simulation studies

In this section, we conduct the simulation studies to investigate the finite sample performance of the proposed method.
We consider the following accelerated failure time model:
log T = β0 E + γ0 Z + ϵ, (5.1)
where E is standard normal distribution, Z is Bernoulli distribution with a success probability of 0.5 and error term ϵ follows
the standard normal distribution. We set β0 = 0 and γ0 = 0.5. The censoring time is generated from the uniform distribution
over [0, c ], where c is chosen to yield around 80% and 70% censoring rate, respectively. Let θ0 denote (β0 , γ0 )′ . We compare
the proposed estimator  θG with four estimators: θF , 
θS , 
θN and θL , which are the standard estimators based on the full
underlying cohort, and the simple random sampled subcohort and simple random sampling design with the same sample
size as the GCC design, and the GCC design with log-rank weight, respectively.
The full cohort size is n = 600 and we investigate different scenarios for the sample size of subsamples (n0 , n1 ). For each
configuration, we generate 1000 simulated data sets. The sample standard deviation of the 1000 estimates is given in the
corresponding column ‘‘SD’’. The column ‘‘SE’’ gives the average of the estimated standard error and ‘‘CI’’ is the nominal 95%
confidence interval coverage using the estimated standard error. The simulation results are summarized in Table 1.
From Table 1, we can make the following observations. First, under all of the situations considered here, the estimators
θF , 
 θS , 
θN , 
θL , and 
θG are all unbiased. Second, the estimator  θF is more efficient than other three methods, because it is based
on the full cohort. Third, the proposed variance estimator provides a good estimation for the sample standard errors and
the confidence intervals attain coverage closed to the nominal 95% level. Finally, the proposed GCC design is more efficient
302 Y. Cao et al. / Journal of the Korean Statistical Society 46 (2017) 298–307

Table 1
Simulation results under model (5.1) with β0 = 0 and γ0 = 0.5.

Censor (n 0 , n 1 ) Method β0
 γ0

Mean SD SE CI Mean SD SE CI

80% (120, 60) θF


 0.003 0.060 0.063 0.958 0.503 0.134 0.128 0.928
θS
 0.011 0.150 0.146 0.930 0.527 0.339 0.305 0.926
θN
 −0.008 0.117 0.119 0.934 0.504 0.255 0.246 0.940
θL
 0.001 0.105 0.103 0.924 0.507 0.192 0.201 0.952
θG
 −0.001 0.092 0.092 0.962 0.489 0.190 0.187 0.934
(240, 45) θF
 0.002 0.062 0.063 0.942 0.498 0.123 0.128 0.962
θS
 −0.002 0.099 0.101 0.958 0.496 0.198 0.204 0.944
θN
 0.005 0.096 0.091 0.932 0.498 0.181 0.187 0.958
θL
 0.005 0.085 0.086 0.950 0.506 0.174 0.171 0.950
θG
 −0.001 0.080 0.078 0.952 0.498 0.161 0.161 0.930

70% (120, 85) θF


 0.000 0.057 0.055 0.936 0.493 0.116 0.112 0.936
θS
 −0.004 0.140 0.132 0.922 0.516 0.264 0.266 0.928
θN
 0.004 0.095 0.097 0.940 0.516 0.195 0.196 0.936
θL
 0.001 0.106 0.096 0.934 0.512 0.193 0.187 0.940
θG
 −0.001 0.087 0.085 0.934 0.497 0.170 0.170 0.952
(240, 65) θF
 −0.001 0.054 0.055 0.946 0.501 0.113 0.112 0.936
θS
 −0.002 0.086 0.090 0.964 0.498 0.182 0.181 0.942
θN
 −0.002 0.077 0.079 0.948 0.510 0.156 0.158 0.950
θL
 0.005 0.081 0.078 0.926 0.503 0.163 0.154 0.942
θG
 −0.004 0.069 0.072 0.950 0.498 0.145 0.143 0.946

θF , 
Notations:  θS , 
θN , 
θL and 
θG are the estimators based on the full cohort, subcohort, simple random sampling design with the same sample size as GCC
design, the GCC design with log-rank weight and the GCC design with Gehan weight, respectively.

Table 2
Simulation results under model (5.2) with β0 = 0.5 and γ0 = 1.

Censor (n0 , n1 ) Method β0


 γ0

Mean SD SE CI Mean SD SE CI

80% (120, 60) θF


 0.496 0.066 0.070 0.950 0.998 0.133 0.141 0.952
θS
 0.523 0.177 0.160 0.918 1.034 0.350 0.338 0.928
θN
 0.511 0.144 0.131 0.920 1.028 0.267 0.267 0.956
θL
 0.506 0.117 0.110 0.930 1.014 0.227 0.220 0.934
θG
 0.495 0.101 0.101 0.948 1.000 0.197 0.200 0.958
(240, 45) θF
 0.506 0.073 0.070 0.944 1.003 0.140 0.143 0.956
θS
 0.511 0.116 0.112 0.936 1.017 0.244 0.231 0.932
θN
 0.511 0.109 0.102 0.922 1.016 0.204 0.210 0.956
θL
 0.505 0.100 0.094 0.930 1.016 0.190 0.190 0.950
θG
 0.508 0.088 0.086 0.946 0.999 0.178 0.175 0.950

70% (120, 85) θF


 0.504 0.065 0.061 0.942 1.003 0.124 0.120 0.946
θS
 0.513 0.143 0.139 0.938 0.998 0.291 0.277 0.946
θN
 0.511 0.106 0.105 0.956 1.003 0.212 0.210 0.944
θL
 0.506 0.105 0.100 0.934 1.011 0.195 0.198 0.952
θG
 0.503 0.092 0.088 0.944 0.998 0.192 0.177 0.934
(240, 65) θF
 0.499 0.065 0.061 0.936 1.001 0.129 0.121 0.934
θS
 0.500 0.108 0.096 0.914 0.995 0.207 0.192 0.926
θN
 0.504 0.089 0.085 0.928 0.998 0.180 0.171 0.926
θL
 0.505 0.089 0.083 0.926 1.001 0.164 0.165 0.952
θG
 0.496 0.080 0.076 0.936 0.991 0.165 0.152 0.912
Notations are the same as in Table 1.

than the simple random sampling design with the same sample size and the method of Gehan weight is more efficient than
the log-rank weight, due to the fact that the Gehan weight could overcome the discontinuous of the estimating equations.
We consider another situation where the failure time is generated from the following model:

log T = β0 E + γ0 Z + ϵ, (5.2)

where β0 = 0.5, γ0 = 1 and error term ϵ follows the standard normal distribution. The full cohort size is n = 600 and
we generate 1000 simulated data sets. We investigate different scenarios for the sample size of subsamples (n0 , n1 ) and the
simulation results are summarized in Table 2.
Y. Cao et al. / Journal of the Korean Statistical Society 46 (2017) 298–307 303

Table 3
θG and 
Asymptotic relative efficiency between  θN .
Simulation model Censor (n0 , n1 ) ARE(β0 ) ARE(γ0 )

Model (5.1) 80% (120,60) 1.673 1.731


(240,45) 1.361 1.349
70% (120,85) 1.302 1.329
(240,65) 1.204 1.221

Model (5.2) 80% (120,60) 1.682 1.782


(240,45) 1.407 1.440
70% (120,85) 1.424 1.408
(240,65) 1.251 1.266

(a) Censor = 80%. (b) Censor = 70%.

Fig. 1. The trace of asymptotic relative efficiency TARE (


βN , 
βG ) under different SRS fraction ρ0 .

Simulation results from Table 2 also confirm that the proposed estimator is unbiased, the proposed variance estimator
provides a good estimation for the sample standard errors, the proposed the GCC design is more efficient than the simple
random sampling design with the same sample size, and the performance of the Gehan weight is better than the log-rank
weight.
We show the asymptotic relative efficiency between the proposed estimator  θG and the standard estimator 
θN , which is
based on simple random sampling design with the same sample size as the proposed GCC design in Table 3.
From Table 3, we know that the asymptotic relative efficiency is increasing with the censor rating increasing. When the
censor rating is fixed, the asymptotic relative efficiency is decreasing with the sample size n0 increasing. Most important,
results from Table 3 confirm that the proposed design is more efficient than simple random sampling design with the same
sample size.
We conduct a new simulation to evaluate the performance of the proposed optimal allocation method in Section 4. The
failure time is generated from the model (5.1) and the censoring time is generated from the uniform distribution over [0, c ],
where c is chosen to yield around 80% and 70% censoring rate, respectively. The full cohort size is set to be n = 800. We
generate 1000 simulated data sets and show the performance of the optimal GCC under ρv being 0.3 and 0.5 by evaluating the
trace of asymptotic efficiency relative between the proposed GCC design and simple random sampling design. The simulation
results are presented in Fig. 1.
304 Y. Cao et al. / Journal of the Korean Statistical Society 46 (2017) 298–307

Table 4
Analysis results for NWTSG.
Method Covariate Estimate SD pvalue

βN

histol −3.338 0.416 0.000
age −0.088 0.088 0.316
stage2 −2.666 0.617 0.000
stage3 −1.348 0.584 0.021
stage4 −1.676 0.814 0.040
study −0.113 0.460 0.805
βG

histol −2.785 0.358 0.000
age −0.182 0.074 0.014
stage2 −1.675 0.486 0.001
stage3 −1.556 0.466 0.001
stage4 −2.245 0.515 0.000
study −0.441 0.343 0.198

From Fig. 1(a), we can obtain that the optimal ρ0 are 0.67 and 0.8 under the ρv being 0.3 and 0.5, respectively, when
the censoring rate is 80%. When the censoring rate is 70%, the optimal ρ0 are 0.5 and 0.6 under the ρv being 0.3 and 0.5,
respectively.

6. Real data analysis

In this section, we analyze a real data set from the National Wilm’s Tumor Study Group (NWTSG) by the proposed method.
NWTSG is a cancer research group to study the kidney tumor which affects children which is called as Wilms’ tumor (Green,
Breslow, & Beckwith, 1998). We are interested in evaluating the relationship between the time to tumor relapse. However,
the tumor histology is difficult and expensive to measure. According to the cell type, the tumor histology can be classified
into two categories, named by favorable and unfavorable, respectively. Let the variable histol indicate the category of the
tumor histology. Other covariates include patient age, disease stages and study group (NWTSG-3 and NWTSG-4).
We consider the following accelerated failure time model:

log T = β1 histol + β2 age + β3 stage2 + β4 stage3 + β5 stage4 + β6 study + ϵ,

where the variables stage2, stage3, stage4 indicate the disease stages and the variable study indicates the study group. The
full cohort includes 4028 subjects and there are 571 subjects subject to relapse. We simple random sample subcohort with
p = 0.1 in the original subcohort and select supplemental failures subjects by q = 0.2. We compare the proposed estimator
βG with 
 βN , which is based on the simple random sampling design with the same sample size as GCC design. The results for
NWTSG are summarized in Table 4.
From Table 4, both the two methods confirm that tumor histology is significant to the cancer relapse. The proposed
method shows the age is significant to cancer relapse which is different from the result from βN .

7. Concluding remarks

In this article, we studied the generalized case–cohort design under the accelerated failure time model. Due to the biased
sampling mechanism, we introduced the weighted estimating equation to estimate the regression coefficients. However,
the proposed estimating equation is not smooth, which presents the challenges for solving. Therefore, we adopt the induced
smoothing procedure to the smooth weighted Gehan estimating equation which could lead to continuously differentiable
estimating equations and can be solved by the standard numerical methods. The simulation studies are conducted to study
the finite sample performance of the proposed method and we illustrate the proposed method by analyzing a real data set
from National Wilm’s Tumor Study Group.
In this article, we considered the covariates which are time-invariant. Next, we will consider the time-dependent
covariates in the accelerated failure time model under the GCC design. We assumed the Bernoulli sampling the subcohort in
the GCC design. It is interesting to investigate the performance of stratified sampling in the subcohort to enhance the study
efficiency. Study along this directions is currently under way.

Acknowledgments

The authors are grateful for the valuable comments and suggestions from the Associate Editor and the referees which
drastically improved the article. This work is partly supported by the National Science Foundation of China grant 11501578
(for Yu) and 11301545, 11671311 (for Yang).
Y. Cao et al. / Journal of the Korean Statistical Society 46 (2017) 298–307 305

Appendix

Proof of Theorem 1. By the definition of the Martingale Mi (β0 ; t ), we can obtain


n  ∞
n−1/2 Ũn,G (β0 ) = n−1/2

ψ̃(β0 ; t )Wi [Zi − Z̃ (β0 ; t )]dNi (β0 ; t )
i=1 −∞
n  ∞
= n−1/2

ψ̃(β0 ; t )Wi [Zi − Z̃ (β0 ; t )]dMi (β0 ; t )
i=1 −∞
n  ∞
+ n−1/2

ψ̃(β0 ; t )Wi [Zi − Z̃ (β0 ; t )]dΛi (β0 ; t ). (A.1)
i=1 −∞

Observably, the second term of (A.1) is zero. Hence,


n  ∞
n−1/2 Ũn,G (β0 ) = n−1/2

ψ̃(β0 ; t )Wi [Zi − Z̃ (β0 ; t )]dMi (β0 ; t )
i=1 −∞
n  ∞
= n−1/2

ψ̃(β0 ; t )Wi [Zi − eZ (β0 ; t )]dMi (β0 ; t )
i=1 −∞
n  ∞
+ n−1/2

ψ̃(β0 ; t )[eZ (β0 ; t ) − Z̃ (β0 ; t )]dWi Mi (β0 ; t ). (A.2)
i=1 −∞

Firstly, we show the second term of (A.2) is


 ∞ n
1 
ψ̃(β0 ; t )[eZ (β0 ; t ) − Z̃ (β0 ; t )]d √ Wi Mi (β0 ; t ) = op (1). (A.3)
−∞ n i=1

Without loss of generality, assume Zi ≥ 0. Otherwise, decompose each Zi into its positive and negative parts. For each
i, the zero-mean process Wi Mi (β0 ; t ) can be expressed as a sum of two monotone processes on the interval [−M , M ].
Because of the condition C1 and C3 and the follow-up time of the studies is bounded, the integrable interval [−∞, +∞]
of formula (A.1) should be the interval of [−M , M ], which is a compact set in real space R and M is a positive constant
and similar as the
n condition A in Tsiatis (1990). Thus, using Example 2.11.16 in van der Vaart and Wellner (1996), we
have that n−1/2 i=1 Wi Mi (β0 ; t ) converges weakly to a tight Gaussian process with continuous sample paths on [−M , M ].
Because Z̃ (β0 ; t ) is a product of two monotone processes, it converges uniformly in probability to eZ (β0 ; t ) on the compact
set [−M , M ] in R. Therefore, (A.3) holds immediately by the Lemma A.1 from Kulich and Lin (2000).
Secondly, the first term of (A.2) can be written as:
n  ∞
n−1/2

ψ̃(β0 ; t )[Zi − eZ (β0 ; t )]dMi (β0 ; t )
i=1 −∞
n  ∞
+ n−1/2

ψ̃(β0 ; t )(1 − δi )(ξi /(ρ0 ρv ) − 1)[Zi − eZ (β0 ; t )]dMi (β0 ; t )
i=1 −∞
n  ∞
+ n−1/2

ψ̃(β0 ; t )ξi δi (π (1 − ρ0 ρv )ηi /(ρ1 ρv ) − 1)[Zi − eZ (β0 ; t )]dMi (β0 ; t ). (A.4)
i=1 −∞
∞
We define Hi (β0 ) = −∞ ψ̃(β0 ; t )[Zi − eZ (β0 ; t )]dMi (β0 ; t ). Therefore, the formula (A.4) can be written as:

1
n
 1
n
 (1 − δi )(ξi − ρ0 ρv ) 1
n
 ξi δi (π (1 − ρ0 ρv )ηi − ρ1 ρv )
n− 2 Hi (β0 ) + n− 2 Hi (β0 ) + n− 2 Hi (β0 ). (A.5)
i=1 i =1
ρ0 ρv i =1
ρ1 ρv

Employing some basic calculations, we can show that the three terms on the right hand side of (A.5) are all zero-mean and
uncorrelated with each other. On the other hand, each term is a scaled sum of independent and identically distributed zero-
mean random vectors. It follows from the multivariate central limit theorem that n−1/2 Ũn,G (β0 ) converges in distribution
1−ρ ρ (1−ρ0 ρv )(π (1−ρ0 ρv )−ρ1 ρv )
to a zero-mean normal vector with covariance matrix ΣF (β0 ) + ρ ρ0 v ΣC (β0 ) + ρ ρ
ΣG (β0 ), where
0 v 1 v
ΣF (β0 ) = E [H1 (β0 )⊗2 ], ΣC (β0 ) = E [(1 − δ1 )H1 (β0 )⊗2 ], ΣG (β0 ) = E [δ1 H1 (β0 )⊗2 ] and a⊗2 = aa′ for a vector a. Therefore,
the Theorem 1 holds.
306 Y. Cao et al. / Journal of the Korean Statistical Society 46 (2017) 298–307

Proof of Theorem 2. We have


n n
  1 
n−1/2 Ũn,G (β0 ) − Ūn,G (β0 ) = δi Wi Wj (Zi − Zj )
n3/2 i=1 j=1

ej (β0 ) − ei (β0 )
  
× I (ej (β0 ) − ei (β0 ) ≥ 0) − Φ .
rij
ej (β0 )−ei (β0 ) ej (β0 )−ei (β0 )
Due to I (ej (β0 ) − ei (β0 ) ≥ 0) − Φ ( rij
) ≤ Φ (−|
rij
|),
 √ 
−1/2
1  n  n
δi Wi Wj (Zi − Zj ) nrij 
∥n (Ũn,G (β0 ) − Ūn,G (β0 ))∥ ≤  2
 
 n i=1 j=1 |ej (β0 ) − ei (β0 )| 

 ej (β0 ) − ei (β0 )   ej (β0 ) − ei (β0 ) 


    
 .
× Φ −
  
rij  rij

Note that limx→+∞ xΦ (−x) = 0 since Φ (−x) ≤ ( 2π x)− 1
exp{−x2 /2}. Therefore, Theorem 2 holds by applying the strong

law of large number (Pollard, 1990) and the fact nrij = (Zj − Zi )′ (Zj − Zi ).

Proof of Theorem 3. Recognizing that Ũn,G (β) is the gradient of the convex objective function
n 
 n
Ln (β) = n−1 δi Wi Wj (ej (β) − ei (β))I (ej (β) − ei (β) ≥ 0), (A.6)
i=1 j=1

a parameter estimator could be obtained by minimizing Ln (β) with respect to β . The resulting set of solutions is convex.
However, the lack of smoothness also presents computational challenges. We use standard results for normal random
variables and integration by parts to obtain

ej (β) − ei (β) ej (β) − ei (β)


n 
 n     
L̄n (β) = n −1
δi Wi Wj (ej (β) − ei (β))Φ + rij φ , (A.7)
i=1 j=1
rij rij

where φ(·) is the standard normal density function. A straightforward calculation shows that Ūn,G (β) = ∂ L̄n (β)/∂β . Let
βn = arg minβ∈B L̄n (β). The smoothed objective function L̄n (β) is convex and continuously differentiable and standard

numerical methods can be used to obtain  βn . By Lemma 1 and Lemma 2 from Johnson and Strawderman (2009), the
respective minimizers β̃n and 
βn of Ln (β) and L̄n (β) thus converge almost surely to β0 (Anderson & Gill, 1982, Corollary II.2).
Next, we prove the asymptotic normality of  βn . By Taylor expansion of Ūn,G (β) around β0 , we have
∂ Ūn,G (β) 
Ūn,G (β) − Ūn,G (β0 ) = β ∗ (β − β0 ).
∂β
βn in above equation, we have
Inserting 

∂ Ūn,G (β ∗ ) √ 
 
n−1/2 Ūn,G (β0 ) = −n − 1 n(βn − β0 ),
∂β
where β ∗ is between 
βn and β0 . To prove the asymptotic normality of 
βn , it suffices to prove Theorem 3 by using Theorem 1
and the condition C4. Therefore, Theorem 3 holds.

References

Anderson, P. K., & Gill, R. D. (1982). Cox’s regression model for counting processes: a large sample study. Annals of Statistics, 10, 1100–1120.
Brown, B. M., & Wang, Y. G. (2007). Induced smoothing for rank regression with censored survival times. Statistics in Medicine, 26, 828–836.
Cai, J., & Zeng, D. (2007). Power calculation for case-cohort studies with nonrare events. Biometrics, 63, 1288–1295.
Chen, K. (2001). Generalized case-cohort sampling. Journal of the Royal Statistical Society: Series B, 63, 791–809.
Chen, K., & Lo, S. H. (1999). Case–cohort and case–control analysis with Cox’s model. Biometrika, 86, 755–764.
Chiou, S., Kang, S., & Yan, J. (2014). Fast accelerated failure time modeling for case-cohort data. Statistics and Computing, 24, 559–568.
Cox, D. R. (1972). Regression models and life tables (with discussion). Journal of the Royal Statistical Society: Series B, 34, 187–220.
Fygenson, M., & Ritov, Y. (1994). Monotone estimating equations for censored data. Annals of Statistics, 22, 732–746.
Green, D. M., Breslow, N. E., Beckwith, J. B., et al. (1998). Comparison between single-dose and divided-dose administration of dactinomycin and doxorubicin
for patients with Wilms tumor: a report from the national Wilms Tumor study. Journal of Clinical Oncology, 16, 237–245.
Horvitz, D., & Thompson, D. (1951). A Generalization of sampling without replacement from a finite universe. Journal of the American Statistical Association,
47, 663–685.
Jin, Z., Lin, D. Y., Wei, L. J., & Ying, Z. (2003). Rank-based inference for the accelerated failure time model. Biometrika, 90, 341–353.
Johnson, L. M., & Strawderman, R. L. (2009). Induced smoothing for the semiparametric accelerated failure time model: asymptotics and extensions to
clustered data. Biometrika, 96, 577–590.
Y. Cao et al. / Journal of the Korean Statistical Society 46 (2017) 298–307 307

Kalbfleisch, J. D., & Lawless, J. F. (1988). Likelihood analysis of multi-state models for disease incidence and mortality. Statistics in Medicine, 7, 147–160.
Kang, S., & Cai, J. (2009). Marginal hazards model for case-cohort studies with multiple disease outcomes. Biometrika, 96, 887–901.
Kong, L., & Cai, J. (2009). Case–cohort analysis with accelerated failure time model. Biometrics, 65, 135–142.
Kong, L., Cai, J., & Sen, P. K. (2004). Weighted estimating equations for semiparametric transformation models with censored data from a case-cohort design.
Biometrika, 91, 305–319.
Kulich, M., & Lin, D. Y. (2000). Additive hazards regression with covariate measurement error. Journal of the American Statistical Association, 95, 238–248.
Lai, T. L., & Ying, Z. (1991). Rank regression methods for left truncated and right-censored data. Annals of Statistics, 19, 531–556.
Lin, D. Y., & Ying, Z. (1993). Cox regression with incomplete covariate measurements. Journal of the American Statistical Association, 88, 1341–1349.
Nan, B., Yu, M., & Kalbfleisch, J. D. (2006). Censored linear regression for case-cohort studies. Biometrica, 93, 747–762.
Pollard, D. (1990). Empirical processes: theory and applications. Hayward: Institute of Mathematical Statistics.
Prentice, R. L. (1986). A case-cohort design for epidemiologic cohort studies and disease prevention trials. Biometrika, 73, 1–11.
Self, S. G., & Prentice, R. L. (1988). Asymptotic distribution theory and efficiency results for case-cohort studies. Annals of Statistics, 16, 64–81.
Sun, J., Sun, L., & Flournoy, N. (2004). Additive Hazards model for competing risks analysis of the case-cohort design. Communications in Statistics - Theory
and Methods, 33, 351–366.
Tsiatis, A. A. (1990). Estimating regression parameters using linear rank tests for censored data. Annals of Statistics, 18, 354–372.
van der Vaart, A. W., & Wellner, J. A. (1996). Weak convergence and empirical processes. New York: Springer-Verlag.
Wei, L. J., Ying, Z., & Lin, D. Y. (1990). Linear regression analysis of censored survival data based on rank test. Biometrika, 77, 845–851.
Ying, Z. (1993). A large sample study of rank estimation for censored regression data. Annals of Statistics, 21, 76–99.
Yu, J., Liu, Y., Cai, J., Sandler, D. P., & Zhou, H. (2016). Outcome-dependent sampling design and inference for Coxs proportional hazards Model. Journal of
Statistical Planning and Inference, 178, 24–36.
Yu, J., Liu, Y., Sandler, D. P., & Zhou, H. (2015). Statistical inference for the additive hazards model under outcome-dependent sampling. The Canadian Journal
of Statistics, 43, 436–453.
Yu, J., Shi, Y., Yang, Q., & Liu, Y. (2014). Additive hazards regression under generalized case-cohort sampling. Acta Mathematica Sinica, English Series, 30,
251–260.

Вам также может понравиться