Вы находитесь на странице: 1из 14

ASC3034 Survival Models BSc (Hons) in Actuarial Studies

CHAPTER 1
ESTIMATION FOR COMPLETE DATA

1.1) Introduction

A data-dependent distribution is at least as complex as the data or knowledge that


produced it, and the number of “parameters” increases as the number of data points or
amount of knowledge increases.

A parametric distribution is a set of distribution functions, each member of which is


determined by specifying one or more values called parameters. The number of
parameters is fixed and finite.

Complete data for a study means that every relevant observation is available and the exact
value of every observation is known. Examples of data that are not complete are:
- Grouped data, in which all that is recorded is the range of values in which the
observation belongs.
- Observations below a certain number are not available.
- For observations above a certain number, you are only told that the observation is
above that number. For example, in a mortality study, the data points may be
amount of time until death. For some individuals, you may only be told that the
person survived 5 years, but not told exactly how long he survived.

The random variable X which we will usually be concerned is one of the following two
types.
- A loss random variable can describe various types of loss-related quantities, such
as the amount of damage to property during specified period of time, or the number
of accidents that a particular driver has in a one-year period.
- A failure time random variable describes the time until the occurrence of a
particular event.

When analysing and estimating properties of the distribution of a random variable X,


sample information is available in one of the following formats.
- A random sample (independent observations) x1 , x2 ,..., xn of n individual
observations.
- Grouped data, in which the range of the random variable is broken into a series of
intervals, (  , c0 ], (c0 , c1 ], (c1 , c2 ], , , (cr 1 , cr ], (cr ,  ] and the number of observations
in interval (c j 1 , c j ] is the integer n j (the individual observation values are not
known).

We will consider estimation based on data in the form of random sample. The sample is
used to construct the empirical distribution.

PREPARED BY: LOW ANN ANN (AUGUST 2016 SEMESTER) 1


ASC3034 Survival Models BSc (Hons) in Actuarial Studies

1.2) Empirical Distribution for Complete, Individual Data

The empirical distribution is a discrete random variable constructed from a random


sample.

Suppose that the random sample consists of n observations, x1 , x2 ,..., xn . If the data is from
a loss distribution, then the xi ’s are loss amounts, and if the data is from a survival
distribution, they are times of death or failure. Knowing the exact value of each outcome
is what is referred to as complete data.

1
The empirical distribution assigns a probability of to each x j . For a sample of size n,
n
let y1  y 2  ...  y k be the k unique values that appear in the sample, ordered from smallest
to largest where k be less than or equal to n. Let s j be the number of times the observation
k
y j appears in the sample. Thus, s j  n (the total number of observed values).
j 1

For instance, if we have a sample of n  8 points, say x1 , x2 ,..., x8 are 7, 2, 4, 4, 6, 2, 1, 9,


then we have k  6 distinct values. In numerical order, y1  1 , y 2  2 , y3  4 , y 4  6 ,
y5  7 , y6  9 with s1  1 , s2  2 , s3  2 , s 4  1 , s5  1 , s6  1 . Note that y 4  6 indicates
that the time 6 is the 4th time point at which some deaths occur. This empirical distribution
is a 6-point discrete random variable based on the numerical values of the y’s, and it has
all the properties of a discrete random variable.

The empirical distribution probability function is defined to be


number of xi 's that are equal to y j s j
pn ( y j )  
n n

The empirical distribution function is defined to be


number of xi 's  t
Fn (t ) 
n
number of xi 's > t
The empirical survival function is S n (t )  1  Fn (t )  .
n

The cumulative hazard rate function is defined as H ( x)   ln S ( x) .


S ( x ) f ( x ) x
If S ( x) is differentiable, H ( x )     h ( x ) and H ( x )   h( y ) dy .
S ( x) S ( x) 

The distribution function can be obtained from F ( x )  1  S ( x )  1  e  H ( x ) . Therefore,


estimating the cumulative hazard function provides an alternative way to estimate the
distribution function.

PREPARED BY: LOW ANN ANN (AUGUST 2016 SEMESTER) 2


ASC3034 Survival Models BSc (Hons) in Actuarial Studies

The risk set at y j is denoted r j and is defined to be the set of observed values that are
greater than or equal to y j . This can be interpreted in the survival time context. The number
that are at risk to die at a point in time is the number who are alive (and under observation)
at that time and will die then or at some later time. In this complete data situation, we know
the time of death of each individual.

When we have a random sample of n observations, the number at risk at each death point
is as follows:
r1  n are at risk at the first death time y1 ;
there are s1 deaths at time y1 , so there are r2  n  s1 at risk at the second death time y 2
there are s2 deaths at time y 2 , so there are r3  r2  s2  n  ( s1  s2 ) at risk at the third
death time y3 , …

We can also formulate the risk set as


k
rj   si  s j  s j 1  ...  sk .
i j

In other words, everyone at risk (alive) at time y j will die at some death point y j , y j 1 ,... .
If we add up all the deaths from death point y j and later, that totals everyone currently still
alive (coming up to death point y j ).

For instance, with the 8-point sample above, there is a death at time 1, two deaths at time
2, two deaths at time 4, one death at time 6, one death at time 7 and one death at time 9.
There are 6 death points.
6
The risk set at time 1 is r1   si  8 ; this is the number at risk of death just before the first
i 1

death point at time 1.

6
r2   si  7 , r3  5 , r4  3 , r5  2 , r6  1 .
i 1

Note that the subscripts of r identify the number of the death point (not that actual death
time). So r4  3 means that at the 4th death point (which is time 6), there are 3 at risk just
before the death at time 6.

The empirical survival function is


rj
S n (t ) 
if y j 1  t  y j
n
This is the proportion still alive at time t just before death time y j .

PREPARED BY: LOW ANN ANN (AUGUST 2016 SEMESTER) 3


ASC3034 Survival Models BSc (Hons) in Actuarial Studies

r4 3
For instance, S 8 (4.5)   , since y3  4  4.5  6  y4 .
n 8

r j denotes the number of elements in the risk set at y j , and we can also interpret r j as
denoting the set of elements at risk at y j .

For instance, suppose that the original sample of xi ’s are times of death of n people, and
y j is one of the times at which deaths occur. Then s j would be the number of deaths that
occur at time y j . r j denotes the number of people who are still alive just before time y j ,
and r j also denotes the set of people alive just before time y j (the distinction is between
the set and the number of objects in the set; r j is used to denote both).

The empirical distribution function can be formulated as


rj n  rj
Fn (t )  1   if y j 1  t  y j
n n
The numerator is the number of deaths from time 0 up to (not including) time y j .

Example 1a (Empirical distribution function)

In a mortality study on 10 lives, times at death are 22, 35, 78, 101, 125, 237, 350, 350, 484,
600. The empirical distribution is used as a model for the underlying distribution of time
to death for the population. Calculate F10 (100) .

number of xi 's  100 3


F10 (100) = = = 0.3
10 10

Example 1b (Empirical probability and distribution function)

Number of accidents Number of drivers


0 81,714
1 11,306
2 1,618
3 250
4 40
5 7

Provide the empirical probability function and empirical distribution function for the data.

PREPARED BY: LOW ANN ANN (AUGUST 2016 SEMESTER) 4


ASC3034 Survival Models BSc (Hons) in Actuarial Studies

The empirical probability function is

81714 / 94935  0.860736, x  0,


11306 / 94935  0.119092, x  1,

1618 / 94935  0.017043, x  2,
p94935 ( x )  
 250 / 94935  0.002633, x  3,
 40 / 94935  0.000421, x  4,

 7 / 94935  0.000074, x5

The empirical distribution function is a step function with jumps at each data point.

 0 / 94935  0, x  0,
81714 / 94935  0.860736, 0  x  1,

93020 / 94935  0.979828, 1  x  2,

F94935 ( x )  94638 / 94935  0.996872, 2  x  3,
94888 / 94935  0.999505, 3  x  4,

94928 / 94935  0.999926, 4 x5

94935 / 94935  1, x  5.

Example 1c (Empirical probability function)

27, 82, 115, 126, 155, 161, 243, 294, 340, 384
Provide the empirical probability function for the data above.

PREPARED BY: LOW ANN ANN (AUGUST 2016 SEMESTER) 5


ASC3034 Survival Models BSc (Hons) in Actuarial Studies

Example 1d (Risk set and Empirical distribution function)

A data set contains the numbers 1.0, 1.3, 1.5, 1.5, 2.1, 2.1, 2.1, 2.8. Determine the values
in the table and then obtain the empirical distribution function.

j yj sj rj
1
2
3
4
5

PREPARED BY: LOW ANN ANN (AUGUST 2016 SEMESTER) 6


ASC3034 Survival Models BSc (Hons) in Actuarial Studies

1.3) Empirical Estimates

Empirical estimates of distribution-related factors from the random variable X are found
by calculating the quantity in question within the empirical distribution.
The empirical estimate of the mean is the mean of empirical distribution and the
empirical estimate of the variance is the variance of the empirical distribution.

The empirical estimate of the mean of X is the mean of the empirical distribution.
1 n
ˆ1  x   xi , also denoted ̂
n i 1
This is the sample mean and is the mean of the empirical distribution.

1 n k
The empirical estimate of the kth (raw) moment is ˆ k   xi . (this is the kth moment
n i 1
of the empirical distribution).

The empirical estimate of the variance is the variance of the empirical distribution,
1 n 1 n 2
which is 
n i 1
( xi  x ) 2
or 
n i 1
xi  x 2 .

Note: If a question asks for the sample variance of a sample, we use the form
1 n

n  1 i 1
( xi  x ) 2 , but if the question asks for the empirical estimate of the variance for

1 n
the same sample, we use the form 
n i 1
( xi  x ) 2 .

The empirical estimate of E [( X  u ) k ] , the kth limited moment with limit u is the kth
limited moment of the empirical distribution:
1 
  xi  u  [number of xi ' s  u ] 
k k

n  xi u 

The Nelson-Aalen estimator estimates the cumulative hazard function. Suppose the
cumulative hazard rate before time y1 is known to be b. If at that time s1 lives out of a risk
s1
set r1 die, that means that the hazard at that time y1 is . Therefore the cumulative hazard
r1
sj s1
function is increased by the amount , and becomes b  . The Nelson-Aalen estimator
rj r1
sj
sets Hˆ (0)  0 and then at each time y j at which an event occurs, Hˆ ( y j )  Hˆ ( y j 1 )  .
rj

PREPARED BY: LOW ANN ANN (AUGUST 2016 SEMESTER) 7


ASC3034 Survival Models BSc (Hons) in Actuarial Studies

y1 s1 s1 s2
H(t)=b H (t )  b  y2 H (t )  b  
r1 r1 r2
t

s1 deaths out s2 deaths out


of r1 lives of r2 lives

The Nelson-Aalen estimate of the cumulative hazard rate function is




0 t  y1
 s
j 1
Hˆ (t )    i y j 1  t  y j , j  2, 3,..., k
 i 1 ri
 k s
 i t  yk
 i 1 ri

The Nelson-Aalen estimate of the survival function is Sˆ ( x )  e  H ( x ) ,


ˆ

And the Nelson-Aalen estimate of the distribution function is Fˆ ( x )  1  e  H ( x ) .


ˆ

The smoothed empirical estimate of the 100pth percentile ˆ p is found in the following
way
i) Order the sample values from smallest to largest, x(1),..., x( n ) ,
g g 1
ii) find the integer g such that  p ,
n 1 n 1
iii) ˆ p is found by linear interpolation,
ˆ p  [ g  1  (n  1) p]x( g )  [(n  1) p  g ]  x( g 1)
 1 
Under this approach, x(1) is the 100   -th sample percentile, x(2) is the
 n 1
 2   n 
100   -th sample percentile, …, x( n ) is the 100   -th sample
 n 1  n 1
percentile. If (n  1) p  g is an integer then the smoothed empirical estimate
of the pth percentile is x(g) .

PREPARED BY: LOW ANN ANN (AUGUST 2016 SEMESTER) 8


ASC3034 Survival Models BSc (Hons) in Actuarial Studies

Example 1e (Empirical estimates)


A random sample of n=8 values from distribution of X is given: 3, 4, 8, 10, 12, 18, 22,
35. Find
a) The empirical estimate of the mean 
b) The empirical estimate of the variance
c) The empirical limited expected value with limit u  20
d) The Nelson-Aalen empirical estimate of H(10) and F(10)
e) The smoothed empirical estimate of the 25th percentile.
f) The empirical estimate of the expected cost per loss and the estimate of expected
cost per payment when there is a deductible of 5.

PREPARED BY: LOW ANN ANN (AUGUST 2016 SEMESTER) 9


ASC3034 Survival Models BSc (Hons) in Actuarial Studies

1.4) Empirical Estimation from Grouped Data

Grouped data has a set of intervals and the number of losses in each interval, but does not
give the exact value of each loss. This means that we know the empirical cumulative
distribution function only at endpoints of intervals. Grouped data is not complete data, but
we will consider a modification to the empirical distribution to handle it.
Let the group boundaries be c0  c1  ...  cr , where often c0  0 . The total number of
observation is n  n1  n2  ...  nr . For the estimation methods presented in this section, we
require that the right endpoint of the final interval is a finite number, cr   .
For grouped data, the distribution function is usually approximated by connecting the
n1 n n
points with straight lines, so that Fn ( c0 )  0, Fn (c1 )  , Fn (c2 )  1 2 , etc., with linear
n n
interpolation between successive c j ’s.

1 j
The empirical distribution function at the interval point c j is Fn (c j )   ni
n i 1
Fn (c j ) is the fraction of all of the observations that are  c j .

The graph of the distribution function is denoted by Fn ( x ) and is called the ogive. The
formula is
cj  x x  c j 1
Fn ( x )  Fn (c j 1 )  Fn (c j ) , c j 1  x  c j
c j  c j 1 c j  c j 1

The empirical density function can be obtained by differentiating the ogive. The
resulting function is called a histogram. It is constant between endpoints of intervals.
The formula is
Fn (c j )  Fn (c j 1 ) nj
f n ( x)   , c j 1  x  c j
c j  c j 1 n (c j  c j 1 )
there are n j points in the interval and n points altogether.

PREPARED BY: LOW ANN ANN (AUGUST 2016 SEMESTER) 10


ASC3034 Survival Models BSc (Hons) in Actuarial Studies

Example 1f (Histogram)
50 observed losses have been recorded in millions and grouped by size of loss as follows:

Size of Loss (X) Number of Observed Losses


(0, 2] 25
(2, 10] 10
(10, 100] 10
(100, 1000] 5
50

a) Find the empirical distribution function of the ogive corresponding to this data
set.
b) Find the density function corresponding to this data set.

PREPARED BY: LOW ANN ANN (AUGUST 2016 SEMESTER) 11


ASC3034 Survival Models BSc (Hons) in Actuarial Studies

Example 1g (Ogive and Histogram)

Data set shows the number of slugs eaten by 1000 purple slug-eating monsters.

Number of slugs eaten Number of Monsters in this range


0 – 35 18
35 – 95 230
95 – 145 320
145 – 300 120
300 - 500 312

a) Find the distribution function of the ogive corresponding to this data set.
b) Find the probability density function of the histogram corresponding to this data.
c) Using an ogive and/or histogram, approximate the probability that a given purple
slug-eating monster has eaten between 234 and 315 slugs, inclusive.

PREPARED BY: LOW ANN ANN (AUGUST 2016 SEMESTER) 12


ASC3034 Survival Models BSc (Hons) in Actuarial Studies

r
 n j c j  c j 1 
The empirical estimate of the mean of X (first moment of X) is  n  .
j 1  2 

This can be interpreted as the weighted average of the interval midpoints, where the
weight for an interval is the proportion of all observations that are in that interval. The
general approach to finding empirical estimates for moments when data is in interval-
grouped form is to make the assumption that loss amounts are uniformly distributed
within each interval. For the first moment, we find the mean for each interval (the
c  c j 1
midpoint of interval (c j 1 , c j ] is j ) and then apply the proportion of observations
2
 nj 
in that interval as the weight for that interval   , and then sum over all intervals.
 n
c j  c j 1
Since is the average value of an observation in the interval (c j 1 , c j ] , and there
2
are n j observations in that interval, the total of the observed value from that interval is
c j  c j 1
nj  . The total of all observed values would be the sum over all intervals,
2
r
 c j  c j 1  1 r  c j  c j 1 
 nj  2  . And the average of all n values would be   nj  .
j 1   n j 1  2 

Applying the concept of uniform distribution of x’s within each interval to the kth
moment gives the empirical estimate of the kth moment.
r  n c kj 1  c kj 11 
  n  (k  1)(c  c ) 
j

j 1
 j j 1 

The empirical estimate of the kth limited moment with limit u is found by first
identifying the interval for which c j 1  u  c j .

Case 1, u  c j is an interval endpoint

j
ni ci  ci 1 r
n
The empirical estimate of E[ X  u ] is n 2   u   ni .
i 1 i  j 1

For kth limited moment, the empirical estimate of E [( X  u ) k ] is


j
ni cik 1  cik11 r
n
 n (k  1)(c  c )
  u k
  ni .
i 1 i i 1 i  j 1

PREPARED BY: LOW ANN ANN (AUGUST 2016 SEMESTER) 13


ASC3034 Survival Models BSc (Hons) in Actuarial Studies

Case 2, c j 1  u  c j

The empirical estimate of E[ X  u ] is


 ni   ci 1  ci   n j   c j 1  u   u  c j 1   nj  cj  u 
j 1 r
ni
  n        
2   n   2   c j  c j 1 
  u      u  
i 1  n   c j  c j 1  i  j 1 n

.
The empirical estimate of E [( X  u ) k ] is

j 1
ni (cik 1  cik11 ) n j (u k 1  c kj 11 ) n j u k (c j  u ) r
n
 n(k  1)(c  c )  n(k  1)(c  c )  n(c  c )  u k   ni
i 1 i i 1 j j 1 j j 1 i  j 1

The empirical estimate of the 100pth percentile of X is found by solving for ˆ p from
the equation Fn (ˆ p )  p using the ogive of the empirical cumulative density function.

Example 1h (Empirical estimates for grouped data)

100 observations are grouped into k  5 intervals:

(0, 100] 25
(100, 200] 20
(200, 500] 20
(500, 1000] 20
(1000, 2000] 15

(a) Find the empirical estimate of the 1st and 2nd moments
(b) Find the empirical estimate of the variance
(c) Find the empirical limited expected values with
(i) limit u  1000
(ii) limit u  1400

PREPARED BY: LOW ANN ANN (AUGUST 2016 SEMESTER) 14

Вам также может понравиться