Академический Документы
Профессиональный Документы
Культура Документы
CHAPTER 1
ESTIMATION FOR COMPLETE DATA
1.1) Introduction
Complete data for a study means that every relevant observation is available and the exact
value of every observation is known. Examples of data that are not complete are:
- Grouped data, in which all that is recorded is the range of values in which the
observation belongs.
- Observations below a certain number are not available.
- For observations above a certain number, you are only told that the observation is
above that number. For example, in a mortality study, the data points may be
amount of time until death. For some individuals, you may only be told that the
person survived 5 years, but not told exactly how long he survived.
The random variable X which we will usually be concerned is one of the following two
types.
- A loss random variable can describe various types of loss-related quantities, such
as the amount of damage to property during specified period of time, or the number
of accidents that a particular driver has in a one-year period.
- A failure time random variable describes the time until the occurrence of a
particular event.
We will consider estimation based on data in the form of random sample. The sample is
used to construct the empirical distribution.
Suppose that the random sample consists of n observations, x1 , x2 ,..., xn . If the data is from
a loss distribution, then the xi ’s are loss amounts, and if the data is from a survival
distribution, they are times of death or failure. Knowing the exact value of each outcome
is what is referred to as complete data.
1
The empirical distribution assigns a probability of to each x j . For a sample of size n,
n
let y1 y 2 ... y k be the k unique values that appear in the sample, ordered from smallest
to largest where k be less than or equal to n. Let s j be the number of times the observation
k
y j appears in the sample. Thus, s j n (the total number of observed values).
j 1
The risk set at y j is denoted r j and is defined to be the set of observed values that are
greater than or equal to y j . This can be interpreted in the survival time context. The number
that are at risk to die at a point in time is the number who are alive (and under observation)
at that time and will die then or at some later time. In this complete data situation, we know
the time of death of each individual.
When we have a random sample of n observations, the number at risk at each death point
is as follows:
r1 n are at risk at the first death time y1 ;
there are s1 deaths at time y1 , so there are r2 n s1 at risk at the second death time y 2
there are s2 deaths at time y 2 , so there are r3 r2 s2 n ( s1 s2 ) at risk at the third
death time y3 , …
In other words, everyone at risk (alive) at time y j will die at some death point y j , y j 1 ,... .
If we add up all the deaths from death point y j and later, that totals everyone currently still
alive (coming up to death point y j ).
For instance, with the 8-point sample above, there is a death at time 1, two deaths at time
2, two deaths at time 4, one death at time 6, one death at time 7 and one death at time 9.
There are 6 death points.
6
The risk set at time 1 is r1 si 8 ; this is the number at risk of death just before the first
i 1
6
r2 si 7 , r3 5 , r4 3 , r5 2 , r6 1 .
i 1
Note that the subscripts of r identify the number of the death point (not that actual death
time). So r4 3 means that at the 4th death point (which is time 6), there are 3 at risk just
before the death at time 6.
r4 3
For instance, S 8 (4.5) , since y3 4 4.5 6 y4 .
n 8
r j denotes the number of elements in the risk set at y j , and we can also interpret r j as
denoting the set of elements at risk at y j .
For instance, suppose that the original sample of xi ’s are times of death of n people, and
y j is one of the times at which deaths occur. Then s j would be the number of deaths that
occur at time y j . r j denotes the number of people who are still alive just before time y j ,
and r j also denotes the set of people alive just before time y j (the distinction is between
the set and the number of objects in the set; r j is used to denote both).
In a mortality study on 10 lives, times at death are 22, 35, 78, 101, 125, 237, 350, 350, 484,
600. The empirical distribution is used as a model for the underlying distribution of time
to death for the population. Calculate F10 (100) .
Provide the empirical probability function and empirical distribution function for the data.
The empirical distribution function is a step function with jumps at each data point.
0 / 94935 0, x 0,
81714 / 94935 0.860736, 0 x 1,
93020 / 94935 0.979828, 1 x 2,
F94935 ( x ) 94638 / 94935 0.996872, 2 x 3,
94888 / 94935 0.999505, 3 x 4,
94928 / 94935 0.999926, 4 x5
94935 / 94935 1, x 5.
27, 82, 115, 126, 155, 161, 243, 294, 340, 384
Provide the empirical probability function for the data above.
A data set contains the numbers 1.0, 1.3, 1.5, 1.5, 2.1, 2.1, 2.1, 2.8. Determine the values
in the table and then obtain the empirical distribution function.
j yj sj rj
1
2
3
4
5
Empirical estimates of distribution-related factors from the random variable X are found
by calculating the quantity in question within the empirical distribution.
The empirical estimate of the mean is the mean of empirical distribution and the
empirical estimate of the variance is the variance of the empirical distribution.
The empirical estimate of the mean of X is the mean of the empirical distribution.
1 n
ˆ1 x xi , also denoted ̂
n i 1
This is the sample mean and is the mean of the empirical distribution.
1 n k
The empirical estimate of the kth (raw) moment is ˆ k xi . (this is the kth moment
n i 1
of the empirical distribution).
The empirical estimate of the variance is the variance of the empirical distribution,
1 n 1 n 2
which is
n i 1
( xi x ) 2
or
n i 1
xi x 2 .
Note: If a question asks for the sample variance of a sample, we use the form
1 n
n 1 i 1
( xi x ) 2 , but if the question asks for the empirical estimate of the variance for
1 n
the same sample, we use the form
n i 1
( xi x ) 2 .
The empirical estimate of E [( X u ) k ] , the kth limited moment with limit u is the kth
limited moment of the empirical distribution:
1
xi u [number of xi ' s u ]
k k
n xi u
The Nelson-Aalen estimator estimates the cumulative hazard function. Suppose the
cumulative hazard rate before time y1 is known to be b. If at that time s1 lives out of a risk
s1
set r1 die, that means that the hazard at that time y1 is . Therefore the cumulative hazard
r1
sj s1
function is increased by the amount , and becomes b . The Nelson-Aalen estimator
rj r1
sj
sets Hˆ (0) 0 and then at each time y j at which an event occurs, Hˆ ( y j ) Hˆ ( y j 1 ) .
rj
y1 s1 s1 s2
H(t)=b H (t ) b y2 H (t ) b
r1 r1 r2
t
The smoothed empirical estimate of the 100pth percentile ˆ p is found in the following
way
i) Order the sample values from smallest to largest, x(1),..., x( n ) ,
g g 1
ii) find the integer g such that p ,
n 1 n 1
iii) ˆ p is found by linear interpolation,
ˆ p [ g 1 (n 1) p]x( g ) [(n 1) p g ] x( g 1)
1
Under this approach, x(1) is the 100 -th sample percentile, x(2) is the
n 1
2 n
100 -th sample percentile, …, x( n ) is the 100 -th sample
n 1 n 1
percentile. If (n 1) p g is an integer then the smoothed empirical estimate
of the pth percentile is x(g) .
Grouped data has a set of intervals and the number of losses in each interval, but does not
give the exact value of each loss. This means that we know the empirical cumulative
distribution function only at endpoints of intervals. Grouped data is not complete data, but
we will consider a modification to the empirical distribution to handle it.
Let the group boundaries be c0 c1 ... cr , where often c0 0 . The total number of
observation is n n1 n2 ... nr . For the estimation methods presented in this section, we
require that the right endpoint of the final interval is a finite number, cr .
For grouped data, the distribution function is usually approximated by connecting the
n1 n n
points with straight lines, so that Fn ( c0 ) 0, Fn (c1 ) , Fn (c2 ) 1 2 , etc., with linear
n n
interpolation between successive c j ’s.
1 j
The empirical distribution function at the interval point c j is Fn (c j ) ni
n i 1
Fn (c j ) is the fraction of all of the observations that are c j .
The graph of the distribution function is denoted by Fn ( x ) and is called the ogive. The
formula is
cj x x c j 1
Fn ( x ) Fn (c j 1 ) Fn (c j ) , c j 1 x c j
c j c j 1 c j c j 1
The empirical density function can be obtained by differentiating the ogive. The
resulting function is called a histogram. It is constant between endpoints of intervals.
The formula is
Fn (c j ) Fn (c j 1 ) nj
f n ( x) , c j 1 x c j
c j c j 1 n (c j c j 1 )
there are n j points in the interval and n points altogether.
Example 1f (Histogram)
50 observed losses have been recorded in millions and grouped by size of loss as follows:
a) Find the empirical distribution function of the ogive corresponding to this data
set.
b) Find the density function corresponding to this data set.
Data set shows the number of slugs eaten by 1000 purple slug-eating monsters.
a) Find the distribution function of the ogive corresponding to this data set.
b) Find the probability density function of the histogram corresponding to this data.
c) Using an ogive and/or histogram, approximate the probability that a given purple
slug-eating monster has eaten between 234 and 315 slugs, inclusive.
r
n j c j c j 1
The empirical estimate of the mean of X (first moment of X) is n .
j 1 2
This can be interpreted as the weighted average of the interval midpoints, where the
weight for an interval is the proportion of all observations that are in that interval. The
general approach to finding empirical estimates for moments when data is in interval-
grouped form is to make the assumption that loss amounts are uniformly distributed
within each interval. For the first moment, we find the mean for each interval (the
c c j 1
midpoint of interval (c j 1 , c j ] is j ) and then apply the proportion of observations
2
nj
in that interval as the weight for that interval , and then sum over all intervals.
n
c j c j 1
Since is the average value of an observation in the interval (c j 1 , c j ] , and there
2
are n j observations in that interval, the total of the observed value from that interval is
c j c j 1
nj . The total of all observed values would be the sum over all intervals,
2
r
c j c j 1 1 r c j c j 1
nj 2 . And the average of all n values would be nj .
j 1 n j 1 2
Applying the concept of uniform distribution of x’s within each interval to the kth
moment gives the empirical estimate of the kth moment.
r n c kj 1 c kj 11
n (k 1)(c c )
j
j 1
j j 1
The empirical estimate of the kth limited moment with limit u is found by first
identifying the interval for which c j 1 u c j .
j
ni ci ci 1 r
n
The empirical estimate of E[ X u ] is n 2 u ni .
i 1 i j 1
Case 2, c j 1 u c j
.
The empirical estimate of E [( X u ) k ] is
j 1
ni (cik 1 cik11 ) n j (u k 1 c kj 11 ) n j u k (c j u ) r
n
n(k 1)(c c ) n(k 1)(c c ) n(c c ) u k ni
i 1 i i 1 j j 1 j j 1 i j 1
The empirical estimate of the 100pth percentile of X is found by solving for ˆ p from
the equation Fn (ˆ p ) p using the ogive of the empirical cumulative density function.
(0, 100] 25
(100, 200] 20
(200, 500] 20
(500, 1000] 20
(1000, 2000] 15
(a) Find the empirical estimate of the 1st and 2nd moments
(b) Find the empirical estimate of the variance
(c) Find the empirical limited expected values with
(i) limit u 1000
(ii) limit u 1400