Вы находитесь на странице: 1из 16

Biometrics 2011

Lecture 7

Friday – is a Dean’s Holiday but…


I will have the Biometrics Classes…

László Pótó
From the sample to the population…
 Remember: biometrics is about making conclusion from
the collected data (sample) to the unknown population.
Typical questions:
- Is a given lab-data (of a group of patients) different from the
„healthy” value? (what is the expected value – for healthy people?)
- Is a measuring tool/process sharp enough (pipette, drug content
of pills, box of sugar, and so on…)?
- Does a complete series of measurements give the proof that the
values are over a certain limit (air or water pollution, …)?

 The problem: how to make conclusion: from x and sx to µ and 
–x and s (and ‘n’, so the measures of the sample) are known…, but:
x
what about µ and ? So: which population come the sample from.
Two methods: - estimation
- hypothesis testing
The estimation
 Point-estimation: ¯x µ and sx 
We did this when we supposed based on the 50 students body height data,
that µ=170cm =8cm for the population.
 Interval-estimation:
¯ sx  a, b ,(two values), so that we can say that
x,
µ is inside of the (a,b) interval by a given probability.
 This „ given probability” is the confidence of the estimation.
 close to 100%, like 99%, 95%, 90%.
 It is already known some such intervals whit a given confidence:
 for the bh data: (µ=170cm =8cm ) – so, from last week table:
2 154-186cm (95%) 3 146-194cm (99.7%)
 for the 16 data sample means: µ=170cm /n=2cm )
2 166-174cm (95%) 3 164-176cm (99.7%)

 We did the estimations knowing µ and  in the above examples — but


the question now is this: estimate the unknown  and  - based on the
known sample! How?  3 steps how to get that
1st step: Distribution of the sample mean around µ
 Example for bh data: The 16 data sample means are distributed around µ:

68%
95%   z × /n
99.7%

 Let’s take all the 100 different means and draw around each of them
the 2/n (95%) interval!
How many out of that 100 intervals contains the µ?

95 eset

5 eset

 The center (mean) of some of those 2/n length intervals are located
inside of the 2/n range. These will contain the µ - but some means
are outside and those ones will not. ‘Out of 100’ means: those 95 „inside”
2/n intervals would contain µ (confidence), but 5 not (error-risk).
2nd step:  is unknown so replace /n by the sx /n
 The S.E. can be smaller or larger than /n –
so how about the confidence?
164 166
166 168 170 172 174 176

95 eset

5 eset

 How many out of the 95 will be shorter – and longer out of those 5?
(based on the binomial distribution…)
 The confidence is decreasing!
 It depends on the sample size!

 The background is the: t distribution:


- little shorter at the middle part and
more wide at the tails - than the normal
- different curves to each n
-3 -2 -1 0 1 2 3
3rd step: increase the length of the intervals!
Instead of the x¯ ±z*/n intervals use x¯ ±t* sx /n!
t values
In case of 16 data the mean ± 2.13* sx /n interval
n-1 p=95%
contains the exp. value of the population by 95% prob.
2 4,30
5 2,57 -4 
-3 -2 -1
  
 1

2
3

4

5

8 2,31
1.
10 2,23
2.
15 2,13
3.
20 2,09
50 2,01 95.
1000 1,96
96.
Z= 1,96
100.

Summary: Because of the increased intervals (t value, depends on n) 95


contain again the µ - out of the 100 different center (the means are
different) and length (the std deviations are different) intervals.
The x¯ ±t* sx /n is the p% confidence interval for the µ
( at n=16 and p=95%: the t=2.13)
Calculation of the confidence interval - 1
 1st example: based on the body height data of a 16 data group:
Let suppose the mean is 175cm, S.D. is 10cm.
Can be the expected value of the population 170cm?
The 95% C.I.: 175±2.13*10/16 cm =175±2.13*2.5cm=
=175±5.33cm = (169.67- 180.33)cm
The 170cm is inside, so it is a possible ! (by 95% confidence)

 2nd example for the serum albumin data of 9 patients:


 The mean of the 9 data is 3.9mg/100ml, the S.D. is 0.6mg/100ml.
Can it be the expected value (of this type of patients) equal to
4.2mg/100ml that is the mean of healthy people?
 The 95% C.I. is 3.9±2.3*0.6/9mg/100ml =3.9±2.3*0.2 (…)
=3.9±0.46 = (3.44- 4.36 )mg/100ml
 The 4.2 is inside, so it’s a possible ! (by 5% error risk)
Calculation of the confidence interval - 2
 The drug content of pills at a pharmacological factory was checked
by the measures of a ‘16 pills sample’.
 The measures are: n=16, mean=102.1 mg, S.D.= 4mg.
(note: the 3 number) Can the expected value be 100mg?
The 95% conf. intv. (in mg): 102.1±2.13*4/16 =102.1±2.13=
= (99.97- 104.23)mg
 The 100mg is inside of it, so the 100mg is a possible !
(by 95% confidence or 5% error-risk)
Interpretation: When repeating the experiment 100 times – having 100 datasets:
100 different means and S.D.s – and calculating the 95% CI from each on the
above way, then 95 out of the 100 different CI would contain the real expected
value (the ) and only 5 CI not.
But note, please, that we can not know that which is the only one C.I. out of the
above 100? Is that one out of the 95 (that „contains”) or the 5 (that is „not…”)!
Let’s see the second method for giving answer to such kind of questions:
the hypothesis testing method!
The hypothesis testing – 1
 An „everyday life” model
 I remember like hearing some noises of heavy rain during night. How
can I decide in the morning, whether it was a rain or just a dream?

1, Let’s suppose it was not… (it was just a dream…)


(2,) Decide what do I mean on „probable” and on „not probable”…
(this is more or less obvious now!)
3, Estimate how probable would be the observed fact in the case of
the 1st point hypothesis? (suppose it IS true now!)
4, Decide about the hypothesis („no rain” in this case)
a, When the result of point 3 is: „not probable”, do reject…
b, When the result of point 3 is: „probable”, do not reject…
5, Conclusion

 Checking the method: try the opposite hypothesis at the 1st point
The hypothesis testing – 2.
 Hypothesis testing in biometrics
 „The drug content of 16 pills…” example.
Mean: 102.1 mg, S.D. 4mg. Can be the expected value 100mg?
1, Suppose that =100mg is true!  No significant difference, the
difference is just by chance! — „null”-hypothesis — : H0
2, Let’s choose the low-end of „probable” is 5%. „Border for
decision”: . So let it be now  = 0.05
3, If =100mg, than how probable is that the mean of 16 data
would differ from the 100mg at least by 2.1mg?
- As to last week: the difference between the mean and the  is t*S.E. (here
SE= 4mg/16=1mg) where „t” follows df=n-1 (here 15) t distribution.
- In our case t=2.1/1=2.1 (-times the S.E.). At the t15-curve at 2.13 (figure!)
would „cut” 5% area (probability), so the prob. of „at least 2.1-times” S.E.
difference is >5%. So that p>0.05 (=„probable”) – (figure)
4, Decide about the hypothesis („ =100mg”)
Because at point 3: p>  („probable”), not to reject!
5, Conclusion: The mean is not significantly different than the
hypothetical expected value. So  can be 100mg!
What did we do here?
 We checked how different is the mean than a hypothetical („ H0”)
expected value. (in S.E. units: „t” times)
 When the difference „t” is big
(= the area under the t curve – outside of the ‘t’ - is small that means:
at least this size of difference has small probability if H0 was true)
than our sample (the fact) are against of our hypothesis (null-hypothesis)
See: everyday life model of hyp. test: Reject the null-hypothesis!

 When the difference „t” is small (= the area under the t curve is big)
at least this size of difference has large probability if H0 was true)
than our sample (the fact) is not against of our hypothesis
(the „null-hypothesis).
See: everyday life model of hyp. test: Do not reject the null-hypothesis!

 The probability (area) can be calculated knowing „t” (and n) using the
prob dens function. By computer: „p=” (sharp) or from table: „p< ”.
This is the: One sample t test.
An other (special) case
 The effect of diet + training was checked: did it lowered the blood-cholesterol? The
lab data of the 12 patients (2 datasets but in paired arrangement…):
serial 1 2 3 4 5 6 7 8 9 10 11 12
before 201 231 221 260 228 237 326 235 240 267 284 201
after 200 236 216 233 224 216 296 195 207 247 210 209
diff -1 5 -5 -27 -4 -21 -30 -40 -33 -20 -74 8
The „difference”: x¯= -20.17 sx= 23.13 S.E.=sx/n=23.13/ 12=6.68

1, H0:  =0 (the treatment is ineffective). 2, =0.05 (decision border).


3, The differences (in pairs) : mean –20.17, S.D. 23.13 , so that the
given „difference” from 0 (t value) is -20.17/6.68 = -3.02.
The probability of „at least this difference is 1.17% - figure -, less then .
4, Here p< , so that reject H0.
5, Conclusion: the diet + training was effective.
The difference was significant.

This method is the: paired t test.


The one sample t test - summary
 The variable (data) are continuous and normally distributed…
 …and the question is about the mean (and the expected value)
(Is there some difference, effect … or it was just by chance?)

Than we can give the answer by the 1 sample t test. Calculate:


How probable is „at least this difference”
— t times the S.E. —, due to chance only, when a H0 is true?
The probability is the area outside the (-t, t) interval of the t
probability density function (by computer or tables.)
-If this (p) probability is less than a predefined limit (), means:
- the difference is big, the sample mean is „far” from the hypothetical exp. value
- and it is rather unlikely to find „at least this big” difference due to chance only
so reject the null-hypothesis.

-If this (p) probability is not less than a predefined limit (), means
no reason to reject the null-hypothesis.

Special case: two pairwise connected datasets: paired t test


in this case the difference data are the „one sample” and  =0 is the H0.
An (already known) example
 The effect of diet+training was checked: did it lowered the blood-cholesterol? The
lab data of the 12 patients are (2 datasets but in paired arrangement…):
serial 1 2 3 4 5 6 7 8 9 10 11 12
before 201 231 221 260 228 237 326 235 240 267 284 201
after 200 236 216 233 224 216 296 195 207 247 210 209
diff -1 5 -5 -27 -4 -21 -30 -40 -33 -20 -74 8
The „difference”: 10 cases out of 12 was effective (negative) while 2 was not (positive).

1, H0: the treatment was ineffective B(12, 0.5) 2, =0.05 (border value).
3, How probable is „at least that difference” from the expected k=6?
This probability is 2*(p(k=0)+p(k=1)+p(k=2)) - figure -,
= 2*(0.02%+0.29%+1.61%) = 2*1.93% It is less than =5% .
4, Here p< , so that reject H0.
5, Conclusion: the diet+training was effective.
The difference was significant.
This method is the: sign test.
Note, please: normal distribution was not supposed! The method can be applied just in those
cases: when the data are not normally distributed (t test is not applicable) this test works well
Goals for the 7th week - What was it today?
 The 3 steps „road” to the last week method of the statistical inference:
the interval-estimation understand it
The confidence interval for the expected value:
¯x ± t* sx /n calculate
 The hypothesis testing
 an everyday life model for the method’s 5 steps (was there rain?)
1. Formulate the starting hypothesis (H0)
2. what is the limit between the „small” and „large” probability ()
3. what is the probability to observe
„at least this difference” from the  (when H0 sets)?
– t is the size p is the probability (significance).
4. decide about H0: — when p <  than reject… (‘not probable…’)
— when p    do not reject… (‘probable…’)
5. conclusion ( based on H0 what is the meaning of the decision?)
 3 methods: One sample (+ paired) t tests and the sign test
 Coming next: compare the two methods, errors, …
From the textbooks :

 Belágyi: pp. 58-69 and 71-75


 Moore: pp. 340-364 and 411-434

Thank you for your attention!

Вам также может понравиться