Вы находитесь на странице: 1из 55

ES 670 Environmental Statistics

Data Analysis

Mean and Median


It is impossible to conduct chemical analysis free of errors / uncertainty. Data of unknown quality is worthless. Thus, replicates - a set of measurements, are required instead of a single measurement.

The central value of the set is a more reliable measure than individual estimates. Mean or median is used. Variation in replicates provides a measure of uncertainty. Standard deviation, Variance, Coefficient of Variation are some ways of determining the same.

Mean and Median


Mean

x =

i=1

xi

Median Middle result when replicate data are arranged in order from smallest to largest. Eg. Median in a set of 5 measurements : 9, 2, 7, 11, 14 Arrange in ascending order : 2, 7, 9, 11, 14 Median is 9. Rank is N + 1 =3 2 If N is even : median is average of two middle measurements The median is less sensitive to extreme values compared to mean.

Mean and Median

For

data that is symmetrically distributed about the mean, mean and median are equal. For skewed distribution, mean shifts towards the direction of the skewiness.

Precision vs. Accuracy


Precision : Describes the reproducibility of measurements and can be determined by repeating the experiment It is indicated by Standard Deviation, variance, coefficient of variation It relates to the deviation from mean

di = xi x

Precision vs. Accuracy


Accuracy : Indicates closeness of the measurement to its true/accepted value It is expressed by the error (Absolute/Relative) It can never be determined exactly since true value is not known exactly
Absolute Error Relative Error

E = xi xt
Er = xi xt 100% xt

Types of Errors in Experimental Data


Random / Indeterminate Errors reflects precision Systematic / Deterministic Errors reflects a bias. It causes a series of measurements to be all high / all low. Gross Error Outlier may be very high or very low.

Types of Errors in Experimental Data


The cause may be assigned & affects all data in the same way. Sources :

Instrument errors (eg. Glassware marking error) Method errors (non-ideal behaviour of reagents) Personal errors (eg. Error in detecting colour change)

Precision Vs Accuracy in Measurements


x1 xt _

Analyst 1 Cmpd1
_ x2 xt

Analyst 2 Cmpd1 Analyst 3 Cmpd2 Analyst 4 Cmpd2

_ x3 xt _ x4 xt

Absolute Error
Abs error in micro Kjeldahl determination of Nitrogen for two compounds 1 and 2 by 4 different analysts

Effect of Systematic Errors & its Detection


Constant Error : Magnitude of the error does not depend on size of quantity measured more serious when quantity measured is small

Eg. Excess reagent required to cause colour change in titration

Proportional Error : Errors increase/decrease in proportion to sample size


Eg. Presence of interfering compounds in sample

Effect of Systematic Error and its detection


Systematic instrumental errors can be determined by calibration Personal errors can be minimized by self discipline Bias in analytical methods can be minimized by using standard reference materials, or by using an alternative reliable analytical method Blank Determination can reveal error due to interfering contaminants eg. Titration end point correction can be done with blanks

Random Errors in Analysis


Indeterminate caused by uncontrollable variables. Cannot identify or measure the variables that contribute to these errors. Causes a random scatter in the data An empirical observation is that for most experimental data the distribution of replicates approaches that of a Gaussian curve / Normal Distribution Theoretically, this distribution occurs due to a large number of individual error components

Calibration of a 10 mL Pipette

The exact volume of water delivered by a 10mL pipette was measured geometrically 50 times. Mass was converted to volume using density values at the measured temperatures The data collected was rearranged in order to obtain a frequency distribution. The data series was distributed in 0.003 mL groups the number % of observations in each group was determined

Calibration of a 10mL pipette


The frequency distribution was plotted as a bar graph histogram


Range : 9.969 9.995 Mean : 9.982 mL Median : 9.982 mL Spread : 0.025 mL SD : 0.0056 mL

Calibration of a 10 mL pipette

As the number of measurements increase the histogram would approach the continuous curve which has a Gaussian/Normal distribution The Gaussian would have the same mean, same std. deviation (SD). Therefore, it has the same precision & the same area under the curve as the histogram Sources of random uncertainties : usually reading the 10 mL mark drainage time / angle of holding the pipette, temperature fluctuations affecting viscosity & performance of balance, vibration / draft affecting balance

Reproducibility / Repeatability

Both terms relate to precision An analyst makes 5 replicate measurements in quick succession using same reagents & glassware. Such measurements reflect Repeatability = within run precision Same analyst takes same readings on 5 different occasions data would be subject to difference in reagents, glassware, lab conditions it would now reflect Reproducibility between run precision

Reproducibility / Repeatability

Error estimates based on sequentially repeated observations may give a false sense of security regarding precision More emphasis needs to be put on reproducibility. It highlights the difference in observations when replicate experiments are performed in random sequence

Normality / Randomness / Independence


Most statistical procedures are based on Normality, Randomness & Independence Normality : The measurement error comes from a normal distribution. Due to central limit effect many additive component errors lead to a normal like distribution

This is not a very restrictive criteria If errors are not normally distributed, transformations are available to make the errors normal like Most tests are robust to deviations from normality

Normality / Randomness / Independence


Random : Observations are drawn from a population such that every element of the population has an equal chance of being drawn Randomization of sampling can ensure that observations are independent Example of non-randomness in measurements : If in 10 replicate measurements all early time measurements are high compared to late time measurements there is a non-randomness associated with measurement

Normality / Randomness / Independence


Randomness is indicated by a plot of measurement error vs. order of observation It is good to check for randomness with respect to each identifiable factor that can affect the measurement

Normality / Randomness / Independence


Independence : It implies that simple multiplicative law of probability works


The probability of joint occurrence of two events is given by product of probability of individual occurrences Lack of independence can seriously distort variance & results of statistical tests

Statistical treatment of Random Error


Sample vs. Population


Sample The finite number of experimental observations Population The infinite number of possible observations that could in principle be made given infinite time The statistical laws are derived on the basis of a population when applied to smaller samples, these laws may need to be modified

Population Mean & sample mean


If the no. of observations is small the two are not same x The population mean is the true mean of the population The sample mean is an estimator of the population mean If a measurement has no systematic error = xt = the true value The difference between the two increases as N increases
_

Population and Sample mean


A B

A B
x z =

The population standard deviation ( measure of precision)


= i =1

(xi )
N

x z =

Deviation from the mean expressed in units of standard deviation

Characteristics of the normal error curve


Mean occurs at the central point of maximum frequency There is a symmetrical distribution of positive & negative deviations about the mean There is an exponential decrease in frequency as the magnitude of the deviations increases, i.e., small random uncertainties are observed more often than larger ones Areas under a Gaussian curve

68.3% of the area lies within 95.5% of the area lies within 99.7% of the area lies within

1 of the mean 2 of the mean 3 of the mean

Therefore the standard deviation is a very useful prediction tool

The Sample Standard Deviation


It applies to small data sets replaces x

s = i =1

(xi x)
N 1

2
=

i =1

x i2

x i i =1 N N 1
N

Use of simplified formula may give rise to large round-off errors

Denominator = N-1 = degrees of freedom If = N, s would be less than . Therefore it prevents negative bias

Alternative Measures of Precision


Variance
s 2 = i =1

(xi x )
N 1

Relative Standard Deviation


s RSD = 100% x

Coefficient of variation s CV = 100% x

Alternative Measures of Precision


Spread or Range

Difference between the largest & smallest value in a set of replicates Standard Deviation of computed results

Obtained by propagation of errors

y = c1 x + c2

y is computed, x is measured To obtain s for y = s y we would need to know s x , s c1 , s c 2 i.e. SD for each of the variances

Error Propagation Formulas


Type of Calc Addition or Subtraction Multiplication or Division Exponential Logarithm Antilogarithm Example y=a+b-c
a b y = c

Std. Dev. Of y
2 + s2 + s2 s y = sa b c
2 2 2

sy

s s s = a + b + c y a b c

y=ax y= log a Y = antilog a

s y

s a x a

s
s

= 0 . 434
= 2 . 303

sa a
s
a

Calibration curve

y is plotted as a function of known x for a series of standards x independent variable y dependent variable Best fit line obtained by regression analysis using method of least squares

Standard Error of a mean


Std. deviation s refers to probable error for a single measurement


Now if the distribution in the set of mean values is observed, less scatter will be observed as N increases
x
1

The standard deviation of the mean is denoted as standard error sm

s sm = N

Reliability of s as a measure of precision


Reliability of s increases as N increases s can be determined apriori using a large number of replicates eg. for pH measurement, chromatograph measurement. particularly for simple measurements For more complicated experiments, data from a series of samples accumulated over time can be used to get a pooled estimate of s This is a better estimate than for a single subset

Pooled Std Deviation


s pooled =

(x x ) + (x x
N1 1 2 N2 i =1 i i =1 i

) + (x x ) ...
2 N3 3 2 i =1 i

N 1 + N 2 + N 3 + N 4 + ....N t t

N| = # of data in set 1 t = # of data sets It assumes the same source of random errors in all sub-sets

Example: Hardness Mearurement


Measurement conducted by 13 students using EDTA Titrimetric method To draw a frequency distribution of deviation from true value on a scale of frequency versus z

Is the distribution a normal distribution ? Are there any outliers ? What are the assumptions ? Are there any bias in the measurement ?

Analysis of Hardness Measurement


Sr No. True Conc.(mg/L) ht 1 2 3 4 5 6 7 8 9 10 11 12 13 110 90 80 80 100 100 100 100 100 60 110 90 90 Measured Conc.(mg/L) h 116 95 86 82.64 105.32 97.2 100 100 100 64 112 90.4 84 Mean (h-ht) Stdev (h-ht) Error Analysis h-ht 6.00 5.00 6.00 2.64 5.32 -2.80 0.00 0.00 0.00 4.00 2.00 0.40 -6.00 1.74 3.62 No. of Observations 13 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 -10.1 -8.1 -6.1 -4.1 -2.1 -0.1 1.9 3.9 5.9 7.9 9.9 11.9 13.9 upper limit lower limit Midpoint x -11.05 -9.05 -7.05 -5.05 -3.05 -1.05 0.95 2.95 4.95 6.95 8.95 10.95 12.95 (xmean/s d) z -3.53 -2.98 -2.43 -1.88 -1.32 -0.77 -0.22 0.34 0.89 1.44 1.99 2.55 3.10 0 0 0 1 1 0 4 2 3 2 0 0 0 0.00 0.00 0.00 0.08 0.08 0.00 0.31 0.15 0.23 0.15 0.00 0.00 0.00 Freq Relative Freq

Relative Frequency Distribution Error in Hardness Measurement


0.4 0.3

0.2

0.1

0.0 -4 -3 -2 -1 0 1 2 3 4

Students t-Distribution

W.S. Gosset experimentally determined the students t distribution in 1908 For a distribution of sample means, definition of z for large samples
x z= x

If s is substituted for in z, the resulting quantity would be the t statistic


x t= s x

Characteristics of t-Distribution

Mound shaped; Symmetrical about t=0 It is more variable than z


z varies only due to x-bar The variability in t is due to two random quantities which are independent: x-bar and s The variability in t decreases as n increases since s approaches . When n=; t=z d.f.=n-1 Degrees of freedom = Number of squared deviations available for estimating 2

The t-distribution depends on sample size, n


t-Distribution

Confidence Limits

Confidence limits define a confidence interval


a region around the experimentally determined mean within which the population mean lies with a given degree of probability

The size of the interval


is derived from the sample standard deviation (s) is also affected by how closely s (sample std. dev) approaches (population std. dev) As s approaches (as N increases) the confidence limits gets narrower

Confidence Interval: Large Sample Size


Definition: z statistic

N Confidence Limit when s is a good approximate of

x z = x

x =

z = x N

Confidence Interval and Sample Size


No. of Measurements Relative size of Confidence Interval

1 2 3 4 5 6 10

1.0 0.71 0.58 0.5 0.45 0.41 0.32

Confidence Interval: Small Sample Size


Confidence Limit when is not known CL

ts = x+ N

Define t statistic

x t= sx

s sx= N

t --Statistics

t values in tabulated form are available.


t>z If s is based on 3 measurements, d.f. = 2 Value of t for 95% CI = 4.3 as compared to z value of 1.96

t values are dependent on degrees of freedom (d.f.) in addition to its dependence on confidence level t z as the d.f.

Depiction of Confidence Interval based on Normal Error Curve


0 . 67
1 . 64

95 times out of 100 the true mean will be within 1.96

1 . 29

y axis x axis

relative frequency
x z= x

Confidence levels for various values of z


Confidence Levels 50 68 80 90 95 96 99 99.7 99.9 z 0.67 1.00 1.29 1.64 1.96 2.00 2.58 3.00 3.29

Confidence Limits based on t & z statistics


Conc. of a contaminant in water (expressed in %) 0.084 0.089 0.079 To determine 95% CL when no additional knowledge on precision is available

x
x
s=

i =0.252

x x= 3

2 i

=0.021218
2

( x) x N
2 i i

N 1

2 ( 0.252 ) 0.021218

= 0.005%

Confidence Limits based on t & z statistics


ts 4.3 0.005 95%CL = x = 0.084 = 0.084 0.012% 3 N

t = 4.3 for d.f. = 2 and 95% confidence

95%CL = 0.084 0.012%

Confidence Limits based on t & z statistics


To determine 95%CL if from previous experiments it is known that = 0.005%


Now the z statistic can be used

z = 1.96 for 95% confidence

z 1.96 0.005 = 0.084 95%CL = x N 3 = 0.084 0.006%

A sure knowledge of decreases the confidence interval significantly.

Quality Assurance & Control


There must be unequivocal evidence to prove that the data from chemical measurements is reliable. Quality assurance studies provides such evidence Quality assessment involves evaluation of accuracy & precision of methods of measurement Eg. Instruments need to be calibrated frequently with standard samples to ensure accuracy & precision Quality assurance of manufactured products also very important. Eg. Fluoride levels in toothpaste is regulated Control charts can be used to monitor quality

Quality Assurance & Control: Example


The accuracy and precision of a balance can be monitored by periodically determining standard weights Determine if measurements made on subsequent days are within certain limits of the standard UCL = + 3/N Upper and lower control limits = Population mean = Population standard deviation For a normal error curve, the measurements are expected to lie in this range 99.7% of the time LCL = - 3/N

Quality Assurance & Control: Example


Balance is almost out of control on day 17 UCL Mass of Std wt = 20.000 = 0.00012 g for mean of 5 measurement 20 x = 0.00012/5 3/N = 0.00054 LCL UCL = 20.00016 g LCL = 19.99946 g

10

15

20

Sample (day)

The Q Test for detection of Gross Errors


A rationale for excluding outlying results that differ excessively from average Qexp = (xq xn)/ w
d x1 x2 x3 x4 x5 x6

Qexp is compared with Qcritical If Qexp>Qcritical The questionable result can be rejected with the specified confidence level

x w

d=x6-x5 w=x6-x1

Qcrit @ Specified Confidence Level


No. of observation

90% 0.941 0.765 0.642 0.560 0.507 0.468 0.437 0.412

95% 0.970 0.829 0.710 0.625 0.568 0.526 0.493 0.466

99%
Assumption

3 4 5 6 7 8 9 10

0.994 0.926 0.821 0.740 0.680 0.634 0.598 0.568

The distribution of population data is normal A cautious approach to rejection of outliers is wise

Recommendation for treatment of Outliers


Re-examine all data & observations relating to outlying result maintain lab notebook with all observations & data Estimate precision of the procedure to ensure that outlying result is actually questionable Repeat analysis. Check for agreement between new data and original set Apply Q test to decide if data should be retained or rejected on statistical ground If Q test indicates retention consider reporting the median instead of mean

Вам также может понравиться