Lec Set 1 Data Analysis

ES 670 Environmental Statistics
Data Analysis
Mean and Median

It is impossible to conduct chemical analysis free of errors / uncertainty. Data of unknown quality is worthless. Thus, replicates - a set of measurements, are required instead of a single measurement.
The central value of the set is a more reliable measure than individual estimates. Mean or median is used. Variation in replicates provides a measure of uncertainty. Standard deviation, Variance, Coefficient of Variation are some ways of determining the same.
Mean and Median

Mean
x =
i=1
xi
Median Middle result when replicate data are arranged in order from smallest to largest. Eg. Median in a set of 5 measurements : 9, 2, 7, 11, 14 Arrange in ascending order : 2, 7, 9, 11, 14 Median is 9. Rank is N + 1 =3 2 If N is even : median is average of two middle measurements The median is less sensitive to extreme values compared to mean.
Mean and Median
For
data that is symmetrically distributed about the mean, mean and median are equal. For skewed distribution, mean shifts towards the direction of the skewiness.
Precision vs. Accuracy

Precision : Describes the reproducibility of measurements and can be determined by repeating the experiment It is indicated by Standard Deviation, variance, coefficient of variation It relates to the deviation from mean
di = xi x
Precision vs. Accuracy

Accuracy : Indicates closeness of the measurement to its true/accepted value It is expressed by the error (Absolute/Relative) It can never be determined exactly since true value is not known exactly
Absolute Error Relative Error
E = xi xt
Er = xi xt 100% xt
Types of Errors in Experimental Data

Random / Indeterminate Errors reflects precision Systematic / Deterministic Errors reflects a bias. It causes a series of measurements to be all high / all low. Gross Error Outlier may be very high or very low.
Types of Errors in Experimental Data

The cause may be assigned & affects all data in the same way. Sources :

Instrument errors (eg. Glassware marking error) Method errors (non-ideal behaviour of reagents) Personal errors (eg. Error in detecting colour change)
Precision Vs Accuracy in Measurements

x1 xt _
Analyst 1 Cmpd1
_ x2 xt
Analyst 2 Cmpd1 Analyst 3 Cmpd2 Analyst 4 Cmpd2
_ x3 xt _ x4 xt
Absolute Error
Abs error in micro Kjeldahl determination of Nitrogen for two compounds 1 and 2 by 4 different analysts
Effect of Systematic Errors & its Detection

Constant Error : Magnitude of the error does not depend on size of quantity measured more serious when quantity measured is small
Eg. Excess reagent required to cause colour change in titration
Proportional Error : Errors increase/decrease in proportion to sample size

Eg. Presence of interfering compounds in sample
Effect of Systematic Error and its detection

Systematic instrumental errors can be determined by calibration Personal errors can be minimized by self discipline Bias in analytical methods can be minimized by using standard reference materials, or by using an alternative reliable analytical method Blank Determination can reveal error due to interfering contaminants eg. Titration end point correction can be done with blanks
Random Errors in Analysis

Indeterminate caused by uncontrollable variables. Cannot identify or measure the variables that contribute to these errors. Causes a random scatter in the data An empirical observation is that for most experimental data the distribution of replicates approaches that of a Gaussian curve / Normal Distribution Theoretically, this distribution occurs due to a large number of individual error components
Calibration of a 10 mL Pipette
The exact volume of water delivered by a 10mL pipette was measured geometrically 50 times. Mass was converted to volume using density values at the measured temperatures The data collected was rearranged in order to obtain a frequency distribution. The data series was distributed in 0.003 mL groups the number % of observations in each group was determined
Calibration of a 10mL pipette

The frequency distribution was plotted as a bar graph histogram

Range : 9.969 9.995 Mean : 9.982 mL Median : 9.982 mL Spread : 0.025 mL SD : 0.0056 mL
Calibration of a 10 mL pipette
As the number of measurements increase the histogram would approach the continuous curve which has a Gaussian/Normal distribution The Gaussian would have the same mean, same std. deviation (SD). Therefore, it has the same precision & the same area under the curve as the histogram Sources of random uncertainties : usually reading the 10 mL mark drainage time / angle of holding the pipette, temperature fluctuations affecting viscosity & performance of balance, vibration / draft affecting balance
Reproducibility / Repeatability

Both terms relate to precision An analyst makes 5 replicate measurements in quick succession using same reagents & glassware. Such measurements reflect Repeatability = within run precision Same analyst takes same readings on 5 different occasions data would be subject to difference in reagents, glassware, lab conditions it would now reflect Reproducibility between run precision
Reproducibility / Repeatability
Error estimates based on sequentially repeated observations may give a false sense of security regarding precision More emphasis needs to be put on reproducibility. It highlights the difference in observations when replicate experiments are performed in random sequence
Normality / Randomness / Independence

Most statistical procedures are based on Normality, Randomness & Independence Normality : The measurement error comes from a normal distribution. Due to central limit effect many additive component errors lead to a normal like distribution

This is not a very restrictive criteria If errors are not normally distributed, transformations are available to make the errors normal like Most tests are robust to deviations from normality

Random : Observations are drawn from a population such that every element of the population has an equal chance of being drawn Randomization of sampling can ensure that observations are independent Example of non-randomness in measurements : If in 10 replicate measurements all early time measurements are high compared to late time measurements there is a non-randomness associated with measurement

Randomness is indicated by a plot of measurement error vs. order of observation It is good to check for randomness with respect to each identifiable factor that can affect the measurement

Independence : It implies that simple multiplicative law of probability works

The probability of joint occurrence of two events is given by product of probability of individual occurrences Lack of independence can seriously distort variance & results of statistical tests
Statistical treatment of Random Error

Sample vs. Population

Sample The finite number of experimental observations Population The infinite number of possible observations that could in principle be made given infinite time The statistical laws are derived on the basis of a population when applied to smaller samples, these laws may need to be modified
Population Mean & sample mean

If the no. of observations is small the two are not same x The population mean is the true mean of the population The sample mean is an estimator of the population mean If a measurement has no systematic error = xt = the true value The difference between the two increases as N increases
_
Population and Sample mean

A B
A B
x z =
The population standard deviation ( measure of precision)

= i =1
(xi )
N
x z =
Deviation from the mean expressed in units of standard deviation
Characteristics of the normal error curve

Mean occurs at the central point of maximum frequency There is a symmetrical distribution of positive & negative deviations about the mean There is an exponential decrease in frequency as the magnitude of the deviations increases, i.e., small random uncertainties are observed more often than larger ones Areas under a Gaussian curve

68.3% of the area lies within 95.5% of the area lies within 99.7% of the area lies within
1 of the mean 2 of the mean 3 of the mean
Therefore the standard deviation is a very useful prediction tool
The Sample Standard Deviation

It applies to small data sets replaces x
s = i =1
(xi x)
N 1
2
=
i =1
x i2
x i i =1 N N 1
N
Use of simplified formula may give rise to large round-off errors
Denominator = N-1 = degrees of freedom If = N, s would be less than . Therefore it prevents negative bias
Alternative Measures of Precision

Variance
s 2 = i =1
(xi x )
N 1
Relative Standard Deviation

s RSD = 100% x
Coefficient of variation s CV = 100% x
Alternative Measures of Precision

Spread or Range
Difference between the largest & smallest value in a set of replicates Standard Deviation of computed results
Obtained by propagation of errors
y = c1 x + c2

y is computed, x is measured To obtain s for y = s y we would need to know s x , s c1 , s c 2 i.e. SD for each of the variances
Error Propagation Formulas

Type of Calc Addition or Subtraction Multiplication or Division Exponential Logarithm Antilogarithm Example y=a+b-c
a b y = c
Std. Dev. Of y
2 + s2 + s2 s y = sa b c
2 2 2
sy
s s s = a + b + c y a b c
y=ax y= log a Y = antilog a
s y
s a x a
s
s
= 0 . 434
= 2 . 303
sa a
s
a
Calibration curve

y is plotted as a function of known x for a series of standards x independent variable y dependent variable Best fit line obtained by regression analysis using method of least squares
Standard Error of a mean

Std. deviation s refers to probable error for a single measurement

Now if the distribution in the set of mean values is observed, less scatter will be observed as N increases
x
1
The standard deviation of the mean is denoted as standard error sm
s sm = N
Reliability of s as a measure of precision

Reliability of s increases as N increases s can be determined apriori using a large number of replicates eg. for pH measurement, chromatograph measurement. particularly for simple measurements For more complicated experiments, data from a series of samples accumulated over time can be used to get a pooled estimate of s This is a better estimate than for a single subset
Pooled Std Deviation

s pooled =
(x x ) + (x x
N1 1 2 N2 i =1 i i =1 i
) + (x x ) ...
2 N3 3 2 i =1 i
N 1 + N 2 + N 3 + N 4 + ....N t t
N| = # of data in set 1 t = # of data sets It assumes the same source of random errors in all sub-sets
Example: Hardness Mearurement

Measurement conducted by 13 students using EDTA Titrimetric method To draw a frequency distribution of deviation from true value on a scale of frequency versus z

Is the distribution a normal distribution ? Are there any outliers ? What are the assumptions ? Are there any bias in the measurement ?
Analysis of Hardness Measurement

Sr No. True Conc.(mg/L) ht 1 2 3 4 5 6 7 8 9 10 11 12 13 110 90 80 80 100 100 100 100 100 60 110 90 90 Measured Conc.(mg/L) h 116 95 86 82.64 105.32 97.2 100 100 100 64 112 90.4 84 Mean (h-ht) Stdev (h-ht) Error Analysis h-ht 6.00 5.00 6.00 2.64 5.32 -2.80 0.00 0.00 0.00 4.00 2.00 0.40 -6.00 1.74 3.62 No. of Observations 13 -12 -10 -8 -6 -4 -2 0 2 4 6 8 10 12 -10.1 -8.1 -6.1 -4.1 -2.1 -0.1 1.9 3.9 5.9 7.9 9.9 11.9 13.9 upper limit lower limit Midpoint x -11.05 -9.05 -7.05 -5.05 -3.05 -1.05 0.95 2.95 4.95 6.95 8.95 10.95 12.95 (xmean/s d) z -3.53 -2.98 -2.43 -1.88 -1.32 -0.77 -0.22 0.34 0.89 1.44 1.99 2.55 3.10 0 0 0 1 1 0 4 2 3 2 0 0 0 0.00 0.00 0.00 0.08 0.08 0.00 0.31 0.15 0.23 0.15 0.00 0.00 0.00 Freq Relative Freq
Relative Frequency Distribution Error in Hardness Measurement

0.4 0.3
0.2
0.1
0.0 -4 -3 -2 -1 0 1 2 3 4
Students t-Distribution

W.S. Gosset experimentally determined the students t distribution in 1908 For a distribution of sample means, definition of z for large samples
x z= x
If s is substituted for in z, the resulting quantity would be the t statistic

x t= s x
Characteristics of t-Distribution

Mound shaped; Symmetrical about t=0 It is more variable than z

z varies only due to x-bar The variability in t is due to two random quantities which are independent: x-bar and s The variability in t decreases as n increases since s approaches . When n=; t=z d.f.=n-1 Degrees of freedom = Number of squared deviations available for estimating 2
The t-distribution depends on sample size, n

t-Distribution
Confidence Limits
Confidence limits define a confidence interval

a region around the experimentally determined mean within which the population mean lies with a given degree of probability
The size of the interval

is derived from the sample standard deviation (s) is also affected by how closely s (sample std. dev) approaches (population std. dev) As s approaches (as N increases) the confidence limits gets narrower
Confidence Interval: Large Sample Size

Definition: z statistic
N Confidence Limit when s is a good approximate of
x z = x
x =
z = x N
Confidence Interval and Sample Size

No. of Measurements Relative size of Confidence Interval
1 2 3 4 5 6 10
1.0 0.71 0.58 0.5 0.45 0.41 0.32
Confidence Interval: Small Sample Size

Confidence Limit when is not known CL
ts = x+ N
Define t statistic
x t= sx
s sx= N
t --Statistics
t values in tabulated form are available.

t>z If s is based on 3 measurements, d.f. = 2 Value of t for 95% CI = 4.3 as compared to z value of 1.96
t values are dependent on degrees of freedom (d.f.) in addition to its dependence on confidence level t z as the d.f.
Depiction of Confidence Interval based on Normal Error Curve

0 . 67
1 . 64
95 times out of 100 the true mean will be within 1.96
1 . 29
y axis x axis
relative frequency
x z= x
Confidence levels for various values of z

Confidence Levels 50 68 80 90 95 96 99 99.7 99.9 z 0.67 1.00 1.29 1.64 1.96 2.00 2.58 3.00 3.29
Confidence Limits based on t & z statistics

Conc. of a contaminant in water (expressed in %) 0.084 0.089 0.079 To determine 95% CL when no additional knowledge on precision is available
x
x
s=
i =0.252
x x= 3
2 i
=0.021218
2
( x) x N
2 i i
N 1
2 ( 0.252 ) 0.021218
= 0.005%

ts 4.3 0.005 95%CL = x = 0.084 = 0.084 0.012% 3 N
t = 4.3 for d.f. = 2 and 95% confidence
95%CL = 0.084 0.012%

To determine 95%CL if from previous experiments it is known that = 0.005%

Now the z statistic can be used
z = 1.96 for 95% confidence
z 1.96 0.005 = 0.084 95%CL = x N 3 = 0.084 0.006%
A sure knowledge of decreases the confidence interval significantly.
Quality Assurance & Control

There must be unequivocal evidence to prove that the data from chemical measurements is reliable. Quality assurance studies provides such evidence Quality assessment involves evaluation of accuracy & precision of methods of measurement Eg. Instruments need to be calibrated frequently with standard samples to ensure accuracy & precision Quality assurance of manufactured products also very important. Eg. Fluoride levels in toothpaste is regulated Control charts can be used to monitor quality
Quality Assurance & Control: Example

The accuracy and precision of a balance can be monitored by periodically determining standard weights Determine if measurements made on subsequent days are within certain limits of the standard UCL = + 3/N Upper and lower control limits = Population mean = Population standard deviation For a normal error curve, the measurements are expected to lie in this range 99.7% of the time LCL = - 3/N

Quality Assurance & Control: Example

Balance is almost out of control on day 17 UCL Mass of Std wt = 20.000 = 0.00012 g for mean of 5 measurement 20 x = 0.00012/5 3/N = 0.00054 LCL UCL = 20.00016 g LCL = 19.99946 g
10
15
20
Sample (day)
The Q Test for detection of Gross Errors

A rationale for excluding outlying results that differ excessively from average Qexp = (xq xn)/ w
d x1 x2 x3 x4 x5 x6
Qexp is compared with Qcritical If Qexp>Qcritical The questionable result can be rejected with the specified confidence level
x w

d=x6-x5 w=x6-x1
Qcrit @ Specified Confidence Level

No. of observation
90% 0.941 0.765 0.642 0.560 0.507 0.468 0.437 0.412
95% 0.970 0.829 0.710 0.625 0.568 0.526 0.493 0.466
99%
Assumption
3 4 5 6 7 8 9 10
0.994 0.926 0.821 0.740 0.680 0.634 0.598 0.568
The distribution of population data is normal A cautious approach to rejection of outliers is wise
Recommendation for treatment of Outliers

Re-examine all data & observations relating to outlying result maintain lab notebook with all observations & data Estimate precision of the procedure to ensure that outlying result is actually questionable Repeat analysis. Check for agreement between new data and original set Apply Q test to decide if data should be retained or rejected on statistical ground If Q test indicates retention consider reporting the median instead of mean

Lec Set 1 Data Analysis

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Lec Set 1 Data Analysis

Загружено:

Авторское право:

Доступные форматы

ES 670 Environmental Statistics

Mean and Median

Mean and Median

Mean and Median

Precision vs. Accuracy

Precision vs. Accuracy

Types of Errors in Experimental Data

Types of Errors in Experimental Data

Precision Vs Accuracy in Measurements

Analyst 2 Cmpd1 Analyst 3 Cmpd2 Analyst 4 Cmpd2

Effect of Systematic Errors & its Detection

Eg. Excess reagent required to cause colour change in titration

Proportional Error : Errors increase/decrease in proportion to sample size

Eg. Presence of interfering compounds in sample

Effect of Systematic Error and its detection

Random Errors in Analysis

Calibration of a 10mL pipette

The frequency distribution was plotted as a bar graph histogram

Normality / Randomness / Independence

Normality / Randomness / Independence

Normality / Randomness / Independence

Normality / Randomness / Independence

Independence : It implies that simple multiplicative law of probability works

Statistical treatment of Random Error

Sample vs. Population

Population Mean & sample mean

Population and Sample mean

The population standard deviation ( measure of precision)

Deviation from the mean expressed in units of standard deviation

Characteristics of the normal error curve

1 of the mean 2 of the mean 3 of the mean

Therefore the standard deviation is a very useful prediction tool

The Sample Standard Deviation

It applies to small data sets replaces x

Use of simplified formula may give rise to large round-off errors

Alternative Measures of Precision

Relative Standard Deviation

Coefficient of variation s CV = 100% x

Alternative Measures of Precision

Obtained by propagation of errors

Error Propagation Formulas

y=ax y= log a Y = antilog a

Standard Error of a mean

Std. deviation s refers to probable error for a single measurement

The standard deviation of the mean is denoted as standard error sm

Reliability of s as a measure of precision

Pooled Std Deviation

Example: Hardness Mearurement

Analysis of Hardness Measurement

Relative Frequency Distribution Error in Hardness Measurement

If s is substituted for in z, the resulting quantity would be the t statistic

Mound shaped; Symmetrical about t=0 It is more variable than z

The t-distribution depends on sample size, n

Confidence limits define a confidence interval

The size of the interval

Confidence Interval: Large Sample Size

N Confidence Limit when s is a good approximate of

Confidence Interval and Sample Size

1.0 0.71 0.58 0.5 0.45 0.41 0.32

Confidence Interval: Small Sample Size

t values in tabulated form are available.

Depiction of Confidence Interval based on Normal Error Curve

95 times out of 100 the true mean will be within 1.96