Вы находитесь на странице: 1из 8

STATISTICAL ANALYSIS OF WATER QUALITY DATA

1. LEARNING OBJECTIVES
The following are the learning objectives

Students to understand some guidelines and techniques for water quality data analysis
and presentation

2. INTRODUCTION
Data analysis and presentation, together with interpretation of the results and report
writing, form the last step in the water quality assessment process. It is this phase that
shows how successful the monitoring activities have been in attaining the objectives of
the assessment. It is also the step that provides the information needed for decision
making, such as choosing the most appropriate solution to a water quality problem,
assessing the state of the environment or refining the water quality assessment process
itself.

Water quality data are often collected at different sites over time to improve water quality
management. Water quality data usually exhibit the following characteristics: non-normal
distribution, presence of outliers, missing values, values below detection limits
(censored), and serial dependence. It is essential to apply appropriate statistical
methodology when analyzing water quality data to draw valid conclusions and hence
provide useful advice in water management.

The collection of appropriate numbers of samples from representative locations is


particularly important for the final stages of data analysis and interpretation of results.
The subject of statistical sampling and programme design is complex and cannot be
discussed in detail here.

Designing a water quality data storage system needs careful consideration to ensure that
all the relevant information is stored such that it maintains data accuracy and allows easy
access, retrieval, and manipulation of the data. Although it is difficult to recommend one
single system that will serve all agencies carrying out water quality studies, some general
1

principles may serve as a framework in designing and implementing effective water


quality data storage and retrieval systems which will serve the particular needs of each
agency or country.

3. DESCRIBING WATER QUALITY


How would you go about describing water quality?

Water quality is characterized by variation

Statistics is the science of variation

Statistical Thinking/Statistical Perspective

Thinking in terms of variation

Thinking in terms of distribution

The present problem is multivariate. WATER QUALITY as a function of

TIME, under the influence of co-variates like

FLOW, at multiple LOCATIONS

Water Quality Verses Time

4. BASIC STATISTICS

Statistics is the science that deals with the collection, tabulation and analysis of numerical
data. Statistical methods can be used to summarise and assess small or large, simple or
complex data sets. Descriptive statistics are used to summarise water quality data sets into
simpler and more understandable forms, such as the mean or median.
Questions about the dynamic nature of water quality can also be addressed with the aid of
statistics. Examples of such questions are:

What is the general water quality at a given site?

Is the water quality improving or getting worse?


2

How do certain variables relate to one another at given sites?

What are the mass loads of materials moving in and out of water systems?

What are the sources of pollutants and what is their magnitude?

Can water quality be predicted from past water quality?

When these and other questions are re-stated in the form of hypotheses then inductive
statistics, such as detecting significant differences, correlations and regressions, can be used
to provide the answers.

5. DATA CHARACTERISTICS
a) Recognition of data types
The types of data collected in water quality studies are many and varied. Frequently,
water quality data possess statistical properties which are characteristic of a particular
type of data. Recognising the type of data can often, therefore, save much
preliminary, uninformed assessment, or eventual application of inappropriate
statistical procedures.

Data sets typically have various, recognisable patterns of distribution of the individual
values. Values in the middle of the range of a data set may occur frequently, whereas
those values close to the extremes of the range occur only very infrequently.
Measurement data are of two types:

Direct: data which result from studies which directly quantify the water
quality of interest in a scale of magnitude, e.g. concentration, temperature,
species population numbers, time.

Indirect: these data are not measured directly, but are derived from other
appropriately measured data, e.g. rates of many kinds, ratios, percentages and
indices.

Both of the above types of measurement data can be sub-divided into two further types:

Continuous: data in which the measurement, in principle, can assume any value
within the scale of measurement. For example, between temperatures of 5.6 C
3

and 5.7 C, there are infinitely many intermediates, even though they are beyond
the resolution of practical measurement instruments. In water quality studies,
continuous data types are predominantly the chemical and physical measurements.

Discontinuous: data in which the measurements may only, by their very nature, take
discrete values. They include counts of various types, and in water quality studies are
derived predominantly from biological methods.

Ranked data: Some water quality descriptors may only be specified in more general
terms, for example on a scale of first, second, third,...; trophic states, etc. In such scales, it
is not the intention that the difference between rank 1 and rank 2 should necessarily be
equal to the difference between rank 2 and rank 3.

Data attributes: This data type is generally qualitative rather than quantitative, e.g. small,
intermediate, large, clear, dark. For many such data types, it may also be possible to
express them as continuous measurement data: small to large, for example, could be
specified as a continuous scale of areas or volumes.

Data sets derived from continuous measurements (e.g. concentrations) may show
frequency distributions which are either normal or non-normal. Conversely,
discontinuous measurements (e.g. counts) will almost always be non-normal. Their nonnormality may also depend on such factors as population spatial distributions and
sampling techniques, and may be shown in various well-defined manners characterised by
particular frequency distributions. Considerable manipulation of non-normal, raw data
may be required before they are amenable to the established array of powerful
distribution-based statistical techniques. Otherwise, it is necessary to use so called nonparametric or distribution-free techniques.

Ranked data have their own branch of statistical techniques. Ratios and proportions,
however, can give rise to curious distributions, particularly when derived from
discontinuous variables. For example, 10 per cent survivors from batches of test
organisms could arise from a minimum of one survivor in ten, whereas 50 per cent could
4

arise from one out of two, two out of four, three out of six, four out of eight or five out
often; and similarly for other ratios.

b) Data validation
To ensure that the data contained in the storage and retrieval system can be used for
decision making in the management of water resources, each agency must define its
data quality needs, i.e. the required accuracy and precision. It must be noted that all
phases of the water quality data collection process, i.e. planning, sample collection
and transport, laboratory analysis and data storage, contribute to the quality of the
data finally stored.

Of particular importance are care and checking in the original coding and keyboard
entry of data. Only careful design of data codes and entry systems will minimise input
errors. Experience also shows that major mistakes can be made in transferring data
from laboratories to databases, even when using standardised data forms. It is
absolutely essential that there is a high level of confidence in the validity of the data
to be analysed and interpreted. Without such confidence, further data manipulation is
fruitless. If invalid data are subsequently combined with valid data, the integrity of the
latter is also impaired.

c) Data outliers
In water quality studies data values may be encountered which do not obviously
belong to the perceived measurement group as a whole. For example, in a set of
chemical measurements, many data may be found to cluster near some central value,
with fewer and fewer occurring as either much larger or much smaller values.
Infrequently, however, a value may arise which is far smaller or larger than the usual
values. A decision must be made as to whether or not this outlying datum is an
occasional, but appropriate, member of the measurement set or whether it is an outlier
which should be amended, or excluded from subsequent statistical analyses because
of the distortions it may introduce. An outlier is, therefore, a value which does not
conform to the general pattern of a data set
5

d) Quality assurance of data


Quality assurance should be applied at all stages of data gathering and subsequent
handling. For the collection of field data, design of field records must be such that
sufficient, necessary information is recorded with as little effort as possible. Preprinted record sheets requiring minimal, and simple, entries are essential. Field
operations often have to take place in adverse conditions and the weather, for
example, can directly affect the quality of the recorded data and can influence the care
taken when filling-in unnecessarily-complex record sheets.

Analytical results must be verified by the analysts themselves checking, where


appropriate, the calculations, data transfers and certain ratios or ionic balances.
Laboratory managers must further check the data before they allow them to leave the
laboratory. Checks at this level should include a visual screening and, if possible, a
comparison with historical values of the same sampling site. The detection of
abnormal values should lead to re-checks of the analysis, related computations and
data transcriptions.

The quality assurance of data storage procedures ensures that the transfer of field and
laboratory data and information to the storage system is done without introducing any
errors. It also ensures that all the information needed to identify the sample has been
stored, together with the relevant information about sample site, methods used.

6. PARAMETRIC AND NON-PARAMETRIC STATISTICS


Some examples of parametric and non-parametric basic statistics are worked through in
detail in the following sections for those without access to more advanced statistical aids.
A choice often has to be made between these statistical approaches and formal methods
are available to aid this choice. However, as water quality data are usually asymmetrically
distributed, using non-parametric methods as a matter of course is generally a reliable
approach, resulting in little or no loss of statistical efficiency. Future developments in the
6

applicability and scope of non-parametric methods will probably further support this
view. Nevertheless, some project objectives may still require parametric methods (usually
following data transformation), although these methods would usually only be used where
sufficient statistical advice and technology are available.

In principle, before a water quality data set is analysed statistically, its frequency
distribution should be determined. In reality, some simple analysis can be done without
going to this level of detail. It is usually good practice to graph out the data in a suitable
manner, as this helps the analyst get an overall concept of the shape of the data sets
involved.
a) Parametric statistics

Just as the water or biota samples taken in water quality studies are only a small fraction
of the overall environment, sets of water quality data to be analysed are considered only
samples of an underlying population data set, which cannot itself be analysed. The sample
statistics are, therefore, estimations of the population parameters. Hence, the sample
statistical mean is really only an estimate of the population parametric mean. The value of
the sample mean may vary from sample to sample, but the parametric mean is a particular
value. Ideally, sample statistics should be un-biased. This implies that repeat sample sets,
regardless of size, when averaged will give the parametric value.

By making presumptions about the data frequency distribution of the population data set,
statistics have been devised which have the property of being unbiased. Because a
frequency distribution has been presumed, it has also been possible to design procedures
which test quantitatively hypotheses about the data set. All such statistics and tests are,
therefore, termed parametric, to indicate their basis in a presumed, underlying data
frequency distribution. This also places certain requirements on data sets before it is valid
to use a particular procedure on them. Parametric statistics are powerful in hypothesis
testing wherever parametric test requirements are met.
b) Non-parametric statistics
7

Since most water quality data sets do not meet the requirements mentioned above, and many
cannot be made to do so by transformation, alternative techniques which do not make
frequency distribution assumptions are usually preferable. These tests are termed nonparametric to indicate their freedom from any restrictive presumptions as to the underlying,
theoretical data population. The range (maximum value to minimum value), is an example of
a traditional non-parametric statistic. Recent developments have been to provide additional
testing procedures to the more traditional descriptive statistics.
Non-parametric methods can have several advantages over the corresponding parametric
methods:

They require no assumptions about the distribution of the population,

Results are resistant to distortion by outliers and missing data,

They can compute population statistics even in censored data,

They are easier to compute,

They are intuitively simpler to understand, and

Some can be effective with small samples.

Non-parametric tests are likely to be more powerful than parametric tests in hypothesis
testing when even slight non-normality exists in the data set; which is the usual case in water
quality studies.