Вы находитесь на странице: 1из 18

# 26 Statistical Models

## The development of mathematical models based on fundamentals of atmospheric

chemistry and physics has been discussed in Chapter 25. These models are essential tools
in tracking emissions from many sources, their atmospheric transport and transformation,
and finally their contribution to concentrations at a given location (receptor). A number
of factors may often limit the application of these mathematical models including need
for spatially resolved time-dependent emission inventories and meteorological fields.
In some cases an alternative to the use of atmospheric chemical transport models is
available. It is possible to attack the source contribution identification problem in
reverse order, proceeding from concentrations at a receptor site backward to responsible
emission sources. The corresponding tools, named receptor models, attempt to relate
measured concentrations at a given site to their sources without reconstructing the
dispersion patterns of the material. Sometimes, receptor and atmospheric chemical
transport models are used synergistically. For example, receptor models can refine the
input emission information used by atmospheric chemical transport models.
In the first part of this chapter a number of receptor modeling approaches will be
discussed. These models are used for apportionment of the contributions of each
source, identification of sources and their emission composition, and for determina­
tion of the spatial distribution of emission fluxes from a group of sources. In the
second part, we will develop the tools needed to analyze the statistical character of
air quality data.

## 26.1 RECEPTOR MODELING METHODS

Receptor models are based on measured mass concentrations and the use of appropriate
mass balances. For example, assume that the total concentration of particulate iron
measured at a site can be considered to be the sum of contributions from a number of
independent sources

## Fetotal = Fesoil + Feauto + Fecoal + • • • (26.1)

where Fetotal is the measured iron concentration, Fesoil and Fe auto are the concentrations
contributed by soil emissions and automobiles, and so on. Let us start from a rather simple
scenario illustrating the major concepts used in receptor modeling.

Atmospheric Chemistry and Physics: From Air Pollution to Climate Change, Second Edition, by John H. Seinfeld
and Spyros N. Pandis. Copyright © 2006 John Wiley & Sons, Inc.
1136
RECEPTOR MODELING METHODS 1137

Source Apportionment Assume that for a rural site the measured PM10 concentra­
tion is 32µgm - 3 containing 2.58µgm - 3 Si and 3.84µgm - 3 Fe. The two major
sources contributing to the location's particulate concentration are a coal-fired power
plant and soil-related dust. Analysis of the emissions of these sources indicates that the
soil contains 200 mg(Si) g - 1 (20% of the total emissions) and 32 mg(Fe) g - 1 (3.2% of
the total emissions), while the particles emitted by the power plant contain
10mg(Si) g - 1 (1%) and 150 mg (Fe) g - 1 (15%). Neglecting Si and Fe contributions
from other sources
S1total = S1soil + S 1 p o w e r (26.2)
Fetotal = Fesoil + F e p o w e r (26.3)

If S and P are the total aerosol contributions (in µg m3) from dust and the power plant
to the PM10 concentration in the receptor, then

PM10 = S + P + E (26.4)

where E is the contribution from any additional sources. If the composition of the
particles does not change during their transport from the sources to the receptor,
then, using the initial composition of the emissions, we obtain

Sisoil = 0.25
Fesoil = 0.0325
(26.5)
Sipower = 0.01 P

Fepower = 0.15 P

## Sitotal = 0.2 S + 0.01 P

(26.6)
Fe t o t a l = 0.0325 + 0.15 P

The preceding is an algebraic system of two equations with two unknowns, the
contributions of the two sources, S and P, to the receptor aerosol concentration. The
solution of the system using the measured Si and Fe concentrations is
S = 12 µg m - 3 and P = 18 µg m - 3 . Using (26.4), we also find that E = 2 µg m - 3 ,
and therefore the power plant is contributing 56.2%, the dust 37.5%, and the
unknown sources 6.3% to the PM10 of the specific location. Recall that we have
implicitly assumed that the unknown sources contribute negligible Si or Fe to the
levels measured at the location.

This example describes a simple scenario but demonstrates the utility of receptor
modeling. One can calculate the contribution of several sources to the atmospheric con­
centrations at a given location with knowledge of only source and receptor compositions.
No information regarding meteorology, topography, location, and magnitude of sources is
necessary.
A general mathematical framework can be developed for solution of problems similar
to the example above. Suppose that for a given area there are m sources and n species.
1138 STATISTICAL MODELS

Ifflyis the fraction of chemical species i in the particulate emissions from source j , then
the composition of sources can be described by a matrix A. For the conditions of the
previous example

Let ci be the concentration (in µg m - 3 ) of element i(i = 1,2,..., n) at a specific site, and
let fij be a fraction representing any modification to the source composition aij due to
atmospheric processes (e.g., gravitational settling) that occurs between the source and the
receptor points. Thenfijaijwill be the fraction of species i in the particulate concentrations
from source j at the receptor. If sj is the total contribution (in µg m - 3 ) of the particles from
source j to the particulate concentration at the receptor site, we can express the
concentration of element i at the site as

(26.7)

## cFe = f F e , s o i l aFe,soil Ssoil + f F e , p o w e r aFe,power Spower

(26.8)
CSi = f S i , s o i l a S i , s o i l S s o i l + f S i , p o w e r a S i , p o w e r S p o w e r

Usually fij is assumed equal to unity, thus assuming that the source signature aij is not
modified by processes (reactions, removal, etc.) occurring during atmospheric transport
between source and receptor. In this case we simply have

(26.9)

Thus the concentration of each chemical element at a receptor site becomes a linear
combination of the contributions of each source to the particulate matter at that site. Given
the chemical composition of the ambient sample ci and the source emission signature aij
(26.9) can then be solved to provide the source contributions Sj.
If there are k ambient aerosol samples, then let cik be the concentration of element i in
the sample k. The source contributions will in general be different from sampling period to
sampling period depending on wind direction, emission strength, and so on. Equation
(26.9) can then be written in a more general form for the k samples as

(26.10)
RECEPTOR MODELING METHODS 1139

where sjk is the concentration (in µ g m - 3 ) of material from source j collected in the
sample k. A number of approaches based on (26.10) have been used to develop our
understanding of source-receptor relationships for nonreactive species in an airshed.
These methods include the chemical mass balance (CMB) used for source apportion­
ment, the principal-component analysis (PCA) used for source identification, and the
empirical orthogonal function (EOF) method used for identification of the location and
strengths of emission sources. A detailed review of all the variations of these basic
treatments by Watson (1984), Henry et al. (1984), Cooper and Watson (1980), Watson
et al. (1981), Macias and Hopke (1981), Dattner and Hopke (1982), Pace (1986), Watson
et al. (1989), Gordon (1980, 1988), Stevens and Pace (1984), Hopke (1985, 1991), and
Javitz et al. (1988).

## 26.1.1 Chemical Mass Balance (CMB)

The CMB model combines the chemical and physical characteristics of particles or
gases measured at the sources and the receptors to quantify the source contributions
to the receptor (Winchester and Nifong 1971; Miller et al. 1972). CMB is a method
for the solution of the set of equations (26.9) to determine the unknown Sj. The source
profiles aij, that is, the fractional amount of the species in the emissions from each
source type, and the receptor concentrations, with appropriate uncertainty estimates,
serve as input to the CMB model. We start by analyzing the case where one
particulate sample is available. The first assumption of CMB is that all sources
contributing to the measured concentrations ci in the receptor have been identified.
Each measured concentration ci can then be expressed as the sum of the true value ci,
and a random error ei:

Ci = Ci + ei, i = 1, 2, . . . , n (26.11)

It is assumed in CMB that the measurement errors ei are random, uncorrected, and
normally distributed about a mean value of zero. These errors can be characterized
statistically by the standard deviation σi of their normal distributions.
For an initial guess of source contributions sj, the predicted concentrations pi for all
elements are given by

(26.12)

## If ci are the corresponding measured elemental concentrations, we would like to minimize

the "distance" between the measurements ci and the predictions pi. This distance can be
expressed by the sum

## Because of the measurement uncertainty, no choice of sj values will result in perfect

agreement between predictions and observations. The measurement uncertainty depends
1140 STATISTICAL MODELS

on the element, and to account for these different degrees of uncertainties, 1/σ2i are used
as weighting factors. Summarizing, one needs to minimize

(26.13)

by choosing appropriate values of the contributions sj. Note that by using the weighting
factors 1/σ2i, elements with large uncertainties contribute less to theξ2,function compared
to elements with smaller uncertainties. Combining (26.12) and (26.13), we obtain

(26.14)

where n is the number of species and m is the number of sources. The solution approach is
to minimize the value of ξ2 with respect to each of the m coefficients Sj, yielding a set of m
simultaneous equations with m unknowns (s1, s2, . . . ,sm). This is the common multiple
regression analysis problem. The solution is the vector s of source contributions given by

s = [ATWA]-1ATWc (26.15)

where A is the n x m source matrix with the source compositions aij, W is the n x n
2 T
diagonal matrix with elements of the weighting factors, wii = 1/σ i, A is the m x n
transpose of A, c is the vector with the measurements of the n elements, and s is the vector
with the m source contributions. Note that [ATWA] is an m x m square matrix so it can be
inverted.
The solution of the receptor problem using (26.15) considers uncertainties in the
measurements ci but neglects the inherent uncertainty in the source contributions ay. Let
us denote by σaij the standard deviation of a determination of the fraction aij of element i in
the emissions of source j . The solution can then be calculated by an expression analogous
to (26.15) (Watson 1979; Hopke 1985):

s = [ATVA]-1ATVc (26.16)

## Here V is the diagonal matrix with elements

(26.17)

The unknown source contributions sj are included in the elements of the V matrix, and
therefore an iterative solution of (26.16) is necessary. The first step is to assume that σaij = 0,

1
The transpose of an n x m matrix A denoted by AT is simply the m x n matrix obtained by interchanging all the
rows and columns.
RECEPTOR MODELING METHODS 1141

solve (26.16) directly, and calculate the first approximation of Sj. Then vii can be calculated
from (26.17) and a second approximation is found. If this approach converges, the solution is
found. This approach using (26.16) and (26.17) is known as the effective variance method.
The major assumptions used by the CMB model are

## 1. Compositions of source emissions are constant.

2. Species included are not reactive.
3. All sources contributing significantly to the receptor have been included in the
calculations.
4. There is no relationship among the source uncertainties.
5. The number of sources is less than or equal to the number of species.
6. Measurement uncertainties are random, uncorrelated, and normally distributed.

These assumptions are fairly restrictive and may be difficult to satisfy for most CMB
applications. When they are not satisfied, the CMB predictions may be unrealistic (e.g.,
negative contributions) or may include significant uncertainties.
The application of CMB to an area poses a number of difficulties in addition to the
assumptions of the method. Let us assume that a particulate sample has been collected in the
area of interest and its elemental composition has been determined. The first issue that one
needs to address is which sources should be included in the model. If an emission inventory
exists for the region, it can be used to determine the major sources. The second issue is
which source profiles should be used. Profiles used by studies in other areas may be
applicable to only that specific source. For example, the emission fingerprint of a power
plant in Ohio may not be representative of a power plant in Texas. Local sources of road and
soil dust are usually different from location to location. To complicate things even further,
emission profiles often change with time. For example, motor vehicle emission composition
has changed dramatically in the last 40 years with the introduction of new fuels (unleaded
gasoline), new engines, and control technologies. Uncertainties or errors in the CMB results
can be reduced noticeably by obtaining source profile measurements that correspond to the
period of the ambient measurements (Glover et al. 1991). It is clearly essential for the CMB
application to know the area that is to be modeled (Hopke 1985).
When multiple samples are available CMB should be applied to each sample separately
and then the results can be averaged (Hopke 1985). This approach, even if more time-
consuming, is more accurate than the CMB application to the averaged measurements.
Information is generally lost during averaging of sample composition data and cannot be
recovered later by CMB.

## CMB Application to Central California PM Chow et al. (1992) apportioned

source contributions to aerosol concentrations in the San Joaquin Valley of California.
The source profiles used for CMB application are shown in Table 26.1. The standard
deviations σaij of the profiles (three or more samples were taken) are also included. To
account for secondary aerosol components in the CMB calculations, ammonium sul­
fate, ammonium nitrate, sodium nitrate, and organic carbon were expressed as second­
ary source profiles using the stoichiometry of each compound. The average elemental
concentrations observed at one of the receptors—Fresno, California, in 1988-1989—
are shown in Table 26.2. The ambient concentrations of some species (e.g., Ga, As, Y,
Mo, Ag) included in the source profiles were below the detection limits. These species
TABLE 26.1 Source Profiles (Percent of Mass Emitted) for Central California

1142
Chemical Paved Road Vegetative Primary Motor
Species Dust Burning Crude Oil Vehicle Limestone
NO-3 0 ± 0.47 0.462 ±0.123 0 ± 0.002 0 ± 0.001 0 ± 0.001
2- 0.547 ±1.17 1.423 ±0.423 20.32 ± 4.24 3.11 ±3.55 3.06 ± 0.3
SO 4
NH + 4 0 ± 0.008 0.0852 ± 0.057 0.0076 ± 0.005 0 ± 0.001 0 ± 0.001
Na + 0.181 ±0.055 0.143 ±0.052 0.762 ± 0.399 0 ± 0.001 0 ± 0.001
EC 2.69 ±1.44 15.89 ±5.80 0 ± 0.072 54.15 ±19.78 0 ± 0.001
OC 19.5 ±4.67 44.60 ± 7.94 0.0894 ± 0.118 49.81 ±24.15 0 ± 0.001
Al 9.34± 1.11 0.0019 ± 0.027 0 ± 0.009 0.077 ±0.051 2.11 ±0.21
Si 23.2 ± 2.62 0 ± 0.015 0.011 ±0.016 0.957 ±1.39 6.5 ±0.65
P 0.304 ± 0.05 0 ± 0.022 0±0.17 0.057 ± 0.02 0 ± 0.001
S 0.520 ±0.17 0.521 ±0.176 5.45 ±0.39 1.037 ±1.182 1.02 ± 0.1
Cl 0.163 ±0.031 1.908 ±0.64 0.024 ± 0.021 0.029 ± 0.02 0.46 ± 0.05
K 1.95 ±0.28 3.993 ±1.24 0.044 ± 0.054 0.008 ± 0.008 0.16 ±0.04
Ca 2.98 ± 0.43 0.0659 ± 0.056 0.062 ± 0.005 0.072 ± 0.079 29.52 ±2.95
Ti 0.499 ± 0.067 0.0009 ± 0.016 0.012 ± 0.002 0.001 ± 0.003 0.08 ± 0.04
V 0.0311 ± 0.008 0.0005 ± 0.007 0.823 ± 0.058 0.001 ± 0.002 0±0.1
Cr 0.0299 ± 0.003 0 ± 0.0016 0.007 ± 0.025 0 ± 0.002 0±0.01
Mn 0.106 ±0.016 0.0007 ± 0.001 0.0056 ± 0.001 0.028 ± 0.024 0.05 ± 0.03
Fe 5.41 ±0.88 0.0006 ± 0.001 0.2134 ±0.022 0.001 ± 0.005 1.04 ± 0.1
Co 0.0059 ± 0.076 0.0001 ± 0.001 0.0185 ±0.002 0 ± 0.001 0 ± 0.001
Ni 0.0111 ±0.001 0.0001 ± 0.001 0.789 ± 0.093 0 ± 0.002 0±0.1
Cu 0.02 ± 0.002 0.0001 ± 0.001 0.0009 ± 0.003 0.005 ± 0.003 0.02 ± 0.01
Zn 0.172 ±0.026 0.0866 ± 0.036 0.260 ± 0.034 0.053 ± 0.028 0.1 ±0.01
Ga 0.0003 ± 0.006 0 ± 0.0021 0.0132 ±0.002 0.002 ± 0.002 0 ± 0.001
As 0.0014 ± 0.042 0.0002 ± 0.002 0.0006 ± 0.001 0.004 ±0.012 0 ± 0.001
Se 0.0001 ±0.002 0.0004 ± 0.001 0.0114 ± 0.002 0 ± 0.002 0 ± 0.001
Br 0.0095 ± 0.001 0.0096 ± 0.002 0.0003 ± 0.0002 0.264 ±0.152 0.03 ± 0.01
Sr 0.0794 ± 0.006 0.0007 ± 0.001 0.0015 ± 0.0003 0 ± 0.003 0 ± 0.001
Y 0.0025 ± 0.004 0.0001 ± 0.001 0.0008 ± 0.0003 0 ± 0.004 0 ± 0.001
Zr 0.0091 ± 0.002 0 ± 0.0019 0.0006 ± 0.0004 0 ± 0.019 0 ± 0.001
Mo 0.0004 ± 0.006 0 ± 0.0033 0.0168 ±0.002 0± 0.012 0± 0.001
Ag 0 ± 0.016 0.0003 ± 0.007 0.0002 ± 0.002 0± 0.016 0± 0.001
Cd 0.0015 ±0.017 0.0007 ± 0.008 0.0006 ± 0.002 0± 0.02 0± 0.001
In 0.0030 ± 0.02 0.0001 ± 0.009 0.0009 ± 0.002 0± 0.026 0± 0.001
Sn 0.0037 ± 0.027 0 ± 0.012 0.0007 ± 0.003 0± 0.031 0± 0.001
Sb 0.0054 ± 0.03 0.0022 ±0.014 0.0006 ± 0.003 0± 0.069 0± 0.001
Ba 0.064 ±0.103 0.0095 ± 0.05 0.0013 ± 0.011 0± 0.129 0± 0.001
La 0.0142 ±0.117 0.0016 ±0.056 0.0041 ±0.013 0± 0.236 0± 0.001
Hg 0.0015 ± 0.008 0 ± 0.0037 0 ± 0.0009 0± 0.002 0± 0.001
Pb 0.265 ± 0.032 0.004 ± 0.003 0 ± 0.0013 0.373 ± 0.207 0.27 ± 0.03

## Marine (NH4)2 SO4 NH4N03 Secondary OC NaNO3

-
NO 3 0 ± 0.001 0 77.5 0 72.9
SO24- 10.0 ± 4.0 72.7 0 0 0
NH + 4 0 ± 0.001 27.3 22.5 0 0
Na + 40.0 ± 4.0 0 0 0 27.1
EC 0 ± 0.001 0 0 0 0
OC 0 ± 0.001 0 0 100 0
S 3.3 ± 1.3 0 0 0 0
Cl 40.0 ±10.0 0 0 0 0
K 1.4 ± 0.2 0 0 0 0
Ca 1.4 ± 0.2 0 0 0 0
Br 0.2 ± 0.05 0 0 0 0
Source: Chow et al. (1992).

1143
1144 STATISTICAL MODELS

## TABLE 26.2 Annual (1988-1989) Aerosol Composition in Fresno, California

Chemical Species PM2.5, µg m - 3 PM10, µg m - 3
NO-3 9.43 ±11.43 10.26 ±10.52
SO+2-4 2.75 ±1.32 3.20 ±1.51
NH 4 4.04 ± 3.89 4.06 ± 3.85
EC 6.27 ± 5.68 6.73 ±5.68
OC 8.05 ±5.31 12.89 ±7.66
Al 0.15 ±0.18 2.94 ± 2.63
Si 0.38 ± 0.46 7.49 ±6.31
P 0.013 ±0.012 0.072 ± 0.067
S 1.12±0.54 1.27 ±0.76
Cl 0.17 ±0.22 0.34 ± 0.37
K 0.28 ±0.18 0.85 ±0.53
Ca 0.072 ±0.068 0.85 ± 0.64
Ti 0.016 ±0.015 0.14±0.12
V 0.0034 ±0.0019 0.01 ±0.008
Cr 0.0016 ± 0.002 0.0081 ±0.0061
Mn 0.0073 ± 0.0058 0.035 ± 0.029
Fe 0.17 ±0.16 1.48 ±1.22
Ni 0.0023 ± 0.0024 0.0051 ±0.0029
Cu 0.069 ± 0.064 0.0077 ± 0.095
Zn 0.069 ± 0.052 0.087 ± 0.066
Se 0.0016 ± 0.003 0.0019 ± 0.004
Br 0.017 ±0.011 0.017 ± 0.008
Sr 0.0007 ± 0.0006 0.0043 ± 0.0035
Zr 0.0011 ±0.0028 0.0031 ±0.0029
Ba 0.013 ±0.014 0.044 ± 0.034
Pb 0.051 ±0.034 0.067 ± 0.034
Source: Chow et al. (1993).

cannot be used for source apportionment in this specific case. Results of the source
apportionment using the CMB method (using CMB for each sample and then
averaging the results) are shown in Table 26.3. The major contributors to the
annual average PM10 concentrations that exceeded 50µgm - 3 were primary geologic
material and ammonium nitrate. For the PM2.5, secondary NH4NO3 and (NH4)2SO4,
together with primary motor vehicle emissions and vegetative burning, were the major
contributors.

CMB Evaluation A method often used to evaluate the CMB method is use of only
selected measurement elements for estimation of source contributions and then use of the
remainder of the measurement elements and predictions as a test of the analysis. For
example, Kowalczyk et al. (1982) used the CMB and nine elements (Na, V, Pb, Zn, Ca, Al,
Fe, Mn, As) to calculate contributions of seven sources to the Washington, DC, aerosol.
Each of the selected elements was characteristic of a source: Na for seasalt, V for fuel oil,
Pb for motor vehicles, Zn for refuse incineration, Ca for limestone, Al and Fe for coal and
soil, and As for coal. The authors used 130 samples from a network of 10 stations. Cr, Ni,
Cu, and Se were significantly underestimated, but the concentrations of the remaining
elements were successfully reproduced by CMB. Kowalczyk et al. (1982) repeated the
RECEPTOR MODELING METHODS 1145

## TABLE 26.3 Estimated Annual Average Source Contributions (ug m-3)

to PM10 and PM25 in Fresno, California, Based on CMB
Source PM10 PM2.5
Geologic 31.78 2.26
Motor vehicle 6.80 9.24
Vegetative burning 5.10 5.92
Primary crude oil 0.29 0.25
(NH4)2SO4—secondary 3.58 3.48
NH4NO3—secondary 10.39 12.35
Seasalt (NaCl-NaNO3) 0.96 0.45
OC—secondary 0.07 0.36
Calculated mass 58.97 34.34
Measured mass 71.49 49.30
Source: Chow et al. (1992).

exercise using 9-30 marker elements and found little difference in the results as the key
elements (Pb, Na, and V) were included. They also observed that including some elements
(namely, Br and Ba) as markers gave erroneous results for several other elements.
The absolute accuracy of the CMB cannot be tested easily, because the true results are
unknown. However, artificial data sets can be created by assuming a realistic distribution of
sources, source strengths, and meteorology, simulating the scenario with a deterministic
transport model (see Chapter 25), and using CMB to apportion the source contributions to the
modeled concentrations. Gerlach et al. (1983) reported the results of such a test using a
typical city plan and 13 known sources. The results of the CMB application indicated that the
contributions of nine of the sources were accurately predicted (errors less than 20%) while
errors as much as a factor of 4 were found for the remaining four sources. The contributions
of the six most important sources were accurately predicted by CMB and the errors were
associated with sources of secondary importance.

CMB Resolution A final issue that may complicate the application of the CMB on
ambient data sets is existence of two sources with similar fingerprints or, more generally, a
source whose profile is a linear combination of other source profiles. This is called the
collinearity problem. If this is the case then the matrix [ATWA] used in (26.15) has two
columns that are almost similar, or a linear combination of several others. This matrix from a
mathematical point of view is close to singular and the result of its inversion is extremely
sensitive to small errors. Often, if this is the case, the results of CMB are large positive and
negative source contributions. The simplest solution to this problem is identification of the
"offending" sources and elimination of one of them. Physically, because the sources are too
similar, it is difficult for CMB to quantify the contribution of each. Thus there are limits to
how far source contributions can be resolved even with almost perfect information; only
significantly different sources can be treated by CMB. Similar sources have to be combined
into a lumped source. Henry (1983) and Hopke (1985) have proposed algorithms that can be
used for the a priori identification of estimable sources and the estimable source
combinations that can be determined for a given source matrix.
We should note, once more, that during the derivation of (26.15) and (26.16) we have
assumed that the atmospheric transformation termsfy are equal to unity [see also (26.17)].
1146 STATISTICAL MODELS

Therefore these equations should not be applied to species that are produced or consumed
(e.g., sulfate) during transport from source to receptor. Gravitational settling is often
assumed not to modify aij (the elemental fractions of the source emissions) to a first
approximation, even if it changes the net concentrations of these elements. This
assumption is equivalent to assuming that all elements have the same size distribution.
Application of (26.15) to gaseous pollutants that react in the atmosphere is generally not
appropriate.

## 26.1.2 Factor Analysis

If the nature of the major sources influencing a particular receptor is unknown, statistical
factor analysis methods can be combined with ambient measurements to estimate the
source composition. Assuming that for a particular location several ambient particulate
samples are collected and analyzed for several elements, the resulting data will probably
include information about the fingerprints of the sources affecting the location.
Principal-component analysis (PCA) is one of the factor analysis methods used to
unravel the hidden source information from a rich ambient measurement data set. Factor
analysis models are mathematically complex, and their results are often difficult to
interpret.
Let us consider first a simple example given by Hopke (1985). A specific location,
without us knowing it, is heavily influenced by two sources—automobiles and a coal-fired
power plant. All samples are analyzed for aluminum (Al), lead (Pb), and bromine (Br). We
assume that Al is emitted only by the power plant, while lead and bromine (mass ratio
3:1) are emitted only by automobiles. Our samples, depending on the prevailing wind
direction, traffic intensity, and so on, will have different Al, Pb, and Br concentrations. A
three-dimensional plot of these concentrations (Figure 26.1) reveals little information
about the underlying sources. However, all the data points are actually located on the same
plane defined by the Al axis and the line Br =1/3Pb (Figure 26.2). If the z-Al plane is
rotated, all the measurements can be plotted on a two-dimensional graph, with axes
defined by Al and z (Figure 26.3). Note that the three-dimensional data set has collapsed to

FIGURE 26.1 Measured aerosol composition of three elements in seven samples for a site
influenced by automobiles (emitting Pb and Br) and a coal-fired power plant (emitting Al).
RECEPTOR MODELING METHODS 1147

FIGURE 26.2 Plane passing through the data points of Figure 26.1. The z axis is defined by
Al = 0 and Br =1/3Pb.

a two-dimensional data set and the two axes correspond to the composition of the sources.
The Al and Br =1/3Pb axes are the principal factors influencing the aerosol concentration
at the receptor.
If there are more than three aerosol species, then we need to work with higher
than three-dimensional spaces, and locating hyperplanes passing through (or close
to) all the data points becomes a complicated exercise. The first step in the procedure
is, of course, the collection of the data set, say, k samples of n aerosol species. These
species measurements are then analyzed for the calculation of the correlation
coefficients. If Al and Pb were two of the species measured, one would have
available the values of (cAI,1, cA1,2, . . . , cA1,k), and (c Pb,1 , cPb,2 , . ..,cPb,k),where cAl,i
and cpb,, are the Al and Pb concentrations in the ith sample. The mean Al and Pb
values will be

(26.18)

FIGURE 26.3 Two-dimensional depiction of the data shown in Figures 26.1 and 26.2.
1148 STATISTICAL MODELS

The correlation coefficient around the mean between lead and aluminum rPb,Al is then
defined as

(26.19)

and is a measure of the interrelationship between Pb and Al concentrations. IfσALand σPb are
the standard deviations of the corresponding samples, the correlation coefficient is given by

(26.20)

If the two are completely unrelated, rPb,Al = 0. If the two variables are strongly related to
each other (positively or negatively), they have a high correlation coefficient (positive or
negative). It is important to stress at this point that high correlation coefficients do not
necessarily imply a cause-and-effect relationship. Variables can be related to each other
indirectly through a common cause. For example, let us assume that for a given receptor
Pb and Al concentrations are highly correlated; high Al values are always accompanied by
high Pb values and vice versa. One would be tempted to conclude that they have a common
source. However, the same correlation can also be a result of the fact that the lead source (a
major traffic artery) is next to the Al source (a coal-fired power plant), and depending on
the wind direction, their concentrations in the receptor vary proportionally to each other.
Correlation coefficients can be calculated for each pair of elements and a correlation
matrix (n x n) can be constructed. Note that because

## rPb,Al = rAl,Pb rPb,Pb = rAl,Al = 1 (26.21)

the matrix will be symmetric, and the elements in the diagonal will be equal to unity. A
correlation matrix for nine elements measured at Whiteface Mountain, New York, is
shown in Table 26.4. The correlation matrix C is the basis of PCA. Let λ1, λ2 , . . ., λn be its

TABLE 26.4 Correlation Matrix for Elements Measured in Whiteface Mountain, New York
Na K Sc Mn Fe Zn As Br Sb
Na 1 0.48 0.03 0.22 0.21 0.14 0.13 0.04 0.08
K 0.48 1 0.46 0.57 0.61 0.03 0.30 0.28 0.15
Sc 0.03 0.46 1 0.82 0.72 -0.26 0.64 -0.06 0.07
Mn 0.22 0.57 0.82 1 0.88 -0.08 0.65 -0.03 0.13
Fe 0.21 0.61 0.72 0.88 1 -0.03 0.46 0.08 0.09
Zn 0.14 0.03 -0.26 -0.08 -0.03 1 -0.12 0.04 0.62
As 0.13 0.30 0.64 0.65 0.46 -0.12 1 -0.18 0.07
Br 0.04 0.28 -0.06 0.03 0.08 0.04 -0.18 1 0.27
Sb 0.08 0.15 0.07 0.13 0.09 0.62 0.07 0.27 1
Source: Parekh and Husain (1981).
RECEPTOR MODELING METHODS 1149

## TABLE 26.5 Normalized Eigenvectors for Whiteface Mountain, New York

Eigenvector
Element 1 2 3 4
Na 0.174 0.262 0.403 0.696
K 0.381 0.222 0.397 0.083
Sc 0.455 -0.182 -0.148 -0.193
Mn 0.499 0.044 -0.104 -0.035
Fe 0.468 0.014 0.027 - 0.095
Zn -0.05 0.591 - 0.394 0.188
As 0.375 -0.158 -0.309 0.096
Br 0.024 0.357 0.504 -0.617
Sb 0.081 0.588 - 0.377 -0.191
Source: Parekh and Husain (1981).

## eigenvalues and e 1 ,e 2 , . . ., en the corresponding eigenvectors. Each eigenvector corre­

sponds to a particular aerosol composition influencing the receptor. Each aerosol sample
collected at the receptor can be expressed as a linear combination of these eigenvectors.
The corresponding eigenvalues can be viewed as a measure of the importance of each
eigenvector for the receptor. For example, for the correlation matrix for Whiteface
Mountain, the eigenvalues are 3.6, 1.83, 1.16, 1.01, 0.54, 0.32, 0.31, 0.16, and 0.078.
Eigenvectors corresponding to eigenvalues close to zero are neglected because they are
usually artifacts due to either the numerical or the sampling and analysis procedures. The
decision of which eigenvector should be retained is not straightforward. In practice,
eigenvectors corresponding to eigenvalues larger than one are usually retained. Following
this rule of thumb, the four corresponding eigenvectors are shown in Table 26.5.
At this stage one is left with a set of vectors corresponding to the specific elemental
combinations influencing the receptor (Table 26.5). These factors often have little physical
significance. For example, the factors in Table 26.5 are difficult to interpret and so a further
transformation is performed. The remaining eigenvectors are rotated in space in such a way
as to maximize the number of values that are close to unity (Hopke 1985). This rotation,
known as "varimax rotation" (Kaiser 1959), is a rather controversial step because it is an
arbitrary procedure. In any case, the final eigenvectors are usually interpreted as fingerprints
of emission sources based on the chemical species with which they are highly correlated.
The rotated eigenvectors are shown in Table 26.6, and their physical significance can be
discussed even if there are only a few elements available. For example, factor 1 containing
elements coming from crustal sources is probably soil and other crustal materials. Factor 2,
characterized by Zn and Sb, apparently results from refuse incineration. Factor 3, containing
Na and K, could be marine aerosol with road salt. Bromine on factor 4 indicates a motor
vehicle component. One should note that the above solution is not unique. Hopke (1985)
reanalyzed the same data using a different rotation and concluded that there were five factors
affecting the location with composition different from the one in Table 26.6.
The major assumptions of PCA modeling are

## 1. The composition of emission sources is constant.

2. Chemical species used in PCA do not interact with each other, and their
1150 STATISTICAL MODELS

## TABLE 26.6 Rotated Factors for Whiteface Mountain, New York

Factor
Element 1 2 3 4
Na 0.047 0.086 0.943 -0.071
K 0.478 0.020 0.628 0.436
Sc 0.901 -0.128 -0.015 0.104
Mn 0.913 0.006 0.210 0.103
Fe 0.815 - 0.006 0.251 0.251
Zn -0.152 0.897 0.144 - 0.068
As 0.803 0.021 0.038 - 0.274
Br 0.134 0.126 0.141 0.890
Sb 0.756 0.887 -0.039 0.227
Source: Parekh and Husain (1981).

## 3. Measurement errors are random and uncorrelated.

4. The variability of the concentrations from sample to sample is dominated by
changes in source contributions and not by the measurement uncertainty or changes
in the source composition.
5. The effect of processes that affect all sources equally (e.g., atmospheric dispersion)
is much smaller than the effect of processes that influence individual sources (e.g.,
wind direction, emission rate changes).
6. There are many more samples than source types for a statistically meaningful
calculation.
7. Eigenvector rotations (if used) are physically meaningful.

Examples of PCA application can be found in Henry and Hidy (1979, 1981), Wolff and
Korsog (1985), Cheng et al. (1988), Henry and Kim (1989), Koutrakis and Spengler
(1987), and Zeng and Hopke (1989). PCA provides a rather qualitative description of
source fingerprints, which can be used later as input to a CMB model or a similar source
reader is referred to Hopke (1985).

## 26.1.3 Empirical Orthogonal Function Receptor Models

Receptor models can also be used together with spatial distribution of measurements to
estimate the spatial distribution of emission fluxes. The empirical orthogonal function
(EOF) method is one of the most popular models for this. Henry et al. (1991) improved the
EOF method by using wind direction in addition to spatially distributed concentration
measurements as input. We describe this approach below.
Let us assume that during a given period the ground-level concentration field of an
atmospheric species can be written as

(26.22)
RECEPTOR MODELING METHODS 1151

where i(x, y) are N orthogonal functions and ai(t) are time weighting functions. The
functions i(x, y) include the information about the sources of the species while the time
weighting functions represent its atmospheric transport. Neglecting vertical concentration
variations and dispersion, the atmospheric diffusion equation for this species is

(26.23)

where u(x,y, t) and v(x, y, t) are the wind components and Q(x, y, t) is the net source term
for the species including emissions and removal processes. Using (26.22)

(26.24)

and

(26.25)

Substituting (26.24) and (26.25) into (26.23) and integrating from time zero to the end of
the sampling period T, one gets

(26.26)

## where Q is the average source strength spatial distribution and

(26.27)

(26.28)

Equation (26.26) suggests that the spatial distribution of the source strength of the species
can be found using concentration and wind data if the EOFs i(x, y) and the time
weighting functions ai(t) can be calculated (Henry et al. 1991).
The EOFs can be found with the following procedure. Assume that a given species is
measured simultaneously at s sites during n sampling periods. The measurements can be
used to construct the n x s concentration matrix C. The first column of C contains all the
measurements at the first site, the second column the measurements at the second site, and
so on. Equation (26.22) can be viewed as the continuous form of the singular value
decomposition of matrix C

C = UBVT (26.29)

## where U is an n x n orthogonal matrix whose columns are the eigenvectors of CC T , V is

the matrix of the eigenvectors of C T C, and B is a diagonal matrix of singular values
(square roots of the eigenvalues). In this case there are s nonzero singular values. If a
1152 STATISTICAL MODELS

singular value is zero or very small, the corresponding eigenvectors can be removed from
the matrices U and V, leaving us with N significant eigenvectors. This step is similar to
selection of the principal components during PCA. Let U* be the n x N matrix with the
significant eigenvectors, B * the N x N diagonal matrix with the remaining singular values,
and V* the corresponding s x N matrix. Then let
τ = B*V*T (26.30)
The columns of τ are then the discrete EOFs and the columns of U* are the discrete-time
functions. These discrete EOFs can be interpolated in space to obtain the continuous EOFs
using, for example, 1 /r 2 interpolation. Then (26.26) can be applied to obtain the spatial
distribution of the source strength.
Note that while PCA is applied to many samples from the same site taken over a
number of sampling periods, the EOF operates on many samples from many sites taken
over the same period.
Assumptions implicit in the use of the EOF are

## 1. The fluxes of spatially distributed species are linearly additive.

2. Species are homogeneously distributed in the mixed layer.
3. Measurement errors are random and uncorrected.
4. The number of sampling sites exceeds the number of sources.
5. Measurements are located in areas where there are significant spatial concentration

The last two assumptions are rarely met in practice. The spatial resolution of the EOF is
limited by the number of observation sites and the distance between them. Sudden changes
of windspeed and direction during a sampling period often result in problems.
Applications of the EOF have been presented by Gebhart et al. (1990), Ashbaugh et al.
(1984), Wolff et al. (1985), and Henry et al. (1991). Henry et al. (1991) compared
simulated two-dimensional data generated by a simple dispersion model and the above-
described version of the EOF using simple wind fields. One of the comparisons is shown in
Table 26.7. For this comparison a sampling site was located in each square and the model
was able to reproduce the location of the two sources. However, the source strength is
underpredicted as a result of numerical diffusion to the neighboring cells.

## TABLE 26.7 Predicted Emission Location and Strenth"

-0.5 -0.5 -0.8 -1.5 -1.6 -1.3 0.8 2.9 3.7
-0.3 -0.2 -1.0 -0.8 -0.5 -0.1 2.4 3.0 4.2
-0.6 -0.5 -1.9 2.2 9.4 9.7 5.2 2.9 3.3
-1.6 -2.0 -2.6 7.3 28.5 31.3 10.8 0.3 -0.9
-0.6 -1.2 -0.6 8.1 25.5 28.4 11.4 1.2 -1.5
2.0 -4.6 5.7 5.5 6.2 7.0 4.1 0.4 -0.2
3.9 14.3 15.5 5.5 -1.0 -1.4 -1.3 -1.5 -0.4
4.0 13.3 14.2 5.4 0.1 -0.4 0.4 0.6 0.4
2.1 3.1 3.6 2.2 1.2 0.7 0.7 0.8 0.7
a
Actual emissions in the shaded cells are 25 and 50 emission units. For the rest the actual emissions are zero.
Source: Henry et al. (1991).
Next Page

## 26.2 PROBABILITY DISTRIBUTIONS FOR AIR

POLLUTANT CONCENTRATIONS

Air pollutant concentrations are inherently random variables because of their dependence
on the fluctuations of meteorological and emission variables. We already have seen from
Chapter 18 that the concentration predicted by atmospheric diffusion theories is the mean
concentration (c). There are important instances in analyzing air pollution where the
ability simply to predict the theoretical mean concentration (c) is not enough. Perhaps the
most important situation in this regard is in ascertaining compliance with ambient air
quality standards. Air quality standards are frequently stated in terms of the number of
times per year that a particular concentration level can be exceeded. In order to estimate
whether such an exceedance will occur, or how many times it will occur, it is necessary to
consider statistical properties of the concentration. One object of this chapter is to develop
the tools needed to analyze the statistical character of air quality data.
Hourly average concentrations are the most common way in which urban air pollutant
data are reported. These hourly average concentrations may be obtained from an
instrument that actually requires a 1-h sample in order to produce a data point or by
averaging data taken by an instrument having a sampling time shorter than 1 h. If we deal
with 1 h average concentrations, those concentrations would be denoted by cx(ti), where
τ = 1 h. For convenience we will omit the subscript x henceforth; however, it should
be kept in mind that concentrations are usually based on a fixed averaging time. There
are 8760 h in a year, so that if we are interested in the statistical distribution of the 1 h
average concentrations measured at a particular location in a region, we will deal with a
sample of 8760 values of the random variable c.
The random variable is characterized by a probability density function p(c), such that
p(c) dc is the probability that the concentration c of a particular species at a particular location
will lie between c and c + dc. Our first task will be to identify probability density functions
(pdf's) that are appropriate for representing air pollutant concentrations. Once we have
determined a form forp(c), we can proceed to calculate the desired statistical properties of c.
If we plot the frequency of occurrence of a concentration versus concentration, we
would expect to obtain a histogram like that sketched in Figure 26.4a. As the number of
data points increases, the histogram should tend to a smooth curve such as that in
Figure 26.4b. Note that very low and very high concentrations occur only rarely. We recall
that aerosol size distributions exhibited a similar overall behavior; there are no particles of

FIGURE 26.4 Hypothetical distributions of atmospheric concentrations: (a) histogram and (b)
continuous distribution.