Вы находитесь на странице: 1из 20

Structural Safety, 9 (1990) 97-116

Elsevier

97

PROBABILITY ESTIMATION AND INFORMATION PRINCIPLES


R. Baker
Faculty of Civil Engineering, Technion I.I.T., Haifa, Israel
(Received February 27, 1989; accepted in revised form December 12, 1989)

Key words: entropy; estimation; probability density.

ABSTRACT

The paper presents a procedure for the estimation of probability density functions. The procedure is based on information theory concepts. It combines Jaynes" maximum entropy formalism with Akaike's approach to the selection of the optimal member of a hierarchy of models. In the present version of the procedure, prior information about the random variable under consideration is specified in terms of lower and upper bounds on the possible values of the variable. We deal with continuous random variables defined on a finite range, with continuous probability density functions and finite (but unknown) moments. The procedure automatically adjusts the level of sophistication of the resulting probability assignment according to the nature and quantity of the available data," thus preventing us from using too sophisticated models (models with too many adjustable parameters) if the data base is not extensive enough. In a sense the procedure provides an effective combination of "subjective probability" characterizing the uncertainty of the situation in the case of a small sample, and 'objective probability" representing measured relative frequencies when the data base is large.

1. INTRODUCTION The value of almost any quantity of interest in civil engineering is to a certain extent uncertain, and hence may be considered as a random variable (r.v.). The data base in most cases is quite limited, and the distribution type of the r.v. is not known. This situation makes the application of classical estimation procedure extremely difficult. Under such conditions of limited information it appears reasonable to base the probability estimation on concepts borrowed from information theory. The classical maximum entropy formalism (MEF) due to Jaynes [1] is the most familiar example of such an approach. Using Jaynes' principle it is
0167-4730/90/$03.50 1990 - Elsevier Science Publishers B.V.

98 possible, under certain circumstances, to derive the form of the probability distribution of a r.v. X. Unfortunately the classical formulation of this formalism presumes the availability of a set of population moments, and therefore is not applicable to most problems of engineering interest. In order to overcome this problem we combine in this work Jaynes' M E F with an estimation procedure due to Akaike [2,3]. This combination results in a two stage estimation procedure which appears particularly suitable to deal with the type of problem often encountered in civil engineering. The natural consistency of the proposed procedure is due to the fact that both Jaynes' and Akaike's approaches are based on information theory concepts, and in fact both are defined here as two different aspects of minimization of the K u l l b a c k - L e i b l e r relative (cross) entropy [4,5]. An information theory approach to probability estimation is essentially a Bayesian approach, and therefore requires a precise definition of the 'prior information', i.e., information that is available prior to the stage of testing. In the present work it is assumed that the prior information consists of the following elements: - I t is assumed that the r.v. X is b o u n d e d in a finite interval Xmin ~ X ~ Xmax where Xmax and Xmin are known a priori. The motivation for this choice is that an engineer should be able to place a lower and upper b o u n d on every quantity of interest prior to running any tests. The exact values of XmaX and Xn~n are probably less important than the fact that parameters of engineering interest are usually finite, and therefore can be linearly transformed to some standard range such as 0-1. - T h e analysis is restricted to continuous r.v. with continuous probability density function. It is possible in principle to extend the same approach to discontinuous density functions, but such extension is not considered in the present work. - I t is assumed that all the moments of the r.v. under consideration are finite.

2. THE CLASSICAL MAXIMUM ENTROPY FORMALISM 2.1. The concept of entropy


In information theory the term entropy represents a quantitative measure of the "information content" of a probability distribution function. Within this framework, flat uniform distributions are considered less informative then the narrow peaked ones. The reason for this statement can be established as follows: -Consider a random variable X which is characterized by a uniform probability distribution p ( x ) , with p ( x ) = const, in some given range Xmin ~ X ~ Xmax. Such a distribution does not help us in predicting the result of a future test on X; initially we had no reason to prefer any x value in the given range and this situation does not change if we receive the 'message' that X is characterized by a uniform distribution. Consequently the uniform distribution p ( x ) - - c o n s t . 'carries' essentially no additional information b e y o n d the a priori given range. - A s s u m e next that p ( x ) = 8 ( x - c) where 8 is the Dirac delta 'function' and Xmin ~ C ~ Xmax. Such a distribution is considered to be highly informative since it makes it possible to state with almost absolute certainty that a future test on X will yield the value x = c. A probability statement which makes it possible to predict the result of a future test without actually performing the test obviously carries a large amount of information.

99

- T h e amount of information of every other probability distribution can be considered as an intermediate state between the uniform distribution and the Dirac 'spike'. Thus it is possible to rank all probability density functions according to their 'information content' (entropy). Evidently these notions have meaning only if probability is considered a measure of uncertainty (Bayesian probability) rather than measured relative frequency (objective probability). This view of entropy is essentially due to Jaynes [1].

2.2. Kullback's entropy functional


For a discrete r.v. a "natural" measure of information content is the Shannon entropy [6,7]:

H[P(xi)] = - E P(xi) ln[P(xi)]


i=1

(1)

where P(xi) is the probability that the r.v. X will have the value x i, and n is the number of possible values of X. It is not difficult to verify that the function H[P(xD] is a consistent measure of information in the sense of the previous section. It can be shown that Shannon's entropy cannot be defined for a continuous r.v., since the value of this measure approaches infinity in the limiting process of transformation from the discrete to the continuous case. Difference of entropy between two distributions remains however finite, and therefore meaningful. The Kullback-Leibler [4] information functional H [ p a ( x ) , pz(x)] can be considered as a measure of the entropy difference between two probability assignments pl(x) and pz(X). This functional is defined as:

H [ p l ( x ) , p2(x)] = f p a ( x ) , n [ ~

, [pl(x) ] d x

(2)

where D represents the range of the r . v . X . The most important properties of Kullback's entropy functional are: -H[pl(x), pz(x)] is invariant to all monotonic transformations of the r . v . X . - H [ p l ( x ), p2(x)] >/0 for all possible distribution functions pl(x) and pz(X). - H [ p l ( x ), p2(x)] = 0 implies that pl(x) =pZ(X) The proof of these relations is quite simple (see, e.g. [8,9]). In view of these properties H[pl(x ), pz(x)] may be considered to measure the 'distance' between the two distributions pl(x) and p2(x). Formally however H is only a semi-metric since H[pa(x), p 2 ( x ) ] 4 = H[p2(x), pa(x)], and the triangle inequality is not satisfied.

2.3. Jaynes' maximum entropy formalism


Jaynes' maximum entropy principle [1] may be stated as follows:

The best (minimally prejudiced) assignment of probabilities is that one which minimizes the entropy subject to the satisfaction of the constraints imposed by the available information. Thus taking H[ P l(X), Pz(X)] as the relevant information measure, Jaynes' principle implies that the best probability assignment p*(x) is the solution of the following minimization
problem:

100

Minimize H [ p ( x ) , po(x)] = L p ( x ) ln[[p0(x)P(X) dx


Subject to the satisfaction of the following requirements:

(3)

p(x)>~O

Vx~D

(4)

fg(x) dx=l
Ik[p(x)]=O
k=l,2 ..... K

(5)
(6)

where po(X) is the 'prior distribution' of X and I~[p(x)] = O, k = 1, 2,..., K is a set of K relations (constraints) representing the available information. In other words Jaynes' prescription is to take the best probability assignment p*(x) as 'close' as possible to the prior distribution po(X) without however contradicting the available physical information (eqn. (6)) and the general requirements of any legitimate density function (eqns. (4) and (5)). Viewed from this perspective Jaynes' principle is so obvious that it appears almost trivial. The significance of Jaynes' contribution is however the realization that Shannon's and Kullback's entropies are the appropriate measures of the distance between probability distributions in the discrete and continuous cases, respectively. 1 The most c o m m o n choice for the set of constraints represented by eqn. (6) is to assume the availability of K population m o m e n t s z /~, k = 1 , . . . , K. In that case eqn. (6) has the form:

Ik[p(x)]

fDxkp(x)dx=O

k= 1 , . . . , K

(7)

The solution of the minimization problem defined by eqns. (3)-(5) and (7) is well known [9,11], and it may be written in the form: p K ( x [ ~ ) = p 0 ( x ) exp Z 0 + ~ Xkx k
k=l

(8)

where Xk, k = 1 . . . . . K is a set of Lagrangian multipliers associated with the physical constraints (7), Z 0 is the multiplier associated with the normalization constraint (5), and ~ -- (/~1,..., ~K) is a vector of given population moments. The notation PK (xl I~) is used in order to emphasize that eqn. (8) corresponds to a given vector of population moments. Substituting the solution (8) into the constraint eqn. (5) results with: (9) This equation shows that Z 0 is fixed by the vector of Lagrangian multipliers X = (X~, X 2.... , ~kK )" Notice that if K = 0, i.e. when there is no physical information except the given range, then all the Xk are zero, so that exp(Zo) = 1, and we get p ~ ( x ]/~j) =P0(X) as expected.
i Jaynes defined the entropy as - H [ p l ( x ) , p2(x)], and maximized this quantity, hence the n a m e ' m a x i m u m entropy formalism'. 2 Some other choice for I k is possible, e.g. information in the form of k n o w n population fractiles was utilized as a constraint in [10].

101

Substituting eqns. (8) and (9) into eqn. (7), it is possible to show [9,11] that the values of the multipliers k may be obtained as the solution of the following system of nonlinear equations:

exp[a X ]

=/~k

k=l,...,K

(10)

fPo(X)

exp

X;x j dx

Considering the maximum entropy distribution PK (x, 1~) as a 'model', it is not difficult to see that the L.H.S. of eqn. (10) represent the 'theoretical' model moments (~M). Hence this equation may be written simply as ~ = I~. In other words the MEF determines the Lagrangian multipliers N using the requirement that the model moments are equal to the given population moments. Finally, substituting (8) into (3) and using the constraints eqns. (5) and (7), the entropy of the optimal distribution pK(x [is) is given by:
K

H[p~(xl~),

po(X)] -- Z o +
k=l

hkttk

(11)

2.4. Discussion of the classical approach


Originally Jaynes derived his formalism as a tool for the solution of problems in statistical physics [12]. In this type of problem X is a microscopic variable such as a velocity of a molecule and I~ represents the result of a measurement on a macroscopic scale such as pressure or temperature. Since pressure and temperature can be related to the average of molecular velocity distribution, the parameters ~ are fixed by the measurements and can indeed be considered as given. In that case it is possible to utilize eqns. (8)-(10) in order to establish the probability distribution for the molecular velocities. A similar situation occurs sometimes in a geotechnical context. Consider for example the case when the local shear strength of a clay layer is modeled as a random variable. Let the experimental information be the result of a loading test on a pile in the clay layer. Neglecting deterministic trend and spatial correlation between the values of the shear strength at different points along the pile, the measured pile capacity can be related to the average shear strength of the clay by a direct relation involving only the geometry of the soil-pile interface. In that case the average (first m o m e n t /tl) is fixed by the measurement, and the classical approach applies. In general however the r.v. X and the measurements are defined on the same scale, and therefore the result of the testing program is a set of N measured values { xl... XN} rather than a set of population moments. Given the N data points, it is possible to calculate N independent sample moments/2~, k = 1 .... , N as: N 1 k /2k= ~ Y'~ [xj] k= 1,...,N (12)
j=l

In applications (e.g. [8,11,13-16]) it is frequently assumed that the unknown population moments #~ can be replaced by the known sample moments /2~. Such a procedure raises two questions: -Merely replacing /t k by t2~ in eqn. (11) results in a situation that the entropy which is presumably a measure of information, does not depend on the number of tests used in order to

102 establish the sample moments, an obviously unacceptable result. Jaynes [17] who considered briefly this question is of the opinion that "within the context of the problem ~ is exact by definition." He also added that " . . . this is the number given to us, and our job is not to question it, but to do the best we can with it." Such an approach does not appear reasonable; in a typical situation population moments are not "given", and their values are usually not known with high degree of reliability. It should be "our job" to analyze the effect of this uncertainty on the "predictive probability" which is the end product of the analysis. - F r o m a sample of size N it is possible to calculate N independent sample moments using eqn. (12). It is well known however that the high order moments calculated this way are extremely unreliable estimates of the corresponding population moments. As a result it is not clear how many moments should be included as constraints in the MEF when the available information is a sample of N measured values. To the best of our knowledge this question has never been considered before. It is the purpose of the present work to address these two questions, and present an approach that may open the way for the application of the MEF in situations where the available information consists of N measured sample values.

3. THE FAMILY OF MAXIMUM ENTROPY DISTRIBUTIONS


For unknown values of population moments, the density pK(X[I~) defined in eqn. (8), represents a family of distributions parametrized by the K unknown constants ~ = ~k, k = 1. . . . . K. Equations (9) and (10) show that /~ =f(?t), and Z 0 = f ( X ) . It can be shown [18] that these relations are one-to-one and hence can be inverted. Consequently it is possible to consider the Lagrangian multipliers X as the unknown parameters of the distribution, instead of the moments ~. From now on we will write p/((x I)k) for the K th order Maximum Entropy Family of Distributions (MEFD). As noted by [11] the MEFD is extremely 'elastic' and can model a wide variety of shapes. This is hardly surprising in view of the fact that pK(X I?t) has the form of exp(Polynomial(x)), eqn. (8). A celebrated theorem by Weierstrass guarantees that on a finite domain, every continuous function can be represented with any required degree of accuracy by a polynomial of sufficiently high order [19]. The fact that in eqn. (8) the polynomial appears in the exponent merely ensures that PK(Xl?t)>~0, but does not impose any other restriction on the shape of the probability distribution of X. We may conclude therefore that restricting the search for a probabilistic model of X to members of the MEFD does not imply any loss of generality with respect to the shape of the probability distribution. Nevertheless, the MEFD does imply certain restrictions with respect to the nature of the r.v. X as compared with a general probability distribution. These restrictions are: -The variational technique used for the solution of the minimization problem defined by eqns. (3)-(5), (7) is restricted to functions which are continuous and possess continuous first derivatives. Hence r.v. with discontinuous density functions are not modeled. -The constraints eqns. (7) used in the derivation of the MEFD presume that the moments of X do exist despite the fact that they are not known. This is not a trivial point, since distributions which do not possess moments are well known (e.g. the Cauchy distribution). For continuous distributions such 'pathological' behavior occurs only in infinite domains. Since the present work is geared towards finite domains, we do not consider this restriction to be a practical limitation on the range of validity of the approach.

103
4. AKAIKE'S INFORMATION CRITERION

Using Jaynes' maximum entropy formalism a family of distributions with unknown parameters (eqn. (8)) was established. The next problem is to determine both the number of parameters and their values in such a way that the resulting probability distribution reflects properly the information contained in a given sample. Problems of this type have been considered by Akaike and others [2,3]. Let g(x) represent the true but unknown distribution of the r.v. X, and pr(XlA) be a Kth order model. A measure of the 'distance' (in terms of information) between g(x) and pr(X IX) is given again by the Kullback-Leibler entropy as follows:

H[g(x), pK(Xlh)]= fDg(X) ln[[pK(xih)-g(x)-] dx

(13)

The best choice for the vector A is that one which minimizes the 'distance' H[g(x), PK(X-[ ~,)] between the Kth order model pK(XI~.) and 'reality' g(x). We can not evaluate H[g(x), PK(XlX.)]since g(x) is not known, but eqn. (13) can be written in the form:

H[g(x), pK(XlX)] =C-L(X, K)


where:

(14)

C= fDg(x) ln[g(x)]
and

dx

(15)

L(2~, K) -- fng(x ) ln[pK(x[X)] dx

(16)

The term C does not depend on X, hence for the purpose of minimizing H with respect to h this term is simply an irrelevant constant. From eqn. (16) we see that the term L(X, K) is the expected value of (ln[pK(xlM]}, so from a sample of N measurements we can obtain the following ' natural estimate' L (X, K) of L (X, K).
1 N

g(x, K)=~

~'~ ln[pK(XjlX)] j=l

(17)

where Xg; j = 1,..., N represents the N measured sample values. The corresponding estimate H of H is: ~(X, K)=C-L(~,,

K)=C--~ E ln[PK(xjlX)]
j=l

(18)

and the 'best' choice X* of X is obtained by minimizing H with respect to the vector of unknown parameters X, i.e.:

1 min{/~(X,x K ) } = C - ~ m a x

,,=1

,19,

104

It is noted that the term ~Nj=I ln(pK[Xj]~] is the familiar log likelihood function. Hence eqn. (19) shows that the parameters ~* which minimize the entropy estimate /4 are the well known maximum likelihood estimates of A. The properties of maximum likelihood estimates have been studied extensively (e.g. [20,21]) and need not be elaborated here. For our purpose it is sufficient to recall that for finite sample size N, the maximum likelihood estimates are often biased estimates of the true parameters. Since the ~* are biased estimates, it follows that also L(A*, K) and /4(~*, K) are biased estimates of their respective 'true' values. On the basis of this observation Akaike [2] suggested that a better estimates of h* will be obtained if we maximize not the 'natural estimate' L of the likelihood function (which is biased), but an unbiased estimate of this function. Moreover, he showed that an unbiased estimate/~ of L is given by: /~(X, K ) = L()~, K) K N (20)

Hence an unbiased estimate H of H can be written as: /4(X, K ) = C - L ( X , or explicitly


u K H(X, K) = C - - ~ 1 E { l n [ p K ( x , [ A ) ] ) + ~
j= 1

K K)=C-L(X, K ) + ~

=/~(~, K) + K

(21)

(22)

The 'bias' term K/N is proportional to the model order K, i.e. to the number of parameters which we try to estimate using a given sample, and inversely proportional to the number of data points N in the sample. It is easily verified that /4(~, K) is proportional to Akaike's information criterion [3]. Akaike's estimation procedure may now be summarized as fallows: - F o r a given value of K, minimize the unbiased estimate of the entropy given in eqn. (22), and obtain the optimal values of the parameters ~. We will use the notation ~ to signify the optimal parameters for the K t h order model, obtained by minimizing / t ( ~ , K). -Following the determination of the parameters ~ : , calculate the entropy associated with the 'best' K t h order model using eqn. (22). At this stage A~: is known so H ( ~ : , K) = / t ( K ) is a function of K only. -Repeating this process for a sequence of K values, obtain the relation H ( K ) as a function of K. -Locate the optimal order of approximation which minimizes the value o f / t ( K ) as a function of K, i.e. find Kopt from the relation: min H ( K ) } K ---, Kop, (23)

This program was established by Akaike [2] as a general procedure for the determination of the optimal model order of probability assignments. In Akaike's presentation PK(x[X) can be any given hierarchy of models, not necessarily the MEFD. This extra generality may be considered as a deficiency of Akaike's procedure, since we have no criteria for the selection of the appropriate family of models on which this procedure should be applied.

105 5. APPLICATION OF AKAIKE'S ESTIMATION PROCEDURE TO THE FAMILY OF MAXIMUM ENTROPY DISTRIBUTIONS Jaynes' MEF and Akaike's estimation procedure complement each other in a naturat way. Taken separately each one of these two procedures is underspecified but together they constitute a complete program. The M E F provides a complete solution for the case of N given moments, however when the moments are not given this procedure defines only an infinite hierarchy of models (MEFD) specified in eqn. (8). Akaike's estimation procedure on the other hand is defined only for a given hierarchy of models, and provides no criteria for the selection of such a hierarchy. Therefore, it appears natural to apply Akaike's procedure to the family of models generated by the MEF. The fact that both the M E F and Akaike's procedure can be presented as two particular cases of minimization of the Kullback-Leibler information functional adds a sense of unity and consistency to such a combined procedure. In order to obtain explicit expression for the combined procedure, substitute the analytical form of the M E F D (eqn. (8)) into the expression for the unbiased estimate of the entropy (eqn. (22)), yielding:

H(h, K)=C--~

Y'~ ln[po(Xg) ] - Z o ( X , K ) j=l

Y'. )~k
k=l

Y'. (Xj) k + - ~
j=l

24,

The term {~Y=a ln[po(Xj)]}/N is independent of h and K, therefore it can be included in the constant C. The t e r m s (I/N)~N=I(Xj) k are just the sample moments/2k, (eqn. (12)), so that eqn. (24) becomes K
k=l

K (25)

I4(h, K) = C - Zo(h, K) - Y'~ Xk~t k + -~

It is convenient at this stage to eliminate the constant C from the formulation; setting K = 0 in eqn. (25) and noticing that for K = 0 also Z 0 = 0, we get: /~(K=0) = C Define now:
K

(26)

AH(~., K ) = H ( ~ . ,

K)-H(K=0)-

K - - -N

Z0(,

K)-

Y'. ~,k/2k
k=l

(27)

We shall refer to AH(X, K) as the 'Differential Entropy'. Notice that unlike entropy this quantity can be positive or negative, in addition it is not invariant to monotonic transformations of the r.v.X. For a constant value of K, the m i n i m u m of AH(~, K) is obtained at ~ satisfying: 0AH(X, K )
0~ k or

0Zo(X, K )
OA k

/2k = 0

k=l ..... K

(28)

aZo (;k, K ) aX k =-/2 k

k=l,...,K

(29)

106 Using the explicit form of Z o (eqn. (9)), eqn. (29) becomes:

Lj=I It is of interest to note that the system of eqn. (30) has exactly the same structure as the system of eqn. (10) which was derived for the classical maximum entropy formalism; merely the population moments /~k in eqn. (10) are replaced by the sample moments t2k in eqn. (30). Moreover, recall that the L.H.S. of eqn. (10) represents the theoretical model moments. Hence eqn. (30) shows that the optimal parameters ~ : are obtained from the requirement that the model moments are equal to the sample moments, like in the conventional method of moments. In the previous section we have shown that the optimal parameters ~ are maximum likelihood estimates of h/c; hence we have here the situation that the method of moments and the method based on maximum likelihood, yield exactly the same result. The identification of the optimal parameters ~ : can be achieved either by solution of the system of nonlinear eqns. (3) or by direct minimization of AH(~, K ) (using eqns. (27) and (9)). In fact it was found that the best results (from numerical point of view) were obtained by combining both of these possibilities. As a result of this stage one obtains the function AH ( K ) = AH ( ~ : , K), and the optimal model order is located simply by graphical identification of the K value which yields the minimum of A H ( K ) as a function of K. The existence of such a minimum can be established in the following way. Recall that p K ( x l ~ ) is the family of maximum entropy distributions. We have established that, using this family, it is possible to approximate any distribution g(x) with arbitrary degree of accuracy by simply taking a sufficient number of parameters in the vector ~, i.e. by considering a model of sufficiently high order K. It follows that it is possible to make the entropy H ( K ) as small as we please by increasing the value of K. In other words the function / 4 ( K ) (which does not include the term K / N ) decreases as a function of K, approaching the value of zero assymptotically. For a constant value of N, the term K / N increases linearly as a function of K. Consequently H ( K ) = / t ( K ) + K / N (eqn. (21)), must have a minimum for some K value. A H ( K ) differs f r o m / t ( K ) only by the constant C, so the same must be true for the differential entropy. We see that K / N acts as a 'penalty term' preventing us from establishing too 'elaborate' models (with too many 'free' parameters) which cannot be justified by the given data. It may be noted that the existence of such a minimum is not self evident for an arbitrary family of models p K ( x l ~ ) as used in Akaike's original procedure. Recalling that the model order K is equal to the number of constraints used in the MEF, this procedure provides a clear answer to the question which was posed at the beginning of the present paper, namely: 'how many moments should be included as constraints in the MEF when the available information is a sample of N measured values'. The answer being, the number of constraints should be the value of K which yields the smallest value of the differential entropy (minimum value of Akaike's information criterion). In order to understand more completely the role of the MEF in the present combined procedure, notice that the sample information is reflected in eqn (30) only through the sample moments/2 k. From eqn. (9) we see that the constant Z0(~, K) is also completely specified by the information contained in the sample moments. Consequently, the differential entropy AH(~, K)

jo o,x,exp[j fI f po(x) I ]
- L,

k = a

. . . . .

(30)

max E )~jxj dx

107

(eqn. (27)), and hence also the final solution, depends on the sample information through the sample moments. We see that the use of moments constraints in Jaynes' M E F results in a procedure in which the sample information is expressed thorough sample moments only. A little reflection shows that if population fractiles are used as constraints in the MEF, then quantities related to sample fractiles become the variables summarizing the sample information. From this point of view use of different types of constraints in the M E F represents essentially different parametrization of the estimation problem. One may speculate that for each choice of statistics summarizing the sample information, it may be possible to find a set of constraints whose use in the MEF will result in these statistics being the carrier of sample information. This speculation is supported by the results in [22] on the 'inverse problem of entropy maximization'. In other words, if one believes that sample moments provide an efficient summary of sample information (as they actually do for certain types of problems), then m o m e n t constraints should be used in the MEF. However if one suspects that in certain situations sample moments should not be used (as is the case when sensitivity to outliers is a problem), then fractile constraints or other 'robust' statistics should be used in the MEF. The important point is that for every choice of statistics summarizing the sample information, the M E F provides the appropriate analytical form of the maximum entropy family of distributions. This argument provides a partial 'justification' for our use of moments constraints in the MEF, despite the fact that the value of these moments is not known. Up to this point we have kept the prior distribution po(X) unspecified in order to maintain the generality of the derivation (the prior distribution still appears in the equation for Z 0 (eqn. (9))). In general the specification of prior distributions is a controversial subject. However if the range Xmin ~ x ~ Xmax is the only prior information, then it is probably justifiable to take a uniform distribution for po(X) [23]. In that case po(x) is simply the constant 1/(Xma x -Xmin), and the estimation procedure is completely specified. For the purpose of numerical evaluation it is convenient to transform the r.v. X to the standard range of 0-1 by the linear transformation: Y= X-Xmi n
Xma x -- Xmi n

(31)

This transformation has the effect of reducing all the sample moments to approximately the same order of magnitude, resulting in a more stable behavior of the numerical procedure. For this reason we performed all the calculations in the transformed domain (i.e. for the r.v. Y). After the determination of the Lagrangian multipliers ~t~: in the transformed domain, these multipliers are transformed back to the original domain of the problem. It was mentioned before that unlike the entropy itself, the differential entropy AH()k, K) is not invariant to the linear transformation defined by eqn. (31). This, however, is of no practical consequence since only the relative magnitude of AH(X, K) is used for the determination of the optimal model order.

6. RESULTS AND DISCUSSION 6.1. Live loads on warehouse floors

This example illustrates the use of the procedure with a relatively large sample of 220 data points representing measured values of warehouse floor loads (in lb/ft2), as reported by

108
0.012
(q t ~in = 0

(qL)m,, = 250

NDP=220
Order Order (Opt)

Model Order K
0.00
-0.05 x/

0,010

Dotted L i n e - - - 2 n d
Solid Line---3rd Dashed Line--lOth

10

Order

0.008

__

ARAMETERSOF OPTIMAL MODEL

(q ,L,. = o (q ,1,,.= 250


\ \ \ \ \
MINIMUM

NDP=220

..... k .

I
~ ..

X0)
+0.554260E-01 -0.487847E-03 +0.993331E-06

I~.
ET 0.006 I

I J

0_ 0.004

1 "\'. 2 ~'~i i .3

-0.655761E+01

"~'-0.10 rm >~-0.15 13"~ -0.20 Ld

").
o
2'5 5b (a) Live
75 100 125 150 175 200 225 250
I I I I I f

-~ --0.25 --0.30

0.002

0.000

C:l - 0 . 3 5

Lood

q, (tb/ft ' )

(b)

Fig. 1. Loads on warehouse floors--Xma x = 2 5 0 (a) data and density functions; (b) differential entropy.

Dunham et al. [24]. Figure l(a) shows the data in the form of a histogram. Figure l(b) shows the differential entropy (in the transformed domain) as a function of model order for the case of Xmin 0 and Xmax 250. The upper limit was chosen simply by inspection of the data set. Such a choice is not really consistent with the definition of the boundaries as 'prior information', and we will return to this point shortly. Inspection of Fig. l(b) shows that the optimal model is of the 3rd order. Superimposed on the histogram in Fig. l(a) are 3 density functions; the optimal 3rd order model is represented by the solid line, the 10th order model is represented by the dashed line, and the 2nd order model is shown as the dotted line. Inspection of this figure shows that the lOth order model fits the histogram slightly better then the optimal model, particularly in the zone of x = 175-250 where the histogram has a little 'plateau'. In general however the difference between these two models is very small, and
= =

0.01 2
0.010

,.-~0.008
o =

"~o.oo6
0.004
0.002

~,

0.000

50 ~00 150 200 2go 360 ~ 0

460 4~0 5~

Live

Lood

q,

(Ib/ft ~ )

Fig. 2. Loads on warehouse floors--Effect of the range on optimal probability assignments.

109 definitely this 'advantage' of the high order model does not justify the complication of having a model with 11 coefficients. Compared next are the 2nd and the 3rd order models; evidently the second order model does not provide as good fit to the histogram as the 3rd order one. One may conclude therefore that the choice of the procedure of the 3rd order as the optimal one is reasonable. Figure 2 shows solutions for 3 different values of XmaX: XmaX= 250, (solid line); Xma~= 500 (dotted line); and Xm~,= 350 (dashed line). Three observations are apparent from inspection of this figure: -Moderate change of the upper bound from 250 to 350 has only a small effect on the resulting optimal distribution. Hence one should not be too worried about the exact numbers used for Xm~,
a n d Xmax-

-Large changes of the upper bound from 250 to 500, (which presumably represents a genuine difference in the state of the 'prior information'), have some effect on the resulting 'best probability assignment'. Naturally the limit Xma = 500 represents a poorer state of prior information than Xma~= 250. The corresponding optimal probability assignment properly reflects this situation by being flatter and more uniform (closer to a constant distribution) than the one associated with the limit Xma = 250. --It is interesting to note that when the prior range is 0-500, while all the data is concentrated in the range 0-230, the resulting optimal probability assignment is heavily 'biased' towards the data, producing essentially zero probability of obtaining values beyond x = 300. In other words, in the present case of a relatively large sample, if our preconceived ideas (prior probabilities) are not in agreement with the experimental findings (data), the procedure pays more attention to the data than to the opinions.

6.2. Friction angle of gravelly sand


This example illustrates the use of the procedure with a moderate size sample of 29 measured values of the effective friction angle in gravelly sand [25]. The data in the form of a histogram is shown in Fig. 3(a). The problem was analyzed using Xmin= 25 o and Xmax 50 o. These limits were established using the time honored procedure of Bayesian statisticians; namely asking an expert. Discussing this problem with J.G. Zeitlen of our department, it was concluded that for this type of cohesionless soil, values of friction angle outside the range of 25 -50 o cannot be considered credible. With this information the entropy function shown in Fig. 3(b) was obtained. The result shows clearly that the optimal model for this case is of the second order. The resulting probability distribution is shown in Fig. 3(a). Recall that the first order model corresponds to a truncated exponential distribution; hence for any data set which shows a concentration of results in a central zone with reduced frequencies on both sides, a second order is the lowest reasonable model order. We see that with a moderate size data set the procedure produces a low order model as expected. It is of interest to recall that classical Bayesian estimation procedures require the specification of prior probabilities for the parameters of the distribution ( ~ in the present case). It is completely unrealistic to expect a practical engineer (who naturally has the widest experience and therefore should play the part of the 'expert' specifying the prior information) to have a definite opinion about values of the distribution's parameters. In the present approach, 'experience' is codified in the prior distribution of the r.v. X itself, for which the engineer has a much better
=

110
0,12 qbrnin=25 Deg. ~b max=50 Deg. NDP:29
2nd ORDER MODEL (OPTIMAL)

0.05 0.00 -0,05

/
-

Model Order K

O,lO

.~

. . . .
7

0,08

"~l [ \ L__\

MODEL PARAMETERS

qbmin=25 Deg NDP=2g

dPmo =x=50 Deg. /

I\ \ I

-6-0,06 O. 0,04

' 1 2
0

x(,) -0.398458E+02 +0.195668E+01 -02ss621E-01

I " ~ - O, 10 -

-~ -0,15
I,I

0,02

0.00

25

(a)

-6 -0.20 E

/
MINIMUM

-0.25 rE3

30

-0.30
35 40 45
50

d~ Deg.

(b)

Fig. 3. Friction angle of gravelly sand, 29 data points (a) data and density function; (b) differential entropy.

feel. As a bare minimum the present procedure requires the expert to specify only upper and lower bounds for the r:v. X. This should not be seen as belittling the role of the 'expert' in the general framework. To see this, consider the case of a really small data set. Using a table of random digits, a sub-sample containing just 3 points was selected from the sample shown in Fig. 3(a). The 'histogram' of the 3 points is shown in Fig. 4(a). The entropy function is presented in Fig. 4(b). Figure 4(b) shows that the optimal model order is zero, hence a uniform distribution on the a-priori range is implied (see Fig. 4(a)). We see that in the case of a very small sample the procedure completely

0.14 0.12 0.10 0.08


v

Cbmin =25 Deg ' max =50 Deg

NDP=3

3.0
Cmin =25 Deg ~max = 5 0 Deg

ZERO ORDER OPTIMAL MODEL

~2.5 I 3 2.0 1.5

NDP=3 ZERO ORDER OPTIMAL M O D E L /

0-0.06 0.04
F

/ 1.0 0.5 0.0

0.02
0.00 25

/
o I ~~

(a)

3b

3'5

'o
Deg.

4'5

50

(b) U,N,UUU

"--~ 6 5

,----T---8 9 0

Model Order K

Fig. 4. Friction angle of gravelly sand, 3 data points (a) data and density function; (b) differential entropy.

111 ignored the data in favor of the prior distribution provided by the 'expert'. Comparing this result with Figs. 2, we see that the relative importance of the data as compared to the 'expert' depends (as it should) on the amount of the available data. Since the geotechnical engineer frequently works with rather small samples, a typical situation is that neither the data nor the 'expert' dominates final probability assignment. This example demonstrates quite clearly the fallacy of the popular opinion among practical engineers that the use of statistical design procedures requires a substantial amount of data (information). The present approach can be utilized with very small samples in which case the expert opinion predominates, or with a significant amount of data, in which case the result depends mainly on the data.

7. STOCHASTIC FOUNDATION DESIGN

Using the above two case studies makes it possible to illustrate some implications of the proposed estimation procedure using as an example the case of design of a footing against bearing capacity failure. For square footing founded on the surface of cohesionless soil the factor of safety F s against bearing capacity failure can be written as:

F~=

B3yNy ((I))

2L2q

(32)

where B is the width of the footing, 3' the unit weight of the soil, Nv() is a bearing capacity factor, dp the angle of internal friction, L the average distance between column, q = q0 + q~ is the footing's contact stress with q0 representing the dead load (weight of the structure), and q~ is the live (useful) load. There are number of analytical expressions defining Nv as a function of gP. We will use the following expression which was suggested by Vesic [26]: Nv(cP) where:
=

2[Nq()

1] tan

(33)

Nq(gP) =

e ~ tan(~) t a n 2 ( / 2 + v / n )

(34)

For the present purpose it is convenient to define a variable G: G=


- -

2L2F~ B3 Y

(35)

Hence eqn. (32) can be rewritten as: N~() G = q0 + q---~ ---(36)

Modeling ql and dp as random variables implies that F s and G are random variables. The cumulative distribution F(g) of G is given by:

F(g)=Pr(G<g}= fD(g)fp(N~,q) dN~dq

(37)

112

(N?)m,.
4' m=x

J
N,

-2

~mln

J1
q,

Y
(a)

(NT)ml.
r

(b)

Fig. 5. Stochastic foundation design. Region of integration in (a) the Nv - q space, and in (b) the ~-q~ space.

where p(Ny, q) is the joint probability density function of Ny and q, g is a particular value of G, and D(g) is the region in the q-Nv space in which G ~< g. This region is shown crosshatched in Fig. 5(a), the upper boundary of this region is the line Nv = gq. Since external loads and soil properties can be assumed to be statistically independent, it is possible to write p(Nv, q)=p(Nv)p(q). Moreover, Nv is an increasing function of and q is an increasing function of ql, therefore we have p(Ny(~))=p(CJ)d~/dNv and p ( q ) = P(ql) dql/dq, where p ( ~ ) and P(ql) are the density functions of the angle of internal friction and live load respectively. Using these relations eqn. (37) becomes: F(g)=
(g)

fp(,~)p(q,)d~ dql

(38)

where D ( g ) is the image of the region D(g) in the ~ - q l space (see Fig. 5(b)). The upper boundary of this region is defined by the equation

q,( * ) = (Uv(*)/g) - qo

(39)

The analysis of the previous two case studies yielded the functions P(ql), and p ( ~ ) . The following two cases will be considered: Case 1. The friction angle of the gravelly sand is characterized by the density function shown in Fig. 3(a). This case corresponds to relatively high level of experimental information (29 data points). Case 2. The friction angle of the gravelly sand is characterized by the uniform density function shown in Fig. 4(a). This case corresponds to low level of experimental information (3 data points). In both cases the live load is characterized by the optimal 3rd order model shown in Fig. l(a). For the purpose of illustration the dead load q0 is taken as q0 = 0.7 t / m 2. The integration of eqn. (38) was done using two dimensional five points Gaussian quadrature. The resulting cumulative probability function F(g) is shown in Fig. 6(a). Since both and ql are defined over finite domains, also G is bounded, and it varies between the values gmin 4.98 m2/t and gmax 1095.57 m2/t. Figure 6(b) shows an expanded view of F(g) in the range of low g values. This figure can be utilized in two different ways: -Consider a given design with ~, = 1.8 t / m 3, B = 1.0 m and L = 4.0 m. Using the definition of
= -~

113
0.25 0.9 0.8 ,~0.6 ~'0.5 0.4 tt3r" u3 03 o ~x"

0.20

_ _Pu=J ~_ _ '/_~/CASE 2

g=17.8 I I ,

.~S

0.15

/ / I g=12.58 / I

CASE 1 /
g=2/.Ts/

o,o
/ 0.05 -i

/!/
/i / ~
, t ,

'/ E/

J I

~/

0.10.20.3 l-

CASE 1

O~I

i/

0.0 0
(a)

200

400
g

600
m~

860

I000 1200

0.00 |

10

~5

20

25

30

35

40

,/to n

(b)

m,/ton

Fig. 6. The cumulative probability function F(g) (a) general view; (b) low end tail. G (eqn. (35)), the probability of failure Pf of the footing is: P f = Pr{ Fs <~ 1.0} =

Pr{ G <~(2.10 4.02/1.03 1.8)} = F ( g = 17.8)

F r o m Fig. 6(b) it is seen that (Pf)cASE1 = 3% while (Pf)cASE2 = 19%. These values show clearly that when the level of available information is low (CASE2) a given design corresponds to a higher risk (higher probability of failure) than the case of high level of information (CASE1). Such a result is naturally to be expected. The relevance of this result to the subject of structural safety is self evident. - F i g u r e 6(b) can be utilized also as a design tool. Assume that it is required to determine the width of the footing so that with 90% confidence the factor of safety 3 is not less than 3.0. Entering Fig. 6(b) with F ( g ) = 0.1 one get (g)CASE1 = 27.75 and (g)CASE2 = 12.58. On the basis of eqn. (35) it is possible to write: 2F~L2 ] a/3 (40)

B=( gv l
Using this relation one gets BcasE 1 = 1.25 m while BCASE2= 1.60 m, i.e. with a small amount of experimental information we are compelled to use a more conservative design. Again this is exactly the type of result which one would expect. Comparing the expense of performing additional 26 (29 - 3) strength tests, with the cost of using a foundation size of 1.25 m instead of 1.60 m makes it is possible to assign a monetary value to the geotechnical testing program. This type of 'cost-benefit' analysis m a y reduce to a certain extent the atmosphere of 'oriental bazaar' which frequently surrounds the determination of the size of such programs. There are number of reservations which must be stated with respect to the type of analysis presented above: - T h e analysis of foundation bearing capacity involves additional uncertainties b e y o n d those associated with the numerical values of the live load and the angle of internal friction. For

3 The value F~ = 3.0 is a carry-over from a deterministic design approach and does not have any justification in the present framework. It is used here for demonstration purposes only.

114 example eqns. (33) and (34) are not exact and therefore introduce additional errors (model error). - T h e purpose of geotechnical site investigation is not only to determine the numerical value of , but probably more important, to establish the nature of the soil at the site. The present analysis assumes a site consisting of 'gravelly sand'. Any uncertainties associated with this assumption are not taken into account. - T h e type of analysis presented above is equivalent to the assumption that values are perfectly correlated in space (random variable approach). In reality modeling the spatial distribution of as a 'random function' may be more appropriate. Usually the random variable approximation yields conservative results, but this this is not always the case. - T h e probability density functions p ( ~ ) and P(ql) were established using all the available data. In the analysis of bearing capacity only the lower tail of the function F ( g ) proved to be relevant. It is not clear if the present estimating procedure is appropriate in the vicinity of the tail of the distribution. Specific 'tail information' may however not be critical in the present case where all variables are defined on finite domains. Despite these reservations it appears that the present estimation procedure provides a natural tool for the incorporation of certain types of uncertainties into engineering design. Most of the reservations may be overcome by incorporating the additional uncertainties into the analysis using essentially the same approach.

8. SUMMARY AND CONCLUSIONS The paper presents an estimation procedure which is based on concepts from information theory. The nature of the procedure may be summarized by considering the following limiting cases: (A): The procedure will produce a flat nearly uniform probability distribution in one of the following two cases: - ( A 1): A small data set, irrespective of the nature of the underlying random variable. -(A2): A large data set, if the measured values are spread evenly throughout the given interval. (B): The procedure will produce a narrow sharply peaked distribution only if both of the following two criteria are satisfied: -(B1): Large data set. -(B2): The sample data is concentrated in a narrow zone. It is demonstrated that the procedure produces consistent results for both small and large sample sizes. It automatically adjusts the level of 'sophistication' of the resulting probability assignment according to the nature and quantity of the available data. In other words it prevents us from using a model too sophisticated (with too many adjustable parameters) if the data set is not extensive enough. Prior information about the r.v. X is summarized in the prior distribution po(x). This is a more 'natural' way of codifying the 'experience' of practical engineers than the classical Bayesian procedure which requires the specification of prior distribution for the parameters of the density function. As a bare minimum, in the present version of the procedure a given range is assumed to be the only prior information. This range is considered to represent the 'expert' opinion about the random variable under consideration.

115

It is shown that in the case of a small sample the resultant probability assignment reflects mainly the expert opinion rather than the data. On the other hand for a large sample size, the final result depends mainly on the data rather than the opinion of the expert. In particular if the data points are concentrated in only one region of the a priori range, and the sample size is large, then the resulting probability assignment will show close to zero probability of occurrence for values of the random variables far removed from the region of measured sample values. These properties of the proposed procedure provides essentially an effective combination of 'subjective probability', characterizing one's uncertainty in the case of small samples, and 'objective probability' representing measured relative frequencies when the data base is large. Such a combination appears particularly useful for engineering purposes where one needs an objective probability assignment with a degree of conservatism which decreases as the amount of measured data increases. Analysis of a bearing capacity problem illustrates the applicability of the procedure in a more practical engineering framework.

REFERENCES
1 E.T. Jaynes, in: R.D. Rosenkrantz (Ed.) Papers on Probability Statistics and Statistical Physics, Reidel, Dordrecht, 1982. 2 H. Akaike, Information theory and an extension of the maximum likelihood principle, in: B.M. Petrov and F. Caski (Eds.), Sec. lnt. Symp. on Information Theory, Akademiai Kiado, Budapest, 1973, pp. 267-281. 3 Y. Sakamoto, M. Ishiguro and G. Kitagawa, Akaike Information Criterion Statistics, Reidel, Dordrecht, 1986. 4 G.S. Kuilback, Information Theory and Statistics, Wiley, New York, NY, 1959. 5 J. Shore and R.W. Jhonson, Axiomatic derivation of the principle of maximum entropy and the principle of minimum cross-entropy, IEEE Trans. Inf. Theory, 6(1) (1980) 26-37. 6 C.E. Shannon and W. Weaver, The Mathematical Theory of Communication, The Univ. of Illinois Press, Champaign, IL, 1959. 7 J. Aczil, B. Forte and C.T. Ng, Why the shannon and hartly entropies are natural. Adv. in Appl. Probability, Vol. 6, 1974, pp. 131-146. 8 H. Theil and D.G. Fiebig, Exploiting Continuity- Maximum Entropy Estimation of Continuous Distributions, Ballinger, Boston, MA, 1984. 9 M. Tribus, Rational Description Decision and Design, Pergamon, Oxford, 1969. 10 N.C. Lind and V. Solana, Cross entropy estimation of random variables with fractile constraints, IRR paper No. 11, Inst. for Risk Research, University of Waterloo, Waterloo, Canada, 1988. 11 J.N. Siddall, Probabilistic Engineering Design, Marcel Dekker, New York, NY, 1983. 12 E.T. Jaynes, Information theory and statistical mechanics, Phys. Rev., 106 (1957) 620-630. 13 T.M. Brown, Information theory and the spectrum of homogeneous turbulence, J. Phys., Ser. A, 15 (1982) 2285-2306. 14 J.O. Sonuga, Entropy principle applied to the rainfall-runoff process, J. Hydrol., 30 (1976) 81-94. 15 M. Tribus, The use of the maximum entropy estimate in the estimation of reliability, pp. 102-139 in: R.E. Machol and P. Gray (Eds.), Symp. on Recent Developments in Information and Decision Processes, Macmillan, New York, NY, 1961. 16 A.B. Vistelius, Geochemical problems and measures of information, in: Studies in Mathematical Geology, 1967, pp. 157-174. Translated from Russian by the Consultant Bureau of New York. 17 E.T. Jaynes, Where do we stand on maximum entropy, in: R.D. Levine and M. Tribus (Eds.), The Maximum Entropy Formalism, MIT Press, Cambridge, MA, 1978, pp. 15-118. 18 J.M. Einbu, On the existence of a class of maximum-entropy density functions, IEEE Trans. Inf. Theory, 23 (6) (1977) 772-775. 19 T.R. Rivlin, An Introduction to the Approximation of Functions, Blaisdell, New York, NY, 1969. 20 R.A. Fisher, The mathematical foundation of theoretical statistics, Philos. Trans. R. Soc. London, Ser. A, 222 (1922) 306-368.

116 21 C.R. Rao, Linear Statistical Inference and its Applications, Wiley, New York, NY, 1965. 22 J.P. Noonan, N.S. Tzannes and T. Costello, On the inverse problem of entropy maximization, IEEE Trans. on Inf. Theory, 22 (1976) 120-123. 23 E.T. Jaynes, Prior Probabilities, IEEE Trans. Syst. Sci. and Cyber., 4 (1986) 227-241. 24 J.W. Dunham, G.N. Brekke and G.N. Thompson, Live loads on floors in buildings, Building Materials and Structures, National Bureau of Standards, 1972. 25 E. Schultze, Some aspects concerning the application of statistics and probability to foundation engineering, Proc. of the 2nd Int. Conf. on the Application of Statistics and Probability in Soil and Structural Engineering, Vol. 2, 1965, pp. 457-494. 26 A.S. Vesic, Analysis of ultimate loads of shallow foundations, JSMFD, ASCE, 99 (1) (1973) 45-73.

Вам также может понравиться