Академический Документы
Профессиональный Документы
Культура Документы
of
Contents
1
KE1
1.1
MULTIVARIATE
MODEL
1.2
MULTIDIMENSIONAL
NORMAL
DISTRIBUTION
1.3
ESTIMATION
1.3.1
METHOD
OF
MOMENTS
1.3.2
MAXIMUM
LIKELIHOOD
1.3.3
LEAST
SQUARES
1.4
TESTING
1.4.1
DISTRIBUTIONS
1.4.2
TESTS
2
KE
2
2.1
TWO
SAMPLE
PROBLEMS
2.1.1
ESTIMATORS
2.1.2
TESTS
FOR
ONE
SAMPLE
PROBLEMS
2.1.3
TESTS
FOR
TWO
SAMPLE
PROBLEMS
2.2
CORRELATION
ANALYIS
2.2.1
CORRELATION
OF
TWO
NORMALLY
DISTRIBUTED
VARIABLES
2.2.2
RANK
CORRELATION
BETWEEN
CHRACTERISTICS
2.2.3
PARTIAL
CORRELATION
2.2.4
MULTIPLE
CORRELATION
2.2.5
CANONIC
CORRELATION
2.2.6
CORRELATION
OF
DISCRETE
CHARACTERISTICS
2.2.7
TESTING
FOR
INDEPENDENCE
OF
P
CHARACTERISTICS
3
KE
3
FACTOR
ANALYSIS
3.1
ESTIMATION
OF
LOADINGS
MATRIX
3.1.1
ML-METHOD
AND
CANONIC
FACTOR
ANALYSIS
3.2
OTHER
METHODS
OF
ESTIMATING
THE
LOADINGS
MATRIX
3.2.1
PRINCIPAL
COMPONENT
ANALYSIS/
PRINCIPAL
FACTOR
ANALYSIS
3.2.2
CENTROID-
AND
JRESKOG-METHOD
3.3
ROTATION
OF
FACTORS
3.3.1
ORTHOGONAL
ROTATION
3.3.2
OBLIQUE
ROTATION
3.4
ESTIMATING
FACTOR
VALUES
3.5
FACTOR
ANALYSIS
SUMMARY
4
KE
4
SCALING
PROCEDURES
4.1
INTRODUCTION
4.2
SCALING
OF
ORDINAL
CHARACTERISTICS
4.2.1
MARGINAL
NORMALIZATION
4.2.2
PERCENTILE
RANKS
4.2.3
SCALING
NOMINAL
CHARACTERISTICS
4.3
MULTIDIMENSIONAL
SCALING
4.3.1
PRINCIPAL
COORDINATE
METHOD
4.3.2
KRUSKALS
METHOD
5
KE
5
CLASSIFICATION
AND
IDENTIFICATION
5.1
CLUSTER
ANALYSIS
5.1.1
CONSTRUCTING
A
PARTITION
2
2
2
3
3
3
3
3
3
4
4
4
4
4
5
6
6
7
7
7
8
8
9
9
10
10
12
12
12
13
14
15
15
16
16
16
17
17
17
17
18
18
18
19
19
20
Peter P. Robejsek
5.1.2 CONSTRUCTING A HIERARCHY 5.2 DISCRIMINANT ANALYSIS 5.2.1 DA FOR PARTITIONS 5.2.2 DISCRIMINANT ANALYSIS FOR HIERARCHIES 6 KE 6 MULTIVARIATE LINEAR MODEL 6.1 INTRODUCTION 6.2 GENERAL MULTIVARIATE REGRESSION 6.3 MULTIVARIATE VARIANZANALYSE 6.3.1 ONE-WAY MANOVA 6.3.2 TWO WAY ANOVA WITH INTERACTION TERMS 6.4 PROFILE ANALYSIS 7 KE7 DISCRETE REGRESSION 7.1 GENERAL APPROACH 7.2 BINARY RESPONSE MODEL 7.2.1 BERKSON-THEIL-METHOD 7.2.2 ML ESTIMATION IN THE BINARY RESPONSE MODEL 7.3 MULTI-RESPONSE MODEL 7.4 RELATIONSHIP BETWEEN DISCREET REGRESSION AND DISCRIMINANT ANALYSIS
20 21 21 23 23 23 24 25 25 26 26 27 28 29 29 32 33 33
1 KE1
1.1 Multivariate model
Univariate: n objects are observed for one characteristic n data points Multivariate: n objects are observed for p>1 characteristics np data points in np matrix
( z ) = fZ ( z ) =
1 e 2
( z ) 2 2 2
Multidimensional normal distribution: Let Z=(Z1,...,Zp) be a p-dimensional random vector. Let this this vector have the expected value vector =(1,...,p) and Covariance matrix where diagi()=i2, (i,j)=ij=(j,i)=ji so is symmetric. Then Z~N(,) iff. Z=AX+ where A is the cholesky factor of =AA and X=(X1,...,Xn) where each Xi~N(0,1) is iid standard normal. For positive definite a density can be given for the distribution of the random vector Z: 1 ( z )' 1 ( z ) 1 ( z ) = f Z ( z ) = e 2 (2 ) p det This multidimensional distribution has the following properties: 1. Let Y~N(,) p1 and B mp, c m1 then X=BY+c ~N(B+c;BB) i.e. a linear combination of normally distributed variables is again normally distributed 2. The marginal distributions of multivariate normal distributions are again normal Peter P. Robejsek FernUniversitt Hagen, 2011 2
3. The components of Y are uncorrelated (pairwise) if Corr(Y)=In Multivariate normal distribution Znp matrix, np matrix, npnp covariance matrix
1.3 Estimation
Repeating an experiment concerning the iid variables X1,...,Xn gives the realizations x1,..,xn. A true parameter of the Xi dist. might e.g. be . We are looking for an estimator ^(x1,...,xn) as a function of the realizations that is as close as possible to . We want the estimator to be unbiased i.e. E[^(X1,...,Xn)]=. An estimator is asymptotically unbiased if limnE[^(X1,...,Xn)]=. Also we want the estimator to be consistent, i.e. with increasing sample size the estimates get better: P(|^n-|>)0 as n. Quality criterion: MSE MSE(,^) =E[(^-)2] =Var(^)+(E(^)-)2 The second term on the LHS is called bias. The estimator is finally called efficient if it has finite variance and if there is no other estimator with a lower variance.
1.4 Testing
1.4.1 Distributions
Let the standard normal distribution N(0,1) have -quantiles u. Let X1,...,Xn be iid standard normal variables. The sum of squares of these variables gives a Chisquared distributed variable with n degrees of freedom: 2=i=1nXi2 ~2n. Let the quantiles of the 2n distribution be 2n; Let X0,X1,...,Xn be iid standard normal variables. Then the variable: t=X0/sqrt(1/ni=1nXi2) follows a t-distribution with n degrees of freedom and -quantiles tn; Peter P. Robejsek FernUniversitt Hagen, 2011 3
Finally let X1,...,Xm,Y1,...,Yn be standard normal iid variables. Then the variable F=[1/m i=1mXi2]/[1/n i=1nYi2] follows an F distribution with m and n degrees of freedom: F~Fm,n The -quantiles are Fm,n; The -quantile of a continuous distribution with density fX(x) is that value on the x-axis that separates the total probability mass in the proportion :1-. One also may refer to the 1-fractile.
1.4.2 Tests
While testing the null hypothesis H0 against the alternative hypothesis H1 two types of error can occur: 1. Type: H0 is wrongly rejected 2. Type: H0 is wrongly accepted The level of significance is specified so that the frequency of type 1 errors does not exceed . Let T be a statistic T(x1,...,xn) such that large values indicate higher probability of H1, low values indicate H0. Let the critical value then be c1- such that P(T>c1-) for all H0. One sided hypothesis/test: H0:0 vs. H1:>0 Two sided hypothesis/test: H0:=0 vs. H1:0 Likelihood quotient test: H0:0 vs. H1:\0 (x1,...,xn) =sup 0L(x1,...,xn,)/sup \0L(x1,...,xn,) for the simple hypothesis =0: (x1,...,xn) =L(0)/supL() Using the estimator ^ in logs: -2ln(x1,...,xn) =2(L(^)-L(0)) ~q2
2 KE 2
2.1 Two sample problems
2.1.1 Estimators
For p categories and n objects in a sample from a population with mean vector and covariance matrix measure the yij where i=1,...,n is the number of datapoints (rows) and j=1,...,p the number of categories. Estimate the mean vector =(1,...,p) by the vector y = ( y1 ,...,yp )'. Where each yj is the mean of a column. Estimator is unbiased. Unbiased estimator for covariance matrix : n 1 n 1 S= ( y y )( y i y )' = n 1 ( y i y i ' nyy ') n 1 i =1 i i =1 Where yi denotes the row vector for the jth object. The matrix S has on the diagonal the empirical variance sj2=1/(n-1)i=1n(yij-yj)2 and off diagonal the empirical covariance: sjk=1/(n1)i=1n(yij-yj) (yik-yk) for j,k =1,...,p
T2=n(y-*)-1(y-*)
reject null for T2>2p;1- (i.e. at the level is significantly different from *) b) unknown (estimated) covariance matrix via Hotellings T2: (n-p)/p(n-1)T2 =(n-p)/p(n-1) n(y-*)S-1(y-*) reject null for T2>(n-p)/p(n-1)Fp,n-p;1- Simplified calculation of Hotellings T2 Let A B G = C D det A det( D CA1 B), for det A 0 det G= det D det( A BD1 C ), for det D 0
1 n ( y * )' Let G = ( n 1)S n ( y * ) And solve system for detG for T2. c) Testing for symmetric expected value vector H0 : 1=2==p compute the transformation zi=(yi1-yip yi2-yip yip-1-yip) i.e. subtract last column entry from every entry in every row and reestimate mean and covariance matrix accordingly (one entry/ row/column less) and apply the above testes with *=0 and p*=p-1
Reject H0 if T2>p(n1+n2-2)/(n1n2-p-1) Fp,n1+n2-p-1;1- b) For disjoint samples and different covariance matrices, identical size samples n1=n2 Compute transformation: ki=yi(1)- y(1)-yi(2)+ y(2) Use the statistic: T2=n(n-1)( y(1)- y(2))(i=1nkiki)-1( y(1)- y(2)) Reject H0 if: T2>p(n-1)/(n-p) Fp,n-p;1- c) For disjoint samples and different covariance matrices, different size samples n1<n2 Compute Transformation: c=[i=1n1(yi(1)- y(1))-sqrt(n1/n2)(yi(2)-1/n1j=1n1yj(2))] Compute Matrix: B=cc Use statistic: T2=n1(n1-1)( y(1)- y(2))B-1( y(1)- y(2)) Reject H0 if: T2>p(n1-1)/(n1-p)Fp,n-1-p;1- d) For correlated samples and different covariance matrices, identical size samples n1=n2 Correlated samples: the ith vector yi(1) is somehow related to the ith vector yi(2) Compute transformation: ki=yi(1)- y(1)-yi(2)+ y(2) Use the statistic: T2=n(n-1)( y(1)- y(2))(i=1nkiki)-1( y(1)- y(2)) Reject H0 if: T2>p(n-1)/(n-p) Fp,n-p;1-
To find a confidence interval for the correlation Find first a confidence interval for :[z1,z2]=[z-u1-/2/sqrt(n-3),z+u1-/2/sqrt(n-3)] Apply the inverse Fisher transf.: ri=(e2zi-1)/(1+e2zi) to find the interval [r1,r2]
The multiple coefficient of determination rX,(Y1,,Yp) indicates the amount of variation in X explained by the characteristics Y1,,Yp. To test for multiple independence H0 : X,(Y1,,Yp)=0 vs. H1: X,(Y1,,Yp)0 (i.e. X,Yi0) Use the statistic: Reject H0 if: F=[(n-p-1)r2X,(Y1,,Yp)]/[p(1-r2X,(Y1,,Yp)] F>Fp,n-p-1;1-
Peter P. Robejsek
3 KE 3 Factor analysis
Aim: To explain observed characteristics by means of a number of latent variables, the factors Usually when analyzing p characteristics Y1,,Yp correlations between the characteristics will exist and will point to the existence of some latent factors. First obtain an empirical covariance matrix from standardized (N(0,1) data): data matrix correlation matrix
r11 r12 r1 p y1 p y 2 p r21 r22 r2 p 1 R= Y' Y = n 1 st st y pp rp1 rp 2 rpp compute R from standardized data matrix where y ij y ij y ij st = for i = 1,..., n; j = 1,..., p where y ij and s j are empirical exp val. and std. deviation sj Now assume that every standardized data point yijst can be expressed as a linear combination of q factors F1,,Fq: yijst =lj1fi1++ljqfiq for i=1,,n and j=1,,p f f l l T 1q ' 1q ' 11 11 L ' = Yst = F f n1 f nq ' l p1 lpq ' L is called the factor pattern and contains the factor loadings ljk. The loading is an indicator of the relationship between the factor and the characteristic. fik is called the factor value of the is the matrix of factor values and assumed to be standardized kth factor Fk for object i. F N(0,1) i.e. Each column has mean 0 and variance 1. Both matrices are unknown and must be estimated from the data. Estimate loading matrix: 1 1 R= Y'st Yst = LF ' FL' = LRF L' n 1 n 1 1 where RF = F ' F denotes the correlation matrix of latent factors n 1 y11 y 21 Y = y p1 y12 y 22 y p2
Peter P. Robejsek FernUniversitt Hagen, 2011 9
For independent factors and linear relationships between factors and variables RF=I and L ' and ljk[-1,1] since these numbers give the correlation between the jth characteristic R=L and the kth factor (Fundamental Theorem of Factor Analysis). A factor for which only one of the loadings lik,,lpk is significantly different from zero is called unique factor whereas several nonzero loadings indicate a common factor. A general factor has all loadings significantly different from zero. The complexity of a characteristic Y1,,Yj is the number of high loadings ljk on common factors.
can be decomposed: L l l 0 0 0 0 l 0 11 1q 1q +1 L +U = L + l p1 l pq 0 0 0 0 0 l pq ' =[L|0pp]+[0pp|U] For orthogonal factors this implies: +U )( L +U )' = L L ' + L U ' +U L ' +U U ' L ' (L R=L L ' +U U ' = LL' +UU' =L Where the elements of the matrix U2 are those parts of the variances of the characteristics that cannot be explained by the common factors F1,,Fp: U2 =UU=diag(l21p+1,,l2pq)=diag(12,,p2) For orthogonal factors it is possible to give a reduced empirical correlation matrix from the loadings matrix: =R-U2=LL R Its diagonal elements kj2 =1-j2 are called communality of the jth characteristic. Communalities indicate what portion of the variance of the jth characteristic is explained by common factors. For non-orthogonal factors one obtains: =R-U2=LRFL R and therefore: kj2 =k=1qljk2+2k=1q-1k=k+1qljkljkrFkFk
), Let be a pq theoretical loadings matrix, let f be a Let yst be a p1 random vector ~N(0, q1 random vector ~N(0,1) and let e be a p1 random vector whose components ~N(0,j2). Assume independence of f and e. The model yst =f+e yields the covariance matrix of yst as: =+diag(12,,p2) To obtain unique estimators L for the loadings matrix and U2=diag(12,,p2) for the characteristic specific (unexplained) variances diag(12,,p2) the ML method additionally requires that LU2L be a diagonal matrix. Not required: Providing ex ante communalities k2,,kp2. Peter P. Robejsek FernUniversitt Hagen, 2011 10
Required: Ex ante specification for the number of common factors q. . The elements of S are The empirical covariance matrix S is an unbiased estimator for 2 jointly Wishart-distributed (multivariate ). Maximizing the Likelihood in two steps, first w.r.t holding diag(12,,p2) constant then inserting this conditional ML estimator into the likelihood function and maximizing w.r.t diag(12,,p2) leads to the eigenvalue problem: (R-U2)U-2A=AJ -2 where J=AU A is the diagonal matrix of the eigenvalues of (R-U2)U-2 and A is the matrix of normalized eigenvectors. The ML estimator for turns out to be: L=AJ1/2 It can only be obtained for known U which in turn can be obtained only numerically. Canonic factor analysis demands that the q canonic correlations between characteristics and latent factors be maximized. The first canonic correlation is the maximum correlation between a linear combination of characteristics and a linear combination of factors, the second is the same only orthogonal to the first etc. For identical numbers of factors both approaches (ML and canonical) lead to identical estimates of the loadings matrix. This can be ssen from the eigenvalueproblems: (R-U2)U-2A=AJ |U-1 -1 2 -1 -1 -1 U (R-U )U U A=U AJ L=UCJ1/2 where J is eigenvalue matrix of U-1(R-U2)U-1 and C=U-1A matrix of normal eigenvectors. This is therefore equivalent to the above statement. Solve iteratively: 1. Choose appropriate starting value j02 for the p specific variances j2 to obtain a starting value U02 for the diagonal matrix of specific variances U2 First let the communality of the jth characteristic Yj, kj2 equal some starting value kj02 e.g. o the multiple coefficient of determination i.e. quadratic multiple correlation of the jth characteristic Yj with the other p-1 characteristics or o the maximum value of the rjj entry in R From this value define j02=1-kj02 2. Calculate the positive eigevalues 10,,q0 of U0-1(R-U02)U0-1 and the respective normalized eigenvectors ck0=(c1k0,,cpk0) 3. in the tth step calculate Ut2=diag(1t2,,pt2) where jt2=1- jt2k=1q kt-1c2jkt-1 as well as q largest eigenvalues of Ut-1(R-Ut2)Ut-1 and corresponding eigenvector ckt 4. repeat for t steps and stop once estimated Ut+1-1 differs from Ut-1 by no more than some small quantity . Find loading matrix for J1/2=diag(sqrt(1t),, sqrt(qt)), C=[c1t,,cqt] L=[l1,,lq]=UtCJ1/2 To test which of the q factors contribute significantly to explaning the correlation of the p = R Ut 2 first factors suffice to reproduce R characteristics test successively whether the q =+diag(12,, p2), vs H1: Rpp arbitrary positive definite symm. H0 :
1/2 (1+2p-sqrt(1+8p)) Use likelihood quotient test. For q 2 det( Lq Lq ' +U( q ) ) 2 )ln ln Use statistic: =(n-1-1/6(2p+5)-2/3 q det R q q U 2( q l 2 ,...,1 l 2 ) )diag(1 where L = [l ,...l ]and
) (q 1 q
k =1
1k
k =1
pk
Peter P. Robejsek
11
Reject H0 if:
U
j =1
j0
5. Repeat until Rt+1 is sufficiently close to zero. Then the approximate loadings matrix l l 1q 11 L = l p1 l pq for q=t factors F1,,Fq is The Jreskog method assumes that the structure of the covariance matrix of the standardized characteristics is: Peter P. Robejsek FernUniversitt Hagen, 2011 12
and that the j2 be proportional to the reciprocals of the diagonal 2 2 1 ) 1 elements of the inverted covariance matrix: diag(1 ,..., p ) = ( diag In applications the covariance matrix is estimated by the empirical covariance matrix R. 1 . Then diagR-1=diag(r11,,rpp) is an estimator for Compute eigenvalues of R*=(diagR-1)1/2R(diagR-1)1/2 q-1 p And use the criterion: j=1qj/j=1p j> and j=1 j/j=1 j to define the number of factors q. The factor of proportionality then can be estimated by: ^=1/(p-q)j=q+1pj be approximated by the matrix: The characteristic specific variances can 2 ,..., 2) = ( diagR 1 ) 1 diag( 1 p Now compute the normalized eigenvectors c1,,cq of length sqrt(1-^),, sqrt(q-^) and the estimation for the loading matrix as: L =(diagR-1)-1/2(c1,,cq)
If the angles of rotation are not identical for all factors i.e. the factors lose their orthogonality, the rotation is oblique. Multiplication of the loadings matrix by the rotation matrix merely yields the factor structure: Lfs=L. To obtain the loadings matrix of the oblique factors one must multiply with the inverse of the correlation matrix (RF=): Lrot =LfsRF-1 =L()-1 =L-1
For varimax, maximize the variance of the communality-normalized factor loadings: l jk l jk z jk = , j = 1,..., p; k = 1,..., q kj q 2 l jk
k =1
1 z jk 2 p j =1 The iterative Varimax procedure is as follows: 1. For a given loadings matrix L and q factors F1,..,Fq compute normalized loadings matrix Z0 where normalizing each element zjk(0) as above. Also compute varimax criterion V0. 2. For the tth step compute the following auxiliary values: where z.k 2 =
3. The step t rotation angle for the kk factor pairing then is:
4. Compute: Zt+1=Ztt=Zt12(t) q-1q(t) as well as Vt+1 proceed to the next step if Vt+1 is substantially greater than Vt 5. Finally compute Lrot whose elements are ljkRot =zjk(t+1)kj=zjk(t+1)sqrt(sum1:ql2jk) The Quartimax method maximizes the Quartimax criterion:
p q
Q = l jk 4
j =1 k =1
Peter P. Robejsek
14
To obtain the closest approximation of a unifactor solution i.e. one factor is supposed to explain as much of the information in the data as possible. 1. First calculate the quartimax criterion Q0 for a given loadings matrix L0. 2. In the tth step calculate for every pair of factors and This gives the rotation angle: 3. Compute Lt+1=Ltt and Qt+1 stop if Qt and Qt+1 are almost identical. Then Lrot=Lt+1
k'
k'
rot
A first step in factor analysis is the description of the standardized data via a linear combina tion of factor values where the weights are the factor loadings. Since the reduced empirical =R-U2 where U2 is the diagonal matrix of characteriscorrelation matrix can be written as: R tic specific variances. This is approximately equal to: R =1/(n-1) YstYst-U2 LRFL where L is an arbitrary loadings matrix for q factors and RF the correlation matrix of these factors. Then we obtain an approximation of the standardized data matrix from the unknown factor weight matrix F: = FL' Y st F describes the n objects in terms of q factors and has p-q fewer columns than YSt. For a principal components analysis F is easily obtained from Yst=FL and YStL=FLL=FI=F since L here is an orthogonal matrix. For other cases one must estimate F. First estimation method Assume that some standardized random vector yst can be represented as a linear combination of two perfectly correlated random vectors e,f weighted by L resp. U2: yst =Lf+Ue This gives the theoretical covariance matrix: yst =LL+UU =LL+U2 Peter P. Robejsek FernUniversitt Hagen, 2011 15
Replace yst by the empirical correlation matrix R. The respective vector of factor values then is: f^ =LR-1yst F^ =YstR-1L for orthogonal rotation Lrot=L and oblique rotation Lrot=Lfs()-1=L()-1=L()-1 we obtain L=Lrot so that the factor value estimations are f^ =LrotR-1yst and factor values relative to the orthogonally rotated factors: rot -1 f^ =f^=LrotR yst =LrotR-1yst (F^rot=YstR-1Lrot) f^rot =f^=LrotR-1yst =LfsR-1yst (F^rot=YstR-1Lfs) Second estimation method This method assumes that the nonstandardized data can be represented as a linear combination of perfectly correlated random vectors: y =Lf+Ue where L and U are covariance matrix and characteristic specific variance matrix of nonstandardized data. Here f has Ef=f~ and Cov f=I and e iid Ee=0 Cove=I. Decompose f into a deterministic component f~ and into a random component where E=0 Cov=I: y =Lf~+L+Ue then: Ey =Lf~ Cov(y)=y=LL+U2 For estimated L, U replace y by the empirical covariance matrix S. If S and LL are invertible estimators obtain estimators f^~ =(LS-1L-1)-1LS-1y ^ =ILS-1y-ILS-1y =0 So estimate f~=f^~+^= f^~ Again we have for both orthogonal and oblique rotation: L=Lrot Therefore the factor values w.r.t the rotated factors can be estimated by: f^rot = f^~rotf^ =(LrotS-1Lrot)-1LrotS-1y =()-1(LrotS-1Lrot)-1-1LrotS-1y =(LrotS-1Lrot)-1LrotS-1y For standardized datavectors replace the empirical covariance matrix S by the empirical correlation matrix of the characteristics R.
4 KE 4 Scaling procedures
4.1 Introduction
Data matrix vs. distance matrix For observations of p characteristics in n objects one obtains a data matrix every row j of which is a p-dimensional vector of observations for object j and every column of which is a n Peter P. Robejsek FernUniversitt Hagen, 2011 16
dimensional vector of characteristic-values for characteristic p. If the matrix contains both qualitative and quantitative data call it mixed else the one or the other. A distance matrix for n objects is nn symmetrical matrix with zeros on main diag. and distances between elements off diag. where the entry d(i,j)=d(j,i) indicates the degree of difference between object i and j. Possible distance measures. Lr distance: d(i,j)=(k=1p|yik-yjk|r)1/r which gives for r=1 d(i,j)=k=1p|yik-yjk| city block distance r=2 d(i,j)=sqrt(k=1p(yik-yjk)2) Euclidean distance r= d(i,j)=max{|yi1-yj1|,,|yip-yjp|} Tchebycheff distance Weighting the Euclidean distance by the empirical covariance matrix gives the Mahalanobis distance: d(i,j)=((yi-yj)S-1(yi-yj))1/2
x =1/ni=1lni.xi =1/nj=1cn.jyj=y=0 s2 x =1/ni=1lni.xi2 =1/nj=1cn.jyj2=sy2=1 Define vector valued dummy variables U=(U1,,Ul) and V=(V1,,Vc) with realizations that are always unit vectors u=eli=(0,,0,1,0,,0), v=ecj=(0,,0,1,0,,0) where the 1 is in the ith or jth row respectively. Now we look for random variables X* =U and Y*=V where the weights , are chosen such that the correlation between X,Y in the sample rXY is maximal. The weight vectors result from the normalized vectors of the empirical canonical correlation between U, V. Any observation in the contingency table can be written as: (xi*,yi*)=(i,j) and the scaled values for the , y result from normalizing and i.e. from the ll matrix Q where each qik =qki=(j=1cnijnkj/n.j-ni.nk./n)/sqrt(ni.nk.) The first canonic correlation between U, V is the sqrt of the largest eigenvalue G of Q rXY =sqrt(G). Find any eigenvector f and compute the weights as: i =fi/sqrt(ni.) Then the scaled values x result from the normalized s: xi =isqrt(n)/sqrt(i=1lni.i2) i=1,,l The weights and the scaled values for characteristic y result to j =1/n.j i=1lniji j=1,,c c 2 yj =jsqrt(n)/sqrt(i=1 n.jj ) j=1,,c
This stress function is invariant w.r.t changes in the coordinate system and has values between 0 and gSmax(y1,,yn) =(i=1n-1j=i+1n(d*(i,j)-d*)2)1/2 where d*=2/n(n-1)i=1n-1j=i+1nd*(i,j) Very good values of the configuration are for gs<0.05gSmax then in steps of .05 aup to Values 0.20 that are not considered satisfactory. For a given dimension of the representation space q chose a starting configuration y10=(y011,,y01q),,yn0=(y0n1,,y0nq) For the th step compute the Euclidean distances for all pairs yi-1,yi-1 their Euclidean distances d-1*(i,j)and sort according to the proximities d(i,j) so that the distance between the pair of objects with lowest proximity comes first. To compute the stress function gS transform the distances d-1*(i,j) monotonously into -1*(i,j)s.
The among class heterogeneity is judged by heterogeneity measures v(Ki1,Ki2)0 and v(Ki,Ki)=0 as well as v(Ki1,Ki2)=v(Ki2,Ki1) for all classes KiK. For disjoint classes (i.e. members of a partition or classes on one level of a hierarchy) one uses Peter P. Robejsek FernUniversitt Hagen, 2011 19
single linkage: v( K i1 ,K i2 ) =
j K i1 , k K i 2
min
tance i.e. heterogeneity of the most similar members complete linkage: v (K i1 , K i2 ) = max d ( j, k ) i.e. maximum inter class inter object
j K i1 , k K i2
The homogeneity indicators can be used for quasi hierarchies and covers. Not so the heterogeneity indicators. These must be modified so that overlapping classes are reduced by over lapping objects before measuring heterogeneity. To judge the quality of classifications that are partitions one uses homogeneity based: g(K ) = h (K i )
K i K
2K
K i1 K i1 i2
K i2 K
v (K i1 , K i 2 )
2K
h(K )
i K i K
A partition with only one class can be judged only on the basis of homogeneity. However using the first measure leads to an n object class. Hierarchies are never assessed in their totality but rather level by level based on the levels partition.
in step t generate partition Kt={K1t,,Kn-tt} according to interclass homogeneity in the previK i * t 1 K i * t 1 for i = min{i1*, i2 *} 2 1 K i t = K i +1t 1 for i max{i1*, i2 *} t 1 Ki else ous step: where i1 * and i2 * are chosen so that
v (K i1 * t 1, K i2 * t 1 ) =
i1 , i2 {1,..., n ( t 1)} i1 i2
min
v (K i1 t 1, K i2 t 1 )
The heterogeneity can be measured by single, complete and average linkage, although single linkage is most common as it discovers very broad classes. However the downside: very heterogeneous classes might be merged only because a single object lies in between. This procedure requires computation of heterogeneity in every step so that it can become computationally costly. The following recursion reduces the t step heterogeneity to those in t1step. For t=1,2,,n-1 v (K i1 * t 1 K i2 * t 1, K i t ) = 1v (K i1 * t 1, K i t ) + 2v (K i2 * t 1, K i t ) + 3 v (K i1 * t 1, K i t ) v (K i2 * t 1, K i t ) where the weights k are chosen according to the measure of heterogeneity used: 1 2 3 single linkage 0,5 0,5 -0,5 complete linkage 0,5 0,5 0,5 t 1 t 1 average linkage 0 K K
i1 * i2 *
K i1 *
t 1
+ K i2 *
t 1
K i1 *
t 1
+ K i2 * t 1
The resulting 2n-1 different classes can be represented as a dendrogram. Sometimes one also v (K i1 t 1, K i2 t 1 ) i.e. the heterogeneity of the fusioned classes uses g(Kt)= g(K t ) = t 1 min t 1 t 1
K i1 , K i2 K
the quality of the partition on one level of the hierarchy. as indicator for
If n objects are allocated to m disjoint classes K1,,Km that therefore are a partition for the n objects then the objects of every class can be viewed as a learning sample of a population. Discrimination of the n+1st object is to decide which of the m populations it belongs to. Distinguish two data situations: 1. There are observations over p jointly normally distributed characteristics for each object 2. There exists only a distance matrix for the objects
Assume m=2 then there are two populations with N((1),); N((2),). Then the densities are given by: fi(y) =1/sqrt((2)pdet) exp[-1/2(y-(i))-1(y-(i))] i=1,2 Take the density quotient f1(y)/f2(y) and find: h~12(y) =exp[-1/2(y-(1))-1(y-(1))]/exp[-1/2(y-(2))-1(y-(2))] =exp{-1/2[(y-(1))-1(y-(1))-((y-(2))-1(y-(2)))]} =((1)-(2))-1y-1/2[((1)-(2))-1((1)+(2))] |quadratic form h~12(y) is Fishers linear discriminant function that allocates an object with vector y to population i=1 if h~12(y)>0 else to i=2. In the case of unknown exp. value vectors and covariance matrix substitute the empirical estimators in the above formula. In the univariate case the above formula reduces to: h12(y) =(y1-y2)y/2 -1/2 [(y1-y2)(y1+y2)]/2 =(y1-y2)/2[y-1/2(y1+y2)] Then the cutoff point is midway between the populations. This classification by Euclidean distance from the exp. value vector does not hold for the multivariate case unless =2I. For the computation of hij(y)=-hji(y)
A larger value of T2 indicates better discriminance of the p characteristics between the m classes. To test whether discriminating between the two populations is significantly possible H0: it is not possible to discriminate use the statistic: T2(Y1,,Yp) Reject Null if: T2>cHL;1-(p,n-m,m-1) where: cHL;1-(p,n-m,m-1)2(2u++1)/2(v+1) F(2u++1),2(v+1);1- and: =min(p,m-1), u=1/2(|p-m+1|-1), v=1/2(n-m-p-1) For m=2: T2(Y1,,Yp)=tr(ShSe-1) =n1n2/n(n-2) (y1-y2)S-1(y1-y2) where S is an estimator for Se/(n-2) In this case the approximation of the quantiles of the Hotelling-Lawley-statistic is exact: T2(Y1,,Yp)(n-p-1)/p ~Fp,n-p-1 The characteristics discriminate significantly to the level if T2(Y1,,Yp) =n1n2/n(n-2)(y1-y2)S-1(y1-y2) > Fp,n-p-1;1- Peter P. Robejsek FernUniversitt Hagen, 2011 22
Peter P. Robejsek
23
if the ith row of Y is given by yi=(yi1,,yip) and the ith row of the design matrix X is given by xi=(xi1,,xim) and (j) gives the jth column of the parameter matrix , then the model functions as follows: For some value of the input variables xi observe the corresponding response variables yi then the parameter vector (j) is supposed to explain the interaction between input variables and the jth response variable as closely as possible. In the following assume that rgX=m i.e. full column rank s that XX is regular. Estimators: exp. value = ( X ' X ) 1 X'Y
covariance matrix = 1 S nm e where Se is the error matrix Se = Y ' Y Y ' X ( X ' X ) 1 X'Y To test hypotheses of the form: H0 : K=0 vs. H1: K0 where K is wm Testmatrix with full row rank compute the hypothesis matrix: Sh =YX(XX)-1K(K(XX)-1K)-1K(XX)-1XY The statistics are a function of the eigenvalues of ShSe-1: 12p0 The critical values depend on the number p of response variables as well as the degrees of freedom of the error ne and the hypothesis nh. For the general multivariate regression model we have: ne =n-rgX =n-m nh =rgK =w
Assume confidence level: Wilks test H0 : K=0 vs. H1: K0 Use the statistic: Reject null: Approximatiion for cW rej. null: where: Peter P. Robejsek
=detS/det(Se+Sh)
24
Hotelling-Lawley-Test Use the statistic: compute: where: Reject null: Pillai-Bartlett-Test Use the statistic: reject null: approximate: where , u and v as above. Roy test Use the statistic: where cR tabulated
HL =i=1pi cHL;1-(p,ne,nh) [2(2u++1)]/[2(v+1)] F(2u++1),2(v+1);1- =min(p,nh), u=1/2(|p-nh|-1), v=1/2(ne-p-1) HL >cHL;1-(p,ne,nh) PB =i=1p(i/(1+i)) PB >cPB;1-(p,ne,nh) PB/(- PB)>[(2u++1)/(2v++1)]F(2u++1),(2v++1);1-
=1/(1+1)>cR;1-(p,ne,nh)
For p=1 all tests reduce to a regular F-test and for nh=1 an exact F-test can be given for example in the case of Wilks: cW;(p,ne,1) =1/[1+p/(ne-p+1) Fp,n(e)-p+1;1-]
Accrdingly the degrees of freedom are: nh=r-1, ne=r(s-1) The covariance matrix is estimated just as in the multivariate regression: ^ =1/neSe =1/r(s-1)Se
The simple profile analysis looks at the influence of one qualitative factor A with r levels on s objects through time. This model is identical with univariate classification: yij =+i+eij i=1,,r; j=1,,s; n=rs; yij=(yij(t1),yij(t2),,yij(tp)) is the average vector, i is the effect on level i of the factor A. estimators are identical to one way classification: ^ = y..=1/ni=1rj=1syij i^= yi.- y.. = yi.-^ where yi.=1/sj=1syij Thus for the time step t the estimated mean is: ^(t) =yi.(t)=1/ni=1rj=1syij(t) i^=yi.(t)-y..(t) =yi.(t)-^(t) where yi.=1/sj=1syij(t) One commonly tests the hypothesis HoA: 1==r=0 To be able to use the known tests, compute the hypothesis- and error matrices: Sh A =si=1r( yi.- y..)( yi.- y..) Se =i=1rj=1s( yi.- y..)( yi.- y..) degrees of freedom: nhA =r-1, ne=r(s-1) A different common hypothesis is: Parallelity: H01: k1,,kr such that 1+k11p==r+kr1p where 1p=(1,,1) One tests that the impact of the r levels of factor A is identical but for a constant factor Hapothesis matrix: Sh(1) =(Ip-1|-1p-1)ShA(Ip-1|-1p-1) Se(1) =(Ip-1|-1p-1)SeA(Ip-1|-1p-1) degrees of freedom nh(1) =nhA=r-1; ne(1)=ne=r(s-1), p*=p-1 Identity of time-means: H02:1==r, i=1/p=1pi(t) for i=1,r One tests whether the effects of the r levels of factor A are identical over p timesteps Hypothesis and error matrix are constants: Sh(2) =1pSh1p Se(2) =1pSe1p and degrees of freedom: nh(2) =nhA=r-1; ne(2)=ne=r(s-1), p*=p-1
Peter P. Robejsek
27
Also one can group the observations y1,,yn for m measurement points x1,,xm. Or it is possible that at the point xi (i=1,,m) one has observed the value y ni times. Then if the event y=1 took place ni(1) times and the event y=0 took place ni(0) times (ni=ni(1)+ni(0); n1+nm=n) then the probability that at position xi one will observe y=1 (pi=P(y=1|x=xi) can be estimated by: pi^ =ni(1)/ni
The regression relationship p=0+1x linearly explains the probability of observing y=1 given the value x of the regressor X. Therefore this is called the linear model of discrete regression. However the prediction of probabilities can exceed the interval [0,1] Therefore one uses transformations, which generate estimates that represent a genuine probability. The probit (normit) model uses the standard normal distribution function as a transformation: p =(0+1x) 2 x x 1 t2 where ( x ) = ( t ) dt = e dt 2 First calculate the probits: giprob =-1(pi^) =upi^ Where upi^ represents the pi^ quantile of the standard normal distribution. Then estimate by least squares the parameters of the regression relation: gprob = -1(pi) = 0+1x From this one obtains the estimated regression relation: p^ = (gprob) = -1(0^+1^x) Since is a distribution function one always obtains values between 0 and 1. A different frequently used transformation function is the logistic distribution function: Flgt(z) =1/(1+e-z) In the logit model one computes the logits: gilgt =F-1lgt(pi^) =ln(pi^/1-pi^) Then again estimate by least squares the parameters of the regression relation: glgt =F-1lgt(p) =ln(p/1-p) =0+v1x This yields the estimated regression relation: p^ =Flgt(g^lgt) =Flgt(0^+v1^x) =1/(1+exp(-(0^+v1^x)) Peter P. Robejsek FernUniversitt Hagen, 2011 28
Simple canonical summary: 1. Find relative frequencies as estimators for probabilities at obs. points 2. Compute ids/probits/logits by applying the inverse transform (G-1) 3. Regress logits on regressors 4. Insert weights into G to obtain estimates for unknown values of regressors
7.2.1 Berkson-Theil-Method
The BTM uses weighted least squares to estimate the s. To this end one obtains for each of m measurement points xi =(xi1,,xih-1) i=1,,m related to the regressors X 1,,X h-1, ni observations of the binary regressand Y . If one observes the event y =1 ni(1) times one can estimate the probability that at the point xi one will observe yi=1 i.e. pi=P(y =1|x=xi) by: pi^ =ni(1)/ni Since the individual events follow a Bernoulli distribution we obtain for the exp. value resp. the variance of pi^: E(pi^) =E(ni(1)/ni) =1/niE(ni(1)) =1/ni ni pi 2 Var(pi^)=Var(ni(1)/ni)=1/ni Var(ni(1)) =1/ni2 ni pi(1-pi) =1/ni pi(1-pi) For sufficiently large ni we have si 2 =pi^(1-pi^)/ni As an estimator for the variance of pi^. This estimator is used for the least squares estimation of 0,1,,h-1. To estimate a relationship of the type: P =G(0+j=1h-1jxj) where G-1 exists first estimate: gi^ =G-1(pi^) and sG,i2 as an estimator for the variance of gi^. Then regress G=G-1 on the X 1,,X h-1 by the relationship: G=G-1(P) =0+j=1h-1jxj The regression model is: gi^=G-1(pi)^ =0+j=1h-1jxij+ei where the ei are independent error terms with variances estimated by sG,i2. Then: 1 x1 ' g 1 0 1 x 2 ' g2 1 = X= ,g , = , G = diag( sG12 ,..., sGm 2 ) 1 x m ' gm h 1 Peter P. Robejsek
29
Where G^ is the estimator for the diagonal covariance matrix of g^ resp. e=(e1,,em). Then the least squares estimator for the parameter is given by: ^ =(0^,1^,,h-1^) =(XG^-1X)-1XG^-1g^ Then for G=id we have the relationship G=P =0+j=1h-1jxj estimated by G^=P^ =0^linear+j=1h-1j^linearxj Where ^linear is the weighted least squares estimator for =(0,1,,h-1) estimated using gi^=pi^ and id^=diag(sid,12,,s2id,m)=diag(s12,,s2m) Then for G= we have the relationship G=-1(P) =0+j=1h-1jxj estimated by (P^) =(0^probit+j=1h-1j^probitxj) G^=-1(P^) =0^ probit +j=1h-1j^ probitxj Where ^probit is the weighted least squares estimator for =(0,1,,h-1) estimated using gi^= gi^probit= -1(pi^)=upi^ and ^=diag(s,12,,s2,m)=diag(s12/((g1^))2,,sm2/((gm^))2) is the estimator of the covariance matrix. The covariance matrix is estimated as follows. pi^ denotes the relative frequency of the yi=1 at xi. Then we have: pi^ =pi+ui |where ui is the unknown error with E(ui)=0, Var(ui)=pi(1-pi)/ni Now apply the inverse standard normal distribution function: -1(pi^) =-1(pi+ui) The Taylor series expansion around pi gives: -1(pi^) =-1(pi)+ui -1(pi)/pi+Ri -1 where (pi)/pi=1/[-1(pi)] and Ri is the remainder term which converges in probability towards zero as ni. Therefore -1(pi^) =-1(pi)+ui/[-1(pi)] -1 gi^= (pi^) =0+j=1h-1jxij+ei The error ei has the exp. value E(ei)=0 and the variance: Var(ei) =Var[ui/(-1(pi))] =Var[ui/(gi)] =pi(1-pi)/ni((gi))2 =si2/((gi))2 Therefore the estimator of the covariance matrix turns out as: ^ =diag(s,12,,s2,m)=diag(s12/((g1^))2,,sm2/((gm^))2) Finally for the logit model we have: G=Flgt we have the relationship P =Flgt(0+j=1h-1jxj) =1/[1+exp(-(0+j=1h-1jxj))] G=Flgt-1(P) =ln[P/(1-P)] =0+j=1h-1jxj Where ^logit is the weighted least squares estimator for =(0,1,,h-1) estimated using gi^=gi^logit=Flgt-1(pi^)=ln[pi^/(1-pi^)] and Flgt^=diag(sFlgt,12,,s2Flgt,m)=diag(1/(ni2si2),, 1/(ni2si2)) is the estimator of the covariance matrix. The covariance matrix is obtained from the following: pi^ =pi+ui The odds ratio then is: pi^/(1-pi^) =(pi+ui)/(1-pi-ui) =pi/(1-pi) (1+ui/pi)/(1-(ui/(1-pi)) So the log-odds ratio becomes: ln[pi^/(1-pi^)] =ln[pi/(1-pi)]+ln(1+ui/pi)-ln(1-[ui/(1-pi)] Expand the last terms as Taylor series around ui/pi resp. ui/(1-pi) drop higher order terms: ln[pi^/(1-pi^)] =ln[pi/(1-pi)]+ui/pi+ui/(1-pi) Peter P. Robejsek FernUniversitt Hagen, 2011 30
=0+j=1h-1jxj+ui/pi(1-pi) Where gi^=0+j=1h-1jxij+ei and gi^=ln[pi^/(1-pi^)]. Then ei has exp. value E(ei)=0 and variance: Var(ei) =Var[ui/pi(1-pi)] =1/pi2(1-pi)2 Var(ui) =1/pi2(1-pi)2 pi(1-pi)/ni =1/nipi(1-pi) =1/ni2si2 Then the estimator for the covariance matrix Flgt then becomes: Flgt^ =diag(sFlgt,12,,s2Flgt,m)=diag(1/(ni2si2),, 1/(ni2si2)) Then the resulting estimators for G and P are: P^ =Flgt(0^logit+j=1h-1jlogitxj) =1/[1+exp(-(0^logit+j=1h-1jlogitxj))] G^=Flgt-1(P^) =ln[P^/(1-P^)] =0^logit+j=1h-1^jlogitxj The goodness of fit can of the regression curves G^=P^ =0^linear+j=1h-1j^linearxj G^=-1(P^) =0^ probit +j=1h-1j^ probitxj G^=Flgt-1(P^) =0^logit+j=1h-1^jlogitxj can be measured by the multiple coefficient of determination:
To directly compare the quality of the estimation P^ at the point x one should use the 2 values (the lower the better the fit): G2 =i=1mni [pi^-p^(xi)]2/p^(xi)(1-p^(xi)) Where
Peter P. Robejsek
31
Then find for log-likelihood: lnLG =i=1m[ln(ni over ni(1))+ni(1)lnpiG()+ni(0)ln(1-piG())] =i=1m[ln(ni over ni(1))+ni(1)lnpiG()/(1-piG())+niln(1-piG())] Which is maximized w.r.t. the components of the parameter vector . To maximize compute first and second derivatives: Linear model: pilinear() =0+j=1h-1jxij =j=0h-1jxij lnLid =i=1m[ln(ni over ni(1))+ni(1)ln j=0h-1jxij+ni(0)ln(1-j=0h-1jxij)] lnLid/l =i=1m[ni(1))/j=0h-1jxij-ni(0)/(1-j=0h-1jxij)]xil 2lnLid/lv =-i=1m[ni(1))/(j=0h-1jxij)2+ni(0)/(1-j=0h-1jxij)2]xilxiv Then the ML estimator is: ^ = ^MLlinear =(^ML,0linear,, ^ML,h-1linear) P^ =P^MLlinear =G^ = G^ MLlinear = ^ML,0linear+j=1h-1ML,jlinearxj Probit model: piprobit() =(0+j=1h-1jxij) =(j=0h-1jxij) lnL =i=1m[ln(ni over ni(1))+ni(1)ln (j=0h-1jxij)+ni(0)ln(1-j=0h-1jxij)] Where is the distribution function of sthe standard normal distribution and the density function lnL/l = (i=1mxil)(j=0h-1jxij){[ni(1)-ni(1-j=0h-1jxij)]/[(j=0h-1jxij)(1-(j=0h-1jxij)]} 2lnL/lv =-i=1mnixilxiv{(j=0h-1jxij)/ [(j=0h-1jxij)(1-(j=0h-1jxij)]} Then the ML estimator is: ^ = ^MLprobit =(^ML,0probit,, ^ML,h-1probit) P^ =P^MLprobit =G^ = G^ MLprobit = ^ML,0probit+j=1h-1ML,jprobitxj Logit model: pilogit() =Flgt(0+j=1h-1jxij) =Flgt(j=0h-1jxij) =1/[1+exp(-(j=0h-1jxij))] m lnLFlgt =i=1 [ln(ni over ni(1))+ni(1)ln{1/[1+exp(-(j=0h-1jxij))]/1-(1/[1+exp(-(j=0h1 jxij))])}+ni(0)ln(1-1/[1+exp(-(j=0h-1jxij))])] = i=1m[ln(ni over ni(1))+[ni(1)-ni](j=0h-1jxij)-niln[1+exp(-(j=0h-1jxij))] Where Flgt is the logistic distribution function lnLFlgt/l =(i=1mxil)[ni(1)-ni/{1+exp(-(j=0h-1jxij))]} Peter P. Robejsek FernUniversitt Hagen, 2011 32
2lnLFlgt/lv=i=1mnixilxiv{ exp(-(j=0h-1jxij))/[1+exp(-(j=0h-1jxij))]2} Then the ML estimator is: ^ = ^MLlogit =(^ML,0logit,, ^ML,h-1logit) P^ =P^MLlogit =G^ = G^ MLlogit =1/[1+exp(-(^ML,0logit+j=1h-1ML,jlogitxj))]
The unknown probabilities pi are given as functions of the observation point xi conditional on unknown paramaters. These parameters are estimated by maximizing the likelihood function w.r.t. these parameters. The conditional logit model gives: pi1 =1/[1+k=2qexp(k0+j=1h-1kjxij)] =1/[1+k=2qexp(1,xi)k] pi =1/[1+k=2qexp(k0+j=1h-1kjxij)] =1/[1+k=2qexp(1,xi)k] For every realization of a different parameter vector =(0,, h-1) is allowed and L is maximized with respect to the parameter vectors 2,,q.
x(1) =( x1(1),, xh-1(1)) xj(0) =1/n(0) i=1mni(0)xij for j=1,,h-1 m xj(1) =1/n(1) i=1 ni(1)xij for j=1,,h-1 m S(0) =1/[n(0)-1] i=1 ni(0)[xi- xj(0)][xi- xj(0)] S(1) =1/[n(1)-1] i=1mni(1)[xi- xj(1)][xi- xj(1)] S =1/(n-2) [n(0)-1]S(0)+(n(1)-1)S(1) Then Fishers discriminant function can be given as: h(x) =( x(0)- x(1))S-1x-1/2 ( x(0)- x(1))S-1( x(0)+ x(1)) Then the decision rule for x=x* is y^ =0 if h(x*)>0 =1 else
8 KE 8 Graphical approaches
Need quantitative data matrix, use scaling for qualitative data or mds for distance data to obtain quant. data matrix 1- and 2D o Stem-and-leaves o Box-plot o scatterplot o quantile-quantile plots (QQ) normality, outliers o Bi-plots 3- and nD o profiles, polygons, radii etc. o faces o Andrews plots o trees, boxes, castles according to cluster and discriminant
Peter P. Robejsek
34
8.1.3 Bi-Plot
To visualize rows and columns of a data matrix Y or a modified matrix Y* simultaneously yij*=yij-y.j To visualize a matrix with rank>2 in two dimensions use an approximation Y2 for Y*. This can be obtained from the singular value decomposition: Find two largest eigenvalues of Y*Y* and call these 1, 2 with corresponding eigenvectors q1, q2. and define normalized eigenvectors pi=1/iY*qk then Y2= (p1,p2)diag(1,2)(q1,q2) now factor Y2=HM where H is n2 with orthonormal columns: H= (n-1) (p1,p2) and M is accordingly: M= 1/(n-1) (1q1,2q2) =(M1,,Mp) is p2 Then the ith row of H represents the ith object and the jth row of M the jth characteristic. Then represent the rows of H as points and the rows of M as vectors to those points from the origin. Then the Euclidean distance of the points i and j approximates the mahalanobis distance of the ith and jth row of H i.e. the ith and jth objects. The dot product MjMj represents the covariance of the characteristics yj and yj The length of the jth vector represents the respective standard deviation of yj and the cosine between the jth and jth vectors represents the correlation between the jth and jth objects.
Peter P. Robejsek
polygonsal lines: same as bar only the data points are plotted above th characteristics and connected with lines
polygons in polar coordinates: These function in the same way if all values are positive. Then the result are stars, the angle is 2/p, the radius corresponds to the actual values.
Same for suns only here circle and center are not needed:
Glyphs: Subdivide top 1/6 of circle into p partitions, plot distances on lines through subdivirion point and center, pointing outside the circle. Two plotting alternatives: Peter P. Robejsek FernUniversitt Hagen, 2011 36
o lower 1/3 values length 0 (on radius), middle 1/3 length c and top 1/3 of values length 2c o lengths proportional to values
for less than 36 parameters keep some features constant, for less than 18 choose symmetrical faces or vary only one half face extremely subjective due to choice of characteristic-feature matching
fi(t)
=c1i/2+c2isint+c3icost+c4isin2t+c5icos2t++c(p-1)isin(p-1)/2t+ cpicos(p-1)/2t
To represent Andrews plots in polar coordinates plot the function fi(t)~ =fi(t)+c t[-,] where c|mint [-,]fi(t)| i=1,n
For small numbers of characteristics one can also use rectangular blocks. o cluster the characteristics into three groups (e.g. use level of hierarchy with three classes) Peter P. Robejsek FernUniversitt Hagen, 2011 38
o assign each group a corresponding dimension of the block (height, width, length) o The dimension then is drawn proportional to the sum of the values of the distance metrics in the corresponding group. o Finally mark the individual characteristic boundaries on the blocks. This makes them look like parcels
Another visualization can be achieved with Kleiner-Hartigan trees, which use as basis the dendrogram of the hierarchy Begin with the stem, i.e. the class which encompasses all objects and draw thinner branches as partitions become finer. To determine the angle between two branches proceed as follows: o assign a maximum angle (e.g. 80) to the branching of the stem and a minimum angle (e.g. 30) to the two-element class with the minimum heterogeneity (or maximum homogeneity) of the two constituent one element classes. o All other branchings are allocated in between values. o call A, B, C, the classes of the hierarchy where A denotes the final class with all p elements. Call the heterogeneities of classes gA, gB, gC etc. Calculate the angle X =[(ln(gA+1)-ln(gX+1))+(ln(gX+1)-ln(gmin+1))]/[ln(gX+1)-ln(gmin+1)] for X=B,C,... where gmin denotes the heterogeneity of the classes that are subsumed first. An angle is divided along the vertical proportionally to the width of the branches When dividing the stem the direction is reversed with each branching When branches are subdivided the thicker one is chosen to go away from the trunk Choose the length as being proportional to the average value of the characteristics contained in them and the width proportional to the number of characteristics in one branch.
Peter P. Robejsek
39
Similar approach for castles, also by Kleiner-Hartigan only that angles are 0 The height of one storey (turret) above the ground is proportional to the minimum of the characteristic values of the characteristics contained in the turret minus some factor d times q, the minimum characteristic number contained in the class (e.g. if class contains chars 3, 2, 8 then q=2.
Peter P. Robejsek
40