least squares

© All Rights Reserved

Просмотров: 21

Leas Squares Excel vs by Hand

least squares

© All Rights Reserved

- M. Rades and D.J. Ewins - Principal Component Analysis of Frequency Response Functions
- Spin 56-432 Spin Lec
- Rossmann DiffGeo
- Invertibility of Large Sub Matrices With Applications to The
- 3D reconstruction
- Mimo Capacity
- Matrices
- MIT6_241JS11_chap05.pdf
- SVD_application_paper[1].pdf
- Andrew Tulloch - ESL-Solutions
- Shlens J.-a Tutorial on Principal Component Analysis (2005)
- spawc05.pdf
- A Review of Least Squares Theory Applied to Traverse From Anantapur
- fink
- How Google Uses SVD
- NA2004_01
- 2-Matrices and Determinants
- matlabtutorial1 (1).pdf
- synops nikhil
- 545_icmlpaper

Вы находитесь на странице: 1из 17

04/13/10

ExcelProject.nb

Abstract

Microsoft Excel is a popular spreadsheet software which is used widely in both industry, academia, and education. Whether it is

in a high school Physics classroom, or in the accounting departments of large Wall Street firms, people rely on Microsoft Excel

to give them accurate results. One of the most used functions of Excel is Least Squares Fitting, or finding a best-fitting curve

given a set of points, by minimizing the squared error. This paper will explore four different ways in which a user can calculate a

Least Squares Linear Fit with Excel and analyze how the four methods perform on mostly ill-conditioned data. We will compare

Excel's results to those of another popular program called Matlab.

Suppose we have 6 datasets which are of the form:

1+

1+

1+

1+

1+

1+

1+

1+

41

2p

42

2p

43

2p

44

2p

45

2p

46

2p

47

2p

48

2p

Y

1

2

3

4

5

6

7

8

where p{25,26,27,28,51,52}.

The task is seemingly simple- find a least squares linear fit for the 6 datasets. Let us try Excel, and see how it fares with the task.

First inputting the data into a spreadsheet, then plotting it on X-Y Scatter Plots, and then finally using the Excel function "Add

Trendline", we get the following results in Figures 1-6, for the least squares linear fits for the datasets:

FIGURE 1:

FIGURE 2:

where p{25,26,27,28,51,52}.

The task is seemingly simple- find a least squares linear fit for the 6 datasets. Let us try Excel, and see how it fares with the task.

First inputting the data into a spreadsheet, then plotting it on X-Y Scatter Plots, and then finally using the Excel function "Add

ExcelProject.nb

3

Trendline", we get the following results in Figures 1-6, for the least squares linear fits for the datasets:

FIGURE 1:

!"#$#%&

!"#"$$%%&&$'()))))))))))))))*"+"$$%%&&,'()))))))))))))))"

-."#"/()))))))))))))))"

2()))))"

1()))))"

,()))))"

0()))))"

%()))))"

'3'%"

&()))))"

456789:'3'%;"

$()))))"

'()))))"

/()))))"

)()))))"

/()))))"

/()))))"

/()))))"

/()))))"

/()))))"

/()))))"

FIGURE 2:

!"#$#%&

-)%%%%%"

2)%%%%%"

$)%%%%%"

')%%%%%"

1)%%%%%"

&)%%%%%"

()%%%%%"

*)%%%%%"

0)%%%%%"

%)%%%%%"

!"#"$%&'&(%$)*%%%%%%%%%%%%%%+","$%&'&(&-)'%%%%%%%%%%%%%%"

./"#"0)%1%%%%%%%%%%%%%"

FIGURE 3:

FIGURE 4:

*3*'"

456789:*3*';"

ExcelProject.nb

FIGURE 3:

!"#$#%&

$')'''''"

!"#"$%&$&'%&()'''''''''''''''*"+"$%&$&'(,-)'''''''''''''''"

./"#"$)0$,1'''''''''''"

()'''''"

&)'''''"

-)'''''"

,2,%"

,)'''''"

3456789,2,%:"

')'''''"

$)'''''" $)'''''" $)'''''" $)'''''" $)'''''" $)'''''" $)'''''" $)'''''"

+,)'''''"

+-)'''''"

FIGURE 4:

!"#$#%&

,"!!!!!#

+"!!!!!#

*"!!!!!#

)"!!!!!#

("!!!!!#

'"!!!!!#

&"!!!!!#

%"!!!!!#

$"!!!!!#

!"!!!!!#

$"!!!!!# $"!!!!!# $"!!!!!# $"!!!!!# $"!!!!!# $"!!!!!# $"!!!!!# $"!!!!!#

FIGURE 5:

FIGURE 6:

%-%+#

./01234%-%+5#

ExcelProject.nb

FIGURE 5:

!"#$%&'

,"!!!!!#

+"!!!!!#

*"!!!!!#

)"!!!!!#

("!!!!!#

%-'(#

'"!!!!!#

./01234%-'(5#

&"!!!!!#

%"!!!!!#

$"!!!!!#

!"!!!!!#

!",,,)!#

$"!!$*!#

FIGURE 6:

!"#$%#&

!"#"$%&&&&&&&&&&&&&&&"

'("#"&%&&&&&&&&&&&&&&&"

0%&&&&&"

$%&&&&&"

/%&&&&&"

.%&&&&&"

-%&&&&&"

*1-*"

,%&&&&&"

2345678*1-*9"

+%&&&&&"

*%&&&&&"

)%&&&&&"

&%&&&&&"

&%000.&"

)%&&)/&"

Something strange is happening here. The only dataset for which Excel seems to perform well is P=25; it gives a linear fit with

an R2 value of 1.0. R2 is known in statistics as the coefficient of determination, whose value (which ranges from 0-1), describes

the "goodness of fit" of the model or more specifically, it is the proportion of the variability in the data explained by the model

over the total variability of the data. So the R2 that Excel shows in Figure 1, denotes that it found a perfect linear fit. Analytically,

we can confirm that this should be the case; each point is evenly spaced on the X-axis and the Y-axis, so they should all fall on

one line. In fact this is true of all 6 datasets.

For the datasets P=26 and P=27 Excel obtains fits which it claims have R2 values that are greater than 1; mathematically

this is meaningless. For datasets with 28p51, Excel (without any warning message) refuses to give a linear fit and display an

equation or an R2 value (Figures 4 and 5 show the endpoints of this interval). We can attempt to explain this behavior by saying

that, if excel cannot distinguish between the x values for 28p51 that is, if the numbers 1 +

41

and

2p

1+

42

are

2p

equal in Excel,

then it refuses to fit a linear function to the data because a vertical line is not a mathematical function. This explanation is

contradicted by the fact that for P=28, Excel can clearly distinguish between the x values (refer to the plot of Figure 4). Furthermore, it seems that for P=52, Excel no longer has any problems displaying the equation and R2 of the linear fit (although not an

actual line on the plot). Furthermore, for P>52, Excel's behavior becomes unpredictable: for some values of P it does display the

equation and R2 value while for others, it does not. It seems that this behavior should be added to the long list of mysterious

behaviors that plague Excel.

On another interesting note, with the Trendline function, a user can choose to display up to 99 decimal places in the

values of slope, intercept, and R2 ! However, Excel can actually only display up to 15 siginificant decimal digits2 . Although this

bug may seem harmless, significant digits are of great importance in scientific computation (i.e. uncertainty analysis).

For the datasets P=26 and P=27 Excel obtains fits which it claims have R2 values that are greater than 1; mathematically

is meaningless. For datasets with 28p51, Excel (without any warning message) refuses to give a linear fit and display an

6this ExcelProject.nb

equation or an R2 value (Figures 4 and 5 show the endpoints of this interval). We can attempt to explain this behavior by saying

that, if excel cannot distinguish between the x values for 28p51 that is, if the numbers 1 +

41

and

2p

1+

42

are

2p

equal in Excel,

then it refuses to fit a linear function to the data because a vertical line is not a mathematical function. This explanation is

contradicted by the fact that for P=28, Excel can clearly distinguish between the x values (refer to the plot of Figure 4). Furthermore, it seems that for P=52, Excel no longer has any problems displaying the equation and R2 of the linear fit (although not an

actual line on the plot). Furthermore, for P>52, Excel's behavior becomes unpredictable: for some values of P it does display the

equation and R2 value while for others, it does not. It seems that this behavior should be added to the long list of mysterious

behaviors that plague Excel.

On another interesting note, with the Trendline function, a user can choose to display up to 99 decimal places in the

values of slope, intercept, and R2 ! However, Excel can actually only display up to 15 siginificant decimal digits2 . Although this

bug may seem harmless, significant digits are of great importance in scientific computation (i.e. uncertainty analysis).

"By Hand"

Unsatisfied with the results we got from the "Trendline" function in Excel, we can take another approach to calculating

the linear least squares fit for our data: input our own formulas into the Excel spreadsheet to calculate the fit. From any introductory statistics textbook5 , we can find that the formulas for the slope and intercept of linear least squares fit are:

b2 =

b1 =

nHSxyL-HSxL HSyL

nISx2 M-HSxL2

HSyL ISx2 M-HSxL HSxyL

nISx2 M-HSxL2

where b2 and b1 are the slope and y-intercept respectively, n is the number of data points, and x and y are the data points.

We implement these formulas in steps as follows:

(1) Calculate intermediate products (e.g. "xy", "x2 ")

(2) Calculate intermediate Sums (e.g. "Sx", "Sx2 ", "Sy", "Sxy")

(3) Calculate Numerators (e.g. "n(Sxy) - (Sx) (Sy)" and "HSyL ISx2 M - HSxL HSxyL")

(4) Calculate Denominator (e.g."nISx2 M - HSxL2 ")

(5) Calculate b1 and b2 using (3) and (4)

Having calculated b2 and b1 , we can calculate R2 using the following formula:

R2 1 -

SSerr

SStot

SSerr = i Hyi - fi L2

where y is the mean of the observed data (all datapoints y), fi 's are the values predicted by the linear fit i.e. fi = b2 xi + b1 , and

the index i goes over all data points.

The formulas for R2 are similarly implemented in steps.

Here are the results for the 6 datasets:

Dataset Hp = L

25

26

27

28

51

52

b2

b1

33 554 432 - 33 554 472

DIV 0 ! DIV 0 !

DIV 0 ! DIV 0 !

DIV 0 ! DIV 0 !

DIV 0 ! DIV 0 !

DIV 0 ! DIV 0 !

R2

1

DIV 0 !

DIV 0 !

DIV 0 !

DIV 0 !

DIV 0 !

Our fit performs even worse than Excel's native "Trendline" function! The results for the P=25 dataset are the same as those

produced by Trendline (refer to figure 1) however, for every other dataset, we get a "Division by zero" error. The division by 0

occurs when calculating the slope and intercept: the denominator nISx2 M - HSxL2 is calculated to be 0 for the 5 datasets. Where is

SStot

ExcelProject.nb

SSerr = i Hyi - fi L

where y is the mean of the observed data (all datapoints y), fi 's are the values predicted by the linear fit i.e. fi = b2 xi + b1 , and

the index i goes over all data points.

The formulas for R2 are similarly implemented in steps.

Here are the results for the 6 datasets:

Dataset Hp = L

25

26

27

28

51

52

b2

b1

33 554 432 - 33 554 472

DIV 0 ! DIV 0 !

DIV 0 ! DIV 0 !

DIV 0 ! DIV 0 !

DIV 0 ! DIV 0 !

DIV 0 ! DIV 0 !

R2

1

DIV 0 !

DIV 0 !

DIV 0 !

DIV 0 !

DIV 0 !

Our fit performs even worse than Excel's native "Trendline" function! The results for the P=25 dataset are the same as those

produced by Trendline (refer to figure 1) however, for every other dataset, we get a "Division by zero" error. The division by 0

occurs when calculating the slope and intercept: the denominator nISx2 M - HSxL2 is calculated to be 0 for the 5 datasets. Where is

our error? Can we fix this?

It turns out, that the error is not our own but Excel's! Prof. Velvel Kahan found that for some computations in Excel, an

extra set of parentheses around the entire expression (which analytically does not change the value of the expression), changed

the result of the computation.2 Can a few parentheses really change the results of our fit? After placing an extra set of parentheses

to surround the formula of each of the steps taken above to calculate b2 , b1 , and R2 , the results were the following:

Dataset Hp = L

25

26

27

28

51

52

b2

b1

R2

33 554 432.0000000 - 33 554 472.0000000 1.000000000000000

70 464 307.2000000 - 70 464 349.6000000 0.991666667198851

176 160 768.000000 - 176 160 824.000000 0.0673363095238095

DIV 0 !

DIV 0 !

DIV 0 !

DIV 0 !

DIV 0 !

DIV 0 !

0.000000000000000 8.00000000000000

- 2.33333333333333

Every slope and intercept in the table above, matches those produced by the Trendline function! Datasets in the interval between

p=28 and p=51 still produce the # DIV/0! error, but those are exactly the datasets for which the Trendline could not compute the

linear fit. Notice however that the R2 values differ for datasets p=26,27,52. This is because Excel uses another less general but, in

the case of simple linear regression, equivalent formula to calculate R2 that is:

R2 =

SSreg

SStot

2

where SSreg =i I fi - f M and f is the mean of the values predicted by the model. Implementing this formula in the spreadsheet,

we reproduce exactly those values for R2 that the Trendline function produced. The two equivalent forms of R2 producing

drastically different results hints at a large numerical instability of our algorithms and ill-conditioning of our data.

Had it not been for Kahan's paper, such a bug would have never been considered and hence discovered in our implementation of the linear fit. The problem that the linear fit fails in the interval 28p51 still persists in our implementation; this could be

a limitation of our method in conjunction with finite-precision arithmetic.

linear fit. Notice however that the R2 values differ for datasets p=26,27,52. This is because Excel uses another less general but, in

the case of simple linear regression, equivalent formula to calculate R2 that is:

8

2

ExcelProject.nb

SSreg

R =

SStot

2

where SSreg =i I fi - f M and f is the mean of the values predicted by the model. Implementing this formula in the spreadsheet,

we reproduce exactly those values for R2 that the Trendline function produced. The two equivalent forms of R2 producing

drastically different results hints at a large numerical instability of our algorithms and ill-conditioning of our data.

Had it not been for Kahan's paper, such a bug would have never been considered and hence discovered in our implementation of the linear fit. The problem that the linear fit fails in the interval 28p51 still persists in our implementation; this could be

a limitation of our method in conjunction with finite-precision arithmetic.

Let us reformulate the problem of a linear least squares fit in matrix notation. Consider the 8 x 2 matrix A:

1 1+

1 1+

1 1+

1 1+

A=

1 1+

1 1+

1 1+

1 1+

41

2p

42

2p

43

2p

44

2p

45

2p

46

2p

47

2p

48

2p

the 2 x 1 vector x:

x=K

b1

O

b2

1

2

3

4

b=

5

6

7

8

Our task is to find such a vector x, as to minimize the squared Euclidean norm of the residual r = b - Ax, or:

minx b - Ax 2 2 . We can extend this to the general case for A of size m x n (mn), x of size n x 1, and b of size m x 1.

The most straightforward approach to solving the least squares problem is called the method of Normal Equations. The derivation for the method is as follows:

We can define the residual as a vector function of x,

rHxL = b - A x

we are trying to minimize the squared Euclidean norm of the residual, or:

2

EHxL = b - Ax 2 2 =m

i=1 ri HxL

x=K

b1

O

b2

ExcelProject.nb

1

2

3

4

b=

5

6

7

8

Our task is to find such a vector x, as to minimize the squared Euclidean norm of the residual r = b - Ax, or:

minx b - Ax 2 2 . We can extend this to the general case for A of size m x n (mn), x of size n x 1, and b of size m x 1.

The most straightforward approach to solving the least squares problem is called the method of Normal Equations. The derivation for the method is as follows:

We can define the residual as a vector function of x,

rHxL = b - A x

we are trying to minimize the squared Euclidean norm of the residual, or:

2

EHxL = b - Ax 2 2 =m

i=1 ri HxL

to minimize EHxL we need to find an x such that the gradient of EHxL is zero, that is the partial derivative with respect to each x j is

zero:

EHxL

x j

= 0 = 2 * m

i=1 ri

ri

x j

ri

x j

x j

and so:

EHxL

x j

= 0 = 2 * m

i=1 ri

ri

x j

n

m

m

n

= -2 m

i=1 IAij M Ibi - k=1 Aik xk M = -2 i=1 IAij M Hbi L + 2 i=1 k=1 IAij M Aik xk = 0

m

n

m

i=1 k=1 IAij M HAik L xk = i=1 IAij M Hbi L (1)

IAT AM x = AT b (2)

These are the normal equations.

The equations for the slope Hb2 ) and y-intercept Hb1 L of the linear fit that we used in the previous section are actually just directly

derived from the Normal Equations, by expanding equation (1) above.

EHxL

x j

10

So

= 0 = 2 * m

i=1 ri

ri

x j

n

m

m

n

= -2 m

i=1 IAij M Ibi - k=1 Aik xk M = -2 i=1 IAij M Hbi L + 2 i=1 k=1 IAij M Aik xk = 0

weExcelProject.nb

just have to solve the following expression:

m

n

m

i=1 k=1 IAij M HAik L xk = i=1 IAij M Hbi L (1)

IAT AM x = AT b (2)

These are the normal equations.

The equations for the slope Hb2 ) and y-intercept Hb1 L of the linear fit that we used in the previous section are actually just directly

derived from the Normal Equations, by expanding equation (1) above.

It is fairly straightforward to show that the method of Normal equations is numerically instable. Suppose that the matrix A (also

called the Vandermonde matrix), is ill conditioned, that is:

condHAL >> 1

To solve the linear least squares problem using the method of normal equations we have to solve the following system:

IAT AM x = AT b

and we can show that:

condIAT AM condHAL2

and so:

condIAT AM >> cond HAL>>1

Clearly, there is large growth in the algorithm. The method of Normal Equations is known to be numerically instable and to

perform poorly for ill-conditioned problems. Perhaps this is the reason why the Excel implementation of the linear fit fails in the

interval 28p51.

In the discussion above we referred to the condition number of the matrix A (cond(A)). The condition number is defined as:

condHAL = A * A-1

but A is an m x n matrix, with mn, that is, A is in general rectangular. If A is non-square, it is non-invertible. How then, do we

calculate the condition number of A?

We introduce a new concept of a generalized inverse, more specifically the Moore-Penrose pseudoinverse A + . For a general m x

n matrix A, A+ is an n x m matrix that has the following properties1 :

(1) A A+ A = A

(2) A+ A A+ = A+

(3) HA A+ L* = A A+

(4) H A+ AL* = A+ A

where * denotes the conjugate transpose.

For general m x n matrices A we define the condition number to be:

condHAL = A * A+

Using this definition we can find how ill conditioned each of our 6 data sets is:

Dataset Matrix Ap

Cond IAp M

A25

2.928874824058564 e + 07

condHAL = A * A-1

but A is an m x n matrix, with mn, that is, A is in general rectangular. If A is non-square, it is non-invertible. How then, do we

ExcelProject.nb

11

calculate the condition number of A?

We introduce a new concept of a generalized inverse, more specifically the Moore-Penrose pseudoinverse A + . For a general m x

n matrix A, A+ is an n x m matrix that has the following properties1 :

(1) A A+ A = A

(2) A+ A A+ = A+

(3) HA A+ L* = A A+

(4) H A+ AL* = A+ A

where * denotes the conjugate transpose.

For general m x n matrices A we define the condition number to be:

condHAL = A * A+

Using this definition we can find how ill conditioned each of our 6 data sets is:

Dataset Matrix Ap

Cond IAp M

A25

A26

A27

A28

A51

A52

2.928874824058564 e + 07

5.857745763834818 e + 07

1.171548756109411 e + 08

2.343097090872832 e + 08

1.925820207179830 e + 15

3.404401319607318 e + 15

It is interesting to note that the condition number of the datasets seems to follow the general trend:

A p+1

Ap

Our Excel implemented solution stops giving meaningful results at around p=26, and breaks down completely at p>27. We must

remember however that since we are essentially using the method of normal equations to solve the least squares problem, our

condition number is roughly squared, and so we can say our solution breaks down at about:

condIA26 * A26 T M = 3.431318543372478 e + 15 condHA26 L2

QR Decomposition

Since our data sets are substantially ill-conditioned, it would be wise to use a more numerically stable algorithm to solve the

linear least squares problem. One such algorithm is implemented using QR Decomposition (or factorization). With QR decomposition, we can factor a general m x n matrix A into a product of an m x m orthogonal matrix Q, and an m x n matrix R partitioned

into two parts, an n x n upper triangular block, and an (m-n) x n block of zeros. We can express that mathematically as:

A=QB

Rn

F

0

Using this factorization we can derive a solution to linear least squares problem. The residual, as previously defined, is:

r=b-Ax

We can multiply the expression above by QT to get:

QT r = QT b - QT A x = QT b - IQT QM R x = QT b - R x =

r 2 2 = rT r = rT Q QT r = K

IQT bMn - Rn x

T

IQ bMm-n

c1 T c1

c1

O K O = H c1 c2 L K O = c1 2 + c2 2

c

c

c

=K

c1

O

c2

12

ExcelProject.nb

Since our data sets are substantially ill-conditioned, it would be wise to use a more numerically stable algorithm to solve the

linear least squares problem. One such algorithm is implemented using QR Decomposition (or factorization). With QR decomposition, we can factor a general m x n matrix A into a product of an m x m orthogonal matrix Q, and an m x n matrix R partitioned

into two parts, an n x n upper triangular block, and an (m-n) x n block of zeros. We can express that mathematically as:

A=QB

Rn

F

0

Using this factorization we can derive a solution to linear least squares problem. The residual, as previously defined, is:

r=b-Ax

We can multiply the expression above by QT to get:

QT r = QT b - QT A x = QT b - IQT QM R x = QT b - R x =

r 2 2 = rT r = rT Q QT r = K

IQT bMn - Rn x

IQT bMm-n

=K

c1

O

c2

c1 T c1

c1

O K O = H c1 c2 L K O = c1 2 + c2 2

c2

c2

c2

Since the value of x does not affect the value of c2 , to minimize r 2 2 we must choose an x to minimize c1 that is, make it equal

zero. Therefore the equation governing the solution of the least linear squares problem becomes:

IQT bMn - Rn x = 0

or:

Rn x = IQT bMn

To solve the linear least squares problem, we must solve the equation above.

The QR factorization is accomplished via orthogonal transformations which, by definition, preserve Euclidean norms. Therefore,

we expect the algorithm to be more numerically stable than the method of Normal Equations, because there is no growth in the

QR algorithm, that is:

condHAL condHRn L

LINEST

Starting with the 2003 version, the developers of Excel decided to implement a least squares algorithm using QR

decomposition in a function called "LINEST" (before Excel 2003, LINEST used the method of Normal Equations). Instead of

finding the R2 value to evaluate the "goodness of fit", we provide the squared Euclidean norm of the residual r = b - Ax that is,

the very thing we are trying to minimize with the least squares solution. Using LINEST we get the following results for our 6

datasets:

Dataset Hp =L

b2

b1

r 2 2

25

0.000000000000000 4.50000000000000 42.000000000000000

26

0.000000000000000 4.50000000000000 42.000000000000000

27

0.000000000000000 4.50000000000000 42.000000000000000

28

0.000000000000000 4.50000000000000 42.000000000000000

51

0.000000000000000 4.50000000000000 42.000000000000000

52

0.000000000000000 4.50000000000000 42.000000000000000

An algorithm that is supposed to be more numerically stable and perform better on ill-conditioned problems, gives the

same bad linear fit for all 6 of our datasets! What is wrong here? Is it the QR algorithm that is causing the problem, or Excel's

implementation?

To compare results we can use Matlab's implementation of QR decomposition to find the least linear squares fits for our

datasets. Here are the results produced by Matlab's implementation of linear least squares with QR decomposition:

Starting with the 2003 version, the developers of Excel decided to implement a least squares algorithm using QR

ExcelProject.nb

decomposition in a function called "LINEST" (before Excel 2003, LINEST used the method of Normal Equations).

Instead of 13

finding the R2 value to evaluate the "goodness of fit", we provide the squared Euclidean norm of the residual r = b - Ax that is,

the very thing we are trying to minimize with the least squares solution. Using LINEST we get the following results for our 6

datasets:

Dataset Hp =L

b2

b1

r 2 2

25

0.000000000000000 4.50000000000000 42.000000000000000

26

0.000000000000000 4.50000000000000 42.000000000000000

27

0.000000000000000 4.50000000000000 42.000000000000000

28

0.000000000000000 4.50000000000000 42.000000000000000

51

0.000000000000000 4.50000000000000 42.000000000000000

52

0.000000000000000 4.50000000000000 42.000000000000000

An algorithm that is supposed to be more numerically stable and perform better on ill-conditioned problems, gives the

same bad linear fit for all 6 of our datasets! What is wrong here? Is it the QR algorithm that is causing the problem, or Excel's

implementation?

To compare results we can use Matlab's implementation of QR decomposition to find the least linear squares fits for our

datasets. Here are the results produced by Matlab's implementation of linear least squares with QR decomposition:

Dataset Hp =L

b2

b1

r 2 2

25

3.355443200800436 e7 - 3.355447200800437 e7

3.3307 e - 16

26

6.710886380847121 e7 - 6.710890380847107 e7 1.998401444325282 e - 15

27

28

51

52

2.684354541191144 e8 - 2.684354941191140 e8 7.105427357601003 e - 15

4.499999999999912

0

41.999999999999844

4.499999999999957

0

41.999999999999908

The Matlab implemented algorithm performs very well- it gives good linear fits up to p=51. On the other hand, Excel's

implementation gives meaningful fits only up to p=23. Furthermore, the fits that it gives for p>23 have a larger residual than any

fit given by Matlab. So the developers of Excel implemented a more numerically stable QR algorithm to better perform on illconditioned data however, implemented it in such a way that it performs worse than the most naive algorithms (even Excel's own

Trendline).

Rank Deficiency

When using QR to solve least linear squares on datasets with p51, Matlab warns the user with the following error: "Warning:

Rank deficient, rank = 1". For a matrix M of order n, M is rank deficient if rank(M)<n that is, if it has less than n linearly independent columns. Indeed for p51, matlab considers the matrix A

1 1+

1 1+

1 1+

1 1+

1 1+

1 1+

1 1+

1 1+

41

2p

42

2p

43

2p

44

2p

45

2p

46

2p

47

2p

48

2p

to have at most 1 linearly independent column (the operation rank() on the matrix returns a "1"). Its interesting to note that for

square matrices, the condition number is a measure of how close the matrix is to being singular, while for the general rectangular

matrices, the condition number is a measure of how close the matrix is to being rank deficient. According to the matlab documentation, for rank deficient matrices M, the linear least squares implemented with QR decomposition no longer returns the minimal

length solution as it is bound by the semantics of QR factorization to return a solution with rank(M) non-zero values.

Meanwhile, the method of Normal Equations breaks down completely for rank deficient matrices A. The solution for x from the

14

ExcelProject.nb

When using QR to solve least linear squares on datasets with p51, Matlab warns the user with the following error: "Warning:

Rank deficient, rank = 1". For a matrix M of order n, M is rank deficient if rank(M)<n that is, if it has less than n linearly independent columns. Indeed for p51, matlab considers the matrix A

1 1+

1 1+

1 1+

1 1+

1 1+

1 1+

1 1+

1 1+

41

2p

42

2p

43

2p

44

2p

45

2p

46

2p

47

2p

48

2p

to have at most 1 linearly independent column (the operation rank() on the matrix returns a "1"). Its interesting to note that for

square matrices, the condition number is a measure of how close the matrix is to being singular, while for the general rectangular

matrices, the condition number is a measure of how close the matrix is to being rank deficient. According to the matlab documentation, for rank deficient matrices M, the linear least squares implemented with QR decomposition no longer returns the minimal

length solution as it is bound by the semantics of QR factorization to return a solution with rank(M) non-zero values.

Meanwhile, the method of Normal Equations breaks down completely for rank deficient matrices A. The solution for x from the

Normal Equations is:

-1

x = IAT AM

AT b

-1

This is the reason that our implementations of the method of Normal equations to linear least squares problem begin to break

down at high condition numbers, the matrix A becomes rank deficient. Is there any way to calculate a linear least squares

solution for rank deficient matrices?

Yes! The way to calculate a least linear squares solution for rank deficient matrices is by using the Singular Value Decomposition. Before getting into the details of the algorithm, lets first look at the motivation.

Notice from the normal equations (eq. 2 above), we can find an expression for the solution x,

-1

x = IAT AM

AT b

-1

AT is actually A+ , the pseudoinverse of A-it satisfies each of the four properties listed above. We

-1

AT A = A JIAT AM

-1

AT N A JIAT AM

(1) A A+ A = A IAT AM

(2) A+ A A+ = JIAT AM

-1

-1

I AT AMN = A QED

-1

AT N = JIAT AM

T

-1

-1

= A IAT AM

-1

-1

AT = IAT AM

AT = A+ QED

-1 T

T -1

AT N

AT = A A+ QED

Note: since we know that in our case we're working in the Reals, we can say that M* = MT

-1

-1

= IAT AM IAT AM

-1

Since IAT AM

-1

= I = IAT AM

-1 T

T -1

AT N = AT JA JIAT AM N N = AT A JIAT AM N

IAT AM = A+ A QED

We can see from the In a sense, the problem of linear least squares is reduced to finding the pseudoinverse of the Vandermonde

-1

x = IAT AM

AT b

ExcelProject.nb

-1

15

A is actually A , the pseudoinverse of A-it satisfies each of the four properties listed above. We

-1

AT A = A JIAT AM

-1

AT N A JIAT AM

(1) A A+ A = A IAT AM

(2) A+ A A+ = JIAT AM

-1

-1

I AT AMN = A QED

-1

AT N = JIAT AM

T

-1

-1

= A IAT AM

-1

-1

AT = IAT AM

AT = A+ QED

-1 T

T -1

AT N

AT = A A+ QED

Note: since we know that in our case we're working in the Reals, we can say that M* = MT

-1

-1

= IAT AM IAT AM

-1

Since IAT AM

-1

= I = IAT AM

-1 T

T -1

AT N = AT JA JIAT AM N N = AT A JIAT AM N

IAT AM = A+ A QED

We can see from the In a sense, the problem of linear least squares is reduced to finding the pseudoinverse of the Vandermonde

matrix A.

With Singular Value Decomposition, we can factor a matrix A as follows:

A=U S V T where S = diagIs1 , s2 . . . s p M

si follow the relation s1 s2 .... s p 0 and are called the singular values of A. The columns of U are called the left

singular vectors, and the columns of V are called the right singular vectors. Using the SVD, we can calculate the pseudoinverse

A+ by the relation:

A+ = V S+ U T

1

where S+ = diagJ s ,

1

1

,

s2

...,

1

N

sn

What makes the SVD so powerful, is that it allows us to manually "fiddle" with the singular values. For instance, if a matrix is

rank deficient, at least one of its singular values is zero. Hence when calculating the pseudoinverse, we set the corresponding

singular value in S+ to zero. By "manually" changing the singular value, we avoid the erroneous infinite result we would have

gotten from the division by zero. The SVD is powerful for nearly deficient Matrices as well- we could set a certain threshold

value, and ensure that if a singular value is smaller than the threshold value, we set it to zero (this is the way the algorithm is

implemented in Matlab). By setting small singular values to zero, we are essentially making the matrix less ill-conditioned (one

of the definitions of the condition number is the

s1

where

sn

s1 and sn are the largest and smallest non-zero singular values respec-

tively).

Using Matlab's pinv() routine, we find the least squares linear fit for our datasets:

Dataset Matrix Ap

b2

b1

A25

A26

A27

A28

A51

A52

3.355443195571761 e7

6.710886391143523 e7

1.342177268800614 e8

2.684354499888867 e8

2.249999999999956

2.249999999999977

- 3.355447195571753 e7

- 6.710890391143516 e7

- 1.342177668800610 e8

- 2.684354899888856 e8

2.250000000000000

2.249999999999999

r 2 2

2.164934898019056 e - 15

1.776356839400251 e - 15

5.240252676230738 e - 14

7.105427357601003 e - 14

41.999999999999080

41.999999999999571

Unfortunately, unlike any serious statistical or mathematical software, Excel does not have its own implemented routine to

calculate the SVD, or the pseudoinverse. We were however, able to find an open-source macro called Biplot3 , which had the

capability of calculating the SVD. Using Excel with the Biplot Macro, we find roughly the same (not very good) fit for each of

our 6 data sets:

value, and ensure that if a singular value is smaller than the threshold value, we set it to zero (this is the way the algorithm is

implemented in Matlab). By setting small singular values to zero, we are essentially making the matrix less ill-conditioned (one

16

of

theExcelProject.nb

definitions of the condition number is the

s1

where

sn

s1 and sn are the largest and smallest non-zero singular values respec-

tively).

Using Matlab's pinv() routine, we find the least squares linear fit for our datasets:

Dataset Matrix Ap

b2

b1

A25

A26

A27

A28

A51

A52

3.355443195571761 e7

6.710886391143523 e7

1.342177268800614 e8

2.684354499888867 e8

2.249999999999956

2.249999999999977

- 3.355447195571753 e7

- 6.710890391143516 e7

- 1.342177668800610 e8

- 2.684354899888856 e8

2.250000000000000

2.249999999999999

r 2 2

2.164934898019056 e - 15

1.776356839400251 e - 15

5.240252676230738 e - 14

7.105427357601003 e - 14

41.999999999999080

41.999999999999571

Unfortunately, unlike any serious statistical or mathematical software, Excel does not have its own implemented routine to

calculate the SVD, or the pseudoinverse. We were however, able to find an open-source macro called Biplot3 , which had the

capability of calculating the SVD. Using Excel with the Biplot Macro, we find roughly the same (not very good) fit for each of

our 6 data sets:

Dataset Matrix Ap

A25,26,27,28,51,52

b2

b1

r 2 2

Conclusion

In the course of this paper, we have explored 4 different methods by which to calculate the Least Squares Linear Fit in

Excel. Using Excel native Trendline function, we found meaningful results only in datasets with p25 (or cond(A)

2.928874824058564 e + 07). We then implemented commonly used statistical formulas for the least squares linear fit in the

spreadsheet, and essentially reproduced the results of Excel's Trendline function. We confirmed that both of the fits use the

method of Normal Equations which we showed was numerically instable (it has large growth), and hence does not perform well

on ill-conditioned data.

We went on to discuss a more numerically stable algorithm, which calculates the Least Squares Linear Fit using QR

Decomposition. Using Matlab, we were able to get reasonably good fits for datasets with p< 51 (or cond(A)<1.925820207179830 e + 15) at which point, Matlab considered our matrix A to be rank deficient. Unfortunately,

Excel's built-in LINEST function, which supposedly uses QR decomposition to give a linear fit, does not fare as well. LINEST

gives good fits for datasets with p23. For p>23, LINEST returns the same (very poor) fit. LINEST performs even worse than

the first algorithms we discussed (e.g. Trendline), which are just naive implementations of the method of Normal Equations.

Finally, we discussed how to solve the Least Square Linear Fit for the most ill-conditioned problems (those that are, or

nearly are, rank deficient) using the concept of pseudoinverse and the Singular Value Decomposition. Using Matlab, we were

able to use the SVD to find reasonable linear fits for our most ill-conditioned datasets with p=51, 52 Icond HAL 1015 M. Using the

method of SVD, we find a linear fit with a slightly smaller (but comparable) residual than QR decomposition. Since Excel does

not implement a native routine to calculate the SVD, we found an open-source macro which did the job. However, just as with

LINEST, the macro only gives meaningful fits on datasets with p<25. For p>25, the macro returns roughly the same linear fit for

all datasets, with a residual that is greater than the largest residual returned by Matlab's QR and SVD algorithms for any of the

datasets.

It is hard to say why Excel performs so poorly without delving deep into the source code, which we of course do not have

access to. Even in the duration of this project we discovered at least two substantial bugs in the software; one has to do with the

fact that Excel displays up to 99 decimal digits in the Trendline function (30 digits elsewhere), while it really only has 15

significant decimal digits of precision, and the other more serious bug has to do with Excel evaluating two mathematically

equivalent expressions differently because of an extra set of parentheses surrounding the expression. This second bug had caused

our code to fail, producing a "# DIV/0!" for all datasets with p>25. Without being aware of this bug (the only place we found a

mention of this bug is in Prof. Kahan's paperL,2 the code would have been virtually undebuggable. Perhaps all of these bugs stem

from the fact that Excel tries to make their arithmetic seem Decimal and not Binary. One direct side effect of this is that Excel

can only use 15 significant digits of precision as opposed to 17. Perhaps developers (both of Excel and not), have caught on to

the fact that Excel does not perform well with floating point arithmetic (especially with very ill-conditioned data), and hence

have written their algorithms to not even execute on ill-conditioned data to avoid unpredictable results (this could be the case

with LINEST and the open source macro previously discussed). In any case, one thing is for certain: unless your data is very

well conditioned, you should not use Excel to find a Least Squares Linear fit.

the first algorithms we discussed (e.g. Trendline), which are just naive implementations of the method of Normal Equations.

Finally, we discussed how to solve the Least Square Linear Fit for the most ill-conditioned problems (those that are, or

nearly are, rank deficient) using the concept of pseudoinverse and the Singular Value Decomposition. Using Matlab, we were

17

able to use the SVD to find reasonable linear fits for our most ill-conditioned datasets with p=51, 52 Icond HAL ExcelProject.nb

1015 M. Using the

method of SVD, we find a linear fit with a slightly smaller (but comparable) residual than QR decomposition. Since Excel does

not implement a native routine to calculate the SVD, we found an open-source macro which did the job. However, just as with

LINEST, the macro only gives meaningful fits on datasets with p<25. For p>25, the macro returns roughly the same linear fit for

all datasets, with a residual that is greater than the largest residual returned by Matlab's QR and SVD algorithms for any of the

datasets.

It is hard to say why Excel performs so poorly without delving deep into the source code, which we of course do not have

access to. Even in the duration of this project we discovered at least two substantial bugs in the software; one has to do with the

fact that Excel displays up to 99 decimal digits in the Trendline function (30 digits elsewhere), while it really only has 15

significant decimal digits of precision, and the other more serious bug has to do with Excel evaluating two mathematically

equivalent expressions differently because of an extra set of parentheses surrounding the expression. This second bug had caused

our code to fail, producing a "# DIV/0!" for all datasets with p>25. Without being aware of this bug (the only place we found a

mention of this bug is in Prof. Kahan's paperL,2 the code would have been virtually undebuggable. Perhaps all of these bugs stem

from the fact that Excel tries to make their arithmetic seem Decimal and not Binary. One direct side effect of this is that Excel

can only use 15 significant digits of precision as opposed to 17. Perhaps developers (both of Excel and not), have caught on to

the fact that Excel does not perform well with floating point arithmetic (especially with very ill-conditioned data), and hence

have written their algorithms to not even execute on ill-conditioned data to avoid unpredictable results (this could be the case

with LINEST and the open source macro previously discussed). In any case, one thing is for certain: unless your data is very

well conditioned, you should not use Excel to find a Least Squares Linear fit.

Bibliography

(1)Burdick,

Prof.

Joel.

"The

Moore-Penrose

Pseudo

<http://robotics.caltech.edu/~jwb/courses/ME115/handouts/pseudo.pdf>

Inverse."

Web.

(2) Kahan, William. "How Futile Are Mindless Assessments of Roundoff in Floating-Point Computation?" Web.

<http://www.cs.berkeley.edu/~wkahan/Mindless.pdf>.

(3)Lipkovich, Ilya, and Eric P. Smith. "Biplot and Singular Value Decomposition Macros for Excel." Virginia Tech Department

of Statistics. Web. <http://filebox.vt.edu/artsci/stats/vining/keying/biplot.doc>.

(4)Markovsky, Prof. Ivan. "Least Squares and Singular Value Decomposition." University of South Hampton. Web.

<http://users.ecs.soton.ac.uk/im/bari08/svd.pdf>.

(5) Taylor, John R. An Introduction to Error Analysis: the Study of Uncertainties in Physical Measurements. Sausalito, Calif.:

University Science, 1997. Print.

- M. Rades and D.J. Ewins - Principal Component Analysis of Frequency Response FunctionsЗагружено:Mircea Rades
- Spin 56-432 Spin LecЗагружено:Jacecosmoz
- Rossmann DiffGeoЗагружено:kajorion
- Invertibility of Large Sub Matrices With Applications to TheЗагружено:Dhruv Mathur
- 3D reconstructionЗагружено:Edison P. Imbaquingo Yar
- Mimo CapacityЗагружено:kostas_ntougias5453
- MatricesЗагружено:AbhishekJain
- MIT6_241JS11_chap05.pdfЗагружено:Alexander Bennett
- SVD_application_paper[1].pdfЗагружено:Asva Abadila
- Andrew Tulloch - ESL-SolutionsЗагружено:Anonymous901
- Shlens J.-a Tutorial on Principal Component Analysis (2005)Загружено:sportingdan
- spawc05.pdfЗагружено:Khurram Shahzad
- A Review of Least Squares Theory Applied to Traverse From AnantapurЗагружено:Sunkara Hemanth Kumar
- finkЗагружено:Bill White
- How Google Uses SVDЗагружено:Vu Trong Hoa
- NA2004_01Загружено:sharer1
- 2-Matrices and DeterminantsЗагружено:slowjams
- matlabtutorial1 (1).pdfЗагружено:prmahajan18
- synops nikhilЗагружено:Akshay Bhosale
- 545_icmlpaperЗагружено:jesus1843
- Geostatistics Formula SheetЗагружено:cmollinedoa
- Ultralight Radar for Small and Micro-uav NavigatioЗагружено:ashish
- ON THE BLOCK TRIANGULAR FORM OF SYMMETRIC MATRICESЗагружено:mszlazak4179
- 6-Redacción Del EnsayoЗагружено:johana dominguez
- 17.0507263.WangЗагружено:api-3725726
- 082410_1Загружено:sankha

- Background Noise Measurement in Different VendorsЗагружено:ajaykkaushik
- VMG LettersЗагружено:MOHAMMA MUSA
- sree kumarЗагружено:Naveen Kumar
- Depreciation Methods(1)Загружено:Jc Uy
- Cantor SetsЗагружено:KK-bookslover
- Q1 SolutionЗагружено:Can Çamlık
- Paper Craft SundialЗагружено:cagedraptor
- On the Study and Difficulties of Mathematics 1000000807Загружено:Energyclub Peta Dimenzija
- Cfs u Section Ts en 1993 20191Загружено:Kantish
- Midpoints - Unleashing the POwer Of The PlanetsЗагружено:Angela Gibson
- OA2_6b_sentenceTransformationAnswerKeyЗагружено:Herber Herber
- Msc(Physics)2017Загружено:Kapil Adhikari
- Examples of Uniform Discrete Distribution - homepage.stat.uiowa.edu-~stramer-S39-lec5classЗагружено:kohli
- face recognitionЗагружено:GeorgeAzmir
- 3DRefer risa3dЗагружено:sherwin827
- 8_principal_stressesЗагружено:api-3766020
- Ab InitioFAQ3Загружено:Sravya Reddy
- rovelli03Загружено:marco1101
- Commercial Studies 7101 Sow SPN 21 Year 10 Express Track - 2 YearsЗагружено:Yenny Tiga
- 04-Simulation With AwesimЗагружено:Ambarwari Agus
- Proj 1.docxЗагружено:Banana
- CCS University Syllabus 2017 for BA/BCOM/BSC/MA/LLBЗагружено:Private Jobs Hub
- RevisionЗагружено:rainbow
- CT Bushing CtЗагружено:Irfan Ullah
- K. Cieliebak and J. Latschev- The role of string topology in symplectic field theoryЗагружено:Core0932
- Solving Linear Systems Using Interval Arithmetic ApproachЗагружено:International Journal of Science and Engineering Investigations
- wortley2015.pdfЗагружено:Carlinhos Pinheiro
- ABC's - Teaching guide for ABeCedario salvaje - BrickHouse Education - TG9781598351170Загружено:BrickHouse Education Website
- 17. Ijhss - State Institutional Characteristics and Teacher Qualities Effects on Curriculum Adaptation in Mathematics Teacher EducationЗагружено:iaset123
- Point-mass delta-function sources and Black Hole EntropyЗагружено:Kathryn Wilson