Вы находитесь на странице: 1из 12

8/5/2008

Trade Marks, Copyrights & Stuff


This presentation is copyright by Ray Wicks 2008.

Predictive Statistics (Trending) a Tutorial CMG Brazil


Ray Wicks 561-236-5846 RayWicks@us.ibm.com RayWicks@yahoo.com

Many terms are trademarks of different companies and are owned by them. This session is sponsored by On foils that appear in this presentation are not in the handout. This is to prevent you from looking ahead and spoiling my jokes and surprises.

IBM 2008

IBM 2008

Abstract
Predictive Statistics (Trending) A Tutorial
This session reviews some of the trending techniques which can be useful in capacity planning. The introduction of the basic statistical concept of regression analysis will examined. The simple linear regression analysis will be shown. This session is sponsored by

How Accurate Is It?


Prediction

t0

Time

Starting from an initial point of maybe dubious accuracy, we apply a growth rate (also dubious) and then recommend actions costing lots of money.

IBM 2008

IBM 2008

Trending CMG Brazil (c) Ray Wicks 2008

8/5/2008

Accuracy

How Accurate Is It?

Prediction

Prediction

Prediction

t0

Time

t0

Time

t0

Time

t0

Time

IBM 2008

Accuracy is found in values that are close to the expected curve. This closeness implies an expected bound or variation in reality. So a thicker line makes sense.

At time t, is the prediction a precise point p or a fuzzy patch?

IBM 2008

Statistical Discourse
=Normdist(x,0,1,0)

Perceptual Structure
0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4

A Conversation
You: The answer is 42.67. Them: I measured it and the answer is 42.663!

You: Give me a break. Them: I just want to be exact. You: OK the answer is around 42.67. Them: How far around. You: ????

Conceptual Structure

Blah, blah, blah

IBM 2008

IBM 2008

Trending CMG Brazil (c) Ray Wicks 2008

8/5/2008

Confidence Interval or
How Thick is the Line?
Prediction
0.45 0.4 0.35 =Normdist(x,0,1,0) 0.3 0.25 0.2 0.15 0.1 0.05 0 -4 -3 -2 -1 0 1 2 3 4

Confidence Interval
[ 1.96 /n , + 1.96 /n ] [ z/2 /n , + z/2 /n ]
X

Z/2

t0

Time

P[m-2s < X < m+2s] = 0.954 P[m-1.96s < X < m+1.96s] = 0.95 or 95% [L,U] is called the 100(1-)% confidence interval. 1- is called the level of confidence associated with [L,U] IBM 2008

Using a Standard Normal Probability table, 95% confidence (2 tail) is found by looking for a z score of 0.025. In Excel: =Confidence(, , n) =Confidence(0.5,1,100) = 1.96

IBM 2008

Summary
Given a list of numbers X={Xi} i=1 to n

Linear Regression (for Trending)


1000 900
PS View Number of points =Count(X) plotted =Average(X) Center of gravity X=Sum(X)/n X[ROUND DOWN 1+N*0.5] =MEDIAN(X) Middle number 2 Spread of data =Var(X) V=(Xi-X) )/n s=SQRT(V) =Stnd(X) Spread of data Spread of data around average CV=s/X First in Sorted list =MIN(X) Bottom of plot Last in Sorted list =Max(X) Top of plot Distance between top and bottom [Minimum,Maximum] X[ROUND DOWN 1+n*0.9] =Percentile(X,0.9) 10% from the top Expected Variability of Look in book =Confidence(0.05,s,n) average (a thick line) Formula n Excel

Statistics
Term Count (number of items) Average Median Variance Standard Deviation Coeficient of Variation (Std/Avg) Minimum Maximum Range 90th percentile Confidence interval

800 700 MIPS Used 600 500 400 300 200 100 0 0

y = 3.0504x + 385.42 R2 = 0.7881

50

100 Week

150

200

= Percentile formulae assume a sorted list; Low to high.

Obtain a useful fit of the data (y= mx+b) and then extend the values of X to obtain predicted values of Y. But remember as Niels Bohr said: Prediction is very hard to do. Especially about the future.

IBM 2008

IBM 2008

Trending CMG Brazil (c) Ray Wicks 2008

8/5/2008

Trending Assumptions & Questions


80 70 60 50 CPU% 40 30 20 10 0 0 10 20 30 40 50 60 70 80 Week 90 100 110 120 130 140 150

Reality
1800 1600 1400 MIPS Used 1200 1000 800 600 400 200 0 0 50 100 Week 150 200

The future will be like the past. How much history is too much? You should look at Era segments. Shape and scale of graph can be interesting. You may need more than numbers.... The business and technical environment? Be smart and lazy. What questions are you answering?

y = 3.0504x + 385.42 R2 = 0.7881

Linear regressions predictions assume that the future looks like the past. IBM 2008 IBM 2008

Coding Implementation
The Butterfly Effect

Linear Fit for {Xi,Yi}


Y Yi=B0 + B1Xi Yi Y Yi B0 Xi X
e

Algorithm 1:
Xn+1 = s*Xn if Xn < 0.5 Xn+1 = s*(1- Xn) otherwise In Excel: cell Xn+1 is =IF(Xn<0.5, S*Xn, S*(1-Xn))

Algorithm 2:
Xn+1 = s *(0.5 - |Xn 0.5|) In Excel: cell Xn+1 is =S*(0.5-ABS(Xn-0.5)) Mathematically Equal.
(Ref. Chaos Under Control, section on Butterfly Effect.)

Goodness of Fit R2 = IBM 2008

(Yi - Y)2 (Yi - Y)2

IBM 2008

On the line would be perfect. Next to that would be a line with minimum error (e). Actually minimum e2 is better.

Trending CMG Brazil (c) Ray Wicks 2008

8/5/2008

Excel Help

Correlation
7000 6000 5000 4000 3000 2000 1000 0
0 20 40 60 80 100

Search Excel Help for R Squared return: RSQ: Returns the square of the Pearson product moment correlation coefficient through data points in known_y's and known_x's. For more information, see PEARSON. The r-squared value can be interpreted as the proportion of the variance in y attributable to the variance in x.

DASD I/O Rate

CPU%

Correlation = COV(X,Y) / x y = xy2 / x y = E[(x-x)(y-y)] / x y Correlation  [-1,1] =CORREL(CPU%,DASDIO) = 0.86 IBM 2008

IBM 2008

Briefly: Correlation is not Causality


Cause Effect (sufficient cause) ~Effect ~Cause (necessary cause) R2 or CORR(C,E) may indicate a linear relationship without there being a causal connection. In cities of various sizes: C = number of TVs is highly correlated with E = number of murders. C = religious events is highly correlated with E = number of suicides. IBM 2008

Causality & Correlation


Claim: Eating Cheerios will lower your cholesterol Cause Effect Cause: Eating Cheerios Effect: Lower Cholesterol Test: Real cause Intervening Variable Cheerios Bacon & Eggs Bacon & Eggs Lower Cholesterol Cholesterol Lower Cholesterol

There is a correlation between Eating Cheerios and lower Cholesterol but is there a causal relationship?

IBM 2008

Trending CMG Brazil (c) Ray Wicks 2008

8/5/2008

Matrix Solution for Linear Fit

B = (Mt * M)-1 * Mt * Y
1 1 1 1 1 X 1.3 1.4 1.45 1.5 1.6 Y 62.3 64.3 70.8 71.1 75.8 68.86 1 1.45

Excel Solution
80 75 70 y = 47.3x + 0.275 R2 = 0.9262

Solve for Y = B0 + B1*X M is 5x2 YH Sq (YH-YA) Sq (Y-YA) 61.765 50.339025 43.0336 66.495 5.593225 20.7936 68.86 5.7678E-24 3.7636 71.225 5.593225 5.0176 75.955 50.339025 48.1636 R2 0.9262 =(SUM(F3:F7)/SUM(G3:G7))

Avg MT is 2x5 1 1.3 5 7.25 42.25 -29 4.55 -3 0.275 B0 47.3 B1 1 1.4 7.25 10.563 -29 20 1.65 -1

1 1.5

1 1.6

ctl-shift-enter

CPU%

65 60 55

MT*M is 2x2

INV(MTM) is 2x2

IMTM*MT is 2x5

0.2 0

-1.25 1

-4.15 3

50 1.2 1.3 1.4 1.5 1.6 1.7 Units of Work

IMTMMT*Y is 2x1

IBM 2008

IBM 2008

Impact of Outlier
100 95 90 85 CPU% 80 75 70 65 60 55 50 1.2 1.3 1.4 1.5 1.6 1.7 Units of Work y = -50.8x + 149.06 R2 = 0.2358

A perfect fit is always possible


80 75

y = 58111x - 338194x + 736689x - 711801x + 257442 R =1


2

70 CPU% 65 60

55 50 1.2 1.25 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 Units of Work

IBM 2008

IBM 2008

Albeit meaningless in this case.

Trending CMG Brazil (c) Ray Wicks 2008

8/5/2008

Confidence of Fit.
85 80 75 70 65 60 55 50 1.2 1.3
y = 47.3x + 0.275 R = 0.9262
2

SAS

CPU%

CPU% LB UB Linear (CPU%)

1.4

1.5

1.6

1.7

Units of Work

IBM 2008

IBM 2008

Analyze -> Linear Regression

Run
Root MSE Coeff Var 1.72313 2.50236 R-Square 0.9262 0.9017 Dependent Mean 68.86000 Adj R-Sq

Parameter Estimates Variable Label DF Parameter Standard t Value Pr > |t| Estimate Error 0.27500 47.30000 11.20033 0.02 7.70606 6.14 0.9820 0.0087 1

Intercept Intercept 1 X X

IBM 2008

IBM 2008

Trending CMG Brazil (c) Ray Wicks 2008

8/5/2008

Results

Residuals

For each Xi, plot e = Y- Yi

Residual
10 5 0 Residual 0 -5 -10 -15 -20 Units of Work 100 200 300 400 500 600 700 800 900

Look for random distribution around 0

IBM 2008

IBM 2008

Interesting Case
40 35 30

Regression other than Linear


40 35

CPU%

CPU%

25 20 15 10 5 0 0 100 200 300

y = 0.0335x 2 R = 0.8569

30 25 20 15 10 5 0

y = 1.234e 2 R = 0.9457

0.0043x

400

500

600

700

800

100

200

300

400

500

600

700

800

Blocks
Notice the points are below the line until >600. Typical of DB/DC. Means less efficient as the load increases? The residuals have a pattern. That usually means a second level effect.

Blocks

Exponential fit is useful when computing compound growth

IBM 2008

IBM 2008

Trending CMG Brazil (c) Ray Wicks 2008

0.72 0.8
05/21/04 05/28/04 06/04/04 06/11/04 06/18/04 06/25/04 07/02/04 07/09/04 07/16/04 07/23/04 07/30/04 08/06/04 08/13/04 08/20/04 08/27/04 09/03/04 09/10/04 09/17/04 09/24/04 10/01/04 10/08/04 10/15/04 10/22/04 10/29/04 11/05/04

0.74

0.76

0.78

0.82

0.84

0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0

IBM 2008 (PS: Its a line)

IBM 2008

PS to CS Dissonance

(PS: Polynomial fit looks good)

Trending CMG Brazil (c) Ray Wicks 2008


Perceptual to Conceptual Dissonance?
y = -0.0002x + 8.2996 R2 = 0.4388 (CS: Not a good line)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

05 /2 1/ 05 0 4 /2 8 /0 06 4 /0 4 /0 4 06 /1 1 /0 06 4 /1 8 /0 4 06 /2 5/ 07 0 4 /0 2/ 0 07 4 /0 9 /0 07 4 /1 6 /0 4 07 /2 3 /0 07 4 /3 0 /0 4 08 /0 6/ 08 0 4 /1 3/ 0 08 4 /2 0 /0 08 4 /2 7 /0 4 09 /0 3 /0 09 4 /1 0 /0 4 09 /1 7/ 0 09 4 /2 4/ 10 0 4 /0 1 /0 10 4 /0 8 /0 4 10 /1 5 /0 10 4 /2 2 /0 4 10 /2 9/ 11 0 4 /0 5 /0 4

y = -6E-08x3 + 0.0063x2 - 241.55x + 3E+06 R2 = 0.7817 (CS: fit looks good)


0.74 0.76 0.78 0.82 0.84

0.8

???

05/21/04

IBM 2008
0

06/04/04

05/21/04 05/28/04 06/04/04 06/11/04 06/18/04 06/25/04 07/02/04 07/09/04 07/16/04 07/23/04 07/30/04 08/06/04 08/13/04 08/20/04 08/27/04 09/03/04 09/10/04 09/17/04 09/24/04 10/01/04 10/08/04 10/15/04 10/22/04 10/29/04 11/05/04

06/18/04

07/02/04

07/16/04

07/30/04

08/13/04

08/27/04

09/10/04

09/24/04

10/08/04

10/22/04

11/05/04

11/19/04

12/03/04

12/17/04

12/31/04

01/14/05

01/28/05

In 144 Days, the $ will be worthless.

02/11/05

(PS: Visual Variability is scale dependent)

02/25/05

03/11/05

Perceptual to Conceptual Dissonance

y = -0.0002x + 8.2996 R2 = 0.4388 (CS: Variability is scale independent) IBM 2008

03/25/05

8/5/2008

8/5/2008

Regression Analysis is not a Crystal Ball


1.37 1.36

Philosophical Remark
Sensation Negotiation
0.84 0.83 0.82 0.81 0.8 0.79 0.78 0.77 0.76 0.75 0.74
y= -0.0002x + 8.2996 2 R = 0.4388

1.35 1.34 1.33 1.32 1.31 1.3 1.29 1.28


1/18/07 2/7/07 2/27/07 3/19/07 4/8/07 4/28/07 5/18/07 6/7/07 6/27/07 7/17/07
Context (Lights Up)

In reaching a conclusion, we negotiate between the potential perceptual structures and the potential conceptual structures and memory events. IBM 2008

IBM 2008

Model Building: Which is Best?


X1 7 1 11 11 7 11 3 1 2 21 1 11 10 X2 26 29 56 31 52 55 71 31 54 47 40 66 68 X3 6 15 8 8 6 9 17 22 18 4 23 9 8 X4 60 52 20 47 33 22 6 44 22 26 34 12 12 Y 78.5 74.3 104.3 87.6 95.9 109.2 102.7 72.5 93.1 115.9 83.8 113.3 109.4

Stepwise Results
Stepwise Analysis Table of Results for General Stepwise X4 entered. df Regression Residual Total 1 11 12 SS 1831.89616 883.8669169 2715.763077 MS 1831.89616 80.3515379 F Significance F 22.7985202 0.000576232

Intercept X4

Coefficients Standard Error t Stat 117.5679312 5.262206511 22.34194552 -0.738161808 0.154595996 -4.774779597

P-value 1.62424E-10 0.000576232

Lower 95% Upper 95% 105.9858927 129.1499696 -1.078425302 -0.397898315

X1 entered. df Regression Residual Total 2 10 12 SS 2641.000965 74.76211216 2715.763077 MS 1320.500482 7.476211216 F Significance F 176.6269631 1.58106E-08

Stepwise procedure to find the best combination of variables. Y = b + a1X1 Y = b + a1X1 + a2X2 Y = b + a2X2 + a3X3 Y = b + a1X1 + a2X2 + a3X3 + a4X4 Using Hald Data from Draper

Intercept X4 X1

Coefficients Standard Error t Stat 103.0973816 2.123983606 48.53963154 -0.613953628 0.048644552 -12.62122063 1.439958285 0.13841664 10.40307211

P-value 3.32434E-13 1.81489E-07 1.10528E-06

Lower 95% Upper 95% 98.36485126 107.829912 -0.722340445 -0.505566811 1.131546793 1.748369777

No other variables could be entered into the model. Stepwise ends.

Using Add-In from Levine

IBM 2008

IBM 2008

Trending CMG Brazil (c) Ray Wicks 2008

10

8/5/2008

Looking for I/O = F(MIPS). Dont give up too quickly


Y intercept forced to 0.
16000 14000 12000 10000 I/O 8000 6000 4000 2000 0 1500 2000 2500 3000 MIPS 3500 4000 4500

Look at ratio in time


5 4.5

y = 2.4545x R = 0.3726
2

4 3.5 3 IO/MIPS 2.5 2 1.5 1 0.5 0


00 10 :0 0 11 :0 0 12 :0 0 13 :0 0 14 :0 0 15 :0 0 16 :0 0 17 :0 0 18 :0 0 19 :0 0 20 :0 0 21 :0 0 22 :0 0 23 :0 0 00 00 00 00 7: 00 00 00 00 00 8: 5: 0: 1: 2: 3: 4: 6: 9:

IBM 2008

IBM 2008

Trending: What to Do?


Average In & Ready
40 35 30 25 20 15 10 5 0 0 100 200 300 400 90th%ile

Options?
Average In & Ready
45 40 35 30 25 20 15 10 5 0 0 100 200 300 400 500 90th%ile Linear (90th%ile) Expon. (90th%ile) y = 7.2692e 2 R = 0.6615
0.0042x

IBM 2008

IBM 2008

Trending CMG Brazil (c) Ray Wicks 2008

11

8/5/2008

How About A Polynomial?


Y=b0 + b1X + b2X2 + b3 X3 + . + bnXn
Average In & Ready
100 90 80 70 60 50 40 30 20 10 0 0 100 200 300 400 500

Time Series
A time series is a sequence of observations which are ordered in time (or space). If observations are made on some phenomenon throughout time, it is most sensible to display the data in the order in which they arose, particularly since successive observations will probably be dependent. Time series are best displayed in a scatter plot. The series value X is plotted on the vertical axis and time t on the horizontal axis. Time is called the independent variable (in this case however, something over which you have little control). There are two kinds of time series data: 1. Continuous, where we have an observation at every instant of time e.g. lie detectors, electrocardiograms. We denote this using observation X at time t, X(t). 2. Discrete, where we have an observation at (usually regularly) spaced intervals. We denote this as Xt.
See http://www.cas.lancs.ac.uk/glossary_v1.1/tsd.html#timeseries

90th%ile Poly. (90th%ile)

A polynomial can be made to fit about any wandering data within the bounds of the data [min,max]. Beyond the bounds, any prediction is suspect.

IBM 2008

IBM 2008

Bibliography

Applied Regression Analysis, Draper & Smith, Wiley. Definitive source for regression analysis. Highly technical.

Statistical Concepts and Methods, Bhattacharyya & Johnson, Wiley, 1977. This has both a discussion of meaning and the formulae. Applied Statistics for Engineers and Scientists, Levine, Ramsey & Smidt, Prentice Hall, 2001. This has a good approach to statistics and Excel implementations. CD comes with the book which has some powerful Excel Add-ins.

The Art of Computer Systems Performance Analysis, by Raj Jain, Wiley. I like this one. For performance analysis and capacity planning, it is thorough and complete. A very good reference. It may be hard to find. Chaos Under Control, by Peak & Frame, Freeman & Co.

http://www.itl.nist.gov/div898/handbook/pmc/pmc.htm is a good web site to explore statistics.

IBM 2008

Trending CMG Brazil (c) Ray Wicks 2008

12