Вы находитесь на странице: 1из 26

MONEYBALL:

FACT OR FICTION?

Statistics & Data Analysis


Data Analysis Project
Professor Jeffrey Simonoff
December 9, 2003
I. Introduction

Despite a payroll smaller than all but four other major league teams, the Oakland

Athletics (commonly referred to as the A’s) won the fourth most games in Major League

Baseball in 2003. This year was no fluke, however, as the A’s have now made the playoffs

in four consecutive years while teams like the New York Mets and Texas Rangers have

generally floundered while spending over $50 million more on their players. Recently,

thanks to the publication of the best-selling book Moneyball, much has been made of the

methods of the team’s general manager, who focuses on new and non-traditional statistics

when evaluating talent rather than on traditional ones and scouting reports. In particular,

Billy Beane, the GM, focuses on a player’s on-base percentage (OBP) and slugging

percentage (SLG) when measuring offensive capability rather than on batting average

(AVG), which has always been baseball’s most famous and well-published statistic. In

addition, Beane eschews the practice of sacrifice bunts (SH) and stolen bases (SB) as he

believes they actually decrease the chances of scoring runs (R), rather than increase them

as is commonly believed. While there are certain circumstances where he feels these last

two moves are appropriate, as a general rule he avoids attempting to sacrifice bunt or steal

bases, and the A’s during Beane’s tenure (1998-present) are known to be near the bottom of

the league in these categories1.

While the A’s also have unconventional ways of measuring defense and a pitcher’s

capability (and many fairly argue that the A’s pitching has more to do with their recent

1
After collecting my data, I checked to make sure that this was indeed the case. Since 2000, the A’s have
never been higher than 12th (out of 14 teams) in sacrifice bunts or stolen bases. Prior to 2000, the evidence is
not as clear-cut. However, it is important to remember that since Beane inherited a team with a manager and
players used to “playing by the book”, it took him a few years to fully implement his system throughout the
organization (a fact that is well detailed in Moneyball). Given the weight of the evidence during the team’s
most recent years, I feel Beane has successfully implemented his system.

2
success than their offensive makeup), the offensive side has become the most controversial

as it clearly conflicts with the traditional way of looking at things.

Based on my love of baseball and interest in the subject matter, I have decided to

examine Billy Beane’s methods for my data analysis project. I intend to focus on offense

from a team perspective. Since the sole purpose of a team offense is to score as many runs

as possible, my target variable will be team runs scored per game. I then will try to predict

this variable based on the following statistics: OBP, SLG, AVG, SH (per game) and SB

(per game).2 Since my instincts agreed with many of the ideas in Moneyball, I took many

of its statistical findings at face value. However, I will now examine for myself whether

OBP and SLG are more highly correlated to runs scored than the traditional AVG and

whether sacrifice bunting and stealing bases are indeed poor strategies.

II. Thinking about the Data

A question that arose before collecting the data was how far back in time to

examine. Since statistics are available dating back to the beginning of the 20th century, I

had many potential observations to examine. However, to make the project less unwieldy, I

took data from American League teams from the past eight seasons (1996-2003). I had

decided to exclude National League teams since, without the presence of the designated

hitter, the sacrifice bunt would (and should) be used more frequently when the pitcher is

batting. Therefore, using National League teams might provide an inaccurate answer for

the correlation between sacrifice bunts and team runs. Considering there have been 14

2
Depending on the results of this regression model, I may run a separate regression model that replaces OBP
and SLG with OPS, another common statistic that is the sum of OBP and SLG. If the two models have
similar predictive power, then this would indicate that OBP and SLG are about equally important when
determining runs scored per game. However, I do not believe that this will be the case.

3
American League teams since 1993 (one team switched leagues in 1998, but was replaced

with an expansion team), this approach will provide 112 observations.

III. Data Collection

While many sports sites can provide the data I desire, Major League Baseball’s

official website (www.mlb.com) had perhaps the easiest to use statistical interface where I

was able to obtain all the statistics I desired for my analysis. In fact, I gathered much more

data than just team totals for R, OBP, SLG, AVG, SH, SB and games played (to determine

the per game numbers for the cumulative statistics). I realized when collecting the data that

other statistics could provide some interesting side analyses. While they may not factor

into my completed project, for my own curiosity I would like to see how well some other

offensive statistics are predictors to runs scored.

I am fortunate that I had no real problems collecting the data. The only problem I

had, albeit a minor one, was that there was no easy way to download each year’s data. A

simple “copy and paste” technique created some formatting errors, but these were easily

corrected.

IV. First Look at the Data

I will start by analyzing the descriptive statistics for each of the variables. Since

OBP, SLG and AVG are all measured in percentages, I have decided to multiply each

observation by 1000. While this will have no effect on the regression analysis, it will make

the interpretation of the coefficients a bit easier to understand, especially for baseball fans

who are familiar with what the statistics represent. Having said that, let’s look at our data:

4
Descriptive Statistics:
Variable N Mean Median TrMean StDev SE Mean
R per G 112 5.0437 5.0216 5.0527 0.5474 0.0517
OBP 112 340.36 341.00 340.65 15.36 1.45
SLG 112 433.21 435.50 433.43 27.00 2.55
AVG 112 270.80 271.00 271.11 11.59 1.09
SB per G 112 0.6370 0.6285 0.6333 0.2003 0.0189
SH per G 112 0.23698 0.24383 0.23600 0.07297 0.00690

Variable Minimum Maximum Q1 Q3


R per G 3.5714 6.2671 4.6588 5.4228
OBP 300.00 374.00 329.25 352.00
SLG 375.00 491.00 411.25 453.75
AVG 240.00 293.00 263.00 279.00
SB per G 0.2284 1.2112 0.4884 0.7731
SH per G 0.06790 0.40994 0.18519 0.28571

An initial examination of the data does not reveal anything alarming. The mean

and medians for each variable are very close to being equal, indicating that no variable

appears to have a distribution that is skewed in either direction. To confirm, I examined a

histogram for each variable and, as expected, all had roughly normal distributions. I also

examined box plots for each variable, which only revealed one clear outlier: the maximum

value for SB per game of 1.21 (the 1996 Kansas City Royals). Based on the above, I feel

all the data looks relatively normal and there is no need to log any data or perform similar

adjustments.

The next step is to find fitted line plots of each of my predictors against R per

game. While these plots will not illustrate how these variables work together to predict my

target variable, they can be useful in determining the general relationship between the

predictors and the dependent variable:

5
Regression Plot
R per Game = -6.23417 + 0.0331354 OBP

S = 0.202521 R-Sq = 86.4 % R-Sq(adj) = 86.3 %

6
R per Game

300 310 320 330 340 350 360 370 380

OBP

Regression Plot
R per Game = -2.75902 + 0.0180112 SLG

S = 0.252520 R-Sq = 78.9 % R-Sq(adj) = 78.7 %

6
R per Game

400 450 500

SLG

6
Regression Plot
R per Game = -5.37003 + 0.0384549 AVG

S = 0.319513 R-Sq = 66.2 % R-Sq(adj) = 65.9 %

6
R per Game

240 250 260 270 280 290

AVG

Regression Plot
R per Game = 5.13135 - 0.137618 SB per Game

S = 0.549179 R-Sq = 0.3 % R-Sq(adj) = 0.0 %

6
R per Game

0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2

SB per Game

7
Regression Plot
R per Game = 5.23589 - 0.811045 SH per Game

S = 0.546653 R-Sq = 1.2 % R-Sq(adj) = 0.3 %

R per Game 6

0.1 0.2 0.3 0.4

SH per Game

While it is clear that OBP, SLG and AVG have a strong positive correlation with R

per game, it is interesting to note the low correlation of SB and SH per game with R per

game. In fact, each has a slightly negative correlation, indicating that increases in these

variables actually are correlated with fewer runs scored (score one for Billy Beane!).

However, the correlations for these two variables are so low that I do not put too much

weight into the signs but rather just emphasize how poor they are in predicting R per game

by themselves. Perhaps they add more value when working with the other variables.

V. Preliminary Multiple Regression Model

As mentioned, these plots only tell us part of the story. We now need to examine a

regression output to see how these variables relate together in predicting our target

variable.

8
Regression Analysis: R per Game versus OBP, SLG, AVG, SB per Game, SH per Game

The regression equation is


R per Game = - 5.81 + 0.0230 OBP + 0.00868 SLG - 0.00291 AVG
+ 0.0725 SB per Game + 0.082 SH per Game

Predictor Coef SE Coef T P


Constant -5.8078 0.3530 -16.45 0.000
OBP 0.022958 0.002146 10.70 0.000
SLG 0.0086819 0.0009370 9.27 0.000
AVG -0.002914 0.002593 -1.12 0.264
SB per G 0.07245 0.07767 0.93 0.353
SH per G 0.0822 0.2053 0.40 0.690

S = 0.1518 R-Sq = 92.7% R-Sq(adj) = 92.3%

Analysis of Variance

Source DF SS MS F P
Regression 5 30.8164 6.1633 267.34 0.000
Residual Error 106 2.4437 0.0231
Total 111 33.2601

The adjusted R2 value of 92.3% indicates that the model accounts for much of the

variability in runs scored per game. As we can see, holding all else constant, a one “point”

increase in OBP (equivalent to one-tenth of one percent, or .001 in a team OBP) is

associated with an increase in the expected R per game for the team of 0.022958. Of

course, a .001 increase is not very meaningful, and thus such a small effect on R per game

is not surprising. However, say a team is able to raise their OBP by 50 points. This would

result in an expected increase of over a run a game (1.1479 to be exact). Needless to say,

an additional run per game over the course of a season could easily be the difference

between making the playoffs and finishing in the middle of the pack.

The standard error of the estimate of 0.1518 implies the model can predict R per

game to within  .3036 (  2 x .1518) about 95% of the time. Considering that the range

9
of R per game was about 2.7 R per game and the interquartile range was about 0.8 R per

game, the model seems to be a highly useful predictor, which we would expect with such a

high R2 value.

VI. Residual Plots and Checking Assumptions

We must examine the behavior of the residuals in order to identify any unusual

observations and to determine whether the assumptions made on the  i terms in the model

are appropriate.

Histogram of the Residuals


(response is R per Ga)

15

10
Frequency

-0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4

Residual

The distribution of the residuals seems fairly normal. I next will plot the residuals

versus each of the predicting variables to see if there is any apparent structure:

10
Residuals Versus OBP
(response is R per Ga)

0.4

0.3

0.2

0.1
Residual

0.0

-0.1

-0.2

-0.3

-0.4

300 310 320 330 340 350 360 370 380

OBP

Residuals Versus SLG


(response is R per Ga)

0.4

0.3

0.2

0.1
Residual

0.0

-0.1

-0.2

-0.3

-0.4

400 450 500

SLG

11
Residuals Versus AVG
(response is R per Ga)

0.4

0.3

0.2

0.1
Residual

0.0

-0.1

-0.2

-0.3

-0.4

240 250 260 270 280 290

AVG

Residuals Versus SB per G


(response is R per Ga)

0.4

0.3

0.2

0.1
Residual

0.0

-0.1

-0.2

-0.3

-0.4

0.2 0.7 1.2

SB per G

12
Residuals Versus SH per G
(response is R per Ga)

0.4

0.3

0.2

0.1
Residual

0.0

-0.1

-0.2

-0.3

-0.4

0.1 0.2 0.3 0.4

SH per G

I do not see any apparent structures. I next will examine the normal plot of the

residuals:

Normal Probability Plot of the Residuals


(response is R per Ga)

2
Normal Score

-1

-2

-3
-0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4

Residual

13
As the plot indicates, the residuals are roughly normally distributed. There appears

to be an outlier or two (circled), but nothing to cause an assumption to be violated. We will

address the outliers later in this paper, in section VIII, after we perform the “best” model

selection process.

Now I will examine the residuals versus the fitted values plot to validate the next

two assumptions:

Residuals Versus the Fitted Values


(response is R per Ga)

0.4

0.3

0.2

0.1
Residual

0.0

-0.1

-0.2

-0.3

-0.4

4 5 6

Fitted Value

Fortunately, there does not appear to be any structure to the errors as no apparent

patterns are seen. In addition, the plot shows constant variance and thus the

homoscedasticity assumption appears fine.

Although not really time-sequenced, the data does come from eight different

baseball seasons. Thus, it will still be useful to plot the residuals versus the order of the

data to ensure that no  i terms are related to each other and that the assumption is not

violated:

14
Residuals Versus the Order of the Data
(response is R per Ga)

0.4

0.3

0.2

0.1
Residual

0.0

-0.1

-0.2

-0.3

-0.4

20 40 60 80 100

Observation Order

Clearly, there do not appear to be any patterns. Each baseball season comprises 14

observations, and moving from left to right in the chart (from 2003 to 1996) does not show

any distinct changes in the residuals. This makes sense, as while baseball has gone through

“deadball” and offensive periods over time (and stadium size changes, etc.), the game has

not fundamentally changed over the course of the previous 8 seasons. Thus, we can

conclude that this assumption has not been violated.

VII. Model Improvement

As stated earlier, there are a few potential outliers and leverage points that need to

be examined to see if their removal could improve the model. However, before doing so, I

will first look at ways where we can improve my model by perhaps eliminating variables

that add little predictive value. Following that, my next step will be to revisit the outliers.

15
Lastly, I will have a final look at the residual plots to see if the regression assumptions

hold. Let’s examine again the preliminary regression results:

Regression Analysis: R per Game versus OBP, SLG, AVG, SB per Game, SH per Game

The regression equation is


R per Game = - 5.81 + 0.0230 OBP + 0.00868 SLG - 0.00291 AVG
+ 0.0725 SB per Game + 0.082 SH per Game

Predictor Coef SE Coef T P


Constant -5.8078 0.3530 -16.45 0.000
OBP 0.022958 0.002146 10.70 0.000
SLG 0.0086819 0.0009370 9.27 0.000
AVG -0.002914 0.002593 -1.12 0.264
SB per G 0.07245 0.07767 0.93 0.353
SH per G 0.0822 0.2053 0.40 0.690

S = 0.1518 R-Sq = 92.7% R-Sq(adj) = 92.3%

Analysis of Variance

Source DF SS MS F P
Regression 5 30.8164 6.1633 267.34 0.000
Residual Error 106 2.4437 0.0231
Total 111 33.2601

The t-statistics for the variables indicate which variables add the most given all the

other variables. In this case, it appears that OBP and SLG clearly add the most while SB

and SH per game, given the other variables, add the least, with AVG somewhere in the

middle. The high p-values for SB, SH per game and AVG indicate that there may be

potential to simplify the model by eliminating variables without losing much predictive

power for my dependent variable of R per game. We can get a sneak preview of what

simplified models may look like by using the “Best Subsets” functionality in Minitab:

Best Subsets Regression: R per Game versus OBP, SLG, AVG, SB per Game, SH per Game

16
Response is R per Game
S S
B H

p p
e e
O S A r r
B L V
Vars R-Sq R-Sq(adj) C-p S P G G G G

1 86.4 86.3 87.7 0.20252 X


1 78.9 78.7 196.3 0.25252 X
2 92.5 92.4 2.2 0.15125 X X
2 86.6 86.4 86.7 0.20188 X X
3 92.6 92.4 3.2 0.15129 X X X
3 92.6 92.3 3.5 0.15146 X X X
4 92.6 92.4 4.2 0.15124 X X X X
4 92.6 92.3 4.9 0.15174 X X X X
5 92.7 92.3 6.0 0.15184 X X X X X

As suggested earlier, it definitely appears that OBP and SLG alone can provide a model

with excellent predictive power for R per game. In fact, a model with just these two

variables provides an adjusted R2 of 92.4%. The fact that the adjusted R2 remains virtually

unchanged (in fact, it even increased by 0.01%) with the elimination of the other three

variables indicates that these three are rather unimportant for a multiple regression fit.

Thus, let us re-run the regression using only OBP and SLG:

Regression Analysis: R per Game versus OBP, SLG

The regression equation is


R per Game = - 5.89 + 0.0216 OBP + 0.00823 SLG

Predictor Coef SE Coef T P


Constant -5.8870 0.3206 -18.36 0.000
OBP 0.021644 0.001540 14.06 0.000
SLG 0.0082269 0.0008759 9.39 0.000

S = 0.1512 R-Sq = 92.5% R-Sq(adj) = 92.4%

Analysis of Variance

Source DF SS MS F P
Regression 2 30.767 15.383 672.49 0.000
Residual Error 109 2.493 0.023
Total 111 33.260

17
As stated earlier, the adjusted R2 value of 92.4% for this new model indicates that it

accounts for much of the variability in runs scored per game. The very high F-statistic

indicates that this regression model for predicting R per game is overall very significant.

The interpretation of the coefficients is similar to before. Holding all else constant, a one

“point” increase in OBP (equivalent to one-tenth of one percent, or .001 in a team OBP) is

associated with an increase in the expected R per game for the team of 0.021644. A team

raising its OBP by 50 points would now result in an expected increase of over a run a game

(1.0822 to be exact).

The standard error of the estimate is now 0.1512, implying that this model can

predict R per game to within  .3024 (  2 x .1512) about 95% of the time. Once again,

given that the range of R per game is about 2.7 R per game and the interquartile range was

about 0.8 R per game, the new model is a highly useful predictor of the dependent variable.

VIII. Examining Unusual Observations

Since we have eliminated three variables, the outliers that appeared to exist before

may no longer be relevant. Let’s examine the new residual plots to see if we can find any

remaining ones. I will first plot the residuals versus each of the predicting variables. This

will be followed by a normal plot of the residuals and finally the residuals versus the fitted

values.

18
Residuals Versus OBP
(response is R per Ga)

0.4

0.3

0.2

0.1
Residual

0.0

-0.1

-0.2

-0.3

-0.4

300 310 320 330 340 350 360 370 380

OBP

Residuals Versus SLG


(response is R per Ga)

0.4

0.3

0.2

0.1
Residual

0.0

-0.1

-0.2

-0.3

-0.4

400 450 500

SLG

19
Normal Probability Plot of the Residuals
(response is R per Ga)

2
Normal Score

-1

-2

-3
-0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4

Residual

Residuals Versus the Fitted Values


(response is R per Ga)

0.4

0.3

0.2

0.1
Residual

0.0

-0.1

-0.2

-0.3

-0.4

4 5 6

Fitted Value

The two outliers that I have circled on the extreme ends of the normal plot of the

residuals are observations 45 and 72, the 2000 Chicago White Sox and the 1998 Tampa

20
Bay Devil Rays, respectively.3 Since outliers can have a strong effect on the fitted

regression and the measures of fit, we perhaps can remove them from the data set and

analyze without them. However, we need to be sure to inform the reader about what we are

doing. I will eliminate these two observations to see if the model becomes even stronger:

Regression Analysis: R per Game versus OBP, SLG

The regression equation is


R per Game = - 5.73 + 0.0218 OBP + 0.00774 SLG

Predictor Coef SE Coef T P


Constant -5.7261 0.3065 -18.68 0.000
OBP 0.021788 0.001459 14.94 0.000
SLG 0.0077434 0.0008392 9.23 0.000

S = 0.1432 R-Sq = 92.9% R-Sq(adj) = 92.7%

Analysis of Variance

Source DF SS MS F P
Regression 2 28.598 14.299 697.05 0.000
Residual Error 107 2.195 0.021
Total 109 30.793

Omitting those two teams did not change very much, although the model is slightly

stronger as the adjusted R2 did slightly increase and the standard error of the estimate

slightly decreased. While things did not improve much, if I was to present this model, I

would make clear that it does not apply to those two teams. Finally, since I omitted two

observations, I, in a sense, have an entirely new data set and thus need to again perform a

model selection process. However, a look at the t-statistics and p-values indicates that both

variables seem to have a high predictive power, with OBP more than SLG. There is no

need to change the model selection again. These two variables together provide the highest

predictive power in the simplest models (a new check of the best subsets confirmed this).

IX. Final Checking of Assumptions with Residual Plots


3
The 1998 Devil Rays, an expansion team, had a much lower R per game than their team OBP and SLG
would suggest which would naturally help explain their terrible 63 win inaugural season. The 2000 White
Sox, on the other hand, had a much higher R per game then their OBP and SLG would predict, helping them
to a division winning, 95-67 record.

21
Since I made several changes to the model since I last checked the regression

assumptions, I need to perform the check one final time before the analysis is complete.

Histogram of the Residuals


(response is R per Ga)

20
Frequency

10

-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3

Residual

The distribution of the residuals seems fairly normal. I next will plot the residuals

versus each of the predicting variables to see if there is any apparent structure:

22
Residuals Versus OBP
(response is R per Ga)

0.3

0.2

0.1
Residual

0.0

-0.1

-0.2

-0.3

300 310 320 330 340 350 360 370 380

OBP

Residuals Versus SLG


(response is R per Ga)

0.3

0.2

0.1
Residual

0.0

-0.1

-0.2

-0.3

400 450 500

SLG

I do not see any apparent structures in the plot of the residuals versus each of the

predicting variables.

23
Normal Probability Plot of the Residuals
(response is R per Ga)

2
Normal Score

-1

-2

-3
-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3

Residual

As the plot above indicates, the residuals are roughly normally distributed.

Now I will examine the residuals versus the fitted values plot to validate the next

two assumptions:

Residuals Versus the Fitted Values


(response is R per Ga)

0.3

0.2

0.1
Residual

0.0

-0.1

-0.2

-0.3

4 5 6

Fitted Value

24
There does not appear to be any structure to the errors as no apparent patterns are seen. In

addition, the plot shows constant variance and thus the homoscedasticity assumption

appears fine.

Finally, the plot the residuals versus the order of the data should ensure that no  i

terms are related to each other:

Residuals Versus the Order of the Data


(response is R per Ga)

0.3

0.2

0.1
Residual

0.0

-0.1

-0.2

-0.3

20 40 60 80 100

Observation Order

Once again, no patterns are apparent and we can conclude that this assumption has

not been violated.

X. Conclusion

This analysis has provided ample evidence that Billy Beane knows what he is

doing. As we have seen, SB and SH per game have virtually no correlation with R per

game when acting by themselves and do not add more prediction value when working

25
together with the other variables. On the other hand, we saw how OBP, SLG and AVG

have a strong positive correlation with R per game. However, of those three statistics, the

traditional and most often used AVG had the lowest correlation. Furthermore, when we

examined how these variables work together in predicting R per game, we noticed that

OBP and SLG alone could do just as good a job predicting R than a model that included

AVG. 4 In fact, my regression analysis has confirmed that R per game can be highly

predicted by OBP and SLG, with about 92.9% of the variability in R per game accounted

for by these two variables.

What practical implications does the analysis provide? Well, perhaps a GM would

best work on focusing on his team’s OBP and SLG over AVG to improve his team’s run

output for the season. In addition, since SB and SH seem insignificant, focusing on speed

on the bases and “small ball” may be counterproductive in a team’s ultimate offensive

goal: scoring runs. However, I hesitate to conclude that these factors should be ignored. In

addition to the important fact that correlation does not imply causation, there is another

angle that was not addressed. My analysis looked for correlations between these factors

and total team runs over the course of a season (expressed as R per game). There is an

argument that in certain situations where you are playing for one run (i.e. a tying, go-ahead

or an “insurance run” in a late inning, etc.), these methods can increase the probability of

scoring that one particular run. While this one run may come at the expense of maximizing

your total team runs scored, it may be beneficial to winning that game. I am skeptical of

this argument and have read analyses that have attempted to disprove this line of

reasoning. However, an examination of this topic goes beyond the scope of this paper.

4
To revisit an earlier discussion from footnote 2, the analysis also showed how, holding all else constant and
among just those variables in the final regression model, OBP has more predicting power than SLG. Thus, it
was not necessary to run a separate regression model that replaces OBP and SLG with OPS.

26

Вам также может понравиться