Академический Документы
Профессиональный Документы
Культура Документы
FACT OR FICTION?
Despite a payroll smaller than all but four other major league teams, the Oakland
Athletics (commonly referred to as the A’s) won the fourth most games in Major League
Baseball in 2003. This year was no fluke, however, as the A’s have now made the playoffs
in four consecutive years while teams like the New York Mets and Texas Rangers have
generally floundered while spending over $50 million more on their players. Recently,
thanks to the publication of the best-selling book Moneyball, much has been made of the
methods of the team’s general manager, who focuses on new and non-traditional statistics
when evaluating talent rather than on traditional ones and scouting reports. In particular,
Billy Beane, the GM, focuses on a player’s on-base percentage (OBP) and slugging
percentage (SLG) when measuring offensive capability rather than on batting average
(AVG), which has always been baseball’s most famous and well-published statistic. In
addition, Beane eschews the practice of sacrifice bunts (SH) and stolen bases (SB) as he
believes they actually decrease the chances of scoring runs (R), rather than increase them
as is commonly believed. While there are certain circumstances where he feels these last
two moves are appropriate, as a general rule he avoids attempting to sacrifice bunt or steal
bases, and the A’s during Beane’s tenure (1998-present) are known to be near the bottom of
While the A’s also have unconventional ways of measuring defense and a pitcher’s
capability (and many fairly argue that the A’s pitching has more to do with their recent
1
After collecting my data, I checked to make sure that this was indeed the case. Since 2000, the A’s have
never been higher than 12th (out of 14 teams) in sacrifice bunts or stolen bases. Prior to 2000, the evidence is
not as clear-cut. However, it is important to remember that since Beane inherited a team with a manager and
players used to “playing by the book”, it took him a few years to fully implement his system throughout the
organization (a fact that is well detailed in Moneyball). Given the weight of the evidence during the team’s
most recent years, I feel Beane has successfully implemented his system.
2
success than their offensive makeup), the offensive side has become the most controversial
Based on my love of baseball and interest in the subject matter, I have decided to
examine Billy Beane’s methods for my data analysis project. I intend to focus on offense
from a team perspective. Since the sole purpose of a team offense is to score as many runs
as possible, my target variable will be team runs scored per game. I then will try to predict
this variable based on the following statistics: OBP, SLG, AVG, SH (per game) and SB
(per game).2 Since my instincts agreed with many of the ideas in Moneyball, I took many
of its statistical findings at face value. However, I will now examine for myself whether
OBP and SLG are more highly correlated to runs scored than the traditional AVG and
whether sacrifice bunting and stealing bases are indeed poor strategies.
A question that arose before collecting the data was how far back in time to
examine. Since statistics are available dating back to the beginning of the 20th century, I
had many potential observations to examine. However, to make the project less unwieldy, I
took data from American League teams from the past eight seasons (1996-2003). I had
decided to exclude National League teams since, without the presence of the designated
hitter, the sacrifice bunt would (and should) be used more frequently when the pitcher is
batting. Therefore, using National League teams might provide an inaccurate answer for
the correlation between sacrifice bunts and team runs. Considering there have been 14
2
Depending on the results of this regression model, I may run a separate regression model that replaces OBP
and SLG with OPS, another common statistic that is the sum of OBP and SLG. If the two models have
similar predictive power, then this would indicate that OBP and SLG are about equally important when
determining runs scored per game. However, I do not believe that this will be the case.
3
American League teams since 1993 (one team switched leagues in 1998, but was replaced
While many sports sites can provide the data I desire, Major League Baseball’s
official website (www.mlb.com) had perhaps the easiest to use statistical interface where I
was able to obtain all the statistics I desired for my analysis. In fact, I gathered much more
data than just team totals for R, OBP, SLG, AVG, SH, SB and games played (to determine
the per game numbers for the cumulative statistics). I realized when collecting the data that
other statistics could provide some interesting side analyses. While they may not factor
into my completed project, for my own curiosity I would like to see how well some other
I am fortunate that I had no real problems collecting the data. The only problem I
had, albeit a minor one, was that there was no easy way to download each year’s data. A
simple “copy and paste” technique created some formatting errors, but these were easily
corrected.
I will start by analyzing the descriptive statistics for each of the variables. Since
OBP, SLG and AVG are all measured in percentages, I have decided to multiply each
observation by 1000. While this will have no effect on the regression analysis, it will make
the interpretation of the coefficients a bit easier to understand, especially for baseball fans
who are familiar with what the statistics represent. Having said that, let’s look at our data:
4
Descriptive Statistics:
Variable N Mean Median TrMean StDev SE Mean
R per G 112 5.0437 5.0216 5.0527 0.5474 0.0517
OBP 112 340.36 341.00 340.65 15.36 1.45
SLG 112 433.21 435.50 433.43 27.00 2.55
AVG 112 270.80 271.00 271.11 11.59 1.09
SB per G 112 0.6370 0.6285 0.6333 0.2003 0.0189
SH per G 112 0.23698 0.24383 0.23600 0.07297 0.00690
An initial examination of the data does not reveal anything alarming. The mean
and medians for each variable are very close to being equal, indicating that no variable
histogram for each variable and, as expected, all had roughly normal distributions. I also
examined box plots for each variable, which only revealed one clear outlier: the maximum
value for SB per game of 1.21 (the 1996 Kansas City Royals). Based on the above, I feel
all the data looks relatively normal and there is no need to log any data or perform similar
adjustments.
The next step is to find fitted line plots of each of my predictors against R per
game. While these plots will not illustrate how these variables work together to predict my
target variable, they can be useful in determining the general relationship between the
5
Regression Plot
R per Game = -6.23417 + 0.0331354 OBP
6
R per Game
OBP
Regression Plot
R per Game = -2.75902 + 0.0180112 SLG
6
R per Game
SLG
6
Regression Plot
R per Game = -5.37003 + 0.0384549 AVG
6
R per Game
AVG
Regression Plot
R per Game = 5.13135 - 0.137618 SB per Game
6
R per Game
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2
SB per Game
7
Regression Plot
R per Game = 5.23589 - 0.811045 SH per Game
R per Game 6
SH per Game
While it is clear that OBP, SLG and AVG have a strong positive correlation with R
per game, it is interesting to note the low correlation of SB and SH per game with R per
game. In fact, each has a slightly negative correlation, indicating that increases in these
variables actually are correlated with fewer runs scored (score one for Billy Beane!).
However, the correlations for these two variables are so low that I do not put too much
weight into the signs but rather just emphasize how poor they are in predicting R per game
by themselves. Perhaps they add more value when working with the other variables.
As mentioned, these plots only tell us part of the story. We now need to examine a
regression output to see how these variables relate together in predicting our target
variable.
8
Regression Analysis: R per Game versus OBP, SLG, AVG, SB per Game, SH per Game
Analysis of Variance
Source DF SS MS F P
Regression 5 30.8164 6.1633 267.34 0.000
Residual Error 106 2.4437 0.0231
Total 111 33.2601
The adjusted R2 value of 92.3% indicates that the model accounts for much of the
variability in runs scored per game. As we can see, holding all else constant, a one “point”
associated with an increase in the expected R per game for the team of 0.022958. Of
course, a .001 increase is not very meaningful, and thus such a small effect on R per game
is not surprising. However, say a team is able to raise their OBP by 50 points. This would
result in an expected increase of over a run a game (1.1479 to be exact). Needless to say,
an additional run per game over the course of a season could easily be the difference
between making the playoffs and finishing in the middle of the pack.
The standard error of the estimate of 0.1518 implies the model can predict R per
game to within .3036 ( 2 x .1518) about 95% of the time. Considering that the range
9
of R per game was about 2.7 R per game and the interquartile range was about 0.8 R per
game, the model seems to be a highly useful predictor, which we would expect with such a
high R2 value.
We must examine the behavior of the residuals in order to identify any unusual
observations and to determine whether the assumptions made on the i terms in the model
are appropriate.
15
10
Frequency
-0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4
Residual
The distribution of the residuals seems fairly normal. I next will plot the residuals
versus each of the predicting variables to see if there is any apparent structure:
10
Residuals Versus OBP
(response is R per Ga)
0.4
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
-0.4
OBP
0.4
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
-0.4
SLG
11
Residuals Versus AVG
(response is R per Ga)
0.4
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
-0.4
AVG
0.4
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
-0.4
SB per G
12
Residuals Versus SH per G
(response is R per Ga)
0.4
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
-0.4
SH per G
I do not see any apparent structures. I next will examine the normal plot of the
residuals:
2
Normal Score
-1
-2
-3
-0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4
Residual
13
As the plot indicates, the residuals are roughly normally distributed. There appears
address the outliers later in this paper, in section VIII, after we perform the “best” model
selection process.
Now I will examine the residuals versus the fitted values plot to validate the next
two assumptions:
0.4
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
-0.4
4 5 6
Fitted Value
Fortunately, there does not appear to be any structure to the errors as no apparent
patterns are seen. In addition, the plot shows constant variance and thus the
Although not really time-sequenced, the data does come from eight different
baseball seasons. Thus, it will still be useful to plot the residuals versus the order of the
data to ensure that no i terms are related to each other and that the assumption is not
violated:
14
Residuals Versus the Order of the Data
(response is R per Ga)
0.4
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
-0.4
20 40 60 80 100
Observation Order
Clearly, there do not appear to be any patterns. Each baseball season comprises 14
observations, and moving from left to right in the chart (from 2003 to 1996) does not show
any distinct changes in the residuals. This makes sense, as while baseball has gone through
“deadball” and offensive periods over time (and stadium size changes, etc.), the game has
not fundamentally changed over the course of the previous 8 seasons. Thus, we can
As stated earlier, there are a few potential outliers and leverage points that need to
be examined to see if their removal could improve the model. However, before doing so, I
will first look at ways where we can improve my model by perhaps eliminating variables
that add little predictive value. Following that, my next step will be to revisit the outliers.
15
Lastly, I will have a final look at the residual plots to see if the regression assumptions
Regression Analysis: R per Game versus OBP, SLG, AVG, SB per Game, SH per Game
Analysis of Variance
Source DF SS MS F P
Regression 5 30.8164 6.1633 267.34 0.000
Residual Error 106 2.4437 0.0231
Total 111 33.2601
The t-statistics for the variables indicate which variables add the most given all the
other variables. In this case, it appears that OBP and SLG clearly add the most while SB
and SH per game, given the other variables, add the least, with AVG somewhere in the
middle. The high p-values for SB, SH per game and AVG indicate that there may be
potential to simplify the model by eliminating variables without losing much predictive
power for my dependent variable of R per game. We can get a sneak preview of what
simplified models may look like by using the “Best Subsets” functionality in Minitab:
Best Subsets Regression: R per Game versus OBP, SLG, AVG, SB per Game, SH per Game
16
Response is R per Game
S S
B H
p p
e e
O S A r r
B L V
Vars R-Sq R-Sq(adj) C-p S P G G G G
As suggested earlier, it definitely appears that OBP and SLG alone can provide a model
with excellent predictive power for R per game. In fact, a model with just these two
variables provides an adjusted R2 of 92.4%. The fact that the adjusted R2 remains virtually
unchanged (in fact, it even increased by 0.01%) with the elimination of the other three
variables indicates that these three are rather unimportant for a multiple regression fit.
Thus, let us re-run the regression using only OBP and SLG:
Analysis of Variance
Source DF SS MS F P
Regression 2 30.767 15.383 672.49 0.000
Residual Error 109 2.493 0.023
Total 111 33.260
17
As stated earlier, the adjusted R2 value of 92.4% for this new model indicates that it
accounts for much of the variability in runs scored per game. The very high F-statistic
indicates that this regression model for predicting R per game is overall very significant.
The interpretation of the coefficients is similar to before. Holding all else constant, a one
“point” increase in OBP (equivalent to one-tenth of one percent, or .001 in a team OBP) is
associated with an increase in the expected R per game for the team of 0.021644. A team
raising its OBP by 50 points would now result in an expected increase of over a run a game
(1.0822 to be exact).
The standard error of the estimate is now 0.1512, implying that this model can
predict R per game to within .3024 ( 2 x .1512) about 95% of the time. Once again,
given that the range of R per game is about 2.7 R per game and the interquartile range was
about 0.8 R per game, the new model is a highly useful predictor of the dependent variable.
Since we have eliminated three variables, the outliers that appeared to exist before
may no longer be relevant. Let’s examine the new residual plots to see if we can find any
remaining ones. I will first plot the residuals versus each of the predicting variables. This
will be followed by a normal plot of the residuals and finally the residuals versus the fitted
values.
18
Residuals Versus OBP
(response is R per Ga)
0.4
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
-0.4
OBP
0.4
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
-0.4
SLG
19
Normal Probability Plot of the Residuals
(response is R per Ga)
2
Normal Score
-1
-2
-3
-0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4
Residual
0.4
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
-0.4
4 5 6
Fitted Value
The two outliers that I have circled on the extreme ends of the normal plot of the
residuals are observations 45 and 72, the 2000 Chicago White Sox and the 1998 Tampa
20
Bay Devil Rays, respectively.3 Since outliers can have a strong effect on the fitted
regression and the measures of fit, we perhaps can remove them from the data set and
analyze without them. However, we need to be sure to inform the reader about what we are
doing. I will eliminate these two observations to see if the model becomes even stronger:
Analysis of Variance
Source DF SS MS F P
Regression 2 28.598 14.299 697.05 0.000
Residual Error 107 2.195 0.021
Total 109 30.793
Omitting those two teams did not change very much, although the model is slightly
stronger as the adjusted R2 did slightly increase and the standard error of the estimate
slightly decreased. While things did not improve much, if I was to present this model, I
would make clear that it does not apply to those two teams. Finally, since I omitted two
observations, I, in a sense, have an entirely new data set and thus need to again perform a
model selection process. However, a look at the t-statistics and p-values indicates that both
variables seem to have a high predictive power, with OBP more than SLG. There is no
need to change the model selection again. These two variables together provide the highest
predictive power in the simplest models (a new check of the best subsets confirmed this).
21
Since I made several changes to the model since I last checked the regression
assumptions, I need to perform the check one final time before the analysis is complete.
20
Frequency
10
Residual
The distribution of the residuals seems fairly normal. I next will plot the residuals
versus each of the predicting variables to see if there is any apparent structure:
22
Residuals Versus OBP
(response is R per Ga)
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
OBP
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
SLG
I do not see any apparent structures in the plot of the residuals versus each of the
predicting variables.
23
Normal Probability Plot of the Residuals
(response is R per Ga)
2
Normal Score
-1
-2
-3
-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3
Residual
As the plot above indicates, the residuals are roughly normally distributed.
Now I will examine the residuals versus the fitted values plot to validate the next
two assumptions:
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
4 5 6
Fitted Value
24
There does not appear to be any structure to the errors as no apparent patterns are seen. In
addition, the plot shows constant variance and thus the homoscedasticity assumption
appears fine.
Finally, the plot the residuals versus the order of the data should ensure that no i
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
20 40 60 80 100
Observation Order
Once again, no patterns are apparent and we can conclude that this assumption has
X. Conclusion
This analysis has provided ample evidence that Billy Beane knows what he is
doing. As we have seen, SB and SH per game have virtually no correlation with R per
game when acting by themselves and do not add more prediction value when working
25
together with the other variables. On the other hand, we saw how OBP, SLG and AVG
have a strong positive correlation with R per game. However, of those three statistics, the
traditional and most often used AVG had the lowest correlation. Furthermore, when we
examined how these variables work together in predicting R per game, we noticed that
OBP and SLG alone could do just as good a job predicting R than a model that included
AVG. 4 In fact, my regression analysis has confirmed that R per game can be highly
predicted by OBP and SLG, with about 92.9% of the variability in R per game accounted
What practical implications does the analysis provide? Well, perhaps a GM would
best work on focusing on his team’s OBP and SLG over AVG to improve his team’s run
output for the season. In addition, since SB and SH seem insignificant, focusing on speed
on the bases and “small ball” may be counterproductive in a team’s ultimate offensive
goal: scoring runs. However, I hesitate to conclude that these factors should be ignored. In
addition to the important fact that correlation does not imply causation, there is another
angle that was not addressed. My analysis looked for correlations between these factors
and total team runs over the course of a season (expressed as R per game). There is an
argument that in certain situations where you are playing for one run (i.e. a tying, go-ahead
or an “insurance run” in a late inning, etc.), these methods can increase the probability of
scoring that one particular run. While this one run may come at the expense of maximizing
your total team runs scored, it may be beneficial to winning that game. I am skeptical of
this argument and have read analyses that have attempted to disprove this line of
reasoning. However, an examination of this topic goes beyond the scope of this paper.
4
To revisit an earlier discussion from footnote 2, the analysis also showed how, holding all else constant and
among just those variables in the final regression model, OBP has more predicting power than SLG. Thus, it
was not necessary to run a separate regression model that replaces OBP and SLG with OPS.
26