Money Ball

MONEYBALL:
FACT OR FICTION?
Statistics & Data Analysis

Data Analysis Project
Professor Jeffrey Simonoff
December 9, 2003
I. Introduction
Despite a payroll smaller than all but four other major league teams, the Oakland
Athletics (commonly referred to as the A’s) won the fourth most games in Major League
Baseball in 2003. This year was no fluke, however, as the A’s have now made the playoffs
in four consecutive years while teams like the New York Mets and Texas Rangers have
generally floundered while spending over $50 million more on their players. Recently,
thanks to the publication of the best-selling book Moneyball, much has been made of the
methods of the team’s general manager, who focuses on new and non-traditional statistics
when evaluating talent rather than on traditional ones and scouting reports. In particular,
Billy Beane, the GM, focuses on a player’s on-base percentage (OBP) and slugging
percentage (SLG) when measuring offensive capability rather than on batting average
(AVG), which has always been baseball’s most famous and well-published statistic. In
addition, Beane eschews the practice of sacrifice bunts (SH) and stolen bases (SB) as he
believes they actually decrease the chances of scoring runs (R), rather than increase them
as is commonly believed. While there are certain circumstances where he feels these last
two moves are appropriate, as a general rule he avoids attempting to sacrifice bunt or steal
bases, and the A’s during Beane’s tenure (1998-present) are known to be near the bottom of
the league in these categories1.
While the A’s also have unconventional ways of measuring defense and a pitcher’s
capability (and many fairly argue that the A’s pitching has more to do with their recent
1
After collecting my data, I checked to make sure that this was indeed the case. Since 2000, the A’s have
never been higher than 12th (out of 14 teams) in sacrifice bunts or stolen bases. Prior to 2000, the evidence is
not as clear-cut. However, it is important to remember that since Beane inherited a team with a manager and
players used to “playing by the book”, it took him a few years to fully implement his system throughout the
organization (a fact that is well detailed in Moneyball). Given the weight of the evidence during the team’s
most recent years, I feel Beane has successfully implemented his system.
2
success than their offensive makeup), the offensive side has become the most controversial
as it clearly conflicts with the traditional way of looking at things.
Based on my love of baseball and interest in the subject matter, I have decided to
examine Billy Beane’s methods for my data analysis project. I intend to focus on offense
from a team perspective. Since the sole purpose of a team offense is to score as many runs
as possible, my target variable will be team runs scored per game. I then will try to predict
this variable based on the following statistics: OBP, SLG, AVG, SH (per game) and SB
(per game).2 Since my instincts agreed with many of the ideas in Moneyball, I took many
of its statistical findings at face value. However, I will now examine for myself whether
OBP and SLG are more highly correlated to runs scored than the traditional AVG and
whether sacrifice bunting and stealing bases are indeed poor strategies.
II. Thinking about the Data
A question that arose before collecting the data was how far back in time to
examine. Since statistics are available dating back to the beginning of the 20th century, I
had many potential observations to examine. However, to make the project less unwieldy, I
took data from American League teams from the past eight seasons (1996-2003). I had
decided to exclude National League teams since, without the presence of the designated
hitter, the sacrifice bunt would (and should) be used more frequently when the pitcher is
batting. Therefore, using National League teams might provide an inaccurate answer for
the correlation between sacrifice bunts and team runs. Considering there have been 14
2
Depending on the results of this regression model, I may run a separate regression model that replaces OBP
and SLG with OPS, another common statistic that is the sum of OBP and SLG. If the two models have
similar predictive power, then this would indicate that OBP and SLG are about equally important when
determining runs scored per game. However, I do not believe that this will be the case.
3
American League teams since 1993 (one team switched leagues in 1998, but was replaced
with an expansion team), this approach will provide 112 observations.
III. Data Collection
While many sports sites can provide the data I desire, Major League Baseball’s
official website (www.mlb.com) had perhaps the easiest to use statistical interface where I
was able to obtain all the statistics I desired for my analysis. In fact, I gathered much more
data than just team totals for R, OBP, SLG, AVG, SH, SB and games played (to determine
the per game numbers for the cumulative statistics). I realized when collecting the data that
other statistics could provide some interesting side analyses. While they may not factor
into my completed project, for my own curiosity I would like to see how well some other
offensive statistics are predictors to runs scored.
I am fortunate that I had no real problems collecting the data. The only problem I
had, albeit a minor one, was that there was no easy way to download each year’s data. A
simple “copy and paste” technique created some formatting errors, but these were easily
corrected.
IV. First Look at the Data
I will start by analyzing the descriptive statistics for each of the variables. Since
OBP, SLG and AVG are all measured in percentages, I have decided to multiply each
observation by 1000. While this will have no effect on the regression analysis, it will make
the interpretation of the coefficients a bit easier to understand, especially for baseball fans
who are familiar with what the statistics represent. Having said that, let’s look at our data:
4
Descriptive Statistics:
Variable N Mean Median TrMean StDev SE Mean
R per G 112 5.0437 5.0216 5.0527 0.5474 0.0517
OBP 112 340.36 341.00 340.65 15.36 1.45
SLG 112 433.21 435.50 433.43 27.00 2.55
AVG 112 270.80 271.00 271.11 11.59 1.09
SB per G 112 0.6370 0.6285 0.6333 0.2003 0.0189
SH per G 112 0.23698 0.24383 0.23600 0.07297 0.00690
Variable Minimum Maximum Q1 Q3

R per G 3.5714 6.2671 4.6588 5.4228
OBP 300.00 374.00 329.25 352.00
SLG 375.00 491.00 411.25 453.75
AVG 240.00 293.00 263.00 279.00
SB per G 0.2284 1.2112 0.4884 0.7731
SH per G 0.06790 0.40994 0.18519 0.28571
An initial examination of the data does not reveal anything alarming. The mean
and medians for each variable are very close to being equal, indicating that no variable
appears to have a distribution that is skewed in either direction. To confirm, I examined a
histogram for each variable and, as expected, all had roughly normal distributions. I also
examined box plots for each variable, which only revealed one clear outlier: the maximum
value for SB per game of 1.21 (the 1996 Kansas City Royals). Based on the above, I feel
all the data looks relatively normal and there is no need to log any data or perform similar
adjustments.
The next step is to find fitted line plots of each of my predictors against R per
game. While these plots will not illustrate how these variables work together to predict my
target variable, they can be useful in determining the general relationship between the
predictors and the dependent variable:
5
Regression Plot
R per Game = -6.23417 + 0.0331354 OBP
S = 0.202521 R-Sq = 86.4 % R-Sq(adj) = 86.3 %
6
R per Game
300 310 320 330 340 350 360 370 380
OBP
Regression Plot
R per Game = -2.75902 + 0.0180112 SLG
S = 0.252520 R-Sq = 78.9 % R-Sq(adj) = 78.7 %
6
R per Game
400 450 500
SLG
6
Regression Plot
R per Game = -5.37003 + 0.0384549 AVG
S = 0.319513 R-Sq = 66.2 % R-Sq(adj) = 65.9 %
6
R per Game
240 250 260 270 280 290
AVG
Regression Plot
R per Game = 5.13135 - 0.137618 SB per Game
S = 0.549179 R-Sq = 0.3 % R-Sq(adj) = 0.0 %
6
R per Game
0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1.1 1.2
SB per Game
7
Regression Plot
R per Game = 5.23589 - 0.811045 SH per Game
S = 0.546653 R-Sq = 1.2 % R-Sq(adj) = 0.3 %
R per Game 6
0.1 0.2 0.3 0.4
SH per Game
While it is clear that OBP, SLG and AVG have a strong positive correlation with R
per game, it is interesting to note the low correlation of SB and SH per game with R per
game. In fact, each has a slightly negative correlation, indicating that increases in these
variables actually are correlated with fewer runs scored (score one for Billy Beane!).
However, the correlations for these two variables are so low that I do not put too much
weight into the signs but rather just emphasize how poor they are in predicting R per game
by themselves. Perhaps they add more value when working with the other variables.
V. Preliminary Multiple Regression Model
As mentioned, these plots only tell us part of the story. We now need to examine a
regression output to see how these variables relate together in predicting our target
variable.
8
Regression Analysis: R per Game versus OBP, SLG, AVG, SB per Game, SH per Game
The regression equation is

R per Game = - 5.81 + 0.0230 OBP + 0.00868 SLG - 0.00291 AVG
+ 0.0725 SB per Game + 0.082 SH per Game
Predictor Coef SE Coef T P

Constant -5.8078 0.3530 -16.45 0.000
OBP 0.022958 0.002146 10.70 0.000
SLG 0.0086819 0.0009370 9.27 0.000
AVG -0.002914 0.002593 -1.12 0.264
SB per G 0.07245 0.07767 0.93 0.353
SH per G 0.0822 0.2053 0.40 0.690
S = 0.1518 R-Sq = 92.7% R-Sq(adj) = 92.3%
Analysis of Variance
Source DF SS MS F P
Regression 5 30.8164 6.1633 267.34 0.000
Residual Error 106 2.4437 0.0231
Total 111 33.2601
The adjusted R2 value of 92.3% indicates that the model accounts for much of the
variability in runs scored per game. As we can see, holding all else constant, a one “point”
increase in OBP (equivalent to one-tenth of one percent, or .001 in a team OBP) is
associated with an increase in the expected R per game for the team of 0.022958. Of
course, a .001 increase is not very meaningful, and thus such a small effect on R per game
is not surprising. However, say a team is able to raise their OBP by 50 points. This would
result in an expected increase of over a run a game (1.1479 to be exact). Needless to say,
an additional run per game over the course of a season could easily be the difference
between making the playoffs and finishing in the middle of the pack.
The standard error of the estimate of 0.1518 implies the model can predict R per
game to within  .3036 (  2 x .1518) about 95% of the time. Considering that the range
9
of R per game was about 2.7 R per game and the interquartile range was about 0.8 R per
game, the model seems to be a highly useful predictor, which we would expect with such a
high R2 value.
VI. Residual Plots and Checking Assumptions
We must examine the behavior of the residuals in order to identify any unusual
observations and to determine whether the assumptions made on the  i terms in the model
are appropriate.
Histogram of the Residuals

(response is R per Ga)
15
10
Frequency
-0.5 -0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4
Residual
The distribution of the residuals seems fairly normal. I next will plot the residuals
versus each of the predicting variables to see if there is any apparent structure:
10
Residuals Versus OBP
0.4
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
-0.4
300 310 320 330 340 350 360 370 380
OBP
Residuals Versus SLG

0.4
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
-0.4
400 450 500
SLG
11
Residuals Versus AVG
0.4
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
-0.4
240 250 260 270 280 290
AVG
Residuals Versus SB per G

0.4
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
-0.4
0.2 0.7 1.2
SB per G
12
Residuals Versus SH per G
0.4
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
-0.4
0.1 0.2 0.3 0.4
SH per G
I do not see any apparent structures. I next will examine the normal plot of the
residuals:
Normal Probability Plot of the Residuals

2
Normal Score
-1
-2
-3
-0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4
Residual
13
As the plot indicates, the residuals are roughly normally distributed. There appears
to be an outlier or two (circled), but nothing to cause an assumption to be violated. We will
address the outliers later in this paper, in section VIII, after we perform the “best” model
selection process.
Now I will examine the residuals versus the fitted values plot to validate the next
two assumptions:
Residuals Versus the Fitted Values

0.4
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
-0.4
4 5 6
Fitted Value
Fortunately, there does not appear to be any structure to the errors as no apparent
patterns are seen. In addition, the plot shows constant variance and thus the
homoscedasticity assumption appears fine.
Although not really time-sequenced, the data does come from eight different
baseball seasons. Thus, it will still be useful to plot the residuals versus the order of the
data to ensure that no  i terms are related to each other and that the assumption is not
violated:
14
Residuals Versus the Order of the Data
0.4
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
-0.4
20 40 60 80 100
Observation Order
Clearly, there do not appear to be any patterns. Each baseball season comprises 14
observations, and moving from left to right in the chart (from 2003 to 1996) does not show
any distinct changes in the residuals. This makes sense, as while baseball has gone through
“deadball” and offensive periods over time (and stadium size changes, etc.), the game has
not fundamentally changed over the course of the previous 8 seasons. Thus, we can
conclude that this assumption has not been violated.
VII. Model Improvement
As stated earlier, there are a few potential outliers and leverage points that need to
be examined to see if their removal could improve the model. However, before doing so, I
will first look at ways where we can improve my model by perhaps eliminating variables
that add little predictive value. Following that, my next step will be to revisit the outliers.
15
Lastly, I will have a final look at the residual plots to see if the regression assumptions
hold. Let’s examine again the preliminary regression results:
Regression Analysis: R per Game versus OBP, SLG, AVG, SB per Game, SH per Game

R per Game = - 5.81 + 0.0230 OBP + 0.00868 SLG - 0.00291 AVG
+ 0.0725 SB per Game + 0.082 SH per Game

Constant -5.8078 0.3530 -16.45 0.000
OBP 0.022958 0.002146 10.70 0.000
SLG 0.0086819 0.0009370 9.27 0.000
AVG -0.002914 0.002593 -1.12 0.264
SB per G 0.07245 0.07767 0.93 0.353
SH per G 0.0822 0.2053 0.40 0.690
S = 0.1518 R-Sq = 92.7% R-Sq(adj) = 92.3%
Source DF SS MS F P
Regression 5 30.8164 6.1633 267.34 0.000
Total 111 33.2601
The t-statistics for the variables indicate which variables add the most given all the
other variables. In this case, it appears that OBP and SLG clearly add the most while SB
and SH per game, given the other variables, add the least, with AVG somewhere in the
middle. The high p-values for SB, SH per game and AVG indicate that there may be
potential to simplify the model by eliminating variables without losing much predictive
power for my dependent variable of R per game. We can get a sneak preview of what
simplified models may look like by using the “Best Subsets” functionality in Minitab:
Best Subsets Regression: R per Game versus OBP, SLG, AVG, SB per Game, SH per Game
16
Response is R per Game
S S
B H
p p
e e
O S A r r
B L V
Vars R-Sq R-Sq(adj) C-p S P G G G G
1 86.4 86.3 87.7 0.20252 X

1 78.9 78.7 196.3 0.25252 X
2 92.5 92.4 2.2 0.15125 X X
2 86.6 86.4 86.7 0.20188 X X
3 92.6 92.4 3.2 0.15129 X X X
3 92.6 92.3 3.5 0.15146 X X X
4 92.6 92.4 4.2 0.15124 X X X X
4 92.6 92.3 4.9 0.15174 X X X X
5 92.7 92.3 6.0 0.15184 X X X X X
As suggested earlier, it definitely appears that OBP and SLG alone can provide a model
with excellent predictive power for R per game. In fact, a model with just these two
variables provides an adjusted R2 of 92.4%. The fact that the adjusted R2 remains virtually
unchanged (in fact, it even increased by 0.01%) with the elimination of the other three
variables indicates that these three are rather unimportant for a multiple regression fit.
Thus, let us re-run the regression using only OBP and SLG:
Regression Analysis: R per Game versus OBP, SLG

R per Game = - 5.89 + 0.0216 OBP + 0.00823 SLG

Constant -5.8870 0.3206 -18.36 0.000
OBP 0.021644 0.001540 14.06 0.000
SLG 0.0082269 0.0008759 9.39 0.000
S = 0.1512 R-Sq = 92.5% R-Sq(adj) = 92.4%
Source DF SS MS F P
Regression 2 30.767 15.383 672.49 0.000
Total 111 33.260
17
As stated earlier, the adjusted R2 value of 92.4% for this new model indicates that it
accounts for much of the variability in runs scored per game. The very high F-statistic
indicates that this regression model for predicting R per game is overall very significant.
The interpretation of the coefficients is similar to before. Holding all else constant, a one
“point” increase in OBP (equivalent to one-tenth of one percent, or .001 in a team OBP) is
associated with an increase in the expected R per game for the team of 0.021644. A team
raising its OBP by 50 points would now result in an expected increase of over a run a game
(1.0822 to be exact).
The standard error of the estimate is now 0.1512, implying that this model can
predict R per game to within  .3024 (  2 x .1512) about 95% of the time. Once again,
given that the range of R per game is about 2.7 R per game and the interquartile range was
about 0.8 R per game, the new model is a highly useful predictor of the dependent variable.
VIII. Examining Unusual Observations
Since we have eliminated three variables, the outliers that appeared to exist before
may no longer be relevant. Let’s examine the new residual plots to see if we can find any
remaining ones. I will first plot the residuals versus each of the predicting variables. This
will be followed by a normal plot of the residuals and finally the residuals versus the fitted
values.
18
0.4
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
-0.4
300 310 320 330 340 350 360 370 380
OBP

0.4
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
-0.4
400 450 500
SLG
19
2
Normal Score
-1
-2
-3
-0.4 -0.3 -0.2 -0.1 0.0 0.1 0.2 0.3 0.4
Residual

0.4
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
-0.4
4 5 6
Fitted Value
The two outliers that I have circled on the extreme ends of the normal plot of the
residuals are observations 45 and 72, the 2000 Chicago White Sox and the 1998 Tampa
20
Bay Devil Rays, respectively.3 Since outliers can have a strong effect on the fitted
regression and the measures of fit, we perhaps can remove them from the data set and
analyze without them. However, we need to be sure to inform the reader about what we are
doing. I will eliminate these two observations to see if the model becomes even stronger:
Regression Analysis: R per Game versus OBP, SLG

R per Game = - 5.73 + 0.0218 OBP + 0.00774 SLG

Constant -5.7261 0.3065 -18.68 0.000
OBP 0.021788 0.001459 14.94 0.000
SLG 0.0077434 0.0008392 9.23 0.000
S = 0.1432 R-Sq = 92.9% R-Sq(adj) = 92.7%
Source DF SS MS F P
Regression 2 28.598 14.299 697.05 0.000
Total 109 30.793
Omitting those two teams did not change very much, although the model is slightly
stronger as the adjusted R2 did slightly increase and the standard error of the estimate
slightly decreased. While things did not improve much, if I was to present this model, I
would make clear that it does not apply to those two teams. Finally, since I omitted two
observations, I, in a sense, have an entirely new data set and thus need to again perform a
model selection process. However, a look at the t-statistics and p-values indicates that both
variables seem to have a high predictive power, with OBP more than SLG. There is no
need to change the model selection again. These two variables together provide the highest
predictive power in the simplest models (a new check of the best subsets confirmed this).
IX. Final Checking of Assumptions with Residual Plots

3
The 1998 Devil Rays, an expansion team, had a much lower R per game than their team OBP and SLG
would suggest which would naturally help explain their terrible 63 win inaugural season. The 2000 White
Sox, on the other hand, had a much higher R per game then their OBP and SLG would predict, helping them
to a division winning, 95-67 record.
21
Since I made several changes to the model since I last checked the regression
assumptions, I need to perform the check one final time before the analysis is complete.
Histogram of the Residuals

20
Frequency
10
-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3
Residual
The distribution of the residuals seems fairly normal. I next will plot the residuals
versus each of the predicting variables to see if there is any apparent structure:
22
0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
300 310 320 330 340 350 360 370 380
OBP

0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
400 450 500
SLG
I do not see any apparent structures in the plot of the residuals versus each of the
predicting variables.
23
2
Normal Score
-1
-2
-3
-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3
Residual
As the plot above indicates, the residuals are roughly normally distributed.
Now I will examine the residuals versus the fitted values plot to validate the next
two assumptions:

0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
4 5 6
Fitted Value
24
There does not appear to be any structure to the errors as no apparent patterns are seen. In
addition, the plot shows constant variance and thus the homoscedasticity assumption
appears fine.
Finally, the plot the residuals versus the order of the data should ensure that no  i
terms are related to each other:
Residuals Versus the Order of the Data

0.3
0.2
0.1
Residual
0.0
-0.1
-0.2
-0.3
20 40 60 80 100
Observation Order
Once again, no patterns are apparent and we can conclude that this assumption has
not been violated.
X. Conclusion
This analysis has provided ample evidence that Billy Beane knows what he is
doing. As we have seen, SB and SH per game have virtually no correlation with R per
game when acting by themselves and do not add more prediction value when working
25
together with the other variables. On the other hand, we saw how OBP, SLG and AVG
have a strong positive correlation with R per game. However, of those three statistics, the
traditional and most often used AVG had the lowest correlation. Furthermore, when we
examined how these variables work together in predicting R per game, we noticed that
OBP and SLG alone could do just as good a job predicting R than a model that included
AVG. 4 In fact, my regression analysis has confirmed that R per game can be highly
predicted by OBP and SLG, with about 92.9% of the variability in R per game accounted
for by these two variables.
What practical implications does the analysis provide? Well, perhaps a GM would
best work on focusing on his team’s OBP and SLG over AVG to improve his team’s run
output for the season. In addition, since SB and SH seem insignificant, focusing on speed
on the bases and “small ball” may be counterproductive in a team’s ultimate offensive
goal: scoring runs. However, I hesitate to conclude that these factors should be ignored. In
addition to the important fact that correlation does not imply causation, there is another
angle that was not addressed. My analysis looked for correlations between these factors
and total team runs over the course of a season (expressed as R per game). There is an
argument that in certain situations where you are playing for one run (i.e. a tying, go-ahead
or an “insurance run” in a late inning, etc.), these methods can increase the probability of
scoring that one particular run. While this one run may come at the expense of maximizing
your total team runs scored, it may be beneficial to winning that game. I am skeptical of
this argument and have read analyses that have attempted to disprove this line of
reasoning. However, an examination of this topic goes beyond the scope of this paper.
4
To revisit an earlier discussion from footnote 2, the analysis also showed how, holding all else constant and
among just those variables in the final regression model, OBP has more predicting power than SLG. Thus, it
was not necessary to run a separate regression model that replaces OBP and SLG with OPS.
26

Money Ball

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Money Ball

Загружено:

Авторское право:

Доступные форматы

MONEYBALL:

Statistics & Data Analysis

the league in these categories1.

as it clearly conflicts with the traditional way of looking at things.

II. Thinking about the Data

with an expansion team), this approach will provide 112 observations.

III. Data Collection

offensive statistics are predictors to runs scored.

IV. First Look at the Data

Variable Minimum Maximum Q1 Q3

appears to have a distribution that is skewed in either direction. To confirm, I examined a

predictors and the dependent variable:

S = 0.202521 R-Sq = 86.4 % R-Sq(adj) = 86.3 %

300 310 320 330 340 350 360 370 380

S = 0.252520 R-Sq = 78.9 % R-Sq(adj) = 78.7 %

400 450 500

S = 0.319513 R-Sq = 66.2 % R-Sq(adj) = 65.9 %

240 250 260 270 280 290

S = 0.549179 R-Sq = 0.3 % R-Sq(adj) = 0.0 %

S = 0.546653 R-Sq = 1.2 % R-Sq(adj) = 0.3 %

0.1 0.2 0.3 0.4

V. Preliminary Multiple Regression Model

The regression equation is

Predictor Coef SE Coef T P

S = 0.1518 R-Sq = 92.7% R-Sq(adj) = 92.3%

increase in OBP (equivalent to one-tenth of one percent, or .001 in a team OBP) is

VI. Residual Plots and Checking Assumptions

Histogram of the Residuals

300 310 320 330 340 350 360 370 380

Residuals Versus SLG

400 450 500

240 250 260 270 280 290

Residuals Versus SB per G

0.2 0.7 1.2

0.1 0.2 0.3 0.4

Normal Probability Plot of the Residuals

to be an outlier or two (circled), but nothing to cause an assumption to be violated. We will

Residuals Versus the Fitted Values

homoscedasticity assumption appears fine.

conclude that this assumption has not been violated.

VII. Model Improvement

hold. Let’s examine again the preliminary regression results:

The regression equation is

Predictor Coef SE Coef T P

S = 0.1518 R-Sq = 92.7% R-Sq(adj) = 92.3%

1 86.4 86.3 87.7 0.20252 X

Regression Analysis: R per Game versus OBP, SLG

The regression equation is

Predictor Coef SE Coef T P

S = 0.1512 R-Sq = 92.5% R-Sq(adj) = 92.4%

VIII. Examining Unusual Observations

300 310 320 330 340 350 360 370 380

Residuals Versus SLG

400 450 500

Residuals Versus the Fitted Values

Regression Analysis: R per Game versus OBP, SLG

The regression equation is

Predictor Coef SE Coef T P

S = 0.1432 R-Sq = 92.9% R-Sq(adj) = 92.7%

IX. Final Checking of Assumptions with Residual Plots

Histogram of the Residuals

-0.3 -0.2 -0.1 0.0 0.1 0.2 0.3

300 310 320 330 340 350 360 370 380

Residuals Versus SLG