Академический Документы
Профессиональный Документы
Культура Документы
Garrett Herr
STAT 462
For many years, people in the baseball field have been trying to figure out which batting
statistic is the most important in producing wins for a team. There have been many hypotheses
throughout the years including statistics such as batting average, home runs hit, and runs scored.
In recent years, Sabermetrics have become more popular in trying to predict wins. These include
statistical measures such as OPS (On Base plus Slugging Percentage), BABIP (Batting Average
on Balls in Play), and WAR(Wins Above Replacement). WAR will not be used in this study
because it does not make sense to use in the team setting. The goal of my study is to figure out
which of these team batting statistics fit best into a model that will be able to best predict the
number of wins for a team in a certain year. I will first do a correlation test to figure out which of
the statistics are already highly correlated with wins and will remove them from the study. This
is done because strongly correlated variables can adversely affect the predictive power of the
regression model. After I get rid of the highly correlated variables, I will then run a Stepwise
Forward Regression to figure out which variables will best fit together in a model. After that I
will run a multiple linear regression on the best variables and analyze the data from there. The
data that I will be using will be the 2012 hitting statistics from each team in the MLB. I will
include at bats (AB), runs (R), hits (H), doubles (2B), triples (3B), home runs (HR), total bases
(TB), runs batted in (RBI), batting average (AVG), on base percentage (OBP), slugging
percentage (SLG), on base plus slugging percentage (OPS), strikeouts (K), and walks (BB). The
data I found was from espn.com.
There are many different thoughts today about what the best batting statistic is to help a
team win games. Many people go with the classical view that batting average is the best gauge of
how well a team is doing. But new Sabermetric statistics have been more popular as of late in
saying how well a team is doing. I am more of a new age type of thinker so I tend to go along
2
Garrett Herr
STAT 462
more with the new way of thinking. My hypothesis is that OBP and OPS will be the best
predictors of a teams win output. To do my research I will use Minitab software.
So my first step was to complete a correlation analysis and this was the output I got:
Wins
AB
AB
0.192
0.309
2B
3B
HR
TB
RBI
0.531
0.003
0.621
0.000
0.267
0.153
0.799
0.000
0.765
0.000
2B
0.144
0.448
0.580
0.001
0.478
0.008
0.663
0.000
3B
-0.107
0.575
-0.005
0.979
-0.043
0.823
0.180
0.341
0.086
0.653
HR
0.449
0.013
0.250
0.183
0.647
0.000
0.171
0.367
0.017
0.930
-0.451
0.012
TB
0.459
0.011
0.702
0.000
0.929
0.000
0.797
0.000
0.553
0.002
-0.058
0.762
0.711
0.000
RBI
0.537
0.002
0.603
0.000
0.995
0.000
0.757
0.000
0.457
0.011
-0.070
0.714
0.656
0.000
0.924
0.000
AVG
0.268
0.152
0.684
0.000
0.749
0.000
0.985
0.000
0.637
0.000
0.221
0.242
0.134
0.480
0.765
0.000
0.745
0.000
OBP
0.412
0.024
0.472
0.009
0.787
0.000
0.832
0.000
0.518
0.003
0.144
0.448
0.204
0.279
0.699
0.000
0.794
0.000
SLG
0.488
0.006
0.580
0.001
0.924
0.000
0.732
0.000
0.503
0.005
-0.067
0.725
0.758
0.000
0.987
0.000
0.923
0.000
OPS
0.498
0.005
0.585
0.001
0.947
0.000
0.818
0.000
0.544
0.002
0.003
0.987
0.627
0.000
0.962
0.000
0.947
0.000
0.081
0.669
-0.458
0.011
-0.257
0.170
-0.643
0.000
-0.280
0.134
-0.127
0.502
0.289
0.121
-0.219
0.244
-0.276
0.139
BB
0.364
0.048
-0.298
0.110
0.144
0.447
-0.189
0.317
-0.120
0.528
-0.089
0.641
0.077
0.685
-0.083
0.662
0.153
0.421
Just using the first row to see how correlated each predictor variable was with the number of
wins, I got rid of any variables that I deemed were too highly correlated. I used the p-value to
decide this and I used an alpha level of .05. So after this step, I was left with the variables AB, H,
2B, 3B, AVG, and K. This is a good number of variables to have use in a regression analysis
because, if there are too many variables in the equation, then it will be harder to form a good
Garrett Herr
STAT 462
conclusion because of the fact that there will be more interactions and they will negatively affect
the R-squared values.
Next, I ran a Stepwise Forward Regression analysis to see which variables would be the
best to include in a model. I used an alpha level of .05 and when I ran the numbers in Minitab, I
got the result that the best model would be for all six variables to be in the model. So then I went
back into Minitab and ran a multiple linear regression with all six variables being predictor
variables and I got this output:
Regression Analysis: Wins versus AB, H, 2B, 3B, AVG, K
The regression equation is
Wins = - 1144 + 0.178 AB - 0.66 H - 0.088 2B - 0.234 3B + 4417 AVG + 0.0674 K
Predictor
Constant
AB
H
2B
3B
AVG
K
Coef
-1144
0.1783
-0.664
-0.0882
-0.2344
4417
0.06742
S = 11.6600
SE Coef
2350
0.4224
1.595
0.1221
0.2311
8840
0.03311
R-Sq = 24.3%
T
-0.49
0.42
-0.42
-0.72
-1.01
0.50
2.04
P
0.631
0.677
0.681
0.477
0.321
0.622
0.053
R-Sq(adj) = 4.5%
Analysis of Variance
Source
Regression
Residual Error
Total
DF
6
23
29
SS
1003.0
3127.0
4130.0
MS
167.2
136.0
F
1.23
P
0.328
Garrett Herr
STAT 462
Residual Plots for Wins
Versus Fits
99
20
90
10
Residual
Percent
50
10
1
0
-10
-20
-20
-10
0
Residual
10
20
70
75
20
10
4
2
0
90
Versus Order
Residual
Frequency
Histogram
80
85
Fitted Value
0
-10
-20
-20
-10
0
Residual
10
20
8 10 12 14 16 18 20 22 24 26 28 30
Observation Order
First off I will look at the slope coefficients to see what each variable will do to the
predicted wins. Since the slope coefficients for H, 2B, and 3B is negative, this means that the
more hits, doubles, and triples you get, this model will predict that you will win less games.
Since the slope coefficients for AB, AVG, and K is positive, this means that the more of each of
these that you have, the more wins you will have.
Now looking at the p-values, the only slope coefficient that looks to be statistically
significant is the variable K. This means that in this model, strikeouts have the most likelihood of
its slope coefficient being correct. All of the other p-values are very high and will not be very
good predictors. Also, the R-squared value for the model is very low at 24.3% and shows that
these 6 variables can only account for 24.3% of the variation in wins in a year. This is not good
at all and would not be considered statistically significant. The one good thing is that the
residuals look very good and would be good enough in the event that there was statistically
significant data.
Garrett Herr
STAT 462
Through this research, it is very easy to see why there is still a lot of questions in the
world of baseball as to what is the most important batting statistic in baseball. The results that I
found were not statistically significant in any shape or form. One of the main things that I did see
though was that some of the best predictors might actually be the variables that were not
included in the regression analysis. Since they were highly correlated with wins, there is a
chance that they will better be able to predict wins, but were not able to be included in the
analysis because of their correlation. One thing that struck me about my analysis was the fact
that, in the multiple linear regression, the number of strikeouts actually had a positive
relationship with the number of wins a team will have. This is sort of backwards from normal
thinking because most people would agree in the baseball field that striking out is one of the
worst types of at bats you can have because it doesnt allow for runner advancement and it
doesnt give the defense a chance to make an error. The p-value for the slope of K was also
statistically significant so the one thing that I could possibly pull out from this analysis is that the
more a team strikes out, the more wins that team will have. If I could change anything about my
analysis, I would use more data from more years to hopefully make my conclusions more
significant.
Garrett Herr
STAT 462
TEAM
AB
2B
3B
HR
TB
RBI
AVG
OBP
SLG
OPS
AVG
OBP
SLG
OPS
BB
94
OAK
5527
713
1315
267
32
195
2231
676
0.238
0.31
0.404
0.714
238
310
404
714
1387
550
55
HOU
5407
583
1276
238
28
146
2008
545
0.236
0.302
0.371
0.673
236
302
371
673
1365
463
PIT
5412
651
1313
241
37
170
2138
620
0.243
0.304
0.395
0.699
243
304
395
699
1354
444
WAS
5615
731
1468
301
25
194
2401
688
0.261
0.322
0.428
0.75
261
322
428
750
1325
479
TB
5398
697
1293
250
30
175
2128
665
0.24
0.317
0.394
0.711
240
317
394
711
1323
571
BAL
5560
712
1375
270
16
214
2319
677
0.247
0.311
0.417
0.728
247
311
417
728
1315
480
ATL
5425
700
1341
263
30
149
2111
660
0.247
0.32
0.389
0.709
247
320
389
709
1289
567
669
1377
296
30
172
2249
636
0.251
0.315
0.411
0.726
79
98
90
93
94
97
CIN
5477
251
315
411
726
1266
481
81
ARI
5462
734
1416
307
33
165
2284
710
0.259
0.328
0.418
0.746
259
328
418
746
1266
539
75
SEA
5494
619
1285
241
27
149
2027
584
0.234
0.296
0.369
0.665
234
296
369
665
1259
466
TOR
5487
716
1346
247
22
198
2231
677
0.245
0.309
0.407
0.716
245
309
407
716
1251
473
NYM
5450
650
1357
286
21
139
2102
625
0.249
0.316
0.386
0.701
249
316
386
701
1250
503
MIL
5557
776
1442
300
39
202
2426
741
0.259
0.325
0.437
0.762
259
325
437
762
1240
466
SD
5422
651
1339
272
43
121
2060
610
0.247
0.319
0.38
0.699
247
319
380
699
1238
539
CHC
5411
613
1297
265
36
137
2045
570
0.24
0.302
0.378
0.68
240
302
378
680
1235
447
609
1327
261
39
137
2077
576
0.244
0.308
0.382
0.69
73
74
83
76
61
69
MIA
5437
244
308
382
690
1228
484
64
COL
5577
758
1526
306
52
166
2434
716
0.274
0.33
0.436
0.766
274
330
436
766
1213
450
85
CWS
5518
748
1409
228
29
211
2328
726
0.255
0.318
0.422
0.74
255
318
422
740
1203
461
69
BOS
5604
734
1459
339
16
165
2325
695
0.26
0.315
0.415
0.73
260
315
415
730
1197
428
STL
5622
765
1526
290
37
159
2367
732
0.271
0.338
0.421
0.759
271
338
421
759
1192
533
NYY
5524
804
1462
280
13
245
2503
774
0.265
0.337
0.453
0.79
265
337
453
790
1176
565
LAD
5438
637
1369
269
23
116
2032
607
0.252
0.317
0.374
0.69
252
317
374
690
1156
481
LAA
5536
767
1518
273
22
187
2396
732
0.274
0.332
0.433
0.764
274
332
433
764
1113
449
808
1526
303
32
200
2493
780
0.273
0.334
0.446
0.78
88
95
86
89
93
TEX
5590
273
334
446
780
1103
478
88
DET
5476
726
1467
279
39
163
2313
698
0.268
0.335
0.422
0.757
268
335
422
757
1103
511
94
SFG
5558
718
1495
287
57
103
2205
675
0.269
0.327
0.397
0.724
269
327
397
724
1097
483
81
PHI
5544
684
1414
271
28
158
2215
659
0.255
0.317
0.4
0.716
255
317
400
716
1094
454
CLE
5525
667
1385
266
24
136
2107
635
0.251
0.324
0.381
0.705
251
324
381
705
1087
555
MIN
5562
701
1448
270
30
131
2171
667
0.26
0.325
0.39
0.715
260
325
390
715
1069
505
KC
5636
676
1492
295
37
131
2254
643
0.265
0.317
0.4
0.716
265
317
400
716
1032
404
68
66
72