Вы находитесь на странице: 1из 7

Predicting Wins In Baseball

By, Garrett Herr

Garrett Herr
STAT 462

For many years, people in the baseball field have been trying to figure out which batting
statistic is the most important in producing wins for a team. There have been many hypotheses
throughout the years including statistics such as batting average, home runs hit, and runs scored.
In recent years, Sabermetrics have become more popular in trying to predict wins. These include
statistical measures such as OPS (On Base plus Slugging Percentage), BABIP (Batting Average
on Balls in Play), and WAR(Wins Above Replacement). WAR will not be used in this study
because it does not make sense to use in the team setting. The goal of my study is to figure out
which of these team batting statistics fit best into a model that will be able to best predict the
number of wins for a team in a certain year. I will first do a correlation test to figure out which of
the statistics are already highly correlated with wins and will remove them from the study. This
is done because strongly correlated variables can adversely affect the predictive power of the
regression model. After I get rid of the highly correlated variables, I will then run a Stepwise
Forward Regression to figure out which variables will best fit together in a model. After that I
will run a multiple linear regression on the best variables and analyze the data from there. The
data that I will be using will be the 2012 hitting statistics from each team in the MLB. I will
include at bats (AB), runs (R), hits (H), doubles (2B), triples (3B), home runs (HR), total bases
(TB), runs batted in (RBI), batting average (AVG), on base percentage (OBP), slugging
percentage (SLG), on base plus slugging percentage (OPS), strikeouts (K), and walks (BB). The
data I found was from espn.com.
There are many different thoughts today about what the best batting statistic is to help a
team win games. Many people go with the classical view that batting average is the best gauge of
how well a team is doing. But new Sabermetric statistics have been more popular as of late in
saying how well a team is doing. I am more of a new age type of thinker so I tend to go along
2

Garrett Herr
STAT 462

more with the new way of thinking. My hypothesis is that OBP and OPS will be the best
predictors of a teams win output. To do my research I will use Minitab software.
So my first step was to complete a correlation analysis and this was the output I got:
Wins
AB
AB
0.192
0.309

2B

3B

HR

TB

RBI

0.531
0.003

0.621
0.000

0.267
0.153

0.799
0.000

0.765
0.000

2B

0.144
0.448

0.580
0.001

0.478
0.008

0.663
0.000

3B

-0.107
0.575

-0.005
0.979

-0.043
0.823

0.180
0.341

0.086
0.653

HR

0.449
0.013

0.250
0.183

0.647
0.000

0.171
0.367

0.017
0.930

-0.451
0.012

TB

0.459
0.011

0.702
0.000

0.929
0.000

0.797
0.000

0.553
0.002

-0.058
0.762

0.711
0.000

RBI

0.537
0.002

0.603
0.000

0.995
0.000

0.757
0.000

0.457
0.011

-0.070
0.714

0.656
0.000

0.924
0.000

AVG

0.268
0.152

0.684
0.000

0.749
0.000

0.985
0.000

0.637
0.000

0.221
0.242

0.134
0.480

0.765
0.000

0.745
0.000

OBP

0.412
0.024

0.472
0.009

0.787
0.000

0.832
0.000

0.518
0.003

0.144
0.448

0.204
0.279

0.699
0.000

0.794
0.000

SLG

0.488
0.006

0.580
0.001

0.924
0.000

0.732
0.000

0.503
0.005

-0.067
0.725

0.758
0.000

0.987
0.000

0.923
0.000

OPS

0.498
0.005

0.585
0.001

0.947
0.000

0.818
0.000

0.544
0.002

0.003
0.987

0.627
0.000

0.962
0.000

0.947
0.000

0.081
0.669

-0.458
0.011

-0.257
0.170

-0.643
0.000

-0.280
0.134

-0.127
0.502

0.289
0.121

-0.219
0.244

-0.276
0.139

BB

0.364
0.048

-0.298
0.110

0.144
0.447

-0.189
0.317

-0.120
0.528

-0.089
0.641

0.077
0.685

-0.083
0.662

0.153
0.421

Just using the first row to see how correlated each predictor variable was with the number of
wins, I got rid of any variables that I deemed were too highly correlated. I used the p-value to
decide this and I used an alpha level of .05. So after this step, I was left with the variables AB, H,
2B, 3B, AVG, and K. This is a good number of variables to have use in a regression analysis
because, if there are too many variables in the equation, then it will be harder to form a good

Garrett Herr
STAT 462

conclusion because of the fact that there will be more interactions and they will negatively affect
the R-squared values.
Next, I ran a Stepwise Forward Regression analysis to see which variables would be the
best to include in a model. I used an alpha level of .05 and when I ran the numbers in Minitab, I
got the result that the best model would be for all six variables to be in the model. So then I went
back into Minitab and ran a multiple linear regression with all six variables being predictor
variables and I got this output:
Regression Analysis: Wins versus AB, H, 2B, 3B, AVG, K
The regression equation is
Wins = - 1144 + 0.178 AB - 0.66 H - 0.088 2B - 0.234 3B + 4417 AVG + 0.0674 K
Predictor
Constant
AB
H
2B
3B
AVG
K

Coef
-1144
0.1783
-0.664
-0.0882
-0.2344
4417
0.06742

S = 11.6600

SE Coef
2350
0.4224
1.595
0.1221
0.2311
8840
0.03311

R-Sq = 24.3%

T
-0.49
0.42
-0.42
-0.72
-1.01
0.50
2.04

P
0.631
0.677
0.681
0.477
0.321
0.622
0.053

R-Sq(adj) = 4.5%

Analysis of Variance
Source
Regression
Residual Error
Total

DF
6
23
29

SS
1003.0
3127.0
4130.0

MS
167.2
136.0

F
1.23

P
0.328

Garrett Herr
STAT 462
Residual Plots for Wins
Versus Fits

99

20

90

10

Residual

Percent

Normal Probability Plot

50
10
1

0
-10
-20

-20

-10

0
Residual

10

20

70

75

20

10

4
2
0

90

Versus Order

Residual

Frequency

Histogram

80
85
Fitted Value

0
-10
-20

-20

-10

0
Residual

10

20

8 10 12 14 16 18 20 22 24 26 28 30

Observation Order

First off I will look at the slope coefficients to see what each variable will do to the
predicted wins. Since the slope coefficients for H, 2B, and 3B is negative, this means that the
more hits, doubles, and triples you get, this model will predict that you will win less games.
Since the slope coefficients for AB, AVG, and K is positive, this means that the more of each of
these that you have, the more wins you will have.
Now looking at the p-values, the only slope coefficient that looks to be statistically
significant is the variable K. This means that in this model, strikeouts have the most likelihood of
its slope coefficient being correct. All of the other p-values are very high and will not be very
good predictors. Also, the R-squared value for the model is very low at 24.3% and shows that
these 6 variables can only account for 24.3% of the variation in wins in a year. This is not good
at all and would not be considered statistically significant. The one good thing is that the
residuals look very good and would be good enough in the event that there was statistically
significant data.

Garrett Herr
STAT 462

Through this research, it is very easy to see why there is still a lot of questions in the
world of baseball as to what is the most important batting statistic in baseball. The results that I
found were not statistically significant in any shape or form. One of the main things that I did see
though was that some of the best predictors might actually be the variables that were not
included in the regression analysis. Since they were highly correlated with wins, there is a
chance that they will better be able to predict wins, but were not able to be included in the
analysis because of their correlation. One thing that struck me about my analysis was the fact
that, in the multiple linear regression, the number of strikeouts actually had a positive
relationship with the number of wins a team will have. This is sort of backwards from normal
thinking because most people would agree in the baseball field that striking out is one of the
worst types of at bats you can have because it doesnt allow for runner advancement and it
doesnt give the defense a chance to make an error. The p-value for the slope of K was also
statistically significant so the one thing that I could possibly pull out from this analysis is that the
more a team strikes out, the more wins that team will have. If I could change anything about my
analysis, I would use more data from more years to hopefully make my conclusions more
significant.

Garrett Herr
STAT 462

Appendix A: Data Used from espn.com


Wins

TEAM

AB

2B

3B

HR

TB

RBI

AVG

OBP

SLG

OPS

AVG

OBP

SLG

OPS

BB

94

OAK

5527

713

1315

267

32

195

2231

676

0.238

0.31

0.404

0.714

238

310

404

714

1387

550

55

HOU

5407

583

1276

238

28

146

2008

545

0.236

0.302

0.371

0.673

236

302

371

673

1365

463

PIT

5412

651

1313

241

37

170

2138

620

0.243

0.304

0.395

0.699

243

304

395

699

1354

444

WAS

5615

731

1468

301

25

194

2401

688

0.261

0.322

0.428

0.75

261

322

428

750

1325

479

TB

5398

697

1293

250

30

175

2128

665

0.24

0.317

0.394

0.711

240

317

394

711

1323

571

BAL

5560

712

1375

270

16

214

2319

677

0.247

0.311

0.417

0.728

247

311

417

728

1315

480

ATL

5425

700

1341

263

30

149

2111

660

0.247

0.32

0.389

0.709

247

320

389

709

1289

567

669

1377

296

30

172

2249

636

0.251

0.315

0.411

0.726

79
98
90
93
94
97

CIN

5477

251

315

411

726

1266

481

81

ARI

5462

734

1416

307

33

165

2284

710

0.259

0.328

0.418

0.746

259

328

418

746

1266

539

75

SEA

5494

619

1285

241

27

149

2027

584

0.234

0.296

0.369

0.665

234

296

369

665

1259

466

TOR

5487

716

1346

247

22

198

2231

677

0.245

0.309

0.407

0.716

245

309

407

716

1251

473

NYM

5450

650

1357

286

21

139

2102

625

0.249

0.316

0.386

0.701

249

316

386

701

1250

503

MIL

5557

776

1442

300

39

202

2426

741

0.259

0.325

0.437

0.762

259

325

437

762

1240

466

SD

5422

651

1339

272

43

121

2060

610

0.247

0.319

0.38

0.699

247

319

380

699

1238

539

CHC

5411

613

1297

265

36

137

2045

570

0.24

0.302

0.378

0.68

240

302

378

680

1235

447

609

1327

261

39

137

2077

576

0.244

0.308

0.382

0.69

73
74
83
76
61
69

MIA

5437

244

308

382

690

1228

484

64

COL

5577

758

1526

306

52

166

2434

716

0.274

0.33

0.436

0.766

274

330

436

766

1213

450

85

CWS

5518

748

1409

228

29

211

2328

726

0.255

0.318

0.422

0.74

255

318

422

740

1203

461

69

BOS

5604

734

1459

339

16

165

2325

695

0.26

0.315

0.415

0.73

260

315

415

730

1197

428

STL

5622

765

1526

290

37

159

2367

732

0.271

0.338

0.421

0.759

271

338

421

759

1192

533

NYY

5524

804

1462

280

13

245

2503

774

0.265

0.337

0.453

0.79

265

337

453

790

1176

565

LAD

5438

637

1369

269

23

116

2032

607

0.252

0.317

0.374

0.69

252

317

374

690

1156

481

LAA

5536

767

1518

273

22

187

2396

732

0.274

0.332

0.433

0.764

274

332

433

764

1113

449

808

1526

303

32

200

2493

780

0.273

0.334

0.446

0.78

88
95
86
89
93

TEX

5590

273

334

446

780

1103

478

88

DET

5476

726

1467

279

39

163

2313

698

0.268

0.335

0.422

0.757

268

335

422

757

1103

511

94

SFG

5558

718

1495

287

57

103

2205

675

0.269

0.327

0.397

0.724

269

327

397

724

1097

483

81

PHI

5544

684

1414

271

28

158

2215

659

0.255

0.317

0.4

0.716

255

317

400

716

1094

454

CLE

5525

667

1385

266

24

136

2107

635

0.251

0.324

0.381

0.705

251

324

381

705

1087

555

MIN

5562

701

1448

270

30

131

2171

667

0.26

0.325

0.39

0.715

260

325

390

715

1069

505

KC

5636

676

1492

295

37

131

2254

643

0.265

0.317

0.4

0.716

265

317

400

716

1032

404

68
66
72

Вам также может понравиться