Вы находитесь на странице: 1из 14

A Hidden Markov Model of Run Scoring in Baseball

Ankur Agarwal
Department of Statistics
Rutgers University
December 20, 2005
Abstract
The goal of a baseball team is to score more runs (points) than its opponent. Accurately
modeling and predicting the run scoring process for a given team can help that teams
manager decide whether he is going to achieve this goal. We design a Hidden Markov
Model (HMM) where the state observations are runs scored, and we simulate it to show
that it is close to a complete model of the actual run scoring process. We compare its
resultant accuracy to mathematical formulas (e.g. linear regression) that estimate a teams
runs scored (dependent variable) based on relevant statistical data about the teams actual
performance (independent variables). Finally, we propose improvements to the HMM
based on the incorporation of minor (yet important) factors that impact the run scoring
process. We also discuss some theoretical and practical applications of this HMM,
including player evaluation and prediction of future performance.

Introduction
Baseball, like any game, is full of uncertainty. Will the batter hit a homerun or a single (or
something else)? How many runs will a certain team score? I will attempt to answer the
second question, which depends in part on the first (and more fundamental) question,
with a Markov model of the run scoring process. Formally, we can describe this model as
a Hidden Markov Model (HMM) where the observation in a given state is the number of
runs scored in that state. For those who know little (or nothing) about the game of
baseball, I will soon give a descriptive example to illustrate the run scoring process.
However, I would first like to present the basic rules and gameplay in baseball:

(Picture from http://en.wikipedia.org/wiki/Baseball)

Baseball is played on a diamond-shaped surface (pictured above) with 4 bases, or


plates, on its corners. The base sequence is Home-1st-2nd-3rd-Home; when a player
called the batter completes this sequence (by returning Home), his team scores one
run, or point. The batter (or hitter) stands in one of the two batters boxes, both
of which are adjacent to Home plate. The pitcher (a player on the opposing team) throws
the baseball towards Home plate with the hopes of preventing the batter from getting a
hit, which is either a single (1B), double (2B), triple (3B), or homerun (HR). The batter
advances to 1st base on a 1B, 2nd base on a 2B, 3rd base on a 3B, and Home on a HR. A
HR automatically scores one run for the batters team. In addition, a walk (BB) or hitby-pitch (HBP) is similar to a 1B in that it also advances the batter to 1st base. A player
who advances to a certain base except Home is called a baserunner, or a runner on 1st/
2nd/3rd base. For further details on the gameplay in baseball, let us turn to an illustration
of the run scoring process:
A baseball teams lineup (like the following example) contains nine batters, ordered
from 1-9. This sequence of batters is repeated as the game moves along. So after batter
x has completed his plate appearance, or opportunity to hit, the next batter is (x +
1), where x is some number between 1-8. After the 9th batter, we return to the start of the
sequence with the 1st batter.
NY Mets Team Lineup:
1. Edgardo Alfonzo
2. Derek Bell
3. Mike Piazza
4. Todd Zeile
5. Benny Agbayani
6. Robin Ventura
7. Jay Payton
8. Mike Bordick
9. Mike Hampton

The following sequence, called a game log, documents the progression of a sample
game with the example lineup above. There are usually/about nine innings in a game
(here I have displayed the first seven), each beginning with no baserunners, no outs,
and the first batter due to hit in the batters box. In the 1st inning (start of the game),
Edgardo Alfonzo, who is 1st in the lineup order, is the first batter. Each batter makes a
play, which maps to seven possible outcomes: BB/HBP, 1B, 2B, 3B, HR, Out, DP.
When a batter makes an Out, he does not advance to any base. A DP stands for Double
Play, or two Outs, which erases a baserunner in addition to retiring the batter. A DP
can only occur when there is at least one baserunner. After explaining parts of this game
log, I will give an example of a DP for clarification.
Inning 1
Edgardo Alfonzo drew a walk (or HBP).
Derek Bell made an out.
Mike Piazza hit a single.
Todd Zeile made an out.
Benny Agbayani made an out.

Baserunners
1 0 0
1 0 0
1 1 0
1 1 0
1 1 0

Runs
0
0
0
0
0

Inning 2
Robin Ventura hit a single.
Jay Payton drew a walk (or HBP).
Mike Bordick hit a single.
Mike Hampton made an out.
Edgardo Alfonzo drew a walk (or HBP).
Derek Bell made an out.
Mike Piazza made an out.

Baserunners
1 0 0
1 1 0
1 1 0
1 1 0
1 1 1
1 0 1
1 0 1

Runs
0
0
1
1
1
2
2

Inning 3
Todd Zeile made an out.
Benny Agbayani made an out.
Robin Ventura made an out.

0 0 0
0 0 0
0 0 0

2
2
2

Inning 4
Jay Payton made an out.
Mike Bordick hit a single.
Mike Hampton made an out.
Edgardo Alfonzo hit a homerun.
Derek Bell made an out.

0
1
1
0
0

0
0
0
0
0

2
2
2
4
4

Inning 5
Mike Piazza made an out.
Todd Zeile made an out.
Benny Agbayani made an out.

0 0 0
0 0 0
0 0 0

4
4
4

Inning 6
Robin Ventura made an out.
Jay Payton hit a double.
Mike Bordick made an out.
Mike Hampton made an out.

0
0
0
0

0
0
0
0

4
4
4
4

Inning 7
Edgardo Alfonzo hit a double.
Derek Bell made an out.
Mike Piazza hit a single.

0 1 0
0 1 0
1 0 1

4
4
4

0
0
0
0
0

0
1
1
1

Todd Zeile made an out.


Benny Agbayani hit a single.
Robin Ventura hit a triple.
Jay Payton made an out.

1
1
0
0

0
1
0
0

1
0
1
1

4
5
7
7

In the 1st inning, Alfonzo draws a BB and advances to 1st base, which is represented by 1
0 0 in the Baserunners column. This means that there is a runner on 1st base, and no
runner on 2nd or 3rd base after the play (BB). After Derek Bell makes an out, Alfonzo
might advance to 2nd base (with a certain probability), but here an out leads to no change.
Piazza hits a 1B and Alfonzo advances to 2nd base. The next two batters make outs, and
the inning is over. An inning ends when the team makes 3 outs, and we move on to the
next inning, which begins with 0 outs. The first batter (Ventura) in the next inning is the
one right after the batter who made the last out in the previous inning (Agbayani).
The NY Mets lineup scores 4 runs in the first 6 innings. In the 7th inning, Alfonzo is the
first batter again as Hampton (the #9 hitter) makes an out to end the previous inning.
Alfonzo hits a 2B and advances to 2nd base. Bell makes an out. Piazza hits a 1B and
Alfonzo advances to 3rd base. Usually, a runner on 2nd base will score (advance to Home)
on a 1B, but sometimes he is only able to advance to 3rd base. So now there are runners
on 1st and 3rd base. Zeile makes an out, and the runners do not advance. There is about a
50/50 chance that a runner on 3rd base will score on an out. In this case, he does not score.
Agbayani hits a single, advances to 1st base, and Piazza advances to 2nd base. Alfonzo
advances to Home, scoring a run, and is no longer a baserunner now. So the Mets have
now scored a total of 5 runs. Ventura hits a triple, advances to 3rd base, and both
baserunners score to give the Mets 7 runs. Payton makes an out, and the inning is over.
The key point to notice is that baserunners will always advance on a hit by the batter and
sometimes advance on an out by the batter. In addition, the runners will advance at least
the same number of bases as the batter does. For example, if a batter hits a 2B, a runner
on 1st base advances to at least 3rd base (a runner on 2nd or 3rd base scores automatically).
The following is an example of a DP from a different game:
Inning 3
Mike Hampton hit a single.
Edgardo Alfonzo hit into a double play.
Derek Bell made an out.

Baserunners
1 0 0
0 0 0
0 0 0

Runs
0
0
0

Alfonzos DP erases Hampton from 1st base. Bell makes the third out to end the inning. If
a DP occurs when there is more than one baserunner, the runner on 1st base is the one
who is usually erased. A DP rarely occurs when there is no runner on 1st base.
Also, note that each play occurs within the context of a certain inning. A DP cannot occur
in an inning that has already recorded two outs because one out automatically ends the
current inning, and this leads us to the next inning, which begins with no outs.

Now that I have explained the run scoring process, I will describe the probability
distribution over the random action Play. This probability distribution is specific to a
certain player; for example, some players are more likely to hit HRs than others. Recall
that plate appearances (PA) are the number of opportunities a batter gets to hit during a
season. A season is 162 games; each year there is a new season. As an example:
Player

Team

Year

PA

BB/HBP

1B

2B

3B

HR

Out

DP

Edgardo
Alfonzo

NY
Mets

2000

650

100

109

40

25

362

12

Since Alfonzo hit a total of 109 singles in 650 PA during the 2000 season, his probability
of hitting a single was simply 109/650 = 16.8 %. So Pr(Play = 1B) = 16.8% for Alfonzo
in the 2000 season. The open-source database at www.baseball-databank.org provided me
with the necessary data to generate probability distributions for each player. The Markov
models transition probabilities are based in part on these probability distributions. The
game logs above are sample results from simulations of this model.

Description of the Hidden Markov Model


Each state is composed of three elements: the situation before the play occurred, the play,
and the situation after the play occurred. A situation is described as the batter currently in
the batters box (referenced by his lineup # between 1-9), the baserunners (e.g. 1 0 1), the
current inning (1-9), and the current number of outs recorded in the inning (0, 1, or 2).
Mathematical Description of the Before/After Situation in a State:
Before the play: Lineup #, X baserunners, Inning, N outs
After the play: (Lineup # + 1), X baserunners, Inning, N outs
X (or X) = (n1, n2, n3) where n1 = 1 if runner on 1st (otherwise 0), n2 = 1 if runner on 2nd, etc.
In the After the Play specification, Lineup # + 1 refers to the next batter in the lineup. If Lineup # = 9,
however, the next batter is actually (Lineup # + 1) modulo 9 because the lineup repeats.
A play (as described earlier) maps to one of these outcomes - BB/HBP, 1B, 2B, 3B, HR, Out, or DP
As an example, let Before the Play equal Alfonzo (#1 in lineup), (0, 1, 0), 8th inning, 0 outs, Play equal
1B, and After the Play equal Bell (#2 in lineup), (1, 0, 0), 8th inning, 0 outs. This state means that
Alfonzo was the batter when there was a runner on 2nd base and 0 outs in the 8th inning. Alfonzo hit a
single, the runner scored, and Bell was the next hitter with Alfonzo on 1st base and 0 outs in the 8th inning.
This state description directly determines how many runs score on one play (in this case, 1 run scored).
Possible Observations:
In a certain state we will observe 0, 1, 2, 3, or 4 runs scored. The probability = 100% for one of these 5
possibilities because each state implies exactly one possibility. 4 runs is the maximum number that can
score on one play a HR with a runner on every base (except Home).
Examples
Observation: 0 runs scored
State: Before equals Bell (#2 in lineup), (0, 1, 0), 3rd inning, 2 outs, Play equals Out, and After equals
Piazza (#3 in lineup), (0, 0, 0), 4th inning, 0 outs.

Bell makes an out to end the 3rd inning and so the situation after the play represents the beginning of the 4th
inning. No more runs can score in an inning after the third out has been recorded.
--------------------------------------------------------------------------------------------------------------------------------Observation: 1 run scored
State: Before equals Bordick (#8 in lineup), (1, 1, 0), 2nd inning, 0 outs, Play equals 1B, and After
equals Hampton (#9 in lineup), (1, 1, 0), 2nd inning, 0 outs.
Bordick hits a single, the runner on 2nd base scores, the runner on 1st base advances to 2nd base, and Bordick
advances to 1st base. Thus, we still have runners on 1st and 2nd base after the play.
--------------------------------------------------------------------------------------------------------------------------------Observation: 2 runs scored
State: Before equals Ventura (#6 in lineup), (1, 1, 0), 7th inning, 2 outs, Play equals 3B, and After
equals Payton (#7 in lineup), (0, 0, 1), 7th inning, 2 outs.
Ventura hits a triple, the two baserunners before the play score, and Ventura advances to 3rd base.
--------------------------------------------------------------------------------------------------------------------------------Observation: 3 runs scored
State: Before equals Agbayani (#5 in lineup), (1, 1, 1), 9th inning, 1 out, Play equals 2B, and After
equals Ventura (#6 in lineup), (0, 1, 0), 9th inning, 1 out.
Agbayani hits a double, all three baserunners before the play are able to score, and Agbayani advances to
2nd base.
--------------------------------------------------------------------------------------------------------------------------------Observation: 4 runs scored
State: Before equals Piazza (#3 in lineup), (1, 1, 1), 5th inning, 1 out, Play equals HR, and After equals
Zeile (#4 in lineup), (0, 0, 0), 5th inning, 1 out.
Piazza hits a homerun, all three baserunners before the play score, and Piazza also scores automatically,
thus totaling 4 runs on the play.
Transition Probabilities:
T(S, S) represents the probability that state S transitions to state S. T(S, S) > 0 only if the After situation
(After the Play) in S equals the Before situation (Before the Play) in S. Formally, T(S, S) = the probability
that the Play value in S (e.g. 2B) will occur, times the probability that the After situation in S (or the
Before situation in S) will lead to the After situation in S given that the Play value in S has occurred.
As an example, let S be: Before the Play equals Hampton (#9 in lineup), (0, 0, 0), 8th inning, 0 outs, Play
equals 2B, and After the Play equals Alfonzo (#1 in lineup), (0, 1, 0), 8th inning, 0 outs. Let S be:
Before the Play equals Alfonzo (#1 in lineup), (0, 1, 0), 8th inning, 0 outs, Play equals 1B, and After
the Play equals Bell (#2 in lineup), (1, 0, 0), 8th inning, 0 outs. Notice that the After situation in state S
equals the Before situation in state S. In state S, Hampton hits a double and Alfonzo is now the batter.
Next, in state S, Alfonzo hits a single, Hampton scores, and Bell is the next batter.
The transition probability T(S, S) = Pr(Play = 1B) x Pr(X = (1, 0, 0) | X = (0, 1, 0), Play = 1B), i.e. the
probability that Alfonzo hits a single times the probability that the runner on 2nd base (Hampton) advances
to Home given that Alfonzo has hit a single. The question remains, how do we calculate this transition
probability? This depends on the particular batter involved (in this case, Edgardo Alfonzo). Recall
Alfonzos 2000 stats, now given below in the form of proportions or probabilities:
Pr(BB/HBP)

Pr(1B)

Pr(2B)

Pr(3B)

Pr(HR)

Pr(Out)

Pr(DP)

.154

.168

.062

.003

.038

.557

.018

The only problem with these probabilities is that a DP (double play) can only occur with less than 2 outs
(there is a limit of 3 outs per inning) and at least one runner on base (usually 1st base). The other plays can
occur in any situation, but their probabilities will change (though their relative proportions will stay the
same) when Pr(DP) = 0. Based on batting stats (called splits) at www.retrosheet.org, I found that about

60% of any batters total PA occur when there are 2 outs or no baserunners (or both). What this means, for
example, is that about 60% of Alfonzos 650 PA (390) in 2000 occurred when there were either 2 outs or no
baserunners, i.e. when Pr(DP) = 0.
For any batter (in this case, Alfonzo), the equation becomes Pr(DP) = (.60 x 0) + (.33 x Y) + (.07 x Z) = .
018. Y represents the Pr(DP) when there is at least one baserunner, one of whom is on 1st base, and less
than 2 outs (about 33% of the PA). Z represents the Pr(DP) when there is at least one baserunner, none of
whom are on 1st base, and less than 2 outs (about 7% of the PA). I have set Z equal to .03 because this
rarely occurs (for any batter, including Alfonzo). We can then solve for Y: (.018 - .0021) / .33 = .048. Z
is a constant value for all batters whereas Y varies with the particular batter.
The following table presents my results for Alfonzo:
Pr(BB/HBP)

Pr(1B)

Pr(2B)

Pr(3B)

Pr(HR)

Pr(Out)

Pr(DP)

No baserunners or 2 outs

Situation

.157

.171

.063

.003

.039

.567

Runner on 1st base and less


than 2 outs

.149

.163

.060

.003

.037

.540

.048 (Y)

Runner on 2nd or 3rd base


(not on 1st base) and less
than 2 outs

.152

.166

.061

.003

.038

.550

.03 (Z)

Returning to the question of how we calculate T(S, S), we now know that Pr(Play = 1B) = .166 because
Alfonzo is batting with a runner on 2nd base and 0 outs (less than 2 outs). But what about the Pr(X = (1, 0,
0) | X = (0, 1, 0), Play = 1B)? The following discussion will explain how to calculate the probability that
some After situation in S will lead to some After situation in S, given that a certain play has occurred.
Given a certain play (e.g. 1B) and an After situation in S (e.g. Alfonzo, X = (0, 1, 0), 8th inning, 0 outs), we
can directly determine the number of outs in the After situation in S. The non-outs (BB/HBP, 1B, 2B, 3B,
HR) keep the number of outs unchanged. A single out results in 1 more out, and a DP results in 2 more
outs. Once we reach 3 outs from one of these two plays, we automatically reset the number of outs to
zero, the X component (in the After situation in S) to (0, 0, 0), and the current inning to the next inning.
X, however, can often equal one of several possibilities, and so we need to use probabilities. When the
play is a BB/HBP, 3B, or HR, however, there is only one possibility:
Let X = (n1, n2, n3) in the After situation in S (or the Before situation in S)
BB/HBP:
If n1 = 1 and n2 = 1 then X = (1, 1, 1) and runs scored = n3
If n1 = 1 and n2 = 0 then X = (1, 1, n3) and runs scored = 0
Else X = (1, n2, n3) and runs scored = 0
3B:
X = (0, 0, 1) and runs scored = n1 + n2 + n3
HR:
X = (0, 0, 0) and runs scored = 1 + n1 + n2 + n3
Explanation: If a batters play is a BB/HBP, then the batter advances to 1st base. If there is a runner on 1st
base before the play, then this runner moves to 2nd base. This continues like a chain reaction as long as there
are runners who are adjacent in the base sequence. For example, if there are runners on every base before
the BB/HBP, then the runner on 3rd base scores, the runner on 2nd base advances to 3rd base, the runner on
1st base advances to 2nd base, and the batter advances to 1st base. If a batters play is a 3B, he advances to
3rd base and any baserunners existing before the play score. A HR is the same as a 3B except that the batter
advances to Home and also scores. Triples usually occur far less often than any other play.

When Play = 1B or 2B, there are often several possibilities for X, given an After situation in S (X):

X'

Pr(X' | X, Play = 1B)

Runs

X'

Pr(X' | X, Play = 2B)

Runs

(0, 0, 0)
(1, 0, 0)

(1, 0, 0)

(0, 0, 0)

(1, 1, 0)

0.85

(1, 0, 0)

(0, 1, 0)

(0, 1, 1)

0.7

(1, 0, 0)

(1, 0, 1)

0.15

(0, 1, 0)

(1, 0, 0)

0.8

(1, 0, 0)

(0, 1, 0)

0.3

(0, 1, 0)

(0, 1, 0)

(0, 1, 0)

(1, 0, 1)

0.2

(0, 0, 1)

(0, 1, 0)

(0, 0, 1)

(1, 0, 0)

(1, 1, 0)

(1, 1, 0)

(1, 1, 0)

(0, 1, 1)

0.7

0.68

(1, 1, 0)

(0, 1, 0)

0.3

(1, 1, 0)

(1, 0, 1)

0.12

(0, 1, 1)

(0, 1, 0)

(1, 1, 0)
(0, 1, 1)

(1, 1, 1)

0.2

(1, 0, 1)

(0, 1, 1)

0.7

(1, 0, 0)

0.8

(1, 0, 1)

(0, 1, 0)

0.3

(0, 1, 1)
(1, 0, 1)

(1, 0, 1)

0.2

(1, 1, 1)

(0, 1, 1)

0.7

(1, 1, 0)

0.85

(1, 1, 1)

(0, 1, 0)

0.3

(1, 0, 1)

(1, 0, 1)

0.15

(1, 1, 1)

(1, 1, 1)

0.2

(1, 1, 1)

(1, 1, 0)

0.68

(1, 1, 1)

(1, 0, 1)

0.12

When Play = Out or DP, there are often several possibilities for X, given an After situation in S (X):
(If a DP occurs with 1 out or an Out occurs with 2 outs, then X' = (0, 0, 0) since now there are 3 outs)
X

X'

X'

(0, 0, 0)

(0, 0, 0)

Pr(X' | X, Play = Out, 0 or 1 outs) Runs


1

(1, 0, 0)

(0, 0, 0)

Pr(X' | X, Play = DP, 0 outs) Runs


1

(1, 0, 0)

(1, 0, 0)

0.95

(0, 1, 0)

(0, 0, 0)

(1, 0, 0)

(0, 1, 0)

0.05

(0, 0, 1)

(0, 0, 0)

(0, 1, 0)

(0, 1, 0)

0.9

(1, 1, 0)

(0, 0, 1)

0.7

(0, 1, 0)

(0, 0, 1)

0.1

(1, 1, 0)

(0, 1, 0)

0.2

(0, 0, 1)

(0, 0, 0)

0.5

(1, 1, 0)

(1, 0, 0)

0.1

(0, 0, 1)

(0, 0, 1)

0.5

(0, 1, 1)

(0, 0, 1)

0.5

(1, 1, 0)

(1, 0, 1)

0.6

(0, 1, 1)

(0, 1, 0)

0.5

(1, 1, 0)

(0, 1, 1)

0.1

(1, 0, 1)

(0, 0, 0)

(1, 1, 0)

(1, 1, 0)

0.3

(1, 1, 1)

(0, 1, 1)

0.45

(0, 1, 1)

(0, 1, 1)

0.5

(1, 1, 1)

(0, 0, 1)

0.5

(0, 1, 1)

(0, 1, 0)

0.5

(1, 1, 1)

(1, 1, 0)

0.05

(1, 0, 1)

(1, 0, 0)

0.5

(1, 0, 1)

(1, 0, 1)

0.5

(1, 1, 1)

(1, 1, 1)

0.4

(1, 1, 1)

(1, 1, 0)

0.3

(1, 1, 1)

(1, 0, 1)

0.3

Going back to our example, the above tables tell us that there are only two possibilities for X given that X
= (0, 1, 0) and Play = 1B; either X = (1, 0, 0) or (1, 0, 1), meaning that either Hampton scores or only
advances to 3rd base. The former is more probable, i.e. Pr(X = (1, 0, 0) | X, 1B) = 0.8. Thus, T(S, S) = .166
x .8 = .1328.

Rationale for Modeling the Run Scoring Process with a HMM:

The result of any baseball game (in the news) is typically described as Home Team 5, Visiting
Team 3 (for example); this means that the home team won the game because it scored 5 runs,
which was more than what the visiting team scored (3 runs). Baseball fans often look at a
scoreboard, which is a listing of different games and their current results, and see (for example)
that the score is 6-2 in the fifth inning of one game. The scoreboard also often tells them how
many runs were scored by each team in each inning.
However, if they did not see this particular game (and the game log is unavailable), they can only
guess what sequence of plays led to the resulting score in each inning. Similarly, the observable
parameter in our HMM is a sequence of runs scored, and one challenge is to determine the most
likely sequence of states (including plays) that could account for a certain observed sequence. The
states in our HMM are the hidden parameters because different situations and plays can account
for the same observed sequence.
Another interesting question that has mostly theoretical value (no practical application) is: Given
a certain lineup of hitters, what is the probability of a certain observation sequence of runs scored
(during a single game)? We will discuss a similar question that does have practical value; namely,
what will be the sum total of runs scored (on average) per game, given a certain lineup? The
HMM is designed so that we can predict (or estimate) the answer to this latter question, both in
theory and through actual simulations. The following sections discuss this key issue in detail, and
reach a conclusion on the HMMs accuracy.

Practical Applications: Predicting Before the Fact (Future) and


Estimating After the Fact (Past)
You may ask, what is the practical value of modeling the run scoring process? First of all,
one must know that the objective of a baseball team is to score more runs than its
opponent (each team usually receives 9 innings to try and score runs in a game). Thus,
the more runs a team scores (on average) each game, the more games it is likely to win.
Accurately predicting runs scored can help a teams manager decide whether or not he
needs to improve his lineups ability to score runs.
Specifically, we wish to predict, given a teams (projected) lineup for each game, how
many runs they will score in a certain season. For example, given the NY Mets lineup for
each game of the 2001 season, we could generate the Play probability distributions for
each player in the lineup from each players cumulative performance between 1998-2000.
Thus, we would be using data from previous seasons to predict each players
performance in 2001 as well as the teams runs scored in 2001 (before the fact).
Inaccurately predicting future performance, however, will probably tell us more about the
inaccuracy of our prediction algorithm than about the inaccuracy of our model. For
example, if our predictions for each players Play probability distribution for 2001 are
very inaccurate, then a simulation of our model will likely give inaccurate predictions
(for the teams runs scored), regardless of whether or not the model is accurate.
Thus, before we can predict future performance, we need to test whether the model is an
accurate simulator of the run scoring process. To accomplish this, we want to estimate a
teams run scored after the fact, i.e. we want to simulate the 2000 season, for example,

based on the probability distributions generated from the actual 2000 stats. As a specific
example, if we did this for the 2000 NY Mets, we would expect Edgardo Alfonzo (or any
other player on the team) to have (on average) the same proportion of singles, outs, etc.
that he had in actuality during the 2000 season. Similarly, if the model is accurate, we
would expect the runs scored in different simulations to be (on average) equal to the
actual runs scored by the 2000 NY Mets.
This idea of estimating after the fact is also the basis for the design of mathematical
formulas that accurately estimate runs scored from actual team stats. There are linear and
non-linear formulas. The following two formulas are examples of each type, respectively:
XR (Extrapolated Runs) = (.50 x 1B) + (.72 x 2B) + (1.04 x 3B) + (1.44 x HR) + (.34 x BB/HBP) + (-.096 x Outs)
BsR (Base Runs) = [A * B/(B + Outs)] + HR (A and B represent baserunners and advancement, respectively)
A = 1B + 2B + 3B + BB/HBP
B = [2*HR + 3.6*3B + 2.2*2B + .8*1B + .1*BB/HBP] * 1.02

The XR formula takes the total number of 1B, 2B, 3B, etc. hit by a team during a certain
season and estimates the teams runs scored using linear weights for each stat. As an
example, the 2000 Mets had 945 singles, 282 doubles, 20 triples, 198 HRs, 720 BB/HBP,
and 4330 Outs. XR estimates that the team scored .5*945 + .72*282 + = 811 runs (they
actually scored 807 runs). Note: the XR formula is fitted to data before the 2000 season.
B/(B+Outs) in the BsR formula represents the percentage of baserunners (A) who
advance to Home plate (i.e. score). After hitting a HR, a batter automatically scores and
does not become a baserunner. Thus, we separately add the total number of HRs.
David Smyth, the creator of BsR, stated (6/6/05): Runs equals baserunners (A) times the
proportion of baserunners who score (B/B+Outs), plus home runs. This statement is so
obviously true that some people have called it an 'identity' instead of an equation or theory.

This identity is certainly correct, but the problem is that we usually cannot determine the
exact proportion of baserunners who scored based on recorded stats like BB, HR, 2B, etc.
Unless we use a game log, or an ordered record of each play, we cannot tell the value
of this proportion. For example, the sequence 1B, DP, 2B, 3B yields one run but a
different sequence like 3B, 2B, 1B, DP may yield two runs, even though each sequence
records the same plays and the same number of baserunners (3). A Markov model
becomes useful here because it accounts for the variability in the run scoring process, as
it can produce different plays and orderings of plays from simulation to simulation.
BsR estimates (on average) how each of the recorded team stats affects the proportion of
baserunners who score. Sometimes (but not always), it estimates this proportion very
accurately. As an example, it calculates that about 31% (612) of the 1,967 baserunners for
the 2000 Mets scored runs. So it estimates that the team scored a total of 612 + 198 = 810
runs, only 3 runs above the actual total. Similarly, XR (like all linear formulas) estimates

the average run value of each stat. Since a HR automatically scores at least one run, and
may score up to 3 more baserunners, its expected run value is about 1.44.
If we let R = actual runs scored by a certain team, then a linear regression formula may
give us a good estimate of R based on the actual team stats (R is the dependent variable
and BB/HBP, 1B, 2B, etc. are the independent variables). If I simulated my model (for a
certain team after the fact) about 100 times, the sample mean runs scored (which is an
unbiased estimator of the HMMs expected value) would hopefully estimate R more
precisely. R is often difficult for mathematical formulas to accurately estimate because
they do not account for the randomness in play orderings as well as the context of innings
and games, both of which are captured by the HMM.

Results & Conclusion


Let the random variable H = the number of runs scored by a certain team after simulating
the HMM based on the probability distributions generated from the teams players
season stats (either before or after the fact). I believe that E(H) is difficult to formally
calculate (mathematically) because of several key factors in real baseball games:
(1) Each game is independent of any other game, i.e. a team could have a different
lineup for each game. One lineup could score 2 runs in a game and another lineup
could score 10 runs in another game, but these run values do not depend on each
other. In other words, what happens in one game (theoretically) has no effect on
what happens in any other game. Thus, we have to calculate E(H) for each distinct
game, given a certain lineup for that game.
(2) All runs score within the context of 3 outs in each inning and innings are
dependent on each other as they determine which batters hit in other innings.
Because an inning could continue indefinitely as long as the batters do not make 3
outs, an infinite number of runs scored in an inning is theoretically possible.
(3) Lineups often change dynamically during each game.
(4) Games can extend beyond nine innings to break tie scores between two teams, i.e.
each team bats for a minimum of eight innings (and usually nine) but there is no
definite maximum.
I will attempt to give a general idea of how one might try to mathematically calculate
E(H), but the above factors make it difficult to formally implement this function. The
initial state for any random game in the 1st inning is one of the following:
S1: Before equals (#1 in lineup), (0, 0, 0), 1st inning, 0 outs, Play equals BB/HBP, and After equals
(#2 in lineup), (1, 0, 0), 1st inning, 0 outs.
S2: Before equals (#1 in lineup), (0, 0, 0), 1st inning, 0 outs, Play equals 1B, and After equals (#2 in
lineup), (1, 0, 0), 1st inning, 0 outs.
S3: Before equals (#1 in lineup), (0, 0, 0), 1st inning, 0 outs, Play equals 2B, and After equals (#2 in
lineup), (0, 1, 0), 1st inning, 0 outs.

S4: Before equals (#1 in lineup), (0, 0, 0), 1st inning, 0 outs, Play equals 3B, and After equals (#2 in
lineup), (0, 0, 1), 1st inning, 0 outs.
S5: Before equals (#1 in lineup), (0, 0, 0), 1st inning, 0 outs, Play equals HR, and After equals (#2 in
lineup), (0, 0, 0), 1st inning, 0 outs.
S6: Before equals (#1 in lineup), (0, 0, 0), 1st inning, 0 outs, Play equals Out, and After equals (#2 in
lineup), (0, 0, 0), 1st inning, 1 out.

Before the first play of the game, there are no baserunners (0, 0, 0) and 0 outs. There are
6 possible plays in this situation and 5 possible situations after the play. The initial state
probability distribution equals the first batters Play probability distribution when there
are no baserunners. For example, Pr(Initial State = S1) = Pr(Play = BB/HBP). The
reward or observation in each of these possible initial states is 0 runs, except S5.
Because a HR with no baserunners scores 1 run, S5 has a reward of 1 run. Let R(S) =
reward in state S. Then the following formula calculates E(H):
E(H) = (Pr(Si) x U(Si))
U(S) = R(S) + (T(S, S) x U(S))

(over all possible states S)

Si stands for some initial state (1 <= i <= 6) and U(S) stands for utility of S. Once we
transition to a state in which there are 3 outs recorded in the 9th inning after a certain play
(either an Out or DP), we can no longer transition to another state (i.e. the game is over).
The interested reader is encouraged to try and output results from this formula, but
beware: an infinite (or very long) sequence of state transitions is improbable but possible.
The average sequence probably contains about 40-45 transitions.
I, however, have decided not to do this (for now) due to time constraints and the
previously mentioned factors (in actual simulations of this HMM, I have accounted for
and/or incorporated those four factors so that I get accurate results when running the
model, much like I would if I implemented these factors in the mathematical calculation
of E(H) ). Instead, I will use the sample mean runs scored from my HMM, which is an
unbiased estimator of E(H), to test the accuracy of my HMM. I did 100 simulations for
each of the 16 teams in the National League (a league is a group of teams), and compared
each teams mean runs scored in simulation to their actual runs scored. For after the
fact predictions, I hoped that the sample mean would be as good or perhaps better than
any mathematical formula (e.g. XR) that estimates runs scored. So I did 100 after the
fact simulations for each team and for each season between 2000-2004. Thus, I did this
for a total of 16 x 5 = 80 teams (the 2001 Mets and 2000 Mets are different teams). I
calculated the difference (non-absolute and absolute) between each teams average runs
scored and their actual runs scored. This difference can also be called error. If E(H) gives
a good estimate of R, meaning that the HMM is an accurate model of the actual run
scoring process, then this error (on average) should be very small. The table below gives
the mean non-absolute error and the mean absolute error for the HMM as well as the two
formulas XR and BsR:

Mean Non-Abs Error


Mean Abs Error

HMM
-5.2
16.5

XR
+17.0
21.2

BsR
+21.6
24.5

As an example of how to calculate these errors, the 2000 Mets scored an average of 829
runs in my 100 simulations, and in actuality scored 807 runs. So non-absolute error = 829
807 = +22 runs. The absolute error is the absolute value of +22, which is 22. Since we
would like the average difference between the estimate and R to be 0, meaning the
expected value of the non-absolute error would be 0, the HMM seems to do a job that is
comparable to (if not better than) both estimation formulas. But since this is a sample of
80 teams, this data does not necessarily prove my HMMs superiority to these formulas.
In addition, its non-absolute error is still somewhat significant, and so it still needs
improvement (not much, though).
To make future improvements, I would need to expand the HMM to account for minor
(yet important) factors such as baserunning speed, stolen bases, caught stealing, errors,
wild pitches, passed balls, runners thrown out, triple plays, etc. to see whether its
performance improves significantly. In order to appreciate how incorporating these
factors in the HMM would likely improve its accuracy, the reader must first understand
what these factors are and how they influence the gameplay in baseball. The following
link is a good beginners reference on baseball: http://en.wikipedia.org/wiki/Baseball.

Other interesting questions that my HMM could answer (further research):


If we replace one batter (e.g. an average one) in some lineup with another one (e.g. Mike Piazza),
how many more (or less) runs will the team score? (this is important in Player Evaluation)
Does the way in which you order the batters in your lineup make a difference in run production?

Appendix: Example Simulations of the HMM


The example below shows us the results of two different after the fact simulations for
the 2000 NY Mets, along with the stats of certain key players (RBI, OBP, and SLG are
not necessary for the reader to understand but I calculated them for those who are more
familiar with baseball and may be interested):
TEAM: NY Mets

YEAR: 2000

SEASON 1

Todd Zeile PA: 626 BB/HBP: 73


106 OBP: .359 SLG: .528

Runs Scored: 844

Hits: 165

Robin Ventura PA: 554 BB/HBP: 88


RBI: 102 OBP: .338 SLG: .481

2B: 34

Hits: 111

3B: 3

2B: 21

HR: 29

3B: 1

RBI:

HR: 30

Mike Piazza PA: 548 BB/HBP: 76


RBI: 109 OBP: .387 SLG: .602
Derek Bell PA: 629 BB/HBP: 60
81 OBP: .350 SLG: .473

Hits: 153
Hits: 171

2B: 29
2B: 34

3B: 0
3B: 2

Edgardo Alfonzo PA: 655 BB/HBP: 95


RBI: 76 OBP: .402 SLG: .523

Hits: 179

TEAM: NY Mets

Runs Scored: 783

YEAR: 2000

SEASON 2

Todd Zeile PA: 626 BB/HBP: 77


110 OBP: .335 SLG: .488

Hits: 142

Robin Ventura PA: 555 BB/HBP: 67


RBI: 77 OBP: .294 SLG: .422
Mike Piazza PA: 549 BB/HBP: 58
RBI: 91 OBP: .353 SLG: .536
Derek Bell PA: 630 BB/HBP: 77
48 OBP: .313 SLG: .335

2B: 33

Hits: 107
Hits: 149

Hits: 138

Edgardo Alfonzo PA: 652 BB/HBP: 120


RBI: 85 OBP: .480 SLG: .628

2B: 53

2B: 27
2B: 23

Hits: 196

HR: 20
3B: 2

3B: 3

2B: 25

HR: 34
RBI:

HR: 19

HR: 29

3B: 1
3B: 0

3B: 0

RBI:

HR: 24
HR: 29

HR: 8

2B: 49

3B: 1

RBI:
HR: 29

Notice that Alfonzo averages 24 HR and 108 BB/HBP in the two simulations; these are
comparable to his actual totals in the 2000 season (25 HR, 100 BB/HBP). If we had
simulated the HMM for the 2000 season based on the probability distributions generated
from Alfonzo and other players cumulative stats from 1997-1999 (before the fact),
then we would generally get different results. For example, Alfonzo had only hit more
than 20 HRs in a season once before 2000, and thus our simulator would likely output a
result of less than 20 HRs for him. Here is an example of such a simulation:

TEAM: NY Mets

YEAR: 2000

SEASON 1

Edgardo Alfonzo PA: 671 BB/HBP: 68


RBI: 82 OBP: .358 SLG: .443

Runs Scored: 750


Hits: 181

2B: 29

3B: 6

HR: 15

Derek Bell PA: 628 BB/HBP: 53


83 OBP: .336 SLG: .442

Hits: 171

2B: 35

3B: 3

HR: 14

RBI:

Todd Zeile PA: 667 BB/HBP: 89


113 OBP: .357 SLG: .510

Hits: 164

2B: 24

3B: 4

HR: 33

RBI:

Mike Piazza PA: 588 BB/HBP: 64


RBI: 88 OBP: .349 SLG: .529

Hits: 161

2B: 23

3B: 0

HR: 31

Вам также может понравиться