Академический Документы
Профессиональный Документы
Культура Документы
Ankur Agarwal
Department of Statistics
Rutgers University
December 20, 2005
Abstract
The goal of a baseball team is to score more runs (points) than its opponent. Accurately
modeling and predicting the run scoring process for a given team can help that teams
manager decide whether he is going to achieve this goal. We design a Hidden Markov
Model (HMM) where the state observations are runs scored, and we simulate it to show
that it is close to a complete model of the actual run scoring process. We compare its
resultant accuracy to mathematical formulas (e.g. linear regression) that estimate a teams
runs scored (dependent variable) based on relevant statistical data about the teams actual
performance (independent variables). Finally, we propose improvements to the HMM
based on the incorporation of minor (yet important) factors that impact the run scoring
process. We also discuss some theoretical and practical applications of this HMM,
including player evaluation and prediction of future performance.
Introduction
Baseball, like any game, is full of uncertainty. Will the batter hit a homerun or a single (or
something else)? How many runs will a certain team score? I will attempt to answer the
second question, which depends in part on the first (and more fundamental) question,
with a Markov model of the run scoring process. Formally, we can describe this model as
a Hidden Markov Model (HMM) where the observation in a given state is the number of
runs scored in that state. For those who know little (or nothing) about the game of
baseball, I will soon give a descriptive example to illustrate the run scoring process.
However, I would first like to present the basic rules and gameplay in baseball:
The following sequence, called a game log, documents the progression of a sample
game with the example lineup above. There are usually/about nine innings in a game
(here I have displayed the first seven), each beginning with no baserunners, no outs,
and the first batter due to hit in the batters box. In the 1st inning (start of the game),
Edgardo Alfonzo, who is 1st in the lineup order, is the first batter. Each batter makes a
play, which maps to seven possible outcomes: BB/HBP, 1B, 2B, 3B, HR, Out, DP.
When a batter makes an Out, he does not advance to any base. A DP stands for Double
Play, or two Outs, which erases a baserunner in addition to retiring the batter. A DP
can only occur when there is at least one baserunner. After explaining parts of this game
log, I will give an example of a DP for clarification.
Inning 1
Edgardo Alfonzo drew a walk (or HBP).
Derek Bell made an out.
Mike Piazza hit a single.
Todd Zeile made an out.
Benny Agbayani made an out.
Baserunners
1 0 0
1 0 0
1 1 0
1 1 0
1 1 0
Runs
0
0
0
0
0
Inning 2
Robin Ventura hit a single.
Jay Payton drew a walk (or HBP).
Mike Bordick hit a single.
Mike Hampton made an out.
Edgardo Alfonzo drew a walk (or HBP).
Derek Bell made an out.
Mike Piazza made an out.
Baserunners
1 0 0
1 1 0
1 1 0
1 1 0
1 1 1
1 0 1
1 0 1
Runs
0
0
1
1
1
2
2
Inning 3
Todd Zeile made an out.
Benny Agbayani made an out.
Robin Ventura made an out.
0 0 0
0 0 0
0 0 0
2
2
2
Inning 4
Jay Payton made an out.
Mike Bordick hit a single.
Mike Hampton made an out.
Edgardo Alfonzo hit a homerun.
Derek Bell made an out.
0
1
1
0
0
0
0
0
0
0
2
2
2
4
4
Inning 5
Mike Piazza made an out.
Todd Zeile made an out.
Benny Agbayani made an out.
0 0 0
0 0 0
0 0 0
4
4
4
Inning 6
Robin Ventura made an out.
Jay Payton hit a double.
Mike Bordick made an out.
Mike Hampton made an out.
0
0
0
0
0
0
0
0
4
4
4
4
Inning 7
Edgardo Alfonzo hit a double.
Derek Bell made an out.
Mike Piazza hit a single.
0 1 0
0 1 0
1 0 1
4
4
4
0
0
0
0
0
0
1
1
1
1
1
0
0
0
1
0
0
1
0
1
1
4
5
7
7
In the 1st inning, Alfonzo draws a BB and advances to 1st base, which is represented by 1
0 0 in the Baserunners column. This means that there is a runner on 1st base, and no
runner on 2nd or 3rd base after the play (BB). After Derek Bell makes an out, Alfonzo
might advance to 2nd base (with a certain probability), but here an out leads to no change.
Piazza hits a 1B and Alfonzo advances to 2nd base. The next two batters make outs, and
the inning is over. An inning ends when the team makes 3 outs, and we move on to the
next inning, which begins with 0 outs. The first batter (Ventura) in the next inning is the
one right after the batter who made the last out in the previous inning (Agbayani).
The NY Mets lineup scores 4 runs in the first 6 innings. In the 7th inning, Alfonzo is the
first batter again as Hampton (the #9 hitter) makes an out to end the previous inning.
Alfonzo hits a 2B and advances to 2nd base. Bell makes an out. Piazza hits a 1B and
Alfonzo advances to 3rd base. Usually, a runner on 2nd base will score (advance to Home)
on a 1B, but sometimes he is only able to advance to 3rd base. So now there are runners
on 1st and 3rd base. Zeile makes an out, and the runners do not advance. There is about a
50/50 chance that a runner on 3rd base will score on an out. In this case, he does not score.
Agbayani hits a single, advances to 1st base, and Piazza advances to 2nd base. Alfonzo
advances to Home, scoring a run, and is no longer a baserunner now. So the Mets have
now scored a total of 5 runs. Ventura hits a triple, advances to 3rd base, and both
baserunners score to give the Mets 7 runs. Payton makes an out, and the inning is over.
The key point to notice is that baserunners will always advance on a hit by the batter and
sometimes advance on an out by the batter. In addition, the runners will advance at least
the same number of bases as the batter does. For example, if a batter hits a 2B, a runner
on 1st base advances to at least 3rd base (a runner on 2nd or 3rd base scores automatically).
The following is an example of a DP from a different game:
Inning 3
Mike Hampton hit a single.
Edgardo Alfonzo hit into a double play.
Derek Bell made an out.
Baserunners
1 0 0
0 0 0
0 0 0
Runs
0
0
0
Alfonzos DP erases Hampton from 1st base. Bell makes the third out to end the inning. If
a DP occurs when there is more than one baserunner, the runner on 1st base is the one
who is usually erased. A DP rarely occurs when there is no runner on 1st base.
Also, note that each play occurs within the context of a certain inning. A DP cannot occur
in an inning that has already recorded two outs because one out automatically ends the
current inning, and this leads us to the next inning, which begins with no outs.
Now that I have explained the run scoring process, I will describe the probability
distribution over the random action Play. This probability distribution is specific to a
certain player; for example, some players are more likely to hit HRs than others. Recall
that plate appearances (PA) are the number of opportunities a batter gets to hit during a
season. A season is 162 games; each year there is a new season. As an example:
Player
Team
Year
PA
BB/HBP
1B
2B
3B
HR
Out
DP
Edgardo
Alfonzo
NY
Mets
2000
650
100
109
40
25
362
12
Since Alfonzo hit a total of 109 singles in 650 PA during the 2000 season, his probability
of hitting a single was simply 109/650 = 16.8 %. So Pr(Play = 1B) = 16.8% for Alfonzo
in the 2000 season. The open-source database at www.baseball-databank.org provided me
with the necessary data to generate probability distributions for each player. The Markov
models transition probabilities are based in part on these probability distributions. The
game logs above are sample results from simulations of this model.
Bell makes an out to end the 3rd inning and so the situation after the play represents the beginning of the 4th
inning. No more runs can score in an inning after the third out has been recorded.
--------------------------------------------------------------------------------------------------------------------------------Observation: 1 run scored
State: Before equals Bordick (#8 in lineup), (1, 1, 0), 2nd inning, 0 outs, Play equals 1B, and After
equals Hampton (#9 in lineup), (1, 1, 0), 2nd inning, 0 outs.
Bordick hits a single, the runner on 2nd base scores, the runner on 1st base advances to 2nd base, and Bordick
advances to 1st base. Thus, we still have runners on 1st and 2nd base after the play.
--------------------------------------------------------------------------------------------------------------------------------Observation: 2 runs scored
State: Before equals Ventura (#6 in lineup), (1, 1, 0), 7th inning, 2 outs, Play equals 3B, and After
equals Payton (#7 in lineup), (0, 0, 1), 7th inning, 2 outs.
Ventura hits a triple, the two baserunners before the play score, and Ventura advances to 3rd base.
--------------------------------------------------------------------------------------------------------------------------------Observation: 3 runs scored
State: Before equals Agbayani (#5 in lineup), (1, 1, 1), 9th inning, 1 out, Play equals 2B, and After
equals Ventura (#6 in lineup), (0, 1, 0), 9th inning, 1 out.
Agbayani hits a double, all three baserunners before the play are able to score, and Agbayani advances to
2nd base.
--------------------------------------------------------------------------------------------------------------------------------Observation: 4 runs scored
State: Before equals Piazza (#3 in lineup), (1, 1, 1), 5th inning, 1 out, Play equals HR, and After equals
Zeile (#4 in lineup), (0, 0, 0), 5th inning, 1 out.
Piazza hits a homerun, all three baserunners before the play score, and Piazza also scores automatically,
thus totaling 4 runs on the play.
Transition Probabilities:
T(S, S) represents the probability that state S transitions to state S. T(S, S) > 0 only if the After situation
(After the Play) in S equals the Before situation (Before the Play) in S. Formally, T(S, S) = the probability
that the Play value in S (e.g. 2B) will occur, times the probability that the After situation in S (or the
Before situation in S) will lead to the After situation in S given that the Play value in S has occurred.
As an example, let S be: Before the Play equals Hampton (#9 in lineup), (0, 0, 0), 8th inning, 0 outs, Play
equals 2B, and After the Play equals Alfonzo (#1 in lineup), (0, 1, 0), 8th inning, 0 outs. Let S be:
Before the Play equals Alfonzo (#1 in lineup), (0, 1, 0), 8th inning, 0 outs, Play equals 1B, and After
the Play equals Bell (#2 in lineup), (1, 0, 0), 8th inning, 0 outs. Notice that the After situation in state S
equals the Before situation in state S. In state S, Hampton hits a double and Alfonzo is now the batter.
Next, in state S, Alfonzo hits a single, Hampton scores, and Bell is the next batter.
The transition probability T(S, S) = Pr(Play = 1B) x Pr(X = (1, 0, 0) | X = (0, 1, 0), Play = 1B), i.e. the
probability that Alfonzo hits a single times the probability that the runner on 2nd base (Hampton) advances
to Home given that Alfonzo has hit a single. The question remains, how do we calculate this transition
probability? This depends on the particular batter involved (in this case, Edgardo Alfonzo). Recall
Alfonzos 2000 stats, now given below in the form of proportions or probabilities:
Pr(BB/HBP)
Pr(1B)
Pr(2B)
Pr(3B)
Pr(HR)
Pr(Out)
Pr(DP)
.154
.168
.062
.003
.038
.557
.018
The only problem with these probabilities is that a DP (double play) can only occur with less than 2 outs
(there is a limit of 3 outs per inning) and at least one runner on base (usually 1st base). The other plays can
occur in any situation, but their probabilities will change (though their relative proportions will stay the
same) when Pr(DP) = 0. Based on batting stats (called splits) at www.retrosheet.org, I found that about
60% of any batters total PA occur when there are 2 outs or no baserunners (or both). What this means, for
example, is that about 60% of Alfonzos 650 PA (390) in 2000 occurred when there were either 2 outs or no
baserunners, i.e. when Pr(DP) = 0.
For any batter (in this case, Alfonzo), the equation becomes Pr(DP) = (.60 x 0) + (.33 x Y) + (.07 x Z) = .
018. Y represents the Pr(DP) when there is at least one baserunner, one of whom is on 1st base, and less
than 2 outs (about 33% of the PA). Z represents the Pr(DP) when there is at least one baserunner, none of
whom are on 1st base, and less than 2 outs (about 7% of the PA). I have set Z equal to .03 because this
rarely occurs (for any batter, including Alfonzo). We can then solve for Y: (.018 - .0021) / .33 = .048. Z
is a constant value for all batters whereas Y varies with the particular batter.
The following table presents my results for Alfonzo:
Pr(BB/HBP)
Pr(1B)
Pr(2B)
Pr(3B)
Pr(HR)
Pr(Out)
Pr(DP)
No baserunners or 2 outs
Situation
.157
.171
.063
.003
.039
.567
.149
.163
.060
.003
.037
.540
.048 (Y)
.152
.166
.061
.003
.038
.550
.03 (Z)
Returning to the question of how we calculate T(S, S), we now know that Pr(Play = 1B) = .166 because
Alfonzo is batting with a runner on 2nd base and 0 outs (less than 2 outs). But what about the Pr(X = (1, 0,
0) | X = (0, 1, 0), Play = 1B)? The following discussion will explain how to calculate the probability that
some After situation in S will lead to some After situation in S, given that a certain play has occurred.
Given a certain play (e.g. 1B) and an After situation in S (e.g. Alfonzo, X = (0, 1, 0), 8th inning, 0 outs), we
can directly determine the number of outs in the After situation in S. The non-outs (BB/HBP, 1B, 2B, 3B,
HR) keep the number of outs unchanged. A single out results in 1 more out, and a DP results in 2 more
outs. Once we reach 3 outs from one of these two plays, we automatically reset the number of outs to
zero, the X component (in the After situation in S) to (0, 0, 0), and the current inning to the next inning.
X, however, can often equal one of several possibilities, and so we need to use probabilities. When the
play is a BB/HBP, 3B, or HR, however, there is only one possibility:
Let X = (n1, n2, n3) in the After situation in S (or the Before situation in S)
BB/HBP:
If n1 = 1 and n2 = 1 then X = (1, 1, 1) and runs scored = n3
If n1 = 1 and n2 = 0 then X = (1, 1, n3) and runs scored = 0
Else X = (1, n2, n3) and runs scored = 0
3B:
X = (0, 0, 1) and runs scored = n1 + n2 + n3
HR:
X = (0, 0, 0) and runs scored = 1 + n1 + n2 + n3
Explanation: If a batters play is a BB/HBP, then the batter advances to 1st base. If there is a runner on 1st
base before the play, then this runner moves to 2nd base. This continues like a chain reaction as long as there
are runners who are adjacent in the base sequence. For example, if there are runners on every base before
the BB/HBP, then the runner on 3rd base scores, the runner on 2nd base advances to 3rd base, the runner on
1st base advances to 2nd base, and the batter advances to 1st base. If a batters play is a 3B, he advances to
3rd base and any baserunners existing before the play score. A HR is the same as a 3B except that the batter
advances to Home and also scores. Triples usually occur far less often than any other play.
When Play = 1B or 2B, there are often several possibilities for X, given an After situation in S (X):
X'
Runs
X'
Runs
(0, 0, 0)
(1, 0, 0)
(1, 0, 0)
(0, 0, 0)
(1, 1, 0)
0.85
(1, 0, 0)
(0, 1, 0)
(0, 1, 1)
0.7
(1, 0, 0)
(1, 0, 1)
0.15
(0, 1, 0)
(1, 0, 0)
0.8
(1, 0, 0)
(0, 1, 0)
0.3
(0, 1, 0)
(0, 1, 0)
(0, 1, 0)
(1, 0, 1)
0.2
(0, 0, 1)
(0, 1, 0)
(0, 0, 1)
(1, 0, 0)
(1, 1, 0)
(1, 1, 0)
(1, 1, 0)
(0, 1, 1)
0.7
0.68
(1, 1, 0)
(0, 1, 0)
0.3
(1, 1, 0)
(1, 0, 1)
0.12
(0, 1, 1)
(0, 1, 0)
(1, 1, 0)
(0, 1, 1)
(1, 1, 1)
0.2
(1, 0, 1)
(0, 1, 1)
0.7
(1, 0, 0)
0.8
(1, 0, 1)
(0, 1, 0)
0.3
(0, 1, 1)
(1, 0, 1)
(1, 0, 1)
0.2
(1, 1, 1)
(0, 1, 1)
0.7
(1, 1, 0)
0.85
(1, 1, 1)
(0, 1, 0)
0.3
(1, 0, 1)
(1, 0, 1)
0.15
(1, 1, 1)
(1, 1, 1)
0.2
(1, 1, 1)
(1, 1, 0)
0.68
(1, 1, 1)
(1, 0, 1)
0.12
When Play = Out or DP, there are often several possibilities for X, given an After situation in S (X):
(If a DP occurs with 1 out or an Out occurs with 2 outs, then X' = (0, 0, 0) since now there are 3 outs)
X
X'
X'
(0, 0, 0)
(0, 0, 0)
(1, 0, 0)
(0, 0, 0)
(1, 0, 0)
(1, 0, 0)
0.95
(0, 1, 0)
(0, 0, 0)
(1, 0, 0)
(0, 1, 0)
0.05
(0, 0, 1)
(0, 0, 0)
(0, 1, 0)
(0, 1, 0)
0.9
(1, 1, 0)
(0, 0, 1)
0.7
(0, 1, 0)
(0, 0, 1)
0.1
(1, 1, 0)
(0, 1, 0)
0.2
(0, 0, 1)
(0, 0, 0)
0.5
(1, 1, 0)
(1, 0, 0)
0.1
(0, 0, 1)
(0, 0, 1)
0.5
(0, 1, 1)
(0, 0, 1)
0.5
(1, 1, 0)
(1, 0, 1)
0.6
(0, 1, 1)
(0, 1, 0)
0.5
(1, 1, 0)
(0, 1, 1)
0.1
(1, 0, 1)
(0, 0, 0)
(1, 1, 0)
(1, 1, 0)
0.3
(1, 1, 1)
(0, 1, 1)
0.45
(0, 1, 1)
(0, 1, 1)
0.5
(1, 1, 1)
(0, 0, 1)
0.5
(0, 1, 1)
(0, 1, 0)
0.5
(1, 1, 1)
(1, 1, 0)
0.05
(1, 0, 1)
(1, 0, 0)
0.5
(1, 0, 1)
(1, 0, 1)
0.5
(1, 1, 1)
(1, 1, 1)
0.4
(1, 1, 1)
(1, 1, 0)
0.3
(1, 1, 1)
(1, 0, 1)
0.3
Going back to our example, the above tables tell us that there are only two possibilities for X given that X
= (0, 1, 0) and Play = 1B; either X = (1, 0, 0) or (1, 0, 1), meaning that either Hampton scores or only
advances to 3rd base. The former is more probable, i.e. Pr(X = (1, 0, 0) | X, 1B) = 0.8. Thus, T(S, S) = .166
x .8 = .1328.
The result of any baseball game (in the news) is typically described as Home Team 5, Visiting
Team 3 (for example); this means that the home team won the game because it scored 5 runs,
which was more than what the visiting team scored (3 runs). Baseball fans often look at a
scoreboard, which is a listing of different games and their current results, and see (for example)
that the score is 6-2 in the fifth inning of one game. The scoreboard also often tells them how
many runs were scored by each team in each inning.
However, if they did not see this particular game (and the game log is unavailable), they can only
guess what sequence of plays led to the resulting score in each inning. Similarly, the observable
parameter in our HMM is a sequence of runs scored, and one challenge is to determine the most
likely sequence of states (including plays) that could account for a certain observed sequence. The
states in our HMM are the hidden parameters because different situations and plays can account
for the same observed sequence.
Another interesting question that has mostly theoretical value (no practical application) is: Given
a certain lineup of hitters, what is the probability of a certain observation sequence of runs scored
(during a single game)? We will discuss a similar question that does have practical value; namely,
what will be the sum total of runs scored (on average) per game, given a certain lineup? The
HMM is designed so that we can predict (or estimate) the answer to this latter question, both in
theory and through actual simulations. The following sections discuss this key issue in detail, and
reach a conclusion on the HMMs accuracy.
based on the probability distributions generated from the actual 2000 stats. As a specific
example, if we did this for the 2000 NY Mets, we would expect Edgardo Alfonzo (or any
other player on the team) to have (on average) the same proportion of singles, outs, etc.
that he had in actuality during the 2000 season. Similarly, if the model is accurate, we
would expect the runs scored in different simulations to be (on average) equal to the
actual runs scored by the 2000 NY Mets.
This idea of estimating after the fact is also the basis for the design of mathematical
formulas that accurately estimate runs scored from actual team stats. There are linear and
non-linear formulas. The following two formulas are examples of each type, respectively:
XR (Extrapolated Runs) = (.50 x 1B) + (.72 x 2B) + (1.04 x 3B) + (1.44 x HR) + (.34 x BB/HBP) + (-.096 x Outs)
BsR (Base Runs) = [A * B/(B + Outs)] + HR (A and B represent baserunners and advancement, respectively)
A = 1B + 2B + 3B + BB/HBP
B = [2*HR + 3.6*3B + 2.2*2B + .8*1B + .1*BB/HBP] * 1.02
The XR formula takes the total number of 1B, 2B, 3B, etc. hit by a team during a certain
season and estimates the teams runs scored using linear weights for each stat. As an
example, the 2000 Mets had 945 singles, 282 doubles, 20 triples, 198 HRs, 720 BB/HBP,
and 4330 Outs. XR estimates that the team scored .5*945 + .72*282 + = 811 runs (they
actually scored 807 runs). Note: the XR formula is fitted to data before the 2000 season.
B/(B+Outs) in the BsR formula represents the percentage of baserunners (A) who
advance to Home plate (i.e. score). After hitting a HR, a batter automatically scores and
does not become a baserunner. Thus, we separately add the total number of HRs.
David Smyth, the creator of BsR, stated (6/6/05): Runs equals baserunners (A) times the
proportion of baserunners who score (B/B+Outs), plus home runs. This statement is so
obviously true that some people have called it an 'identity' instead of an equation or theory.
This identity is certainly correct, but the problem is that we usually cannot determine the
exact proportion of baserunners who scored based on recorded stats like BB, HR, 2B, etc.
Unless we use a game log, or an ordered record of each play, we cannot tell the value
of this proportion. For example, the sequence 1B, DP, 2B, 3B yields one run but a
different sequence like 3B, 2B, 1B, DP may yield two runs, even though each sequence
records the same plays and the same number of baserunners (3). A Markov model
becomes useful here because it accounts for the variability in the run scoring process, as
it can produce different plays and orderings of plays from simulation to simulation.
BsR estimates (on average) how each of the recorded team stats affects the proportion of
baserunners who score. Sometimes (but not always), it estimates this proportion very
accurately. As an example, it calculates that about 31% (612) of the 1,967 baserunners for
the 2000 Mets scored runs. So it estimates that the team scored a total of 612 + 198 = 810
runs, only 3 runs above the actual total. Similarly, XR (like all linear formulas) estimates
the average run value of each stat. Since a HR automatically scores at least one run, and
may score up to 3 more baserunners, its expected run value is about 1.44.
If we let R = actual runs scored by a certain team, then a linear regression formula may
give us a good estimate of R based on the actual team stats (R is the dependent variable
and BB/HBP, 1B, 2B, etc. are the independent variables). If I simulated my model (for a
certain team after the fact) about 100 times, the sample mean runs scored (which is an
unbiased estimator of the HMMs expected value) would hopefully estimate R more
precisely. R is often difficult for mathematical formulas to accurately estimate because
they do not account for the randomness in play orderings as well as the context of innings
and games, both of which are captured by the HMM.
S4: Before equals (#1 in lineup), (0, 0, 0), 1st inning, 0 outs, Play equals 3B, and After equals (#2 in
lineup), (0, 0, 1), 1st inning, 0 outs.
S5: Before equals (#1 in lineup), (0, 0, 0), 1st inning, 0 outs, Play equals HR, and After equals (#2 in
lineup), (0, 0, 0), 1st inning, 0 outs.
S6: Before equals (#1 in lineup), (0, 0, 0), 1st inning, 0 outs, Play equals Out, and After equals (#2 in
lineup), (0, 0, 0), 1st inning, 1 out.
Before the first play of the game, there are no baserunners (0, 0, 0) and 0 outs. There are
6 possible plays in this situation and 5 possible situations after the play. The initial state
probability distribution equals the first batters Play probability distribution when there
are no baserunners. For example, Pr(Initial State = S1) = Pr(Play = BB/HBP). The
reward or observation in each of these possible initial states is 0 runs, except S5.
Because a HR with no baserunners scores 1 run, S5 has a reward of 1 run. Let R(S) =
reward in state S. Then the following formula calculates E(H):
E(H) = (Pr(Si) x U(Si))
U(S) = R(S) + (T(S, S) x U(S))
Si stands for some initial state (1 <= i <= 6) and U(S) stands for utility of S. Once we
transition to a state in which there are 3 outs recorded in the 9th inning after a certain play
(either an Out or DP), we can no longer transition to another state (i.e. the game is over).
The interested reader is encouraged to try and output results from this formula, but
beware: an infinite (or very long) sequence of state transitions is improbable but possible.
The average sequence probably contains about 40-45 transitions.
I, however, have decided not to do this (for now) due to time constraints and the
previously mentioned factors (in actual simulations of this HMM, I have accounted for
and/or incorporated those four factors so that I get accurate results when running the
model, much like I would if I implemented these factors in the mathematical calculation
of E(H) ). Instead, I will use the sample mean runs scored from my HMM, which is an
unbiased estimator of E(H), to test the accuracy of my HMM. I did 100 simulations for
each of the 16 teams in the National League (a league is a group of teams), and compared
each teams mean runs scored in simulation to their actual runs scored. For after the
fact predictions, I hoped that the sample mean would be as good or perhaps better than
any mathematical formula (e.g. XR) that estimates runs scored. So I did 100 after the
fact simulations for each team and for each season between 2000-2004. Thus, I did this
for a total of 16 x 5 = 80 teams (the 2001 Mets and 2000 Mets are different teams). I
calculated the difference (non-absolute and absolute) between each teams average runs
scored and their actual runs scored. This difference can also be called error. If E(H) gives
a good estimate of R, meaning that the HMM is an accurate model of the actual run
scoring process, then this error (on average) should be very small. The table below gives
the mean non-absolute error and the mean absolute error for the HMM as well as the two
formulas XR and BsR:
HMM
-5.2
16.5
XR
+17.0
21.2
BsR
+21.6
24.5
As an example of how to calculate these errors, the 2000 Mets scored an average of 829
runs in my 100 simulations, and in actuality scored 807 runs. So non-absolute error = 829
807 = +22 runs. The absolute error is the absolute value of +22, which is 22. Since we
would like the average difference between the estimate and R to be 0, meaning the
expected value of the non-absolute error would be 0, the HMM seems to do a job that is
comparable to (if not better than) both estimation formulas. But since this is a sample of
80 teams, this data does not necessarily prove my HMMs superiority to these formulas.
In addition, its non-absolute error is still somewhat significant, and so it still needs
improvement (not much, though).
To make future improvements, I would need to expand the HMM to account for minor
(yet important) factors such as baserunning speed, stolen bases, caught stealing, errors,
wild pitches, passed balls, runners thrown out, triple plays, etc. to see whether its
performance improves significantly. In order to appreciate how incorporating these
factors in the HMM would likely improve its accuracy, the reader must first understand
what these factors are and how they influence the gameplay in baseball. The following
link is a good beginners reference on baseball: http://en.wikipedia.org/wiki/Baseball.
YEAR: 2000
SEASON 1
Hits: 165
2B: 34
Hits: 111
3B: 3
2B: 21
HR: 29
3B: 1
RBI:
HR: 30
Hits: 153
Hits: 171
2B: 29
2B: 34
3B: 0
3B: 2
Hits: 179
TEAM: NY Mets
YEAR: 2000
SEASON 2
Hits: 142
2B: 33
Hits: 107
Hits: 149
Hits: 138
2B: 53
2B: 27
2B: 23
Hits: 196
HR: 20
3B: 2
3B: 3
2B: 25
HR: 34
RBI:
HR: 19
HR: 29
3B: 1
3B: 0
3B: 0
RBI:
HR: 24
HR: 29
HR: 8
2B: 49
3B: 1
RBI:
HR: 29
Notice that Alfonzo averages 24 HR and 108 BB/HBP in the two simulations; these are
comparable to his actual totals in the 2000 season (25 HR, 100 BB/HBP). If we had
simulated the HMM for the 2000 season based on the probability distributions generated
from Alfonzo and other players cumulative stats from 1997-1999 (before the fact),
then we would generally get different results. For example, Alfonzo had only hit more
than 20 HRs in a season once before 2000, and thus our simulator would likely output a
result of less than 20 HRs for him. Here is an example of such a simulation:
TEAM: NY Mets
YEAR: 2000
SEASON 1
2B: 29
3B: 6
HR: 15
Hits: 171
2B: 35
3B: 3
HR: 14
RBI:
Hits: 164
2B: 24
3B: 4
HR: 33
RBI:
Hits: 161
2B: 23
3B: 0
HR: 31