Академический Документы
Профессиональный Документы
Культура Документы
Introduction
suits), dealers card, and players score (if we define the score as the sum of
the cards) we have of the order of 6.5 billion states 1 , more than the number
of breaths an average human has on his lifetime. Even storing the optimal
policy would be a challenge. Consider now that most casinos have 5 or 6
decks in a table.
Hence we use approximate dynamic programming (ADP) to solve the
problem. The advantages from this method is that allows a standard way of
dealing with the problem and its easy to adapt to dierent casino rules. The
main disadvantage coming from the ADP methodology is that it requires simulation in order to compute the optimal action. A complimentary work to be
considered is the use of Machine Learning algorithms in the resulting policy
in order to obtain rules that approximate the resulting policy. The approximate dynamic programming scheme we use is the Smoothed Approximate
Dynamic Programming (S-ALP) as described in [5].
We compare three dierent policies: the wikipedia policy, the simple
S-ALP and the smart S-ALP. The wikipedia policy is the results of a
player following the strategy chart that can be found in the wikipedia entry
on Blackjack. The simple S-ALP is a policy with just a constant function,
representing a policy of a player that just simulates the eect of the next
stage. The smart S-ALP is a policy where the basis functions are chosen
based on what would be important for the player.
The report is organized as follows, first in Section 2 we describe the model
we use, describing the controls, state space, probability matrix and dynamic
programming formulation. Next, we describe the approximate dynamic programming scheme we use in some detail. The implementation, results and
conclusions end the report.
Blackjack Model
In order to understand the model, we begin by explaining what is the sequence of events in a blackjack game. First, the player must decide how
much to bet. Once the player, has made his bet, he is dealt two cards, the
dealer is dealt a facedown card and a face up card. Here the player can
take four dierent actions 2 : Hit, Stand, Double Down and Surrender. The
1
Deck configurations dealers face up card players sum of cards= (59 17) 10 21
In reality there are five dierent actions, but the game was simplified to not include
split for the purpose of this project.
2
2.1
States
A state in the problem consists in the dealers face up card, the players cards,
the deck configuration and the current bet.
The dealer face up card is straightforward to describe, we use a numerical value, d, between 1-10 to describe it.
The players card can be stored as two values, (p, a), p being the sum
of the cards in the player hands, and a the number of aces held by
5
Figure 3: Strategy followed after the first action. H: Hit, Su: Hit, S: Stand,
D: Double Down.
the player. This significantly reduces the possible number of states as
opposed to storing each of the cards received by the player.
Deck configuration, c, We store the number of cards of each value
still left in the deck. This requires an array of length 10 composed
of numerical values. For example, if there are only 13 cards scoring
10 points and 5 cards scoring 4 points, the deck configuration is: c =
(0, 0, 0, 0, 5, 0, 0, 0, 0, 13).
The current bet, b, can be stored by its numerical value.
Therefore we can describe the state space as the following set:
S = {(d, (p, a), c, b) 2
10
B|d 2 [1, 10], p 2 [2, 20], a 2 [0, 2], c 2 ([0, 4n]9 [0, 16n])}
6
Figure 4: Flow Chart of the model. Purple represents the decision of the
player, the rest is the transition process to a new state.
Where B is the set of possible bets, and n the number of decks used by
the dealer. The size of the state space is considerable, for a dealer using
three decks, we have |S| 1013 , a number larger than the number of stars
estimated to be in the Milky Way.
2.2
Control
2.3
Costs
There are two factors that add to the cost of a control given a state: the next
bet, and the expected profit of the action taken. Hence the cost of a control
(b0 , s) given a state s can be written as:
8
b0 2b (W in|s)
if s = Stand, Hit
<
0
b + b 4b (W in|s) if s = DoubleDown
g(x, u) =
:
b0 0.5b
if s = Surrender
Here b0 represents the next bet (part of the control), and b represents the
previous bet.
2.4
Transitions
Given that the post game actions, as well as the dealer actions can lead to
many dierent states, an analytical expression for all the reachable states
as well as their transition probability is hard to obtain. Given that we use
simulation in order to approximate the value of the expected basis functions
as well as the expected costs, it is not necessary to obtain an analytical
expression.
2.5
Both formulations, average cost or discounted cost, could be used to formulate the objective function of our dynamic programming problem. We decide
to formulate the problem as a discounted maximization problem. The main
reason for this is the measure of choice by the paper that uses the methodology we follow [5], in order to solve the ADP. Also after consulting with V.
Farias about why solve Tetris in [3] as a discounted profit problem rather
than an stochastic path, he mentioned that the discounted cost setting has
better numerical stability.
2.6
Infinite Stages
In order to use the discounted profit setting, we need to have infinite stages.
Of course this cannot happen with a finite number of decks unless we decide
to shue the deck and reset the game. This is exactly what we do, and for
a fixed number cards l if the deck has less than or equal to l cards, at the
8
beginning of the game, then the deck is reshued and the new game is drawn
from a freshly reshued deck. Its interesting to examine how the profit of
the player changes based in this number.
Given the model described, we can now formulate the dynamic programming
problem associated with it. In this case the Bellman equation is the following:
X
J ((d, (p, a), c, b)) = min{g((d, (p, a), c, b), u)+
p(d,(p,a),c,b)s (u)J (s)}
u2U
s2S((d,(p,a),c,b),u)
3.1
There are several methods in the approximate dynamic programming literature in order to compute the r coefficients, when approximating the value
function. The method we decided to use is the smoothed dynamic programming problem, following the paper by V. Farias and C. Moallemi [5]. The
reason for this choice is that, rather than rely in projections and fixed points
it uses linear programming to compute the r coefficients. As an MIT student
we have access to IBMs CPLEX, and hence we can compute the r coefficients
efficiently by using this software.
The smoothed ALP(Approximate Linear Program) approach is based in
the fact that the solution to the following problem gives a solution to the
Bellman equation:
max J
s.t. J T J
P
Where we follow the operational notation: T J(x) = minu2U {g(x, u)+ s p(x, s)J(s)}.
Even though, the inequality looks non linear, this is in fact an linear, as we
can replace the inequality by the following set of inequalities:
X
J(x) g(x, u) +
p(x, s)J(s) 8u 2 U
s
The reason this delivers the solution of the bellman equation is that a solution
to the optimization problem yields the largest of the fixed points of T .
Moreover, we can replace the cost coefficient by any positive cost, or
probability distribution, and the solution to the previous LP also yields the
solution to the Bellman equation. The ALP approach suggest solving the
following problem:
max c0 J
s.t. J T J
J 2 span( )
Where is the matrix of the basis functions we want to use to approximate
the value function. Equivalently, this problem can be written as the following
LP:
max c0 r
s.t.
r T r
A good reference for ALP is given by D. P. de Farias and B. Van Roy in
[2]. Given the large number of constraints of this LP in contrast with the
small number of variables, constraint sampling can be used. Hence we need
to only use a small portion of the constraints in order to obtain a fairly good
value for r. An improvement in this method is to bootstrap the policies in
and resolve the ALP as shown in [3].
Improvements can be made to improve solution, in [5], they use the
smoothed linear program. The smoothed linear program solves:
max
s.t.
c0 r
r
T r+
0
0
10
3.2
Basis Functions
Based on knowledge of the game, we used two concepts to define our basis
functions: deck load and cards-to-shue. Deck load is defined as the sum of
the number of cards of high value remaining in the deck (A, K, Q, J, 10)
minus the cards of low value remaining in the deck (2, 3, 4, 5, 6). A deck
that has recently been shued has a deck load of zero. By cards-to-shue
we mean the number of cards to be dealt before the deck is re shued.
Using these two notions we define our basis functions, 0 , ..., 5 as:
0:
1 , ...,
constant.
4:
5:
Cards-to-shue.
The reason we separated the combination of bets and signs for the deck
loads is that we are going to try and interpret the coefficients that multiply
the basis function.
One of the reasons for settling for these basis functions as opposed to
functions that depend on the dealers card or the players card is that using
such values have very high variability, as are basically an independent sample
from the deck. Therefore requires large amount of simulations in order to
obtain reasonable stability of [ i |(b, s)]
We compare the results we obtain by choosing this basis functions with
the results we obtain by choosing only the constant basis function (meaning
pure simulation).
Implementation
13
As we find a policy and start simulating it, we find that policies with a
large , tend to make inappropriate decisions when playing. After careful
consideration, we realized that this mainly occurred due to misestimating
future costs. Given the variability of outcomes in the next game, it is hard
to appropriately estimate the next state and future costs without exhaustive
sampling. Using a value of small enough, provides better results while at
the same time taking into account future costs.
Figure 6: Profit collected after playing 1000 hands, for 100 dierent samples.
In Figure 5, the variability of the profits can be appreciated. Each of the
dots represent the profits after playing a thousand hands. We can see the
variability of the profits, this is well known fact in the gambling community,
independent of the level of skill (how good the policy is) positive profits are
not guaranteed in the short run [1]. The eects performance of a policy can
only be appreciated in the long term, and in this case, this should be more
than 1000 hands. The inability to quantify the performance of the policy
reliably prevents us to take full advantage of the S-ALP, as its not easy to
decide whether to increase or not.
The average profit of the ADP policy with a small is 11.8 in 100 repetitions of 1000 games. The average profit of the policy found in the internet
is -11.335 in 100 repetitions of 1000 games.
14
Conclusions
15
References
[1] American Publishing Corporation (2000). How To Beat The Dealer in
Any Casino. American Publishing Corporation (2000), 2000.
[2] B. Van Roy D. P. de Farias. The linear programming approach to approximate dynamic programming. Operations Research, 2003.
[3] Vivek Farias and Benjamin Van Roy. Tetris: A study of randomized
constraint sampling. Probabilistic and Randomized Methods for Design
Under Uncertainty.
[4] Edward O. Thorp. Beat the Dealer: A Winning Strategy for the Game
of Twenty One. Vintage, 1966.
[5] V. F. Farias V. V. Desai and C. C. Moallemi. Approximate dynamic
programming via a smoothed approximate linear program. Operations
Research, 2009.
16