Bootstrap Student Presentation

Alternative Forecasting
Methods: Bootstrapping
Bryce Bucknell
Jim Burke
Ken Flores
Tim Metts
Agenda
Scenario
Obstacles
Regression Model
Bootstrapping
Applications and Uses
Results
Scenario
You have been recently hired as the statistician for the University
of Notre Dame football team. You are tasked with performing a
statistical analysis for the first year of the Charlie Weis era.
Specifically, you have been asked to develop a regression model
that explains the relationship between key statistical categories
and the number of points scored by the offense. You have a
limited number of data points, so you must also find a way to
ensure that the regression results generated by the model are
reliable and significant.
Problems/Obstacles:
Central Limit Theorem

Replication of data
Sampling
Variance of error terms
Constrained by the Central Limit

Theorem
In selecting simple random samples of_ size n from a

population, the sampling distribution of the sample mean x
can be approximated by a normal probability distribution as
the sample size becomes large. It is generally accepted that
the sample size must be 30 or greater to satisfy the largesample condition of the theorem.
Sample N = 1
Sample N = 2
Sample N = 3
Sample N = 4
1. http://www.statisticalengineering.com/central_limit_theorem_(summary).htm
Central Limit Theorem

Central Limit theorem is the foundation for many statistical
procedures, because the distribution of the phenomenon
under study does NOT have to be Normal because its
average WILL tend to be normal.
Why is the assumption of a normal distribution
important?
A normal distribution allows for the application of the empirical

rule 68%, 95% and 99.7%
Chebyshevs Theorem no more than 1/4 of the values are more

than 2 standard deviations away from the mean, no more than
1/9 are more than 3 standard deviations away, no more than 1/25
are more than 5 standard deviations away, and so on.
The assumption of a normally distributed data allows descriptive

statistics to be used to explain the nature of the population
Not enough data available?

Monte Carlo simulation, a type of spreadsheet simulation, is
used to randomly generate values for uncertain variables over
and over to simulate a model.
Monte Carlo methods randomly select values to create scenarios

The random selection process is repeated many times to create
multiple scenarios
Through the random selection process, the scenarios give a range
of possible solutions, some of which are more probable and some
less probable
As the process is repeated multiple times, 10,000 or more, the
average solution will give an approximate answer to the problem
The accuracy can be improved by increasing the number of
scenarios selected
Sampling without Replacement

Simple Random Sampling
A simple random sample from a population is a sample chosen

randomly, so that each possible sample has the same probability
of being chosen.
In small populations such sampling is typically done "without

replacement
Sampling without replacement results in deliberate avoidance of

choosing any member of the population more than once
This process should be used when outcomes are mutually

exclusive, i.e. poker hands
Sampling with Replacement
Initial data set is not sufficiently large enough to use simple random
sampling without replacement
Through Monte Carlo simulation we have been able to replicate the original
population
Units are sampled from the population one at a time, with each unit being
replaced before the next is sampled.
One outcome does not affect the other outcomes
Allows a greater number of potential outcomes than sampling without

replacement
If observations were not replaced there would not be enough independent

observations to create a sample size of n 30
Homoscedasticity constant
variance
Residual
s
Residual
s
Hetroscedasticity vs.
Homoscedasticity
All random variables have the

same finite variance
Simplifies mathematical and
computational treatment
Leads to good estimation results
in data mining and regression
Hetroscedasticity nonconstant
variance
Random variables may have different

variances
Standard errors of regression

coefficients may be understated
T-ratios may be larger than actual
More common with cross sectional data
Regression Model For ND Points

Scored
ND Points = 38.54 + 0.079*b1 - 0.170*b2 - 0.662*b3 - 3.16*b4
b1 = Total Yards
Gained
b3 = Total Plays
b2 = Penalty Yards
b4 = Turnovers
4 Checks of a Regression Model

1. Do the coefficients have the correct
sign?
2. Are the slope terms statistically
significant?
3. How well does the model fit the
data?
4. Is there any serial correlation?

1. Do the coefficients have the correct
sign?
Could this represent a big play factor?

2. Are the slope terms statistically
significant?

data?

data?
Adjusted R2 = 74.22%

Data is cross sectional
With limited data points, how useful is this

regression in describing how well the model
fits the actual data? Is there a way to tests its
reliability?
How to test the significance of the

analysis
What happens when the sample size is not large enough (n
30)?
Bootstrapping is a method for estimating the sampling
distribution of an estimator by resampling with replacement from
the original sample.
Commonly used statistical significance tests are used to

determine the likelihood of a result given a random sample
and a sample size of n.
If the population is not random and does not allow a large
enough sample to be drawn, the central limit theorem would
not hold true
Thus, the statistical significance of the data would not hold
Bootstrapping uses replication of the original data to
simulate a larger population, thus allowing many samples to
be drawn and statistical tests to be calculated
How It
Works
Bootstrapping is a method for estimating the sampling
distribution of an estimator by resampling with replacement from
the original sample.
The bootstrap procedure is a means of estimating the statistical

accuracy . . . from the data in a single sample.
Bootstrapping is used to mimic the process of selecting many

samples when the population is too small to do otherwise
The samples are generated from the data in the original sample by
copying it many number of times (Monte Carlo Simulation)
Samples can then selected at random and descriptive statistics

calculated or regressions run for each sample
The results generated from the bootstrap samples can be treated as

if it they were the result of actual sampling from the original
population
Characteristics of Bootstrapping
Sampling
with
Replacemen
t
Full Sample
Bootstrapping Example
Original Data
Set
Pittsburgh
Limited
number of
observation
s
Random sampling
with replacement
can be employed to
create multiple
independent
samples for analysis
1st Random
Sample
Navy
Michigan
Ohio State
Michigan State
USC
Washington
Washington
Purdue
USC
BYU
Tennessee
109 Copies
of each
observatio
n
Stanford
Ohio State
USC
BYU
Stanford
Pittsburgh
Navy
Syracuse
Ohio State
Creating a
much larger
sample with
which to work
Ohio State
Stanford
Michigan
When it should be
used
Bootstrapping is especially useful in situations when no analytic
formula for the sampling distribution is available.
Traditional forecasting methods, like

exponential smoothing, work well when
demand is constant patterns easily
recognized by software
In contrast, when demand is irregular, patterns
may be difficult to recognize.
Therefore, when faced with irregular demand,
bootstrapping may be used to provide more
accurate forecasts, making some important
assumptions
Assumptions and Methodology
Bootstrapping makes no assumption regarding the

population
No normality of error terms
No equal variance
Allows for accurate forecasts of intermittent demand
If the sample is a good approximation of the population, the

sampling distribution may be estimated by generating a
large number of new samples
For small data sets, taking a small representative sample of

the data and replicating it will yield superior results

Criminology
Statistical significance testing is important

in criminology and criminal justice
Six of the most popular journals in

criminology and criminal justice are
dominated by quantitative methods that
rely on statistical significance testing
However, it poses two potential problems:

tautology and violations of assumptions

Criminology
Tautology: the null hypothesis is always false

because virtually all null hypothesis may be
rejected at some sample size
Violation of assumptions of regression: errors

are homogeneous and errors of independent
variables are normally distributed
Bootstrapping provides a user-friendly

alternative to cross-validation and jackknife to
augment statistical significance testing

Actuarial Practice
Process of developing an actuarial model

begins with the creation of probability
distributions of input variables
Input variables are generally asset-side
generated cash flows (financial) or cash
flows generated from the liabilities side
(underwriting)
Traditional actuarial methodologies are
rooted in parametric approaches, which fit
prescribed distribution of losses to the data

Actuarial Practice
However, experience from the last two decades

has shown greater interdependence of loss
variables with asset variables
Increased complexity has been accompanied
by increased competitive pressures and more
frequent insolvencies
There is a need to use nonparametric methods
in modeling loss distributions
Bootstrap standard errors and confidence
intervals are used to derive the distribution

Classifications Used by
Ecologists
Ecologists often use cluster analysis as a tool in the

classification and mapping of entities such as
communities or landscapes
However, the researcher has to choose an
adequate group partition level and in addition,
cluster analysis techniques will always reveal
groups
Use bootstrap to test statistically for fuzziness of
the partitions in cluster analysis
Partitions found in bootstrap samples are compared
to the observed partition by the similarity of the
sampling units that form the groups.

Human Nutrition
Inverse regression used to estimate

vitamin B-6 requirement of young women
Standard statistical methods were used
to estimate the mean vitamin B-6
requirement
Used bootstrap procedure as a further
check for the mean vitamin B-6
requirement by looking at the standard
error estimates and confidence intervals
Application and Uses

Outsourcin
g
Agilent Technologies determined it was time to

transfer manufacturing of its 3070 in-circuit
test systems from Colorado to Singapore
Major concern was the change in
environmental test conditions (dry vs humid)
Because Agilent tests to tighter factory limits
(guard banding), they needed to adjust the
guard band for Singapore
Bootstrap was used to determine the
appropriate guard band for Singapore facility
An Alternative to the bootstrap

Jackknife
A statistical method for estimating

and removing bias* and for deriving
robust estimates of standard errors
and confidence intervals
Created by systematically dropping
out subsets of data one at a time and
assessing the resulting variation
Bias: A statistical sampling or testing error caused by systematically favoring

some outcomes over others
A comparison of the Bootstrap &

Jackknife
Bootstrap
Yields slightly
different results
when repeated on
the same data
(when estimating
the standard error)
Not bound to
theoretical
distributions
Jackknife
Less general
technique
Explores sample
variation differently
Yields the same
result each time
Similar data
requirements
Another alternative method

Cross-Validation
The practice of partitioning data into a

sample of data into sub-samples such
that the initial analysis is conducted on
a single sub-sample (training data),
while further sub-samples (test or
validation data) are retained blind in
order for subsequent use in confirming
and validating the initial analysis
Bootstrap vs. Cross-Validation
Bootstrap
Requires a small of
data
More complex
technique time
consuming
Cross-Validation
Not a resampling
technique
Requires large
amounts of data
Extremely useful in
data mining and
artificial intelligence
Methodology for ND Points

Model
Use bootstrapping on ND points scored

regression model
Goal: determine the reliability of the model
Replication, random sampling, and

numerous independent regression
Calculation of a confidence interval for

adjusted R2
Bootstrapping Results
R2 Data
Sample #
Adjusted R^2
Sample #
Adjusted R^2
0.7351
13
0.7482
0.7545
14
0.8719
0.7438
15
0.7391
0.7968
16
0.9025
0.5164
17
0.8634
0.6449
18
0.7927
0.9951
19
0.6797
0.9253
20
0.6765
0.8144
21
0.8226
10
0.7631
22
0.9902
11
0.8257
23
0.8812
12
0.9099
24
0.9169
The Mean,
Standard Dev., 95%
and 99%
confidence
intervals are then
calculated in excel
from the 24
observations
Bootstrapping Results
R2 Data
Mean:
STDEV:
Conf 95%
Conf 99%
0.8046
0.1131
0.0453 or 75.93 - 84.98%
0.0595 or 74.51 - 86.41%
So what does this mean for the results of

the regression?
Can we rely on this model to help predict
the number of points per game that will be
scored by the 2006 team?
Questions
?

Bootstrap Student Presentation

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Bootstrap Student Presentation

Загружено:

Авторское право:

Доступные форматы

Alternative Forecasting

Central Limit Theorem

Constrained by the Central Limit

In selecting simple random samples of_ size n from a

Central Limit Theorem

A normal distribution allows for the application of the empirical

Chebyshevs Theorem no more than 1/4 of the values are more

The assumption of a normally distributed data allows descriptive

Not enough data available?

Monte Carlo methods randomly select values to create scenarios

Sampling without Replacement

A simple random sample from a population is a sample chosen

In small populations such sampling is typically done "without

Sampling without replacement results in deliberate avoidance of

This process should be used when outcomes are mutually

Sampling with Replacement

One outcome does not affect the other outcomes

Allows a greater number of potential outcomes than sampling without

If observations were not replaced there would not be enough independent

All random variables have the

Random variables may have different

Standard errors of regression

T-ratios may be larger than actual

More common with cross sectional data

Regression Model For ND Points

4 Checks of a Regression Model

4 Checks of a Regression Model

Could this represent a big play factor?

4 Checks of a Regression Model

3. How well does the model fit the

4 Checks of a Regression Model

4 Checks of a Regression Model

With limited data points, how useful is this

How to test the significance of the

Commonly used statistical significance tests are used to

The bootstrap procedure is a means of estimating the statistical

Bootstrapping is used to mimic the process of selecting many

Samples can then selected at random and descriptive statistics

The results generated from the bootstrap samples can be treated as

Traditional forecasting methods, like

Assumptions and Methodology

Bootstrapping makes no assumption regarding the

No normality of error terms

Allows for accurate forecasts of intermittent demand

If the sample is a good approximation of the population, the

For small data sets, taking a small representative sample of

Applications and Uses

Statistical significance testing is important

Six of the most popular journals in

However, it poses two potential problems:

Applications and Uses

Tautology: the null hypothesis is always false

Violation of assumptions of regression: errors

Bootstrapping provides a user-friendly

Applications and Uses

Process of developing an actuarial model

Applications and Uses

However, experience from the last two decades

Applications and Uses

Ecologists often use cluster analysis as a tool in the

Applications and Uses

Inverse regression used to estimate

Application and Uses

Agilent Technologies determined it was time to