Академический Документы
Профессиональный Документы
Культура Документы
x
= and
x
= * sqrt( 1/n 1/N )
4. What is the sampling distribution of the population?
In a population of size N, suppose that the probability of the
occurence of an event (dubbed a success) is P; and the probability
of the events non-occurence (dubbed a failure) is Q. From this
population, suppose that we draw all possible samples of size n. And
finally, within each sample, suppose that we determine the proportion
of successes p and failures q. In this way, we create a sampling
distribution of the proportion.
5. Show the mathematical expression of the sampling distribution
of the population.
We find that the mean of the sampling distribution of the proportion
(
p
) is equal to the probability of success in the population (P). And
the standard error of the sampling distribution (
p
) is determined by
the standard deviation of the population (), the population size, and
the sample size. These relationships are shown in the equations
below:
p
= P and
p
= * sqrt( 1/n 1/N ) = sqrt[ PQ/n - PQ/N ]
where = sqrt[ PQ ].
Estimation
1. When will the sampling distribution be normally distributed?
Generally, the sampling distribution will be approximately normally
distributed if any of the following conditions apply.
The population distribution is normal.
The sampling distribution is symmetric, unimodal,
without outliers, and the sample size is 15 or less.
The sampling distribution is moderately skewed, unimodal,
without outliers, and the sample size is between 16 and 40.
The sample size is greater than 40, without outliers.
2. Get the variability of the sample mean.
Suppose k possible samples of size n can be selected from a
population of size N. The standard deviation of the sampling
distribution is the average deviation between the k sample means
and the true population mean, . The standard deviation of the sample
mean
x
is:
x
= * sqrt{ ( 1/n ) * ( 1 n/N ) * [ N / ( N - 1 ) ] }
where is the standard deviation of the population, N is the
population size, and n is the sample size. When the population size is
much larger (at least 10 times larger) than the sample size, the
standard deviation can be approximated by:
x
= / sqrt( n )
3. How can standard error of the population calculated?
When the standard deviation of the population is unknown, the
standard deviation of the sampling distribution cannot be calculated.
Under these circumstances, use the standard error. The standard error
(SE) provides an unbiased estimate of the standard deviation. It can
be calculated from the equation below.
SE
x
= s * sqrt{ ( 1/n ) * ( 1 n/N ) * [ N / ( N - 1 ) ] }
where s is the standard deviation of the sample, N is the population
size, and n is the sample size. When the population size is much larger
(at least 10 times larger) than the sample size, the standard error can
be approximated by:
SE
x
= s / sqrt( n )
4. How to find the confidence interval of the mean?
population mean.
uncertainty of a sampling method. Often, researchers choose 90%,
95%, or 99% confidence levels; but any percentage can be used.
nfidence
interval is defined by the sample statistic + margin of error. And the
uncertainty is denoted by the confidence level.
Testing of Hypothesis in case of large & small samples
1. What is a statistical hypothesis?
A statistical hypothesis is an assumption about a
population parameter. This assumption may or may not be true.
2. What are the types of statistical hypothesis?
There are two types of statistical hypotheses.
Null hypothesis. The null hypothesis, denoted by H
0
, is usually
the hypothesis that sample observations result purely from
chance.
Alternative hypothesis. The alternative hypothesis, denoted by
H
1
or H
a
, is the hypothesis that sample observations are
influenced by some non-random cause.
3. What is hypothesis testing?
Statisticians follow a formal process to determine whether to reject a
null hypothesis, based on sample data. This process is
called hypothesis testing.
4. Define the steps of hypothesis testing?
Hypothesis testing consists of four steps.
State the hypotheses. This involves stating the null and
alternative hypotheses. The hypotheses are stated in such a way
that they are mutually exclusive. That is, if one is true, the other
must be false.
Formulate an analysis plan. The analysis plan describes how to
use sample data to evaluate the null hypothesis. The evaluation
often focuses around a single test statistic.
Analyze sample data. Find the value of the test statistic (mean
score, proportion, t-score, z-score, etc.) described in the analysis
plan.
Interpret results. Apply the decision rule described in the
analysis plan. If the value of the test statistic is unlikely, based
on the null hypothesis, reject the null hypothesis.
5. What are decision errors?
Two types of errors can result from a hypothesis test.
Type I error. A Type I error occurs when the researcher rejects
a null hypothesis when it is true. The probability of committing
a Type I error is called the significance level. This probability is
also called alpha, and is often denoted by .
Type II error. A Type II error occurs when the researcher fails
to reject a null hypothesis that is false. The probability of
committing a Type II error is called Beta, and is often denoted
by . The probability of not committing a Type II error is called
the Power of the test.
6. How to arrive at a decision on hypothesis?
The decision rules can be taken in two ways with reference to a P-
value or with reference to a region of acceptance.
P-value. The strength of evidence in support of a null hypothesis
is measured by the P-value. Suppose the test statistic is equal
to S. The P-value is the probability of observing a test statistic as
extreme as S, assuming the null hypotheis is true. If the P-value
is less than the significance level, we reject the null hypothesis.
Region of acceptance. The region of acceptance is a range of
values. If the test statistic falls within the region of acceptance,
the null hypothesis is not rejected. The region of acceptance is
defined so that the chance of making a Type I error is equal to
the significance level.The set of values outside the region of
acceptance is called the region of rejection. If the test statistic
falls within the region of rejection, the null hypothesis is
rejected. In such cases, we say that the hypothesis has been
rejected at the level of significance.
7. Explain one-tailed and two-tailed tests?
A test of a statistical hypothesis, where the region of rejection is on
only one side of the sampling distribution, is called a one-tailed test.
For example, suppose the null hypothesis states that the mean is less
than or equal to 10. The alternative hypothesis would be that the mean
is greater than 10. The region of rejection would consist of a range of
numbers located located on the right side of sampling distribution;
that is, a set of numbers greater than 10.
A test of a statistical hypothesis, where the region of rejection is on
both sides of the sampling distribution, is called a two-tailed test. For
example, suppose the null hypothesis states that the mean is equal to
10. The alternative hypothesis would be that the mean is less than 10
or greater than 10. The region of rejection would consist of a range of
numbers located located on both sides of sampling distribution; that
is, the region of rejection would consist partly of numbers that were
less than 10 and partly of numbers that were greater than 10.
What is Chi Sqare in Statistics?
Suppose Sachin plays 100 tests, and 20 times he made 50. Is he a
good player ?
In statistics, the chi-square test calculates how well a series of
numbers fits a distribution. In this module, we only test for whether
results fit an even distribution. It doesnt simply say yes or no.
Instead, it gives you a confidence interval, which sets upper and lower
bounds on the likelihood that the variation in your data is due to
chance.
There are basically two types of random variables and they yield two
types of data: numerical and categorical.
A chi square (X2) statistic is used to investigate whether distributions
of categorical variables differ from one another. Basically categorical
variable yield data in the categories and numerical variables yield data
in numerical form.
Responses to such questions as What is your major? or Do you own
a car? are categorical because they yield data such as biology or
no. In contrast, responses to such questions as How tall are you?
or What is your G.P.A.? are numerical. Numerical data can be
either discrete or continuous.
F-Distribution and Analysis of variance (ANOVA)
1. What is ANOVA?
Analysis of variance (ANOVA) is a collection of statistical models
and their associated procedures in which the observed variance is
partitioned into components due to different sources of variation.
ANOVA provides a statistical test of whether or not the means of
several groups are all equal.
2. What are the assumption in ANOVA?
The following assumptions are made to perform ANOVA:
Independence of cases this is an assumption of the model that
simplifies the statistical analysis.
Normality the distributions of the residuals are normal.
Equality (or homogeneity) of variances,
called homoscedasticity the variance of data in groups should
be the same. Model-based approaches usually assume that the
variance is constant. The constant-variance property also
appears in the randomization (design-based) analysis of
randomized experiments, where it is a necessary consequence of
the randomized design and the assumption of unit treatment
additivity (Hinkelmann and Kempthorne): If the responses of a
randomized balanced experiment fail to have constant variance,
then the assumption of unit treatment additivity is necessarily
violated. It has been shown, however, that the F-test is robust to
violations of this assumption.
3. What is the logic of ANOVA?
Partitioning of the sum of squares
The fundamental technique is a partitioning of the total sum of
squares (abbreviated SS) into components related to the effects used
in the model. For example, we show the model for a simplified
ANOVA with one type of treatment at different levels.
So, the number of degrees of freedom (abbreviated df) can be
partitioned in a similar way and specifies the chi-square distribution
which describes the associated sums of squares.
4. What is the F-test?
The F-test is used for comparisons of the components of the total
deviation. For example, in one-way, or single-factor ANOVA,
statistical significance is tested for by comparing the F test statistic
where
I = number of treatments
and
n
T
= total number of cases
to the F-distribution with I 1,n
T
I degrees of freedom. Using the F-
distribution is a natural candidate because the test statistic is the
quotient of two mean sums of squares which have a chi-square
distribution.
5. Why is ANOVA helpful?
ANOVAs are helpful because they possess a certain advantage over a
two-sample t-test. Doing multiple two-sample t-tests would result in a
largely increased chance of committing a type I error. For this reason,
ANOVAs are useful in comparing three or more means.
Simple correlation and Regression
1. What is correlation?
Correlation is a measure of association between two variables. The
variables are not designated as dependent or independent.
2. What can be the values for correlation coefficient?
The value of a correlation coefficient can vary from -1 to +1. A -1
indicates a perfect negative correlation and a +1 indicated a perfect
positive correlation. A correlation coefficient of zero means there is
no relationship between the two variables.
3. What is the interpretation of the correlation coefficient values?
When there is a negative correlation between two variables, as the
value of one variable increases, the value of the other variable
decreases, and vise versa. In other words, for a negative correlation,
the variables work opposite each other. When there is a positive
correlation between two variables, as the value of one variable
increases, the value of the other variable also increases. The variables
move together.
4. What is simple regression?
Simple regression is used to examine the relationship between one
dependent and one independent variable. After performing an
analysis, the regression statistics can be used to predict the dependent
variable when the independent variable is known. Regression goes
beyond correlation by adding prediction capabilities.
5. Explain the mathematical analysis of regression?
In the regression equation, y is always the dependent variable and x is
always the independent variable. Here are three equivalent ways to
mathematically describe a linear regression model.
y = intercept + (slope x) + error
y = constant + (coefficient x) + error
y = a + bx + e
The significance of the slope of the regression line is determined from
the t-statistic. It is the probability that the observed correlation
coefficient occurred by chance if the true correlation is zero. Some
researchers prefer to report the F-ratio instead of the t-statistic. The F-
ratio is equal to the t-statistic squared.
Business Forecasting
1. What is forecasting?
Forecasting is a prediction of what will occur in the future, and it is an
uncertain process. Because of the uncertainty, the accuracy of a
forecast is as important as the outcome predicted by the forecast.
2. What are the various business forecasting techniques?
3. How to model the Causal time series?
With multiple regressions, we can use more than one predictor. It is
always best, however, to be parsimonious, that is to use as few
variables as predictors as necessary to get a reasonably accurate
forecast. Multiple regressions are best modeled with commercial
package such as SAS or SPSS. The forecast takes the form:
Y = b
0
+ b
1
X
1
+ b
2
X
2
+ . . .+ b
n
X
n
,
where b
0
is the intercept, b
1
, b
2
, . . . b
n
are coefficients representing the
contribution of the independent variables X
1
, X
2
,, X
n
.
4. What are the various smoothing techniques?
Simple Moving average: The best-known forecasting methods is the
moving averages or simply takes a certain number of past periods and
add them together; then divide by the number of periods. Simple
Moving Averages (MA) is effective and efficient approach provided
the time series is stationary in both mean and variance. The following
formula is used in finding the moving average of order n, MA(n) for a
period t+1,
MA
t+1
= [D
t
+ D
t-1
+ ... +D
t-n+1
] / n
where n is the number of observations used in the calculation.
Weighted Moving Average: Very powerful and economical. They
are widely used where repeated forecasts required-uses methods like
sum-of-the-digits and trend adjustment methods. As an example, a
Weighted Moving Averages is:
Weighted MA(3) = w
1
.D
t
+ w
2
.D
t-1
+ w
3
.D
t-2
where the weights are any positive numbers such that: w1 + w2 + w3
= 1.
5. Explain exponential smoothing techniques?
Single Exponential Smoothing: It calculates the smoothed series as a
damping coefficient times the actual series plus 1 minus the damping
coefficient times the lagged value of the smoothed series. The
extrapolated smoothed series is a constant, equal to the last value of
the smoothed series during the period when actual data on the
underlying series are available.
F
t+1
= a D
t
+ (1 a) F
t
where:
D
t
is the actual value
F
t
is the forecasted value
a is the weighting factor, which ranges from 0 to 1
t is the current time period.
Double Exponential Smoothing: It applies the process described
above three to account for linear trend. The extrapolated series has a
constant growth rate, equal to the growth of the smoothed series at the
end of the data period.
6. What are time series models?
A time series is a set of numbers that measures the status of some
activity over time. It is the historical record of some activity, with
measurements taken at equally spaced intervals (exception: monthly)
with a consistency in the activity and the method of measurement.
Time Series Analysis
1. What is time series forecasting?
The time-series can be represented as a curve that evolve over time.
Forecasting the time-series mean that we extend the historical values
into the future where the measurements are not available yet.
2. What are the different models in time series forecasting?
Simple moving average
Weighted moving average
Simple exponential smoothing
Holts double Exponential smoothing
Winters triple exponential smoothing
Forecast by linear regression
3. Explain simple moving average and weighted moving average
models?
Simple Moving average: The best-known forecasting methods is the
moving averages or simply takes a certain number of past periods and
add them together; then divide by the number of periods. Simple
Moving Averages (MA) is effective and efficient approach provided
the time series is stationary in both mean and variance. The following
formula is used in finding the moving average of order n, MA(n) for a
period t+1,
MA
t+1
= [D
t
+ D
t-1
+ ... +D
t-n+1
] / n
where n is the number of observations used in the calculation.
Weighted Moving Average: Very powerful and economical. They
are widely used where repeated forecasts required-uses methods like
sum-of-the-digits and trend adjustment methods. As an example, a
Weighted Moving Averages is:
Weighted MA(3) = w
1
.D
t
+ w
2
.D
t-1
+ w
3
.D
t-2
where the weights are any positive numbers such that: w1 + w2 + w3
= 1.
4. Explain the exponential smoothing techniques?
Single Exponential Smoothing: It calculates the smoothed series as a
damping coefficient times the actual series plus 1 minus the damping
coefficient times the lagged value of the smoothed series. The
extrapolated smoothed series is a constant, equal to the last value of
the smoothed series during the period when actual data on the
underlying series are available.
F
t+1
= a D
t
+ (1 a) F
t
where:
D
t
is the actual value
F
t
is the forecasted value
a is the weighting factor, which ranges from 0 to 1
t is the current time period.
Double Exponential Smoothing: It applies the process described
above three to account for linear trend. The extrapolated series has a
constant growth rate, equal to the growth of the smoothed series at the
end of the data period.
Triple exponential Smoothing: It applies the process described
above three to account for nonlinear trend.
5. How should one forecast by linear regression?
Regression is the study of relationships among variables, a principal
purpose of which is to predict, or estimate the value of one variable
from known or assumed values of other variables related to it.
Types of Analysis
Simple Linear Regression: A regression using only one predictor is
called a simple regression.
Multiple Regression: Where there are two or more predictors,
multiple regression analysis is employed.
Index Numbers
1. What are index numbers?
Index numbers are used to measure changes in some quantity which
we cannot observe directly. E.g changes in business activity.
2. Describe the classification of index numbers?
Index numbers are classified in terms of the variables that are
intended to measure. In business, different groups of variables in the
measurement of which index number techniques are commonly used
are i) price ii) quantity iii) value iv) Business activity
3. What are simple and composite index numbers?
Simple index numbers: A simple index number is a number that
measures a relative change in a single variable with respect to a base.
Composite index numbers: A composite index number is a number
that measures an average relative change in a group of relative
variables with respect to a base.
4. What are price index numbers?
Price index numbers measure the relative changes in the prices of
commodities between two periods. Prices can be retail or wholesale.
5. What are quantity index numbers?
These index numbers are considered to measure changes in the
physical quantity of goods produced, consumed, or sold of an item or
a group of items.