Академический Документы
Профессиональный Документы
Культура Документы
The science of statistics deals with the collection, analysis, interpretation, and presentation of data. We
see and use data in our everyday lives.
We generally want to study a population a collection of persons, things, or objects under study, we
select a sample from population. The idea of sampling is to select a subset of the larger population and
study that sample to gain information about the population. Data are the result of sampling from a
population.
Example:
Find the sample mean for the following set of numbers: 12, 13, 14, 16, 17, 40, 43, 55, 56, 67, 78, 78, 79,
80, 81, 90, 99, 101, 102, 304, 306, 400, 401, 403, 404, 405.
Add up all of the numbers:
12 + 13 + 14 + 16 + 17 + 40 + 43 + 55 + 56 + 67 + 78 + 78 + 79 + 80 + 81 + 90 + 99 + 101 + 102 + 304 + 306
+ 400 + 401 + 403 + 404 + 405 = 3744.
Count the numbers of items in your data set. In this particular data set there are 26 items.
Divide the number you found in Step 1 by the number you found in Step 2. 3744/26 = 144.
x = ( Σ xi ) / n
= 3744/26
= 144
Mode: Mode is the value that occurs with the greatest frequency.
Example:
Find the median for the following set of numbers: 12, 13, 14, 16, 17, 40, 43, 55, 56, 67, 78, 78, 79, 80, 81,
90, 99, 101, 102, 304, 306, 404, 401, 403, 400.
In above set of number the mode is 78 with highest frequency.
Percentiles: The pth percentile is a value such that at least p percent of the observations are less
than or equal to this value and at least (100-p) percent of the observations are greater than or equal to
this value
p
Formula: i=( 100 ) xn
Example:
Find the median for the following set of numbers: 12, 13, 14, 16, 17, 40, 43, 55, 56, 67, 78, 78, 79, 80, 81,
90, 99, 101, 102, 304, 306, 404, 401, 403, 400.
Determine the 75th percentile of the given data
Arrange data in ascending order:
12, 13, 14, 16, 17, 40, 43, 55, 56, 67, 78, 78, 79, 80, 81, 90, 99, 101, 102, 304, 306, 400, 401, 403, 404.
75
I= ( ¿ x 25 =18.75
100
The position of the 75th percentile is the next integer greater than 18.75, the 19 th position.
The 75th percentile is the data value in the 19 th position is 102.
Quartiles: A division of observations into four defined intervals based upon the values of the data
and how they compare to the entire set of observations.
Each quartile contains 25% of the total observations. The data is arranged from smallest to largest:
Example:
Let’s work with an example. Suppose, the distribution of math scores in a class of 19 students in
ascending order is:
59, 60, 65, 65, 68, 69, 70, 72, 75, 75, 76, 77, 81, 82, 84, 87, 90, 95, 98
First, mark down the median, Q2, which in this case is the tenth value: 75.
Q1 is the central point between the smallest score and the median. In this case, Q1 falls between the
first and fifth score: 68. [Note that the median can also be included when calculating Q1 or Q3 for an
odd set of values. If we were to include the median on either side of the middle point, then Q1 will be
the middle value between the first and tenth score, which is the average of the fifth and sixth score –
(fifth + sixth)/2 = (68 + 69)/2 = 68.5].
Q3 is the middle value between Q2 and the highest score: 84. [Or if you include the median, Q3 = (82 +
84)/2 = 83].
Now that we have our quartiles, let’s interpret their numbers. A score of 68 (Q1) represents the first
quartile and is the 25th percentile. 68 is the median of the lower half of the score set in the available
data i.e. the median of the scores from 59 to 75.
Q1 tells us that 25% of the scores are less than 68 and 75% of the class scores are greater. Q2 (the
median) is the 50th percentile and shows that 50% of the scores are less than 75, and 50% of the scores
are above 75. Finally, Q3, the 75th percentile, reveals that 25% of the scores are greater and 75% are
less than 84.
Interquartile Range: The interquartile range is the range for the middle 50% of the data.
IQR = Q3 – Q1
Variance: Variance is the expectation of the squared deviation of a random variable from its mean.
Informally, it measures how far a set of numbers are spread out from their average value.
Formula: sample standard deviation s =√ ∑ (xi- x̄)2/n-1 population standard deviation σ = √∑ (xi-µ)2/N
Coefficient of variation: The coefficient of variation is a relative measure of variability it
measures the standard deviation relative to the mean
Standard deviation
Formula: coefficient of variation = ( x 100)%
Mean
Covariance: covariance is a measure of the joint variability of two random variables. If the greater
values of one variable mainly correspond with the greater values of the other variable, and the same
holds for the lesser values, the covariance is positive. In the opposite case, when the greater values of
one variable mainly correspond to the lesser values of the other, the covariance is negative. The sign of
the covariance therefore shows the tendency in the linear relationship between the variables.
Formula: sample covariance sxy = ∑(xi- x̄)(Yi-Ῡ)/n-1 population covariance σxy = ∑(xi- µx)(Yi-µy)/N
Correlation coefficient: A correlation coefficient is a numerical measure of some type
of correlation, meaning a statistical relationship between two variables. The variables may be
two columns of a given data set of observations, often called a sample, or two components of
a multivariate random variable with a known distribution.
Formula:
Pearson correlation coefficient sample mean
rxy = sxy/sxsy
rxy = sample correlation coefficient
sxy = sample covariance
sx = sample standard deviation of x
sy = sample standard deviation of y
pearson correlation coefficient population mean
ƿxy = σxy/ σx σy
ƿxy = population correlation coefficient
σxy = population covariance
σx = population standard deviation of x
σy = population standard deviation of y
Weighted Mean: A weighted mean is a kind of average. Instead of each data point contributing
equally to the final mean, some data points contribute more “weight” than others. If all the weights are
equal, then the weighted mean equals the arithmetic mean. Weighted means are very common in
statistics, especially when studying populations.
Formula:
x̄ = ∑ wi xi/∑wi
Random Variables
A random variable, usually written X, is a variable whose possible values are numerical outcomes of a
random phenomenon.
There are two types of random variables, discrete and continuous.
Suppose a random variable X may take k different values, with the probability that X = xi defined to
be P(X = xi) = pi. The probabilities pi must satisfy the following:
Probability histogram
Cumulative distribution function
All random variables (discrete and continuous) have a cumulative distribution function. It is a function
giving the probability that the random variable X is less than or equal to x, for every value x. For a
discrete random variable, the cumulative distribution function is found by summing up the probabilities.
Example
The cumulative distribution function for the above probability distribution is calculated as follows:
The probability that X is less than or equal to 1 is 0.1,
the probability that X is less than or equal to 2 is 0.1+0.3 = 0.4,
the probability that X is less than or equal to 3 is 0.1+0.3+0.4 = 0.8, and
the probability that X is less than or equal to 4 is 0.1+0.3+0.4+0.2 = 1.
A continuous random variable is not defined at specific values. It defined over an interval of values, and
is represented by the area under a curve.
Suppose a random variable X may take all values over an interval of real numbers. Then the probability
that X is in the set of outcomes S, P(S), is defined to be the area above S and under a curve. The curve,
which represents a function p(x), must satisfy the following:
∫ f ( x)dx=1
AIIX
Example
X is a continuous random variable with probability density function given by f(x) = cx for 0 ≤ x ≤ 1, where
c is a constant. Find c.
If we integrate f(x) between 0 and 1 we get c/2. Hence c/2 = 1 (from the useful fact above!), giving c = 2.
Discrete
Probability
Distribution
Formula:
n!
P(X) = ( p) x ( q)n−x
( n− X ) ! X !
P(X) = The probability of x successes in n trails
n = The number of trails
p = The probability of a success on any one trail
q = The probability of a failure on any one trail
Example:
80% of people who purchase pet insurance are women. If 9 pet insurance owners are randomly
selected, find the probability that exactly 6 are women.
Step 1: Identify ‘n’ from the problem. Using our sample question, n (the number of randomly selected
items) is 9.
Step 2: Identify ‘X’ from the problem. X (the number you are asked to find the probability for) is 6.
Step 3: Work the first part of the formula. The first part of the formula is
n! / (n – X)! X!
Step 4: Find p and q. p is the probability of success and q is the probability of failure. We are given p =
80%, or .8. So the probability of failure is 1 – .8 = .2 (20%).
Step 5: Work the second part of the formula.
pX
= .86
= .262144
Set this number aside for a moment.
Formula:
e−µ µx
P(x) = x!
P(x) = The probability of x occurrence in an interval
µ = Expected value or mean number occurrences in an interval
e = Euler’s constant 2.71828
Example: Consider a computer system with Poisson job-arrival stream at an average of 2 per minute.
Determine the probability that in any one-minute interval there will be
(i) 0 jobs;
(ii) exactly 2 jobs;
(iii) at most 3 arrivals.
Solution:
Job Arrivals with λ = 2
(i) No job arrivals:
( rx )( N−r
n−x )
P(x) = N
(n)
Example:
A deck of cards contains 20 cards: 6 red cards and 14 black cards. 5 cards are drawn randomly without
replacement. What is the probability that exactly 4 red cards are drawn?
The probability of choosing exactly 4 red cards is:
P(4 red cards) = # samples with 4 red cards and 1 black card / # of possible 4 card samples
( 64 )(141 )
P(x) = 20
(5)
In shorthand, the above formula can be written as:
(6C4*14C1)/20C5
where
6C4 means that out of 6 possible red cards, we are choosing 4.
14C1 means that out of a possible 14 black cards, we’re choosing 1.
Solution:
= (6C4*14C1)/20C5
= 15*14/15504
= 0.0135
Continuous probability distribution: If a random variable is a continuous variable,
its probability distribution is called a continuous probability distribution
Continuous
Probability
Distribution
Formula:
1
P(x) = b−a for a≤x≤b
Example:
The data in the table below are 55 smiling times, in seconds, of an eight-week-old baby.
10.4 19.6 18.8 13.9 17.8 16.8 21.6 17.9 12.5 11.1 4.9
12.8 14.8 22.8 20 15.9 16.3 13.4 17.1 14.5 19 22.8
1.3 0.7 8.9 11.9 10.9 7.3 5.9 3.7 17.9 19.2 9.8
5.8 6.9 2.6 5.8 21.7 11.8 3.4 2.1 4.5 6.3 10.7
8.9 9.4 9.4 7.6 10 3.3 6.7 7.8 11.6 13.8 18.6
The sample mean = 11.49 and the sample standard deviation = 6.23.
We will assume that the smiling times, in seconds, follow a uniform distribution between zero and 23
seconds, inclusive. This means that any smiling time from zero to and including 23 seconds is equally
likely. The histogram that could be constructed from the sample is an empirical distribution that closely
matches the theoretical uniform distribution.
The notation for the uniform distribution is X ~ U(a, b) where a = the lowest value of x and b = the
highest value of x.
1
P(x)=
b−a
for a ≤ x ≤ b.
1
P(x)=
23−0
a+b √ (b−a)2
μ= and σ=
2 12
For this problem, the theoretical mean and standard deviation are
0+23 √ (23−0)2
μ= = 11.50 seconds and σ= = 6.64 seconds
2 12
Formula:
1 2 2
P(x) = e−(x−µ) / 2 σ
σ √2 π
Where
µ = mean
σ = standard deviation
π = 3.14159
e = 2.71828
Example:
X is a normally normally distributed variable with mean μ = 30 and standard deviation σ = 4. Find
a) P(x < 40)
b) P(x > 21)
c) P(30 < x < 35)
What is meant here by area is the area under the standard normal curve.
Hence P(x < 40) = P(z < 2.5) = [area to the left of 2.5] = 0.9938
Hence P(x > 21) = P(z > -2.25) = [total area] - [area to the left of -2.25]
= 1 - 0.0122 = 0.9878
Hence P(30 < x < 35) = P(0 < z < 1.25) = [area to the left of z = 1.25] - [area to the left of 0]
Formula:
1 −x /µ
P(x) = e for x≥ 0, µ> 0
µ
Example:
Let X = amount of time (in minutes) a postal clerk spends with his or her customer. The time is known to
have an exponential distribution with the average amount of time equal to four minutes.
X is a continuous random variable since time is measured. It is given that μ = 4 minutes. To do any
calculations, you must know m, the decay parameter.
1 1
m= . Therefore, m= = 0.25
µ 4
The standard deviation, σ, is the same as the mean. μ = σ
The probability density function is P(x) = me–mx. The number e = 2.71828182846… It is a number that is
used often in mathematics. Scientific calculators have the key “ex.” If you enter one for x, the calculator
will display the value e.
Formula:
Mean μM = μ
Example:
Draw all possible samples of size 2 without replacement from a population consisting of 3, 6, 9, 12, 15.
Form the sampling distribution of sample means and verify the results.
(i) E(x̄)=μ
σ 2 N −n
(ii) Var(x̄)= ( )
n N−1
Solution:
We have population values 3, 6, 9, 12, 15, population size N=5 and sample size n=2. Thus, the number of
possible samples which can be drawn without replacement is
( Nn )= (52)=10
Sample Mean Sample Mean
Sample No. Sample Values Sample No. Sample Values
(x̄) (x̄)
1 3, 6 4.5 6 6, 12 9.0
2 3, 9 6.0 7 6, 15 10.5
3 3, 12 7.5 8 9, 12 10.5
4 3, 15 9.0 9 9, 15 12.0
5 6, 9 7.5 10 12, 15 13.5
The sampling distribution of the sample mean x̄ and its mean and standard deviation are:
2
887.5 90
Var(x̄)=∑ x̄ 2f(x̄)–[∑ x̄ f ( x̄ )]2 = 10 – ( ) = 87.75–81 = 6.75
10
The mean and variance of the population are:
X 3 6 9 12 15 ∑X=45
X2 9 36 81 144 225 ∑X2=495
∑X 45 ∑X2 ∑X 495 45
μ = N = 5 = 9 and σ2 = N – ( N )2 = 5 – ( 5 )2 = 99–81 =18
Verification:
σ2 N –n 18 5 – 2
(i) E(x̄) = μ = 9 (ii) Var(x̄) = n ( N−1 )= 2 ( 5−1 ) = 6.75