Вы находитесь на странице: 1из 16

Definitions of Statistics

The science of statistics deals with the collection, analysis, interpretation, and presentation of data. We
see and use data in our everyday lives.
We generally want to study a population a collection of persons, things, or objects under study, we
select a sample from population. The idea of sampling is to select a subset of the larger population and
study that sample to gain information about the population. Data are the result of sampling from a
population.

Example: Population: All the students at your school


Sample: a sample of 50 students

Mean: Mean is used to derive the central tendency of the data.


Formula: sample Mean x̄ = ∑x /n and Population Mean µ = ∑x /N
i i

Example:
Find the sample mean for the following set of numbers: 12, 13, 14, 16, 17, 40, 43, 55, 56, 67, 78, 78, 79,
80, 81, 90, 99, 101, 102, 304, 306, 400, 401, 403, 404, 405.
Add up all of the numbers:
12 + 13 + 14 + 16 + 17 + 40 + 43 + 55 + 56 + 67 + 78 + 78 + 79 + 80 + 81 + 90 + 99 + 101 + 102 + 304 + 306
+ 400 + 401 + 403 + 404 + 405 = 3744.
Count the numbers of items in your data set. In this particular data set there are 26 items.
Divide the number you found in Step 1 by the number you found in Step 2. 3744/26 = 144.
x = ( Σ xi ) / n
= 3744/26
= 144

Median: Arrange the data in ascending order


a) For an odd number of observations the median is the middle value.
b) For an even numbers of observations the median is the average of the two middle values.
Example:
Find the median for the following set of numbers: 12, 13, 14, 16, 17, 40, 43, 55, 56, 67, 78, 78, 79, 80, 81,
90, 99, 101, 102, 304, 306, 404, 401, 403, 400.
Arrange data in ascending order:
a). 12, 13, 14, 16, 17, 40, 43, 55, 56, 67, 78, 78, 79, 80, 81, 90, 99, 101, 102, 304, 306, 400, 401, 403, 404.
n= 25 which is odd the median is the middle value of the data and that is 79.
b). 12, 13, 14, 16, 17, 40, 43, 55, 56, 67, 78, 78, 79, 80, 81, 90, 99, 101, 102, 304, 306, 400, 401, 403, 404,
405.
n= 26 which is even the median is selection of two middle values and divided by 2
79+80
Median= 2
=79.5

Mode: Mode is the value that occurs with the greatest frequency.
Example:
Find the median for the following set of numbers: 12, 13, 14, 16, 17, 40, 43, 55, 56, 67, 78, 78, 79, 80, 81,
90, 99, 101, 102, 304, 306, 404, 401, 403, 400.
In above set of number the mode is 78 with highest frequency.

Percentiles: The pth percentile is a value such that at least p percent of the observations are less
than or equal to this value and at least (100-p) percent of the observations are greater than or equal to
this value
p
Formula: i=( 100 ) xn
Example:
Find the median for the following set of numbers: 12, 13, 14, 16, 17, 40, 43, 55, 56, 67, 78, 78, 79, 80, 81,
90, 99, 101, 102, 304, 306, 404, 401, 403, 400.
Determine the 75th percentile of the given data
Arrange data in ascending order:
12, 13, 14, 16, 17, 40, 43, 55, 56, 67, 78, 78, 79, 80, 81, 90, 99, 101, 102, 304, 306, 400, 401, 403, 404.
75
I= ( ¿ x 25 =18.75
100
The position of the 75th percentile is the next integer greater than 18.75, the 19 th position.
The 75th percentile is the data value in the 19 th position is 102.

Quartiles: A division of observations into four defined intervals based upon the values of the data
and how they compare to the entire set of observations.
Each quartile contains 25% of the total observations. The data is arranged from smallest to largest:

1. First quartile: the lowest 25% of numbers


2. Second quartile: between 25.1% and 50% (up to the median)
3. Third quartile: 51% to 75% (above the median)
4. Fourth quartile: the highest 25% of numbers

Example:
Let’s work with an example. Suppose, the distribution of math scores in a class of 19 students in
ascending order is:

59, 60, 65, 65, 68, 69, 70, 72, 75, 75, 76, 77, 81, 82, 84, 87, 90, 95, 98

First, mark down the median, Q2, which in this case is the tenth value: 75.

Q1 is the central point between the smallest score and the median. In this case, Q1 falls between the
first and fifth score: 68. [Note that the median can also be included when calculating Q1 or Q3 for an
odd set of values. If we were to include the median on either side of the middle point, then Q1 will be
the middle value between the first and tenth score, which is the average of the fifth and sixth score –
(fifth + sixth)/2 = (68 + 69)/2 = 68.5].

Q3 is the middle value between Q2 and the highest score: 84. [Or if you include the median, Q3 = (82 +
84)/2 = 83].

Now that we have our quartiles, let’s interpret their numbers. A score of 68 (Q1) represents the first
quartile and is the 25th percentile. 68 is the median of the lower half of the score set in the available
data i.e. the median of the scores from 59 to 75.

Q1 tells us that 25% of the scores are less than 68 and 75% of the class scores are greater. Q2 (the
median) is the 50th percentile and shows that 50% of the scores are less than 75, and 50% of the scores
are above 75. Finally, Q3, the 75th percentile, reveals that 25% of the scores are greater and 75% are
less than 84.

Range: The simplest measure of variability is the range.


Range = largest value – smallest value

Interquartile Range: The interquartile range is the range for the middle 50% of the data.
IQR = Q3 – Q1

Variance:  Variance is the expectation of the squared deviation of a random variable from its mean.
Informally, it measures how far a set of numbers are spread out from their average value.

Formula: sample variance s2 = ∑ (xi- x̄)2/n-1 population variance σ2 = ∑ (xi-µ)2/N


Standard Deviation:  The standard deviation is a measure of the amount of variation or
dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the
mean of the set, while a high standard deviation indicates that the values are spread out over a wider
range.

Formula: sample standard deviation s =√ ∑ (xi- x̄)2/n-1 population standard deviation σ = √∑ (xi-µ)2/N
Coefficient of variation: The coefficient of variation is a relative measure of variability it
measures the standard deviation relative to the mean
Standard deviation
Formula: coefficient of variation = ( x 100)%
Mean

Covariance: covariance is a measure of the joint variability of two random variables. If the greater
values of one variable mainly correspond with the greater values of the other variable, and the same
holds for the lesser values, the covariance is positive. In the opposite case, when the greater values of
one variable mainly correspond to the lesser values of the other, the covariance is negative. The sign of
the covariance therefore shows the tendency in the linear relationship between the variables.

Formula: sample covariance sxy = ∑(xi- x̄)(Yi-Ῡ)/n-1 population covariance σxy = ∑(xi- µx)(Yi-µy)/N
Correlation coefficient: A correlation coefficient is a numerical measure of some type
of correlation, meaning a statistical relationship between two variables. The variables may be
two columns of a given data set of observations, often called a sample, or two components of
a multivariate random variable with a known distribution.

Formula:
Pearson correlation coefficient sample mean

rxy = sxy/sxsy
rxy = sample correlation coefficient
sxy = sample covariance
sx = sample standard deviation of x
sy = sample standard deviation of y
pearson correlation coefficient population mean

ƿxy = σxy/ σx σy
ƿxy = population correlation coefficient
σxy = population covariance
σx = population standard deviation of x
σy = population standard deviation of y

Weighted Mean: A weighted mean is a kind of average. Instead of each data point contributing
equally to the final mean, some data points contribute more “weight” than others. If all the weights are
equal, then the weighted mean equals the arithmetic mean. Weighted means are very common in
statistics, especially when studying populations.

Formula:
x̄ = ∑ wi xi/∑wi
Random Variables
A random variable, usually written X, is a variable whose possible values are numerical outcomes of a
random phenomenon.
There are two types of random variables, discrete and continuous.

Discrete Random Variables


A discrete random variable is one which may take on only a countable number of distinct values such as
0,1,2,3,4,.... Discrete random variables are usually counts. If a random variable can take only a finite
number of distinct values, then it must be discrete. Examples of discrete random variables include the
number of children in a family, the number of defective light bulbs in a box of ten.

Suppose a random variable X may take k different values, with the probability that X = xi defined to
be P(X = xi) = pi. The probabilities pi must satisfy the following:

1: 0 < pi < 1 for each i


2: p1 + p2 + ... + pk = 1.
Example: Suppose a variable X can take the values 1, 2, 3, or 4.
The probabilities associated with each outcome are described by the following table:
Outcome 1 2 3 4
Probability0.1 0.3 0.4 0.2
The probability that X is equal to 2 or 3 is the sum of the two probabilities: P(X = 2 or X = 3) = P(X = 2) +
P(X = 3) = 0.3 + 0.4 = 0.7. Similarly, the probability that X is greater than 1 is equal to 1 - P(X = 1) = 1 - 0.1
= 0.9, by the complement rule.

Probability histogram
Cumulative distribution function
All random variables (discrete and continuous) have a cumulative distribution function. It is a function
giving the probability that the random variable X is less than or equal to x, for every value x. For a
discrete random variable, the cumulative distribution function is found by summing up the probabilities.

Example
The cumulative distribution function for the above probability distribution is calculated as follows:
The probability that X is less than or equal to 1 is 0.1,
the probability that X is less than or equal to 2 is 0.1+0.3 = 0.4,
the probability that X is less than or equal to 3 is 0.1+0.3+0.4 = 0.8, and
the probability that X is less than or equal to 4 is 0.1+0.3+0.4+0.2 = 1.

Probability histogram for the cumulative distribution

Continuous Random Variables


A continuous random variable is one which takes an infinite number of possible values. Continuous
random variables are usually measurements. Examples include height, weight, the amount of sugar in an
orange, the time required to run a mile.

A continuous random variable is not defined at specific values. It defined over an interval of values, and
is represented by the area under a curve.

Suppose a random variable X may take all values over an interval of real numbers. Then the probability
that X is in the set of outcomes S, P(S), is defined to be the area above S and under a curve. The curve,
which represents a function p(x), must satisfy the following:

1: The curve has no negative values (p(x) > 0 for all x)


2: The total area under the curve is equal to 1.
A curve meeting these requirements is known as a density curve.

Continuous random variable with probability density function f(x)


.

∫ f ( x)dx=1
AIIX

Example
X is a continuous random variable with probability density function given by f(x) = cx for 0 ≤ x ≤ 1, where
c is a constant. Find c.

If we integrate f(x) between 0 and 1 we get c/2. Hence c/2 = 1 (from the useful fact above!), giving c = 2.

Discrete probability distribution: A discrete probability distribution is a probability


distribution that can take on a countable number of values. For the probabilities to add up to 1, they
have to decline to zero fast enough. For example, if for n = 1, 2, ..., the sum of probabilities would be 1/2
+ 1/4 + 1/8 + ... = 1.

Discrete
Probability
Distribution

Binomial Poisson Hypergeometric


Probability Probability Probability
Distribution Distribution Distribution
Binomial probability distribution: A binomial distribution is a specific probability
distribution. It is used to model the probability of obtaining one of two outcomes, a certain number of
times (k), out of fixed number of trials (N) of a discrete random event. A binomial distribution has only
two outcomes: the expected outcome is called a success and any other outcome is a failure. The
probability of a successful outcome is p and the probability of a failure is 1 - p.

Formula:
n!
P(X) =  ( p) x ( q)n−x
( n− X ) ! X !
P(X) = The probability of x successes in n trails
n = The number of trails
p = The probability of a success on any one trail
q = The probability of a failure on any one trail

Example:
80% of people who purchase pet insurance are women.  If 9 pet insurance owners are randomly
selected, find the probability that exactly 6 are women.
Step 1: Identify ‘n’ from the problem. Using our sample question, n (the number of randomly selected
items) is 9.
Step 2: Identify ‘X’ from the problem. X (the number you are asked to find the probability for) is 6.
Step 3: Work the first part of the formula. The first part of the formula is
n! / (n – X)!  X!

Substitute your variables:

9! / ((9 – 6)! × 6!)

Which equals 84. Set this number aside for a moment.

Step 4: Find p and q. p is the probability of success and q is the probability of failure. We are given p =
80%, or .8. So the probability of failure is 1 – .8 = .2 (20%).
Step 5: Work the second part of the formula.
pX
= .86
= .262144
Set this number aside for a moment.

Step 6: Work the third part of the formula.


q(n – X)
= .2(9-6)
= .23
= .008
Step 7: Multiply your answer from step 3, 5, and 6 together.
84  × .262144 × .008 = 0.176.

Poisson probability distribution: Poisson distributions are used to calculate the


probability of an event occurring over a certain interval. The interval can be one of time, area, volume or
distance.

Formula:
e−µ µx
P(x) = x!
P(x) = The probability of x occurrence in an interval
µ = Expected value or mean number occurrences in an interval
e = Euler’s constant 2.71828

Example: Consider a computer system with Poisson job-arrival stream at an average of 2 per minute.
Determine the probability that in any one-minute interval there will be
(i) 0 jobs;
(ii) exactly 2 jobs;
(iii) at most 3 arrivals.

Solution:
Job Arrivals with λ = 2
(i) No job arrivals:

P(X = 0) = e-2 = .135


(ii) Exactly 3 job arrivals:

P(X = 3) = e-2x 23/ 3! = 0.18


(iii) At most 3 arrivals

P(X ≤ 3) = P(0) + P(1) + P(2) + P(3)


= e-2 + e-2 2/1 + e-2 22/2! + e-2 23/3!
= 0.1353 + 0.2707 + 0.2707 + 0.1805
= 0.8571
more than 3 arrivals:

P(X > 3) = 1 − P(X ≤ 3)


= 1 − 0.8571
= 0.1429
Hypergeometric probability distribution:
The hypergeometric distribution is a discrete probability distribution that describes the probability
of k successes in n draws, without replacement, from a finite population of size N that contains exactly k
objects with that feature, wherein each draw is either a success or a failure.
Formula:

( rx )( N−r
n−x )
P(x) = N
(n)

Example:
A deck of cards contains 20 cards: 6 red cards and 14 black cards. 5 cards are drawn randomly without
replacement. What is the probability that exactly 4 red cards are drawn?
The probability of choosing exactly 4 red cards is:
P(4 red cards) = # samples with 4 red cards and 1 black card / # of possible 4 card samples

Using the combinations formula, the problem becomes:

( 64 )(141 )
P(x) = 20
(5)
In shorthand, the above formula can be written as:
(6C4*14C1)/20C5
where
6C4 means that out of 6 possible red cards, we are choosing 4.
14C1 means that out of a possible 14 black cards, we’re choosing 1.

Solution:
= (6C4*14C1)/20C5

= 15*14/15504

= 0.0135
Continuous probability distribution: If a random variable is a continuous variable,
its probability distribution is called a continuous probability distribution

Continuous
Probability
Distribution

Uniform Normal Exponential


Probability Probability Probability
Distribution Distribution Distribution
Uniform Probability Distribution: Uniform distribution refers to a probability
distribution for which all of the values that a random variable can take on occur with equal probability.

Formula:
1
P(x) = b−a for a≤x≤b

Example:

The data in the table below are 55 smiling times, in seconds, of an eight-week-old baby.

10.4 19.6 18.8 13.9 17.8 16.8 21.6 17.9 12.5 11.1 4.9
12.8 14.8 22.8 20 15.9 16.3 13.4 17.1 14.5 19 22.8
1.3 0.7 8.9 11.9 10.9 7.3 5.9 3.7 17.9 19.2 9.8
5.8 6.9 2.6 5.8 21.7 11.8 3.4 2.1 4.5 6.3 10.7
8.9 9.4 9.4 7.6 10 3.3 6.7 7.8 11.6 13.8 18.6

The sample mean = 11.49 and the sample standard deviation = 6.23.

We will assume that the smiling times, in seconds, follow a uniform distribution between zero and 23
seconds, inclusive. This means that any smiling time from zero to and including 23 seconds is equally
likely. The histogram that could be constructed from the sample is an empirical distribution that closely
matches the theoretical uniform distribution.

Let X = length, in seconds, of an eight-week-old baby’s smile.

The notation for the uniform distribution is X ~ U(a, b) where a = the lowest value of x and b = the
highest value of x.

The probability density function is

1
P(x)=
b−a

for a ≤ x ≤ b.

For this example, X ~ U(0, 23) and

1
P(x)=
23−0

for 0 ≤ X ≤ 23.

Formulas for the theoretical mean and standard deviation are

a+b √ (b−a)2
μ= and σ=
2 12

For this problem, the theoretical mean and standard deviation are

0+23 √ (23−0)2
μ= = 11.50 seconds and σ= = 6.64 seconds
2 12

Normal Probability Distribution: Normal distribution, also known as the


Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data
near the mean are more frequent in occurrence than data far from the mean.

Formula:
1 2 2

P(x) = e−(x−µ) / 2 σ
σ √2 π

Where

µ = mean
σ = standard deviation

π = 3.14159
e = 2.71828
Example:
X is a normally normally distributed variable with mean μ = 30 and standard deviation σ = 4. Find
a) P(x < 40)
b) P(x > 21)
c) P(30 < x < 35)

What is meant here by area is the area under the standard normal curve.

a) For x = 40, the z-value z = (40 - 30) / 4 = 2.5

Hence P(x < 40) = P(z < 2.5) = [area to the left of 2.5] = 0.9938

b) For x = 21, z = (21 - 30) / 4 = -2.25

Hence P(x > 21) = P(z > -2.25) = [total area] - [area to the left of -2.25]

= 1 - 0.0122 = 0.9878

c) For x = 30 , z = (30 - 30) / 4 = 0 and for x = 35, z = (35 - 30) / 4 = 1.25

Hence P(30 < x < 35) = P(0 < z < 1.25) = [area to the left of z = 1.25] - [area to the left of 0]

= 0.8944 - 0.5 = 0.3944

Exponential Probability Distribution: The exponential distribution is the probability


distribution of the time between events in a Poisson point process, i.e., a process in which events occur
continuously and independently at a constant average rate.

Formula:
1 −x /µ
P(x) = e for x≥ 0, µ> 0
µ

Example:
Let X = amount of time (in minutes) a postal clerk spends with his or her customer. The time is known to
have an exponential distribution with the average amount of time equal to four minutes.

X is a continuous random variable since time is measured. It is given that μ = 4 minutes. To do any
calculations, you must know m, the decay parameter.

1 1
m= . Therefore, m= = 0.25
µ 4
The standard deviation, σ, is the same as the mean. μ = σ

The distribution notation is X ~ Exp(m). Therefore, X ~ Exp(0.25).

The probability density function is P(x) = me–mx. The number e = 2.71828182846… It is a number that is
used often in mathematics. Scientific calculators have the key “ex.” If you enter one for x, the calculator
will display the value e.

The curve is:

P(x) = 0.25 e−0.25 x  where x is at least zero and m = 0.25.


For example, P(5) = 0.25 e−(0.25 )(5) = 0.072. The postal clerk spends five minutes with the customers. 

The graph is as follows:

Notice the graph is a declining curve. When x = 0,


P(x) = 0.25 e−(0.25 )(0) = (0.25)(1) = 0.25 = m. The maximum value on the y-axis is m.

Sampling Distribution: A sampling d


istribution is a probability distribution of a statistic obtained through a large number of samples drawn
from a specific population. The sampling distribution of a given population is the distribution of
frequencies of a range of different outcomes that could possibly occur for a statistic of a population.

Formula:

Mean μM = μ

Finite population Infinite population


N −n σ σ
Standard deviation σx̄ = √ ( ) σx̄ =
N−1 √ n √n

Example:
Draw all possible samples of size 2 without replacement from a population consisting of 3, 6, 9, 12, 15.
Form the sampling distribution of sample means and verify the results.

(i) E(x̄)=μ
σ 2 N −n
(ii) Var(x̄)= ( )
n N−1
 
Solution:
We have population values 3, 6, 9, 12, 15, population size N=5 and sample size n=2. Thus, the number of
possible samples which can be drawn without replacement is

( Nn )= (52)=10
Sample Mean Sample Mean
Sample No. Sample Values Sample No. Sample Values
(x̄) (x̄)
1 3, 6 4.5 6 6, 12 9.0
2 3, 9 6.0 7 6, 15 10.5
3 3, 12 7.5 8 9, 12 10.5
4 3, 15 9.0 9 9, 15 12.0
5 6, 9 7.5 10 12, 15 13.5

The sampling distribution of the sample mean x̄ and its mean and standard deviation are:

x̄ f f(x̄) x̄f(x̄) x̄ 2 f ( x̄)


4.5 1 1/10 4.5/10 20.25/10
6.0 1 1/10 6.0/10 36.00/10
7.5 2 2/10 15.0/10 112.50/10
9.0 2 2/10 18.0/10 162.00/10
10.5 2 2/10 21.0/10 220.50/10
12.0 1 1/10 12.0/10 144.00/10
13.5 1 1/10 13.5/10 182.25/10
Total 10 1 90/10 877.5/10
90
E(x̄)=∑ x̄f(x̄) = 10 = 9

2
887.5 90
Var(x̄)=∑ x̄ 2f(x̄)–[∑ x̄ f ( x̄ )]2 = 10 – ( ) = 87.75–81 = 6.75
10
 
The mean and variance of the population are:

X 3 6 9 12 15 ∑X=45
X2 9 36 81 144 225 ∑X2=495
∑X 45 ∑X2 ∑X 495 45
μ = N = 5 = 9 and  σ2 = N – ( N )2 = 5 – ( 5 )2 = 99–81 =18

Verification:

σ2 N –n 18 5 – 2
(i) E(x̄) = μ = 9 (ii) Var(x̄) = n ( N−1 )= 2 ( 5−1 ) = 6.75

Вам также может понравиться