Вы находитесь на странице: 1из 9

L04: Basic Statistical Descriptions of Data

Terminologies In Statistics – Statistics For Data Science

• Population is the set of sources from which data has to be collected.


• A Sample is a subset of the Population
• A Variable is any characteristics, number, or quantity that can be measured or counted.
A variable may also be called a data item.

What are the measures of central tendency?


A measure of central tendency (also referred to as measures of centre or central location) is a summary
measure that attempts to describe a whole set of data with a single value that represents the middle or
centre of its distribution.

There are three main measures of central tendency: the mode, the median and the mean. Each of these
measures describes a different indication of the typical or central value in the distribution.
Mean: The average of all data points
Median: The data point where half of the data lies above and half below it
Mode: The most common value in the data
mean (arithmetic average), mode (most frequent number), median (middle number when
numbers are listed smallest to largest).

Measuring the Central Tendency: Mean, Median, and Mode

# Mean
An arithmetic mean is calculated using the following equation:

The most common and effective numeric measure of the “center” of a set of data is the
(arithmetic) mean. Let x1, x2, : : :, xN be a set of N values or observations, such as for some
numeric attribute X, like salary. The mean of this set of values is
X̅ = =
Example: The mean of 4,1, and 7 is (4+1+7)/3=12/3 =4 left parenthesis, 4, plus, 1, plus, 7, right
parenthesis, slash, 3, equals, 12, slash, 3, equals, 4.
Example Mean. Suppose we have the following values forsalary(in thousands of
dollars), shownin increasing order: 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110.
Using Eq. (2.1), we have

X̅ =

Thus, the mean salary is $58,000.

The mean is the sum of the value of each observation in a dataset divided by the number of
observations. This is also known as the arithmetic average.
Looking at the retirement age distribution again:

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

The mean is calculated by adding together all the values


(54+54+54+55+56+57+57+58+58+60+60 = 623) and dividing by the number of observations (11)
which equals 56.6 years.

Advantage of the mean:


The mean can be used for both continuous and discrete numeric data.
Limitations of the mean:
The mean cannot be calculated for categorical data, as the values cannot be summed.
As the mean includes every value in the distribution the mean is influenced by outliers and
skewed distributions.
What else do I need to know about the mean?
The population mean is indicated by the Greek symbol µ (pronounced ‘mu’). When the mean is
calculated on a distribution from a sample it is indicated by the symbol x̅ (pronounced X-bar).

Weighted arithmetic mean:


Sometimes, each value xi in a set may be associated with a weight wi for I,…,N. The weights reflect the
significance, importance, or occurrence frequency attached to their respective values. In this case, we
can compute

This is called the weighted arithmetic mean or the weighted average.


Although the mean is the singlemost useful quantity for describing a data set, it is not
always the best way of measuring the center of the data. A major problem with the
mean is its sensitivity to extreme (e.g., outlier) values. Even a small number of extreme
values can corrupt the mean. For example, the mean salary at a company may be
substantially pushed up by that of a few highly paid managers. Similarly, the mean
score of a class in an exam could be pulled down quite a bit by a few very low scores
trimmed mean, which is the mean obtained after chopping off values at the high and low extremes. For example, we
can sort the values observed for salary and remove the top and bottom 2% before computing the mean. We should avoid
trimming too large a portion (such as 20%) at both ends, as this can result in the loss of valuable information.

# median
The median is the middle value in distribution when the values are arranged in ascending or
descending order.

The median divides the distribution in half (there are 50% of observations on either side of
the median value). In a distribution with an odd number of observations, the median value
is the middle value.
Example 1
Find the median of this data:
1, 4, 2, 5, 0
Put the data in order first:
0, 1, 2, 4, 5
There is an odd number of data points, so the median is the middle data point.
The median is 2.
Find the median of this data:
10, 40, 20, 50
Put the data in order first:
10, 40, 20, 50

There is an even number of data points, so the median is the average of the middle two data
points.
Median=20+40=60/2=30
fraction, equals, start fraction, 60, divided by, 2, end fraction, equals, 30
The median is 30

Looking at the retirement age distribution (which has 11 observations), the median is the middle
value, which is 57 years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

When the distribution has an even number of observations, the median value is the mean of the
two middle values. In the following distribution, the two middle values are 56 and 57, therefore
the median equals 56.5 years:

52, 54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

Advantage of the median:


The median is less affected by outliers and skewed data than the mean, and is usually the
preferred measure of central tendency when the distribution is not symmetrical.

Limitation of the median:

The median cannot be identified for categorical nominal data, as it cannot be logically ordered.

*If two elements are in the middle of a set of data...to find the median, add the two
numbers together and divide by two.*

#Mode

element/number in the set of data that has the highest frequency.

Example:

Set of data: 71,75,60,84,71,73,66


First rank in order....
60, 66, 71, 71, 73, 75, 84.
71 appeared twice.
71 is the mode.
midrange
average of the high and low from a ranked set of data.

Example:
Set of data: 1, 1, 1, 1, 4, 4, 4, 4, 6, 8, 10, 12, 15, 21, 21.

The highest value in the set is 21 and the lowest value is 1. So we add 21 and 1 and get 22 and
then divide 22 by 2 = 11. 11 is the midrange.
The mode is the most commonly occurring value in a distribution.
Consider this dataset showing the retirement age of 11 people, in whole years:
54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60
This table shows a simple frequency distribution of the retirement age data.
Age Frequency
54 3
55 1
56 1
57 2
58 2
60 2

The most commonly occurring value is 54, therefore the mode of this distribution is 54 years.
Advantage of the mode:

The mode has an advantage over the median and the mean as it can be found for both numerical
and categorical (non-numerical) data.

Limitations of the mode:

The are some limitations to using the mode. In some distributions, the mode may not reflect the
centre of the distribution very well. When the distribution of retirement age is ordered from
lowest to highest value, it is easy to see that the centre of the distribution is 57 years, but the
mode is lower, at 54 years.

54, 54, 54, 55, 56, 57, 57, 58, 58, 60, 60

It is also possible for there to be more than one mode for the same distribution of data, (bi-
modal, or multi-modal). The presence of more than one mode can limit the ability of the mode in
describing the centre or typical value of the distribution because a single value to describe the
centre cannot be identified.

In some cases, particularly where the data are continuous, the distribution may have no mode at
all (i.e. if all values are different).

In cases such as these, it may be better to consider using the median or mean, or group the data
in to appropriate intervals, and find the modal class.

The mode is another measure of central tendency. The mode for a set of data is the
value that occurs most frequently in the set. Therefore, it can be determined for qualita-
tive and quantitative attributes. It is possible for the greatest frequency to correspond to
several different values, which results in more than one mode. Data sets with one, two,
or three modes are respectively called unimodal, bimodal, and trimodal. In general, a
data set with two or more modes is multimodal. At the other extreme, if each data
value occurs only once, then there is no mode.
Variance:

** Deviation an individual data from the mean


+The Variance is defined as: Variance measures how far a data set is spread out
+The average of the squared differences from the Mean.
+It measures how far each number in the set is from the mean and is calculated by taking the
differences between each number in the set and the mean

Deviation from the mean: xi-µ


Example
You and your friends have just measured the heights of your dogs (in millimeters):

The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm.
Find out the Mean, the Variance, and the Standard Deviation.
Your first step is to find the Mean:
Answer:
Mean = 600 + 470 + 170 + 430 + 3005
= 19705
= 394
so the mean (average) height is 394 mm. Let's plot this on the chart:
Now we calculate each dog's difference from the Mean:

To calculate the Variance, take each difference, square it, and then average the result:

Variance

σ2 =( 2062 + 762 + (−224)2 + 362 + (−94)2 ) /5

// [600-394m, 470-394 ….]

= 42436 + 5776 + 50176 + 1296 + 88365

= 1085205 / 5

= 21704

So the Variance is 21,704

# Standard Deviation:
The Standard Deviation is a measure of how spread out numbers are.
Standard deviation are statistics which measure spread - how the data is distributed and
dispersion. amount of variation or dispersion of a set of values
Its symbol is σ (the greek letter sigma)

* A low standard deviation indicates that the values tend to be close to the mean (also called
the expected value) of the set, while a high standard deviation indicates that the values are
spread out over a wider range.

The formula is easy: Standard Deviation is the square root of the Variance. So now you ask,
"What is the Variance?"

And the Standard Deviation is just the square root of Variance, so:
Standard Deviation

σ = √21704
= 147.32...
= 147 (to the nearest mm)

why the standard deviation can tell you how spread out the examples in a set are from the mean.
Why is this useful? Here's an example: If you are comparing test scores for different schools, the
standard deviation will tell you how diverse the test scores are for each school.

A standard deviation is a number that tells us


to what extent a set of numbers lie apart.

Consider the following three data sets A, B and C.


A = {9,10,11,7,13}
B = {10,10,10,10,10}
C = {1,1,10,19,19}
a) Calculate the mean of each data set.
b) Calculate the standard deviation of each data set.
c) Which set has the largest standard deviation?
d) Is it possible to answer question c) without calculations of the standard deviation?
Find the standard deviation for the following data series:
12, 6, 7, 3, 15, 10, 18, 5.
The population standard deviation formula is:

where,
= population standard deviation
= sum of...
= population mean—miu/mi-u
n = number of scores in sample.
Find the standard deviation of 4, 9, 11, 12, 17, 5, 8, 12, 14
First work out the mean: 10.222
Now, subtract the mean individually from each of the numbers given and square the result. This
is equivalent to the (x - )² step. x refers to the values given in the question.
x 4 9 11 12 17 5 8 12 14

(x - )2 38.7 1.49 0.60 3.16 45.9 27.3 4.94 3.16 14.3


Now add up these results (this is the 'sigma' in the formula): 139.55
Divide by n. n is the number of values, so in this case is 9. This gives us: 15.51
And finally, square root this: 3.94

Вам также может понравиться