Вы находитесь на странице: 1из 15

Definitions

Statistics
Collection of methods for planning experiments, obtaining data, and then
organizing, summarizing, presenting, analyzing, interpreting, and drawing
conclusions.
Variable
Characteristic or attribute that can assume different values
Random Variable
A variable whose values are determined by chance.
Population
All subjects possessing a common characteristic that is being studied.
Sample
A subgroup or subset of the population.
Parameter
Characteristic or measure obtained from a population.
Statistic (not to be confused with Statistics)
Characteristic or measure obtained from a sample.
Descriptive Statistics
Collection, organization, summarization, and presentation of data.
Inferential Statistics
Generalizing from samples to populations using probabilities. Performing
hypothesis testing, determining relationships between variables, and making
predictions.
Qualitative Variables
Variables which assume non-numerical values.
Quantitative Variables
Variables which assume numerical values.
Discrete Variables
Variables which assume a finite or countable number of possible values.
Usually obtained by counting.
Continuous Variables
Variables which assume an infinite number of possible values. Usually obtained
by measurement.
Nominal Level
Level of measurement which classifies data into mutually exclusive, all
inclusive categories in which no order or ranking can be imposed on the data.
Ordinal Level
Level of measurement which classifies data into categories that can be ranked.
Differences between the ranks do not exist.
Interval Level
Level of measurement which classifies data that can be ranked and differences
are meaningful. However, there is no meaningful zero, so ratios are
meaningless.
Ratio Level
Level of measurement which classifies data that can be ranked, differences are
meaningful, and there is a true zero. True ratios exist between the different units
of measure.
Random Sampling

Sampling in which the data is collected using chance methods or random


numbers.
Systematic Sampling
Sampling in which data is obtained by selecting every kth object.
Convenience Sampling
Sampling in which data is which is readily available is used.
Stratified Sampling
Sampling in which the population is divided into groups (called strata)
according to some characteristic. Each of these strata is then sampled using one
of the other sampling techniques.
Cluster Sampling
Sampling in which the population is divided into groups (usually
geographically). Some of these groups are randomly selected, and then all of
the elements in those groups are selected.

Population vs Sample
The population includes all objects of interest whereas the sample is only a portion of
the population. Parameters are associated with populations and statistics with samples.
Parameters are usually denoted using Greek letters (mu, sigma) while statistics are
usually denoted using Roman letters (x, s).
There are several reasons why we don't work with populations. They are usually large,
and it is often impossible to get data for every object we're studying. Sampling does
not usually occur without cost, and the more items surveyed, the larger the cost.
We compute statistics, and use them to estimate parameters. The computation is the
first part of the statistics course (Descriptive Statistics) and the estimation is the
second part (Inferential Statistics)

Discrete vs Continuous
Discrete variables are usually obtained by counting. There are a finite or countable
number of choices available with discrete data. You can't have 2.63 people in the
room.
Continuous variables are usually obtained by measuring. Length, weight, and time are
all examples of continous variables. Since continuous variables are real numbers, we
usually round them. This implies a boundary depending on the number of decimal
places. For example: 64 is really anything 63.5 <= x < 64.5. Likewise, if there are two
decimal places, then 64.03 is really anything 63.025 <= x < 63.035. Boundaries
always have one more decimal place than the data and end in a 5.

Levels of Measurement
There are four levels of measurement: Nominal, Ordinal, Interval, and Ratio. These go
from lowest level to highest level. Data is classified according to the highest level
which it fits. Each additional level adds something the previous level didn't have.
Nominal is the lowest level. Only names are meaningful here.
Ordinal adds an order to the names.
Interval adds meaningful differences
Ratio adds a zero so that ratios are meaningful.

Types of Sampling
There are five types of sampling: Random, Systematic, Convenience, Cluster, and
Stratified.
Random sampling is analogous to putting everyone's name into a hat and
drawing out several names. Each element in the population has an equal chance
of occuring. While this is the preferred way of sampling, it is often difficult to
do. It requires that a complete list of every element in the population be
obtained. Computer generated lists are often used with random sampling. You
can generate random numbers using the TI82 calculator.
Systematic sampling is easier to do than random sampling. In systematic
sampling, the list of elements is "counted off". That is, every kth element is
taken. This is similar to lining everyone up and numbering off "1,2,3,4; 1,2,3,4;
etc". When done numbering, all people numbered 4 would be used.
Convenience sampling is very easy to do, but it's probably the worst technique
to use. In convenience sampling, readily available data is used. That is, the first
people the surveyor runs into.
Cluster sampling is accomplished by dividing the population into groups -usually geographically. These groups are called clusters or blocks. The clusters
are randomly selected, and each element in the selected clusters are used.
Stratified sampling also divides the population into groups called strata.
However, this time it is by some characteristic, not geographically. For
instance, the population might be separated into males and females. A sample is
taken from each of these strata using either random, systematic, or convenience
sampling.

Sampling Lab

The purpose of this laboratory exercise is to familiarize yourself with the


different sampling techniques.
You need one page from a movie listing (like contained in TV-Guide). Note, if
you actually use TV Guide, then you need to use two facing pages. Pick a
page with little extraneous material, other than the listings, on it.
For the purposes of this sampling project, a movie is included on the page or in
a cluster if the running time for the movie falls on the page.
Random Sampling
Number each movie on the page. If there are a lot of movies, you may wish to
number every other or every third movie.
Generate a random sample on 8 numbers between 1 and the number of movies
on the page. Write down the # generated and the running time for the movie
corresponding to that number.

Systematic Sampling
Generate a random number between 1 and 6. Beginning with the movie
corresponding to that number, and then taking every 6th movie thereafter, write
the # of the movie and the running length of the movie.
Convenience Sampling
Write down the running time of the first eight movies.
Stratified Sampling
On a separate piece of paper, write down the running times of all PG/PG13, R,
and not-rated (either NR or no rating given) movies in three columns -- ignore
all other types (NC17, G, etc). Split a sample of 8 proportionally to each type of
movie (if R is 40%, then sample 40% of 8 = 3.2 -> 3 R movies). Use random
sampling within each movie type. Record the running lengths of the movies
selected.
Cluster Sampling
Divide the page into equal regions so that each region has roughly 3 - 4 movies
in each cluster. Randomly select 3 clusters, and record the running length of all
movies in those clusters.

Statistics: Frequency Distributions & Graphs


Definitions
Raw Data
Data collected in original form.
Frequency
The number of times a certain value or class of values occurs.
Frequency Distribution
The organization of raw data in table form with classes and frequencies.
Categorical Frequency Distribution
A frequency distribution in which the data is only nominal or ordinal.
Ungrouped Frequency Distribution
A frequency distribution of numerical data. The raw data is not grouped.
Grouped Frequency Distribution
A frequency distribution where several numbers are grouped into one class.
Class Limits
Separate one class in a grouped frequency distribution from another. The limits
could actually appear in the data and have gaps between the upper limit of one
class and the lower limit of the next.
Class Boundaries
Separate one class in a grouped frequency distribution from another. The
boundaries have one more decimal place than the raw data and therefore do not
appear in the data. There is no gap between the upper boundary of one class
and the lower boundary of the next class. The lower class boundary is found by
subtracting 0.5 units from the lower class limit and the upper class boundary is
found by adding 0.5 units to the upper class limit.
Class Width

The difference between the upper and lower boundaries of any class. The class
width is also the difference between the lower limits of two consecutive classes
or the upper limits of two consecutive classes. It is not the difference between
the upper and lower limits of the same class.
Class Mark (Midpoint)
The number in the middle of the class. It is found by adding the upper and
lower limits and dividing by two. It can also be found by adding the upper and
lower boundaries and dividing by two.
Cumulative Frequency
The number of values less than the upper class boundary for the current class.
This is a running total of the frequencies.
Relative Frequency
The frequency divided by the total frequency. This gives the percent of values
falling in that class.
Cumulative Relative Frequency (Relative Cumulative Frequency)
The running total of the relative frequencies or the cumulative frequency
divided by the total frequency. Gives the percent of the values which are less
than the upper class boundary.
Histogram
A graph which displays the data by using vertical bars of various heights to
represent frequencies. The horizontal axis can be either the class boundaries,
the class marks, or the class limits.
Frequency Polygon
A line graph. The frequency is placed along the vertical axis and the class
midpoints are placed along the horizontal axis. These points are connected with
lines.
Ogive
A frequency polygon of the cumulative frequency or the relative cumulative
frequency. The vertical axis the cumulative frequency or relative cumulative
frequency. The horizontal axis is the class boundaries. The graph always starts
at zero at the lowest class boundary and will end up at the total frequency (for a
cumulative frequency) or 1.00 (for a relative cumulative frequency).
Pareto Chart
A bar graph for qualitative data with the bars arranged according to frequency.
Pie Chart
Graphical depiction of data as slices of a pie. The frequency determines the size
of the slice. The number of degrees in any slice is the relative frequency times
360 degrees.
Pictograph
A graph that uses pictures to represent data.
Stem and Leaf Plot
A data plot which uses part of the data value as the stem and the rest of the data
value (the leaf) to form groups or classes. This is very useful for sorting data
quickly.

Statistics: Data Description


Definitions
Statistic
Characteristic or measure obtained from a sample
Parameter
Characteristic or measure obtained from a population
Mean
Sum of all the values divided by the number of values. This can either be a
population mean (denoted by mu) or a sample mean (denoted by x bar)
Median
The midpoint of the data after being ranked (sorted in ascending order). There
are as many numbers below the median as above the median.
Mode
The most frequent number
Skewed Distribution
The majority of the values lie together on one side with a very few values (the
tail) to the other side. In a positively skewed distribution, the tail is to the right
and the mean is larger than the median. In a negatively skewed distribution, the
tail is to the left and the mean is smaller than the median.
Symmetric Distribution
The data values are evenly distributed on both sides of the mean. In a
symmetric distribution, the mean is the median.
Weighted Mean
The mean when each value is multiplied by its weight and summed. This sum is
divided by the total of the weights.
Midrange
The mean of the highest and lowest values. (Max + Min) / 2
Range
The difference between the highest and lowest values. Max - Min
Population Variance
The average of the squares of the distances from the population mean. It is the
sum of the squares of the deviations from the mean divided by the population
size. The units on the variance are the units of the population squared.
Sample Variance
Unbiased estimator of a population variance. Instead of dividing by the
population size, the sum of the squares of the deviations from the sample mean

is divided by one less than the sample size. The units on the variance are the
units of the population squared.
Standard Deviation
The square root of the variance. The population standard deviation is the square
root of the population variance and the sample standard deviation is the square
root of the sample variance. The sample standard deviation is not the unbiased
estimator for the population standard deviation. The units on the standard
deviation is the same as the units of the population/sample.
Coefficient of Variation
Standard deviation divided by the mean, expressed as a percentage. We won't
work with the Coefficient of Variation in this course.
Chebyshev's Theorem
The proportion of the values that fall within k standard deviations of the mean
is at least
where k > 1. Chebyshev's theorem can be applied to any

distribution regardless of its shape.


Empirical or Normal Rule
Only valid when a distribution in bell-shaped (normal). Approximately 68%
lies within 1 standard deviation of the mean; 95% within 2 standard deviations;
and 99.7% within 3 standard deviations of the mean.
Standard Score or Z-Score
The value obtained by subtracting the mean and dividing by the standard
deviation. When all values are transformed to their standard scores, the new
mean (for Z) will be zero and the standard deviation will be one.
Percentile
The percent of the population which lies below that value. The data must be
ranked to find percentiles.
Quartile
Either the 25th, 50th, or 75th percentiles. The 50th percentile is also called the
median.
Decile
Either the 10th, 20th, 30th, 40th, 50th, 60th, 70th, 80th, or 90th percentiles.
Lower Hinge
The median of the lower half of the numbers (up to and including the median).
The lower hinge is the first Quartile unless the remainder when dividing the
sample size by four is 3.
Upper Hinge
The median of the upper half of the numbers (including the median). The upper
hinge is the 3rd Quartile unless the remainder when dividing the sample size by
four is 3.
Box and Whiskers Plot (Box Plot)
A graphical representation of the minimum value, lower hinge, median, upper
hinge, and maximum. Some textbooks, and the TI-82 calculator, define the five
values as the minimum, first Quartile, median, third Quartile, and maximum.
Five Number Summary
Minimum value, lower hinge, median, upper hinge, and maximum.
InterQuartile Range (IQR)
The difference between the 3rd and 1st Quartiles.
Outlier

An extremely high or low value when compared to the rest of the values.
Mild Outliers
Values which lie between 1.5 and 3.0 times the InterQuartile Range below the
1st Quartile or above the 3rd Quartile. Note, some texts use hinges instead of
Quartiles.
Extreme Outliers
Values which lie more than 3.0 times the InterQuartile Range below the 1st
Quartile or above the 3rd Quartile. Note, some texts use hinges instead of
Quartiles.
Averages

In statistics, an average is defined as the number that measures the central tendency of a given set of
numbers. There are a number of different averages including but not limited to: mean, median, mode
and range.

Mean
Mean is what most people commonly refer to as an average. The mean refers to the number you
obtain when you sum up a given set of numbers and then divide this sum by the total number in the
set. Mean is also referred to more correctly as arithmetic mean.

Given a set of n elements from a1 to an

The mean is found by adding up all the a's and then dividing by the total number, n

This can be generalized by the formula below:

Mean Example Problems


Example 1

Find the mean of the set of numbers below

Solution

The first step is to count how many numbers there are in the set, which we shall call n

The next step is to add up all the numbers in the set

The last step is to find the actual mean by dividing the sum by n

Mean can also be found for grouped data, but before we see an example on that, let us first define
frequency.

Frequency in statistics means the same as in everyday use of the word. The frequency an element in a
set refers to how many of that element there are in the set. The frequency can be from 0 to as many
as possible. If you're told that the frequency an element a is 3, that means that there are 3 as in the
set.

Example 2

Find the mean of the set of ages in the table below

Age (years)

Frequency

10

11

12

13

14

Solution

The first step is to find the total number of ages, which we shall call n. Since it will be tedious to count
all the ages, we can find nby adding up the frequencies:

Next we need to find the sum of all the ages. We can do this in two ways: we can add up each
individual age, which will be a long and tedious process; or we can use the frequency to make things
faster.

Since we know that the frequency represents how many of that particular age there are, we can just
multiply each age by its frequency, and then add up all these products.

The last step is to find the mean by dividing the sum by n

Population Mean vs Sample Mean


In the Introduction to Statistics section, we defined a population and a sample whereby a sample is a
part of a population.

In statistics there are two kinds of means: population mean and sample mean. A population mean is
the true mean of the entire population of the data set while a sample mean is the mean of a small
sample of the population. These different means appear frequently in both statistics and probability
and should not be confused with each other.

Population mean is represented by the Greek letter (pronounced mu) while sample mean is
represented by xx (pronounced x bar). The total number of elements in a population is represented
by N while the number of elements in a sample is represented by n. This leads to an adjustment in the
formula we gave above for calculating the mean.

The sample mean is commonly used to estimate the population mean when the population mean is
unknown. This is because they have the same expected value.

Median
The median is defined as the number in the middle of a given set of numbers arranged in order of
increasing magnitude. When given a set of numbers, the median is the number positioned in the exact
middle of the list when you arrange the numbers from the lowest to the highest. The median is also a
measure of average. In higher level statistics, median is used as a measure of dispersion. The median
is important because it describes the behavior of the entire set of numbers.

Example 3

Find the median in the set of numbers given below

Solution

From the definition of median, we should be able to tell that the first step is to rearrange the given set
of numbers in order of increasing magnitude, i.e. from the lowest to the highest

Then we inspect the set to find that number which lies in the exact middle.

Lets try another example to emphasize something interesting that often occurs when solving for the
median.

Example 4

Find the median of the given data

Solution

As in the previous example, we start off by rearranging the data in order from the smallest to the
largest.

Next we inspect the data to find the number that lies in the exact middle.

We can see from the above that we end up with two numbers (4 and 5) in the middle. We can solve for
the median by finding the mean of these two numbers as follows:

Mode
The mode is defined as the element that appears most frequently in a given set of elements. Using the
definition of frequency given above, mode can also be defined as the element with the largest
frequency in a given data set.

For a given data set, there can be more than one mode. As long as those elements all have the same
frequency and that frequency is the highest, they are all the modal elements of the data set.

Example 5

Find the Mode of the following data set.

Solution

Mode = 3 and 15

Mode for Grouped Data


As we saw in the section on data, grouped data is divided into classes. We have defined mode as the
element which has the highest frequency in a given data set. In grouped data, we can find two kinds of
mode: the Modal Class, or class with the highest frequency and the mode itself, which we calculate
from the modal class using the formula below.

where

L is the lower class limit of the modal class

f1 is the frequency of the modal class

f0 is the frequency of the class before the modal class in the frequency table

f2 is the frequency of the class after the modal class in the frequency table

h is the class interval of the modal class

Example 6

Find the modal class and the actual mode of the data set below

Number

Frequency

1-3

4-6

7-9

10 - 12

13 - 15

16 - 18

19 - 21

22 - 24

25 - 27

28 - 30

Solution

Modal class = 10 - 12

where

L = 10

f1 = 9

f0 = 4

f2 = 2

h=3

therefore,

Solving the above using the order of operations:

Range
The range is defined as the difference between the highest and lowest number in a given data set.

Example 7

Find the range of the data set below

Solution

Вам также может понравиться