Вы находитесь на странице: 1из 57

Introduction to

Statistics
CHAPTER 1
BKU2032

1
CONTENT
1.1 Overview
1.2 Statistical Problem-Solving Methodology
1.3 Review on Descriptive Statistics
1.3.1 Measures of Central Tendency
1.3.2 Measures of Variation
1.3.3 Concept of Variance
1.3.3.1 Chebychev’s Theorem
1.3.3.2 Game of Dart
2
OBJECTIVES
 By the end of this chapter, you should be able to

 Define the meaning of statistics, population, sample, parameter,


statistic, descriptive statistics and inferential statistics.
 Understand and explain why a knowledge of statistics is needed
 Outline the 6 basic steps in the statistical problem solving methodology.
 Identifies various method to obtain samples.
 Discuss the role of computers and data analysis software in statistical
work.
 Summarize data using measures of central tendency, such as the
mean, median, mode.
 Describe data using measures of variation, such as the range, variance,
and standard deviation.
 Explain the accuracy and precision of data using game of dart.

3
1.1 OVERVIEW
• Define the meaning of statistics, population, sample,
parameter, statistic, descriptive statistics and
inferential statistics.

• Understand and explain why a knowledge of


statistics is needed

4
What is Statistics?
Most people become familiar with probability and statistics through
radio, television, newspapers, and magazines. For example, the
following statements were found in newspapers:
• Ten of thousands parents in Malaysia have chosen StemLife as their trusted
stem cell bank.
• The death rate from lung cancer was 10 times for smokers compared to
nonsmokers.
• The average cost of a wedding is nearly RM10,000.
• In USA, the average salary for men with a bachelor’s degree is $49,982, while
the average salary for women with a bachelor’s degree is $35,408.
• Globally, an estimated 500,000 children under the age of 15 live with Type 1
diabetes.
• Women who eat fish once a week are 29% less likely to develop heart disease.

5
Statistics
The science of conducting studies to

collect organize

analyze summarize

draw conclusions from data.


and

Any values (observations or measurements)


that have been collected 6
The basic idea behind all statistical methods of data analysis
is to make inferences about a population by studying small
sample chosen from it
Population Parameter
The complete collection of A number that describes a
measurements outcomes, object population characteristics
or individual under study

Tangible Conceptual
Always finite & after a population is sampled, Population that consists of all the
the population size decreases by 1. value that might possibly have been
The total number of members is finite & observed & does not consist of
consist of actual physical object actual objects

Sample Statistic
A subset of a population,
A number that describes a
containing the objects or outcomes
sample characteristics
that are actually observed
7
EXERCISE 1.1
1. The freshman class at Engineering College has 317 students and
an IQ pre-test is given to all of them in their first week. The dean
of admission collected data on 27 of them and found their mean
score on the IQ pre-test was 51. The mean for the entire
freshman class was therefore estimated to approximately 51 on
this test. A subsequent computer analysis of all freshmen showed
the true mean to be 52.
Based on the above problem,

a) What is the population?


b) Is the population tangible or conceptual?
c) What is the sample?
d) What number is a parameter?
e) What number is a statistic?

8
Descriptive & Inferential Statistics
 Descriptive statistics  Inferential statistics
 consists of the collection,  consists of generalizing from
organization, samples to populations,
classification, performing estimations
summarization, and
presentation of data hypothesis testing,
obtain from the sample. determining relationships
 Used to describe the
among variables, and making
characteristics of the predictions.
sample  Used to describe, infer,

 Used to determine estimate, approximate the


whether the sample characteristics of the target
represent the target population
population by comparing  Used when we want to draw a
sample statistic and
population parameter conclusion for the data obtain
from the sample
9
EXERCISE 1.1
2. In each of these statements, tell whether descriptive or inferential
statistics have been used.
a) Ten of thousands parents in Malaysia have chosen StemLife
as their trusted stem cell bank. (Descriptive)
b) The death rate from lung cancer was 10 times for smokers
compared to nonsmokers. (Inferential)

3. Suppose UMP wants to estimate the average time a student takes to find a
proper parking spot. During semester 1 2009/2010, an administrator randomly
asked 200 students and recorded their parking times and found that it takes on
average 10 minutes to find a parking spot.

(a) What is the population in this study?


(b) Describe the sample.
(c) What is the descriptive statement?
(d) How is the inference expressed?
10
An overview of descriptive statistics
and statistical inference
START

Gathering of
Data

Classification,
Summarization, and
Processing of data

Presentation and
Communication of
Summarized information

Yes
Use sample information
Is Information from a
to make inferences about
sample?
the population Statistical
Inference
No
Descripti
ve
Draw conclusions about
Statistics Use cencus data to
the population
analyze the population
characteristic (parameter)
characteristic under study
under study

STOP
11
Need for Statistics
 It is a fact that, you need a knowledge of
statistics to help you:

1. Describe and understand numerical


relationship between variables
 There are a lot of data in this world so
we need to identify the right variables.

2. Make better decision


 Statistical methods allow people to
make better decisions in the face of
uncertainty.

12
1.2: STATISTICAL
PROBLEM SOLVING
METHODOLOGY
• Outline the 6 basic steps in the statistical problem
solving methodology.

• Identifies various method to obtain samples.

• Discuss the role of computers and data analysis


software in statistical work.

13
STATISTICAL PROBLEM
SOLVING METHODOLOGY
6 Basic Steps
1. Identifying the problem or opportunity
2. Deciding on the method of data collection
3. Collecting the data
4. Classifying and summarizing the data
5. Presenting and analyzing the data
6. Making the decision

14
STEP 1
Identifying the problem or opportunity

 Must clearly understand & correctly define the objective/goal


of the study
 If not, time & effort are waste
 Is the goal to study some population?
 Is it to impose some treatment on the group & then test the
response?
 Can the study goal be achieved through simple counts or
measurements of the group?
 Must an experiment be performed on the group?
 If sample are needed, how large?, how should they be
taken? – the larger the better (more than 30)

15
Characteristics of Sample Size
 The larger the sample, the smaller the magnitude of
sampling errors.
 Survey studies needed large sample because the returns
of the survey is voluntary based.
 Easy to divide into subgroups.
 In mail response the percentage of response may be as
low as 20%-30%, thus the bigger number of samples is
required.
 Subject availability and cost factors are legitimate
considerations in determining appropriate sample size.

16
STEP 2
Deciding on the Method of Data Collection
 Data must be gathered that are accurate, as complete as possible
& relevant to the problem

 Data can be obtained in 3 ways


1. Data that are made available by others (internal,
external, primary or secondary data)
2. Data resulting from an experiment (experimental study)
- the researcher manipulates one of the variables and tries to determine how
the manipulation influences other variables.
- the subjects should be assigned to groups randomly.
- the treatments should be assigned to the groups at random.

3. Data collected in an observational study (observation,


survey, questionnaire, interview)
- the researcher merely observes what is happening or what has happened in
the past and tries to draw conclusions based on these observations. 17
STEP 3
Collecting the data
A. Nonprobability data
 Is one in which the judgement of the experimenter, the
method in which the data are collected or other factors
could affect the results of the sample
 3 basic methods: Judgement samples, Voluntary
samples and Convenience samples

B. Probability data
 Is one in which the chance of selection of each item in
the population is known before the sample is picked
 4 basic methods : random, systematic, stratified, and
cluster.

18
A) Nonprobability Data Samples
1. Judgment samples
 Base on opinion of one or more expert person
 Ex: A political campaign manager intuitively picks certain voting
districts as reliable places to measure the public opinion of his
candidate

2. Voluntary samples
 Question are posed to the public by publishing them over radio or
tv (phone or sms)

3. Convenience samples
 Take an ‘easy sample’ (most conveniently available)
 Ex: A surveyor will stand in one location & ask passerby their
questions
19
B) Probability Data Samples
1. Random samples
 Selected using chance method or random methods
 Example:
 A lecturer wants to study the physical fitness levels
of students at her university. There are 5,000
students enrolled at the university, and she wants to
draw a sample of size 100 to take a physical fitness
test. She obtains a list of all 5,000 students,
numbered it from 1 to 5,000 and then randomly
invites 100 students corresponding to those numbers
to participate in the study.

20
B) Probability Data Samples
2. Systematic samples
 Numbering each subject of the populations and data is
selected every kth number.
- Subjects are selected by using every k-number after the first
subject is selected from 1 through k.

 Example:
 A lecturer wants to study the physical fitness levels of
students at her university. There are 5,000 students
enrolled at the university, and she wants to draw a sample
of size 100 to take a physical fitness test. She obtains a list
of all 5,000 students, numbered it from 1 to 5,000 and
randomly picks one of the first 50 voters (5000/100 = 50)
on the list. If the pick number is 30, then the 30th student in
the list should be invited first. Then she should invite the
selected every 50th name on the list after this first random
starts (the 80th student, the 130th student, etc) to produce
100 samples of students to participate in the study. 21
B) Probability Data Samples
3. Stratified samples
 Dividing the population into groups according to some
characteristics that is important to the study, then sampling
from each group
- Subjects are selected by dividing up the population into groups (strata),
and subjects within groups are randomly selected.
 Example:
 A lecturer wants to study the physical fitness levels of students at
her university. There are 5,000 students enrolled at the
university, and she wants to draw a sample of size 100 to take a
physical fitness test. Assume that, because of different lifestyles,
the level of physical fitness is different between male and female
students. To account for this variation in lifestyle, the population
of student can easily be stratified into male and female students.
Then she can either use random method or systematic methods
to select the participants. As example she can use random
sample to chose 50 male students and use systematic method to
chose another 50 female students or otherwise.
22
B) Probability Data Samples
4. Cluster samples
 Dividing the population into sections/clusters, then
randomly select some of those cluster and then choose
all members from those selected cluster
- Subjects are selected by using an intact group that is representative
of the population.
 Using a cluster sampling can reduce cost and time.
 Example:
 A lecturer wants to study the physical fitness levels of students at
her university. There are 5,000 students enrolled at the university,
and she wants to draw a sample to take a physical fitness test.
Assume that, because of different lifestyles, the level of physical
fitness is different between freshmen, juniors and seniors
students. To account for this variation in lifestyle, the population of
student can easily be clustered into freshmen, juniors and seniors
students. Then she can choose any one cluster such as freshmen
and take all the freshmen students as the participant.
23
STEP 4
Classifying and Summarizing the Data

 Organize or group the facts/sample raw data for study


and investigation
 Classifying- identifying items with like characteristics &
arranging them into groups or classes.
 Ex: Production data (product make, location, production
process,…)
 Data can be classified as Qualitative (categorical/Attributes)
data and Quantitative (Numerical) data.

 Summarization
 Graphical & Descriptive statistics ( tables, charts, measure of
central tendency, measure of variation, measure of position)

24
Variables & Data Classification
 Data are the values that variables can assume
 Variables is a characteristic or attribute that can assume different
values.
 Variables whose values are determined by chance are called
random variables

Variables can be
classified

By how they are categorized,


As Quantitative
counted or measured
and Qualitative
- Level of measurements of
data
25
Qualitative Nominal Data (can’t be rank)
Gender, race, citizenship. etc
(categorical/Attributes)
Use code
1* Data that refers only to numbers
(1, 2,…)
name classification (done
using numbers) Ordinal Data (can be rank)
Feeling (dislike – like),
2* Can be placed into color (dark – bright) , etc
distinct categories Likert scale
Types according to some
of characteristic or attribute.
Data Discrete Variables
Quantitative Assume values that can be
counted and finite
(Numerical) Ex : number of defective parts
1* Data that represent
counts or measurements Continuous variables
(can be count or measure) 1. Can assume all values between any two
2* Are numerical in nature specific values & it obtained by measuring
2. Have boundaries and must be rounded
and can be ordered or because of the limits of measuring device
ranked. Ex: weight, age, salary, height,
temperature, etc
26
EXERCISE 1.2
4. The Lemon Marketing Corporation has asked you for information about
the car you drive. For each question, identify each of the types of data
requested as either attribute data or numeric data. When numeric data
is requested, identify the variable as discrete or continuous.

a) What is the weight of your car?


b) In what city was your car made?
c) How many people can be seated in your car?
d) What’s the distance traveled from your home to your school?
e) What’s the color of your car?
f) How many cars are in your household?
g) What’s the length of your car?
h) What’s the normal operating temperature (in degree Fahrenheit) of
your car’s engine?
i) What gas mileage (miles per gallon) do you get in city driving?
j) Who made your car?
k) How many cylinders are there in your car’s engine?
l) How many miles have you put on your car’s current set of tyres?
27
Level of Measurements of Data
Nominal-level Ordinal-level Interval-level Ratio-level
data data data data
classifies data classifies data ranks data, and Possesses all the
into mutually into categories precise characteristics
exclusive (non that can be differences of interval
overlapping), ranked; however, between units of measurement,
exhausting precise measure do and there exists
categories in differences exist; however, a true zero.
which no order between the there is no
or ranking can ranks do not meaningful zero
be imposed on exist
the data

Examples

28
EXERCISE 1.2
5. The chart shows the number of job-related injuries for each of the
transportation industries for 1998.

Industry Number of injuries


Railroad 4520
Intercity bus 5100
Subway 6850
Trucking 7144
Airline 9950

a) What are the variables under study?


b) Categorize each variable as qualitative or quantitative.
c) Categories each quantitative variables as discrete or continuous.
d) Identify the level of measurement for each variable.

29
STEP 5
Presenting and Analyzing the data
 Summarized & analyzed information given by the
 graphical statistics (graph and chart)

 descriptive statistics (refer topic 1.3)

 Identify the relationship of the information


 Making any relevant statistical inferences
 confidence interval

 hypothesis testing

 ANOVA

 Regression analysis

 control charts, box plot, etc


30
Types of Graph & Chart

 The purpose of graphs in statistics is to convey the data


to the viewer in pictorial form.
 Graphs are useful in getting the audience’s attention in a
publication or a presentation.

31
Types of Graph & Chart

32
Distribution Shapes for Histogram
 Bell Shaped  Uniform
 Has a single  Basically
peak & tapers flat/rectangular
off at either end
 Approximately
symmetry
 It is roughly the
same on the
both sides of a
line running
through the
center

 J-Shaped  Reverse J-
 Has a few data
Shaped
values on the  Opposite J-
left side & Shaped
increase as one  Has a few data
move to the values on the
right right side &
increase as one
move to the left

33
Distribution Shapes for Histogram
 Right Skewed  Left Skewed
 The peak is to  The peak is to
the left the right
 The data value  The data value
taper off to the taper off to the
right left

 Bimodal  U-Shaped
 Have 2 peak at  The shape is U
the same height

34
STEP 6
Making the decision

 The researchers can make a list of all the options


and decisions which can achieve the objective
and goal of the research, weighs the options and
choose the best options which represents the
‘best’ solution to the problem.

 The correctness of this choice depends on the


analytical skill and the quality of the information.

35
START

Identify the problem or


opportunity

Gather available internal and


external facts relevant to the
problem

Statistical Are available facts


sufficient?
No Gather new data from populations and
samples using instruments, interviews,
questionnaire, etc

Problem Yes

Solving Classify, summarize, and


process data using tables,
charts, and numerical
descriptive measure

Methodology
Present and communicate
summarized information in
form of tables, charts and
descriptive measure

Is information from
Yes Use sample information to
1. Estimate value of parameter
a sample? 2. Test assumptions about
parameter

No
Use cencus information to
Interpret the results, draw
evaluate alternative courses of
conclusions, and make decisions
action and make decisions

STOP 36
Role of the Computer in Statistics

Two software tools commonly used for data


Analysis:
1. Spreadsheets
 Microsoft Excel & Lotus 1-2-3

2. Statistical Packages
 MINITAB, SAS, SPSS and SPlus

37
Data Analysis Aplication in EXCEL
• Graph and chart
• Formulas
• Add in – Analisis Tool Park – Data Analysis

38
1.3: REVIEW ON
DESCRIPTIVE
STATISTICS
• Summarize data using measures of central tendency, such as
the mean, median, mode.

• Describe data using measures of variation, such as the range,


variance, and standard deviation.

• Explain the accuracy and precision of data using game of dart.


39
.
DATA DESCRIPTION

Measures of Measures of
Central Tendency Variation

Determine the center Determine the spread of the data


of the distribution
-range, variance, standard deviation
-mean, median, mode

Measures of Exploratory Data


Position Analysis
Percentile, decile, quartile Box plot, five number summary
40
TIPS: INSERT & CLEAR DATA
by using Scientific Calculator
 Casio fx-570MS  Casio fx-570W
 Insert data  Insert data
 MODE SD data M+  MODE SD data M+
 Shift 1  Shift 1
 Shift 2  Shift 2
 Clear data  Shift 3
 Shift CLR 1  Shift 4
 Clear data
 Shift AC/ON =

ROUNDING RULE: The value of statistic/parameter should be


rounded to one more decimal place than occurs in the raw data
41
1.3.1 Measures of Central Tendency

Mean

the sum of the values divided by the total number of values.

Population Mean Sample Mean

N n

x i x i

 i 1
, N population size x i 1
, n sample size
N n
Example: 9 2 1 4 3 3 7 5 8 6 ,   x  4.8
42
Properties of Mean
 The mean is compute by using all the values of the data.
 The mean varies less than the median or mode when samples are taken
from the same population and all three measures are computed for
these samples.
 The mean is used in computing other statistics, such as variance.
 The mean for the data set is unique, and not necessarily one of the data
values.
 The mean cannot be computed for an open-ended frequency
distribution.
 The mean is affected by extremely high or low values and may not be
the appropriate average to use in these situations

Note: Assume the data are obtained from samples unless otherwise specified.
43
1.3.1 Measures of Central Tendency

Median
the middle number of n ordered data (smallest to largest)

If n is odd If n is even

Median(MD)  xn1 xn  xn
1
2 Median(MD)  2 2
2
Example: 9 2 1 3 3 7 5 8 6 Example: 9 2 1 4 3 3 7 5 8 6
Step 1: 1 2 3 3 5 6 7 8 9 MD = 4.5
Step 2 : MD = 5 44
Properties of Median

 The median is used when one must find the center or middle value
of a data set.

 The median is used when one must determine whether the data
values fall into the upper half or lower half of the distribution.

 The median is used to find the average of an open-ended


distribution.

 The median is affected less than the mean by extremely high or


extremely low values.
 Try this data: 19 2 1 4 3 3 7 5 8 6

45
1.3.1 Measures of Central Tendency

Mode
the most commonly occurring value in a data series

 The mode is used when the most typical case is desired.

 The mode is the easiest average to compute.

 The mode can be used when the data are nominal, such as
religious preference, gender, or political affiliation.

 The mode is not always unique. A data set can have more than
one mode, or the mode may not exist for a data set.

Example: 9 2 1 4 3 3 7 5 8 6 Mode = 3
46
Types of Distribution

Symmetric

Positively skewed or right-skewed Negatively skewed or left-skewed


47
EXERCISE 1.3.1
6. Determine the type of distribution of the following data

a) Mean = Mode = Median = 11

b) Mean = 25, Mode = 13, Median = 17

c) Mean = 5, Mode = 73, Median = 17

48
1.3.2 Measures of Variation / Dispersion

 Measures of variation determine the spread (variation) of


the data values.

 Used when the central of tendency doesn't mean anything


or not needed (ex: mean are same for two types of data)
 If the mean of X and Y are same and dispersion X < dispersion
Y, then population X is better than population Y

 measure the variability that exists in a data set

 To form a judgment about how well the average value


illustrate/ depict the data

 To learn the extent of the scatter so that steps may be taken


to control the existing variation
49
1.3.2 Measures of Variation / Dispersion

Range

is the different between the highest


value and the lowest value in a data set.
The symbol R is used for the range.

R = highest value - lowest value

Example: 9 2 1 4 3 3 7 5 8 6
R=9-1=8
50
1.3.2 Measures of Variation / Dispersion

Variance
is the average of the squares of the distance each value is from the mean.

Population Variance Sample Variance


N n

 x     x  x 
2 2
i i
2  i 1
, N population size s2  i 1
, n sample size
N n 1

Example: 9 2 1 4 3 3 7 5 8 6

 2  6.4 s 2  7.1
51
1.3.2 Measures of Variation / Dispersion

Standard Deviation
is the square root of the variance

Population standard deviation ,  Sample standard deviation, s

N n

  xi      xi  x 
2 2

 i 1
, N population size s i 1
, n sample size
N n 1

Example: 9 2 1 4 3 3 7 5 8 6

  2.5 s  2.7

52
Properties of Variance
& Standard Deviation
 Variances and standard deviations can be used to determine the
spread of the data. If the variance or standard deviation is large, the
data are more dispersed. The information is useful in comparing two
or more data sets to determine which is more variable.
 The measures of variance and standard deviation are used to
determine the consistency of a variable.
 The variance and standard deviation are used to determine the
number of data values that fall within a specified interval in a
distribution.
 The variance and standard deviation are used quite often in
inferential statistics.
 The standard deviation is used to estimate amount of spread in the
population from which the sample was drawn.

53
EXERCISE 1.3.2
7. A testing lab wishes to test two experimental brands of outdoor paint
to see how long it will last before fading. The testing lab makes 6
gallons of each paint to test. Since different chemical agents are
added to each group and only 6 cans are involves, these two groups
constitutes two small populations. The results (in month) are shown.

A 10 20 30 40 50 60
B 25 30 35 40 45 35

a. Find the mean of each group.


b. Find the standard deviations and variance for each group.
c. Compare the variations between group A and B and interpret
the result. (hint: Which data set is less disperse? which data set
is more variable, which data set is better?)

54
1.3.3 Concept of Variance:
1.3.3.1 Chebyshev’s Theorem
 Chebyshev’s Theorem states that the proportion of values from any data set that fall
(lies) wthin k standard deviation of the mean will be at least 1-1/k², where k is any
number greater than 1.
 In other words, if the data sets have μ mean and σ standard deviation, so at least 1-
1/k² data fall (lies) within (μ – kσ, μ + kσ ) .

Chebychev’s theorem
applies to any distribution
regardless of its shape. But
when a distribution is bell-
shaped, another rules
apply for the distribution.

Exercise 1.3.3.1: A machine produces bullet shell with mean length 2cm and a standard
deviation of 0.01cm. By using Chebyshev’s Theorem, what is the percentage of
bullet shell that is within the range 1.98cm and 2.02cm?
55
1.3.3.2 Games of Dart

 There are two important concepts of Games of Dart


 Accuracy – how close it is to true value
 Precise – reproducibility of measurement errors
Conclusion
 The applications of statistics
are many and varied. People
encounter them in everyday
life, such as in reading
newspapers or magazines,
listening to the radio, or
watching television.
 By combining all of the
descriptive statistics
techniques discussed in this
chapter together, the student
is now able to collect,
organize, summarize and
present data.
Thank You
NEXT: Chapter 2 Sampling Distribution and Confidence Interval
57