Descriptive Statistics - Lec1 PDF

STAT – 835
PROBABILITY AND STATISTICS
FALL 2018
Dr. Muhammad Irfan

Outline of Today’s Lecture
 Scope of the Course
 Planned Curriculum for STAT – 835 (Probability and

Statistics) in Fall 2018
 Miscellaneous Course Information
 Descriptive Statistics (Lecture # 1)

◦ Populations, Samples, Processes
◦ Descriptive statistics
 Measure of Central Tendency (Location)
 Measure of Variation
STAT - 835: Probability and Statistics 2

Scope of the Course
To serve as a comprehensive introduction to

‘probability concepts’ and ‘statistical
methods & applications’ most likely to be
encountered and used by students in pursuit
of their careers in engineering

STAT – 835 Probability and Statistics
Planned Curriculum for Fall 2018 (18 Weeks)
 Descriptive Statistics (6 hours)

◦ Populations, Samples, Processes
◦ Mean, Median, Quartiles, Percentiles, Trimmed mean
◦ Measures of Variability (variance, standard deviation)
◦ Pictorial and Tabular Methods in Descriptive Statistics (Stem-
and-leaf, box plot, dot plots, histogram)
 Probability (18 hours)
◦ Sample Spaces, Events
◦ Axioms, Interpretations and Properties of Probability
◦ Conditional Probability , Independence and Bayes’ Theorem
◦ Discrete and Continuous Random Variables
◦ Discrete and Continuous Probability Distributions

Planned Curriculum for Fall 2018 (Cont...)
 Statistical Inferences ( 6 hours)
◦ Confidence Interval / Significance level
◦ Hypotheses and Test Procedures
◦ Test about a population Mean
◦ Inferences based on two samples (two-sample t test)
 Regression and Statistical Modeling (15 hours)

◦ Simple linear Regression Model
◦ Estimating Model Parameters and their inferences
◦ Correlation
◦ Diagnostics and Remedial Measures
◦ Nonlinear and Multiple Regression
◦ Software Learning (pH Stat, SAS 9.1.3, SPSS (PASW 18.0), Minitab,
Nvivo)
 2 x Class Test and Revisions (2+1 hours)
Miscellaneous Course Information
STAT – 835: Probability and Statistics
 Time and Location: as per Weekly Program
PG Block Class Room
 Instructor: Dr. Muhammad Irfan.

Email: mirfan@mce.nust.edu.pk; mirfans.36@gmail.com
Office Hours for Students: Office Hours on Weekdays or by Appointment
Textbooks:
1. Probability and Statistics For Engineering and Sciences by Jay L. Devore (8th Edition) (Available in MCE Library)
2. Applied Linear Statistical Models, by Michael Kutner, Christopher Nachtsheim, John Nether, and William Li. (5th
Edition) (Available in MCE Library)
Exams:
There will be Two Class Tests (One hour each) and One final examination (3 hours). These will contribute the majority
of the final grade. The 1st Class Test will cover Descriptive Statistics, and Probability and will be held in 7th / 8th Week of
the Semester., 2nd Class Test will cover Hypothesis Testing and a Portion of Regression Analysis and will be held in 13th /
14th Week of the Semester. The final examination will be held during the final exam week, and covers the entire course,
Home Work:
Homework will be given on bi-weekly basis. (a total of 5/6 Homework Assignments)
Quiz and Attendance:
There will be 5/6 quiz tests including a couple of pop-up quizzes in class. Students are expected to attend almost all
classes. Poor attendance will affect the final grade of students.
Final Grade:
Final grade will depend on the following components with the proportions mentioned against each:
homework (15%), quiz (15%), Class Tests(30%), final exam (40%).

Break-down of Course Activities
Important Dates/ Dead-lines
Fall 2018 (1st Oct 2018- 1st Feb 2019)
3rd Oct 2018 – Commencement of Classes
17th Oct 2018 – Homework 1
– Quiz 1
24 Oct 2018 – Homework 1 (Due for Submission)
– Homework 2
31st Oct 2018 – Quiz 2
7th Nov 2018 – Class Test 1
– Homework 2 ( Due for Submission)
14th Nov 2018 – Homework 3
21st Nov 2018 – Quiz 3
– Homework 3 (Due for Submission)
28th Nov 2018 – Homework 4
5th Dec 2018 – Class Test 2
19th Dec 2018 – Homework 5
26th Dec 2018 – Quiz 4
9th Jan 2019 – Last Day of Classes (All Tests(except ESE), Quizs, Homeworks
Marked and Results Disseminated )
19th Jan – 31st Jan 2019 – Fall End Semester Exam (ESE)
DESCRIPTIVE STATISTICS (1)
Dr. Muhammad Irfan

October 3rd, 2018
Population and Sample and Processes
 Engineers and Scientists are constantly exposed to the
collection of facts, or data
 Statistics provide methods for organizing and

summarizing data and for drawing conclusions
based on data
 An investigation will typically focus on a well-defined

collection of objects constituting a population (e.g. all
graduating students of a University)
 If desired information is available for all objects in the

population, we have what is called census
Population and Sample
 Population: The entire collection of individuals or
measurement objects about which information is desired
e.g. Average height of 5-year old children in Pakistan;
average tensile strength of steel for complete production
run.
 Sample: A subset of the population selected for study.

Primary objective is to create a subset of population
whose center, spread and shape are as close as that of
population. There are many methods of sampling. Random
(simple or systematic) sampling, stratified or cluster
sampling etc.
 Random Sample: A simple random sample of size n

from a population is a subset of n elements from that
population where the subset is chosen in such a way that
every possible unit of population has the same chance of
being selected.
Population and Sample and Processes
(cont...)
 Usually census is impractical and infeasible: Why?
Constraints on time, money and other scarce resources
 Instead, a subset of population – a sample is selected in
some prescribed manner (e.g. a randomly selected 50
students out of 500 graduates)
 In order to draw inferences/ conclusions about a
population, certain characteristics of the objects of
population are investigated: (e.g. age, gender, GPA – a
categorical or numerical variable)
 Variable is any characteristic whose value may change
from one object to another
 Uni-variate , bi-variate and Multivariate data set

Univariate, Bivariate, and Multivariate
Data
 Depending on how many variables we are
measuring on the individuals or objects in our
sample, we will have one of the three following
types of data sets
◦ Univariate: Measurements made on only one variable per
observation.
◦ Bivariate: Measurements made on two variables per
observation.
◦ Multivariate: Measurements made on more than two
variables per observation.

Population and Sample (Cont…)
 Why do we need randomness in sampling?
It reduces the possibility of subjective biases (e.g.
selectivity bias).
Mean and variance of a random sample is an unbiased
estimate of the population mean and variance
respectively.

Census and Inference
 Census: Complete enumeration of population units.
 Inference: We sample the population (in a manner to ensure

that the sample correctly represents the population) and then
take measurements on our sample and infer (or generalize)
back to the population.
Example: We may want to know the average height of all

adults (over 18 years old) in Pakistan. Our population is then
all adults over 18 years of age. If we were to census, we would
measure every adult and then compute the average. By using
statistics, we can take a random sample of adults over 18
years of age, measure their average height, and then infer that
the average height of the total population is ``close to'' the
average height of our sample.

Population and Sample and
Processes •Properties of population under
study is assumed to be known
•Deals with questions involving
samples taken from population
Probability
Deductive
(logic based on known properties)
Sample
Population
Inductive
(logic based on observed instances)
Statistics of sample are known to infer

Inferential Statistics about population
•Point estimation
•Hypothesis testing
•Estimation by Confidence interval
 Any samples used should be representative of

the target population

Parameter and Statistic
 Parameter: Any statistical characteristic of a
population. Population mean, population median,
population standard deviation are examples of
parameters.
 Statistics: Any statistical characteristic of a

sample. Sample mean, sample median, sample
standard deviation are some examples of
statistics.
 Statistical Methods: Describing population

through census or making inference from sample
by estimating the value of the parameter using
statistic.

Some Differences between Population and Sample
POPULATION SAMPLE
Size Large Small
Size Notation N n
Easy to collect data? No Yes
Term used to describe A “parameter” A “statistic”
its nature
e.g., μ, σ e.g., x, s

Some Differences between Population and Sample
(Cont’d)
POPULATION SAMPLE
Mean (notation) μ x
Std Deviation σ s
(notation)
Mean (formula)
  x
x
 x
N n
Variance (formula)
 (x   ) 2
s2 
 (x  x) 2
 2
 n 1
N

Statistics!
What is it? What does it involve?
 The art or science of making intelligent judgments, informed decisions
and confident conclusions about the attributes of a system or collection of
systems
 Involves:
- taking a small sample from a larger set (Sampling)
- analyzing data from the small sample (Data analysis)
- testing the hypotheses to ascertain if true (Hypothesis Testing)
- making conclusions about the larger set (Statistical Inference)
- presenting your findings to an audience (Information Delivery)

Using Statistics in Research
 Carrying out research means the collection and
collation of data. Statistics are a way of making use
of this data
◦ Descriptive Statistics: used to describe characteristics
of the sample
 Statistics describe samples
 Gives numerical and graphic procedures to summarize a collection
of data in a clear and understandable way
◦ Inferential Statistics: used to generalise from our
sample to our population
 Parameters describe populations
 Provides procedures to draw inferences about a population from a
sample

… there’re countless instances in civil engineering
where
we’ll have to take only a small sample from a large

population of systems or system components
in order to
investigate an issue and provide needed answers.

Some of the questions we may be required to answer as
civil engineer :
- What is the quality of aggregates at a certain quarry?

(Construction/ Materials Engineering)
- What is the ratio of auto use to transit use

(Transportation Planning)
- What fraction of vehicles in the traffic stream on a

particular highway (say M-2) are “semi” trucks?
(Highway vehicle classification)

- Do the new traffic signals at a particular city location
actually reduce accidents?
(Traffic Studies)
- What is the strength of concrete being used in

constructing a certain structure?
(Construction/Materials Engineering)
- What is the quality of water produced by a water

treatment plant? (Environmental Engineering)
- What has been the long-term settlement of

high-rise buildings in a City? (Geotechnical)

- How deep down can we generally expect to hit
groundwater in a district? (Geotechnical/Hydrology)
- Are people’s health being affected by the heavy smog

and air pollution in a certain city? (Environmental
Engineering)
- How many of the steel I-sections provided by a certain

supplier have a lower-than-specified strength?
(Structural Engineering)
- What is the quality of water in a water reservoir?

(Environmental Engineering)

Because we draw the sample from the population, the
sample is called a subset of the population (Recall
Set Theory)
The population is also referred to as the “Universe”, or

the “Sample Space”.
Sample
Population

Ideally, we seek a sample that is a miniature copy of
the population.
But there is no guarantee that we can achieve such a

sample.
This dilemma leads to 2 very important questions …

Important Questions …
1. Is our sample a good copy of the

population?
In other words, what quantitative means
can we use to determine whether our
sample is “close” enough to the
population?
2. What steps can we take to ensure that our

sample is a good miniature copy of the
population?

Every engineer involved in statistic analysis of his/her system hopes
that:
his/her sample is a good representative of the population.
i.e., the engineer “prays” that the statistics of his/her

sample closely match the true (but unknown) parameters of
the population.
Otherwise any conclusion he/she makes about the

sample does not reflect the entire population.
POPULATION SAMPLE
Parameters: μ, σ Statistics: x, s,
Back to “Important Questions, #1”
Is our sample a good copy (close enough) of the

population?
We may compare the population parameters and the
sample statistics. However, the parameters of the
population are unknown, so can we measure such
closeness of our sample to the population?
We use the concepts of Bias and Efficiency (to be

discussed under “Inferential Statistics”).
“Statistical Inference”, helps to determine the

biasedness or efficiency of estimates, in order to
see how good our samples are.
Back to “Important Questions #2”
What steps can we take to ensure that our sample is a

good miniature copy of the population?
Answer: Sampling must be random (and representative).

i.e., all elements of the population should have an
equal chance of being picked in the sample

Methods of Random Sampling
There are 4 major ways by which a sample can be

carried out to ensure that it is random and yet
represents a true miniature copy of the population:
- Simple Random Sampling

- Systematic Random Sampling
- Stratified (or Clustered) Random Sampling
- Combos of the above
The choice of any specific sampling technique above depends

on
- the composition of the population
- the availability of sampling resources
Simple Random Sampling
This is just a simple selection of elements of the
population without regard to the nature of the
population.
Advantages: - Less effort in preparations for the survey
- Less effort for conduct of the survey
- Is best when all elements in the population
have similar characteristics (besides that under
investigation).
Disadvantage: May not be truly representative of the

population, especially if the population
has diverse characteristics.

Systematic Random Sampling
This sampling method is …
Systematic in time (i.e., sampling elements from the

population within specified time intervals, at the same
location), or
Systematic in space (i.e., sampling elements from the

population at selected locations at the same time).

Stratified Random Sampling
This sampling method first divides the entire population

into different groups, or strata, on the basis of
certain characteristics of the population.
Next, a random sample is obtained within each stratum

to obtain the desired sample size.
See illustration on next slide …

MAIN POPULATION
SUB-POPULATION SUB-POPULATION SUB-POPULATION SUB-POPULATION

#1 #2 #3 #4
SAMPLE SAMPLE
SAMPLE SAMPLE
Sub-populations may be of same size or of different sizes

Stratified Random Sampling (continued)
A stratified sampling approach is most effective when three
conditions are met
 Variability within strata are minimized
 Variability between strata are maximized
 The variables upon which the population is stratified are strongly correlated
with the desired dependent variable.
Advantage:
Stratified random sampling ensures that each group in the
population is represented in the sample.
Is therefore ideal for populations having diverse groups.
Disadvantage:
Relatively more preparation time is needed to calculate the
proportions of each group in the population, and therefore
determination of their proportions in the sample
Combinations of the 3 major methods of random sampling.
Sampling schemes which are combination of the 3 methods can

also be used.
For example, You may decide to carry out a stratified and

systematic random sampling of your population.

In Summary ...
- We can afford to take only a small sample from a
large population of systems or system components in
order to investigate the population.
- Our sample must as much as possible reflect the

population from which it is drawn.
- Good sampling should be random, and

representative. Systematic and Stratified sampling
are useful to ensure that sample is representative of
the population.
- Only a good sample can result in accurate

inferences and predictions about the population.

Introduction to Statistics
Types of Statistical Analysis
Descriptive Inferential
Graphical Non-graphical
Scaled Figures, Central Tendency Point Estimation
Dot Plots Dispersion/ Variance Hypothesis Testing
Scatter Plots Range Confidence Interval
Box Plots Shape Statistical Regression
Stem-and-leaf Plots
Bar Charts/Histograms
39
Descriptive Statistics
◦ Statistical procedures used to summarise,
organise, and simplify data. This process
should be carried out in such a way that
reflects overall findings
 Raw data is made more manageable
 Raw data is presented in a logical form
 Patterns can be seen from organised data
 Frequency tables
 Graphical techniques
 Measures of Central Tendency
 Measures of Spread (variability)

Descriptive Measures
 Central Tendency measures. They are
computed to give a “center” around which the
measurements in the data are distributed.
 Variation or Variability measures. They

describe “data spread” or how far away the
measurements are from the center.

Measures of Central Tendency
 Mean:
Sum of all measurements divided by the number
of measurements.
 Median:
A number such that at most half of the
measurements are below it and at most half of the
measurements are above it.
 Mode:
The most frequent measurement in the data.

Mean
 Sum of the values divided by the number
of cases
y
 y i

Summation notation
 The yi (y1, y2, …, yn) are the n values of the
variable Y
 The sum of the values is then denoted as
yy y i
i 1
i  y1  y2    yn

Calculating the mean for high
temperatures
 Add values
High
Date
2-Jan
Temperature
59 y i  442
3-Jan 60
4-Jan 43  Number of cases
5-Jan 42
6-Jan
7-Jan
35
32
n  10
8-Jan 32
9-Jan 46  Calculate mean
10-Jan 41
Sum
11-Jan 52
442
y
 y i

442
 44.2
n 10
Notice that every single observation intervenes in the computation
of the mean.
Median
 The median represents the middle of the
ordered sample data
 When the sample size is odd, the median
is the middle value
 When the sample size is even, the median
is the midpoint/mean of the two middle
values

Calculating the median for high
temperatures
High
Date Temperature
7-Jan 32
8-Jan 32
6-Jan 35
10-Jan 41
5-Jan 42 <===Middle values
4-Jan 43 <===Middle values
9-Jan 46
11-Jan 52
2-Jan 59
3-Jan 60
42  43
Median   42.5
2
Mode
 The mode is the value that occurs most
frequently
 It is the least useful (and least used) of the
three measures of central tendency

Calculating the mode for high
temperatures
High
Date Temperature
2-Jan 59
3-Jan 60
4-Jan 43
5-Jan 42
6-Jan 35
7-Jan 32 <===Mode
8-Jan 32 <===Mode
9-Jan 46
10-Jan 41
11-Jan 52
mode = 32
Another Example of Mode
Measurements  In this case the data have

x two modes:
3  5 and 7
5
5  Both measurements are
1 repeated twice
7
 Notice that it is possible for
2
6 a data not to have any
7 mode.
0
4

Measures of central tendency and
levels of measurement
 Mean assumes numerical values and
requires interval data
 Median requires ordering of values and
can be used with both interval and ordinal
data
 Mode only involves determination of
most common value and can be used with
interval, ordinal, and nominal data

Comparison of mean and median
 Mean
◦ Uses all of the data
◦ Has desirable statistical properties
◦ Affected by extreme high or low values (outliers)
◦ May not best characterize skewed distributions
 Median
◦ Not affected by outliers
◦ May better characterize skewed distributions

The mean and median and the
distribution of the data
 For symmetric distributions, the mean and the median
are the same
 For skewed distributions, the mean lies in the direction

of the skew (the longer tail) relative to the median

Central Tendencies and Distribution Shape
Asymmetrical / Skewed distributions

Few Notes of Central Tendency
 When the Mean is greater than the Median the
data distribution is skewed to the Right.
 When the Median is greater than the Mean the

data distribution is skewed to the Left.
 When Mean and Median are very close to each

other the data distribution is approximately
symmetric.

Measures of variation
 Range
 Variance and standard deviation
 Interquartile range

Range
 Range is the difference between the
minimum and maximum values

Calculating the range for high
temperatures
range = 60 – 32 = 28
Variance and standard deviation
 The variance s2 is the sum of the squared
deviations from the mean divided by the number
of cases minus 1

 iy  y 2
s2 
n 1
The standard deviation s is the square root of
the variance

 iy  y 2
s
n 1
It is a measure of “spread”
Notice that the larger the deviations (positive or negative) the

larger the variance
Variance (for a sample)
 Steps:
◦ Compute each deviation
◦ Square each deviation
◦ Sum all the squares
◦ Divide by the data size (sample size) minus
one: n-1

Calculating the variance and standard
deviation for high temperatures
High Difference Difference
Date Temperature X - mean Squared
2-Jan 59 14.80 219.04
3-Jan 60 15.80 249.64
4-Jan 43 -1.20 1.44
5-Jan 42 -2.20 4.84
6-Jan 35 -9.20 84.64
7-Jan 32 -12.20 148.84
8-Jan 32 -12.20 148.84
9-Jan 46 1.80 3.24
10-Jan 41 -3.20 10.24
11-Jan 52 7.80 60.84
Sum 442 931.60
n 10
Mean 44.2

 iy  y 2
931.60 
 iy  y 2
s2    103.51 s  103.51  10.2

n 1 10  1 n 1

Percentiles
 The pth percentile is a number such that at most p% of

the measurements are below it and at most 100 – p
percent of the data are above it.
 Example, if in a certain data the 85th percentile is 340

means that 15% of the measurements in the data are
above 340. It also means that 85% of the
measurements are below 340
 Notice that the median is the 50th percentile

Descriptive Statistics - Lec1 PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Descriptive Statistics - Lec1 PDF

Загружено:

Авторское право:

Доступные форматы

STAT – 835

PROBABILITY AND STATISTICS

Dr. Muhammad Irfan

 Planned Curriculum for STAT – 835 (Probability and

 Miscellaneous Course Information

 Descriptive Statistics (Lecture # 1)

STAT - 835: Probability and Statistics 2

To serve as a comprehensive introduction to

STAT - 835: Probability and Statistics 3

 Descriptive Statistics (6 hours)

STAT - 835: Probability and Statistics 4

 Regression and Statistical Modeling (15 hours)

 Instructor: Dr. Muhammad Irfan.

STAT - 835: Probability and Statistics 6

DESCRIPTIVE STATISTICS (1)

Dr. Muhammad Irfan

 Statistics provide methods for organizing and

 An investigation will typically focus on a well-defined

 If desired information is available for all objects in the

 Sample: A subset of the population selected for study.

 Random Sample: A simple random sample of size n

STAT - 835: Probability and Statistics 11

STAT - 835: Probability and Statistics 12

STAT - 835: Probability and Statistics 13

 Inference: We sample the population (in a manner to ensure

Example: We may want to know the average height of all

STAT - 835: Probability and Statistics 14

Statistics of sample are known to infer

 Any samples used should be representative of

STAT - 835: Probability and Statistics 15

 Statistics: Any statistical characteristic of a

 Statistical Methods: Describing population

STAT - 835: Probability and Statistics 16

STAT - 835: Probability and Statistics 17

STAT - 835: Probability and Statistics 18

STAT - 835: Probability and Statistics 19

STAT - 835: Probability and Statistics 20

we’ll have to take only a small sample from a large

investigate an issue and provide needed answers.

STAT - 835: Probability and Statistics 21

- What is the quality of aggregates at a certain quarry?

- What is the ratio of auto use to transit use

- What fraction of vehicles in the traffic stream on a

STAT - 835: Probability and Statistics 22

- What is the strength of concrete being used in

- What is the quality of water produced by a water

- What has been the long-term settlement of

STAT - 835: Probability and Statistics 23

- Are people’s health being affected by the heavy smog

- How many of the steel I-sections provided by a certain

- What is the quality of water in a water reservoir?

STAT - 835: Probability and Statistics 24

The population is also referred to as the “Universe”, or

STAT - 835: Probability and Statistics 25

But there is no guarantee that we can achieve such a

This dilemma leads to 2 very important questions …

STAT - 835: Probability and Statistics 26

1. Is our sample a good copy of the

2. What steps can we take to ensure that our

STAT - 835: Probability and Statistics 27

i.e., the engineer “prays” that the statistics of his/her

Otherwise any conclusion he/she makes about the

Is our sample a good copy (close enough) of the