Вы находитесь на странице: 1из 62

STAT – 835

PROBABILITY AND STATISTICS

FALL 2018

Dr. Muhammad Irfan


Outline of Today’s Lecture
 Scope of the Course

 Planned Curriculum for STAT – 835 (Probability and


Statistics) in Fall 2018

 Miscellaneous Course Information

 Descriptive Statistics (Lecture # 1)


◦ Populations, Samples, Processes
◦ Descriptive statistics
 Measure of Central Tendency (Location)
 Measure of Variation

STAT - 835: Probability and Statistics 2


Scope of the Course

To serve as a comprehensive introduction to


‘probability concepts’ and ‘statistical
methods & applications’ most likely to be
encountered and used by students in pursuit
of their careers in engineering

STAT - 835: Probability and Statistics 3


STAT – 835 Probability and Statistics
Planned Curriculum for Fall 2018 (18 Weeks)

 Descriptive Statistics (6 hours)


◦ Populations, Samples, Processes
◦ Mean, Median, Quartiles, Percentiles, Trimmed mean
◦ Measures of Variability (variance, standard deviation)
◦ Pictorial and Tabular Methods in Descriptive Statistics (Stem-
and-leaf, box plot, dot plots, histogram)
 Probability (18 hours)
◦ Sample Spaces, Events
◦ Axioms, Interpretations and Properties of Probability
◦ Conditional Probability , Independence and Bayes’ Theorem
◦ Discrete and Continuous Random Variables
◦ Discrete and Continuous Probability Distributions

STAT - 835: Probability and Statistics 4


STAT – 835 Probability and Statistics
Planned Curriculum for Fall 2018 (Cont...)
 Statistical Inferences ( 6 hours)
◦ Confidence Interval / Significance level
◦ Hypotheses and Test Procedures
◦ Test about a population Mean
◦ Inferences based on two samples (two-sample t test)

 Regression and Statistical Modeling (15 hours)


◦ Simple linear Regression Model
◦ Estimating Model Parameters and their inferences
◦ Correlation
◦ Diagnostics and Remedial Measures
◦ Nonlinear and Multiple Regression
◦ Software Learning (pH Stat, SAS 9.1.3, SPSS (PASW 18.0), Minitab,
Nvivo)
 2 x Class Test and Revisions (2+1 hours)
STAT - 835: Probability and Statistics 5
Miscellaneous Course Information
STAT – 835: Probability and Statistics
 Time and Location: as per Weekly Program
PG Block Class Room

 Instructor: Dr. Muhammad Irfan.


Email: mirfan@mce.nust.edu.pk; mirfans.36@gmail.com
Office Hours for Students: Office Hours on Weekdays or by Appointment
Textbooks:
1. Probability and Statistics For Engineering and Sciences by Jay L. Devore (8th Edition) (Available in MCE Library)
2. Applied Linear Statistical Models, by Michael Kutner, Christopher Nachtsheim, John Nether, and William Li. (5th
Edition) (Available in MCE Library)
Exams:
There will be Two Class Tests (One hour each) and One final examination (3 hours). These will contribute the majority
of the final grade. The 1st Class Test will cover Descriptive Statistics, and Probability and will be held in 7th / 8th Week of
the Semester., 2nd Class Test will cover Hypothesis Testing and a Portion of Regression Analysis and will be held in 13th /
14th Week of the Semester. The final examination will be held during the final exam week, and covers the entire course,
Home Work:
Homework will be given on bi-weekly basis. (a total of 5/6 Homework Assignments)
Quiz and Attendance:
There will be 5/6 quiz tests including a couple of pop-up quizzes in class. Students are expected to attend almost all
classes. Poor attendance will affect the final grade of students.
Final Grade:
Final grade will depend on the following components with the proportions mentioned against each:
homework (15%), quiz (15%), Class Tests(30%), final exam (40%).

STAT - 835: Probability and Statistics 6


Break-down of Course Activities
Important Dates/ Dead-lines
Fall 2018 (1st Oct 2018- 1st Feb 2019)
3rd Oct 2018 – Commencement of Classes
17th Oct 2018 – Homework 1
– Quiz 1
24 Oct 2018 – Homework 1 (Due for Submission)
– Homework 2
31st Oct 2018 – Quiz 2
7th Nov 2018 – Class Test 1
– Homework 2 ( Due for Submission)
14th Nov 2018 – Homework 3
21st Nov 2018 – Quiz 3
– Homework 3 (Due for Submission)
28th Nov 2018 – Homework 4
5th Dec 2018 – Class Test 2
– Homework 4 (Due for Submission)
19th Dec 2018 – Homework 5
26th Dec 2018 – Quiz 4
– Homework 5 (Due for Submission)
9th Jan 2019 – Last Day of Classes (All Tests(except ESE), Quizs, Homeworks
Marked and Results Disseminated )
19th Jan – 31st Jan 2019 – Fall End Semester Exam (ESE)
STAT - 835: Probability and Statistics 7
STAT – 835 Probability and Statistics

DESCRIPTIVE STATISTICS (1)

Dr. Muhammad Irfan


October 3rd, 2018
STAT - 835: Probability and Statistics 8
Population and Sample and Processes
 Engineers and Scientists are constantly exposed to the
collection of facts, or data

 Statistics provide methods for organizing and


summarizing data and for drawing conclusions
based on data

 An investigation will typically focus on a well-defined


collection of objects constituting a population (e.g. all
graduating students of a University)

 If desired information is available for all objects in the


population, we have what is called census
STAT - 835: Probability and Statistics 9
Population and Sample
 Population: The entire collection of individuals or
measurement objects about which information is desired
e.g. Average height of 5-year old children in Pakistan;
average tensile strength of steel for complete production
run.

 Sample: A subset of the population selected for study.


Primary objective is to create a subset of population
whose center, spread and shape are as close as that of
population. There are many methods of sampling. Random
(simple or systematic) sampling, stratified or cluster
sampling etc.

 Random Sample: A simple random sample of size n


from a population is a subset of n elements from that
population where the subset is chosen in such a way that
every possible unit of population has the same chance of
being selected.
STAT - 835: Probability and Statistics 10
Population and Sample and Processes
(cont...)
 Usually census is impractical and infeasible: Why?
Constraints on time, money and other scarce resources
 Instead, a subset of population – a sample is selected in
some prescribed manner (e.g. a randomly selected 50
students out of 500 graduates)
 In order to draw inferences/ conclusions about a
population, certain characteristics of the objects of
population are investigated: (e.g. age, gender, GPA – a
categorical or numerical variable)
 Variable is any characteristic whose value may change
from one object to another
 Uni-variate , bi-variate and Multivariate data set

STAT - 835: Probability and Statistics 11


Univariate, Bivariate, and Multivariate
Data
 Depending on how many variables we are
measuring on the individuals or objects in our
sample, we will have one of the three following
types of data sets
◦ Univariate: Measurements made on only one variable per
observation.
◦ Bivariate: Measurements made on two variables per
observation.
◦ Multivariate: Measurements made on more than two
variables per observation.

STAT - 835: Probability and Statistics 12


Population and Sample (Cont…)
 Why do we need randomness in sampling?
It reduces the possibility of subjective biases (e.g.
selectivity bias).
Mean and variance of a random sample is an unbiased
estimate of the population mean and variance
respectively.

STAT - 835: Probability and Statistics 13


Census and Inference
 Census: Complete enumeration of population units.

 Inference: We sample the population (in a manner to ensure


that the sample correctly represents the population) and then
take measurements on our sample and infer (or generalize)
back to the population.

Example: We may want to know the average height of all


adults (over 18 years old) in Pakistan. Our population is then
all adults over 18 years of age. If we were to census, we would
measure every adult and then compute the average. By using
statistics, we can take a random sample of adults over 18
years of age, measure their average height, and then infer that
the average height of the total population is ``close to'' the
average height of our sample.

STAT - 835: Probability and Statistics 14


Population and Sample and
Processes •Properties of population under
study is assumed to be known
•Deals with questions involving
samples taken from population
Probability

Deductive
(logic based on known properties)

Sample
Population
Inductive
(logic based on observed instances)

Statistics of sample are known to infer


Inferential Statistics about population
•Point estimation
•Hypothesis testing
•Estimation by Confidence interval

 Any samples used should be representative of


the target population

STAT - 835: Probability and Statistics 15


Parameter and Statistic
 Parameter: Any statistical characteristic of a
population. Population mean, population median,
population standard deviation are examples of
parameters.

 Statistics: Any statistical characteristic of a


sample. Sample mean, sample median, sample
standard deviation are some examples of
statistics.

 Statistical Methods: Describing population


through census or making inference from sample
by estimating the value of the parameter using
statistic.

STAT - 835: Probability and Statistics 16


Some Differences between Population and Sample

POPULATION SAMPLE
Size Large Small
Size Notation N n
Easy to collect data? No Yes
Term used to describe A “parameter” A “statistic”
its nature
e.g., μ, σ e.g., x, s

STAT - 835: Probability and Statistics 17


Some Differences between Population and Sample
(Cont’d)

POPULATION SAMPLE
Mean (notation) μ x
Std Deviation σ s
(notation)
Mean (formula)
  x
x
 x
N n
Variance (formula)
 (x   ) 2
s2 
 (x  x) 2

 2
 n 1
N

STAT - 835: Probability and Statistics 18


Statistics!
What is it? What does it involve?
 The art or science of making intelligent judgments, informed decisions
and confident conclusions about the attributes of a system or collection of
systems

 Involves:
- taking a small sample from a larger set (Sampling)
- analyzing data from the small sample (Data analysis)
- testing the hypotheses to ascertain if true (Hypothesis Testing)
- making conclusions about the larger set (Statistical Inference)
- presenting your findings to an audience (Information Delivery)

STAT - 835: Probability and Statistics 19


Using Statistics in Research
 Carrying out research means the collection and
collation of data. Statistics are a way of making use
of this data
◦ Descriptive Statistics: used to describe characteristics
of the sample
 Statistics describe samples
 Gives numerical and graphic procedures to summarize a collection
of data in a clear and understandable way
◦ Inferential Statistics: used to generalise from our
sample to our population
 Parameters describe populations
 Provides procedures to draw inferences about a population from a
sample

STAT - 835: Probability and Statistics 20


… there’re countless instances in civil engineering

where

we’ll have to take only a small sample from a large


population of systems or system components

in order to

investigate an issue and provide needed answers.

STAT - 835: Probability and Statistics 21


Some of the questions we may be required to answer as
civil engineer :

- What is the quality of aggregates at a certain quarry?


(Construction/ Materials Engineering)

- What is the ratio of auto use to transit use


(Transportation Planning)

- What fraction of vehicles in the traffic stream on a


particular highway (say M-2) are “semi” trucks?
(Highway vehicle classification)

STAT - 835: Probability and Statistics 22


- Do the new traffic signals at a particular city location
actually reduce accidents?
(Traffic Studies)

- What is the strength of concrete being used in


constructing a certain structure?
(Construction/Materials Engineering)

- What is the quality of water produced by a water


treatment plant? (Environmental Engineering)

- What has been the long-term settlement of


high-rise buildings in a City? (Geotechnical)

STAT - 835: Probability and Statistics 23


- How deep down can we generally expect to hit
groundwater in a district? (Geotechnical/Hydrology)

- Are people’s health being affected by the heavy smog


and air pollution in a certain city? (Environmental
Engineering)

- How many of the steel I-sections provided by a certain


supplier have a lower-than-specified strength?
(Structural Engineering)

- What is the quality of water in a water reservoir?


(Environmental Engineering)

STAT - 835: Probability and Statistics 24


Because we draw the sample from the population, the
sample is called a subset of the population (Recall
Set Theory)

The population is also referred to as the “Universe”, or


the “Sample Space”.

Sample

Population

STAT - 835: Probability and Statistics 25


Ideally, we seek a sample that is a miniature copy of
the population.

But there is no guarantee that we can achieve such a


sample.

This dilemma leads to 2 very important questions …

STAT - 835: Probability and Statistics 26


Important Questions …

1. Is our sample a good copy of the


population?
In other words, what quantitative means
can we use to determine whether our
sample is “close” enough to the
population?

2. What steps can we take to ensure that our


sample is a good miniature copy of the
population?

STAT - 835: Probability and Statistics 27


Every engineer involved in statistic analysis of his/her system hopes
that:
his/her sample is a good representative of the population.

i.e., the engineer “prays” that the statistics of his/her


sample closely match the true (but unknown) parameters of
the population.

Otherwise any conclusion he/she makes about the


sample does not reflect the entire population.

POPULATION SAMPLE
Parameters: μ, σ Statistics: x, s,
STAT - 835: Probability and Statistics 28
Back to “Important Questions, #1”

Is our sample a good copy (close enough) of the


population?
We may compare the population parameters and the
sample statistics. However, the parameters of the
population are unknown, so can we measure such
closeness of our sample to the population?

We use the concepts of Bias and Efficiency (to be


discussed under “Inferential Statistics”).

“Statistical Inference”, helps to determine the


biasedness or efficiency of estimates, in order to
see how good our samples are.
STAT - 835: Probability and Statistics 29
Back to “Important Questions #2”

What steps can we take to ensure that our sample is a


good miniature copy of the population?

Answer: Sampling must be random (and representative).


i.e., all elements of the population should have an
equal chance of being picked in the sample

STAT - 835: Probability and Statistics 30


Methods of Random Sampling

There are 4 major ways by which a sample can be


carried out to ensure that it is random and yet
represents a true miniature copy of the population:

- Simple Random Sampling


- Systematic Random Sampling
- Stratified (or Clustered) Random Sampling
- Combos of the above

The choice of any specific sampling technique above depends


on
- the composition of the population
- the availability of sampling resources
STAT - 835: Probability and Statistics 31
Simple Random Sampling
This is just a simple selection of elements of the
population without regard to the nature of the
population.
Advantages: - Less effort in preparations for the survey
- Less effort for conduct of the survey
- Is best when all elements in the population
have similar characteristics (besides that under
investigation).

Disadvantage: May not be truly representative of the


population, especially if the population
has diverse characteristics.

STAT - 835: Probability and Statistics 32


Systematic Random Sampling

This sampling method is …

Systematic in time (i.e., sampling elements from the


population within specified time intervals, at the same
location), or

Systematic in space (i.e., sampling elements from the


population at selected locations at the same time).

STAT - 835: Probability and Statistics 33


Stratified Random Sampling

This sampling method first divides the entire population


into different groups, or strata, on the basis of
certain characteristics of the population.

Next, a random sample is obtained within each stratum


to obtain the desired sample size.

See illustration on next slide …

STAT - 835: Probability and Statistics 34


MAIN POPULATION

SUB-POPULATION SUB-POPULATION SUB-POPULATION SUB-POPULATION


#1 #2 #3 #4

SAMPLE SAMPLE
SAMPLE SAMPLE

Sub-populations may be of same size or of different sizes

STAT - 835: Probability and Statistics 35


Stratified Random Sampling (continued)
A stratified sampling approach is most effective when three
conditions are met
 Variability within strata are minimized
 Variability between strata are maximized
 The variables upon which the population is stratified are strongly correlated
with the desired dependent variable.
Advantage:
Stratified random sampling ensures that each group in the
population is represented in the sample.
Is therefore ideal for populations having diverse groups.

Disadvantage:
Relatively more preparation time is needed to calculate the
proportions of each group in the population, and therefore
determination of their proportions in the sample
STAT - 835: Probability and Statistics 36
Combinations of the 3 major methods of random sampling.

Sampling schemes which are combination of the 3 methods can


also be used.

For example, You may decide to carry out a stratified and


systematic random sampling of your population.

STAT - 835: Probability and Statistics 37


In Summary ...
- We can afford to take only a small sample from a
large population of systems or system components in
order to investigate the population.

- Our sample must as much as possible reflect the


population from which it is drawn.

- Good sampling should be random, and


representative. Systematic and Stratified sampling
are useful to ensure that sample is representative of
the population.

- Only a good sample can result in accurate


inferences and predictions about the population.

STAT - 835: Probability and Statistics 38


Introduction to Statistics

Types of Statistical Analysis

Descriptive Inferential

Graphical Non-graphical
Scaled Figures, Central Tendency Point Estimation
Dot Plots Dispersion/ Variance Hypothesis Testing
Scatter Plots Range Confidence Interval
Box Plots Shape Statistical Regression
Stem-and-leaf Plots
Bar Charts/Histograms

39
Descriptive Statistics
◦ Statistical procedures used to summarise,
organise, and simplify data. This process
should be carried out in such a way that
reflects overall findings
 Raw data is made more manageable
 Raw data is presented in a logical form
 Patterns can be seen from organised data
 Frequency tables
 Graphical techniques
 Measures of Central Tendency
 Measures of Spread (variability)

STAT - 835: Probability and Statistics 40


Descriptive Measures
 Central Tendency measures. They are
computed to give a “center” around which the
measurements in the data are distributed.

 Variation or Variability measures. They


describe “data spread” or how far away the
measurements are from the center.

STAT - 835: Probability and Statistics 41


Measures of Central Tendency

 Mean:
Sum of all measurements divided by the number
of measurements.

 Median:
A number such that at most half of the
measurements are below it and at most half of the
measurements are above it.

 Mode:
The most frequent measurement in the data.

STAT - 835: Probability and Statistics 42


Mean
 Sum of the values divided by the number
of cases

y
 y i

STAT - 835: Probability and Statistics 43


Summation notation
 The yi (y1, y2, …, yn) are the n values of the
variable Y
 The sum of the values is then denoted as

yy y i
i 1
i  y1  y2    yn

STAT - 835: Probability and Statistics 44


Calculating the mean for high
temperatures
 Add values
High
Date
2-Jan
Temperature
59 y i  442
3-Jan 60
4-Jan 43  Number of cases
5-Jan 42
6-Jan
7-Jan
35
32
n  10
8-Jan 32
9-Jan 46  Calculate mean
10-Jan 41

Sum
11-Jan 52
442
y
 y i

442
 44.2
n 10
Notice that every single observation intervenes in the computation
of the mean.
STAT - 835: Probability and Statistics 45
Median
 The median represents the middle of the
ordered sample data
 When the sample size is odd, the median
is the middle value
 When the sample size is even, the median
is the midpoint/mean of the two middle
values

STAT - 835: Probability and Statistics 46


Calculating the median for high
temperatures
High
Date Temperature
7-Jan 32
8-Jan 32
6-Jan 35
10-Jan 41
5-Jan 42 <===Middle values
4-Jan 43 <===Middle values
9-Jan 46
11-Jan 52
2-Jan 59
3-Jan 60

42  43
Median   42.5
2
STAT - 835: Probability and Statistics 47
Mode
 The mode is the value that occurs most
frequently
 It is the least useful (and least used) of the
three measures of central tendency

STAT - 835: Probability and Statistics 48


Calculating the mode for high
temperatures
High
Date Temperature
2-Jan 59
3-Jan 60
4-Jan 43
5-Jan 42
6-Jan 35
7-Jan 32 <===Mode
8-Jan 32 <===Mode
9-Jan 46
10-Jan 41
11-Jan 52

mode = 32
STAT - 835: Probability and Statistics 49
Another Example of Mode

Measurements  In this case the data have


x two modes:
3  5 and 7
5
5  Both measurements are
1 repeated twice
7
 Notice that it is possible for
2
6 a data not to have any
7 mode.
0
4

STAT - 835: Probability and Statistics 50


Measures of central tendency and
levels of measurement
 Mean assumes numerical values and
requires interval data
 Median requires ordering of values and
can be used with both interval and ordinal
data
 Mode only involves determination of
most common value and can be used with
interval, ordinal, and nominal data

STAT - 835: Probability and Statistics 51


Comparison of mean and median
 Mean
◦ Uses all of the data
◦ Has desirable statistical properties
◦ Affected by extreme high or low values (outliers)
◦ May not best characterize skewed distributions
 Median
◦ Not affected by outliers
◦ May better characterize skewed distributions

STAT - 835: Probability and Statistics 52


The mean and median and the
distribution of the data
 For symmetric distributions, the mean and the median
are the same

 For skewed distributions, the mean lies in the direction


of the skew (the longer tail) relative to the median

STAT - 835: Probability and Statistics 53


Central Tendencies and Distribution Shape
Asymmetrical / Skewed distributions

STAT - 835: Probability and Statistics 54


Few Notes of Central Tendency
 When the Mean is greater than the Median the
data distribution is skewed to the Right.

 When the Median is greater than the Mean the


data distribution is skewed to the Left.

 When Mean and Median are very close to each


other the data distribution is approximately
symmetric.

STAT - 835: Probability and Statistics 55


Measures of variation
 Range
 Variance and standard deviation
 Interquartile range

STAT - 835: Probability and Statistics 56


Range
 Range is the difference between the
minimum and maximum values

STAT - 835: Probability and Statistics 57


Calculating the range for high
temperatures

range = 60 – 32 = 28
STAT - 835: Probability and Statistics 58
Variance and standard deviation
 The variance s2 is the sum of the squared
deviations from the mean divided by the number
of cases minus 1

 iy  y 2

s2 
n 1
The standard deviation s is the square root of
the variance

 iy  y 2

s
n 1
It is a measure of “spread”

Notice that the larger the deviations (positive or negative) the


larger the variance
STAT - 835: Probability and Statistics 59
Variance (for a sample)

 Steps:
◦ Compute each deviation
◦ Square each deviation
◦ Sum all the squares
◦ Divide by the data size (sample size) minus
one: n-1

STAT - 835: Probability and Statistics 60


Calculating the variance and standard
deviation for high temperatures
High Difference Difference
Date Temperature X - mean Squared
2-Jan 59 14.80 219.04
3-Jan 60 15.80 249.64
4-Jan 43 -1.20 1.44
5-Jan 42 -2.20 4.84
6-Jan 35 -9.20 84.64
7-Jan 32 -12.20 148.84
8-Jan 32 -12.20 148.84
9-Jan 46 1.80 3.24
10-Jan 41 -3.20 10.24
11-Jan 52 7.80 60.84
Sum 442 931.60
n 10
Mean 44.2


 iy  y 2
931.60 
 iy  y 2

s2    103.51 s  103.51  10.2


n 1 10  1 n 1

STAT - 835: Probability and Statistics 61


Percentiles

 The pth percentile is a number such that at most p% of


the measurements are below it and at most 100 – p
percent of the data are above it.

 Example, if in a certain data the 85th percentile is 340


means that 15% of the measurements in the data are
above 340. It also means that 85% of the
measurements are below 340

 Notice that the median is the 50th percentile

STAT - 835: Probability and Statistics 62

Вам также может понравиться