Вы находитесь на странице: 1из 34

Basic Statistics

ECB-652

Note: Materials in these slides has be collected and adapted from different internet sources and books
Statistics

• Statistics
– The science of collecting, organizing, presenting,
analysing, and interpreting data to assist in making more
effective decisions

• Statistical analysis
– used to manipulate summarize, and investigate data, so
that useful decision-making information results.
A Taxonomy of Statistics
Statistical Methods

• Descriptive statistics
– Methods of organizing, summarizing, and presenting data
in an informative way
• Inferential statistics
– The methods used to determine something about a
population on the basis of a sample
– Population –The entire set of individuals or objects of
interest or the measurements obtained from all individuals
or objects of interest
– Sample – A portion, or part, of the population of interest
Population and Sample
Inferential statistics

• Estimation
– e.g., Estimate the
population mean weight
using the sample mean
weight
• Hypothesis testing
– e.g., Test the claim that
the population mean
weight is 70 kg
Inference is the process of drawing conclusions or making decisions
about a population based on sample results
Statistical data
 The collection of data that are relevant to the problem
being studied is commonly the most difficult, expensive,
and time-consuming part of the entire research project.
 Statistical data are usually obtained by counting or
measuring items.
 Primary data are collected specifically for the analysis desired
 Secondary data have already been compiled and are available for
statistical analysis
 A variable is an item of interest that can take on many
different numerical values.
 A constant has a fixed numerical value.
Data

Statistical data are usually obtained by counting or


measuring items.
Most data can be put into the following categories:
• Qualitative
– Data are measurements that each fall into one of several
categories. (hair color, ethnic groups and other attributes
of the population)
• Quantitative
– Data are observations that are measured on a numerical
scale (distance traveled to college, number of childrens in
a family, etc.)
Qualitative data

Qualitative data are generally described by words or letters.


They are not as widely used as quantitative data because many
numerical techniques do not apply to the qualitative data.
– For example, it does not make sense to find an average hair color or
blood type.
Qualitative data can be separated into two subgroups:
 Dichotomic (if it takes the form of a word with two options
(gender - male or female)
 Polynomic (if it takes the form of a word with more than two
options (education - primary school, secondary school and
university).
Quantitative data

Quantitative data are always numbers and are the result of


counting or measuring attributes of a population.
Quantitative data can be separated into two
subgroups:
• discrete
– if it is the result of counting (the number of students of a given ethnic
group in a class, the number of books on a shelf, ...)
• continuous
– if it is the result of measuring (distance traveled, weight of luggage,
…)
Types of variables

Variables

Qualitative Quantitative

Dichotomic Polynomic Discrete Continuous

Amount of income
Gender, marital Brand of Pc, hair Children in a family,
tax paid, weight of
status color Students in a class
a student
Data in Economics

• Time Series
– Data collected of same variable for over a period of time
– e.g. GDP of India

• Cross-sectional Data
– Data collected for same variable for more than one unit at a
particular point of time
– e.g. GDP of India, Pakistan, USA in 2010

• Panel Data
– Data collected for more than one units over a period of time.
– E.g. GDP of India, Pakistan, USA since 1990 to 2010
Statistical Description of Data

• Statistics describes a numeric set of data by its


• Center
• Variability
• Shape
• Statistics describes a categorical set of data by
• Frequency, percentage or proportion of each category
Frequency Distribution

Consider a data set of 26 children of ages 1-6 years. Then the frequency
distribution of variable ‘age’ can be tabulated as follows:

Frequency Distribution of Age

Age 1 2 3 4 5 6
Frequency 5 3 7 5 4 2
Grouped Frequency Distribution of Age:
Age Group 1-2 3-4 5-6

Frequency 8 12 6
Cumulative Frequency

Cumulative frequency of data in previous page

Age 1 2 3 4 5 6

Frequency 5 3 7 5 4 2

Cumulative Frequency 5 8 15 20 24 26

Age Group 1-2 3-4 5-6


Frequency 8 12 6
Cumulative Frequency 8 20 26
Descriptive Statistics

• Collect data
– e.g., Survey

• Present data
– e.g., Tables and graphs

• Summarize data
– e.g., Sample mean =
X i

n
Data Presentation

• Two types of statistical presentation of data - graphical


and numerical.
• Graphical Presentation: We look for the overall pattern
and for striking deviations from that pattern. Over all
pattern usually described by shape, center, and spread of
the data. An individual value that falls outside the overall
pattern is called an outlier.
• Bar diagram and Pie charts are used for categorical
variables
• Histogram, stem and leaf and Box-plot are used for
numerical variable.
Data Presentation –Categorical Variable
• Bar Diagram: Lists the categories and presents the
percent or count of individuals who fall in each category.
Figure 1: Bar Chart of Subjects in
Treatment Groups
Treatment Frequency Proportion Percent
Group (%)

1 15 (15/60)= 25.0 30

Number of Subjects
0.25
25
2 25 (25/60)= 41.7 20
0.333
15
3 20 (20/60)= 33.3
0.417 10
5
Total 60 1.00 100
0
1 2 3

Treatment Group
Data Presentation –Categorical Variable
• Pie Chart: Lists the categories and presents
the percent or count of individuals who fall in
each category.
Pie chart of subjects in
treatment group Treatment Frequency Proportion Percent
Group (%)

1 15 (15/60)= 25.0
25%
33% 0.25
1 2 25 (25/60)= 41.7
2 0.333
42% 3
3 20 (20/60)= 33.3
0.417

Total 60 1.00 100


Numerical Presentation

• A fundamental concept in summary statistics is that of a central value for a


set of observations and the extent to which the central value characterizes
the whole set of data. Measures of central value such as the mean or
median must be coupled with measures of data dispersion (e.g., average
distance from the mean) to indicate how well the central value
characterizes the data as a whole
• To understand how well a central value characterizes a set of
observations, let us consider the following two sets of data:
– A: 30, 50, 70
– B: 40, 50, 60
• The mean of both two data sets is 50. But, the distance of the
observations from the mean in data set A is larger than in the data set B.
Thus, the mean of data set B is a better representation of the data set
than is the case for set A.
Methods of Center Measurement
• Center measurement is a summary measure of the overall level of a
dataset

• Commonly used methods are mean, median, mode, geometric mean


etc.

• Mean: Summing up all the observation and dividing by number of


observations. Mean of 20, 30, 40 is (20+30+40)/3 = 30.

Notation : Let x1 , x2, ...xn are n observations of a variable


x. Then the mean of this variable,
n

x + x + ... + xn x i
x= 1 2 = i =1
n n
Methods of Center Measurement

• Median: The middle value in an ordered sequence of


observations. That is, to find the median we need to order the
data set and then find the middle value. In case of an even
number of observations the average of the two middle most
values is the median.
– For example, to find the median of {9, 3, 6, 7, 5}, we first sort the data
giving {3, 5, 6, 7, 9}, then choose the middle value 6. If the number of
observations is even, e.g., {9, 3, 6, 7, 5, 2}, then the median is the
average of the two middle values from the sorted sequence, in this
case, (5 + 6) / 2 = 5.5.

• Mode: The value that is observed most frequently. The mode


is undefined for sequences in which no observation is
repeated.
Mean or Median

• The median is less sensitive to outliers (extreme


scores) than the mean and thus a better measure
than the mean for highly skewed distributions, e.g.
family income.
– For example mean of 20, 30, 40, and 990 is
(20+30+40+990)/4 =270.
– The median of these four observations is (30+40)/2 =35.
– Here 3 observations out of 4 lie between 20-40. So, the
mean 270 really fails to give a realistic picture of the
major part of the data. It is influenced by extreme value
990.
Methods of Variability Measurement

• Variability (or dispersion) measures the amount of


scatterness in a dataset.

– Commonly used methods: range, variance, standard


deviation, interquartile range, coefficient of variation etc.

• Range: The difference between the largest and the


smallest observations. The range of 10, 5, 2, 100 is
(100-2)=98. It’s a crude measure of variability.
Methods of Variability Measurement

Variance: The variance of a set of observations is the average of


the squares of the deviations of the observations from their
mean. In symbols, the variance of the n observations x1, x2,…xn
is
( x1 − x ) + .... + ( xn − x )
2 2
S2 =
n −1

Variance of 5, 7, 3? Mean is (5+7+3)/3 = 5 and the variance is


(5 − 5) 2 + (3 − 5) 2 + (7 − 5) 2
=4
3 −1
Standard Deviation: Square root of the variance. The standard
deviation of the above example is 2.
Shape of Data

• Shape of data is measured by


– Skewness
– Kurtosis
Skewness

• Measures asymmetry of data


– Positive or right skewed:
Longer right tail
– Negative or left skewed: Longer
left tail
– Skewness of a normal
distribution is 0

Let x1 , x2 ,...xn be n observations. Then,


n
n  ( xi − x ) 3
Skewness = i =1
3/ 2
 n 
  ( xi − x ) 2 
 i =1 
Kurtosis

• Measures peakedness of the distribution of data. The


kurtosis of normal distribution is 3.

Let x1 , x2 ,...xn be n observations. Then,


n
n ( xi − x ) 4
Kurtosis = i =1
2
−3
 n 
  ( xi − x ) 2 
 i =1 
Bivariate: Covariance and Correlation
• Covariance
– Covariance provides insight into how two variables are
related to one another.
– More precisely, covariance refers to the measure of how
two random variables in a data set will change together.
– A positive covariance means that the two variables at hand
are positively related, and they move in the same
direction.
– A negative covariance means that the variables are
inversely related, or that they move in opposite directions
𝑛
෌𝑖−1 𝑥𝑖 −𝑥ҧ 𝑦𝑖 −𝑦ത
• 𝐶𝑜𝑣 =
𝑛−1
• Where, x and y represents variables, n, number of observations,
𝑥ҧ 𝑎𝑛𝑑 𝑦ഥ are mean of 𝑥 𝑎𝑛𝑑 𝑦
Correlation
• Correlation is a bivariate analysis that measures the strength of
association between two variables and the direction of the linear
relationship.
– In terms of the strength of relationship, the value of the correlation
coefficient varies between +1 and -1.
– A value of ± 1 indicates a perfect degree of association between the two
variables.
– As the correlation coefficient value goes towards 0, the relationship
between the two variables will be weaker.
– The direction of the relationship is indicated by the sign of the
coefficient
– A + sign indicates a positive relationship and a – sign indicates a negative
relationship.
– Correlation is symmetrical in nature
• the correlation between X and Y is the same as the correlation between Y and X
𝑁 σ 𝑥𝑦−(σ 𝑥)(σ 𝑦)
– 𝑟=
2 2
[𝑁 σ 𝑥 2 −(σ 𝑥) ][𝑁 σ 𝑦 2 −(σ 𝑦) ]
Regression

• Correlation- association between two variables


• if the two variables are related it means that when one
changes by a certain amount the other changes on an average
by a certain amount.
• By using this relationship one may predict other variables
• One variable may depend on the other variable
• If y represents the dependent variable and x the independent
variable, this relationship is described as the regression of y
on x.
• The relationship can be represented by a simple equation
called the regression equation.
• In this context "regression" simply means that the average
value of y is a "function" of x, that is, it changes with x.
Regression

• Equation: Y = a + bx
– b is the gradient, slope or regression coefficient
– a is the intercept of the line at Y axis or regression
constant
– Y is a value for the outcome
– x is a value for the predictor
Covariance Vs Correlation
• A measure used to indicate the extent to which two random variables
change in tandem is known as covariance. A measure used to represent
how strongly two random variables are related known as correlation.
• Covariance is nothing but a measure of correlation. On the contrary,
correlation refers to the scaled form of covariance.
• The value of correlation takes place between -1 and +1. Conversely, the
value of covariance lies between -∞ and +∞.
• Covariance is affected by the change in scale, i.e. if all the value of one
variable is multiplied by a constant and all the value of another variable
are multiplied, by a similar or different constant, then the covariance is
changed. As against this, correlation is not influenced by the change in
scale.
• Correlation is dimensionless, i.e. it is a unit-free measure of the
relationship between variables. Unlike covariance, where the value is
obtained by the product of the units of the two variables.
Regression and Correlation
Basis for Comparison Correlation Regression
Correlation is a statistical
Regression describes how an
measure which determines
Meaning independent variable is numerically
co-relationship or association
related to the dependent variable.
of two variables.

To represent linear
To fit a best line and estimate one
Usage relationship between two
variable on the basis of another variable.
variables.
Dependent and
No difference Both variables are different.
Independent variables
Correlation coefficient Regression indicates the impact of a unit
Indicates indicates the extent to which change in the known variable (x) on the
two variables move together. estimated variable (y).
To find a numerical value To estimate values of random variable
Objective expressing the relationship on the basis of the values of fixed
between variables. variable

Вам также может понравиться