You are on page 1of 50

PLT 6133 QUANTITATIVE DATA ANALYSIS

CORRELATION AND REGRESSIONS

Statistics maybe regarded as a method of dealing


with data. This definition stresses the view that statistics is a tool concerned with collection, organization and analysis of numerical facts and observations..the major concerned with descriptive statistics is to present information in a convenient, usable, and understandable form

- Richard Runyon & Audry Haber

Summary of Major Types of Descriptive Statistics


TYPE OF TECHNIQUE STATISTICAL TECHNIQUE PURPOSE

Univariate

Frequency distribution, measures of central tendency, std deviation, Correlation, percentage table, chi-square Elaboration paradigm, linear and multiple regression

Describe one variable

Bivariate

Describe a relationship or the association between two variables Describe relationships among several variables, or see how several independent variables have an effect on a dependent variable.

Multivariate

Three Broad Types of Research Questions:

Descriptive Research Questions

Associational Research Questions

Difference Research Questions

DESCRIPTIVE RESEARCH QUESTIONS

Descriptive Research Questions are not answered with inferential statistics. They merely describe or summarize data, without trying to generalize to a larger population of individual. Mean, Percentage, SD, Mod, Median, etc.

INFERENTIAL STATISTICS rely on principles from probability sampling, whereby a researcher uses a random process to select cases from the entire population. Inferential statistics are a precise way to talk about how confident a researcher can be when inferring from the results in a sample to the population.

ASSOCIATIONAL RESEARCH QUESTIONS

Associational Research Questions are those in which 2 or more variables are associated or related. This approach usually involves an attempt to see how 2 or more variables covary (as one grows larger, the other grows larger or smaller) or one or more variables enables one to predict another variable. Pearson Correlation, Spearman Correlation, Eta Correlation, etc.

DIFFERENCE RESEARCH QUESTIONS

Difference Research Questions: For these questions, we compare scores (on the dependent variable) of 2 or more different groups, each of which is composed of individuals with one of the values or levels on the independent variable. This type of question attempts to demonstrate that groups are not the same on the dependent variable. T-test, ANOVA, ANCOVA, MANOVA, MANCOVA, etc.

CORRELATION
The correlation is one of the most common and most useful statistics.
Definition - A correlation is a single number that describes the degree of relationship (dependence) between two variables. It characterizes the existence of a relationship between variables. Relationship between 2 variables can vary from strong to weak. More accurately, correlation is the co-variation of standardized variables.

However, a correlation does not imply causation. meaning Because there is a strong positive or strong negative correlation between 2 variables, this does not mean that one variable is caused by the other variable. Many statisticians claim that a strong correlation never implies a cause-effect relationship between two variables.

GENERALLY
Two variables may correlate to each other in 3 possible ways: Positive relationship: Both variables vary in the same direction as one goes up, the other goes up. Eg. Salary and years of education are positively correlated because people who get the highest salaries tend to be the ones who have gone to school the longest. Negative relationship: Two variables vary in the opposite direction as one up, the other goes down. Eg. The number of problems faced and the amount of immunoglobulin A in a persons system are negatively correlated because as the number of problems goes up, the amount of immunoglobulin A tends to go down. Zero relationship: Two variables has no relationship with each other one changes without affecting the other. Eg. Average speed of car driven and average speed of mouse. Also, the relationship between personality fluctuations and movement of distant stars has a zero correlation.

Degree of Correlation: How Strongly are variables correlated?


The degree of correlation between two variables can be established using two methods:
Scatter plot a graph with plotted values for two variables being compared. Correlation Coefficient methods.

SCATTER PLOTS

Scatter Plots - Example


Example of positive correlation - Cardiovascular fitness score and months machine owned

Example of negative correlation - Hours of exercise per week and months of machine owned

Example of uncorrelated data - Height and months of machined owned

Scatter Plots - Example

Example of (a) weak and (b) strong correlation

Scatter Plots - Example


Researchers laid out 10 circular plots, each 4 meters in diameter, in an area where beavers were cutting down cottonwood trees. The number of stumps and the number of clusters of beetle larvae were recorded in each plot with the following results:
Stumps 2 2 1 3 4 1 5 3 1 2 Beetle Larvae 10 30 12 24 40 11 56 40 8 14

Scatter Plots - Example


The scatter plot for the previous data:

From the scatter plot, there appears to be a fairly strong positive association between the number of cottonwood stumps and the number of clusters of beetle larvae.

CORRELATION COEFFICIENT

Correlation coefficient
Correlation coefficient is used to measure the degree of correlation between variables - It is a quantitative indicator. There are several type of correlation coefficient depending of the type of relationship.

The most common is Pearsons correlation coefficient (denoted by r) which is sensitive only to a linear relationship between two variables.
Other types of common correlation coefficients include Spearmens rank correlation coefficient (denoted by ) and Kendalls rank correlation coefficient (denoted by ).

Correlation Coefficient
A correlation coefficient is a calculated number that indicates the degree of correlation between two variables: Perfect positive correlation usually is calculated as a value of 1 (or 100%). Perfect negative correlation usually is calculated as a value of -1.

A values of zero shows no correlation at all.

Correlation Coefficient
TABLE 1.0 Interpreting a Correlation Coefficient
Size of the Correlation coefficient General Interpretation

0.8 to 1.0 0.6 to 0.8 0.4 to 0.6 0.2 to 0.4 0.0 to 0.2

Very strong relationship Strong relationship Moderate relationship Weak relationship Weak or no relationship

Correlation Coefficient
A much more precise way to interpret the correlation coefficient: Computing the coefficient of determination. The coefficient of determination is the percentage of variance in one variable that is accounted for by the variance in the other variable. Coefficient of determination = Square of correlation coefficient

Example: If the correlation between GPA and the number of hours of study is 0.7, then the coefficient of determination is _______. This means _______% of the variance in GPA can be explained by the variance in studying time. The stronger the correlation, the more the variance can be explained. However, this means that _______ % cannot be explained. The amount of unexplained variance is called the coefficient of alienation (or coefficient of non-determination).

Pearsons Correlation Coefficient


If we have a series of n measurements of X and Y written as xi and yi where i = 1, 2, ..., n, then the sample correlation coefficient can be used to estimate the population Pearson correlation r between X and Y. The sample correlation coefficient is written as:

where x and y are the sample means of X and Y, and sx and sy are the sample standard deviations of X and Y. This can also be written as:

Correlation Coefficient Example


Is there a linear relationship between the age at which a child first begins to speak and his or her mental ability later on? To answer this question a study was conducted in which the age (in months) at which a child first spoke and the child's score on an aptitude test as a teenager were recorded: Draw a scatter plot and determine whether there appears to be a linear relationship between these two variables. If so, describe the relationship, calculate r, and determine what percentage of the variability in the aptitude score can be explained by the variability in the age at which a child begins speaking. Age 15 26 Score 95 71

10
9 15 20

83
91 102 87

18
11 8 20

93
100 104 94

Correlation Coefficient Example


The scatter plot for the data:

There appears to be a moderate negative association between the age at which a baby first begins to speak and mental ability later in life.

Correlation Coefficient Example


Calculation of the correlation coefficient:

r=(1013676-152920) (102616-1522) (1085510-9202)) = -0.5973301213-0.60


The variability in the age at which a child first speaks explains only about 36% (r2 = 0.36) of the variability in aptitude test scores later in life.

Exercise
Compute the correlation between the mens Height (in cm) and Weights (in kg) for the following data:
Man Height (X) Weight (Y)

A
B C D E

182
167 175 182 180

86
61 70 75 70

When is a correlation strong enough?

<0.2

slight; almost negligible relationship

0.2 0.4 low correlation; definite but small relationship

0.4 0.7 moderate correlation; substantial relationship


0.7 0.9 high correlation; marked relationship >0.9 very high correlation; very dependable relationship

Words of Caution
Ex amine your data distribution (i.e using scatter plot) before you do anything with the correlation and make sure you know the dos and donts with the correlation coefficient! Correlation coefficient is just an index of relationship which tells nothing about the cause and effect of the relationship! Limit yourself to linear relationship if you dont have adequate statistical background!

REGRESSION

Regression Analysis
In statistics, regression analysis is a statistical technique for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables.
More specifically, regression analysis helps one understand how the typical value of the dependent variable changes when any one of the independent variables is varied, while the other independent variables are held fixed. All regression analysis test whether a significant quantitative relationship exists.

Some Commonly Used Jargons..


Linear Regression Line of Best Fit Regression Equation

The General idea About Regression


Suppose we are asked to investigate the relationship between two variables namely Variable P (being the independent) and variable Q (being the dependent):
Pair Pair 1 Pair 2 Pair 3 Pair 4 Variable P 10 20 30 40 Variable Q 7 12 17 22

What would be the predicted value of Q if P = 15? If P = 25? How do you predict these?

20

Pair 4

15

Pair 3

Q variable
10

Pair 2

Pair 1

0 10 20 30 40

P variable

Notice that if we connect these points, we would get a straight line. This line fits ALL the observed points. This straight line is called the line of best fit or regression line.

The line of best fit defines a basis for predicting values of Q, given values of P (and vice versa).
The concept of the line of best fit can be extended to form a basis for linear regression as well as non-linear regression.

Linear Regression

Non-Linear Regression

Regression Models
Regression models involve the following variables: The unknown parameters, denoted as , which may represent a scalar or a vector. The independent variables, X. The dependent variable, Y. Regression models can predict a value of the Y variable given values of the X variables. Prediction within the range of values in the dataset used for model-fitting is known informally as interpolation. Prediction outside this range of the data is known as extrapolation.

Linear Regression
In linear regression, data is modeled using linear predictor functions, and unknown model parameters are estimated from the data.

Such models are called linear models. Most commonly, linear regression refers to a model in which the conditional mean of Y given the value of X is an affine function of X. Less commonly, linear regression could refer to a model in which the median, or some other quantile of the conditional distribution of Y given X is expressed as a linear function of X.
Like all forms of regression analysis, linear regression focuses on the conditional probability distribution of Y given X, rather than on the joint probability distribution of Y and X, which is the domain of multivariate analysis.

Non-Linear Regression
In non-linear regression, data are modeled by a function which is a non-linear combination of the model parameters and depends on one or more independent variables. As linear regression is much easier, some non-linear regression can be transformed or segmented to a linear regression.

Method of least squares


The method of least squares gives a way to find the best estimate of a particular measurement or data, assuming that the errors (i.e. the differences from the true value) are random and unbiased. "Least squares" means that the overall solution minimizes the sum of the squares of the errors made in the results of every single equation. The best fit in the least-squares sense minimizes the sum of squared residuals, a residual being the difference between an observed value and the fitted value provided by a model.

Method of least squares the line of best fit


The method of least squares calculates the line of best fit by minimising the sum of the squares of the vertical distances of the points to the line. Lets illustrate with a simple example.

Method of least squares the line of best fit


Continued from previous slide.

Example - Method of least squares


Fit a least square line to the following data.

X
Y

1
2

2
5

3
3

4
8

5
7

Example - Method of least squares


Solution: X 1 Y 2 XY 2 X2 1

2
3 4 5

5
3 8 7

10
9 32 35

4
9 16 25

The equation of least square line Normal Equation for a Normal Equation for b

---- (1) ---- (2)

Eliminate a from equation (1) and (2), multiply equation (2) by 3 and subtract form equation (2), we get the values of a and b. Here a = 1.1 and b = 1.3, the equation of least square line becomes .

Exercise
A researcher investigates the relationship between individuals score on a Reading Aptitude Test and the average amount of hours he/she spends for reading (simply called Hours): The data gathered from 10 students are as follows:

Student S1 S2 S3 S4 S5 S6 S7 S8 S9 S10

Score on Reading Aptitude Test (X) 20 5 5 40 30 35 5 5 15 40

Hours (Y) 5 1 2 7 8 9 3 2 5 8

DO NOT WORRY ABOUT APPLYING THE EQUATIONS! You will use SPSS (Statistical Package for Social science) to obtain all the analysis

The first step in any applied research is to get a good THEORETICAL grasp of the topic to be studied. The best data analyst dont start with the data, they start with theory.

THANK YOU

PREPARED BY ASSOC. PROF. DR NORMAH MULOP