Вы находитесь на странице: 1из 74

BTP - Analytics

Analytics Team
Mjunction Services

confidential - mjunction services limited


Explaining the Univariate
Exploration
Past
Bivariate

Data
Science Classification

Regression

Clustering
Predicting the
Modelling
Future Association Rule

confidential - mjunction services limited


Data Exploration Count, Count %
Categorical
Min, Max, Mean, Median , Mode
Pie Chart, Bar
Chart
Univariate Range, Quartile, Variance, Standard Deviation,
Coef of Variation

Numeric Skewness, Kurtosis

Histograms, Box Plots

Chi Square test


Categorical &
Categorical

Bar Chart, 2 y Axis Plot

Correlation

Bivariate Numerical &


Numerical
Scatter Plot

Z test, T test, Anova

Numerical &
categorical
Bar and Line Chart, 2 Y Axis Plot

confidential - mjunction services limited


ZeroR

Frequency table
OneR

LDA
Naïve Bayesian
Covariance
Matrix

Logistical
Decision Tree
Regression

Classification

Similarity
K Nearest Neighbors
Function

Artificial Neural Networks

Others
SVM- Support Vector
Machines

confidential - mjunction services limited


Frequency table Decision Trees

Covariance Multi Linear Regression


Matrix

Regression
Similarity
Function K Nearest Neighbors

Artificial Neural Networks

Others SVM- Support Vector


Machines
Agglomerative

Hierarchical

Divisive

Clustering
K Means

Partitive
Self organizing Maps
confidential - mjunction services limited
Solving an Optimization Problem
using Excel Solver

confidential - mjunction services limited


Zooter Industries: Products, Profits, Demand

 Zooter Industries (ZI)manufactures high-end kick-scooters for the North


American market
 ZI’s main product models are Razor and Navajo, with profit
contributions of $150 and $160 per unit
 At present, ZI’s scooters are so popular that the company can sell all the
units it makes

confidential - mjunction services limited


Zooter Industries: Manufacturing Process
 The production process for each model includes three main steps:
– frame manufacturing
– wheels and deck assembly

– quality assurance and packaging

Each unit of the two scooter models requires the following processing times in these
production steps:
Wheels and Quality
Frame Deck Assurance and
Manufacturin Assembly( Packaging
Model g ( Hours) Hours) (hours)
Razor 4 1.5 1
Navajo 5 2 0.8

confidential - mjunction services limited


Zooter Industries: Supply Side
ZI’s capacity available at each production step is shown below for the coming week

Available Time in
Coming Week ( in
Production Step Hours)
Frame Manufacturing 5610
Wheels and Deck Assembly 2200
Quality Assurance and
Packaging 1200

How many units of each model should ZI produce in the coming week in order to
maximize its weekly profit?

confidential - mjunction services limited


Assuming Away Uncertainty: Pros and Cons

 The Zooter example treats profit contributions, manufacturing requirements, supply availabilities as
non-random quantities
 If ZI decides to make a certain number of units of each scooter model in the coming week, it will
know for sure
 How much profit it will make
 Whether it will have sufficient supply of each resource

 The “no uncertainty” assumption simplifies the search for the best production plan
 In practice, it allows us to tackle analytics models with large numbers of products and resources

 We Will be using Excel Solver to Solve this optimization problem

confidential - mjunction services limited


Sample Designs, Procedures
And Hypothesis Testing

confidential - mjunction services limited


Sampling Terminology

Population Sample Census


The total collection Investigation of all
of elements about Subset of the population individual elements
which we want to that make up the
make inferences population

Sampling
The process of using a small number of
items or parts of larger population to
make
a conclusions about the whole population

confidential - mjunction services limited


Sample Selection

Population, sample and individual cases


confidential - mjunction services limited
Why Sampling ?

Survey of entire
population is
impractical

A valid Budget and time


alternative constraints
to census restrict data
collection
when..
Results from
data collection
are needed
quickly

confidential - mjunction services limited


Sampling Techniques

confidential - mjunction services limited


Sampling Techniques: Probability Sampling
Known, nonzero probability for every element. Used in Statistical and Industry based research.

• Simple Random Sampling: A sampling procedure that ensures that each element in the population will
have an equal chance of being included in the sample. Implementation is easy but its both time and money
consuming, where larger samples are required. Most widely used sampling technique.

• Systematic sampling: A simple process, where every nth name in the list is drawn. Output can be biased
and skewed. Used in research.

• Stratified sampling: Subsamples are drawn within different strata. Each stratum is more or less equal on
some characteristic. The process is expensive. Used in research.

• Cluster sampling: The purpose of cluster sampling is to sample economically while retaining the
characteristics of the sample. It is no longer based on individual element of the population. U.S. uses this
sampling technique to create clusters for its population.

confidential - mjunction services limited


Sampling Techniques: Non-Probability Sampling
When probability of selecting a unit is unknown. Its mostly used in Marketing Research.

• Convenience Sampling: A sampling procedure where the element selection isbased on ease of accessibility.
They are the least reliable but cheapest and easiest to conduct. Street Interviews are the best examples.

• Judgment sampling: An experienced individual selects the sample based on his or her judgment about
some appropriate characteristics required of the sample member.

• Quota sampling: Ensures that the various subgroups in a population are represented on pertinent sample
characteristics.

• Snowball sampling: Initial respondents are selected by probability methods. Additional respondents are
obtained from information provided by the initial respondents.

confidential - mjunction services limited


Hypothesis Testing

What is a hypothesis?

• An assumption about the population


parameter.
• Population parameter is the characteristic of
a population such as variance or mean.
• The parameter/s must be identified
beforehand. Example- In SAS when randomly two
Lamps Picked up out of 10 are defective
whether entire batch is defective or not

confidential
confidential - mjunction
- mjunction services
services limited limited
Hypothesis Testing

Example: Assuming the average weight of the


packets is 10kgs.This is out Initial Hypothesis.

• To test the validity of assumed or hypothetical value


of population, we gather sample data and determine
the difference between our hypothesized value and
actual value.
• Then we judge whether the difference is significant
or not. If the difference is not significant then out
hypothesized value (here mean weight) is correct
else our hypothesis fails.

confidential
confidential - mjunction
- mjunction services
services limited limited
Procedure of Hypothesis Testing

Hypothesize: Establishing H0 and Hα

Test: Determining the statistical test, setting up


alpha(Type 1 error), establishing a decision
rule, data gathering and analysing.

Take a Statistical Action to reach to a


conclusion.

Determine the Business Implication

confidential
confidential - mjunction
- mjunction services
services limited limited
Hypothesis Testing

Null Hypothesis
• The assumption that we wish to test is Null Hypothesis(H0 :
read as H-not). We begin with the assumption that what has
been happening is correct.
• E.g.: The average weight of students in a class is 58kgs.
• (H0 : µ=58)
Alternative Hypothesis
• The radical claim that the new theory is correct or there are
changes happening in the said system. It is defined by
Alternative Hypothesis(Hα : read as H-alpha)
• The average weight of students in a class is not 58kgs.
• (Hα : µ≠58)

confidential
confidential - mjunction
- mjunction services
services limited limited
Decision Making: Rejection and Non-Rejection Region Approach
Test Statistic: The sample statistic one uses to either
reject Ho (and conclude Ha) or not to reject Ho.

If the null hypothesis is rejected, statistically it means that the


result lies in the rejection region.

If the null hypothesis is not rejected, statistically, it means that


the result lies in the non-rejection region.

Critical values are the values that determine whether the null
hypothesis will be rejected or not.

P-value: The p-value (or probability value) is the probability that


the test statistic equals the observed value or a more extreme
value under the assumption that the null hypothesis is true.

confidential
confidential - mjunction
- mjunction services
services limited limited
Conditions for Hypothesis Testing
• If z-test for one proportion for n samples:

• If a t-test: the data comes from an approximately normal distribution or the sample size is at least 30.

• Deciding significance level/α

• Compute the value of the test statistic:

Where p-hat is sample proportion and p0 is


population proportion.

Where x-bar is sample mean and µ0 is


population mean, S is standard deviation of
sample, n is the sample size.

confidential
confidential - mjunction
- mjunction services
services limited limited
When to use what..
Example:
We take a random sample of 500 Penn State students and find that 278 are from Pennsylvania. Can we conclude
that the proportion is larger than 0.5 at a 5% level of significance? Can we conclude that a major proportion is
from Pennsylvania.
Step 1: Using the one-proportion z-test since the hypothesized value p0 is 0.5(population proportion) and we can
check that

Hence, setting up the hypothesis,

Step 2: Deciding the significance level, α=0.05


Step 3: Computing the value of test statistic: where p-hat is 278/500=0.556(sample
proportion)

confidential
confidential - mjunction
- mjunction services
services limited limited
When to use what..
Step 4: Finding the appropriate critical values for the test using the z-table. From the table, Z0 =1.645, which is
the critical value. . The rejection region for the two-tailed test is given by:

Step 5: Check whether the value of the test statistic falls in the rejection region. If it does, then reject H0 in the
favour of Hα .
The observed Z-value is 2.504 - this is our test statistic. Since Z* falls within the rejection region, we reject H0 in
the favour of Hα .

Step 6: Concluding that a majority of the students are from Pennsylvania.

confidential
confidential - mjunction
- mjunction services
services limited limited
When to use what..

Example:
The mean length of the lumber is supposed to be 8.5 feet. A builder wants to check whether the shipment of
lumber she receives has a mean length different from 8.5 feet. If the builder observes that the sample mean
of 61 pieces of lumber is 8.3 feet with a sample standard deviation of 1.2 feet. What will she
conclude? Conduct this test at a 1% level of significance(α).

Step 1: Using t-test since sample size is 61>30. Then setting the hypothesis

Step 2: Deciding the significance level, α=0.01


Step 3: Computing the value of test statistic:

Step 4: Finding the appropriate critical values for the test using the t-table. From the table, the critical value
2.660. The rejection region for the two-tailed test is given by:

confidential
confidential - mjunction
- mjunction services
services limited limited
When to use what..

Step 5: Check whether the value of the test statistic falls in the rejection region. If it does, then reject H0
(and conclude Hα ). If it does not fall in the rejection region, do not reject H0.
The observed t-value is -1.3 - this is our test statistic. Since t* does not fall within the rejection region,
we fail to reject H0 in the favour of Hα.

Step 6: Concluding that with a test statistic of -1.3 and critical value of ± 2.660 at a 1% level of significance,
the mean length of lumber differs from 8.5 feet.

confidential
confidential - mjunction
- mjunction services
services limited limited
Type I and Type II Errors

Rejecting a true null hypothesis is Type I


error.
Example: An innocent man sent to jail. Or,
an employee is fired for stealing from
company without enough evidence.
Probability(Type I error)=significance
level=α

Failing to reject a false null hypothesis is


Type II error.
Example: A guilty man is let gone. Or,
An employee who is actually stealing but
due to lack of evidence, cannot be fired.
Probability(Type II error)=Power=β

confidential
confidential - mjunction
- mjunction services
services limited limited
Analysis of Variance (ANOVA)
When to use ANOVA?
To compare the mean values of a certain characteristic among two or more groups. Basically, to see whether two or more
groups are equal (or different) on a given metric characteristic.

H0 in ANOVA:
There are no differences among the mean values of the groups being compared (i.e., the group means are all equal)–
H0 : µ1 = µ2 = µ3 = …= µk

Hα (Conclusion if H0 rejected)?
Not all group means are equal (i.e., at least one group mean is different from the rest).

Scenario 1. When comparing 2 groups, a one-step test : Group A and Group B


Step1: Check to see if the two groups are different or not, and if so, how.

Scenario 2. When comparing >3 groups, if H0 is rejected, it is a two-step test: Group A, Group B and Group C
Step 1: Overall test that examines if all groups are equal or not. And, if not all are equal (H0 rejected), then:
Step 2: Pair-wise (post-hoc) comparison tests to see where (i.e., among which groups) the differences exit, and how.

confidential
confidential - mjunction
- mjunction services
services limited limited
Logic Behind an Analysis of Variance (ANOVA)
ANOVA—you take 1 continuous (“response”) variable and 1 categorical (“factor”) variable and test the null
hypothesis that all group means for the categorical variable are equal.

Example: Analyse the effects of the machine operator on the valve opening measurements of valves produced
in a manufacturing plant. The measurements for the openings of 24 valves randomly selected from an assembly line that
are given in the table below. The mean opening is 6.34 centimeters (cm).

Question: Why do the valve openings vary? And how to


justify it?

confidential
confidential - mjunction
- mjunction services
services limited limited
Logic Behind an Analysis of Variance (ANOVA)

Independent/Response Variable:
The machine operator.
Our Hypothesis:
Treatment/Classification Levels of the response
variable:
The 4 machine operators 1,2,3 and 4

Dependent/Predictor Variable:
The opening measurement of the valves.
confidential
confidential - mjunction
- mjunction services
services limited limited
Logic Behind an Analysis of Variance (ANOVA)

SST=SSC+SSE
Where SST: Total Sum of Squares. It measures all the
variation in the dependent or response variable.

SSC: Sum of Squares columns i.e. between the


columns/treatments.

SSE: Sum of Squares errors i.e. the within


columns/treatments.

In the ANOVA situation, the F-statistic which can be


expressed as the ratio of Between Group variability
and Within Group Variability

confidential
confidential - mjunction
- mjunction services
services limited limited
ANOVA using Excel
Here we observe that the values for treatment
level 3 seem to be located differently from those
of levels 2 and 4.

Treatment level 1 seems to be closest to the


mean valve measurement.

Hence, the observed F value of 10.18 is larger


than the table F value of 3.10. The null hypothesis
is rejected. Not all means are equal, so there is a
significant difference in the mean valve openings
by machine operator.

confidential
confidential - mjunction
- mjunction services
services limited limited
Regression and Clustering

confidential - mjunction services limited


Linear Regression
What is Regression

Regression analysis is a statistical technique for studying linear relationships among variables. It
includes many techniques for modeling and analyzing several variables, when the focus is on the
relationship between dependent variable and one or more independent variables (or 'predictors'). More
specifically, regression analysis helps one understand how the typical value of the dependent variable
(or 'criterion variable') changes when any one of the independent variables is varied, while the other
independent variables are held fixed.

confidential - mjunction services limited


Regression Equation
Suppose the Linear Model is Y = β0 + β1X + e

 where Y is the dependent variable and X is the independent variable.


 β0 and β1 are two unknown constants that represent the intercept and slope, also known as coefficients or
parameters, and e is the error term.
 Estimating the effect of an explanatory variable on the dependent variable (Effect Size – magnitude of the β
coefficient )
 This is Simple Linear Rgression Equation.

And When the Linear Model is Y = β0 + β1X1 + β2X2 + β3X3 + β4X5+…. + β20X20 + e

 This is Multiple Linear Regression Equation.


 X’s is the independent variables.

confidential - mjunction services limited


Example Of Linear Regression
Advertisement Sales
5 62
10 100
16 148
30 300
33 357
40 400
5 700
20 ??????

confidential - mjunction services limited


Plot The Data
Plots The Sales on Graph
1.2

1
Advertisement

0.8

0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1 1.2
Sales

confidential - mjunction services limited


Linear Regression
Joins The Data Points With A Line… May be Curve
1.2

1
Advertisement

0.8

0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1 1.2

Sales

confidential - mjunction services limited


Linear Regression

1.2
Has to be a Straight Line .. But Which One

0.8
Advertisement

0.6

0.4

0.2

0
0 0.2 0.4 0.6 0.8 1 1.2
Sales

confidential - mjunction services limited


Called residuals and not deviations in regression..
Least Square Method
1.2

0.8
Advertisement

0.6

0.4

0.2

0
0 0.2 0.4 Sales
0.6 0.8 1 1.2

confidential - mjunction services limited


Least Square Methods
 The least squares method is a form of mathematical regression analysis that finds the line of best fit for a set of
data, providing a visual demonstration of the relationship between the data points. Each point of data is
representative of the relationship between a known independent variable and an unknown dependent variable.

 It aims to create a straight line that minimizes the sum of the squares of the errors generated by the results of the
associated equations, such as the squared residuals resulting from differences in the observed value and the value
anticipated based on the model.

• Clarification https://nptel.ac.in/courses/122104019/numerical-analysis/Rathish-kumar/least-square/r1.htm

confidential - mjunction services limited


Linear Regression
Model Validation

o ANOVA
o P value
o R Square
o Adjusted R Square
o MAPE (Mean Absolute Percentage Error)

ANOVA

• H0 : There is no relationship between X and Y versus the alternative hypothesis


• H1 : There is some relationship between X and Y .
• We need the Value Less Than 0.05 for any model.

confidential - mjunction services limited


Model Validation
P Value
• For each independent variable p value should be less than 0.05. If is not then we discard the variable from the model as the
variable will be no importance.
R Square ( Goodness Of Fit)
• It’s value ranges from 0 to 1.
• How good the line fits the data.
Adjusted R Square ( Goodness Of Fit)
• Problems of R Square
• R Square increase with no of Predictors included in the model
• Adjusted R Square solves the problem
• Adjusted R Square <= R Square
MAPE (Mean Absolute Percentage Error)

• How different the predictions are from actual


• Ranges from 0 to 1.
• Lesser MAPE better the model is.

confidential - mjunction services limited


Linear Regression Example
• D:\Working\LPG\Working\Regression.xlsx

confidential - mjunction services limited


What Is Cluster Analysis & Its Usage
• Cluster : a collection of data objects
– Similar to one another within the same cluster
– Dissimilar to the objects in other clusters
• Cluster analysis
– Grouping a set of data objects into clusters

• Suppose, you are the head of a retail store and wish to understand preferences of your costumers to
scale up your business. Is it possible for you to look at details of each costumer and devise a unique
business strategy for each one of them? Definitely not. But, what you can do is to cluster all of your
costumers into say 10 groups based on their purchasing habits and use a separate strategy for
costumers in each of these 10 groups. And this is what we call clustering.

confidential - mjunction services limited


K-means Clustering
 The k-means algorithm is an algorithm to cluster n objects based on attributes into k partitions, where k < n.
 Each cluster is associated with a centroid (centre point)
 Each point is assigned to the cluster with the closest centroid
 Number of clusters K must be specified

confidential - mjunction services limited


Clustering: Example - Step 1
Algorithm: k-means, Distance Metric: Euclidean Distance
5

expression in condition 2
4
k1

k2
2

k3
0
0 1 2 3 4 5

expression in condition 1
confidential - mjunction services limited
Clustering: Example - Step 2
Algorithm: k-means, Distance Metric: Euclidean Distance
5

expression in condition 2
4
k1

k2
2

k3
0
0 1 2 3 4 5

expression in condition 1
confidential - mjunction services limited
Clustering: Example - Step 3
Algorithm: k-means, Distance Metric: Euclidean Distance
5

expression in condition 2
4
k1

2
k3
k2
1

0
0 1 2 3 4 5

expression in condition 1
confidential - mjunction services limited
Clustering: Example - Step 4
Algorithm: k-means, Distance Metric: Euclidean Distance
5

expression in condition 2
4
k1

2
k3
k2
1

0
0 1 2 3 4 5

expression in condition 1
confidential - mjunction services limited
Clustering: Example - Step 5
Algorithm: k-means, Distance Metric: Euclidean Distance
5

expression in condition 2
4
k1

2
k2
k3
1

0
0 1 2 3 4 5

expression in condition 1
confidential - mjunction services limited
How the K-Mean Clustering algorithm
works?

Keep
repeating
the
process
till the
centroids
don’t
change
anymore.

confidential - mjunction services limited


Hierarchical Clustering
• Hierarchical clustering, as the name suggests is an algorithm that builds hierarchy of clusters. This algorithm starts with all
the data points assigned to a cluster of their own. Then two nearest clusters are merged into the same cluster. In the end,
this algorithm terminates when there is only a single cluster left.
• The results of hierarchical clustering can be shown using dendrogram. The dendrogram can be interpreted as:

confidential - mjunction services limited


Explanation Of Hierarchical Clustering

 At the bottom, we start with 25 data points, each assigned to separate


clusters. Two closest clusters are then merged till we have just one cluster
at the top. The height in the dendrogram at which two clusters are
merged represents the distance between two clusters in the data space.
 The decision of the no. of clusters that can best depict different groups
can be chosen by observing the dendrogram. The best choice of the no. of
clusters is the no. of vertical lines in the dendrogram cut by a horizontal
line that can transverse the maximum distance vertically without
intersecting a cluster.
 In the above example, the best choice of no. of clusters will be 4 as the
red horizontal line in the dendrogram below covers maximum vertical
distance AB.

confidential - mjunction services limited


Dendrogram

confidential - mjunction services limited


Clustering Example
• D:\Working\LPG\Working\Clustering.csv

confidential - mjunction services limited


Dimensionality Reduction techniques

confidential - mjunction services limited


Dimensionality reduction techniques

Addressing the “Curse of Dimensionality” where at times too many is not good.
 These techniques are used when there are too many variables
 When You Need to Visualize certain results in a two dimensional Plane.
 When Computing Time is an issue with systems
The Usage of Dimensionality reduction
• Forecasting with many variables
• Face Recognition
• Image Compression
• Gene Expression Analysis
• Data Reduction
• Data Classification
• Trend Analysis
• Factor Analysis
• Noise Reduction

confidential - mjunction services limited


Dimensionality Reduction Techniques

• Feature Selection
– Out of the existing features select the Most Relevant based on the Target
• Feature Extraction
• Out of the existing variables ( n Numbers)- get a new set of lesser number of
variables( say k < n) which contains the maximum Information
– Factor Analysis
– Principal Component Analysis
– Linear Discriminant Analysis

confidential - mjunction services limited


PCA Illustration

We can picture PCA- Principal Component Analysis as a technique that finds the
directions of maximal variance:

confidential - mjunction services limited


What is Principal Component Analysis?

• They are the directions where there is the most variance, the directions where the data is most
spread out.

confidential - mjunction services limited


To find the direction where there is most variance, find the straight line where the data is most
spread out when projected onto it. A vertical straight line with the points projected on to it will
look like this:

confidential - mjunction services limited


On this line the data is way more spread out, it has a large variance.
In fact there isn’t a straight line you can draw that has a larger variance than a horizontal one. A horizontal
line is therefore the principal component in this example.

confidential - mjunction services limited


The Steps Involved in Doing an Actual Principal Component Analysis

• Standardize the data.


• Perform Singular Vector Decomposition to get the Eigenvectors and Eigenvalues.
• Sort eigenvalues in descending order and choose the k- eigenvectors
• Construct the projection matrix from the selected k- eigenvectors.
• Transform the original dataset via projection matrix to obtain a k-dimensional
feature subspace.

confidential - mjunction services limited


LDA- Linear Discrimant Analysis

LDA attempts to find a feature subspace that maximizes class separability (note
that LD 2 would be a very bad linear discriminant in the figure above).

confidential - mjunction services limited


LDA- Linear Discriminant Analysis
Introduction
Linear Discriminant Analysis (LDA) is used to solve dimensionality reduction for data with
higher attributes

Pre-processing step for pattern-classification and machine learning applications.


Used for feature extraction.
Linear transformation that maximize the separation between multiple classes.
“Supervised” - Prediction agent.

It is also for Classification – Multiple Class Classification

Basically what it does is it tries to maximizes the Fishers Ration ( mu1- mu2)^2/( sigma1^2-
sigma2^2)

confidential - mjunction services limited


Feature Subspace :

To reduce the dimensions of a d-dimensional data set by projecting it


onto a (k)-dimensional subspace
(where k < d)

Feature space data is well represented?


Compute eigen vectors from dataset
Collect them in scatter matrix
Generate k-dimensional data from d-dimensional dataset.

confidential - mjunction services limited


Scatter Matrix:

Within class scatter matrix


In between class scatter matrix

Maximize the between class measure & minimize


the within class measure.

confidential - mjunction services limited


LDA steps:

 Compute the d-dimensional mean vectors.


 Compute the scatter matrices
 Compute the eigenvectors and corresponding eigenvalues for
the scatter matrices.
 Sort the eigenvalues and choose those with the largest
eigenvalues to form a d×k dimensional matrix
 Transform the samples onto the new subspace.

confidential - mjunction services limited


Using LDA for Classification
 Usage Basis is the Bayes Theorem
 muk = 1/nk * sum(x) , where muk – Mean within the class k,
 nk is the number of instances with class k, x is an instance within class k.
 sigma^2 = 1 / (n-K) * sum((x – mu)^2), K is the number of Classes and
 sigma^2 is the variance.

Briefly Bayes’ Theorem can be used to estimate the probability of the output class (k) given the input (x) using
the probability of each class and the probability of the data belonging to each class:
P(Y=x|X=x) = (PIk * fk(x)) / sum(PIl * fl(x))

Where PIk refers to the base probability of each class (k) observed in your training data (e.g. 0.5 for a 50-50 split
in a two class problem). In Bayes’ Theorem this is called the prior probability.
 PIk = nk/n
 The f(x) above is the estimated probability of x belonging to the class.
 Dk(x) = x * (muk/siga^2) – (muk^2/(2*sigma^2)) + ln(PIk)
 Dk(x) is the discriminate function for class k given input x, the muk, sigma^2 and PIk are all estimated from
your data.

confidential - mjunction services limited


Factor Analysis
 Factor analysis is a method for investigating whether a number of variables of interest Y1,
Y2, :: :, Yl, are linearly related to a smaller number of unobservable factors F1, F2, : ::, Fk .
 Originally Started from Customer Surveys where Multiple survey Questions were underlying
the same factor.
 Factor Analysis: Let’s say some variables are highly correlated. These variables can be
grouped by their correlations i.e. all variables in a particular group can be highly correlated
among themselves but have low correlation with variables of other group(s). Here each
group represents a single underlying construct or factor. These factors are small in number
as compared to large number of dimensions. However, these factors are difficult to
observe. There are basically two methods of performing factor analysis:
 EFA (Exploratory Factor Analysis)
 CFA (Confirmatory Factor Analysis)

confidential - mjunction services limited


An Illustrated Example
Finance ( Marketing Policy
Student No Y1) (Y2) (Y3)
1 3 6 5
2 7 3 3
3 10 9 8
4 3 9 7
5 10 6 5

Say Y1, Y2, Y3 represents the Grade of Students in 3 Subjects.


It has been suggested that these grades are functions of two underlying factors, F1 and F2,
tentatively and rather loosely described as quantitative ability and verbal ability, respectively.

Y1 = B10 + B11F1 + B12F2 + e1


Y2 = B20 + B21F1 + B22F2 + e2
Y1 = B30 + B31F1 + B32F2 + e3

The error terms e1, e2, and e3, serve to indicate that the hypothesized relationships are not
exact.

confidential - mjunction services limited


Thank you

© mjunction services limited 2018 | all rights reserved | confidential


confidential - mjunction services limited

Вам также может понравиться