Академический Документы
Профессиональный Документы
Культура Документы
Analytics Team
Mjunction Services
Data
Science Classification
Regression
Clustering
Predicting the
Modelling
Future Association Rule
Correlation
Numerical &
categorical
Bar and Line Chart, 2 Y Axis Plot
Frequency table
OneR
LDA
Naïve Bayesian
Covariance
Matrix
Logistical
Decision Tree
Regression
Classification
Similarity
K Nearest Neighbors
Function
Others
SVM- Support Vector
Machines
Regression
Similarity
Function K Nearest Neighbors
Hierarchical
Divisive
Clustering
K Means
Partitive
Self organizing Maps
confidential - mjunction services limited
Solving an Optimization Problem
using Excel Solver
Each unit of the two scooter models requires the following processing times in these
production steps:
Wheels and Quality
Frame Deck Assurance and
Manufacturin Assembly( Packaging
Model g ( Hours) Hours) (hours)
Razor 4 1.5 1
Navajo 5 2 0.8
Available Time in
Coming Week ( in
Production Step Hours)
Frame Manufacturing 5610
Wheels and Deck Assembly 2200
Quality Assurance and
Packaging 1200
How many units of each model should ZI produce in the coming week in order to
maximize its weekly profit?
The Zooter example treats profit contributions, manufacturing requirements, supply availabilities as
non-random quantities
If ZI decides to make a certain number of units of each scooter model in the coming week, it will
know for sure
How much profit it will make
Whether it will have sufficient supply of each resource
The “no uncertainty” assumption simplifies the search for the best production plan
In practice, it allows us to tackle analytics models with large numbers of products and resources
Sampling
The process of using a small number of
items or parts of larger population to
make
a conclusions about the whole population
Survey of entire
population is
impractical
• Simple Random Sampling: A sampling procedure that ensures that each element in the population will
have an equal chance of being included in the sample. Implementation is easy but its both time and money
consuming, where larger samples are required. Most widely used sampling technique.
• Systematic sampling: A simple process, where every nth name in the list is drawn. Output can be biased
and skewed. Used in research.
• Stratified sampling: Subsamples are drawn within different strata. Each stratum is more or less equal on
some characteristic. The process is expensive. Used in research.
• Cluster sampling: The purpose of cluster sampling is to sample economically while retaining the
characteristics of the sample. It is no longer based on individual element of the population. U.S. uses this
sampling technique to create clusters for its population.
• Convenience Sampling: A sampling procedure where the element selection isbased on ease of accessibility.
They are the least reliable but cheapest and easiest to conduct. Street Interviews are the best examples.
• Judgment sampling: An experienced individual selects the sample based on his or her judgment about
some appropriate characteristics required of the sample member.
• Quota sampling: Ensures that the various subgroups in a population are represented on pertinent sample
characteristics.
• Snowball sampling: Initial respondents are selected by probability methods. Additional respondents are
obtained from information provided by the initial respondents.
What is a hypothesis?
confidential
confidential - mjunction
- mjunction services
services limited limited
Hypothesis Testing
confidential
confidential - mjunction
- mjunction services
services limited limited
Procedure of Hypothesis Testing
confidential
confidential - mjunction
- mjunction services
services limited limited
Hypothesis Testing
Null Hypothesis
• The assumption that we wish to test is Null Hypothesis(H0 :
read as H-not). We begin with the assumption that what has
been happening is correct.
• E.g.: The average weight of students in a class is 58kgs.
• (H0 : µ=58)
Alternative Hypothesis
• The radical claim that the new theory is correct or there are
changes happening in the said system. It is defined by
Alternative Hypothesis(Hα : read as H-alpha)
• The average weight of students in a class is not 58kgs.
• (Hα : µ≠58)
confidential
confidential - mjunction
- mjunction services
services limited limited
Decision Making: Rejection and Non-Rejection Region Approach
Test Statistic: The sample statistic one uses to either
reject Ho (and conclude Ha) or not to reject Ho.
Critical values are the values that determine whether the null
hypothesis will be rejected or not.
confidential
confidential - mjunction
- mjunction services
services limited limited
Conditions for Hypothesis Testing
• If z-test for one proportion for n samples:
• If a t-test: the data comes from an approximately normal distribution or the sample size is at least 30.
confidential
confidential - mjunction
- mjunction services
services limited limited
When to use what..
Example:
We take a random sample of 500 Penn State students and find that 278 are from Pennsylvania. Can we conclude
that the proportion is larger than 0.5 at a 5% level of significance? Can we conclude that a major proportion is
from Pennsylvania.
Step 1: Using the one-proportion z-test since the hypothesized value p0 is 0.5(population proportion) and we can
check that
confidential
confidential - mjunction
- mjunction services
services limited limited
When to use what..
Step 4: Finding the appropriate critical values for the test using the z-table. From the table, Z0 =1.645, which is
the critical value. . The rejection region for the two-tailed test is given by:
Step 5: Check whether the value of the test statistic falls in the rejection region. If it does, then reject H0 in the
favour of Hα .
The observed Z-value is 2.504 - this is our test statistic. Since Z* falls within the rejection region, we reject H0 in
the favour of Hα .
confidential
confidential - mjunction
- mjunction services
services limited limited
When to use what..
Example:
The mean length of the lumber is supposed to be 8.5 feet. A builder wants to check whether the shipment of
lumber she receives has a mean length different from 8.5 feet. If the builder observes that the sample mean
of 61 pieces of lumber is 8.3 feet with a sample standard deviation of 1.2 feet. What will she
conclude? Conduct this test at a 1% level of significance(α).
Step 1: Using t-test since sample size is 61>30. Then setting the hypothesis
Step 4: Finding the appropriate critical values for the test using the t-table. From the table, the critical value
2.660. The rejection region for the two-tailed test is given by:
confidential
confidential - mjunction
- mjunction services
services limited limited
When to use what..
Step 5: Check whether the value of the test statistic falls in the rejection region. If it does, then reject H0
(and conclude Hα ). If it does not fall in the rejection region, do not reject H0.
The observed t-value is -1.3 - this is our test statistic. Since t* does not fall within the rejection region,
we fail to reject H0 in the favour of Hα.
Step 6: Concluding that with a test statistic of -1.3 and critical value of ± 2.660 at a 1% level of significance,
the mean length of lumber differs from 8.5 feet.
confidential
confidential - mjunction
- mjunction services
services limited limited
Type I and Type II Errors
confidential
confidential - mjunction
- mjunction services
services limited limited
Analysis of Variance (ANOVA)
When to use ANOVA?
To compare the mean values of a certain characteristic among two or more groups. Basically, to see whether two or more
groups are equal (or different) on a given metric characteristic.
H0 in ANOVA:
There are no differences among the mean values of the groups being compared (i.e., the group means are all equal)–
H0 : µ1 = µ2 = µ3 = …= µk
Hα (Conclusion if H0 rejected)?
Not all group means are equal (i.e., at least one group mean is different from the rest).
Scenario 2. When comparing >3 groups, if H0 is rejected, it is a two-step test: Group A, Group B and Group C
Step 1: Overall test that examines if all groups are equal or not. And, if not all are equal (H0 rejected), then:
Step 2: Pair-wise (post-hoc) comparison tests to see where (i.e., among which groups) the differences exit, and how.
confidential
confidential - mjunction
- mjunction services
services limited limited
Logic Behind an Analysis of Variance (ANOVA)
ANOVA—you take 1 continuous (“response”) variable and 1 categorical (“factor”) variable and test the null
hypothesis that all group means for the categorical variable are equal.
Example: Analyse the effects of the machine operator on the valve opening measurements of valves produced
in a manufacturing plant. The measurements for the openings of 24 valves randomly selected from an assembly line that
are given in the table below. The mean opening is 6.34 centimeters (cm).
confidential
confidential - mjunction
- mjunction services
services limited limited
Logic Behind an Analysis of Variance (ANOVA)
Independent/Response Variable:
The machine operator.
Our Hypothesis:
Treatment/Classification Levels of the response
variable:
The 4 machine operators 1,2,3 and 4
Dependent/Predictor Variable:
The opening measurement of the valves.
confidential
confidential - mjunction
- mjunction services
services limited limited
Logic Behind an Analysis of Variance (ANOVA)
SST=SSC+SSE
Where SST: Total Sum of Squares. It measures all the
variation in the dependent or response variable.
confidential
confidential - mjunction
- mjunction services
services limited limited
ANOVA using Excel
Here we observe that the values for treatment
level 3 seem to be located differently from those
of levels 2 and 4.
confidential
confidential - mjunction
- mjunction services
services limited limited
Regression and Clustering
Regression analysis is a statistical technique for studying linear relationships among variables. It
includes many techniques for modeling and analyzing several variables, when the focus is on the
relationship between dependent variable and one or more independent variables (or 'predictors'). More
specifically, regression analysis helps one understand how the typical value of the dependent variable
(or 'criterion variable') changes when any one of the independent variables is varied, while the other
independent variables are held fixed.
And When the Linear Model is Y = β0 + β1X1 + β2X2 + β3X3 + β4X5+…. + β20X20 + e
1
Advertisement
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1 1.2
Sales
1
Advertisement
0.8
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1 1.2
Sales
1.2
Has to be a Straight Line .. But Which One
0.8
Advertisement
0.6
0.4
0.2
0
0 0.2 0.4 0.6 0.8 1 1.2
Sales
0.8
Advertisement
0.6
0.4
0.2
0
0 0.2 0.4 Sales
0.6 0.8 1 1.2
It aims to create a straight line that minimizes the sum of the squares of the errors generated by the results of the
associated equations, such as the squared residuals resulting from differences in the observed value and the value
anticipated based on the model.
• Clarification https://nptel.ac.in/courses/122104019/numerical-analysis/Rathish-kumar/least-square/r1.htm
o ANOVA
o P value
o R Square
o Adjusted R Square
o MAPE (Mean Absolute Percentage Error)
ANOVA
• Suppose, you are the head of a retail store and wish to understand preferences of your costumers to
scale up your business. Is it possible for you to look at details of each costumer and devise a unique
business strategy for each one of them? Definitely not. But, what you can do is to cluster all of your
costumers into say 10 groups based on their purchasing habits and use a separate strategy for
costumers in each of these 10 groups. And this is what we call clustering.
expression in condition 2
4
k1
k2
2
k3
0
0 1 2 3 4 5
expression in condition 1
confidential - mjunction services limited
Clustering: Example - Step 2
Algorithm: k-means, Distance Metric: Euclidean Distance
5
expression in condition 2
4
k1
k2
2
k3
0
0 1 2 3 4 5
expression in condition 1
confidential - mjunction services limited
Clustering: Example - Step 3
Algorithm: k-means, Distance Metric: Euclidean Distance
5
expression in condition 2
4
k1
2
k3
k2
1
0
0 1 2 3 4 5
expression in condition 1
confidential - mjunction services limited
Clustering: Example - Step 4
Algorithm: k-means, Distance Metric: Euclidean Distance
5
expression in condition 2
4
k1
2
k3
k2
1
0
0 1 2 3 4 5
expression in condition 1
confidential - mjunction services limited
Clustering: Example - Step 5
Algorithm: k-means, Distance Metric: Euclidean Distance
5
expression in condition 2
4
k1
2
k2
k3
1
0
0 1 2 3 4 5
expression in condition 1
confidential - mjunction services limited
How the K-Mean Clustering algorithm
works?
Keep
repeating
the
process
till the
centroids
don’t
change
anymore.
Addressing the “Curse of Dimensionality” where at times too many is not good.
These techniques are used when there are too many variables
When You Need to Visualize certain results in a two dimensional Plane.
When Computing Time is an issue with systems
The Usage of Dimensionality reduction
• Forecasting with many variables
• Face Recognition
• Image Compression
• Gene Expression Analysis
• Data Reduction
• Data Classification
• Trend Analysis
• Factor Analysis
• Noise Reduction
• Feature Selection
– Out of the existing features select the Most Relevant based on the Target
• Feature Extraction
• Out of the existing variables ( n Numbers)- get a new set of lesser number of
variables( say k < n) which contains the maximum Information
– Factor Analysis
– Principal Component Analysis
– Linear Discriminant Analysis
We can picture PCA- Principal Component Analysis as a technique that finds the
directions of maximal variance:
• They are the directions where there is the most variance, the directions where the data is most
spread out.
LDA attempts to find a feature subspace that maximizes class separability (note
that LD 2 would be a very bad linear discriminant in the figure above).
Basically what it does is it tries to maximizes the Fishers Ration ( mu1- mu2)^2/( sigma1^2-
sigma2^2)
Briefly Bayes’ Theorem can be used to estimate the probability of the output class (k) given the input (x) using
the probability of each class and the probability of the data belonging to each class:
P(Y=x|X=x) = (PIk * fk(x)) / sum(PIl * fl(x))
Where PIk refers to the base probability of each class (k) observed in your training data (e.g. 0.5 for a 50-50 split
in a two class problem). In Bayes’ Theorem this is called the prior probability.
PIk = nk/n
The f(x) above is the estimated probability of x belonging to the class.
Dk(x) = x * (muk/siga^2) – (muk^2/(2*sigma^2)) + ln(PIk)
Dk(x) is the discriminate function for class k given input x, the muk, sigma^2 and PIk are all estimated from
your data.
The error terms e1, e2, and e3, serve to indicate that the hypothesized relationships are not
exact.