Вы находитесь на странице: 1из 10

Chapter 1 Intro to Business Analytics

Business Analytics: process of transforming data into actions through


analysis and insights in the context of organizational decision making and
problem solving.
Some common types of decisions that can be enhanced using analytics:
o Pricing, customer segmentation, merchandising, location
Business Intelligence (BI) collection, management, analysis and
reporting of data Information Systems (IS)
Operations Research/Management Science (OR/MS): involves use of
applications that use modeling and optimization for translating real problems
into mathematics, spreadsheets, or other computer languages

Data Mining

Business Intelligence/Information Systems


Statistics
Visualizati
on
Simulation and Risk

What if?

Modeling and Optimization

Data mining: focused on better understanding characteristics and patterns


among variables in large databases using a variety of statistical and
analytical tools
Simulation and Risk Analysis: spreadsheet models and stats analysis to
examine impacts of uncertainty assumptions
What-if analysis: how specific combos of inputs that reflect key
assumptions will affect model outputs
Visualization: visualizing data and results of analyses provide a way of
easily communicating data at all levels of a business

Scope of Business Analytics:

Begins with collection organization and manipulation of data, supported by 3


major components
1. Descriptive Analytics:use of data to understand past and current
business performance and make informed decisions.
- Summarizes data using charts, tables, reports
Answers Question: Which factory has lowest productivity?
2. Predictive Analytics: seeks to predict the future by examining historical
data, detecting patterns, or relationships in these data and then
extrapolating these relationships forward in time
- Answers Question: What will happen if demand falls by 10%?

3. Prescriptive Analytics: uses optimization to identify best alternatives to


minimize or maximize some objective
- Answers Question: How much should we produce to maximize
profit?
Data for Business Analytics:

Data: numerical facts and figures collected through some type of


measurement process
Information: comes from analyzing data and extracting meaning from it
Data set: collection of data
Database: collection of related files containing records on people, places, or
things
Big data: refers to massive amounts of business data from a wide variety of
sources, much of which is available in real time, and much of which is
uncertain or unpredictable
Metric: unit of measurement that provides way to objectively quantify
performance (ex: net profit, ROI, market share)
Measurement: act of obtaining data associated with a metric
Measures: numerical values associated with a metric
Discrete metrics: derived from counting something (ex: delivery is either on
time or not)
Continuous metrics: based on continuous scale of measurement (ex:
dollars, length, time)
4 groups of Data: classification is hierarchal, each level includes all info
content of one preceding it
1. Categorical (nominal) data: qualitative usually expressed as
proportions or percentages
2. Ordinal data: ordered or ranked
3. Interval Data: quantitate ordinal but have constant differences
between observations
4. Ratio data: continuous and have natural zero (ex: dollars and time)
ratios of this data are meaningful
Data Reliability: data is accurate and consistent
Validity: data correctly measure what they are supposed to measure

Models in Business Analytics

Model: abstraction or representation of a real system, idea, or object.


Capture most important features of a problem and present them in a form
that is easy to interpret.
Ex: Descriptive model: influence diagram describes how various elements of
the model influence, or relate to, others
Decision Models: logical or mathematical representation of a problem or
business situation that can be used to understand, analyze, or facilitate
making a decision. Most have 3 inputs:
o Data
o Uncontrollable Variables

o Decision Variables
Uncertainty: imperfect knowledge of what will happen
Risk: associated with consequences and likelihood of what might happen

Prescriptive Decision Models:

Optimization: process of finding a set of values for decision variables


that minimize or maximize some quantity of interest profit, revenue,
cost, - called the objective function
Optimal solution: set of decision variables that optimizes the objective
function
Constraints: limitations imposed on any solution
Algorithm: systematic procedure that finds a solution to a problem
Search algorithms: find GOOD solution but not best one
Prescriptive models can be:
o Deterministic: one in which all model input info is either known or
assumed to be known with certainty
o Stochastic model: one in which some of the model input info is
uncertain

Problem Solving with Analytics:


1. Recognizing problem
- Problems arise when there is a gap between what is happening and
what we think should be happening
2. Defining problem
- Finding problem and distinguishing it from symptoms
3. Structuring Problem
- Involves stating goals and objectives, characterizing possible
decisions, and identifying and constraints or restrictions
- Often involves developing a formal model
4. Analyzing Problem
- Using analytics techniques
5. Interpreting Results and Making a Decision
6. Implementing the solution

Chapter 8 Trendlines and Regression Analysis:


Modeling Relationships and Trends in Data:

Linear Function: y = a + bx show steady increases or decreases over


range of x
Logarithmic Function: y = ln(x) used when rate of change in variable
increases or decreases quickly and then levels out (ex: diminishing returns to
scale)

Polynomial Function: y = ax2 + bx + c used for things like revenue


models that incorporate price elasticity
Power Function: y = axb define phenomena that increase at a specific rate
Exponential Function y = abx have property that y rises or falls at
constantly increasing rates

Linear Regression

Least Squares Regression Line :Y = B0 + B1X +E


o Intercept is mean value of Y, when X= 0
o Slope is change in the mean value of Y as X changes by one unit
o For a specific value of X, we have many possible values of Y that vary
around the mean, this is why we add the error term E
ei = residuals one way to quantify relationship between each observed point
and estimated regression equation is to measure vertical distance between
them

Steps for Conducting Regression Analysis and Evaluating a Model:


1. Determine the least-squares regression line and interpret coefficients
Part B Overall Model Assessment
2. Interpret Standard Error of Estimate sE (lower it is the better indication
that the model is a good fit) is the variability of observed Y-values from
predicted values (y-hat)
3. Coefficient of Determination R2 quantifies how much variation in y
(dependent variable), can be explained by the variation in x (independent
variable). Higher value = better model
-ALWAYS increases when an independent variable is added to model
Adjusted R-squared: explains proportion of variation in y
explained by all independent variables adjusted for number of
independent variables used.
Can either INCREASE or DECREASE when a variable is added so more
useful to judge if a variable is adding value to the model. If R-square
increases when a variable is added, indicates that the model has
improved
4. F- Test (ANOVA)
H0: b1 = b2 = b3 = b4 = b5
HA: at least onebi 0

If significance F is less than alpha, reject null hypothesis and conclude


that slope of independent variables is not 0 and is thus statistically
significant. Some of the variation in y can be explained by variation
independent variables. At least one of them affects y.

5. At this point, look at p-values of all independent variables. If some of them


have a p-value higher than alpha, reject these. Remove them one by one to
see if the model improves.
Part C- Testing Individual Variables:
H0: b1= 0 (no linear relationship)
HA: b1 0 (linear relationship does exist between x and y)
Two Methods:
1. Using P-values and T-test
Reject null hypothesis if p-value is less than alpha = no linear relationship
exists between x and y in this model
2. Using T-critical and Rejection Regions
Rejection Region
o one-tailed test is: t > or <- talpha,n-2
o two-tailed test is: t > talpha/2, n-k-1 AND t < -taplha/2,n-k-1
T-critical
o t= bi Bi/sbi
o

If t-critical falls in rejection region, reject null hypothesis and conclude that a
linear relationship exists
At this point, you can use t-stats to remove a variable. If t <1, the
variable should be removed and adjusted R-squared will increase

Part C- Checking Assumptions


1. Need Normality of Errors plot histogram of STANDARDIZED RESIDUALS
(describe how far each residual is from its mean in units of standard
residuals) see if normal distribution or not
2. Need Independence of Errors residuals should be independent for each
value of the independent variable. Usually not a problem for cross-sectional
data.
o
For Time = series data successive observations can appear to be
correlated = autocorrelation identified by clusters of residuals having
the same sign
o Use Durbin-Watson stat to test for it if needed
3. Need Homoscedasticity
o Variation about regression line is constant for all values of the
independent variable = constant variance check their spread should
be scattered not showing increasing or decreasing pattern
4. Need Linearity residuals should appear to be scattered about zero with no
apparent pattern
5. Check for Outliers that could potentially influence the data
Other things to check for in Multiple Regression:

Multicollinearity: strong correlations BETWEEN the independent


variables

Parsimony: choose a simple model, lower numbers of independent


variables used is better
overfitting

Part D Making Predictions


1. Confidence Interval for Slope
value of coefficient of independent will fall between (lower
confidence) and (upper confidence), with 95% confidence
2. Confidence Interval
The dependent variable given a certain value for the
independent variable will fall between (lower confidence) and
(upper confidence), with 95% confidence
3. Prediction Interval
the prediction falls between these values with 95% confidence

Interpreting the Coefficients:


Intercept
Without the influence of any of the independent variables (regardless of age,
number of people dined with, gender, or the location), the average student spends
about $3.04 on a meal
x1 = Age
For every one year older the person is, the average amount of money spent on a
meal increases by $0.15. Therefore, there is a small positive correlation between
age and the amount of money spent on a meal.
X2 = # of People Dined With
For every additional person that one dines with, the average amount of money
spent on a meal increases by $0.90 thus indicating a small positive correlation
between the number of people one dined with and the amount of money spent on a
meal.
X3 = Gender, 1 if female, 0 if male
Since this is a dummy variable the input to gender can be only 1(female) or 0
(male). Thus, females tend to spend $0.54 less on an average meal than males.
X4 = York Lanes if 1, 0 if not

Since this is a dummy variable, the input to Location1 can only be 0 (somewhere
other than York Lanes), or 1 (York Lanes). Thus, it can be interpreted from this slope
that a person who eats at York Lanes will spend approximately $1.59 more than one
who eats at Seneca (because Seneca is when both Location1 and Location2 are
equal to 0)
X5 = 1 if TEL, 0 if not
Since this is a dummy variable, the input to Location2 can only be 0 (somewhere
other than TEL), or 1 (TEL). Therefore, it can be interpreted from this slope that one
who eats at TEL will spend approximately $0.57 less than one who eats at Seneca

Part B - Overall Model Assessment/Individual Variables Assessment:


1. Standard Error of Estimate
The standard error is 5.195 which tells us that the average distance of the data
points from the regression line is approximately 5.20. Given that the mean amount
of money spent on a meal for this sample is $7.57 the standard error is not that
high. Thus, we can infer that overall this model is a good fit for the data. However,
since we cannot compare this standard error to other models it makes it difficult to
utilize in order to make a completely accurate assessment of the overall quality of
the model.

2. Coefficient of Determination
R^2= 0.0952
This indicates that 9.52% of the variation in amount of money spent
on a meal by students is explained by the variation in the
independent variables. This indicates a weak linear relationship
between the dependent and independent variables. Since the value
of R squared is quite low we can infer that this model is not a good
fit for the data.
Adjusted R Square = 0.0619
This indicates that 6.19% of the proportion of variation in the
amount of money spentis explained by all the independent
variables adjusted for the number of independent variables used.
Since the Adjusted R squared is also fairly low it can be concluded
that for this model, there is a weak relationship between some of
the independent variables and the amount of money spent on a
meal.
3. F-test (ANOVA)
H0: b1 = b2 = b3 = b4 = b5
HA: at least onebi 0

Level of significance is set at 0.05


Significance F is 0.0172 which is less than 0.05, thus we can reject
the null hypothesis and conclude that the slope of the independent
variables is not zero and is therefore statistically significant. We can
conclude that some of the variation of the amount of money spent
on a meal (x), can be explained by the variation in the independent
variables. More specifically we can infer that at least one of the
independent variables affects the amount of money spent on a
meal.
Testing Individual Variables:
1. Using P-values of the T-test:
Confidence Level is 95%, is 0.025 since this is a two-tailed test
a. Age = x1
H0: b1= 0 (no linear relationship)
HA: b1 0 (linear relationship does exist between the amount of Money spent
on a meal and Age)
The p-value for Age is 0.0631 which is greater than , therefore the
null hypothesis cannot be rejected and it can be concluded that no
linear relationship exists between age and the amount of money
spent on a meal in this model.
b. Number of People Dined With = x2
H0: b1= 0 (no linear relationship)
HA: b1 0 (linear relationship does exist between the amount of money spent
on a meal and the number of people dined with)
The p-value for this independent variable is 0.0246 which is less
than , therefore the null hypothesis can be rejected and it can be
concluded that a linear relationship exists and this variable is
significant in this model
c. Gender = x3
H0: b1= 0 (no linear relationship)
HA: b1 0 (linear relationship does exist between the amount of money spent
on a meal and the number of people dined with)
The p-value for this independent variable is 0.5484 which is greater
than ,therefore the null hypothesis cannot be rejected and it can
be concluded that no linear relationship exists between gender and
the amount of money spent on a meal in this model.
d. Location 1 (York Lanes) = x4
H0: b1= 0 (no linear relationship)
HA: b1 0 (linear relationship does exist between the amount of money
people spent on a meal and whether people ate at York Lanes or not)
The p-value for this independent variable is 0.1546 which is greater
than , therefore the null hypothesis cannot be rejected and it can
be concluded that no linear relationship exists between this variable
and the amount of money spent on a meal in this model.

e. Location 2 (TEL) =x5


H0: b1= 0 (no linear relationship)
HA: b1 0 (linear relationship does exist between the amount of money
people spent on a meal and whether people ate at TEL or not)
The p-value for this independent variable is 0.6097 which is greater
than ,therefore the null hypothesis cannot be rejected and it can
be concluded that no linear relationship exists between this variable
and the amount of money spent on a meal in this model.
Part C Checking Assumptions
Normality of Errors:

Histogram
60
40

Frequency

Frequency 20
0
-1

2 More

Bin:

The histogram indicates that the standard residuals of the data are
relatively normally distributed allowing us to conclude that this
assumption holds true.
Independence of Errors:
Since the data is cross-sectional we can assume that this
assumption holds.
a. Age

Age Residual Plot


30
20
Residuals

10
0
-10 10

20

30

40

50

60

70

-20
Age

1. Linearity:
The residuals appear to be randomly scattered about zero with no
apparent pattern. Thus, there is not enough evidence to conclude that
some other functional form would fit the data better and this
assumption holds.
2. Homoscedasticity: The spread of the residuals seems to be fairly
consistent and thus indicates that the variance of this variable is fairly
constant and is not homoscedastic.

Вам также может понравиться