BZAN6310 Chapter 2

BZAN 6310: Quantitative Analysis for Business
Lecture 1. Introduction and Describing Data
Dr. Jinghui (Jove) Hou

What are we trying to accomplish?
1. What influence my firm’s return of investment?
2. Why some salespeople are more productive than others?
3. How to distribute advertising budget effectively?
4. How to predict next year’s revenue?
5. Which employee(s) should I hire?
6. What promotes my customer to stick around?
Textbook
Installing StatTools
• Palisade Decision Tools Suite: StatTools: a statistics toolset to Excel
• Install Excel 2016 (or later)
• Go to Blackboard “Course Syllabus and Information etc” for the link
• Check email -> Download -> Save File -> Extract
• Close all Excel windows
• Run .exe -> install (accept the default)
Installing StatTools cont’d
• Run Excel (or open an excel file)
• Menu -> StatTools -> Click “OK” -> Click “Close”
• New tab in the Excel menu: “StatTools”
Resources on Blackboard
• Excel tutorials
• StatTools tutorials
Describing Data
• What kind of Data do you encounter in your daily business?
2-2 Basic Concepts
 Several important concepts

 Populations and samples
 Data sets
 Variables and observations
 Types of data
© 2017 Cengage Learning. All Rights Reserved. May not be scanned, copied or duplicated, or posted to a publicly accessible website, in whole or in part.
2-2a Populations and Samples
 A population includes all of the entities of interest in a study (people,

households, machines, etc.).
 Examples
 All potential voters in a presidential election
 All subscribers to cable television
 All invoices submitted for Medicare reimbursement by nursing homes
 A sample is a subset of the population, often randomly chosen and

preferably representative of the population as a whole.
2-2b Data Sets, Variables, and Observations
 A data set is usually a rectangular array of data, with variables in

columns and observations in rows.
 A variable (or field or attribute) is a characteristic of members of a
population, such as height, gender, or salary.
 An observation (or value, or case or record) is a list of all variable
values for a single member of a population.
Example 2.1: Data from an Environmental Survey
 Objective: To illustrate variables and observations in a typical data

set.
 Solution: Data set includes observations on 30 people who
responded to a questionnaire on the president’s environmental
policies.
 Variables include age, gender, state, children, salary, and opinion.
 Include a row that lists variable names.
 Include a column that shows an index of the observation.
2-2c Types of Data
 A variable is numerical if meaningful arithmetic can be performed on

it.
 Otherwise, the variable is categorical.
 There is also a third data type, a date variable.
 Excel® stores dates as numbers, but dates are treated differently from
typical numbers.
 A categorical variable is ordinal if there is a natural ordering of its
possible values.
 If there is no natural ordering, it is nominal.
Types of Data
 A numerical variable is discrete if it results from a count, such as the

number of children.
 A continuous variable is the result of an essentially continuous
measurement, such as weight or height.
Types of Data
Categorical _ arithmetic operations do NOT make sense.
• Nominal _ Measurement are categories & numerical values have no
mathematical significance. Example: color.
• Ordinal _ Rank order. Example: horse racing
Numerical _ arithmetic operations make sense.
• Discrete_ Measurement are from counts. Example: size of household.
• Continuous _ continuous measures. Example: bank saving, volume
Test Yourself
Determine Types of Data.
Challenge Level 1:
• political party affiliation
• grades (A, A-, B+… F)
• test scores (SAT, ACT)
• size of household
Test Yourself
Challenge Level 2:
• Are you a smartphone owner? (1 = Y; 0 = N)
• President Trump
• Q1: Dependable _ _ _ _ _ _ _ _ _ Undependable
• Q2: Trustworthy _ _ _ _ _ _ _ _ _ Untrustworthy
• BZAN6310 is known to be extremely difficult.
1=Strongly Disagree 2=Disagree 3=Neutral 4=Agree 5=Strongly Agree
Types of Data
 Categorical variables can be coded numerically.

 A dummy variable is a 0–1 coded variable for a specific category.
 It is coded as 1 for all observations in that category and 0 for all
observations not in that category.
 A binned (or discretized) variable corresponds to a numerical
variable that has been categorized into discrete categories.
 These categories are usually called bins.
Test Yourself
Challenge Level 3:
• Age
• What is your age? ___________
• 1 = under 18
• 2 = 18 – 25
• 3 = 26 – 55
• 4 = above 55
• Young
• Old
Example 2.1: Data from an Environmental Survey
• Objective: To illustrate variables and observations in a typical data

set.
• Solution: Data set includes observations on 30 people who
responded to a questionnaire on the president’s environmental
policies.
• Variables include age, gender, state, children, salary, and opinion.
 Include a row that lists variable names.
 Include a column that shows an index of the observation.
Types of Data
Types of Data
 Cross-sectional data are data on a cross-section of a population at a

distinct point in time.
 Time series data are data collected over time.
Types of Data
2-3 Descriptive Measures for
Categorical Variables
 There are only a few possibilities for describing a categorical

variable, all based on counting:
 Count the number of categories.
 Give the categories names.
 Count the number of observations in each category. (The resulting counts can
be reported as “raw counts” or as percentages of totals.)
 Once you have the counts, you can display them graphically, usually in a column
chart or a pie chart.
Example 2.2: Supermarket Sales
 Objective: To summarize categorical variables in a large

data set.
 Solution: Data set contains transactions made by
supermarket customers over a two-year period.
 Children, Units Sold, and Revenue are numerical.
 Purchase Date is a date variable.
 Transaction and Customer ID are used only to identify.
 All of the other variables are categorical.
 To get the counts in column S, use the Excel® function,

COUNTIF.
 To get the percentages in column T, divide each count by the total
number of observations.
 Keep charts simple so that the information they contain emerges
as clearly as possible.
 Another efficient way to find

counts for a categorical
variable is to use dummy (0–1)
variables.
 Recode each variable so that
one category is replaced by 1
and all others by 0.
 This can be done using a simple IF
formula.
 Find the count of that category
by summing the 0s and 1s.
 Find the percentage of that
category by averaging the 0s
and 1s.
In-class Lab 1
• Descriptive Statistics for Categorical Variables exercise
• Excel: COUNTIF Function (also see Excel Tutorial)
• Charts
• Data set: Supermarket Transaction.xlsx
2-4 Descriptive Measures for
Numerical Variables
 There are many ways to summarize numerical variables, both with

numerical summary measures and with charts.
 We begin with a numerical variable such as Salary, where there is one
observation for each person. Our basic goal is to learn how these
salaries are distributed across people by asking:
1. What are the most “typical” salaries?
2. How spread out are the salaries?
3. What are the “extreme” salaries on either end?
4. Is a chart of the salaries symmetric about some middle value, or is it
skewed in one direction?
Example 2.3: Baseball Salaries
 Objective: To learn how salaries are distributed across

all 2015 MLB players.
 Solution: Data set contains data on 868 Major League
Baseball players in the 2015 season.
 Variables are player’s name, team, position, and salary.
2-4a Numerical Summary Measures
 Throughout this section, we focus on a Salary variable.

 Measures of Central Tendency
 Minimum, Maximum, Percentiles, and Quartiles
 Measures of Variability
 Empirical Rules for Interpreting Standard Deviation
 Measures of Shape
Measures of Central Tendency
(slide 1 of 3)
 The mean is the average of all values.

 If the data set represents a sample from some larger population, this
measure is called the sample mean and is denoted by (“X-bar”).
 If the data set represents the entire population, it is called the population
mean and is denoted by μ.
 In Excel®, the mean can be calculated with the AVERAGE function.
(slide 2 of 3)
 The median is the middle observation when the data are sorted from
smallest to largest.
 If the number of observations is odd, the median is literally the middle
observation.
 If the number of observations is even, the median is usually defined as the
average of the two middle observations.
 In Excel®, the median can be calculated with the MEDIAN function.
(slide 3 of 3)
 The mode is the value that appears most often.

 In Excel®, the mode can be calculated with the MODE function.
(slide 2 of 10)
Minimum, Maximum, Percentiles, and Quartiles
 For any percentage p, the pth percentile is the value such that a
percentage p of all values are less than it.
 The quartiles divide the data into four groups, each with
(approximately) a quarter of all observations.
 The first, second, and third quartiles are the percentiles corresponding to p
= 25%, p = 50%, and p = 75%.
 By definition, the second quartile (p = 50%) is equal to the median.
 The minimum and maximum values can be calculated with MIN and
MAX functions, and the percentiles and quartiles with PERCENTILE and
QUARTILE functions in Excel®.
Measures of Variability
(slide 1 of 4)
 The range is the maximum value minus the minimum

value.
 The interquartile range (IQR) is the third quartile
minus the first quartile.
 Thus, it is the range of the middle 50% of the data.
 It is less sensitive to extreme values than the range.
 The variance is essentially the average of the

squared deviations from the mean.
 If Xi is a typical observation, its squared deviation from
the mean is (Xi – mean)2.
(slide 2 of 4)
 The sample variance is denoted by s2, and the population variance

by σ2.
 If all observations are close to the mean, their squared deviations from the
mean—and the variance—will be relatively small.
 If at least a few of the observations are far from the mean, their squared
deviations from the mean—and the variance—will be large.
 In Excel®, use the VAR.S function to obtain the sample variance and the
VAR.P function to obtain the population variance.
(slide 3 of 4)
 A fundamental problem with variance is that it is in squared units

(e.g., $  $2).
 A more natural measure is the standard deviation, which is the
square root of the variance.
 The sample standard deviation, denoted by s, is the square root of
the sample variance.
 The population standard deviation, denoted by σ, is the square root
of the population variance.
 In Excel®, use the STDEV.S function to find the sample standard
deviation or the STDEV.P function to find the population standard
deviation.
(slide 4 of 4)
Empirical Rules for Interpreting Standard Deviation
 The interpretation of the standard deviation can be

stated as three empirical rules.
 “Empirical” means that they are based on commonly
observed data, as opposed to theoretical mathematical
arguments.
 If the values of a variable are approximately normally
distributed (symmetric and bell-shaped), then the following
rules hold:
 Approximately 68% of the observations are within one standard
deviation of the mean.
 Approximately 95% of the observations are within two standard
deviations of the mean.
 Approximately 99.7% of the observations are within three
standard deviations of the mean.
skewed to the right (or positively skewed)

 The empirical rules should be applied with caution, especially when

the data are clearly skewed, as illustrated by the calculations for
baseball salaries below.
 The mean absolute deviation (MAD) is the average of the absolute

deviations.
 In Excel®, use the AVEDEV function to calculate MAD.

 There is another empirical rule for MAD: For many variables, the
standard deviation is approximately 25% larger than MAD.
Measures of Shape
(slide 1 of 2)
 Skewness occurs when there is a lack of symmetry.

 A variable can be skewed to the right (or positively skewed) because of
some really large values (e.g., really large baseball salaries).
 Or it can be skewed to the left (or negatively skewed) because of some
really small values (e.g., temperature lows in Antarctica).
 In Excel®, a measure of skewness can be calculated with the SKEW
function.

Measures of Shape
(slide 2 of 2)
 Kurtosis has to do with the “fatness” of the tails of the distribution

relative to the tails of a normal distribution.
 A distribution with high kurtosis has many more extreme observations.
 In Excel®, kurtosis can be calculated with the KURT function.
Histogram Example: Late or Lost Baggage
2-4b Numerical Summary Measures with StatTools
 Although built-in functions in Excel® can be used to

calculate a number of summary measures, a much
quicker way is to use the StatTools add-in.
 StatTools is part of the Palisade DecisionTools Suite®.
Once the suite is installed, load StatTools by double-
clicking the StatTools item in the list of programs on the
Windows Start menu.
 You will know that StatTools is loaded when you see the
StatTools tab and ribbon.
Basic StatTools Features
(slide 3 of 5)
Before you can perform any

statistical analysis, you must define
a StatTools data set.
 You do this by clicking the Data
Set Manager button.
 Make sure any cell in the data set
is selected, and click the Data Set
Manager button.
 StatTools makes several guesses
about your data set. You can
always override them.
(slide 4 of 5)
 To generate the summary

measures for the Salary
variable, select One-Variable
Summary from the Summary
Statistics dropdown list.
 This is a typical StatTools
dialog box. In the top section,
you can select a StatTools
data set and one or more
variables. In the bottom
section, you can select the
measures you want.
 You might want to choose
only your favorite summary
measures as the defaults.
(slide 5 of 5)
 There are several other things

to note about the StatTools
output.
 First, it formats the results
according to its own rules.
 Second, the fact that there are
formulas in these result cells
indicates that they are “live.”
If you go back to the data
and change any of the
salaries, the summary
measures will update
automatically.
In-class Lab 2
• Descriptive Statistics for Numerical Variables exercise
• Excel: StatTools (also see StatTools Tutorial)
• Data set: Baseball Salaries.xlsx
2-4d Charts for Numerical Variables
 There are many graphical ways to indicate the distribution of a

numerical variable.
 For cross-sectional variables:
 Histograms
 Box plots (also called box-whisker plots)
 For time series variables:
 Time series graphs
Histograms
 A histogram is the most common type of chart for showing the

distribution of a numerical variable.
 It is based on binning the variable—that is, dividing it up into discrete
categories.
 It is a column chart of the counts in the various categories (with no gaps
between the vertical bars).
 A histogram is great for showing the shape of a distribution—whether
the distribution is symmetric or skewed in one direction.
 Objective: To see the shape of the salary distribution through a

histogram.
 Solution: Using Excel 2016®, select the Salary variable and choose
the Histogram chart type from the Statistics group on the Insert ribbon.
 If you are not yet using Excel 2016®, it is possible to create a histogram
with Excel® tools only but tedious.
 Excel® automatically chooses

bins for the histogram, but
you can right-click the
horizontal axis and select
Format Axis, where you can
choose the Bin width or the
Number of bins (but not
both).
 The resulting histogram has one added

enhancement, where data labels “outside” the bars
indicate the counts of the various bins.
 You can use the Chart Elements dropdown list on the
Chart Tools Design ribbon to add the data labels.
 It is much easier to create a

histogram with StatTools.
 First, designate a StatTools
data set.
 Next, select Histogram from
the Summary Graphs
dropdown list.
 In the dialog box, select the
Salary variable and click OK
to get the histogram with the
StatTools “auto” bins.
 Change the options at the
bottom of the dialog box to
fine-tune the bins.
 This example file contains the count of bags late or lost for 456 flights.
 The “natural” bins are the integer values 0 to 8, which matches to
values for the counts.
Box Plots
 A box plot (or box-whisker plot) is an alternative type of chart for

showing the distribution of a variable.
 Side-by-side box plots are very useful for comparing distributions.
 Box plots and histograms are complementary ways of displaying the
distribution of a numerical variable.
 As with histograms, box plots are “big picture” charts.
 Objective: To illustrate the features of a box plot,

particularly how it indicates skewness.
 Solution in Excel®: Create a box plot of a single variable
like Salary almost exactly like you create a histogram.
 Select the variable and then choose the Box and Whisker
chart type from the Statistics group on the Insert ribbon.
 In StatTools, select Box-Whisker Plot from the Summary

Graphs dropdown list and fill in the dialog box.
 To get a generic box plot, you can check the bottom
option.

2-5 Time Series Data
 Our main interest in time series variables is how they change over time,
and this information is lost in traditional summary measures and in
histograms or box plots.
 For time series data, a time series graph is used. This is a graph of
the values of one or more time series, using time on the horizontal axis.
 This is always the place to start a time series analysis.
Example 2.4: Crime in United States
(slide 1 of 6)
(slide 2 of 6)
 Objective: To see how time series

graphs help to detect trends in
crime data.
 Solution: Data set contains annual
data on violent and property
crimes for the years 1960 to 2010.
 In StatTools, designate a StatTools
data set.
 Then select Times Series Graph
from the Time Series and
Forecasting dropdown list and fill in
the resulting dialog box.
(slide 3 of 6)
(slide 4 of 6)
Violent and Property Crime Rates
(slide 5 of 6)
Rates of Violent Crime Types
Example 2.5: The DJIA Index
(slide 1 of 2)
 Objective: To find useful ways to summarize the monthly

Dow data.
 Solution: Data set contains monthly values of the Dow from
1950 through mid-2015.
 Create summary measures and time series graphs for
monthly values and percentage changes of the Dow.
Example 2.5: The DJIA Index
(slide 2 of 2)
In-class Lab 3 & 4
Lab 3
• Charts for Numerical Variables exercise
• Excel: StatTools (also see StatTools Tutorial)
• Histogram & box plots
• Data set: Baseball Salaries.xlsx
Lab 4
• Time series data exercise
• Excel: StatTools
• Time series graph
• Data set: Crime in US.xlsx
Class Summary
To do
• Quiz 0
• Reading Assignments (on Syllabus)
• Homework (#38, 41, 48; pp. 73-76)
• Say hello to Jove (optional)
2-7 Excel® Tables for Filtering, Sorting, and Summarizing
 Tables are a tool introduced in Excel® 2007.

 You now have the ability to designate a rectangular data set as a
table and then employ a number of powerful tools for analyzing
tables.
 These tools include:
 Filtering
 Sorting
 Summarizing

BZAN6310 Chapter 2

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

BZAN6310 Chapter 2

Загружено:

Авторское право:

Доступные форматы

BZAN 6310: Quantitative Analysis for Business

Lecture 1. Introduction and Describing Data

Dr. Jinghui (Jove) Hou

 Several important concepts

 A population includes all of the entities of interest in a study (people,

 A sample is a subset of the population, often randomly chosen and

 A data set is usually a rectangular array of data, with variables in

 Objective: To illustrate variables and observations in a typical data

 A variable is numerical if meaningful arithmetic can be performed on

 A numerical variable is discrete if it results from a count, such as the

 Categorical variables can be coded numerically.

• Objective: To illustrate variables and observations in a typical data

 Cross-sectional data are data on a cross-section of a population at a

 There are only a few possibilities for describing a categorical

 Objective: To summarize categorical variables in a large

 To get the counts in column S, use the Excel® function,

 Another efficient way to find

 There are many ways to summarize numerical variables, both with

 Objective: To learn how salaries are distributed across

 Throughout this section, we focus on a Salary variable.

 Empirical Rules for Interpreting Standard Deviation

 The mean is the average of all values.

 In Excel®, the mean can be calculated with the AVERAGE function.

 The mode is the value that appears most often.

 The range is the maximum value minus the minimum

 The variance is essentially the average of the

 The sample variance is denoted by s2, and the population variance

 A fundamental problem with variance is that it is in squared units

 The interpretation of the standard deviation can be

skewed to the right (or positively skewed)

 The empirical rules should be applied with caution, especially when

 The mean absolute deviation (MAD) is the average of the absolute

 In Excel®, use the AVEDEV function to calculate MAD.

 Skewness occurs when there is a lack of symmetry.

skewed to the right (or positively skewed)

 Kurtosis has to do with the “fatness” of the tails of the distribution

 Although built-in functions in Excel® can be used to

Before you can perform any

 To generate the summary

 There are several other things

 There are many graphical ways to indicate the distribution of a

 A histogram is the most common type of chart for showing the

 Objective: To see the shape of the salary distribution through a

 Excel® automatically chooses

 The resulting histogram has one added

 It is much easier to create a

 A box plot (or box-whisker plot) is an alternative type of chart for

 Objective: To illustrate the features of a box plot,

 In StatTools, select Box-Whisker Plot from the Summary

skewed to the right (or positively skewed)

 Objective: To see how time series

Violent and Property Crime Rates

Rates of Violent Crime Types

 Objective: To find useful ways to summarize the monthly

 Tables are a tool introduced in Excel® 2007.

Вам также может понравиться