Вы находитесь на странице: 1из 10

MSIT 3000: Unit 3

Analyzing Data

Graphing Data
Now that we have our data, we want to do something with them! The first thing we tend to do it
visualize our data. This allows us to look through and see trend in our data without scouring the rows
and columns of the data table.

Example 1: Tables vs. Graphs

Consider the following ways of presenting data on the maximum monthly temperate of a Hawaiian
island across 3 years:

A. Tabular form:

B. Graphical form:
100
2012
2013
2014
95
90
Max Temperature

85
80
75
70

2 4 6 8 10 12

Month

1. Which year was the warmest? Which year was the coolest? Which one (table or graph) did you use
to get that information?

2014 was warmest

2012 was coolest

2. Which single month had the highest maximum temperature? What was that temperature?

July 100.15

3. As our dataset gets larger, it will become increasingly _______________ to scour tables.

We should treat categorical and quantitative data ____________. Therefore, we should probably use
____________ methods for analyzing each type of variable.
Categorical Data

Categorical data gives us counts in a category || measures of a quantity. As such, we might want to start
analyzing categorical data by looking at ______________________________________. A good way to
do this is a frequency table.

Example 2: Frequency Tables in JMP

Well use the Phones.jmp dataset found on eLC. To make a frequency table in JMP, we:

1. Load data
2. Analyze -> Distribution

Bar Charts are used for graphically displaying the information in a frequency table. They should only be
used for counts of categorical variables. The X-axis has the different values of the variable, and the Y-axis
gives the counts:

Bar Charts can handle situations where people select more than one answer.

Example 3: Bar Charts in JMP

There are two ways to build bar charts in JMP. The first way can be done from the Analyze ->
Distribution module. Well click the red arrow -> Histogram Options and change it to suit us.

The second way involves the Graph Builder. Well use Graph -> Graph Builder, and select the bar chart
up top. Then, just drag the variable of interest over to the drop zone.

Pie Charts are used when we want to focus on percentage of a whole rather than raw counts. They
require that all the percentages add up to 100%.
Example 4: Pie Charts in JMP

1. Load the Data


2. Graph Builder -> Pie Chart.
3. Drag the variable we want to chart into the middle.

Would a pie-chart be useful for variables with 30 different categories?

No it gets cluttered

Quantitative Data

Histograms are useful ways to visualize quantitative data. Unlike bar graphs, there are no gaps between
the bars in a histogram the bars are supposed to cover discrete ranges of the quantitative variable
(called bins) that combine to cover the whole range of the data.

Changing the width of the bins can significantly alter the shape of the graph, so its oftentimes useful to
change the bin-width and plot multiple times.
Note that you cannot make a bar chart for quantitative data and you cannot make a histogram for
categorical data!

Shapes

Peaks or humps seen in a histogram are called modes. A distribution with one peak is called unimodal;
two peaks is called bimodal; three or more is called multimodal.

Histograms with all the bars at approximately the same height (i.e. there is no peak and no mode) are
called uniform.

How many modes are in the scores data from the top of the page?

unimodal

How many modes are in the phone manufacturer data?

Modal doesnt apply to categorical data

A distribution is symmetric if the halves on either side of the center look (approximately) like mirror
images.
Circle the symmetric distributions of data:
8
6
Frequency

4
2
0

0.4 0.6 0.8 1.0


30
20
count

10
0

-3 -2 -1 0 1 2 3 4

When one tail of the data stretches out longer than the other, the distribution is said to be skewed
towards the side with the longer tail.

Identify the skewed plots above. Write the direction of their skew over them.
Useful Statistics for Quantitative Variables
The mean of a variable, oftentimes called the average, is calculated by

Its a good measure for unimodal, symmetric distributions. However, the mean is very sensitive to
outliers or skewed data.

Example 5a:

Consider the following dataset of employee ages at a grocery store:

16 28 30 23 38 37 20 37 22 27

A. Find the mean age of the ten employees

B. Say the store hires a 67 year old. Find the new average age of all eleven employees

The median is another measure of center for a quantitative variable. To find the median, we first sort
the data from smallest to largest. Then:

For an odd number of values, the median is the single value in the middle.
For an even number of values, the median is the average of the two middle values.

The median ignores skewness or outliers, which makes it more representative of the typical value
from skewed distributions!

Example 5b:

Consider still the ages of the employees at the grocery store (listed above).

A. Find the median age of the ten original employees.

B. Find the median age of the eleven employees after the 67 year old is hired.

When a distribution is symmetric, the mean and the median will be roughly similar. In a skewed
distribution, however, the mean always gets pulled toward the longer tail!
The mean here is $10,260, and the median is $8,619. You can see that the mean is getting pulled to the
right since the distribution is right-skewed.

In addition to measures of center (mean and median), we also want to have a measure of spread, which
tells us how far the values are spread out around the center. The two main measures of spread we use
are the standard deviation and the interquartile range (IQR).

Standard Deviation measures the average distance of points from the mean. Data that is more spread
out from the center will have a larger standard deviation than data with all points close to the center
(mean). The formula to calculate standard deviation is:

IQR gives the range between the first and third quartiles. Quartiles divide our data into quarters; the
first quartile (Q1) is greater than 25% of the data, the second quartile (Q2) is greater than 50% of the
data, the third quartile (Q3) is greater than __75_% of the data, and the fourth quartile (Q4) is greater
than _100__% of the data.

IQR= Q3- Q1
Do you know another name for the second quartile (Q2)?

Median

How about for the fourth quartile (Q4)?

Maximum

Вам также может понравиться