Вы находитесь на странице: 1из 153

431 Part A: Describing Data

Thomas E. Love September 11, 2016

Contents

1 A Quick Introduction

 

5

1.1

R Setup Chunk

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

5

2 Types of Data

 

6

2.1 Quantitative Data

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

6

2.2 Qualitative (Categorical) Data

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

7

3 Data Science - What this course is about, and how exploratory work fits in.

 

8

4 The National Youth Fitness Survey Data (nyfs1)

 

9

4.1 Looking over the Data Set

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

9

 

4.1.1

subject.id

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

10

4.1.2

sex

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

10

4.1.3

age.exam

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

11

4.1.4

bmi

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

11

4.1.5 bmi.cat

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

12

4.1.6 waist.circ

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

13

4.1.7 triceps.skinfold

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

13

4.2 Summarizing the Data Set

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

14

 

4.2.1

The Five Number Summary, Quantiles and IQR

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

14

4.3 Using favstats in the mosaic library to gather additional summaries

 

15

4.4 The Variance and the Standard Deviation

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

15

 

4.4.1

Defining the Variance and Standard Deviation

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

15

4.5 Chebyshev’s Inequality: One Interpretation of the Standard Deviation

 

16

4.6 Empirical Rule Interpretation of the Standard Deviation

 

16

5 Graphical Summaries: Histograms and Variants

 

17

5.1 The Histogram

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

17

 

5.1.1

Freedman-Diaconis Rule to select bin width

 

18

5.2 A Note on Colors .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

19

5.3 The Stem-and-Leaf

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

20

 

5.3.1

A Fancier Stem-and-Leaf Display

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

21

5.4 The Dot Plot to display a distribution

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

23

5.5 The Frequency Polygon

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

24

5.6 Plotting the Probability Density Function

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

25

5.7 Comparing a Histogram to a Normal Distribution

 

26

 

5.7.1

Histogram of counts with Normal model

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

27

6 Boxplots for Summarizing Distributions

 

28

6.1 Summarizing the Quantiles graphically in a Boxplot

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

28

 

6.1.1

Drawing a Boxplot for One Variable in ggplot2

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

29

6.2 A Simple Comparison Boxplot

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

30

7 Does a Normal Distribution Fit Our Data Well?

 

33

7.1 Tools for Assessing Normality

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

33

7.2 Empirical Rule Interpretation of the Standard Deviation

 

33

1

7.3 Describing Outlying Values with Z Scores

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

34

7.3.1

Fences and Z Scores

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

34

7.4 Does a Normal model work well for the Ages? .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

35

7.5 The Normal Q-Q Plot

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

36

7.5.1 An Alternative Method using base R plotting

 

37

7.5.2 A Fancy ggplot2 function - gg_qq

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

38

7.6 Interpreting the Normal Q-Q Plot

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

39

7.6.1 Data from a Normal distribution shows up as a straight line in a Normal Q-Q plot

 

40

7.6.2 Skew is indicated by monotonic curves in the Normal Q-Q plot

 

42

7.6.3 Direction of Skew

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

44

7.6.4 Outlier-proneness is indicated by “s-shaped” curves in a Normal Q-Q plot

.

.

.

.

.

.

.

45

7.7 Returning to the BMI data

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

47

7.8 Are the Waist Circumferences Normally Distributed?

 

49

8 Using Transformations to “Normalize” Distributions

 

50

8.1 The Ladder of Power Transformations

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

50

8.2 Using the Ladder

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

50

8.3 Transformation Example 1: A Simulated Data Set with Right Skew

.

.

.

.

.

.

.

.

.

.

.

.

.

.

51

8.3.1 Does Squaring the Sample 1 Data Help “Normalize” the Picture?

 

52

8.3.2 Does Taking the Square Root of the Sample Help “Normalize” the Picture?

 

53

8.3.3 Does Taking the Logarithm of the Sample Help “Normalize” the Picture? .

.

.

.

.

.

.

54

8.4 Transformation Example 1: Which Best “Normalizes” These Data?

 

55

8.5 Transformation Example 1: Ladder of Potential Transformations in Frequency Polygons

 

57

8.6 Looking at the Ladder with Normal Q-Q Plots

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

58

8.7 Transformation Example 2. A Simulated Data Set with Left Skew

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

59

8.8 Transformation Example 2: Ladder of Potential Transformations in Frequency Polygons

 

60

8.9 Transformation Example 2 Ladder with Normal Q-Q Plots

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

61

9 Additional Descriptive Numerical Summaries

 

62

9.1 Using describe in the psych library

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

62

9.1.1 The Trimmed Mean

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

62

9.1.2 The Median Absolute Deviation

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

62

9.2 Assessing Skew

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

63

9.2.1 Non-parametric Skew via skew 1

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

63

9.2.2 A home-made function to calculate skew 1

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

63

9.3 Assessing Kurtosis (Heavy-Tailedness)

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

64

9.3.1

The Standard Error of the Sample Mean .

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

64

9.4 The describe function in the Hmisc library

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

64

9.5 xda from GitHub for numerical summaries for exploratory data analysis

.

.

.

.

.

.

.

.

.

.

.

.

66

9.6 What Summaries to Report

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

67

10 Summarizing data within subgroups

 

68

10.1 Using dplyr and summarize to build a tibble of summary information

 

.

.

.

.

.

.

.

.

.

.

.

.

.

68

10.2 Using the by function to summarize groups numerically

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

69

11 Comparing Distributions Across Subgroups Graphically

 

70

11.1 Boxplots to Relate an Outcome to a Categorical Predictor

70

11.2 Augmenting the Boxplot with the Sample Mean

 

71

11.3 Adding Notches to a Boxplot

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

72

11.4 Using Multiple Histograms to Make Comparisons

 

74

11.5 Using Multiple Density Plots to Make Comparisons

 

75

11.6 Building a Violin Plot

 

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.