Вы находитесь на странице: 1из 103
R Programming for Data Science Sovello Hildebrand Mgani sovellohpmgani@gmail.com
R Programming for Data Science Sovello Hildebrand Mgani sovellohpmgani@gmail.com

R Programming for Data Science

Sovello Hildebrand Mgani

Outline

History of R

Installation (Windows and Linux)

Data Types

Reading Data:

Tabular

Large datasets

Textual Data Formats

Subsetting:

Lists, Matrices, Partial matching

Removing missing values

datasets ● Textual Data Formats ● Subsetting: – Lists, Matrices, Partial matching – Removing missing values

2

Outline

Vectorized operations

Control Structures

If-else

For, while, repeat, next break

Functions

Scoping

Dates and Times

Loop functions

lapply, tapply, apply, mapply, split,

Simulation and profiling

Generating random numbers, simulating a linear model, random sampling

Visualizations

Simulation and profiling – Generating random numbers, simulating a linear model, random sampling ● Visualizations 3

3

History of R

Originates from S language. S was initiated in 1976 as an internal statistical analysis environment—originally implemented as Fortran libraries

History of S:

R development history:

● R development history: – https://en.wikipedia.org/wiki/R_(programming_la nguage ) 4

)

4

R and Statistics

R developed from S which is a statistical analysis tool, and so is R

Its functionality is divided into modules

Need to load a module for different functionalities

Has very sophisticated graphics capabilities than most other statistical packages

Useful for interactive work: run from terminal

Contains a powerful programming language for developing new tools

Tools: for visualizations and analysis

● Contains a powerful programming language for developing new tools – Tools: for visualizations and analysis

5

Design of the R System

The “base” system, downloaded from CRAN “All other stuff”

Packages in R

The “base” has the base package required to run R and has the most fundamental functions

Other packages contained in the “base”. Need to load these to be able to use them: utils, stats, datasets, graphics, grDevices, tools, etc.

Recommended packages: boot, class, cluster, codetools, foreign, lattice, etc.

Load packages with library(), or require()

packages: boot, class, cluster, codetools, foreign, lattice, etc. – Load packages with library() , or require()

6

R Resources

CRAN:

Quick-R: a book

R bloggers (platform): not a social network

R-Bloggers is about empowering bloggers to empower other R users

R-Bloggers.com is a blog aggregator of content contributed by bloggers who write about R (in English)

is a blog aggregator of content contributed by bloggers who write about R (in English) –

7

Installation of R: Ubuntu

Run from terminal:

sudo apt-get install r-base r-base-dev

If this doesn’t work, then you need

To add the repositories:

sudo echo "deb http://cran.rstudio.com/bin/linux/ubuntu xenial/" | sudo tee -a /etc/apt/sources.list

Add the keyring:

gpg --keyserver keyserver.ubuntu.com --recv-key E084DAB9

gpg -a --export E084DAB9 | sudo apt-key add -

Install R-Base

sudo apt-get update; sudo apt-get install r-base r-base-dev

You can install from a PPA which has the most recent versions Add the PPA

sudo add-apt-repository ppa:marutter/rrutter

Install R-Base

sudo apt-get update; sudo apt-get install r-base r-base-dev

ppa:marutter/rrutter – Install R-Base  sudo apt-get update; sudo apt-get install r-base r-base-dev 8

8

Installation of R: Windows

Visit CRAN

CRAN: Comprehensive R Archive Network

Installation of R: Windows ● Visit CRAN – https://cran.r-project.org/ ● CRAN: Comprehensive R Archive Network 9

9

Installation of R: Windows

Installation of R: Windows Click/Select Download R for Windows 10

Click/Select Download R for Windows

Installation of R: Windows Click/Select Download R for Windows 10

10

Installation of R: Windows

Installation of R: Windows Then click/select base or install R for the first time 11

Then click/select base or install R for the first time

Installation of R: Windows Then click/select base or install R for the first time 11

11

Installation of R: Windows

Installation of R: Windows ● Then click/select Download R X.X.X for Windows ● After the download

Then click/select Download R X.X.X for Windows After the download has finished, locate the downloaded file and install.

Download R X.X.X for Windows ● After the download has finished, locate the downloaded file and

12

RStudio: www.rstudio.com

RStudio: www.rstudio.com 13
RStudio: www.rstudio.com 13

13

RStudio: Introduction

RStudio is a set of integrated tools designed to help you be more productive with R.

How?

It includes a console,

syntax-highlighting editor that supports direct code execution,

a variety of robust tools for

direct code execution, – a variety of robust tools for  plotting,  viewing history, 

plotting,

viewing history,

debugging and

managing your workspace.

14

RStudio: Installation

From the RStudio home page, go to Products then select RStudio

Then scroll down and click Download RStudio Desktop

Then click Download under RStudio Desktop Personal License.

Select RStudio for your platform. Clicking on the link will download the file directly.

Locate the file in your system Downloads folder and start the installation.

will download the file directly. – Locate the file in your system Downloads folder and start

15

RStudio: Parts

The Environment tab shows all the active objects The History tab shows a list of
The Environment tab shows all
the active objects
The History tab shows a list of
commands used so far
The Console is where you
write and run code
interactively
The Files tab shows all the files and folders in
your default workspace as if you were on a
PC/Mac window.
The Plots tab will show all your graphs.
The Packages tab will list a series of packages or
add-ons needed to run certain processes.
For additional info see the Help tab
tab will list a series of packages or add-ons needed to run certain processes. For additional

16

RStudio: Working Directory

It is important to organize all files for a particular project under one main/parent directory

A working directory in RStudio is where all the files for a particular project are stored

All paths used in the console to load data files and scripts are relative to the working directory.

are stored ● All paths used in the console to load data files and scripts are

17

RStudio: Working Directory

To set the working directory:

Start RStudio the same way you start other programs in your computer

From the File menu options select New Project then select New Directory then Empty Project then type the directory name (rprogramming) then under create project as subdirectory of click Browse and select Desktop

directory name (rprogramming) then under create project as subdirectory of click Browse and select Desktop ●

18

R: Getting Started

A few basic commands to test them on the console

getwd(): get current working directory

setwd(“/path/to/directory”): set a working directory to the specified path

install.packages(“package_name”): install a package. Requires internet connection

library(package_name), require(package_name): load and attach add-on packages

?object: provide documentation/help for an object. e.g. ?mtcars

summary(object): provide a summary of an object like a dataset e.g. summary(mtcars)

Everytime you run library(package_name) and get an error “there is no package called ‘ package_name ”, you will need to install it first then call library on it.

“ there is no package called ‘ package_name ’ ”, you will need to install it

19

Data Visualizations in R: Introduction

R has different systems (packages) for making graphs (visualizations)

For this case we are going to use ggplot2 which is more elegant and versatile compared to many others. (ggvis, rgl, htmlwidgets, googleVis, etc.)

Ggplot2 is built upon the “ The Layered Grammar of Graphics

, rgl , htmlwidgets , googleVis , etc.) ● Ggplot2 is built upon the “ The

20

Data Visualizations in R: Tidyverse

Tidyverse is a set of packages

The packages work in harmony

Reason: they share common data representations and API design.

The tidyverse package makes it easy to install and load core packages from it in a single command

To install run: install.packages(“tidyverse”)

To use it run: library(tidyverse)which loads tidyverse core packages: ggplot2, tibble, tidyr, readr, purrr, and dplyr.

Google each one of these packages to learn what they do

ggplot2, tibble, tidyr, readr, purrr, and dplyr . – Google each one of these packages to

21

Data Visualizations: First Steps

library(tidyverse) loads all the core packages from tidyverse

The library() function also tells any conflicts with base R or other packages that arise from loading the named package.

e.g. for this case filter() and lag() are functions from tidyverse that conflict with similar functions from dplyr and stats packages

In this case you may need to call a function explicitly from a package in the form. package::function() e.g. ggplot2::ggplot() calls the ggplot function from ggplot2 package.

in the form. package::function() ● e.g. ggplot2::ggplot() calls the ggplot function from ggplot2 package. 22
in the form. package::function() ● e.g. ggplot2::ggplot() calls the ggplot function from ggplot2 package. 22

22

Data Visualizations: First Steps

Which is more fuel efficient: cars with big engines or cars with small engines?

The mpg data frame:

Data Frame: is a rectangular collection of variables in columns and observations in rows

The mpg data frame in ggplot2 contains observations collected by the US Environment Protection Agency on 38 models of cars.

Run (from console) ?mpg to learn more about the data set.

Environment Protection Agency on 38 models of cars. ● Run (from console) ?mpg to learn more

23

First Steps Creating a ggplot

To answer the question about fuel efficiency plot fuel consumption (hwy: y-axis) against engine size (displ: x-axis)

See the magic of this command:

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))

● See the magic of this command: – ggplot ( data = mpg) + geom_point (

24

First Steps Creating a ggplot

> ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))

+ geom_point ( mapping = aes ( x = displ, y = hwy)) A negative relationship

A negative relationship between engine size (displ) and fuel efficiency (hwy) means Cars with bigger engines use more fuel.

relationship between engine size (displ) and fuel efficiency (hwy) means Cars with bigger engines use more

25

Creating a ggplot

In ggplot2,

You begin with the function ggplot()

ggplot() creates a coordinate system that you can add layers onto.

The first argument is the data set that you are going to use for plotting

To complete the graph add more layers to the coordinate system created by ggplot()

geom_point() function adds a layer of points to plot (which creates a scatter plot for this case)

Each function in ggplot2 takes a mapping argument which defines how variables are mapped to visual properties.

The mapping argument is always paired with aes()

The x and y arguments of aes() specify which variables to map to the x and y axes.

ggplot2 looks for the mapped variable in the data argument, in this case, mpg

to map to the x and y axes. – ggplot2 looks for the mapped variable in

26

Creating a ggplot: Template

A graphing template for ggplot

a ggplot: Template ● A graphing template for ggplot ● You can get a list of

You can get a list of <GEOM_FUNCTION>s by

● You can get a list of <GEOM_FUNCTION>s by following this link ( http://docs.ggplot2.org/current/ ) 27

27

ggplot: Aesthetics Mappings

Look at the graph and note the circled dots

Aesthetics Mappings ● Look at the graph and note the circled dots ● What is special

What is special with these big engine cars?

Aesthetics Mappings ● Look at the graph and note the circled dots ● What is special

28

ggplot: Aesthetics

Ggplot Aesthetic mappings can help answer the question

An aesthetic is a visual property of the objects in a plot.

These are things like size, shape or color of points.

You can therefore display a point in different ways by changing the values of its aesthetic properties.

You can convey information about your data by mapping the aesthetics in your plot to the variables in your dataset.

e.g. you can map the colors of your points to the class variable to reveal the class of each car.

your dataset. – e.g. you can map the colors of your points to the class variable

29

ggplot: Aesthetics

New plot with aesthetics for class:

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy, color = class))

geom_point(mapping = aes(x = displ, y = hwy, color = class)) ● Try for year and

Try for year and manufacturer and look at the trends

geom_point(mapping = aes(x = displ, y = hwy, color = class)) ● Try for year and

30

ggplot: Aesthetics

Other aesthetics:

Size: for ordered variables, so each point reveals its attribute size

Alpha: controls the transparency of the points

Shape: points will be of different shapes

Exercise: try plotting the same geom with these different aesthetics

ggplot2 takes care of selecting a reasonable scale to use with the aesthetic and constructs a legend

aesthetics ● ggplot2 takes care of selecting a reasonable scale to use with the aesthetic and

31

ggplot: Aesthetics

The aesthetic properties of a geom can be set manually.

For example:

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color = "blue")

Will set all points to blue

Note color is outside the aes()

aes(x = displ, y = hwy), color = "blue") – Will set all points to blue

32

ggplot: Facets

ggplot: Facets 33
ggplot: Facets 33

33

ggplot: Facets

When the data has categorical variables, it is possible to split the plot into facets.

Facets are subplots that each displays a subset of data.

To plot facets, with a single variable, use the function facet_wrap(formula, …)

formula is created with ~ variable-name

formula is the name of a data structure in R, not a synonym for equation.

The variable (variable-name) should be discrete.

name of a data structure in R, not a synonym for equation. – The variable (

34

ggplot: Facets

For example:

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), color=”red”) + facet_wrap(~ class, nrow = 3)

y = hwy), color=”red”) + facet_wrap(~ class, nrow = 3) ● This will produce a plot

This will produce a plot for each element in mpg.class, and the plot will display in three rows.

class, nrow = 3) ● This will produce a plot for each element in mpg.class, and

35

ggplot: Facets

Can we facet the plot using two discrete variables:

Do this:

?facet_grid

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + facet_grid(drv ~ cyl)

In the plot, why do we have empty sub-plots?

= aes(x = displ, y = hwy)) + facet_grid(drv ~ cyl)  In the plot, why

36

ggplot: Facets

Hack:

With facet grid, what happens when you use a . at the place of one variable?

Is there an advantage of faceting over the color aesthetic? Any disadvantages? What is the dataset is very large?

In facet_wrap() what do nrow or ncol do?

When using facet_grid() put the variable with more unique levels in the columns (RHS of formula), why?

more unique levels in the columns (RHS of formula), why ?   Why doesn’t facet_grid()

Why doesn’t facet_grid() have nrow, and ncolumn

37

ggplot2::Geometric objects (geoms)

These are the geometric objects used to represent the data.

e.g. bar geoms, point geoms, line geoms, smooth geoms, etc.

To change the geom in your plot, change the geom function (geom_xxx())

For example:

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))

ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy))

Not every aesthetic works with every geom

e.g. you can’t set a shape of a line but of a point

Read: ?geom_point, ?geom_smooth

works with every geom – e.g. you can’t set a shape of a line but of

38

ggplot2: geoms

ggplot(data = mpg) +

geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

Try:

= aes(x = displ, y = hwy, linetype = drv)) ● Try: – ggplot(data = mpg)

ggplot(data = mpg) + geom_line(mapping = aes(x = displ, y = hwy, linetype = drv))

hwy, linetype = drv)) ● Try: – ggplot(data = mpg) + geom_line(mapping = aes(x = displ,

39

ggplot2: geoms

Plot:

ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y = hwy, linetype = drv))

ggplot(data = mpg) + geom_smooth(mapping = aes(x = displ, y – hwy, group = drv))

What is the difference? Which is better? Why?

mpg) + geom_smooth(mapping = aes(x = displ, y – hwy, group = drv))  What is

40

Ggplot2: combined geoms

Can we use more than one geoms on the same plot?

Try:

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) + geom_smooth(mapping = aes(x = displ, y = hwy))

When using multiple geoms on the same plot you can use global mappings:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point() + geom_smooth()

Which makes the code easy to read and modify.

mapping = aes(x = displ, y = hwy)) + geom_point() + geom_smooth()  Which makes the

41

ggplot2: combined geoms

When you use global mappings and set some mappings in a geom function, these mappings will be treated as local to this layer only.

For example:

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color = class)) + geom_smooth()

– ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color
– ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) + geom_point(mapping = aes(color

42

ggplot2: combined geoms

In the same way, you can specify different data for each layer.

Say you only want to fit a smooth line for one class of cars

ggplot(data = mpg, mapping = aes(x = displ, y = hwy)) +

geom_point(mapping = aes(color = class)) + geom_smooth(data = filter(mpg, class == "subcompact"), se = FALSE)

Hack:

can we plot more than one of the same geom? Try a smooth geom with different car class

= FALSE) – Hack:  can we plot more than one of the same geom? –

43

Ggplot2: combined geoms

Ggplot2: combined geoms 44
Ggplot2: combined geoms 44

44

Combined Geoms: exercise

Combined Geoms: exercise 45
Combined Geoms: exercise 45

45

Ggplot2: geoms

How many geoms does ggplot2 have?

Visit this page:

 Look for Data Visualization Cheat Sheet ● ● ggplot2 extensions provide more geoms to use.

ggplot2 extensions provide more geoms to use. Take a look at available extensions from

46

ggplot2: statistical transformations

Read: ?diamonds

ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut))

● Read: ?diamonds – ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut)) – Where does

Where does count come from?

● Read: ?diamonds – ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut)) – Where does

47

Statistical Transformations

Some plots plot raw values

e.g. scatterplots,

Some plots use calculated values

● Some plots use calculated values – – – – bar charts, histograms, and frequency polygons

bar charts, histograms, and frequency polygons bin your data and then plot bin counts, the number of points that fall in each bin.

smoothers fit a model to your data and then plot predictions from the model. (Remember regression lines)

boxplots compute a robust summary of the distribution and then display a specially formatted box.

48

Statistical Transformation

The algorithm used to calculate new values for a graph is called a stat, (Statistical Transformation)

You can check which stat is used by default by looking at the default value of stat.

geom_bar() uses count. Thus you can recreate the bar chart by running

ggplot(data = diamonds) + stat_count(mapping = aes(x = cut))

Every geom has a default stat; and vice-versa. This means that you can typically use geoms without worrying about the underlying statistical transformation.

This means that you can typically use geoms without worrying about the underlying statistical transformation. 49

49

Statistical Transformation

You can explicitly specify a stat:

When you want to override the default stat

e.g. Run

demo <- tribble(

~b,

"bar_1", 20, "bar_2", 30, "bar_3", 40

~a,

)

Then run

ggplot(data = demo) + geom_bar(mapping = aes(x = a, y = b), stat = "identity")

40 ~a, )  Then run ggplot(data = demo) + geom_bar(mapping = aes(x = a, y

50

Statistical Transformation

Reasons to explicitly specify a stat: cntd

You want to override the default mapping from transformed variables to aesthetics. ggplot(data = diamonds) +

geom_bar(mapping = aes(x = cut, y =

prop

,

group = 1))

This will draw a bar chart of proportion instead of count

+ geom_bar(mapping = aes(x = cut, y = prop , group = 1)) – This will
+ geom_bar(mapping = aes(x = cut, y = prop , group = 1)) – This will

51

Position Adjustments

A bar chart can be colored in either of two ways: color and fill.

ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, colour = cut))

ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = cut))

= aes(x = cut, colour = cut)) – ggplot(data = diamonds) + geom_bar(mapping = aes(x =
= aes(x = cut, colour = cut)) – ggplot(data = diamonds) + geom_bar(mapping = aes(x =
= aes(x = cut, colour = cut)) – ggplot(data = diamonds) + geom_bar(mapping = aes(x =

52

Position Adjustments

Check how the following plots will look like

ggplot(data = diamonds) +

geom_bar(mapping = aes(x = cut, fill = clarity))

ggplot(data = diamonds, mapping = aes(x = cut, fill = clarity)) +

geom_bar(alpha = 1/5, position = "identity")

ggplot(data = diamonds, mapping = aes(x = cut, colour = clarity)) +

geom_bar(fill = NA, position = "identity")

ggplot(data = diamonds) +

geom_bar(mapping = aes(x = cut, fill = clarity), position = "fill")

ggplot(data = diamonds) +

geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

– ggplot(data = diamonds) + geom_bar(mapping = aes(x = cut, fill = clarity), position = "dodge")

53

Position Adjustments

Learn more about position adjustments

?position_dodge,

?position_fill,

?position_identity,

?position_jitter

?position_stack

– ?position_dodge, – ?position_fill, – ?position_identity, – ?position_jitter – ?position_stack 54

54

Position Adjustments:overplotting.

Recall: ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy))

It displays fewer than 234 points: the number of observations (can you count them?)

The values of displ and hwy are rounded and many points overlap each other. That is a problem called overplotting.

You can avoid this gridding by setting the position adjustment to “jitter”

position = “jitter” adds a small amount of random noise to each point

Since no points can receive the same amount of noise, they are going to be spread out.

Jittering makes the graph less accurate at small scales, however it will make the graph more revealing at large scales.

In ggplot2 the shorthand for geom_point(position = "jitter") is geom_jitter()

at large scales. ● In ggplot2 the shorthand for geom_point(position = "jitter") is geom_jitter() 55

55

Position Adjustments: jitter

ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), position = "jitter")

jitter ● ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), position =
jitter ● ggplot(data = mpg) + geom_point(mapping = aes(x = displ, y = hwy), position =

56

Thank You! Asanteni!

Thank You! Asanteni! 57

57

Working with Data

In this part we are going to learn how to work with your data.

Getting data

Importing your own data

Tidying data

How to work with different data types:

Relational data,

Strings,

Factors,

Dates and Times

data – How to work with different data types:  Relational data,  Strings,  Factors,

58

Importing Data

For importing files, we will use the readr package which is part of the tidyverse core packages.

Most of readr functions turn flat files into data frames. A Data Frame is a tabular data format with rows and columns. It is a list of vectors of equal length.

read_csv(): reads comma separated files

read_csv2(): reads semicolon separated files

read_tsv(): read tab delimited files

read_delim(): reads files with any delimiter

Activity:

Check what read_table(), read_fwf() and read_log()

: reads files with any delimiter ● Activity: – Check what read_table(), read_fwf() and read_log() do?

do?

59

Importing Data: read_csv()

The first argument is the path to the file to read

read_csv(“data/students.csv”)

read_csv() prints out a column specification

read_csv() by default uses the first row as the column names

You can use skip = n, to skip the first n lines if they contain data you don’t need, (most likely metadata)

You can use comment = “#” to drop all lines that start with # for example

Use col_names = FALSE so that read_csv() doesn’t treat the first row as the column names

Missing values in R are specified out by na or NA. When loading files where missing values are specified differently, use na = “.” for example if missing values are specified by a period.

What will this line do?

read_csv(“students.csv”, skip = 2, comment = “//”, col_names = FALSE, na = “-”)

What will this line do? read_csv(“students.csv”, skip = 2, comment = “//”, col_names = FALSE, na

60

Importing Data: Parsing

The parse_*() functions:

?parse_logical, ?parse_integer, ?parse_date

The parse functions take in a character vector and return a more specialized vector.

Characters include everything, all letters and numbers, e.g. “dLab”, “2013”, “xyz3”, “12.09”

A specialized would contain say only numbers, or only decimal numbers, or only characters, and this is what the parse functions do: return a list of specific type of characters

A vector in R is a list of characters surrounded enclosed in c()

For example

names <- c(“John”, “Jean”, “Giovanni”, “Joni”) dates_of_birth <- c(“2012-12-31”, “1988-05-02”, “1990-01-06”)

“Jean”, “Giovanni”, “Joni”) dates_of_birth <- c(“2012-12-31”, “1988-05-02”, “1990-01-06”) 61

61

Importing Data: Parsing

What happens to the following?

parse_integer(c("1", "231", ".", "456"), na = ".") x <- parse_integer(c("123", "345", "abc", "123.45"))

parse_logical() and parse_integer() parse logicals and integers respectively. There’s basically nothing that can go wrong with these parsers so I won’t describe them here further.

parse_double() is a strict numeric parser, and parse_number() is a flexible numeric parser. These are more complicated than you might expect because different parts of the world write numbers in different ways.

parse_character() seems so simple that it shouldn’t be necessary. But one complication makes it quite important: character encodings.

parse_factor() create factors, the data structure that R uses to represent categorical variables with fixed and known values.

parse_datetime(), parse_date(), and parse_time() allow you to parse various date & time specifications. These are the most complicated because there are so many different ways of writing dates.

& time specifications. These are the most complicated because there are so many different ways of

62

Importing Data: parsing

One important thing to note is encoding when parsing character. UTF-8 is the most common, it may save you hours of fixing problems. Specify it when parsing characters like

x <- "El Niño was particularly bad this year" parse_character(x, locale = locale(encoding = "utf-8"))

?parse_datetime, ?parse_date, ?parse_time

Generate correct format strings to parse each of the following dates and times

d1 <- "January 1, 2010"

d2 <- "2015-Mar-07"

d3 <- "06-Jun-2017"

d4 <- c("August 19 (2015)", "July 1 (2015)")

d5 <- "12/30/14" # Dec 30, 2014

t1 <- "1705"

t2 <- "11:15:10.12 PM"

– d5 <- "12/30/14" # Dec 30, 2014 – t1 <- "1705" – t2 <- "11:15:10.12

63

Importing Data: parsing files

example_file <- read_csv(readr_example("challenge.csv"))

Use the problems() function to look at any issues with the import

problems(example_file)

Specify the column names explicitly when reading the file

example_file <- read_csv(readr_example(“challenge.csv”),

col_types =

cols(

x = col_double(),

y = col_date()

)

)

Use tail(dataframe, n=X) and head(dataframe, n=X) to look at

last and first X rows of the data frame.

) ) ● Use tail(dataframe, n=X) and head(dataframe, n=X) to look at last and first X

64

Parsing files

One more strategy to get the column types is to use the guess_max option when reading in a file.

example_file2 <- read_csv(readr_example("challenge.csv"), guess_max = 1001)

option when reading in a file. example_file2 <- read_csv(readr_example("challenge.csv"), guess_max = 1001) 65
option when reading in a file. example_file2 <- read_csv(readr_example("challenge.csv"), guess_max = 1001) 65

65

Writing to a file

If you want to save the data into CSV you can use either of the functions

write_csv() or write_tsv() where you need to specify

The data frame you are saving

The the file path (location) where to save it

Optionally:

you can set how missing values are written with na

You can also append to an existing file

Optionally: – you can set how missing values are written with na – You can also

66

Parsing Files

Group Activity

Download the dataset: Number of Trainees with Special Needs enrolled in Vocational Training Centres from http://opendata.go.tz

Read it into a data frame and do some manipulations including making some plots

Inspect

read_rds() and write_rds() and see where you can use these functions

Explore these packages:

Haven, readxl, DBI

and write_rds() and see where you can use these functions – Explore these packages:  Haven,

67

Tidy Data

A tidy dataset has these features

Each variable is in its own column

Each observation is in its own row

Each value is in its own cell

?gather, ?spread

Missing Values:

Can be explicitly stated with NA

Can be implicit: not present in the data

With gather(…, na.rm=TRUE)

You can use the complete() function to make missing values explicit tidy data.

?complete

na.rm=TRUE) ● You can use the complete() function to make missing values explicit tidy data. –

68

Case Study

Optionally download the data from http://www.who.int/tb/country/data/downlo ad/en/

Load the data from the file or from the package: tidyr::who

Looking at the data:

Country, iso2, iso3 are similar: representing a country

Year is clearly a variable

Other columns, have unclear names, look at the dictionary

representing a country – Year is clearly a variable – Other columns, have unclear names, look

69

Case Study cntd

Gather all the other columns, removing all missing values

who1 <- who %>% gather(new_sp_m014:newrel_f65, key = "key", value = "cases", na.rm = TRUE)

Look at structure of the values in the new key by counting

who1 %>% count(key)

Use the data dictionary for the definition of the keys

who2 <- who1 %>%

mutate(key = stringr::str_replace(key, "newrel", "new_rel"))

Separate the key variable into different columns

who3 <- who2 %>% separate(key, c("new", "type", "sexage"), sep = "_")

Look at new key

who3 %>%

count(new)

Drop new column because it is constant

who4 <- who3 %>% select(-new)

Separate sexage into sex and age

who5 <- who4 %>% separate(sexage, c("sex", "age"), sep = 1)

sexage into sex and age – who5 <- who4 %>% separate(sexage, c("sex", "age"), sep = 1)

70

71

71

Writing Code in R

Create new objects with <- with the format object_name <- object_value

The <- symbol is the assignment operator

Examples:

first_name <- “Sovello”

date.of.birth <- “12/31/1980”

PlaceOfBirth <- “Njombe”

AGE <- 37

x = 200 * 5

Object names must start with a letter.

Object names can only contain letters, numbers, underscore (_), and period (.)

Look at the examples above

● Object names can only contain letters, numbers, underscore (_), and period (.) – Look at

72

Writing code in R

You can look at what is in R by typing the name of the object

can look at what is in R by typing the name of the object ● You

You can also print an object explicitly

print(first_name) [1] “Sovello”

The [1] shown in the output indicates that x is a vector and 5 is its first element.

[1] “Sovello”  The [1] shown in the output indicates that x is a vector and

73

Writing code in R

All values that are not numbers must be enclosed in double/single quotes (“value”, or ‘value’)

Look at definition of place.of.birth in the screenshot

Typos matter, when using object names. Cases matter a lot such that surname and Surname are not the same.

The # character indicates a comment. Anything to the right of # is ignored by R

No multi-line comments

● The # character indicates a comment. Anything to the right of # is ignored by

74

Group Exercise (5min)

What is wrong with this code snippet

Surname <- “Mkulima” surname

If you start typing a value for an object and press enter before an enclosing quote or paranthesis the code will look like

college <- “College of informatics

+

A + means you should continue typing. What would you do to fix, stop or escape from the problem?

Fix errors in this piece of code until it works

● Fix errors in this piece of code until it works library(tidyverse) ggplot(dota = mpg) +

library(tidyverse) ggplot(dota = mpg) + geom_point(mapping = aes(x = displ, y = hwy)) fliter(mpg, cyl = 8)

75

R Objects

R has five atomic objects

Character

Numeric (real numbers)

Integer

Complex

Logical (True/False)

The most basic type of R is a vector. An empty vector can be created with vector()

A vector can only contain objects of the same type.

Numbers are generally treated as numeric objects

If you want an integer, you have to explicitly specify an L.

1L is an integer

1 is a real number

objects – If you want an integer, you have to explicitly specify an L.  1L

76

R Objects

Inf is a special number which represents infinity.

You can use Inf in calculations like 1/Inf

Creating vectors

Use the c() function to create vectors

Creating vectors ● Use the c() function to create vectors > x <- c(0.5, 0.6) ##

> x <- c(0.5, 0.6) ## numeric

> x <- c(TRUE, FALSE) ## logical

> x <- c(T, F) ## logical

> x <- c("a", "b", "c") ## character

> x <- 9:29 ## integer

> x <- c(1+0i, 2+4i) ## complex

77

Coercion of R objects

You can explicitly coerce objects using the as.* functions. ?

as.integer, ?as.character, ?as.logical, ?as.numeric

> x <- 0:6

> class(x)

[1] "integer"

> as.numeric(x)

[1] 0

1

2

3

4

5

6

> as.logical(x)

[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE

> as.character(x)

[1] "0" "1" "2" "3" "4" "5" "6"

If R fails to coerce an object, it produces NAs.

> x <- c("a", "b", "c")

> as.numeric(x)

Warning: NAs introduced by coercion

[1] NA NA NA

> as.logical(x)

[1] NA NA NA

> as.complex(x)

Warning: NAs introduced by coercion

[1] NA NA NA > as.logical(x) [1] NA NA NA > as.complex(x) Warning: NAs introduced by

[1] NA NA NA

78

R Objects: Matrices

Matrices are vectors with a dimension attribute.

The dimension is an integer vector of length 2 (number of rows, number of columns)

> m <- matrix(nrow = 2, ncol = 3)

> m [,1] [,2] [,3] [1,] NA NA NA [2,] NA NA NA

> dim(m)

[1] 2 3

> attributes(m) $dim [1] 2 3

= 2, ncol = 3) > m [,1] [,2] [,3] [1,] NA NA NA [2,] NA

79

Matrices

Matrices are constructed column-wise and so entries start at the “upper left” corner and running down the columns

> m <- matrix(1:6, nrow = 2, ncol = 3)

> m [,1] [,2] [,3] [1,] 1 3 5 [2,] 2 4 6

You can create matrices from vectors by adding a dimensions attribute

>

m <- 1:10

 

>

m [1] 1

2

3

4

5

6

7

8

9 10

> dim(m) <- c(2, 5)

> m

[,1] [,2] [,3] [,4] [,5]

[1,]

[2,] 2 4 6 8 10

1

3

5

7

9

Matrices must have every element be the same class (e.g. all integers or all numeric).

[2,] 2 4 6 8 10 1 3 5 7 9 ● Matrices must have every

80

Group work

What do cbind() and rbind() do?

Create 3 vectors and 3 matrices.

Create 3 matrices from vectors

Create 2 matrices using cbind() and rbind()

Read about R lists: how to create using list()

from vectors ● Create 2 matrices using cbind() and rbind() ● Read about R lists: how

81

R Objects: Factors

Factors represent categorical data

Factors can be ordered or unordered

Factor objects can be created with the factor() function

> x <- factor(c("yes", "yes", "no", "yes", "no"))

> x

[1] yes yes no yes no Levels: no yes

"no", "yes", "no")) > x [1] yes yes no yes no Levels: no yes > table(x)

> table(x)

x

no yes

2

3

82

Factors

Say you want to sort a vector

> x1 <- c("Dec", "Apr", "Jan", "Mar")

> sort(x1)

[1] "Apr" "Dec" "Jan" "Mar"

The target was to see months sorted in the order of Jan, Mar, Apr, Dec

To solve this problem we can make use of factors

Create a vector of months

month_levels <- c(

"Jan", "Feb", "Mar", "Apr", "May", "Jun",

"Jul", "Aug", "Sep", "Oct", "Nov", "Dec”

)

Then create a vector with month levels.

> y1 <- factor(x1, levels = month_levels)

Applying sort on the new variable, will produce a sorted list in order of months

= month_levels) ● Applying sort on the new variable, will produce a sorted list in order

> sort(y1)

83

R Objects: missing values

Missing values are denoted by NA and NaN for undefined mathematical operations

is.na() is used to test objects if they are NA

is.nan() is used to test for NaN

NA values have a class also, so there are integer NA, character NA, etc.

A NaN value is also NA but the converse is not true

> ## Create a vector with NAs in it

> x <- c(1, 2, NA, 10, 3)

> ## Return a logical vector indicating which elements are NA

> is.na(x)

[1] FALSE FALSE TRUE FALSE FALSE

> ## Return a logical vector indicating which elements are NaN

> is.nan(x)

[1] FALSE FALSE FALSE FALSE FALSE

What is difference between missing values Nas and Zero

> is.nan(x) – [1] FALSE FALSE FALSE FALSE FALSE ● What is difference between missing values

84

R Objects:Data Frames

Data frames store tabular data in R

Data frames are represented as a special type of list where every element of the list has to have the same length.

Each element of the list can be thought of as a column and the length of each element of the list is the number of rows.

Unlike matrices, data frames can store different classes of objects in each column.

list is the number of rows. ● Unlike matrices, data frames can store different classes of

85

Data Frames

> x <- data.frame(foo = 1:4, bar = c(T, T, F, F))

> x

foo

bar

1

TRUE

2

TRUE

3

FALSE

4

FALSE

> nrow(x)

[1] 4

> ncol(x)

> x foo bar 1 TRUE 2 TRUE 3 FALSE 4 FALSE > nrow(x) [1] 4

[1] 2

86

Writing Code in R

Scripts:

Turning interactive code into scripts

Writing Code in R ● Scripts: – Turning interactive code into scripts 87

87

Data Transformation

Filter rows with filter()

Comparisons: >, >=, <, <=, !=, ==

sqrt(2) ^ 2 == 2

Logical operators

And & Or | (shorthand x %in% y e.g. 2 %in% c(1, 2, 3, 4)) Not !

To determing missing values is.na(x)

Ordering: use arrange()

x %in% y e.g. 2 %in% c(1, 2, 3, 4)) Not ! – To determing missing

88

Reading Data: large datasets

With much larger datasets, there are a few things that you can do that will make your life easier and will prevent R from choking.

Read the help page for read.table, which contains many hints

Stop if your RAM is smaller than the size of the file

Set comment.char = "" if there are no commented lines in your file.

Use the colClasses argument. Specifying this option instead of using the default can make ’read.table’ run MUCH faster, often twice as fast. You have to know the class of each column

Set nrows. This doesn’t make R run faster but it helps with memory usage.

have to know the class of each column – Set nrows. This doesn’t make R run

89

Reading large datasets

A quick way to figure out the classes of each column is the following:

> initial <- read.table("datatable.txt", nrows = 100)

> classes <- sapply(initial, class)

> tabAll <- read.table("datatable.txt", colClasses = classes)

classes <- sapply(initial, class) > tabAll <- read.table("datatable.txt", colClasses = classes) 90

90

Control Structures

Control structures allow to control the flow of execution of a series of R expressions.

Control structures allow you to put some “logic” into R code, rather than just always executing the same R code every time.

Control structures allow you to respond to inputs or to features of the data and execute different R expressions accordingly.

structures allow you to respond to inputs or to features of the data and execute different

91

Control Structures: if-else

This if-else structure allows you to test a condition and act on it depending on whether it’s true or false

You can only use the if statement

if(<condition>) { ## do something

}

## Continue with rest of code

Or use the complete if-else

if(<condition>) { ## do something

}

else { ## do something else

}

You can have a series of tests by following the initial if with any number of else ifs.

if(<condition1>) { ## do something

}

else if(<condition2>) { ## do something different

}

else { ## do something different

}

} else if(<condition2>) { ## do something different } else { ## do something different }

92

Example: if-else

## Generate a uniform random number

x <- runif(1, 0, 10)

if(x > 3) {

y <- 10

} else {

y <- 0

}

This is the same as executing

10 } else { y <- 0 } ● This is the same as executing y

y

<- if(x > 3) {

10

}

else {

0

}

93

Control Structures: for

For loops are the only looping construct in R

for( x in sequence ){ ##Execute code

}

For one line loops, the curly braces are not strictly necessary.

loops, the curly braces are not strictly necessary. – – > for(i in 1:4) print(x[i]) [1]

> for(i in 1:4) print(x[i]) [1] "a" [1] "b" [1] "c" [1] "d"

94

Control Structures: while

While loops begin by testing a condition

If it is true, they loop body is executed and the condition is tested again until the condition is false

> count <- 0

> while(count < 10) { print(count) count <- count + 1

until the condition is false > count <- 0 > while(count < 10) { print(count) count

}

95

Control Structures: next

Next is used to skip an iteration of a loop

for(i in 1:100) {

if(i <= 20) { ## Skip the first 20 iterations next

}

## Do something here

iteration of a loop for(i in 1:100) { if(i <= 20) { ## Skip the first

}

96

Control Structures: break

Break is used to exit the loop immediately, regardless of what the loop maybe on.

for(i in 1:100) { print(i) if(i > 20) { ## Stop loop after 20 iterations break

of what the loop maybe on. for(i in 1:100) { print(i) if(i > 20) { ##

}

}

97

Functions

Functions 98

98

Functions: scoping

Functions: scoping 99

99

Dates and Times

Dates and Times 100

100

Loop functions

Loop functions 101

101

Simulating and Profiling

Simulating and Profiling 102

102

Vectorized Operations

Vectorized Operations 103

103